Agrégats de mots sémantiquement cohérents issus d'un grand graphe de terrain
par Christian Belbèze
Université Toulouse 1 Capitole - Doctorat en informatique 2012
The observation of internet users in a situation of information research has helped to highlight a general need to immediate exchange. The immediacy of the exchanges may take different aspects and in particular the fact for a surfer, at a given moment, to be able to benefit from the research of other surfers by dynamic recommendation. According to the principle of social networks, a community is a set of Internet surfers who can take advantage of links, predefined or not, on the basis common interests, common practices... Identifying these links dynamically and causing meetings between Surfers seemed to be a true challenge.
Then we have to dynamically create communities of internet users from ongoing research via search engines (log files for example). The process of dynamic generation of communities is largely based on the extraction of the research themes (centers of interests) of Internet users present on the network at a given moment (or during a given period of time). The themes of research allowing the connection between Internet users constitute the core of the Community dynamic. The community is then presented as a Large complex network graph of words (extracts of themes) in which the connections represent the cooccurrences.
In this thesis, we propose an approach for creation and validation of the graph community. This approach involves the aggregation of the nodes of the graph so that each aggregate has the highest semantics consistency possible. The following issues must be resolved:
- creating clusters of words that can contain overlap (a spelling may belong to several thematic);
- choosing or defining a grouping technique that guarantees a high degree of semantics consistency; - characterizing the aggregates to understand the differences of semantics consistency;
- proposing techniques to validate semantics consistency of aggregates.
In a first part constituting a state of the art, we are studying many methods of creating communities in the graphs. However no one fulfills all of the necessary criteria.
In a second part we present our contribution. The latter is constituted of several methods of aggregation and several methods of semantic validations.
We offer 4 methods of aggregation: cliques Detection (agglomeration of clique), Simple Ratification (search for points of rupture in the graph), Regulated Regasification (search for points of rupture in relying on the study of specific populations, empty words and monosemic) and a method of Enrichment of Aggregate by Gravity (the method determines a coefficient of attraction for each word toward each aggregate).
We then propose three methods to validate the semantic consistency of aggregates : Method of Compared Coefficient of Semantics Validation (estimate of the value semantics of aggregates by comparing the behavior of search engine on the Internet by using different test sets and aggregates), Trec-Eval method for requests enrichment (the aggregates are used to specify user requests) and a method of consistency comparison of documents returned (comparison of the semantics consistency of documents returned by queries from test specific sets and aggregate ). We will also use the manual validation by experts in the field of semantic spaces handled including comparison with other methods.
The various proposals and methods of experiments provide evidence of the importance of weighted nodes and links, as well as to direct the graphs. Limiting the size of the aggregates of words is also a major element of semantics consistency. The different clustering methods can still evolve. The combination of several types of links in a graph, for example, would refine the content of the aggregates.
Graphs, Term aggregates, Communities and user communities, Complex networks, Small words.