Xavier Polanco
Laboratoire d'Informatique de Paris 6 (LIP6)
Université Pierre et Marie Curie (Paris 6) – CNRS UMR 7606
(December 2007)
The objective is a methodology to support an information analysis system. Information analysis is the working process that follows data storage and information retrieval. Usually information is analyzed after to be collected from databases by means of information retrieval systems. An information analysis system is designed to provide users assistance performing transactions on a finite set of data. Let X = {x1, x2, …, xn} be the finite set of data to analyze. The goal is to extract useful information from X, one that has a particular interest to a user or user community, one that is of interest to a particular need of information (as for instance science watching and technology forecasting in research policy and management; or business intelligence in economic sector).
Network analysis is the general framework of the proposed methodology. Associations are the building blocks of network analysis. A network is generally defined as a specific type of relations linking a defined set of data. Different types of relations identify different networks, even when imposed on the identical set of elements. Attributes and relationships are two basic approaches to viewing and classifying a dataset. Attributes are intrinsic characteristics or patterns of data. Data may also be involved in relationships that exist only if two or more entities are considered together. A relation is an emergent property of the association (connection or linkage) between units of observation. Various types of attributes and relations can be measured. By ignoring the structural context within which data are located, a purely attribute-based analysis loses much of the explanatory potential that relational analysis can offer. The ultimate advance of information analysis requires combinations of both types of data (attributes, relations), and the creation of measurements and analysis methods capable of processing them.
The analysis involves the extraction of similarity graphs from X according to a hypergraph model. The informative short strong paths in the graphs are mining following a graph reduction process, and also applying a graph decomposition in order to reveal an independent family of formal concept intensions and to visualize the way they overlap. We refer to this objective as the formal concept network.
Relative to the empirical domain of study and the units of analysis, our work concentrates on scientific and technical information data. And the STI data’s units of analysis are keywords, authors, their affiliations, citations (citing and cited authors), and journals (as sources of information).
In the next, our approach is compressed into ten cursory propositions.
Proposition 1: Any set X of data can be modelled as a hypergraph H (Berge 1987). The elements x1, x2, …, xn of X are represented by vertices, and the data attributes are represented by subsets of data sharing these attributes. These subsets E1, E2, …, Em constitute the hyperedges of H; H = (E1, E2, …, Em).
Proposition 2: Two types of objects can be derived from the hypergraph model: (1) minimal transversals as formal concept’s extensions and (2) intersection graphs as association graphs.
Proposition 3: The minimal traversals correspond by definition (1) to formal concepts in formal concept analysis (FCA), FCA uses the Galois lattice model (Ganter et al 2005; Ganter & Wille 1999); (2) to close item-sets in data mining (Zaki 2004).
Proposition 4: The number of possible closed sets is exponential on the number of attributes and the generation of the whole Galois lattice is a NP hard problem. For large data sets, data-mining methods focus on formal concepts which intensions are frequent item sets of reduced size (Hahsler et al, 2005). We are interested in clustering methods that naturally highlight particular small closed sets of attributes in linear or quadratic time.
Proposition 5: According to the type of information that the user wants to analyze, intersection graphs G (V, E) that we call co-occurrence graphs can be derived from different subparts of H. The co-occurrence is a symmetric relation, (i,j) = (j,i), then E is a set of non directed edges. The vertices in V are the selected hyperedges of H meanwhile an edge in E is drawn among two vertices (vi, vj) whenever they have a non empty intersection, vi Ç vj ¹ Æ, whenever they have at least k elements in their intersection.
Proposition 6: Applying an association coefficient on G (V, E) we obtain G (A) that is graph of associations. The equivalence coefficient noted s(vi,vi) = |viÇ vj|2 / (|vi|´|vj|) is used for measuring the strength and normalizing the co-occurrences, called from now associations. G (A) = (V, E, s) is the valued graph of associations. This coefficient has an easy interpretation in terms of probability theory since it is the product of conditional probabilities of finding one item knowing the presence of the other. This coefficient is maximized by pairs of items that are in the same closed sets.
Proposition 7: Using a threshold s on the weighted edges of G (A) a sub-graph G (A>s) = (V, Es, s) is obtain where Es is the set of pairs of vertices {vi,vj} such that s (vi,vj) > s. Usually, when a co-occurrence matrix is used, a threshold is set on the frequency of items in order to obtain a less sparse matrix. We claim that setting the threshold on association value s(vi,vj) and not on item frequency |vj| is better suited to form clusters that are closed. Consequently, every value s Î [0, 1] induces a sub-graph G (A>s) = (V, Es, s).
Proposition 8: Applying a clustering algorithm on G (A>s) one obtains a clustered graph with the purpose of mining strong geodesic paths. We attempt to show that a variant of single link clustering (SLC) works fine in this context. The SLC variants are better suited to extract interesting clusters formed along easily interpretable paths of associated items than algorithms based on detecting high density regions. The SLC algorithm runs in linear time O (|E|) on the number of edges |E| in the graph (San Juan et al, 2005). It is called CPCL (Classification by Preferential Clustered Link) originally introduced in (Ibekwe-San Juan 1998). For a detailed description of the CPCL algorithm in the graph formalism, see Berry et al. (2004).
Proposition 9: The derived graphs (from H) are small world graphs (SWG), this structural property allows dividing each of them into a central kernel and a periphery, and then both into atoms. The atoms are overlapping subgraphs without clique separators (Berry et al, 2004). The centre kernel is made of a single huge atom meanwhile the periphery is made of a family of small atoms. The analysis of the central kernel is carried out in the way exposed above setting a threshold on association values, applying a clustering algorithm and using force directed graph display algorithms. The analysis of the periphery requires more complex graph decomposition in atoms.
Proposition 10: Visual display is important in network analysis. It is an information visualization subject. We are using the interactive interface aiSee and its optimized bi-scale force directed layout. For example, the edge widths display the strength of the links, and edges which values are higher that adjacent edges, and thus revealing local maximums, are drawn in a different colour. Merging together subgraphs of this colour, the clustered graph G (CPCL) is obtained. Clusters are then represented by node-shapes which size depends on the number of clustered vertices. Special clusters can be unfolded in a wrapped form that allows visualizing the transitions to other clusters.
I would like to thank Eric San Juan (LIA, Université d’Avignon) for his significant contribution to this research project during 2006-2007.
Publications
- POLANCO X. (2007) Mode of analyze the scientific infrastructure of information technologies and communications (in Spanish), Revista Iberoamericana de Ciencia, Tecnología y Sociedad, Vol. 3, num. 9, (ISSN: 1668-0030). Available in http://www.oei.es/revistacts.htm
- POLANCO X., SAN JUAN E. (2007) Hypergraph Modelling and Graph Clustering Process Applied to Co-Word Analysis. Proceedings of the 11th International Conference of the International Society for Scientometrics and Informetrics, CSIC, Madrid, Spain, June 25-27, 2007, Edited by Daniel Torres-Salinas and Henk F. Moed, and published by CSIC-CINDOC, vol. II, p. 613-618. Available in http://hal.archives-ouvertes.fr/hal-00165984/en/
- POLANCO X., SAN JUAN E. (2006) Text Data Network Analysis Using Graph Approach. 1th International Conference on Multidisciplinary Information Sciences and Technology, InSciT2006, Mérida, Spain, October 25-28th. In Vicente P. Guerrero-Bote (Editor), Current Research in Information Sciences and Technologies Multidisciplinary Approaches to Global Information Systems, Edited by Open Institute of Knowledge, vol. II, p. 586-592 (ISBN: 84-611-3104-5) Available in http://hal.archives-ouvertes.fr/hal-00165964/en/
References
- Berge C. (1987) Hypergraphes, Paris: Gauthier-Villars.
- Berry A., Kaba B., Nadif M., SanJuan E., Sigayret A. (2004). Classification et désarticulation de graphes de termes, Proc. of the 7th International Conference on Textual Data Statistical Analysis (JADT 2004), Louvain-la-Neuve, Belgium, 10-12 march, p. 160-170.
- Ganter B., Stummed G., Wille R. (Eds) (2005). Formal Concept Analysis: Foundations and Applications. Lecture Notes in Artificial Intelligence 3626, Springer.
- Ganter B., Wille, R., (1999). Formal Concept Analysis: Mathematical Foundations. Springer.
- Hahsler, M., Grün, B., Hornik, K.: arules - a computational environment for mining association rules and frequent item sets. Journal of Statistical Software 14(15) (2005)
- Ibekwe-SanJuan F. (1998). A linguistic and mathematical method for mapping thematic trends from texts, Proceedings of the 13th European Conference on Artificial Intelligence (ECAI). Brighton, U.K., p. 170-174.
- SanJuan E., Dowdall J., Ibekwe-SanJuan F., Rinaldi F. (2005). A symbolic approach to automatic multiword term structuring, Computer Speech and Language, vol 19, 4, October 2005, p. 524-542
- Zaki M. (2004). Mining non-redundant association rules, Data Mining and Knowledge Discovery, 9 (3) p. 223-248.