The primary purpose is to develop an approach combining both clustering (unsupervised classification) and categorization (supervised classification) on the basis of a graph representation
This study analyses the 2004 scientific articles about the Semantic Web subject, stored in PASCAL database, and deals with the application of two types of graphs: undirected valued graph (it was explained in Polanco 2007, see Graph), in this graph the edges are valued by an association coefficient (see proposition 6 in Hypergraph and Graph Clustering)
The other type is called directed valued graph which is in this page exposed. Two levels are considered: categories and concepts. Categories are classes of concepts. Concepts are clusters. Cluster analysis is a way to detect the concepts which are then categorized by a classification method.
The directed valued graph is called "graph of inclusion" and noted G (V,A,Inc): V = the set of vertices, each one is a cluster, i.e a subset of keywords (meaning the concepts), and a subset of documents indexed by this subset of keywords. A = the set of arcs (from/to) between nodes-clusters which are valued by an inclusion coefficient, the valued arcs have a support, a subset of documents. Inc = the inclusion value of the arcs.
Below the graph shows the inclusion among concepts.
The graph is a set of concepts (colored and labeled nodes) and a set of inclusion relations (arrowheads) among them, the colors indicate the concept's category (see graph at left side). The inclusion is an asymmetric relation (ij) ¹ (ji).
Dataset: 330 scientific articles published in 2004, indexed by 809 keywords (stored in PASCAL database)
Objective: organize the dataset into clusters and then categories with the purpose of known (a) the mainly areas of research on semantic web and (b) their network order (by inclusion).
Clustering: it was applied the co-word analysis program called Sdoc, using the inclusion index as it can be choiced in Stanalyst. Sdoc executes a single link hierarchical agglomerative method.
Categorization: it was handmaded applying a "grep" technique on the file of keywords ordered by occurrence (or frequency).
The inclusion index is defined as Inc(ij) = (ij) / min (i, j) , and can be interpreted as a conditional probability. If (i) > (j), the index measures the probability of finding (i) given (j).
16/02/2008

