The primary purpose is to develop an approach combining both clustering (unsupervised classification) and categorization (supervised classification) on the basis of a graph representation 

 

This study analyses the 2004 scientific articles about the Semantic Web subject, stored in PASCAL database, and deals with the application of two types of graphs: undirected valued graph (it was explained in Polanco 2007, see Graph), in this graph the edges are valued by an association coefficient (see proposition 6 in Hypergraph and Graph Clustering)

The other type is called directed valued graph which is in this page exposed. Two levels are considered: categories and concepts. Categories are classes of concepts. Concepts are clusters. Cluster analysis is a way to detect the concepts which are then categorized by a classification method.

The directed valued graph is called "graph of inclusion" and noted G (V,A,Inc): V = the set of vertices, each one is a cluster, i.e a subset of keywords (meaning the concepts), and a subset of documents indexed by this subset of keywords. = the set of arcs (from/to) between nodes-clusters which are valued by an inclusion coefficient, the valued arcs have a support, a subset of documents. Inc = the inclusion value of the arcs.

 Below the graph shows the inclusion among concepts.

The graph is a set of concepts (colored and labeled nodes) and a set of inclusion relations (arrowheads) among them, the colors indicate the concept's category (see graph at left side). The inclusion is an asymmetric relation (ij) ¹ (ji).

Dataset: 330 scientific articles published in 2004, indexed by 809 keywords (stored in PASCAL database)

Objective: organize the dataset into clusters and then categories with the purpose of known (a) the mainly areas of research on semantic web and (b) their network order (by inclusion).  

Clustering: it was applied the co-word analysis program called Sdoc, using the inclusion index as it can be choiced in Stanalyst. Sdoc executes a single link hierarchical agglomerative method.

Categorization: it was handmaded applying a "grep" technique on the file of keywords ordered by occurrence (or frequency).

The inclusion index is defined as Inc(ij) = (ij) / min (i, j)  , and can be interpreted as a conditional probability. If (i) > (j), the index measures the probability of finding (i) given (j). 

 

(Soon, the unpublished full text will be available here,
pdf version)

 

16/02/2008