1 research outputs found
Analysis of category co-occurrence in Wikipedia networks
Wikipedia has seen a huge expansion of content since its inception. Pages within this online
encyclopedia are organised by assigning them to one or more categories, where Wikipedia
maintains a manually constructed taxonomy graph that encodes the semantic relationship
between these categories. An alternative, called the category co-occurrence graph, can be
produced automatically by linking together categories that have pages in common. Properties
of the latter graph and its relationship to the former is the concern of this thesis.
The analytic framework, called t-component, is introduced to formalise the graphs and
discover category clusters connecting relevant categories together. The m-core, a cohesive
subgroup concept as a clustering model, is used to construct a subgraph depending on the
number of shared pages between the categories exceeding a given threshold t. The significant
of the clustering result of the m-core is validated using a permutation test. This is compared
to the k-core, another clustering model.
TheWikipedia category co-occurrence graphs are scale-free with a few category hubs and
the majority of clusters are size 2. All observed properties for the distribution of the largest
clusters of the category graphs obey power-laws with decay exponent averages around 1.
As the threshold t of the number of shared pages is increased, eventually a critical threshold
is reached when the largest cluster shrinks significantly in size. This phenomena is only
exhibited for the m-core but not the k-core. Lastly, the clustering in the category graph
is shown to be consistent with the distance between categories in the taxonomy graph