4,556 research outputs found
Extracting tag hierarchies
Tagging items with descriptive annotations or keywords is a very natural way
to compress and highlight information about the properties of the given entity.
Over the years several methods have been proposed for extracting a hierarchy
between the tags for systems with a "flat", egalitarian organization of the
tags, which is very common when the tags correspond to free words given by
numerous independent people. Here we present a complete framework for automated
tag hierarchy extraction based on tag occurrence statistics. Along with
proposing new algorithms, we are also introducing different quality measures
enabling the detailed comparison of competing approaches from different
aspects. Furthermore, we set up a synthetic, computer generated benchmark
providing a versatile tool for testing, with a couple of tunable parameters
capable of generating a wide range of test beds. Beside the computer generated
input we also use real data in our studies, including a biological example with
a pre-defined hierarchy between the tags. The encouraging similarity between
the pre-defined and reconstructed hierarchy, as well as the seemingly
meaningful hierarchies obtained for other real systems indicate that tag
hierarchy extraction is a very promising direction for further research with a
great potential for practical applications.Comment: 25 pages with 21 pages of supporting information, 25 figure
Statistical mechanics of ontology based annotations
We present a statistical mechanical theory of the process of annotating an
object with terms selected from an ontology. The term selection process is
formulated as an ideal lattice gas model, but in a highly structured
inhomogeneous field. The model enables us to explain patterns recently observed
in real-world annotation data sets, in terms of the underlying graph structure
of the ontology. By relating the external field strengths to the information
content of each node in the ontology graph, the statistical mechanical model
also allows us to propose a number of practical metrics for assessing the
quality of both the ontology, and the annotations that arise from its use.
Using the statistical mechanical formalism we also study an ensemble of
ontologies of differing size and complexity; an analysis not readily performed
using real data alone. Focusing on regular tree ontology graphs we uncover a
rich set of scaling laws describing the growth in the optimal ontology size as
the number of objects being annotated increases. In doing so we provide a
further possible measure for assessment of ontologies.Comment: 27 pages, 5 figure
Comparing the hierarchy of author given tags and repository given tags in a large document archive
Folksonomies - large databases arising from collaborative tagging of items by
independent users - are becoming an increasingly important way of categorizing
information. In these systems users can tag items with free words, resulting in
a tripartite item-tag-user network. Although there are no prescribed relations
between tags, the way users think about the different categories presumably has
some built in hierarchy, in which more special concepts are descendants of some
more general categories. Several applications would benefit from the knowledge
of this hierarchy. Here we apply a recent method to check the differences and
similarities of hierarchies resulting from tags given by independent
individuals and from tags given by a centrally managed repository system. The
results from out method showed substantial differences between the lower part
of the hierarchies, and in contrast, a relatively high similarity at the top of
the hierarchies.Comment: 10 page
Comparing the hierarchy of keywords in on-line news portals
The tagging of on-line content with informative keywords is a widespread
phenomenon from scientific article repositories through blogs to on-line news
portals. In most of the cases, the tags on a given item are free words chosen
by the authors independently. Therefore, relations among keywords in a
collection of news items is unknown. However, in most cases the topics and
concepts described by these keywords are forming a latent hierarchy, with the
more general topics and categories at the top, and more specialised ones at the
bottom. Here we apply a recent, cooccurrence-based tag hierarchy extraction
method to sets of keywords obtained from four different on-line news portals.
The resulting hierarchies show substantial differences not just in the topics
rendered as important (being at the top of the hierarchy) or of less interest
(categorised low in the hierarchy), but also in the underlying network
structure. This reveals discrepancies between the plausible keyword association
frameworks in the studied news portals
Clustering of tag-induced sub-graphs in complex networks
We study the behavior of the clustering coefficient in tagged networks. The
rich variety of tags associated with the nodes in the studied systems provide
additional information about the entities represented by the nodes which can be
important for practical applications like searching in the networks. Here we
examine how the clustering coefficient changes when narrowing the network to a
sub-graph marked by a given tag, and how does it correlate with various other
properties of the sub-graph. Another interesting question addressed in the
paper is how the clustering coefficient of the individual nodes is affected by
the tags on the node. We believe these sort of analysis help acquiring a more
complete description of the structure of large complex systems
Information-theoretic inference of common ancestors
A directed acyclic graph (DAG) partially represents the conditional
independence structure among observations of a system if the local Markov
condition holds, that is, if every variable is independent of its
non-descendants given its parents. In general, there is a whole class of DAGs
that represents a given set of conditional independence relations. We are
interested in properties of this class that can be derived from observations of
a subsystem only. To this end, we prove an information theoretic inequality
that allows for the inference of common ancestors of observed parts in any DAG
representing some unknown larger system. More explicitly, we show that a large
amount of dependence in terms of mutual information among the observations
implies the existence of a common ancestor that distributes this information.
Within the causal interpretation of DAGs our result can be seen as a
quantitative extension of Reichenbach's Principle of Common Cause to more than
two variables. Our conclusions are valid also for non-probabilistic
observations such as binary strings, since we state the proof for an
axiomatized notion of mutual information that includes the stochastic as well
as the algorithmic version.Comment: 18 pages, 4 figure
- …