7,171 research outputs found
Measuring Similarity in Large-Scale Folksonomies
Social (or folksonomic) tagging has become a very popular way to describe content within Web 2.0 websites. Unlike\ud
taxonomies, which overimpose a hierarchical categorisation of content, folksonomies enable end-users to freely create and choose the categories (in this case, tags) that best\ud
describe some content. However, as tags are informally de-\ud
fined, continually changing, and ungoverned, social tagging\ud
has often been criticised for lowering, rather than increasing, the efficiency of searching, due to the number of synonyms, homonyms, polysemy, as well as the heterogeneity of\ud
users and the noise they introduce. To address this issue, a\ud
variety of approaches have been proposed that recommend\ud
users what tags to use, both when labelling and when looking for resources. As we illustrate in this paper, real world\ud
folksonomies are characterized by power law distributions\ud
of tags, over which commonly used similarity metrics, including the Jaccard coefficient and the cosine similarity, fail\ud
to compute. We thus propose a novel metric, specifically\ud
developed to capture similarity in large-scale folksonomies,\ud
that is based on a mutual reinforcement principle: that is,\ud
two tags are deemed similar if they have been associated to\ud
similar resources, and vice-versa two resources are deemed\ud
similar if they have been labelled by similar tags. We offer an efficient realisation of this similarity metric, and assess its quality experimentally, by comparing it against cosine similarity, on three large-scale datasets, namely Bibsonomy, MovieLens and CiteULike
Taxonomy and clustering in collaborative systems: the case of the on-line encyclopedia Wikipedia
In this paper we investigate the nature and structure of the relation between
imposed classifications and real clustering in a particular case of a
scale-free network given by the on-line encyclopedia Wikipedia. We find a
statistical similarity in the distributions of community sizes both by using
the top-down approach of the categories division present in the archive and in
the bottom-up procedure of community detection given by an algorithm based on
the spectral properties of the graph. Regardless the statistically similar
behaviour the two methods provide a rather different division of the articles,
thereby signaling that the nature and presence of power laws is a general
feature for these systems and cannot be used as a benchmark to evaluate the
suitability of a clustering method.Comment: 5 pages, 3 figures, epl2 styl
Semantic Stability in Social Tagging Streams
One potential disadvantage of social tagging systems is that due to the lack
of a centralized vocabulary, a crowd of users may never manage to reach a
consensus on the description of resources (e.g., books, users or songs) on the
Web. Yet, previous research has provided interesting evidence that the tag
distributions of resources may become semantically stable over time as more and
more users tag them. At the same time, previous work has raised an array of new
questions such as: (i) How can we assess the semantic stability of social
tagging systems in a robust and methodical way? (ii) Does semantic
stabilization of tags vary across different social tagging systems and
ultimately, (iii) what are the factors that can explain semantic stabilization
in such systems? In this work we tackle these questions by (i) presenting a
novel and robust method which overcomes a number of limitations in existing
methods, (ii) empirically investigating semantic stabilization processes in a
wide range of social tagging systems with distinct domains and properties and
(iii) detecting potential causes for semantic stabilization, specifically
imitation behavior, shared background knowledge and intrinsic properties of
natural language. Our results show that tagging streams which are generated by
a combination of imitation dynamics and shared background knowledge exhibit
faster and higher semantic stability than tagging streams which are generated
via imitation dynamics or natural language streams alone
Effective Retrieval of Resources in Folksonomies Using a New Tag Similarity Measure
Social (or folksonomic) tagging has become a very popular way to describe
content within Web 2.0 websites. However, as tags are informally defined,
continually changing, and ungoverned, it has often been criticised for
lowering, rather than increasing, the efficiency of searching. To address this
issue, a variety of approaches have been proposed that recommend users what
tags to use, both when labeling and when looking for resources. These
techniques work well in dense folksonomies, but they fail to do so when tag
usage exhibits a power law distribution, as it often happens in real-life
folksonomies. To tackle this issue, we propose an approach that induces the
creation of a dense folksonomy, in a fully automatic and transparent way: when
users label resources, an innovative tag similarity metric is deployed, so to
enrich the chosen tag set with related tags already present in the folksonomy.
The proposed metric, which represents the core of our approach, is based on the
mutual reinforcement principle. Our experimental evaluation proves that the
accuracy and coverage of searches guaranteed by our metric are higher than
those achieved by applying classical metrics.Comment: 6 pages, 2 figures, CIKM 2011: 20th ACM Conference on Information and
Knowledge Managemen
Folksonomies and clustering in the collaborative system CiteULike
We analyze CiteULike, an online collaborative tagging system where users
bookmark and annotate scientific papers. Such a system can be naturally
represented as a tripartite graph whose nodes represent papers, users and tags
connected by individual tag assignments. The semantics of tags is studied here,
in order to uncover the hidden relationships between tags. We find that the
clustering coefficient reflects the semantical patterns among tags, providing
useful ideas for the designing of more efficient methods of data classification
and spam detection.Comment: 9 pages, 5 figures, iop style; corrected typo
Universal Features in the Genome-level Evolution of Protein Domains
Protein domains are found on genomes with notable statistical distributions, which bear a high degree of similarity. Previous work has shown how these distributions can be accounted for by simple models, where the main ingredients are probabilities of duplication, innovation, and loss of domains. However, no one so far has addressed the issue that these distributions follow definite trends depending on protein-coding genome size only. We present a stochastic duplication/innovation model, falling in the class of so-called Chinese Restaurant Processes, able to explain this feature of the data. Using only two universal parameters, related to a minimal number of domains and to the relative weight of innovation to duplication, the model reproduces two important aspects: (a) the populations of domain classes (the sets, related to homology classes, containing realizations of the same domain in different proteins) follow common power-laws whose cutoff is dictated by genome size, and (b) the number of domain families is universal and markedly sublinear in genome size. An important ingredient of the model is that the innovation probability decreases with genome size. We propose the possibility to interpret this as a global constraint given by the cost of expanding an increasingly complex interactome. Finally, we introduce a variant of the model where the choice of a new domain relates to its occurrence in genomic data, and thus accounts for fold specificity. Both models have general quantitative agreement with data from hundreds of genomes, which indicates the coexistence of the well-known specificity of proteomes with robust self-organizing phenomena related to the basic evolutionary ``moves'' of duplication and innovation
True scale-free networks hidden by finite size effects
We analyze about two hundred naturally occurring networks with distinct
dynamical origins to formally test whether the commonly assumed hypothesis of
an underlying scale-free structure is generally viable. This has recently been
questioned on the basis of statistical testing of the validity of power law
distributions of network degrees by contrasting real data. Specifically, we
analyze by finite-size scaling analysis the datasets of real networks to check
whether purported departures from the power law behavior are due to the
finiteness of the sample size. In this case, power laws would be recovered in
the case of progressively larger cutoffs induced by the size of the sample. We
find that a large number of the networks studied follow a finite size scaling
hypothesis without any self-tuning. This is the case of biological protein
interaction networks, technological computer and hyperlink networks, and
informational networks in general. Marked deviations appear in other cases,
especially infrastructure and transportation but also social networks. We
conclude that underlying scale invariance properties of many naturally
occurring networks are extant features often clouded by finite-size effects due
to the nature of the sample data
The taxonomic distribution of asteroids from multi-filter all-sky photometric surveys
The distribution of asteroids across the Main Belt has been studied for
decades to understand the compositional distribution and what that tells us
about the formation and evolution of our solar system. All-sky surveys now
provide orders of magnitude more data than targeted surveys. We present a
method to bias-correct the asteroid population observed in the Sloan Digital
Sky Survey (SDSS) according to size, distance, and albedo. We taxonomically
classify this dataset consistent with the Bus and Bus-DeMeo systems and present
the resulting taxonomic distribution. The dataset includes asteroids as small
as 5 km, a factor of three in diameter smaller than in previous works. Because
of the wide range of sizes in our sample, we present the distribution by
number, surface area, volume, and mass whereas previous work was exclusively by
number. While the distribution by number is a useful quantity and has been used
for decades, these additional quantities provide new insights into the
distribution of total material. We find evidence for D-types in the inner main
belt where they are unexpected according to dynamical models of implantation of
bodies from the outer solar system into the inner solar system during planetary
migration (Levison et al. 2009). We find no evidence of S-types or other
unexpected classes among Trojans and Hildas, albeit a bias favoring such a
detection. Finally, we estimate for the first time the total amount of material
of each class in the inner solar system. The main belt's most massive classes
are C, B, P, V and S in decreasing order. Excluding the four most massive
asteroids, Ceres, Pallas, Vesta and Hygiea that heavily skew the values,
primitive material (C-, P-types) account for more than half main-belt and
Trojan asteroids by mass, most of the remaining mass being in the S-types. All
the other classes are minor contributors to the material between Mars and
Jupiter.Comment: Accepted for publication in Icarus -- 43 pages, 15 figures, 7 table
- …