6,707 research outputs found
Unsupervised Terminological Ontology Learning based on Hierarchical Topic Modeling
In this paper, we present hierarchical relationbased latent Dirichlet
allocation (hrLDA), a data-driven hierarchical topic model for extracting
terminological ontologies from a large number of heterogeneous documents. In
contrast to traditional topic models, hrLDA relies on noun phrases instead of
unigrams, considers syntax and document structures, and enriches topic
hierarchies with topic relations. Through a series of experiments, we
demonstrate the superiority of hrLDA over existing topic models, especially for
building hierarchies. Furthermore, we illustrate the robustness of hrLDA in the
settings of noisy data sets, which are likely to occur in many practical
scenarios. Our ontology evaluation results show that ontologies extracted from
hrLDA are very competitive with the ontologies created by domain experts
Interactions of cultures and top people of Wikipedia from ranking of 24 language editions
Wikipedia is a huge global repository of human knowledge, that can be
leveraged to investigate interwinements between cultures. With this aim, we
apply methods of Markov chains and Google matrix, for the analysis of the
hyperlink networks of 24 Wikipedia language editions, and rank all their
articles by PageRank, 2DRank and CheiRank algorithms. Using automatic
extraction of people names, we obtain the top 100 historical figures, for each
edition and for each algorithm. We investigate their spatial, temporal, and
gender distributions in dependence of their cultural origins. Our study
demonstrates not only the existence of skewness with local figures, mainly
recognized only in their own cultures, but also the existence of global
historical figures appearing in a large number of editions. By determining the
birth time and place of these persons, we perform an analysis of the evolution
of such figures through 35 centuries of human history for each language, thus
recovering interactions and entanglement of cultures over time. We also obtain
the distributions of historical figures over world countries, highlighting
geographical aspects of cross-cultural links. Considering historical figures
who appear in multiple editions as interactions between cultures, we construct
a network of cultures and identify the most influential cultures according to
this network.Comment: 32 pages. 10 figures. Submitted for publication. Supporting
information is available on
http://www.quantware.ups-tlse.fr/QWLIB/topwikipeople
mARC: Memory by Association and Reinforcement of Contexts
This paper introduces the memory by Association and Reinforcement of Contexts
(mARC). mARC is a novel data modeling technology rooted in the second
quantization formulation of quantum mechanics. It is an all-purpose incremental
and unsupervised data storage and retrieval system which can be applied to all
types of signal or data, structured or unstructured, textual or not. mARC can
be applied to a wide range of information clas-sification and retrieval
problems like e-Discovery or contextual navigation. It can also for-mulated in
the artificial life framework a.k.a Conway "Game Of Life" Theory. In contrast
to Conway approach, the objects evolve in a massively multidimensional space.
In order to start evaluating the potential of mARC we have built a mARC-based
Internet search en-gine demonstrator with contextual functionality. We compare
the behavior of the mARC demonstrator with Google search both in terms of
performance and relevance. In the study we find that the mARC search engine
demonstrator outperforms Google search by an order of magnitude in response
time while providing more relevant results for some classes of queries
DocTag2Vec: An Embedding Based Multi-label Learning Approach for Document Tagging
Tagging news articles or blog posts with relevant tags from a collection of
predefined ones is coined as document tagging in this work. Accurate tagging of
articles can benefit several downstream applications such as recommendation and
search. In this work, we propose a novel yet simple approach called DocTag2Vec
to accomplish this task. We substantially extend Word2Vec and Doc2Vec---two
popular models for learning distributed representation of words and documents.
In DocTag2Vec, we simultaneously learn the representation of words, documents,
and tags in a joint vector space during training, and employ the simple
-nearest neighbor search to predict tags for unseen documents. In contrast
to previous multi-label learning methods, DocTag2Vec directly deals with raw
text instead of provided feature vector, and in addition, enjoys advantages
like the learning of tag representation, and the ability of handling newly
created tags. To demonstrate the effectiveness of our approach, we conduct
experiments on several datasets and show promising results against
state-of-the-art methods.Comment: 10 page
- …