1,786 research outputs found
Computational Approaches to Measuring the Similarity of Short Contexts : A Review of Applications and Methods
Measuring the similarity of short written contexts is a fundamental problem
in Natural Language Processing. This article provides a unifying framework by
which short context problems can be categorized both by their intended
application and proposed solution. The goal is to show that various problems
and methodologies that appear quite different on the surface are in fact very
closely related. The axes by which these categorizations are made include the
format of the contexts (headed versus headless), the way in which the contexts
are to be measured (first-order versus second-order similarity), and the
information used to represent the features in the contexts (micro versus macro
views). The unifying thread that binds together many short context applications
and methods is the fact that similarity decisions must be made between contexts
that share few (if any) words in common.Comment: 23 page
From Word to Sense Embeddings: A Survey on Vector Representations of Meaning
Over the past years, distributed semantic representations have proved to be
effective and flexible keepers of prior knowledge to be integrated into
downstream applications. This survey focuses on the representation of meaning.
We start from the theoretical background behind word vector space models and
highlight one of their major limitations: the meaning conflation deficiency,
which arises from representing a word with all its possible meanings as a
single vector. Then, we explain how this deficiency can be addressed through a
transition from the word level to the more fine-grained level of word senses
(in its broader acceptation) as a method for modelling unambiguous lexical
meaning. We present a comprehensive overview of the wide range of techniques in
the two main branches of sense representation, i.e., unsupervised and
knowledge-based. Finally, this survey covers the main evaluation procedures and
applications for this type of representation, and provides an analysis of four
of its important aspects: interpretability, sense granularity, adaptability to
different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence
Researc
Clustering documents with active learning using Wikipedia
Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit the semantic knowledge in Wikipedia for clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. We first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. We then exploit the semantic relatedness between Wikipedia concepts to find pair-wise instance-level constraints for supervised clustering, guiding clustering towards the direction indicated by the constraints. We test our approach on three standard text document datasets. Empirical results show that our basic document representation strategy yields comparable performance to previous attempts; and adding constraints improves clustering performance further by up to 20%
Similarity-Based Models of Word Cooccurrence Probabilities
In many applications of natural language processing (NLP) it is necessary to
determine the likelihood of a given word combination. For example, a speech
recognizer may need to determine which of the two word combinations ``eat a
peach'' and ``eat a beach'' is more likely. Statistical NLP methods determine
the likelihood of a word combination from its frequency in a training corpus.
However, the nature of language is such that many word combinations are
infrequent and do not occur in any given corpus. In this work we propose a
method for estimating the probability of such previously unseen word
combinations using available information on ``most similar'' words.
We describe probabilistic word association models based on distributional
word similarity, and apply them to two tasks, language modeling and pseudo-word
disambiguation. In the language modeling task, a similarity-based model is used
to improve probability estimates for unseen bigrams in a back-off language
model. The similarity-based method yields a 20% perplexity improvement in the
prediction of unseen bigrams and statistically significant reductions in
speech-recognition error.
We also compare four similarity-based estimation methods against back-off and
maximum-likelihood estimation methods on a pseudo-word sense disambiguation
task in which we controlled for both unigram and bigram frequency to avoid
giving too much weight to easy-to-disambiguate high-frequency configurations.
The similarity-based methods perform up to 40% better on this particular task.Comment: 26 pages, 5 figure
On the Impact of Entity Linking in Microblog Real-Time Filtering
Microblogging is a model of content sharing in which the temporal locality of
posts with respect to important events, either of foreseeable or unforeseeable
nature, makes applica- tions of real-time filtering of great practical
interest. We propose the use of Entity Linking (EL) in order to improve the
retrieval effectiveness, by enriching the representation of microblog posts and
filtering queries. EL is the process of recognizing in an unstructured text the
mention of relevant entities described in a knowledge base. EL of short pieces
of text is a difficult task, but it is also a scenario in which the information
EL adds to the text can have a substantial impact on the retrieval process. We
implement a start-of-the-art filtering method, based on the best systems from
the TREC Microblog track realtime adhoc retrieval and filtering tasks , and
extend it with a Wikipedia-based EL method. Results show that the use of EL
significantly improves over non-EL based versions of the filtering methods.Comment: 6 pages, 1 figure, 1 table. SAC 2015, Salamanca, Spain - April 13 -
17, 201
- ā¦