923 research outputs found
ShotgunWSD: An unsupervised algorithm for global word sense disambiguation inspired by DNA sequencing
In this paper, we present a novel unsupervised algorithm for word sense
disambiguation (WSD) at the document level. Our algorithm is inspired by a
widely-used approach in the field of genetics for whole genome sequencing,
known as the Shotgun sequencing technique. The proposed WSD algorithm is based
on three main steps. First, a brute-force WSD algorithm is applied to short
context windows (up to 10 words) selected from the document in order to
generate a short list of likely sense configurations for each window. In the
second step, these local sense configurations are assembled into longer
composite configurations based on suffix and prefix matching. The resulted
configurations are ranked by their length, and the sense of each word is chosen
based on a voting scheme that considers only the top k configurations in which
the word appears. We compare our algorithm with other state-of-the-art
unsupervised WSD algorithms and demonstrate better performance, sometimes by a
very large margin. We also show that our algorithm can yield better performance
than the Most Common Sense (MCS) baseline on one data set. Moreover, our
algorithm has a very small number of parameters, is robust to parameter tuning,
and, unlike other bio-inspired methods, it gives a deterministic solution (it
does not involve random choices).Comment: In Proceedings of EACL 201
Applying a Naive Bayes Similarity Measure to Word Sense Disambiguation
We replace the overlap mechanism of the Lesk algorithm with a simple, general-purpose Naive Bayes model that mea-sures many-to-many association between two sets of random variables. Even with simple probability estimates such as max-imum likelihood, the model gains signifi-cant improvement over the Lesk algorithm on word sense disambiguation tasks. With additional lexical knowledge from Word-Net, performance is further improved to surpass the state-of-the-art results.
PowerAqua: fishing the semantic web
The Semantic Web (SW) offers an opportunity to develop novel, sophisticated forms of question answering (QA). Specifically, the availability of distributed semantic markup on a large scale opens the way to QA systems which can make use of such semantic information to provide precise, formally derived answers to questions. At the same time the distributed, heterogeneous, large-scale nature of the semantic information introduces significant challenges. In this paper we describe the design of a QA system, PowerAqua, designed to exploit semantic markup on the web to provide answers to questions posed in natural language. PowerAqua does not assume that the user has any prior information about the semantic resources. The system takes as input a natural language query, translates it into a set of logical queries, which are then answered by consulting and aggregating information derived from multiple heterogeneous semantic sources
Distributional Measures of Semantic Distance: A Survey
The ability to mimic human notions of semantic distance has widespread
applications. Some measures rely only on raw text (distributional measures) and
some rely on knowledge sources such as WordNet. Although extensive studies have
been performed to compare WordNet-based measures with human judgment, the use
of distributional measures as proxies to estimate semantic distance has
received little attention. Even though they have traditionally performed poorly
when compared to WordNet-based measures, they lay claim to certain uniquely
attractive features, such as their applicability in resource-poor languages and
their ability to mimic both semantic similarity and semantic relatedness.
Therefore, this paper presents a detailed study of distributional measures.
Particular attention is paid to flesh out the strengths and limitations of both
WordNet-based and distributional measures, and how distributional measures of
distance can be brought more in line with human notions of semantic distance.
We conclude with a brief discussion of recent work on hybrid measures
- …