Search CORE

923 research outputs found

ShotgunWSD: An unsupervised algorithm for global word sense disambiguation inspired by DNA sequencing

Author: Butnaru Andrei M.
Hristea Florentina
Ionescu Radu Tudor
Publication venue
Publication date: 01/01/2017
Field of study

In this paper, we present a novel unsupervised algorithm for word sense disambiguation (WSD) at the document level. Our algorithm is inspired by a widely-used approach in the field of genetics for whole genome sequencing, known as the Shotgun sequencing technique. The proposed WSD algorithm is based on three main steps. First, a brute-force WSD algorithm is applied to short context windows (up to 10 words) selected from the document in order to generate a short list of likely sense configurations for each window. In the second step, these local sense configurations are assembled into longer composite configurations based on suffix and prefix matching. The resulted configurations are ranked by their length, and the sense of each word is chosen based on a voting scheme that considers only the top k configurations in which the word appears. We compare our algorithm with other state-of-the-art unsupervised WSD algorithms and demonstrate better performance, sometimes by a very large margin. We also show that our algorithm can yield better performance than the Most Common Sense (MCS) baseline on one data set. Moreover, our algorithm has a very small number of parameters, is robust to parameter tuning, and, unlike other bio-inspired methods, it gives a deterministic solution (it does not involve random choices).Comment: In Proceedings of EACL 201

arXiv.org e-Print Archive

Crossref

Applying a Naive Bayes Similarity Measure to Word Sense Disambiguation

Author: Graeme Hirst
Tong Wang
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2014
Field of study

We replace the overlap mechanism of the Lesk algorithm with a simple, general-purpose Naive Bayes model that mea-sures many-to-many association between two sets of random variables. Even with simple probability estimates such as max-imum likelihood, the model gains signifi-cant improvement over the Lesk algorithm on word sense disambiguation tasks. With additional lexical knowledge from Word-Net, performance is further improved to surpass the state-of-the-art results.

CiteSeerX

Crossref

PowerAqua: fishing the semantic web

Author: A.Y. Levy
D. Mc Guinness
E. Mena
E. Rahm
F. Giunchiglia
N. Ide
P. Resnik
T. Berners-Lee
Publication venue
Publication date: 01/01/2006
Field of study

The Semantic Web (SW) offers an opportunity to develop novel, sophisticated forms of question answering (QA). Specifically, the availability of distributed semantic markup on a large scale opens the way to QA systems which can make use of such semantic information to provide precise, formally derived answers to questions. At the same time the distributed, heterogeneous, large-scale nature of the semantic information introduces significant challenges. In this paper we describe the design of a QA system, PowerAqua, designed to exploit semantic markup on the web to provide answers to questions posed in natural language. PowerAqua does not assume that the user has any prior information about the semantic resources. The system takes as input a natural language query, translates it into a set of logical queries, which are then answered by consulting and aggregating information derived from multiple heterogeneous semantic sources

CiteSeerX

Crossref

Open Research Online (The Open University)

Distributional Measures of Semantic Distance: A Survey

Author: Hirst Graeme
Mohammad Saif M.
Publication venue
Publication date: 01/01/2012
Field of study

The ability to mimic human notions of semantic distance has widespread applications. Some measures rely only on raw text (distributional measures) and some rely on knowledge sources such as WordNet. Although extensive studies have been performed to compare WordNet-based measures with human judgment, the use of distributional measures as proxies to estimate semantic distance has received little attention. Even though they have traditionally performed poorly when compared to WordNet-based measures, they lay claim to certain uniquely attractive features, such as their applicability in resource-poor languages and their ability to mimic both semantic similarity and semantic relatedness. Therefore, this paper presents a detailed study of distributional measures. Particular attention is paid to flesh out the strengths and limitations of both WordNet-based and distributional measures, and how distributional measures of distance can be brought more in line with human notions of semantic distance. We conclude with a brief discussion of recent work on hybrid measures

arXiv.org e-Print Archive

CiteSeerX