229 research outputs found
Human Associations Help to Detect Conventionalized Multiword Expressions
In this paper we show that if we want to obtain human evidence about
conventionalization of some phrases, we should ask native speakers about
associations they have to a given phrase and its component words. We have shown
that if component words of a phrase have each other as frequent associations,
then this phrase can be considered as conventionalized. Another type of
conventionalized phrases can be revealed using two factors: low entropy of
phrase associations and low intersection of component word and phrase
associations. The association experiments were performed for the Russian
language
Taking antonymy mask off in vector space
Automatic detection of antonymy is an important task in Natural Language Processing (NLP) for Information Retrieval (IR), Ontology Learning (OL) and many other semantic applications. However, current unsupervised approaches to antonymy detection are still not fully effective because they cannot discriminate antonyms from synonyms. In this paper, we introduce APAnt, a new Average-Precision-based measure for the unsupervised discrimination of antonymy from synonymy using Distributional Semantic Models (DSMs). APAnt makes use of Average Precision to estimate the extent and salience of the intersection among the most descriptive contexts of two target words. Evaluation shows that the proposed method is able to distinguish antonyms and synonyms with high accuracy across different parts of speech, including nouns, adjectives and verbs. APAnt outperforms the vector cosine and a baseline model implementing the co-occurrence hypothesis
Using distributional similarity to organise biomedical terminology
We investigate an application of distributional similarity techniques to the problem of structural organisation of biomedical terminology. Our application domain is the relatively small GENIA corpus. Using terms that have been accurately marked-up by hand within the corpus, we consider the problem of automatically determining semantic proximity. Terminological units are dened for our purposes as normalised classes of individual terms. Syntactic analysis of the corpus data is carried out using the Pro3Gres parser and provides the data required to calculate distributional similarity using a variety of dierent measures. Evaluation is performed against a hand-crafted gold standard for this domain in the form of the GENIA ontology. We show that distributional similarity can be used to predict semantic type with a good degree of accuracy
Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language
This article presents a measure of semantic similarity in an IS-A taxonomy
based on the notion of shared information content. Experimental evaluation
against a benchmark set of human similarity judgments demonstrates that the
measure performs better than the traditional edge-counting approach. The
article presents algorithms that take advantage of taxonomic similarity in
resolving syntactic and semantic ambiguity, along with experimental results
demonstrating their effectiveness
Discovering multiword expressions
In this paper, we provide an overview of research on multiword expressions (MWEs), from a natural lan- guage processing perspective. We examine methods developed for modelling MWEs that capture some of their linguistic properties, discussing their use for MWE discovery and for idiomaticity detection. We con- centrate on their collocational and contextual preferences, along with their fixedness in terms of canonical forms and their lack of word-for-word translatatibility. We also discuss a sample of the MWE resources that have been used in intrinsic evaluation setups for these methods
SemEval-2010 Task 17: All-words Word Sense Disambiguation on a Specific Domain
Domain portability and adaptation of NLP components and Word Sense Disambiguation systems present new challenges. The difficulties found by supervised systems to adapt might change the way we assess the strengths and weaknesses of supervised and knowledge-based WSD systems. Unfortunately, all existing evaluation datasets for specific domains are lexical-sample corpora. This task presented all-words datasets on the environment domain for WSD in four languages (Chinese, Dutch, English, Italian). 11 teams participated, with supervised and knowledge-based systems, mainly in the English dataset. The results show that in all languages the participants where able to beat the most frequent sense heuristic as estimated from general corpora. The most successful approaches used some sort of supervision in the form of hand-tagged examples from the domain
When Similarity Becomes Opposition: Synonyms and Antonyms Discrimination in DSMs
This paper analyzes the concept of opposition and describes a fully unsupervised method for its automatic discrimination from near-synonymy in Distributional Semantic Models (DSMs). The discriminating method is based on the hypothesis that, even though both near-synonyms and opposites are mostly distributionally similar, opposites are different from each other in at least one dimension of meaning, which can be assumed to be salient. Such hypothesis has been implemented in APAnt, a distributional measure that evaluates the extent of the intersection among the most relevant contexts of two words (where relevance is measured as mutual dependency), and its saliency (i.e. their average rank in the mutual dependency sorted list of contexts). The measure – previously introduced in some pilot studies – is presented here with two variants. Evaluation shows that it outperforms three baselines in an antonym retrieval task: the vector cosine, a baseline implementing the co-occurrence hypothesis, and a random rank. This paper describes the algorithm in details and analyzes its current limitations, suggesting that extensions may be developed for discriminating antonyms not only from near-synonyms but also from other semantic relations. During the evaluation, we have noticed that APAnt also has a particular preference for hypernyms
From Distributional to Semantic Similarity
Institute for Communicating and Collaborative SystemsLexical-semantic resources, including thesauri and WORDNET, have been successfully incorporated
into a wide range of applications in Natural Language Processing. However they are
very difficult and expensive to create and maintain, and their usefulness has been severely
hampered by their limited coverage, bias and inconsistency. Automated and semi-automated
methods for developing such resources are therefore crucial for further resource development
and improved application performance.
Systems that extract thesauri often identify similar words using the distributional hypothesis
that similar words appear in similar contexts. This approach involves using corpora to examine
the contexts each word appears in and then calculating the similarity between context distributions.
Different definitions of context can be used, and I begin by examining how different
types of extracted context influence similarity.
To be of most benefit these systems must be capable of finding synonyms for rare words.
Reliable context counts for rare events can only be extracted from vast collections of text. In
this dissertation I describe how to extract contexts from a corpus of over 2 billion words. I
describe techniques for processing text on this scale and examine the trade-off between context
accuracy, information content and quantity of text analysed.
Distributional similarity is at best an approximation to semantic similarity. I develop improved
approximations motivated by the intuition that some events in the context distribution are more
indicative of meaning than others. For instance, the object-of-verb context wear is far more
indicative of a clothing noun than get. However, existing distributional techniques do not
effectively utilise this information. The new context-weighted similarity metric I propose in
this dissertation significantly outperforms every distributional similarity metric described in
the literature.
Nearest-neighbour similarity algorithms scale poorly with vocabulary and context vector size.
To overcome this problem I introduce a new context-weighted approximation algorithm with
bounded complexity in context vector size that significantly reduces the system runtime with
only a minor performance penalty. I also describe a parallelized version of the system that runs
on a Beowulf cluster for the 2 billion word experiments.
To evaluate the context-weighted similarity measure I compare ranked similarity lists against
gold-standard resources using precision and recall-based measures from Information Retrieval,
since the alternative, application-based evaluation, can often be influenced by distributional
as well as semantic similarity. I also perform a detailed analysis of the final results using
WORDNET.
Finally, I apply my similarity metric to the task of assigning words to WORDNET semantic
categories. I demonstrate that this new approach outperforms existing methods and overcomes
some of their weaknesses
Crowdsourcing Multiple Choice Science Questions
We present a novel method for obtaining high-quality, domain-targeted
multiple choice questions from crowd workers. Generating these questions can be
difficult without trading away originality, relevance or diversity in the
answer options. Our method addresses these problems by leveraging a large
corpus of domain-specific text and a small set of existing questions. It
produces model suggestions for document selection and answer distractor choice
which aid the human question generation process. With this method we have
assembled SciQ, a dataset of 13.7K multiple choice science exam questions
(Dataset available at http://allenai.org/data.html). We demonstrate that the
method produces in-domain questions by providing an analysis of this new
dataset and by showing that humans cannot distinguish the crowdsourced
questions from original questions. When using SciQ as additional training data
to existing questions, we observe accuracy improvements on real science exams.Comment: accepted for the Workshop on Noisy User-generated Text (W-NUT) 201
- …