44 research outputs found
Adjusting Sense Representations for Word Sense Disambiguation and Automatic Pun Interpretation
Word sense disambiguation (WSD)—the task of determining which meaning a word carries in a particular context—is a core research problem in computational linguistics. Though it has long been recognized that supervised (machine learning–based) approaches to WSD can yield impressive results, they require an amount of manually annotated training data that is often too expensive or impractical to obtain. This is a particular problem for under-resourced languages and domains, and is also a hurdle in well-resourced languages when processing the sort of lexical-semantic anomalies employed for deliberate effect in humour and wordplay. In contrast to supervised systems are knowledge-based techniques, which rely only on pre-existing lexical-semantic resources (LSRs). These techniques are of more general applicability but tend to suffer from lower performance due to the informational gap between the target word's context and the sense descriptions provided by the LSR.
This dissertation is concerned with extending the efficacy and applicability of knowledge-based word sense disambiguation. First, we investigate two approaches for bridging the information gap and thereby improving the performance of knowledge-based WSD. In the first approach we supplement the word's context and the LSR's sense descriptions with entries from a distributional thesaurus. The second approach enriches an LSR's sense information by aligning it to other, complementary LSRs.
Our next main contribution is to adapt techniques from word sense disambiguation to a novel task: the interpretation of puns. Traditional NLP applications, including WSD, usually treat the source text as carrying a single meaning, and therefore cannot cope with the intentionally ambiguous constructions found in humour and wordplay. We describe how algorithms and evaluation methodologies from traditional word sense disambiguation can be adapted for the "disambiguation" of puns, or rather for the identification of their double meanings.
Finally, we cover the design and construction of technological and linguistic resources aimed at supporting the research and application of word sense disambiguation. Development and comparison of WSD systems has long been hampered by a lack of standardized data formats, language resources, software components, and workflows. To address this issue, we designed and implemented a modular, extensible framework for WSD. It implements, encapsulates, and aggregates reusable, interoperable components using UIMA, an industry-standard information processing architecture. We have also produced two large sense-annotated data sets for under-resourced languages or domains: one of these targets German-language text, and the other English-language puns
Tailored semantic annotation for semantic search
This paper presents a novel method for semantic annotation and search of a target corpus using several knowledge resources (KRs). This method relies on a formal statistical framework in which KR concepts and corpus documents are homogeneously represented using statistical language models. Under this framework, we can perform all the necessary operations for an efficient and effective semantic annotation of the corpus. Firstly, we propose a coarse tailoring of the KRs w.r.t the target corpus with the main goal of reducing the ambiguity of the annotations and their computational overhead. Then, we propose the generation of concept profiles, which allow measuring the semantic overlap of the KRs as well as performing a finer tailoring of them. Finally, we propose how to semantically represent documents and queries in terms of the KRs concepts and the statistical framework to perform semantic search. Experiments have been carried out with a corpus about web resources which includes several Life Sciences catalogs and Wikipedia pages related to web resources in general (e.g., databases, tools, services, etc.). Results demonstrate that the proposed method is more effective and efficient than state-of-the-art methods relying on either context-free annotation or keyword-based search.We thank anonymous reviewers for their very useful comments and suggestions. The work was supported by the CICYT project TIN2011-24147 from the Spanish Ministry of Economy and Competitiveness (MINECO)
New frontiers in supervised word sense disambiguation: building multilingual resources and neural models on a large scale
Word Sense Disambiguation is a long-standing task in Natural Language Processing
(NLP), lying at the core of human language understanding. While it has already
been studied from many different angles over the years, ranging from knowledge
based systems to semi-supervised and fully supervised models, the field seems to
be slowing down in respect to other NLP tasks, e.g., part-of-speech tagging and
dependencies parsing. Despite the organization of several international competitions
aimed at evaluating Word Sense Disambiguation systems, the evaluation of automatic
systems has been problematic mainly due to the lack of a reliable evaluation
framework aiming at performing a direct quantitative confrontation.
To this end we develop a unified evaluation framework and analyze the performance
of various Word Sense Disambiguation systems in a fair setup. The results
show that supervised systems clearly outperform knowledge-based models. Among
the supervised systems, a linear classifier trained on conventional local features
still proves to be a hard baseline to beat. Nonetheless, recent approaches exploiting
neural networks on unlabeled corpora achieve promising results, surpassing this
hard baseline in most test sets. Even though supervised systems tend to perform
best in terms of accuracy, they often lose ground to more flexible knowledge-based
solutions, which do not require training for every disambiguation target. To bridge
this gap we adopt a different perspective and rely on sequence learning to frame
the disambiguation problem: we propose and study in depth a series of end-to-end
neural architectures directly tailored to the task, from bidirectional Long ShortTerm
Memory to encoder-decoder models. Our extensive evaluation over standard
benchmarks and in multiple languages shows that sequence learning enables more
versatile all-words models that consistently lead to state-of-the-art results, even
against models trained with engineered features.
However, supervised systems need annotated training corpora and the few available
to date are of limited size: this is mainly due to the expensive and timeconsuming
process of annotating a wide variety of word senses at a reasonably high
scale, i.e., the so-called knowledge acquisition bottleneck. To address this issue, we
also present different strategies to acquire automatically high quality sense annotated
data in multiple languages, without any manual effort. We assess the quality of the
sense annotations both intrinsically and extrinsically achieving competitive results
on multiple tasks
Application of generic sense classes in word sense disambiguation
Ph.DDOCTOR OF PHILOSOPH
Automatic Gloss Finding for a Knowledge Base using Ontological Constraints
While there has been much research on automatically construct-ing structured Knowledge Bases (KBs), most of it has focused on generating facts to populate a KB. However, a useful KB must go beyond facts. For example, glosses (short natural language defi-nitions) have been found to be very useful in tasks such as Word Sense Disambiguation. However, the important problem of Auto-matic Gloss Finding, i.e., assigning glosses to entities in an ini-tially gloss-free KB, is relatively unexplored. We address that gap in this paper. In particular, we propose GLOFIN, a hierarchical semi-supervised learning algorithm for this problem which makes effective use of limited amounts of supervision and available onto-logical constraints. To the best of our knowledge, GLOFIN is the first system for this task. Through extensive experiments on real-world datasets, we demon-strate GLOFIN’s effectiveness. It is encouraging to see that GLOFIN outperforms other state-of-the-art SSL algorithms, especially in low supervision settings. We also demonstrate GLOFIN’s robustness to noise through experiments on a wide variety of KBs, ranging from user contributed (e.g., Freebase) to automatically constructed (e.g., NELL). To facilitate further research in this area, we have already made the datasets and code used in this paper publicly available. 1