53 research outputs found
Knowledge-based approaches to producing large-scale training data from scratch for Word Sense Disambiguation and Sense Distribution Learning
Communicating and understanding each other is one of the most important human abilities.
As humans, in fact, we can easily assign the correct meaning to the ambiguous words in a text, while, at the same time, being able to abstract, summarise and enrich its content with new information that we learned somewhere else.
On the contrary, machines rely on formal languages which do not leave space to ambiguity hence being easy to parse and understand.
Therefore, to fill the gap between humans and machines and enabling the latter to better communicate with and comprehend its sentient counterpart, in the modern era of computer-science's much effort has been put into developing Natural Language Processing (NLP) approaches which aim at understanding and handling the ambiguity of the human language.
At the core of NLP lies the task of correctly interpreting the meaning of each word in a given text, hence disambiguating its content exactly as a human would do.
Researchers in the Word Sense Disambiguation (WSD) field address exactly this issue by leveraging either knowledge bases, i.e. graphs where nodes are concept and edges are semantic relations among them, or manually-annotated datasets for training machine learning algorithms. One common obstacle is the knowledge acquisition bottleneck problem, id est, retrieving or generating semantically-annotated data which are necessary to build both semantic graphs or training sets is a complex task.
This phenomenon is even more serious when considering languages other than English where resources to generate human-annotated data are scarce and ready-made datasets are completely absent. With the advent of deep learning this issue became even more serious as more complex models need larger datasets in order to learn meaningful patterns to solve the task.
Another critical issue in WSD, as well as in other machine-learning-related fields, is the domain adaptation problem, id est, performing the same task in different application domains.
This is particularly hard when dealing with word senses, as, in fact, they are governed by a Zipfian distribution; hence, by slightly changing the application domain, a sense might become very frequent even though it is very rare in the general domain.
For example the geometric sense of plane is very frequent in a corpus made of math books, while it is very rare in a general domain dataset.
In this thesis we address both these problems. Inter alia, we focus on relieving the burden of human annotations in Word Sense Disambiguation thus enabling the automatic construction of high-quality sense-annotated dataset not only for English, but especially for other languages where sense-annotated data are not available at all.
Furthermore, recognising in word-sense distribution one of the main pitfalls for WSD approaches, we also alleviate the dependency on most frequent sense information by automatically inducing the word-sense distribution in a given text of raw sentences.
In the following we propose a language-independent and automatic approach to generating semantic annotations given a collection of sentences, and then introduce two methods for the automatic inference of word-sense distributions.
Finally, we combine the two kind of approaches to build a semantically-annotated dataset that reflect the sense distribution which we automatically infer from the target text
Recommended from our members
Using Wikipedia for Automatic Word Sense Disambiguation
This paper describes a method for generating sense-tagged data using Wikipedia as a source of sense annotations. Through word sense disambiguation experiments, the authors show that the Wikipedia-based sense annotations are reliable and can be used to construct accurate sense classifiers
The Acquisition Of Lexical Knowledge From The Web For Aspects Of Semantic Interpretation
This work investigates the effective acquisition of lexical knowledge from the Web to perform semantic interpretation. The Web provides an unprecedented amount of natural language from which to gain knowledge useful for semantic interpretation. The knowledge acquired is described as common sense knowledge, information one uses in his or her daily life to understand language and perception. Novel approaches are presented for both the acquisition of this knowledge and use of the knowledge in semantic interpretation algorithms. The goal is to increase accuracy over other automatic semantic interpretation systems, and in turn enable stronger real world applications such as machine translation, advanced Web search, sentiment analysis, and question answering. The major contributions of this dissertation consist of two methods of acquiring lexical knowledge from the Web, namely a database of common sense knowledge and Web selectors. The first method is a framework for acquiring a database of concept relationships. To acquire this knowledge, relationships between nouns are found on the Web and analyzed over WordNet using information-theory, producing information about concepts rather than ambiguous words. For the second contribution, words called Web selectors are retrieved which take the place of an instance of a target word in its local context. The selectors serve for the system to learn the types of concepts that the sense of a target word should be similar. Web selectors are acquired dynamically as part of a semantic interpretation algorithm, while the relationships in the database are useful to iii stand-alone programs. A final contribution of this dissertation concerns a novel semantic similarity measure and an evaluation of similarity and relatedness measures on tasks of concept similarity. Such tasks are useful when applying acquired knowledge to semantic interpretation. Applications to word sense disambiguation, an aspect of semantic interpretation, are used to evaluate the contributions. Disambiguation systems which utilize semantically annotated training data are considered supervised. The algorithms of this dissertation are considered minimallysupervised; they do not require training data created by humans, though they may use humancreated data sources. In the case of evaluating a database of common sense knowledge, integrating the knowledge into an existing minimally-supervised disambiguation system significantly improved results – a 20.5% error reduction. Similarly, the Web selectors disambiguation system, which acquires knowledge directly as part of the algorithm, achieved results comparable with top minimally-supervised systems, an F-score of 80.2% on a standard noun disambiguation task. This work enables the study of many subsequent related tasks for improving semantic interpretation and its application to real-world technologies. Other aspects of semantic interpretation, such as semantic role labeling could utilize the same methods presented here for word sense disambiguation. As the Web continues to grow, the capabilities of the systems in this dissertation are expected to increase. Although the Web selectors system achieves great results, a study in this dissertation shows likely improvements from acquiring more data. Furthermore, the methods for acquiring a database of common sense knowledge could be applied in a more exhaustive fashion for other types of common sense knowledge. Finally, perhaps the greatest benefits from this work will come from the enabling of real world technologies that utilize semantic interpretation
A Fast Method for Parallel Document Identification
We present a fast method to identify
homogeneous parallel documents. The
method is based on collecting counts of
identical low-frequency words between
possibly parallel documents. The candidate with the most shared low-frequency
words is selected as the parallel document.
The method achieved 99.96% accuracy
when tested on the EUROPARL corpus
of parliamentary proceedings, failing only
in anomalous cases of truncated or otherwise distorted documents. While other
work has shown similar performance on
this type of dataset, our approach presented here is faster and does not require
training. Apart from proposing an efficient method for parallel document identification in a restricted domain, this paper furnishes evidence that parliamentary
proceedings may be inappropriate for testing parallel document identification systems in general
A Fast Method for Parallel Document Identification
We present a fast method to identify
homogeneous parallel documents. The
method is based on collecting counts of
identical low-frequency words between
possibly parallel documents. The candidate with the most shared low-frequency
words is selected as the parallel document.
The method achieved 99.96% accuracy
when tested on the EUROPARL corpus
of parliamentary proceedings, failing only
in anomalous cases of truncated or otherwise distorted documents. While other
work has shown similar performance on
this type of dataset, our approach presented here is faster and does not require
training. Apart from proposing an efficient method for parallel document identification in a restricted domain, this paper furnishes evidence that parliamentary
proceedings may be inappropriate for testing parallel document identification systems in general
Word sense disambiguation : Scaling up, domain adaptation and application to machine translation
Ph.DDOCTOR OF PHILOSOPH
New frontiers in supervised word sense disambiguation: building multilingual resources and neural models on a large scale
Word Sense Disambiguation is a long-standing task in Natural Language Processing
(NLP), lying at the core of human language understanding. While it has already
been studied from many different angles over the years, ranging from knowledge
based systems to semi-supervised and fully supervised models, the field seems to
be slowing down in respect to other NLP tasks, e.g., part-of-speech tagging and
dependencies parsing. Despite the organization of several international competitions
aimed at evaluating Word Sense Disambiguation systems, the evaluation of automatic
systems has been problematic mainly due to the lack of a reliable evaluation
framework aiming at performing a direct quantitative confrontation.
To this end we develop a unified evaluation framework and analyze the performance
of various Word Sense Disambiguation systems in a fair setup. The results
show that supervised systems clearly outperform knowledge-based models. Among
the supervised systems, a linear classifier trained on conventional local features
still proves to be a hard baseline to beat. Nonetheless, recent approaches exploiting
neural networks on unlabeled corpora achieve promising results, surpassing this
hard baseline in most test sets. Even though supervised systems tend to perform
best in terms of accuracy, they often lose ground to more flexible knowledge-based
solutions, which do not require training for every disambiguation target. To bridge
this gap we adopt a different perspective and rely on sequence learning to frame
the disambiguation problem: we propose and study in depth a series of end-to-end
neural architectures directly tailored to the task, from bidirectional Long ShortTerm
Memory to encoder-decoder models. Our extensive evaluation over standard
benchmarks and in multiple languages shows that sequence learning enables more
versatile all-words models that consistently lead to state-of-the-art results, even
against models trained with engineered features.
However, supervised systems need annotated training corpora and the few available
to date are of limited size: this is mainly due to the expensive and timeconsuming
process of annotating a wide variety of word senses at a reasonably high
scale, i.e., the so-called knowledge acquisition bottleneck. To address this issue, we
also present different strategies to acquire automatically high quality sense annotated
data in multiple languages, without any manual effort. We assess the quality of the
sense annotations both intrinsically and extrinsically achieving competitive results
on multiple tasks
SUBJECTIVITY WORD SENSE DISAMBIGUATION: A METHOD FOR SENSE-AWARE SUBJECTIVITY ANALYSIS
Subjectivity lexicons have been invaluable resources in subjectivity analysis and their creation has been an important topic. Many systems rely on these lexicons. For any subjectivity analysis system, which relies on a subjectivity lexicon, subjectivity sense ambiguity is a serious problem. Such systems will be misled by the presence of subjectivity clues used with objective senses called false hits.
We believe that any type of subjectivity analysis system relying on lexicons will benefit from a sense-aware approach. We think sense-aware subjectivity analysis has been neglected mostly because of the concerns related to word sense disambiguation (WSD), the problem of automatically determining which sense of a word is activated by the use of the word in a particular context according to a sense-inventory. Although WSD is the perfect tool for sense-aware classification, trust in traditional fine-grained WSD as an enabling technology is not high due to previous mostly unsuccessful results.
In this thesis, we investigate feasible and practical methods to avoid these false hits via sense-aware analysis. We define a new coarse-grained WSD task capturing the right semantic granularity specific to subjectivity analysis
Recommended from our members
Towards Communicating Simple Sentence using Pictorial Representations
Language can sometimes be an impediment in communication. Whether we are talking about people who speak different languages, students who are learning a new language, or people with language disorders, the understanding of linguistic representations in a given language requires a certain amount of knowledge that not everybody has. In this thesis, we propose "translation through pictures" as a means for conveying simple pieces of information across language barriers, and describe a system that can automatically generate pictorial representations for simple sentences. Comparative experiments conducted on visual and linguistic representations of information show that a considerable amount of understanding can be achieved through pictorial descriptions, with results within a comparable range of those obtained with current machine translation techniques. Moreover, a user study conducted around the pictorial translation system reveals that users found the system to generally produce correct word/image associations, and rate the system as interactive and intelligent
- …