349 research outputs found
Wiktionary and NLP: Improving synonymy networks
International audienceWiktionary, a satellite of the Wikipedia initiative, can be seen as a potential resource for Natural Language Processing. It requires however to be processed before being used efficiently as an NLP resource. After describing the relevant aspects Wiktionary for our purposes, we focus on its structural properties. Then, we describe how we extracted synonymy networks from this resource. We provide an in-depth study of these synonymy networks and compare them to those extracted from traditional resources. Finally, we describe two methods for semiautomatically improving this network by adding missing relations: (i) using a kind of semantic proximity measure; (ii) using translation relations of Wiktionary itself
Proceedings of the Conference on Natural Language Processing 2010
This book contains state-of-the-art contributions to the 10th
conference on Natural Language Processing, KONVENS 2010
(Konferenz zur Verarbeitung natĂźrlicher Sprache), with a focus
on semantic processing.
The KONVENS in general aims at offering a broad perspective
on current research and developments within the interdisciplinary
field of natural language processing. The central theme
draws specific attention towards addressing linguistic aspects
ofmeaning, covering deep as well as shallow approaches to semantic
processing. The contributions address both knowledgebased
and data-driven methods for modelling and acquiring
semantic information, and discuss the role of semantic information
in applications of language technology.
The articles demonstrate the importance of semantic processing,
and present novel and creative approaches to natural
language processing in general. Some contributions put their
focus on developing and improving NLP systems for tasks like
Named Entity Recognition or Word Sense Disambiguation, or
focus on semantic knowledge acquisition and exploitation with
respect to collaboratively built ressources, or harvesting semantic
information in virtual games. Others are set within the
context of real-world applications, such as Authoring Aids, Text
Summarisation and Information Retrieval. The collection highlights
the importance of semantic processing for different areas
and applications in Natural Language Processing, and provides
the reader with an overview of current research in this field
Recommended from our members
Acquiring and Harnessing Verb Knowledge for Multilingual Natural Language Processing
Advances in representation learning have enabled natural language processing models to derive non-negligible linguistic information directly from text corpora in an unsupervised fashion. However, this signal is underused in downstream tasks, where they tend to fall back on superficial cues and heuristics to solve the problem at hand. Further progress relies on identifying and filling the gaps in linguistic knowledge captured in their parameters. The objective of this thesis is to address these challenges focusing on the issues of resource scarcity, interpretability, and lexical knowledge injection, with an emphasis on the category of verbs.
To this end, I propose a novel paradigm for efficient acquisition of lexical knowledge leveraging native speakersâ intuitions about verb meaning to support development and downstream performance of NLP models across languages. First, I investigate the potential of acquiring semantic verb classes from non-experts through manual clustering. This subsequently informs the development of a two-phase semantic dataset creation methodology, which combines semantic clustering with fine-grained semantic similarity judgments collected through spatial arrangements of lexical stimuli. The method is tested on English and then applied to a typologically diverse sample of languages to produce the first large-scale multilingual verb dataset of this kind. I demonstrate its utility as a diagnostic tool by carrying out a comprehensive evaluation of state-of-the-art NLP models, probing representation quality across languages and domains of verb meaning, and shedding light on their deficiencies. Subsequently, I directly address these shortcomings by injecting lexical knowledge into large pretrained language models. I demonstrate that external manually curated information about verbsâ lexical properties can support data-driven models in tasks where accurate verb processing is key. Moreover, I examine the potential of extending these benefits from resource-rich to resource-poor languages through translation-based transfer. The results emphasise the usefulness of human-generated lexical knowledge in supporting NLP models and suggest that time-efficient construction of lexicons similar to those developed in this work, especially in under-resourced languages, can play an important role in boosting their linguistic capacity.ESRC Doctoral Fellowship [ES/J500033/1], ERC Consolidator Grant LEXICAL [648909
Theory and Applications for Advanced Text Mining
Due to the growth of computer technologies and web technologies, we can easily collect and store large amounts of text data. We can believe that the data include useful knowledge. Text mining techniques have been studied aggressively in order to extract the knowledge from the data since late 1990s. Even if many important techniques have been developed, the text mining research field continues to expand for the needs arising from various application fields. This book is composed of 9 chapters introducing advanced text mining techniques. They are various techniques from relation extraction to under or less resourced language. I believe that this book will give new knowledge in the text mining field and help many readers open their new research fields
Generating and applying textual entailment graphs for relation extraction and email categorization
Recognizing that the meaning of one text expression is semantically related to the meaning of another can be of help in many natural language processing applications. One semantic relationship between two text expressions is captured by the textual entailment paradigm, which is defined as a relation between exactly two text expressions. Entailment relations holding among a set of more than two text expressions can be captured in the form of a hierarchical knowledge structure referred to as entailment graphs. Despite the fact that several people have worked on building entailment graphs for different types of textual expressions, little research has been carried out regarding the applicability of such entailment graphs in NLP applications. This thesis fills this research gap by investigating how entailment graphs can be generated and used for addressing two specific NLP tasks: First, the task of validating automatically derived relation extraction patterns and, second, the task of automatically categorizing German customer emails. After laying a theoretical foundation, the research problem is approached in an empirical way, i.e., by drawing conclusions from analyzing, processing, and experimenting with specific task-related datasets. The experimental results show that both tasks can benefit from the integration of semantic knowledge, as expressed by entailment graphs
A clustering approach to automatic verb classification incorporating selectional preferences: model, implementation, and user manual
This report presents two variations of an innovative, complex approach to semantic verb classes that relies on selectional preferences as verb properties. The underlying linguistic assumption for this verb class model is that verbs which agree on their selectional preferences belong to a common semantic class. The model is implemented as a soft-clustering approach, in order to capture the polysemy of the verbs. The training procedure uses the Expectation-Maximisation (EM) algorithm (Baum, 1972) to iteratively improve the probabilistic parameters of the model, and applies the Minimum Description Length (MDL) principle (Rissanen, 1978) to induce WordNet-based selectional preferences for arguments within subcategorisation frames. One variation of the MDL principle replicates a standard MDL approach by Li and Abe (1998), the other variation presents an improved pruning strategy that outperforms the standard implementation considerably. Our model is potentially useful for lexical induction (e.g., verb senses, subcategorisation and selectional preferences, collocations, and verb alternations), and for NLP applications in sparse data situations. We demonstrate the usefulness of the model by a standard evaluation (pseudo-word disambiguation), and three applications (selectional preference induction, verb sense disambiguation, and semi-supervised sense labelling)
Exploratory Search on Mobile Devices
The goal of this thesis is to provide a general framework (MobEx) for exploratory search especially on mobile devices. The central part is the design, implementation, and evaluation of several core modules for on-demand unsupervised information extraction well suited for exploratory search on mobile devices and creating the MobEx framework. These core processing elements, combined with a multitouch - able user interface specially designed for two families of mobile devices, i.e. smartphones and tablets, have been finally implemented in a research prototype. The initial information request, in form of a query topic description, is issued online by a user to the system. The system then retrieves web snippets by using standard search engines. These snippets are passed through a chain of NLP components which perform an ondemand or ad-hoc interactive Query Disambiguation, Named Entity Recognition, and Relation Extraction task. By on-demand or ad-hoc we mean the components are capable to perform their operations on an unrestricted open domain within special time constraints. The result of the whole process is a topic graph containing the detected associated topics as nodes and the extracted relation ships as labelled edges between the nodes. The Topic Graph is presented to the user in different ways depending on the size of the device she is using. Various evaluations have been conducted that help us to understand the potentials and limitations of the framework and the prototype
From Distributional to Semantic Similarity
Institute for Communicating and Collaborative SystemsLexical-semantic resources, including thesauri and WORDNET, have been successfully incorporated
into a wide range of applications in Natural Language Processing. However they are
very difficult and expensive to create and maintain, and their usefulness has been severely
hampered by their limited coverage, bias and inconsistency. Automated and semi-automated
methods for developing such resources are therefore crucial for further resource development
and improved application performance.
Systems that extract thesauri often identify similar words using the distributional hypothesis
that similar words appear in similar contexts. This approach involves using corpora to examine
the contexts each word appears in and then calculating the similarity between context distributions.
Different definitions of context can be used, and I begin by examining how different
types of extracted context influence similarity.
To be of most benefit these systems must be capable of finding synonyms for rare words.
Reliable context counts for rare events can only be extracted from vast collections of text. In
this dissertation I describe how to extract contexts from a corpus of over 2 billion words. I
describe techniques for processing text on this scale and examine the trade-off between context
accuracy, information content and quantity of text analysed.
Distributional similarity is at best an approximation to semantic similarity. I develop improved
approximations motivated by the intuition that some events in the context distribution are more
indicative of meaning than others. For instance, the object-of-verb context wear is far more
indicative of a clothing noun than get. However, existing distributional techniques do not
effectively utilise this information. The new context-weighted similarity metric I propose in
this dissertation significantly outperforms every distributional similarity metric described in
the literature.
Nearest-neighbour similarity algorithms scale poorly with vocabulary and context vector size.
To overcome this problem I introduce a new context-weighted approximation algorithm with
bounded complexity in context vector size that significantly reduces the system runtime with
only a minor performance penalty. I also describe a parallelized version of the system that runs
on a Beowulf cluster for the 2 billion word experiments.
To evaluate the context-weighted similarity measure I compare ranked similarity lists against
gold-standard resources using precision and recall-based measures from Information Retrieval,
since the alternative, application-based evaluation, can often be influenced by distributional
as well as semantic similarity. I also perform a detailed analysis of the final results using
WORDNET.
Finally, I apply my similarity metric to the task of assigning words to WORDNET semantic
categories. I demonstrate that this new approach outperforms existing methods and overcomes
some of their weaknesses
- âŚ