1,880 research outputs found
Unsupervised induction of semantic roles
In recent years, a considerable amount of work has been devoted to the task of automatic
frame-semantic analysis. Given the relative maturity of syntactic parsing technology,
which is an important prerequisite, frame-semantic analysis represents a realistic
next step towards broad-coverage natural language understanding and has been
shown to benefit a range of natural language processing applications such as information
extraction and question answering.
Due to the complexity which arises from variations in syntactic realization, data-driven
models based on supervised learning have become the method of choice for this task.
However, the reliance on large amounts of semantically labeled data which is costly
to produce for every language, genre and domain, presents a major barrier to the
widespread application of the supervised approach.
This thesis therefore develops unsupervised machine learning methods, which automatically
induce frame-semantic representations without making use of semantically
labeled data. If successful, unsupervised methods would render manual data annotation
unnecessary and therefore greatly benefit the applicability of automatic framesemantic
analysis.
We focus on the problem of semantic role induction, in which all the argument instances
occurring together with a specific predicate in a corpus are grouped into clusters
according to their semantic role. Our hypothesis is that semantic roles can be induced
without human supervision from a corpus of syntactically parsed sentences, by
leveraging the syntactic relations conveyed through parse trees with lexical-semantic
information.
We argue that semantic role induction can be guided by three linguistic principles. The
first is the well-known constraint that semantic roles are unique within a particular
frame. The second is that the arguments occurring in a specific syntactic position
within a specific linking all bear the same semantic role. The third principle is that
the (asymptotic) distribution over argument heads is the same for two clusters which
represent the same semantic role. We consider two approaches to semantic role induction based on two fundamentally
different perspectives on the problem. Firstly, we develop feature-based probabilistic
latent structure models which capture the statistical relationships that hold between the
semantic role and other features of an argument instance. Secondly, we conceptualize
role induction as the problem of partitioning a graph whose vertices represent argument
instances and whose edges express similarities between these instances. The graph
thus represents all the argument instances for a particular predicate occurring in the
corpus. The similarities with respect to different features are represented on different
edge layers and accordingly we develop algorithms for partitioning such multi-layer
graphs.
We empirically validate our models and the principles they are based on and show that
our graph partitioning models have several advantages over the feature-based models.
In a series of experiments on both English and German the graph partitioning models
outperform the feature-based models and yield significantly better scores over a strong
baseline which directly identifies semantic roles with syntactic positions.
In sum, we demonstrate that relatively high-quality shallow semantic representations
can be induced without human supervision and foreground a promising direction of
future research aimed at overcoming the problem of acquiring large amounts of lexicalsemantic
knowledge
Learning Language from a Large (Unannotated) Corpus
A novel approach to the fully automated, unsupervised extraction of
dependency grammars and associated syntax-to-semantic-relationship mappings
from large text corpora is described. The suggested approach builds on the
authors' prior work with the Link Grammar, RelEx and OpenCog systems, as well
as on a number of prior papers and approaches from the statistical language
learning literature. If successful, this approach would enable the mining of
all the information needed to power a natural language comprehension and
generation system, directly from a large, unannotated corpus.Comment: 29 pages, 5 figures, research proposa
From Word to Sense Embeddings: A Survey on Vector Representations of Meaning
Over the past years, distributed semantic representations have proved to be
effective and flexible keepers of prior knowledge to be integrated into
downstream applications. This survey focuses on the representation of meaning.
We start from the theoretical background behind word vector space models and
highlight one of their major limitations: the meaning conflation deficiency,
which arises from representing a word with all its possible meanings as a
single vector. Then, we explain how this deficiency can be addressed through a
transition from the word level to the more fine-grained level of word senses
(in its broader acceptation) as a method for modelling unambiguous lexical
meaning. We present a comprehensive overview of the wide range of techniques in
the two main branches of sense representation, i.e., unsupervised and
knowledge-based. Finally, this survey covers the main evaluation procedures and
applications for this type of representation, and provides an analysis of four
of its important aspects: interpretability, sense granularity, adaptability to
different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence
Researc
Semantic frame induction through the detection of communities of verbs and their arguments
Resources such as FrameNet, which provide sets of semantic frame definitions and annotated textual data that maps into the evoked frames, are important for several NLP tasks. However, they are expensive to build and, consequently, are unavailable for many languages and domains. Thus, approaches able to induce semantic frames in an unsupervised manner are highly valuable. In this paper we approach that task from a network perspective as a community detection problem that targets the identification of groups of verb instances that evoke the same semantic frame and verb arguments that play the same semantic role. To do so, we apply a graph-clustering algorithm to a graph with contextualized representations of verb instances or arguments as nodes connected by edges if the distance between them is below a threshold that defines the granularity of the induced frames. By applying this approach to the benchmark dataset defined in the context of SemEval 2019, we outperformed all of the previous approaches to the task, achieving the current state-of-the-art performance.info:eu-repo/semantics/publishedVersio
L2F/INESC-ID at SemEval-2019 Task 2: unsupervised lexical semantic frame induction using contextualized word representations
Building large datasets annotated with semantic information, such as FrameNet, is an expensive process. Consequently, such resources are unavailable for many languages and specific domains. This problem can be alleviated by using unsupervised approaches to induce the frames evoked by a collection of documents. That is the objective of the second task of SemEval 2019, which comprises three subtasks: clustering of verbs that evoke the same frame and clustering of arguments into both frame-specific slots and semantic roles. We approach all the subtasks by applying a graph clustering algorithm on contextualized embedding representations of the verbs and arguments. Using such representations is appropriate in the context of this task, since they provide cues for word-sense disambiguation. Thus, they can be used to identify different frames evoked by the same words. Using this approach we were able to outperform all of the baselines reported for the task on the test set in terms of Purity F1, as well as in terms of BCubed F1 in most cases.info:eu-repo/semantics/publishedVersio
- …