12 research outputs found
Generating Disambiguating Paraphrases for Use in Crowdsourced Judgments of Meaning
Adapting statistical parsers to new domains requires annotated data, which is expensive and time consuming to collect. Using crowdsourced annotation data as a “silver standard” is a step towards a more viable solution and so in order to facilitate the
collection of this data, we have developed a system for creating semantic disambiguation tasks for use in crowdsourced judgments of meaning. In our system here described, these tasks are generated automatically using surface realizations of structurally ambiguous parse trees, along with minimal use of forced parse structure changes.NSF grant IIS-1319318No embargoAcademic Major: Computer and Information Scienc
Structured Named Entities
The names of people, locations, and organisations play a central role in language, and named entity recognition (NER) has been widely studied, and successfully incorporated, into natural language processing (NLP) applications. The most common variant of NER involves identifying and classifying proper noun mentions of these and miscellaneous entities as linear spans in text. Unfortunately, this version of NER is no closer to a detailed treatment of named entities than chunking is to a full syntactic analysis. NER, so construed, reflects neither the syntactic nor semantic structure of NE mentions, and provides insufficient categorical distinctions to represent that structure. Representing this nested structure, where a mention may contain mention(s) of other entities, is critical for applications such as coreference resolution. The lack of this structure creates spurious ambiguity in the linear approximation. Research in NER has been shaped by the size and detail of the available annotated corpora. The existing structured named entity corpora are either small, in specialist domains, or in languages other than English. This thesis presents our Nested Named Entity (NNE) corpus of named entities and numerical and temporal expressions, taken from the WSJ portion of the Penn Treebank (PTB, Marcus et al., 1993). We use the BBN Pronoun Coreference and Entity Type Corpus (Weischedel and Brunstein, 2005a) as our basis, manually annotating it with a principled, fine-grained, nested annotation scheme and detailed annotation guidelines. The corpus comprises over 279,000 entities over 49,211 sentences (1,173,000 words), including 118,495 top-level entities. Our annotations were designed using twelve high-level principles that guided the development of the annotation scheme and difficult decisions for annotators. We also monitored the semantic grammar that was being induced during annotation, seeking to identify and reinforce common patterns to maintain consistent, parsimonious annotations. The result is a scheme of 118 hierarchical fine-grained entity types and nesting rules, covering all capitalised mentions of entities, and numerical and temporal expressions. Unlike many corpora, we have developed detailed guidelines, including extensive discussion of the edge cases, in an ongoing dialogue with our annotators which is critical for consistency and reproducibility. We annotated independently from the PTB bracketing, allowing annotators to choose spans which were inconsistent with the PTB conventions and errors, and only refer back to it to resolve genuine ambiguity consistently. We merged our NNE with the PTB, requiring some systematic and one-off changes to both annotations. This allows the NNE corpus to complement other PTB resources, such as PropBank, and inform PTB-derived corpora for other formalisms, such as CCG and HPSG. We compare this corpus against BBN. We consider several approaches to integrating the PTB and NNE annotations, which affect the sparsity of grammar rules and visibility of syntactic and NE structure. We explore their impact on parsing the NNE and merged variants using the Berkeley parser (Petrov et al., 2006), which performs surprisingly well without specialised NER features. We experiment with flattening the NNE annotations into linear NER variants with stacked categories, and explore the ability of a maximum entropy and a CRF NER system to reproduce them. The CRF performs substantially better, but is infeasible to train on the enormous stacked category sets. The flattened output of the Berkeley parser are almost competitive with the CRF. Our results demonstrate that the NNE corpus is feasible for statistical models to reproduce. We invite researchers to explore new, richer models of (joint) parsing and NER on this complex and challenging task. Our nested named entity corpus will improve a wide range of NLP tasks, such as coreference resolution and question answering, allowing automated systems to understand and exploit the true structure of named entities
Proceedings
Proceedings of the Ninth International Workshop
on Treebanks and Linguistic Theories.
Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti.
NEALT Proceedings Series, Vol. 9 (2010), 268 pages.
© 2010 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/15891
Wide-coverage statistical parsing with minimalist grammars
Syntactic parsing is the process of automatically assigning a structure to a string
of words, and is arguably a necessary prerequisite for obtaining a detailed and precise
representation of sentence meaning. For many NLP tasks, it is sufficient to use
parsers based on simple context free grammars. However, for tasks in which precision
on certain relatively rare but semantically crucial constructions (such as unbounded
wh-movements for open domain question answering) is important, more expressive
grammatical frameworks still have an important role to play.
One grammatical framework which has been conspicuously absent from journals
and conferences on Natural Language Processing (NLP), despite continuing to dominate
much of theoretical syntax, is Minimalism, the latest incarnation of the Transformational
Grammar (TG) approach to linguistic theory developed very extensively
by Noam Chomsky and many others since the early 1950s. Until now, all parsers
using genuine transformational movement operations have had only narrow coverage
by modern standards, owing to the lack of any wide-coverage TG grammars or treebanks
on which to train statistical models. The received wisdom within NLP is that
TG is too complex and insufficiently formalised to be applied to realistic parsing tasks.
This situation is unfortunate, as it is arguably the most extensively developed syntactic
theory across the greatest number of languages, many of which are otherwise
under-resourced, and yet the vast majority of its insights never find their way into NLP
systems. Conversely, the process of constructing large grammar fragments can have
a salutary impact on the theory itself, forcing choices between competing analyses of
the same construction, and exposing incompatibilities between analyses of different
constructions, along with areas of over- and undergeneration which may otherwise go
unnoticed.
This dissertation builds on research into computational Minimalism pioneered by
Ed Stabler and others since the late 1990s to present the first ever wide-coverage Minimalist
Grammar (MG) parser, along with some promising initial experimental results.
A wide-coverage parser must of course be equipped with a wide-coverage grammar,
and this dissertation will therefore also present the first ever wide-coverage MG, which
has analyses with a high level of cross-linguistic descriptive adequacy for a great many
English constructions, many of which are taken or adapted from proposals in the mainstream
Minimalist literature. The grammar is very deep, in the sense that it describes
many long-range dependencies which even most other expressive wide-coverage grammars
ignore. At the same time, it has also been engineered to be highly constrained,
with continuous computational testing being applied to minimize both under- and over-generation.
Natural language is highly ambiguous, both locally and globally, and even with a
very strong formal grammar, there may still be a great many possible structures for a
given sentence and its substrings. The standard approach to resolving such ambiguity
is to equip the parser with a probability model allowing it to disregard certain unlikely
search paths, thereby increasing both its efficiency and accuracy. The most successful
parsing models are those extracted in a supervised fashion from labelled data in the
form of a corpus of syntactic trees, known as a treebank. Constructing such a treebank
from scratch for a different formalism is extremely time-consuming and expensive,
however, and so the standard approach is to map the trees in an existing treebank into
trees of the target formalism. Minimalist trees are considerably more complex than
those of other formalisms, however, containing many more null heads and movement
operations, making this conversion process far from trivial. This dissertation will describe
a method which has so far been used to convert 56% of the Penn Treebank trees
into MG trees. Although still under development, the resulting MGbank corpus has
already been used to train a statistical A* MG parser, described here, which has an
expected asymptotic time complexity of O(n3); this is much better than even the most
optimistic worst case analysis for the formalism
Cross-lingual Semantic Parsing with Categorial Grammars
Humans communicate using natural language. We need to make sure that computers can understand us so that they can act on our spoken commands or independently gain new insights from knowledge that is written down as text. A “semantic parser” is a program that translates natural-language sentences into computer commands or logical formulas–something a computer can work with. Despite much recent progress on semantic parsing, most research focuses on English, and semantic parsers for other languages cannot keep up with the developments. My thesis aims to help close this gap. It investigates “cross-lingual learning” methods by which a computer can automatically adapt a semantic parser to another language, such as Dutch. The computer learns by looking at example sentences and their translations, e.g., “She likes to read books”/”Ze leest graag boeken”. Even with many such examples, learning which word means what and how word meanings combine into sentence meanings is a challenge, because translations are rarely word-for-word. They exhibit grammatical differences and non-literalities. My thesis presents a method for tackling these challenges based on the grammar formalism Combinatory Categorial Grammar. It shows that this is a suitable formalism for this purpose, that many structural differences between sentences and their translations can be dealt with in this framework, and that a (rudimentary) semantic parser for Dutch can be learned cross-lingually based on one for English. I also investigate methods for building large corpora of texts annotated with logical formulas to further study and improve semantic parsers
Syntax-mediated semantic parsing
Querying a database to retrieve an answer, telling a robot to perform an action, or
teaching a computer to play a game are tasks requiring communication with machines
in a language interpretable by them. Semantic parsing is the task of converting human
language to a machine interpretable language. While human languages are sequential in
nature with latent structures, machine interpretable languages are formal with explicit
structures. The computational linguistics community have created several treebanks to
understand the formal syntactic structures of human languages. In this thesis, we use
these to obtain formal meaning representations of languages, and learn computational
models to convert these meaning representations to the target machine representation.
Our goal is to evaluate if existing treebank syntactic representations are useful for
semantic parsing.
Existing semantic parsing methods mainly learn domain-specific grammars which
can parse human languages to machine representation directly. We deviate from this
trend and make use of general-purpose syntactic grammar to help in semantic parsing.
We use two syntactic representations: Combinatory Categorial Grammar (CCG) and
dependency syntax. CCG has a well established theory on deriving meaning representations
from its syntactic derivations. But there are no CCG treebanks for many languages
since these are difficult to annotate. In contrast, dependencies are easy to annotate and
have many treebanks. However, dependencies do not have a well established theory for
deriving meaning representations. In this thesis, we propose novel theories for deriving
meaning representations from dependencies.
Our evaluation task is question answering on a knowledge base. Given a question,
our goal is to answer it on the knowledge base by converting the question to an executable
query. We use Freebase, the knowledge source behind Google’s search engine,
as our knowledge base. Freebase contains millions of real world facts represented in a
graphical format. Inspired from the Freebase structure, we formulate semantic parsing
as a graph matching problem, i.e., given a natural language sentence, we convert it into
a graph structure from the meaning representation obtained from syntax, and find the
subgraph of Freebase that best matches the natural language graph.
Our experiments on Free917, WebQuestions and GraphQuestions semantic parsing
datasets conclude that general-purpose syntax is more useful for semantic parsing than
induced task-specific syntax and syntax-agnostic representations
Graphical Models with Structured Factors, Neural Factors, and Approximation-aware Training
This thesis broadens the space of rich yet practical models for structured prediction. We introduce a general framework for modeling with four ingredients: (1) latent variables, (2) structural constraints, (3) learned (neural) feature representations of the inputs, and (4) training that takes the approximations made during inference into account. The thesis builds up to this framework through an empirical study of three NLP tasks: semantic role labeling, relation extraction, and dependency parsing -- obtaining state-of-the-art results on the former two. We apply the resulting graphical models with structured and neural factors, and approximation-aware learning to jointly model part-of-speech tags, a syntactic dependency parse, and semantic roles in a low-resource setting where the syntax is unobserved. We present an alternative view of these models as neural networks with a topology inspired by inference on graphical models that encode our intuitions about the data
Projection in discourse:A data-driven formal semantic analysis
A sentence like "Bertrand, a famous linguist, wrote a book" contains different contributions: there is a person named "Bertrand", he is a famous linguist, and he wrote a book. These contributions convey different types of information; while the existence of Bertrand is presented as given information---it is presupposed---the other contributions signal new information. Moreover, the contributions are affected differently by linguistic constructions. The inference that Bertrand wrote a book disappears when the sentence is negated or turned into interrogative form, while the other contributions survive; this is called 'projection'. In this thesis, I investigate the relation between different types of contributions in a sentence from a theoretical and empirical perspective. I focus on projection phenomena, which include presuppositions ('Bertrand exists' in the aforementioned example) and conventional implicatures ('Bertrand is a famous linguist'). I argue that the differences between the contributions can be explained in terms of information status, which describes how content relates to the unfolding discourse context. Based on this analysis, I extend the widely used formal representational system Discourse Representation Theory (DRT) with an explicit representation of the different contributions made by projection phenomena; this extension is called 'Projective Discourse Representation Theory' (PDRT). I present a data-driven computational analysis based on data from the Groningen Meaning Bank, a corpus of semantically annotated texts. This analysis shows how PDRT can be used to learn more about different kinds of projection behaviour. These results can be used to improve linguistically oriented computational applications such as automatic translation systems