15 research outputs found
Evaluating Parsers with Dependency Constraints
Many syntactic parsers now score over 90% on English in-domain evaluation, but the remaining errors have been challenging to address and difficult to quantify. Standard parsing metrics provide a consistent basis for comparison between parsers, but do not illuminate what errors remain to be addressed. This thesis develops a constraint-based evaluation for dependency and Combinatory Categorial Grammar (CCG) parsers to address this deficiency. We examine the constrained and cascading impact, representing the direct and indirect effects of errors on parsing accuracy. This identifies errors that are the underlying source of problems in parses, compared to those which are a consequence of those problems. Kummerfeld et al. (2012) propose a static post-parsing analysis to categorise groups of errors into abstract classes, but this cannot account for cascading changes resulting from repairing errors, or limitations which may prevent the parser from applying a repair. In contrast, our technique is based on enforcing the presence of certain dependencies during parsing, whilst allowing the parser to choose the remainder of the analysis according to its grammar and model. We draw constraints for this process from gold-standard annotated corpora, grouping them into abstract error classes such as NP attachment, PP attachment, and clause attachment. By applying constraints from each error class in turn, we can examine how parsers respond when forced to correctly analyse each class. We show how to apply dependency constraints in three parsers: the graph-based MSTParser (McDonald and Pereira, 2006) and the transition-based ZPar (Zhang and Clark, 2011b) dependency parsers, and the C&C CCG parser (Clark and Curran, 2007b). Each is widely-used and influential in the field, and each generates some form of predicate-argument dependencies. We compare the parsers, identifying common sources of error, and differences in the distribution of errors between constrained and cascaded impact. Our work allows us to contrast the implementations of each parser, and how they respond to constraint application. Using our analysis, we experiment with new features for dependency parsing, which encode the frequency of proposed arcs in large-scale corpora derived from scanned books. These features are inspired by and extend on the work of Bansal and Klein (2011). We target these features at the most notable errors, and show how they address some, but not all of the difficult attachments across newswire and web text. CCG parsing is particularly challenging, as different derivations do not always generate different dependencies. We develop dependency hashing to address semantically redundant parses in n-best CCG parsing, and demonstrate its necessity and effectiveness. Dependency hashing substantially improves the diversity of n-best CCG parses, and improves a CCG reranker when used for creating training and test data. We show the intricacies of applying constraints to C&C, and describe instances where applying constraints causes the parser to produce a worse analysis. These results illustrate how algorithms which are relatively straightforward for constituency and dependency parsers are non-trivial to implement in CCG. This work has explored dependencies as constraints in dependency and CCG parsing. We have shown how dependency hashing can efficiently eliminate semantically redundant CCG n-best parses, and presented a new evaluation framework based on enforcing the presence of dependencies in the output of the parser. By otherwise allowing the parser to proceed as it would have, we avoid the assumptions inherent in other work. We hope this work will provide insights into the remaining errors in parsing, and target efforts to address those errors, creating better syntactic analysis for downstream applications
Unsupervised grammar induction with Combinatory Categorial Grammars
Language is a highly structured medium for communication. An idea starts in the speaker's mind (semantics) and is transformed into a well formed, intelligible, sentence via the specific syntactic rules of a language. We aim to discover the fingerprints of this process in the choice and location of words used in the final utterance. What is unclear is how much of this latent process can be discovered from the linguistic signal alone and how much requires shared non-linguistic context, knowledge, or cues.
Unsupervised grammar induction is the task of analyzing strings in a language to discover the latent syntactic structure of the language without access to labeled training data. Successes in unsupervised grammar induction shed light on the amount of syntactic structure that is discoverable from raw or part-of-speech tagged text. In this thesis, we present a state-of-the-art grammar induction system based on Combinatory Categorial Grammars. Our choice of syntactic formalism enables the first labeled evaluation of an unsupervised system. This allows us to perform an in-depth analysis of the systemâs linguistic strengths and weaknesses. In order to completely eliminate reliance on any supervised systems, we also examine how performance is affected when we use induced word clusters instead of gold-standard POS tags. Finally, we perform a semantic evaluation of induced grammars, providing unique insights into future directions for unsupervised grammar induction systems
Wide-coverage statistical parsing with minimalist grammars
Syntactic parsing is the process of automatically assigning a structure to a string
of words, and is arguably a necessary prerequisite for obtaining a detailed and precise
representation of sentence meaning. For many NLP tasks, it is sufficient to use
parsers based on simple context free grammars. However, for tasks in which precision
on certain relatively rare but semantically crucial constructions (such as unbounded
wh-movements for open domain question answering) is important, more expressive
grammatical frameworks still have an important role to play.
One grammatical framework which has been conspicuously absent from journals
and conferences on Natural Language Processing (NLP), despite continuing to dominate
much of theoretical syntax, is Minimalism, the latest incarnation of the Transformational
Grammar (TG) approach to linguistic theory developed very extensively
by Noam Chomsky and many others since the early 1950s. Until now, all parsers
using genuine transformational movement operations have had only narrow coverage
by modern standards, owing to the lack of any wide-coverage TG grammars or treebanks
on which to train statistical models. The received wisdom within NLP is that
TG is too complex and insufficiently formalised to be applied to realistic parsing tasks.
This situation is unfortunate, as it is arguably the most extensively developed syntactic
theory across the greatest number of languages, many of which are otherwise
under-resourced, and yet the vast majority of its insights never find their way into NLP
systems. Conversely, the process of constructing large grammar fragments can have
a salutary impact on the theory itself, forcing choices between competing analyses of
the same construction, and exposing incompatibilities between analyses of different
constructions, along with areas of over- and undergeneration which may otherwise go
unnoticed.
This dissertation builds on research into computational Minimalism pioneered by
Ed Stabler and others since the late 1990s to present the first ever wide-coverage Minimalist
Grammar (MG) parser, along with some promising initial experimental results.
A wide-coverage parser must of course be equipped with a wide-coverage grammar,
and this dissertation will therefore also present the first ever wide-coverage MG, which
has analyses with a high level of cross-linguistic descriptive adequacy for a great many
English constructions, many of which are taken or adapted from proposals in the mainstream
Minimalist literature. The grammar is very deep, in the sense that it describes
many long-range dependencies which even most other expressive wide-coverage grammars
ignore. At the same time, it has also been engineered to be highly constrained,
with continuous computational testing being applied to minimize both under- and over-generation.
Natural language is highly ambiguous, both locally and globally, and even with a
very strong formal grammar, there may still be a great many possible structures for a
given sentence and its substrings. The standard approach to resolving such ambiguity
is to equip the parser with a probability model allowing it to disregard certain unlikely
search paths, thereby increasing both its efficiency and accuracy. The most successful
parsing models are those extracted in a supervised fashion from labelled data in the
form of a corpus of syntactic trees, known as a treebank. Constructing such a treebank
from scratch for a different formalism is extremely time-consuming and expensive,
however, and so the standard approach is to map the trees in an existing treebank into
trees of the target formalism. Minimalist trees are considerably more complex than
those of other formalisms, however, containing many more null heads and movement
operations, making this conversion process far from trivial. This dissertation will describe
a method which has so far been used to convert 56% of the Penn Treebank trees
into MG trees. Although still under development, the resulting MGbank corpus has
already been used to train a statistical A* MG parser, described here, which has an
expected asymptotic time complexity of O(n3); this is much better than even the most
optimistic worst case analysis for the formalism
Cross-lingual Semantic Parsing with Categorial Grammars
Humans communicate using natural language. We need to make sure that computers can understand us so that they can act on our spoken commands or independently gain new insights from knowledge that is written down as text. A âsemantic parserâ is a program that translates natural-language sentences into computer commands or logical formulasâsomething a computer can work with. Despite much recent progress on semantic parsing, most research focuses on English, and semantic parsers for other languages cannot keep up with the developments. My thesis aims to help close this gap. It investigates âcross-lingual learningâ methods by which a computer can automatically adapt a semantic parser to another language, such as Dutch. The computer learns by looking at example sentences and their translations, e.g., âShe likes to read booksâ/âZe leest graag boekenâ. Even with many such examples, learning which word means what and how word meanings combine into sentence meanings is a challenge, because translations are rarely word-for-word. They exhibit grammatical differences and non-literalities. My thesis presents a method for tackling these challenges based on the grammar formalism Combinatory Categorial Grammar. It shows that this is a suitable formalism for this purpose, that many structural differences between sentences and their translations can be dealt with in this framework, and that a (rudimentary) semantic parser for Dutch can be learned cross-lingually based on one for English. I also investigate methods for building large corpora of texts annotated with logical formulas to further study and improve semantic parsers
Treebank-based acquisition of Chinese LFG resources for parsing and generation
This thesis describes a treebank-based approach to automatically acquire robust,wide-coverage Lexical-Functional Grammar (LFG) resources for Chinese parsing
and generation, which is part of a larger project on the rapid construction of deep, large-scale, constraint-based, multilingual grammatical resources. I present an application-oriented LFG analysis for Chinese core linguistic phenomena and (in cooperation with PARC) develop a gold-standard dependency-bank of Chinese f-structures for evaluation. Based on the Penn Chinese Treebank, I design and implement two architectures for inducing Chinese LFG resources, one annotation-based and the other dependency conversion-based. I then apply the f-structure acquisition algorithm together with external, state-of-the-art parsers to parsing new text into "proto" f-structures. In order to convert "proto" f-structures into "proper" f-structures or deep dependencies, I present a novel Non-Local Dependency (NLD) recovery algorithm using subcategorisation frames and f-structure paths linking antecedents and traces in NLDs extracted from the automatically-built LFG f-structure treebank. Based on the grammars extracted from the f-structure annotated treebank, I develop a PCFG-based chart generator and a new n-gram based pure dependency generator to realise Chinese sentences from LFG f-structures.
The work reported in this thesis is the first effort to scale treebank-based, probabilistic Chinese LFG resources from proof-of-concept research to unrestricted, real
text. Although this thesis concentrates on Chinese and LFG, many of the methodologies, e.g. the acquisition of predicate-argument structures, NLD resolution and
the PCFG- and dependency n-gram-based generation models, are largely language and formalism independent and should generalise to diverse languages as well as to labelled bilexical dependency representations other than LFG
Combined distributional and logical semantics
Understanding natural language sentences requires interpreting words, and combining
the meanings of words into the meanings of sentences. Despite much work on lexical
and compositional semantics individually, existing approaches are unlikely to offer a
complete solution. This thesis introduces a new approach, which combines the benefits
of distributional lexical semantics and logical compositional semantics.
Linguistic theories of compositional semantics have shown how logical forms can
be built for sentences, and how to represent semantic operators such as negatives,
quantifiers and modals. However, computational implementations of such theories
have shown poor performance on applications, mainly due to a reliance on incomplete
hand-built ontologies for the meanings of content words. Conversely, distributional semantics
has been shown to be effective in learning the representations of content words
based on collocations in large unlabelled corpora, but there are major outstanding challenges
in representing function words and building representations for sentences.
I introduce a new model which captures the main advantages of logical and distributional
approaches. The proposal closely follows formal semantics, except for changing
the definitions of content words. In traditional formal semantics, each word would
express a different symbol. Instead, I allow multiple words to express the same symbol,
corresponding to underlying concepts. For example, both the verb write and the noun
author can be made to express the same relation. These symbols can be learnt by clustering
symbols based on distributional statisticsâfor example, write and author will
share many similar arguments. Crucially, the clustering means that the representations
are symbolic, so can easily be incorporated into standard logical approaches.
The simple model proves insufficient, and I develop several extensions. I develop
an unsupervised probabilistic model of ambiguity, and show how this model can be
built into compositional derivations to produce a distribution over logical forms. The
flat clustering approach does not model relations between concepts, for example that
buying implies owning. Instead, I show how to build graph structures over the clusters,
which allows such inferences. I also explore if the abstract concepts can be generalized
cross-lingually, for example mapping French verb ecrire to the same cluster as
the English verb write. The systems developed show good performance on question
answering and entailment tasks, and are capable of both sophisticated multi-sentence
inferences involving quantifiers, and subtle reasoning about lexical semantics.
These results show that distributional and formal logical semantics are not mutually
exclusive, and that a combined model can be built that captures the advantages of each
Integrating source-language context into log-linear models of statistical machine translation
The translation features typically used in state-of-the-art statistical machine translation (SMT) model dependencies between the source and target phrases, but not among the phrases in the source language themselves. A swathe of research has demonstrated that integrating source context modelling directly into log-linear phrase-based SMT (PB-SMT) and hierarchical PB-SMT (HPB-SMT), and can positively
influence the weighting and selection of target phrases, and thus improve translation quality. In this thesis we present novel approaches to incorporate source-language contextual modelling into the state-of-the-art SMT models in order to enhance the quality of lexical selection. We investigate the effectiveness of use of a range of contextual features, including lexical features of neighbouring words, part-of-speech tags, supertags, sentence-similarity features, dependency information, and semantic roles. We explored a series of language pairs featuring typologically different languages, and examined the scalability of our research to larger amounts of training data.
While our results are mixed across feature selections, language pairs, and learning curves, we observe that including contextual features of the source sentence
in general produces improvements. The most significant improvements involve the integration of long-distance contextual features, such as dependency relations in
combination with part-of-speech tags in Dutch-to-English subtitle translation, the combination of dependency parse and semantic role information in English-to-Dutch parliamentary debate translation, supertag features in English-to-Chinese translation, or combination of supertag and lexical features in English-to-Dutch subtitle
translation. Furthermore, we investigate the applicability of our lexical contextual model in another closely related NLP problem, namely machine transliteration