350 research outputs found
Evaluating Parsers with Dependency Constraints
Many syntactic parsers now score over 90% on English in-domain evaluation, but the remaining errors have been challenging to address and difficult to quantify. Standard parsing metrics provide a consistent basis for comparison between parsers, but do not illuminate what errors remain to be addressed. This thesis develops a constraint-based evaluation for dependency and Combinatory Categorial Grammar (CCG) parsers to address this deficiency. We examine the constrained and cascading impact, representing the direct and indirect effects of errors on parsing accuracy. This identifies errors that are the underlying source of problems in parses, compared to those which are a consequence of those problems. Kummerfeld et al. (2012) propose a static post-parsing analysis to categorise groups of errors into abstract classes, but this cannot account for cascading changes resulting from repairing errors, or limitations which may prevent the parser from applying a repair. In contrast, our technique is based on enforcing the presence of certain dependencies during parsing, whilst allowing the parser to choose the remainder of the analysis according to its grammar and model. We draw constraints for this process from gold-standard annotated corpora, grouping them into abstract error classes such as NP attachment, PP attachment, and clause attachment. By applying constraints from each error class in turn, we can examine how parsers respond when forced to correctly analyse each class. We show how to apply dependency constraints in three parsers: the graph-based MSTParser (McDonald and Pereira, 2006) and the transition-based ZPar (Zhang and Clark, 2011b) dependency parsers, and the C&C CCG parser (Clark and Curran, 2007b). Each is widely-used and influential in the field, and each generates some form of predicate-argument dependencies. We compare the parsers, identifying common sources of error, and differences in the distribution of errors between constrained and cascaded impact. Our work allows us to contrast the implementations of each parser, and how they respond to constraint application. Using our analysis, we experiment with new features for dependency parsing, which encode the frequency of proposed arcs in large-scale corpora derived from scanned books. These features are inspired by and extend on the work of Bansal and Klein (2011). We target these features at the most notable errors, and show how they address some, but not all of the difficult attachments across newswire and web text. CCG parsing is particularly challenging, as different derivations do not always generate different dependencies. We develop dependency hashing to address semantically redundant parses in n-best CCG parsing, and demonstrate its necessity and effectiveness. Dependency hashing substantially improves the diversity of n-best CCG parses, and improves a CCG reranker when used for creating training and test data. We show the intricacies of applying constraints to C&C, and describe instances where applying constraints causes the parser to produce a worse analysis. These results illustrate how algorithms which are relatively straightforward for constituency and dependency parsers are non-trivial to implement in CCG. This work has explored dependencies as constraints in dependency and CCG parsing. We have shown how dependency hashing can efficiently eliminate semantically redundant CCG n-best parses, and presented a new evaluation framework based on enforcing the presence of dependencies in the output of the parser. By otherwise allowing the parser to proceed as it would have, we avoid the assumptions inherent in other work. We hope this work will provide insights into the remaining errors in parsing, and target efforts to address those errors, creating better syntactic analysis for downstream applications
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
An Unsolicited Soliloquy on Dependency Parsing
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
This thesis presents work on dependency parsing covering two distinct lines of research. The
first aims to develop efficient parsers so that they can be fast enough to parse large amounts
of data while still maintaining decent accuracy. We investigate two techniques to achieve
this. The first is a cognitively-inspired method and the second uses a model distillation
method. The first technique proved to be utterly dismal, while the second was somewhat of
a success.
The second line of research presented in this thesis evaluates parsers. This is also done in
two ways. We aim to evaluate what causes variation in parsing performance for different
algorithms and also different treebanks. This evaluation is grounded in dependency displacements
(the directed distance between a dependent and its head) and the subsequent
distributions associated with algorithms and the distributions found in treebanks. This work
sheds some light on the variation in performance for both different algorithms and different
treebanks. And the second part of this area focuses on the utility of part-of-speech tags
when used with parsing systems and questions the standard position of assuming that they
might help but they certainly won’t hurt.[Resumen]
Esta tesis presenta trabajo sobre análisis de dependencias que cubre dos lÃneas de investigación distintas. La primera tiene como objetivo desarrollar analizadores eficientes, de
modo que sean suficientemente rápidos como para analizar grandes volúmenes de datos y,
al mismo tiempo, sean suficientemente precisos. Investigamos dos métodos. El primero se
basa en teorÃas cognitivas y el segundo usa una técnica de destilación. La primera técnica
resultó un enorme fracaso, mientras que la segunda fue en cierto modo un ´éxito.
La otra lÃnea evalúa los analizadores sintácticos. Esto también se hace de dos maneras. Evaluamos
la causa de la variación en el rendimiento de los analizadores para distintos algoritmos
y corpus. Esta evaluación utiliza la diferencia entre las distribuciones del desplazamiento
de arista (la distancia dirigida de las aristas) correspondientes a cada algoritmo y corpus.
También evalúa la diferencia entre las distribuciones del desplazamiento de arista en los
datos de entrenamiento y prueba. Este trabajo esclarece las variaciones en el rendimiento
para algoritmos y corpus diferentes. La segunda parte de esta lÃnea investiga la utilidad de
las etiquetas gramaticales para los analizadores sintácticos.[Resumo]
Esta tese presenta traballo sobre análise sintáctica, cubrindo dúas liñas de investigación. A
primeira aspira a desenvolver analizadores eficientes, de maneira que sexan suficientemente
rápidos para procesar grandes volumes de datos e á vez sexan precisos. Investigamos dous
métodos. O primeiro baséase nunha teorÃa cognitiva, e o segundo usa unha técnica de
destilación. O primeiro método foi un enorme fracaso, mentres que o segundo foi en certo
modo un éxito.
A outra liña avalúa os analizadores sintácticos. Esto tamén se fai de dúas maneiras. Avaliamos
a causa da variación no rendemento dos analizadores para distintos algoritmos e corpus. Esta
avaliaci´on usa a diferencia entre as distribucións do desprazamento de arista (a distancia
dirixida das aristas) correspondentes aos algoritmos e aos corpus. Tamén avalÃa a diferencia
entre as distribucións do desprazamento de arista nos datos de adestramento e proba.
Este traballo esclarece as variacións no rendemento para algoritmos e corpus diferentes. A
segunda parte desta liña investiga a utilidade das etiquetas gramaticais para os analizadores
sintácticos.This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, grant agreement No 714150) and from the Centro de Investigación de Galicia (CITIC) which is funded by the Xunta de Galicia and the European Union (ERDF - Galicia 2014-2020 Program) by grant ED431G 2019/01.Xunta de Galicia; ED431G 2019/0
- …