232 research outputs found
A Linearization Framework for Dependency and Constituent Trees
[Abstract]: Parsing is a core natural language processing problem in which, given an input raw sentence, a
model automatically produces a structured output that represents its syntactic structure. The
most common formalisms in this field are constituent and dependency parsing. Although
both formalisms show differences, they also share limitations, in particular the limited speed
of the models to obtain the desired representation, and the lack of a common representation
that allows any end-to-end neural system to obtain those models. Transforming both parsing
tasks into a sequence labeling task solves both of these problems. Several tree linearizations
have been proposed in the last few years, however there is no common suite that facilitates
their use under an integrated framework. In this work, we will develop such a system. On the
one hand, the system will be able to: (i) encode syntactic trees according to the desired syntactic
formalism and linearization function, and (ii) decode linearized trees into their original
representation. On the other hand, (iii) we will also train several neural sequence labeling
systems to perform parsing from those labels, and we will compare the results.[Resumen]: El análisis sintáctico es una tarea central dentro del procesado del lenguaje natural, en
el que dada una oración se produce una salida que representa su estructura sintáctica. Los
formalismos más populares son el de constituyentes y el de dependencias. Aunque son fundamentalmente
diferentes, tienen ciertas limitaciones en común, como puede ser la lentitud
de los modelos empleados para su predicción o la falta de una representación común que permita
predecirlos con sistemas neuronales de uso general. Transformar ambos formalismos a
una tarea de etiquetado de secuencias permite resolver ambos problemas. Durante los últimos
años se han propuesto diferentes maneras de linearizar árboles sintácticos, pero todavÃa
se carecÃa de un software unificado que permitiese obtener representaciones para ambos formalismos
sobre un mismo sistema. En este trabajo se desarrollará dicho sistema. Por un lado,
éste permitirá: (i) linearizar árboles sintácticos en el formalismo y función de linearización
deseadas y (ii) decodificar árboles linearizados de vuelta a su formato original. Por otro lado,
también se entrenarán varios modelos de etiquetado de secuencias, y se compararán los resultados
obtenidos.Traballo fin de grao (UDC.FIC). EnxeñarÃa Informática. Curso 2021/202
Neural Techniques for German Dependency Parsing
Syntactic parsing is the task of analyzing the structure of a sentence based on some predefined formal assumption. It is a key component in many natural language processing (NLP) pipelines and is of great benefit for natural language understanding (NLU) tasks such as information retrieval or sentiment analysis. Despite achieving very high results with neural network techniques, most syntactic parsing research pays attention to only a few prominent languages (such as English or Chinese) or language-agnostic settings. Thus, we still lack studies that focus on just one language and design specific parsing strategies for that language with regards to its linguistic properties.
In this thesis, we take German as the language of interest and develop more accurate methods for German dependency parsing by combining state-of-the-art neural network methods with techniques that address the specific challenges posed by the language-specific properties of German. Compared to English, German has richer morphology, semi-free word order, and case syncretism. It is the combination of those characteristics that makes parsing German an interesting and challenging task.
Because syntactic parsing is a task that requires many levels of language understanding, we propose to study and improve the knowledge of parsing models at each level in order to improve syntactic parsing for German. These levels are: (sub)word level, syntactic level, semantic level, and sentence level.
At the (sub)word level, we look into a surge in out-of-vocabulary words in German data caused by compounding. We propose a new type of embeddings for compounds that is a compositional model of the embeddings of individual components. Our experiments show that character-based embeddings are superior to word and compound embeddings in dependency parsing, and compound embeddings only outperform word embeddings when the part-of-speech (POS) information is unavailable. Thus, we conclude that it is the morpho-syntactic information of unknown compounds, not the semantic one, that is crucial for parsing German.
At the syntax level, we investigate challenges for local grammatical function labeler that are caused by case syncretism. In detail, we augment the grammatical function labeling component in a neural dependency parser that labels each head-dependent pair independently with a new labeler that includes a decision history, using Long Short-Term Memory networks (LSTMs). All our proposed models significantly outperformed the baseline on three languages: English, German and Czech. However, the impact of the new models is not the same for all languages: the improvement for English is smaller than for the non-configurational languages (German and Czech). Our analysis suggests that the success of the history-based models is not due to better handling of long dependencies but that they are better in dealing with the uncertainty in head direction.
We study the interaction of syntactic parsing with the semantic level via the problem of PP attachment disambiguation. Our motivation is to provide a realistic evaluation of the task where gold information is not available and compare the results of disambiguation systems against the output of a strong neural parser. To our best knowledge, this is the first time that PP attachment disambiguation is evaluated and compared against neural dependency parsing on predicted information. In addition, we present a novel approach for PP attachment disambiguation that uses biaffine attention and utilizes pre-trained contextualized word embeddings as semantic knowledge. Our end-to-end system outperformed the previous pipeline approach on German by a large margin simply by avoiding error propagation caused by predicted information. In the end, we show that parsing systems (with the same semantic knowledge) are in general superior to systems specialized for PP attachment disambiguation.
Lastly, we improve dependency parsing at the sentence level using reranking techniques. So far, previous work on neural reranking has been evaluated on English and Chinese only, both languages with a configurational word order and poor morphology. We re-assess the potential of successful neural reranking models from the literature on English and on two morphologically rich(er) languages, German and Czech. In addition, we introduce a new variation of a discriminative reranker based on graph convolutional networks (GCNs). Our proposed reranker not only outperforms previous models on English but is the only model that is able to improve results over the baselines on German and Czech. Our analysis points out that the failure is due to the lower quality of the k-best lists, where the gold tree ratio and the diversity of the list play an important role
Viability of Sequence Labeling Encodings for Dependency Parsing
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
This thesis presents new methods for recasting dependency parsing as
a sequence labeling task yielding a viable alternative to the traditional
transition- and graph-based approaches. It is shown that sequence labeling
parsers provide several advantages for dependency parsing, such
as: (i) a good trade-off between accuracy and parsing speed, (ii) genericity
which enables running a parser in generic sequence labeling software
and (iii) pluggability which allows using full parse trees as features to
downstream tasks.
The backbone of dependency parsing as sequence labeling are the encodings
which serve as linearization methods for mapping dependency
trees into discrete labels, such that each token in a sentence is associated
with a label. We introduce three encoding families comprising: (i)
head selection, (ii) bracketing-based and (iii) transition-based encodings
which are differentiated by the way they represent a dependency
tree as a sequence of labels. We empirically examine the viability of
the encodings and provide an analysis of their facets.
Furthermore, we explore the feasibility of leveraging external complementary
data in order to enhance parsing performance. Our sequence
labeling parser is endowed with two kinds of representations. First,
we exploit the complementary nature of dependency and constituency
parsing paradigms and enrich the parser with representations from both
syntactic abstractions. Secondly, we use human language processing
data to guide our parser with representations from eye movements.
Overall, the results show that recasting dependency parsing as sequence
labeling is a viable approach that is fast and accurate and provides
a practical alternative for integrating syntax in NLP tasks.[Resumen]
Esta tesis presenta nuevos métodos para reformular el análisis sintáctico
de dependencias como una tarea de etiquetado secuencial, lo
que supone una alternativa viable a los enfoques tradicionales basados
en transiciones y grafos. Se demuestra que los analizadores de etiquetado
secuencial ofrecen varias ventajas para el análisis sintáctico de
dependencias, como por ejemplo (i) un buen equilibrio entre la precisión
y la velocidad de análisis, (ii) la genericidad que permite ejecutar
un analizador en un software genérico de etiquetado secuencial y (iii)
la conectividad que permite utilizar el árbol de análisis completo como
caracterÃsticas para las tareas posteriores.
El pilar del análisis sintáctico de dependencias como etiquetado secuencial
son las codificaciones que sirven como métodos de linealización
para transformar los árboles de dependencias en etiquetas discretas, de
forma que cada token de una frase se asocia con una etiqueta. Introducimos
tres familias de codificación que comprenden: (i) selección de
núcleos, (ii) codificaciones basadas en corchetes y (iii) codificaciones basadas
en transiciones que se diferencian por la forma en que representan
un árbol de dependencias como una secuencia de etiquetas. Examinamos
empÃricamente la viabilidad de las codificaciones y ofrecemos un
análisis de sus facetas.
Además, exploramos la viabilidad de aprovechar datos complementarios
externos para mejorar el rendimiento del análisis sintáctico. Dotamos
a nuestro analizador sintáctico de dos tipos de representaciones. En
primer lugar, explotamos la naturaleza complementaria de los paradigmas
de análisis sintáctico de dependencias y constituyentes, enriqueciendo
el analizador sintáctico con representaciones de ambas abstracciones
sintácticas. En segundo lugar, utilizamos datos de procesamiento del
lenguaje humano para guiar nuestro analizador con representaciones de
los movimientos oculares.
En general, los resultados muestran que la reformulación del análisis
sintáctico de dependencias como etiquetado de secuencias es un enfoque
viable, rápido y preciso, y ofrece una alternativa práctica para integrar
la sintaxis en las tareas de PLN.[Resumo]
Esta tese presenta novos métodos para reformular a análise sintáctica
de dependencias como unha tarefa de etiquetaxe secuencial, o que
supón unha alternativa viable aos enfoques tradicionais baseados en
transicións e grafos. Demóstrase que os analizadores de etiquetaxe secuencial
ofrecen varias vantaxes para a análise sintáctica de dependencias,
por exemplo (i) un bo equilibrio entre a precisión e a velocidade
de análise, (ii) a xenericidade que permite executar un analizador nun
software xenérico de etiquetaxe secuencial e (iii) a conectividade que
permite empregar a árbore de análise completa como caracterÃsticas
para as tarefas posteriores.
O piar da análise sintáctica de dependencias como etiquetaxe secuencial
son as codificacións que serven como métodos de linealización para
transformar as árbores de dependencias en etiquetas discretas, de forma
que cada token dunha frase se asocia cunha etiqueta. Introducimos
tres familias de codificación que comprenden: (i) selección de núcleos,
(ii) codificacións baseadas en corchetes e (iii) codificacións baseadas en
transicións que se diferencian pola forma en que representan unha árbore
de dependencia como unha secuencia de etiquetas. Examinamos
empÃricamente a viabilidade das codificacións e ofrecemos unha análise
das súas facetas.
Ademais, exploramos a viabilidade de aproveitar datos complementarios
externos para mellorar o rendemento da análise sintáctica. O noso
analizador sintáctico de etiquetaxe secuencial está dotado de dous tipos
de representacións. En primeiro lugar, explotamos a natureza complementaria
dos paradigmas de análise sintáctica de dependencias e constituÃntes
e enriquecemos o analizador sintáctico con representacións de
ambas abstraccións sintácticas. En segundo lugar, empregamos datos
de procesamento da linguaxe humana para guiar o noso analizador con
representacións dos movementos oculares.
En xeral, os resultados mostran que a reformulación da análise sintáctico
de dependencias como etiquetaxe de secuencias é un enfoque
viable, rápido e preciso, e ofrece unha alternativa práctica para integrar
a sintaxe nas tarefas de PLN.This work has been carried out thanks to the funding from
the European Research Council (ERC), under the European Union’s
Horizon 2020 research and innovation programme (FASTPARSE, grant
agreement No 714150)
Syntax-based machine translation using dependency grammars and discriminative machine learning
Machine translation underwent huge improvements since the groundbreaking
introduction of statistical methods in the early 2000s, going from very
domain-specific systems that still performed relatively poorly despite the
painstakingly crafting of thousands of ad-hoc rules, to general-purpose
systems automatically trained on large collections of bilingual texts which
manage to deliver understandable translations that convey the general
meaning of the original input.
These approaches however still perform quite below the level of human
translators, typically failing to convey detailed meaning and register, and
producing translations that, while readable, are often ungrammatical and
unidiomatic.
This quality gap, which is considerably large compared to most other
natural language processing tasks, has been the focus of the research in
recent years, with the development of increasingly sophisticated models that
attempt to exploit the syntactical structure of human languages, leveraging
the technology of statistical parsers, as well as advanced machine learning
methods such as marging-based structured prediction algorithms and neural
networks.
The translation software itself became more complex in order to accommodate
for the sophistication of these advanced models: the main translation
engine (the decoder) is now often combined with a pre-processor which
reorders the words of the source sentences to a target language word order, or
with a post-processor that ranks and selects a translation according according
to fine model from a list of candidate translations generated by a coarse
model.
In this thesis we investigate the statistical machine translation problem
from various angles, focusing on translation from non-analytic languages
whose syntax is best described by fluid non-projective dependency grammars
rather than the relatively strict phrase-structure grammars or projectivedependency
grammars which are most commonly used in the literature.
We propose a framework for modeling word reordering phenomena
between language pairs as transitions on non-projective source dependency
parse graphs. We quantitatively characterize reordering phenomena for the
German-to-English language pair as captured by this framework, specifically
investigating the incidence and effects of the non-projectivity of source
syntax and the non-locality of word movement w.r.t. the graph structure.
We evaluated several variants of hand-coded pre-ordering rules in order to
assess the impact of these phenomena on translation quality.
We propose a class of dependency-based source pre-ordering approaches
that reorder sentences based on a flexible models trained by SVMs and and
several recurrent neural network architectures.
We also propose a class of translation reranking models, both syntax-free
and source dependency-based, which make use of a type of neural networks
known as graph echo state networks which is highly flexible and requires
extremely little training resources, overcoming one of the main limitations
of neural network models for natural language processing tasks
SEQUENTIAL DECISIONS AND PREDICTIONS IN NATURAL LANGUAGE PROCESSING
Natural language processing has achieved great success in a wide range of ap- plications, producing both commercial language services and open-source language tools. However, most methods take a static or batch approach, assuming that the model has all information it needs and makes a one-time prediction. In this disser- tation, we study dynamic problems where the input comes in a sequence instead of all at once, and the output must be produced while the input is arriving. In these problems, predictions are often made based only on partial information. We see this dynamic setting in many real-time, interactive applications. These problems usually involve a trade-off between the amount of input received (cost) and the quality of the output prediction (accuracy). Therefore, the evaluation considers both objectives (e.g., plotting a Pareto curve).
Our goal is to develop a formal understanding of sequential prediction and decision-making problems in natural language processing and to propose efficient solutions. Toward this end, we present meta-algorithms that take an existent batch model and produce a dynamic model to handle sequential inputs and outputs. Webuild our framework upon theories of Markov Decision Process (MDP), which allows learning to trade off competing objectives in a principled way. The main machine learning techniques we use are from imitation learning and reinforcement learning, and we advance current techniques to tackle problems arising in our settings. We evaluate our algorithm on a variety of applications, including dependency parsing, machine translation, and question answering. We show that our approach achieves a better cost-accuracy trade-off than the batch approach and heuristic-based decision- making approaches.
We first propose a general framework for cost-sensitive prediction, where dif- ferent parts of the input come at different costs. We formulate a decision-making process that selects pieces of the input sequentially, and the selection is adaptive to each instance. Our approach is evaluated on both standard classification tasks and a structured prediction task (dependency parsing). We show that it achieves similar prediction quality to methods that use all input, while inducing a much smaller cost. Next, we extend the framework to problems where the input is revealed incremen- tally in a fixed order. We study two applications: simultaneous machine translation and quiz bowl (incremental text classification). We discuss challenges in this set- ting and show that adding domain knowledge eases the decision-making problem. A central theme throughout the chapters is an MDP formulation of a challenging problem with sequential input/output and trade-off decisions, accompanied by a learning algorithm that solves the MDP
Methods for taking semantic graphs apart and putting them back together again
The thesis develops a competitive compositional semantic parser for Abstract Meaning Representation (AMR). This approach combines a neural model with mechanisms that echo ideas from compositional semantic construction in a new, simple dependency structure. The thesis first tackles the task of generating structured training data necessary for a compositional approach, by developing the linguistically motivated AM algebra. Encoding the terms over the AM algebra as dependency trees yields a simple semantic parsing model where neural tagging and dependency models predict interpretable, meaningful operations that construct the AMR.Diese Dissertation entwickelt einen kompositionellen semantischen Parser für den Graphformalismus Abstract Meaning Representation (AMR). Der Ansatz kombiniert ein neuronales Modell mit Mechanismen, die Ideen der klassischen kompositionellen semantischen Konstruktion widerspiegeln. Die Arbeit geht zunächst das Problem an, strukturierte latente Trainingsdaten zu erzeugen die für den kompositionellen Ansatz nötig sind. Für diesen Zweck wird die linguistisch motivierte AM Algebra entwickelt. Indem die Terme der AM Algebra als Dependenzbäume ausgedrückt werden, erhalten wir ein Modell für semantisches Parsen, in dem neuronale Tagging- und Dependenzmodelle interpretierbare, aussagekräftige Operationen vorhersagen die dann den AMR Graphen erzeugen. Damit erreicht das Modell starke Evaluationsergebnisse und deutliche Verbesserungen gegenüber einem weniger strukturierten Vergleichsmodell.DF
Improving a supervised CCG parser
The central topic of this thesis is the task of syntactic parsing with Combinatory Categorial Grammar (CCG). We focus on pipeline approaches that have allowed researchers to develop efficient and accurate parsers trained on articles taken from the Wall Street Journal (WSJ). We present three approaches to improving the state-of-the-art in CCG parsing. First, we test novel supertagger-parser combinations to identify the parsing models and algorithms that benefit the most from recent gains in supertagger accuracy. Second, we attempt to lessen the future burdens of assembling a state-of-the-art CCG parsing pipeline by showing that a part-of-speech (POS) tagger is not required to achieve optimal performance. Finally, we discuss the deficiencies of current parsing algorithms and propose a solution that promises improvements in accuracy – particularly for difficult dependencies – while preserving efficiency and optimality guarantees
- …