16 research outputs found
Evaluating Parsers with Dependency Constraints
Many syntactic parsers now score over 90% on English in-domain evaluation, but the remaining errors have been challenging to address and difficult to quantify. Standard parsing metrics provide a consistent basis for comparison between parsers, but do not illuminate what errors remain to be addressed. This thesis develops a constraint-based evaluation for dependency and Combinatory Categorial Grammar (CCG) parsers to address this deficiency. We examine the constrained and cascading impact, representing the direct and indirect effects of errors on parsing accuracy. This identifies errors that are the underlying source of problems in parses, compared to those which are a consequence of those problems. Kummerfeld et al. (2012) propose a static post-parsing analysis to categorise groups of errors into abstract classes, but this cannot account for cascading changes resulting from repairing errors, or limitations which may prevent the parser from applying a repair. In contrast, our technique is based on enforcing the presence of certain dependencies during parsing, whilst allowing the parser to choose the remainder of the analysis according to its grammar and model. We draw constraints for this process from gold-standard annotated corpora, grouping them into abstract error classes such as NP attachment, PP attachment, and clause attachment. By applying constraints from each error class in turn, we can examine how parsers respond when forced to correctly analyse each class. We show how to apply dependency constraints in three parsers: the graph-based MSTParser (McDonald and Pereira, 2006) and the transition-based ZPar (Zhang and Clark, 2011b) dependency parsers, and the C&C CCG parser (Clark and Curran, 2007b). Each is widely-used and influential in the field, and each generates some form of predicate-argument dependencies. We compare the parsers, identifying common sources of error, and differences in the distribution of errors between constrained and cascaded impact. Our work allows us to contrast the implementations of each parser, and how they respond to constraint application. Using our analysis, we experiment with new features for dependency parsing, which encode the frequency of proposed arcs in large-scale corpora derived from scanned books. These features are inspired by and extend on the work of Bansal and Klein (2011). We target these features at the most notable errors, and show how they address some, but not all of the difficult attachments across newswire and web text. CCG parsing is particularly challenging, as different derivations do not always generate different dependencies. We develop dependency hashing to address semantically redundant parses in n-best CCG parsing, and demonstrate its necessity and effectiveness. Dependency hashing substantially improves the diversity of n-best CCG parses, and improves a CCG reranker when used for creating training and test data. We show the intricacies of applying constraints to C&C, and describe instances where applying constraints causes the parser to produce a worse analysis. These results illustrate how algorithms which are relatively straightforward for constituency and dependency parsers are non-trivial to implement in CCG. This work has explored dependencies as constraints in dependency and CCG parsing. We have shown how dependency hashing can efficiently eliminate semantically redundant CCG n-best parses, and presented a new evaluation framework based on enforcing the presence of dependencies in the output of the parser. By otherwise allowing the parser to proceed as it would have, we avoid the assumptions inherent in other work. We hope this work will provide insights into the remaining errors in parsing, and target efforts to address those errors, creating better syntactic analysis for downstream applications
Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing
The development of distributed training strategies for statistical prediction functions
is important for applications of machine learning, generally, and the development
of distributed structured prediction training strategies is important for natural
language processing (NLP), in particular. With ever-growing data sets this is, first, because,
it is easier to increase computational capacity by adding more processor nodes
than it is to increase the power of individual processor nodes, and, second, because
data sets are often collected and stored in different locations.
Iterative parameter mixing (IPM) is a distributed training strategy in which each
node in a network of processors optimizes a regularized average loss objective on its
own subset of the total available training data, making stochastic (per-example) updates
to its own estimate of the optimal weight vector, and communicating with the
other nodes by periodically averaging estimates of the optimal vector across the network.
This algorithm has been contrasted with a close relative, called here the single-mixture
optimization algorithm, in which each node stochastically optimizes an average
loss objective on its own subset of the training data, operating in isolation until
convergence, at which point the average of the independently created estimates is returned.
Recent empirical results have suggested that this IPM strategy produces better
models than the single-mixture algorithm, and the results of this thesis add to this
picture.
The contributions of this thesis are as follows.
The first contribution is to produce and analyze an algorithm for decentralized
stochastic optimization of regularized average loss objective functions. This algorithm,
which we call the distributed regularized dual averaging algorithm, improves over
prior work on distributed dual averaging by providing a simpler algorithm (used in the
rest of the thesis), better convergence bounds for the case of regularized average loss
functions, and certain technical results that are used in the sequel.
The central contribution of this thesis is to give an optimization-theoretic justification
for the IPM algorithm. While past work has focused primarily on its empirical
test-time performance, we give a novel perspective on this algorithm by showing that,
in the context of the distributed dual averaging algorithm, IPM constitutes a convergent
optimization algorithm for arbitrary convex functions, while the single-mixture
distribution algorithm is not. Experiments indeed confirm that the superior test-time
performance of models trained using IPM, compared to single-mixture, correlates with
better optimization of the objective value on the training set, a fact not previously reported.
Furthermore, our analysis of general non-smooth functions justifies the use of
distributed large-margin (support vector machine [SVM]) training of structured predictors,
which we show yields better test performance than the IPM perceptron algorithm,
the only version of the IPM to have previously been given a theoretical justification.
Our results confirm that IPM training can reach the same level of test performance
as a sequentially trained model and can reach better accuracies when one has a fixed
budget of training time.
Finally, we use the reduction in training time that distributed training allows to experiment
with adding higher-order dependency features to a state-of-the-art phrase-structure
parsing model. We demonstrate that adding these features improves out-of-domain
parsing results of even the strongest phrase-structure parsing models, yielding
a new state-of-the-art for the popular train-test pairs considered. In addition, we show
that a feature-bagging strategy, in which component models are trained separately and
later combined, is sometimes necessary to avoid feature under-training and get the best
performance out of large feature sets
Apprentissage discriminant des modèles continus en traduction automatique
Over the past few years, neural network (NN) architectures have been successfully applied to many Natural Language Processing (NLP) applications, such as Automatic Speech Recognition (ASR) and Statistical Machine Translation (SMT).For the language modeling task, these models consider linguistic units (i.e words and phrases) through their projections into a continuous (multi-dimensional) space, and the estimated distribution is a function of these projections. Also qualified continuous-space models (CSMs), their peculiarity hence lies in this exploitation of a continuous representation that can be seen as an attempt to address the sparsity issue of the conventional discrete models. In the context of SMT, these echniques have been applied on neural network-based language models (NNLMs) included in SMT systems, and oncontinuous-space translation models (CSTMs). These models have led to significant and consistent gains in the SMT performance, but are also considered as very expensive in training and inference, especially for systems involving large vocabularies. To overcome this issue, Structured Output Layer (SOUL) and Noise Contrastive Estimation (NCE) have been proposed; the former modifies the standard structure on vocabulary words, while the latter approximates the maximum-likelihood estimation (MLE) by a sampling method. All these approaches share the same estimation criterion which is the MLE ; however using this procedure results in an inconsistency between theobjective function defined for parameter stimation and the way models are used in the SMT application. The work presented in this dissertation aims to design new performance-oriented and global training procedures for CSMs to overcome these issues. The main contributions lie in the investigation and evaluation of efficient training methods for (large-vocabulary) CSMs which aim~:(a) to reduce the total training cost, and (b) to improve the efficiency of these models when used within the SMT application. On the one hand, the training and inference cost can be reduced (using the SOUL structure or the NCE algorithm), or by reducing the number of iterations via a faster convergence. This thesis provides an empirical analysis of these solutions on different large-scale SMT tasks. On the other hand, we propose a discriminative training framework which optimizes the performance of the whole system containing the CSM as a component model. The experimental results show that this framework is efficient to both train and adapt CSM within SMT systems, opening promising research perspectives.Durant ces dernières années, les architectures de réseaux de neurones (RN) ont été appliquées avec succès à de nombreuses applications en Traitement Automatique de Langues (TAL), comme par exemple en Reconnaissance Automatique de la Parole (RAP) ainsi qu'en Traduction Automatique (TA).Pour la tâche de modélisation statique de la langue, ces modèles considèrent les unités linguistiques (c'est-à-dire des mots et des segments) à travers leurs projections dans un espace continu (multi-dimensionnel), et la distribution de probabilité à estimer est une fonction de ces projections.Ainsi connus sous le nom de "modèles continus" (MC), la particularité de ces derniers se trouve dans l'exploitation de la représentation continue qui peut être considérée comme une solution au problème de données creuses rencontré lors de l'utilisation des modèles discrets conventionnels.Dans le cadre de la TA, ces techniques ont été appliquées dans les modèles de langue neuronaux (MLN) utilisés dans les systèmes de TA, et dans les modèles continus de traduction (MCT).L'utilisation de ces modèles se sont traduit par d'importantes et significatives améliorations des performances des systèmes de TA. Ils sont néanmoins très coûteux lors des phrases d'apprentissage et d'inférence, notamment pour les systèmes ayant un grand vocabulaire.Afin de surmonter ce problème, l'architecture SOUL (pour "Structured Output Layer" en anglais) et l'algorithme NCE (pour "Noise Contrastive Estimation", ou l'estimation contrastive bruitée) ont été proposés: le premier modifie la structure standard de la couche de sortie, alors que le second cherche à approximer l'estimation du maximum de vraisemblance (MV) par une méthode d’échantillonnage.Toutes ces approches partagent le même critère d'estimation qui est la log-vraisemblance; pourtant son utilisation mène à une incohérence entre la fonction objectif définie pour l'estimation des modèles, et la manière dont ces modèles seront utilisés dans les systèmes de TA.Cette dissertation vise à concevoir de nouvelles procédures d'entraînement des MC, afin de surmonter ces problèmes.Les contributions principales se trouvent dans l'investigation et l'évaluation des méthodes d'entraînement efficaces pour MC qui visent à: (i) réduire le temps total de l'entraînement, et (ii) améliorer l'efficacité de ces modèles lors de leur utilisation dans les systèmes de TA.D'un côté, le coût d'entraînement et d'inférence peut être réduit (en utilisant l'architecture SOUL ou l'algorithme NCE), ou la convergence peut être accélérée.La dissertation présente une analyse empirique de ces approches pour des tâches de traduction automatique à grande échelle.D'un autre côté, nous proposons un cadre d'apprentissage discriminant qui optimise la performance du système entier ayant incorporé un modèle continu.Les résultats expérimentaux montrent que ce cadre d'entraînement est efficace pour l'apprentissage ainsi que pour l'adaptation des MC au sein des systèmes de TA, ce qui ouvre de nouvelles perspectives prometteuses
Empirical machine translation and its evaluation
Aquesta tesi estudia l'aplicació de les tecnologies del Processament del Llenguatge Natural disponibles actualment al problema de la Traducció Automàtica basada en Mètodes Empírics i la seva Avaluació.D'una banda, tractem el problema de l'avaluació automàtica. Hem analitzat les principals deficiències dels mètodes d'avaluació actuals, les quals es deuen, al nostre parer, als principis de qualitat superficials en els que es basen. En comptes de limitar-nos al nivell lèxic, proposem una nova direcció cap a avaluacions més heterogènies. El nostre enfocament es basa en el disseny d'un ric conjunt de mesures automàtiques destinades a capturar un ampli ventall d'aspectes de qualitat a diferents nivells lingüístics (lèxic, sintàctic i semàntic). Aquestes mesures lingüístiques han estat avaluades sobre diferents escenaris. El resultat més notable ha estat la constatació de que les mètriques basades en un coneixement lingüístic més profund (sintàctic i semàntic) produeixen avaluacions a nivell de sistema més fiables que les mètriques que es limiten a la dimensió lèxica, especialment quan els sistemes avaluats pertanyen a paradigmes de traducció diferents. Tanmateix, a nivell de frase, el comportament d'algunes d'aquestes mètriques lingüístiques empitjora lleugerament en comparació al comportament de les mètriques lèxiques. Aquest fet és principalment atribuïble als errors comesos pels processadors lingüístics. A fi i efecte de millorar l'avaluació a nivell de frase, a més de recòrrer a la similitud lèxica en absència d'anàlisi lingüística, hem estudiat la possibiliat de combinar les puntuacions atorgades per mètriques a diferents nivells lingüístics en una sola mesura de qualitat. S'han presentat dues estratègies no paramètriques de combinació de mètriques, essent el seu principal avantatge no haver d'ajustar la contribució relativa de cadascuna de les mètriques a la puntuació global. A més, el nostre treball mostra com fer servir el conjunt de mètriques heterogènies per tal d'obtenir detallats informes d'anàlisi d'errors automàticament.D'altra banda, hem estudiat el problema de la selecció lèxica en Traducció Automàtica Estadística. Amb aquesta finalitat, hem construit un sistema de Traducció Automàtica Estadística Castellà-Anglès basat en -phrases', i hem iterat en el seu cicle de desenvolupament, analitzant diferents maneres de millorar la seva qualitat mitjançant la incorporació de coneixement lingüístic. En primer lloc, hem extès el sistema a partir de la combinació de models de traducció basats en anàlisi sintàctica superficial, obtenint una millora significativa. En segon lloc, hem aplicat models de traducció discriminatius basats en tècniques d'Aprenentatge Automàtic. Aquests models permeten una millor representació del contexte de traducció en el que les -phrases' ocorren, efectivament conduint a una millor selecció lèxica. No obstant, a partir d'avaluacions automàtiques heterogènies i avaluacions manuals, hem observat que les millores en selecció lèxica no comporten necessàriament una millor estructura sintàctica o semàntica. Així doncs, la incorporació d'aquest tipus de prediccions en el marc estadístic requereix, per tant, un estudi més profund.Com a qüestió complementària, hem estudiat una de les principals crítiques en contra dels sistemes de traducció basats en mètodes empírics, la seva forta dependència del domini, i com els seus efectes negatius poden ésser mitigats combinant adequadament fonts de coneixement externes. En aquest sentit, hem adaptat amb èxit un sistema de traducció estadística Anglès-Castellà entrenat en el domini polític, al domini de definicions de diccionari.Les dues parts d'aquesta tesi estan íntimament relacionades, donat que el desenvolupament d'un sistema real de Traducció Automàtica ens ha permès viure en primer terme l'important paper dels mètodes d'avaluació en el cicle de desenvolupament dels sistemes de Traducció Automàtica.In this thesis we have exploited current Natural Language Processing technology for Empirical Machine Translation and its Evaluation.On the one side, we have studied the problem of automatic MT evaluation. We have analyzed the main deficiencies of current evaluation methods, which arise, in our opinion, from the shallow quality principles upon which they are based. Instead of relying on the lexical dimension alone, we suggest a novel path towards heterogeneous evaluations. Our approach is based on the design of a rich set of automatic metrics devoted to capture a wide variety of translation quality aspects at different linguistic levels (lexical, syntactic and semantic). Linguistic metrics have been evaluated over different scenarios. The most notable finding is that metrics based on deeper linguistic information (syntactic/semantic) are able to produce more reliable system rankings than metrics which limit their scope to the lexical dimension, specially when the systems under evaluation are different in nature. However, at the sentence level, some of these metrics suffer a significant decrease, which is mainly attributable to parsing errors. In order to improve sentence-level evaluation, apart from backing off to lexical similarity in the absence of parsing, we have also studied the possibility of combining the scores conferred by metrics at different linguistic levels into a single measure of quality. Two valid non-parametric strategies for metric combination have been presented. These offer the important advantage of not having to adjust the relative contribution of each metric to the overall score. As a complementary issue, we show how to use the heterogeneous set of metrics to obtain automatic and detailed linguistic error analysis reports.On the other side, we have studied the problem of lexical selection in Statistical Machine Translation. For that purpose, we have constructed a Spanish-to-English baseline phrase-based Statistical Machine Translation system and iterated across its development cycle, analyzing how to ameliorate its performance through the incorporation of linguistic knowledge. First, we have extended the system by combining shallow-syntactic translation models based on linguistic data views. A significant improvement is reported. This system is further enhanced using dedicated discriminative phrase translation models. These models allow for a better representation of the translation context in which phrases occur, effectively yielding an improved lexical choice. However, based on the proposed heterogeneous evaluation methods and manual evaluations conducted, we have found that improvements in lexical selection do not necessarily imply an improved overall syntactic or semantic structure. The incorporation of dedicated predictions into the statistical framework requires, therefore, further study.As a side question, we have studied one of the main criticisms against empirical MT systems, i.e., their strong domain dependence, and how its negative effects may be mitigated by properly combining outer knowledge sources when porting a system into a new domain. We have successfully ported an English-to-Spanish phrase-based Statistical Machine Translation system trained on the political domain to the domain of dictionary definitions.The two parts of this thesis are tightly connected, since the hands-on development of an actual MT system has allowed us to experience in first person the role of the evaluation methodology in the development cycle of MT systems
Improvements to the performance and applicability of dependency parsing
[Resumen]Los analizadores de dependencias han generado un gran interés en las últimas décadas
debido a su utilidad en un amplio rango de tareas de procesamiento de lenguaje natural.
Estos utilizan grafos de dependencias para definir la estructura sintáctica de una oración
dada. En particular, los algoritmos basados en transiciones proveen un análisis sintáctico
de dependencias eficiente y preciso. Sin embargo, su principal inconveniente es que
tienden a sufrir propagación de errores. Así, una decisión temprana tomada erróneamente
podría posicionar el analizador en un estado incorrecto, causando más errores en futuras
decisiones.
Esta tesis se centra en mejorar la precisión de los analizadores basados en transiciones
mediante la reducción del efecto de la propagación de errores, mientras mantienen su
velocidad y eficiencia. Concretamente, proponemos cinco enfoques diferentes que han
demostrado ser beneficiosos para su rendimiento, al aliviar la propagación de errores e
incrementar su precisión.
Además, hemos ampliado la utilidad de los analizadores de dependencias más allá
de la construcción de grafos de dependencias. Presentamos una novedosa técnica que
permite que estos sean capaces de construir representaciones de constituyentes. Esto
cubriría la necesidad de la comunidad de procesamiento de lenguaje natural de disponer
de un analizador eficiente capaz de proveer un árbol de constituyentes para representar la
estructura sintáctica de las oraciones.[Abstract]Dependency parsers have attracted a remarkable interest in the last two decades due
to their usefulness in a wide range of natural language processing tasks. They employ
a dependency graph to define the syntactic structure of a given sentence. In particular,
transition-based algorithms provide accurate and efficient dependency syntactic analyses.
However, the main drawback of these techniques is that they tend to suffer from error
propagation. So, an early erroneous decision may place the parser into an incorrect state,
causing more errors in future decisions.
This thesis focuses on improving the accuracy of transition-based parsers by reducing
the effect of error propagation, while preserving their speed and efficiency. Concretely,
we propose five different approaches that proved to be beneficial for their performance,
mitigating the presence of error propagation and boosting its accuracy.
We also extend the usefulness of dependency parsers beyond building dependency
graphs.We present a novel technique that allows these to build constituent representations.
This meets the necessity of the natural language processing community to have an
efficient parser able to provide constituent trees to represent the syntactic structure of
sentences.[Resumo]Os analizadores de dependencias xeraron gran interese nas últimas décadas debido
á súa utilidade nun amplo rango de tarefas de procesamento da linguaxe natural. Estes
utilizan grafos de dependencias para definir a estrutura sintáctica dunha oración dada.
En particular, os algoritmos baseados en transicións provén un análise sintáctico de
dependencias eficiente e preciso. Sen embargo, o seu principal inconveniente é que tenden
a sufrir propagación de erros. Así, unha decisión temprana tomada erroneamente podería
posicionar o analizador nun estado incorrecto, causando máis erros en futuras decisións.
Esta tese centrase en mellorar a precisión dos analizadores baseados en transicións
mediante a redución do efecto da propagación de erros, mentres manteñen a súa
velocidade e eficiencia. Concretamente, propomos cinco diferentes enfoques que
demostraron ser beneficiosos para o seu rendemento, ó aliviar a propagación de erros
e incrementar a súa precisión.
Ademais, ampliámo-la utilidade dos analizadores de dependencias máis alá da
construción de grafos de dependencias. Presentamos unha novidosa técnica que permite
que estes sexan capaces de construir representacións de constituíntes. Isto cubriría a
necesidade da comunidade de procesamento da linguaxe natural de dispor dun analizador
eficiente capaz de prover unha árbore de constituíntes para representar a estrutura
sintáctica das oracións
Proceedings of the 18th Irish Conference on Artificial Intelligence and Cognitive Science
These proceedings contain the papers that were accepted for publication at AICS-2007, the 18th Annual Conference on Artificial Intelligence and Cognitive Science, which was held in the Technological University Dublin; Dublin, Ireland; on the 29th to the 31st August 2007. AICS is the annual conference of the Artificial Intelligence Association of Ireland (AIAI)