11 research outputs found
Joint models for concept-to-text generation
Much of the data found on the world wide web is in numeric, tabular, or other nontextual
format (e.g., weather forecast tables, stock market charts, live sensor feeds), and
thus inaccessible to non-experts or laypersons. However, most conventional search engines
and natural language processing tools (e.g., summarisers) can only handle textual
input. As a result, data in non-textual form remains largely inaccessible. Concept-to-
text generation refers to the task of automatically producing textual output from
non-linguistic input, and holds promise for rendering non-linguistic data widely accessible.
Several successful generation systems have been produced in the past twenty
years. They mostly rely on human-crafted rules or expert-driven grammars, implement
a pipeline architecture, and usually operate in a single domain.
In this thesis, we present several novel statistical models that take as input a set
of database records and generate a description of them in natural language text. Our
unique idea is to combine the processes of structuring a document (document planning),
deciding what to say (content selection) and choosing the specific words and
syntactic constructs specifying how to say it (lexicalisation and surface realisation),
in a uniform joint manner. Rather than breaking up the generation process into a sequence
of local decisions, we define a probabilistic context-free grammar that globally
describes the inherent structure of the input (a corpus of database records and
text describing some of them). This joint representation allows individual processes
(i.e., document planning, content selection, and surface realisation) to communicate
and influence each other naturally.
We recast generation as the task of finding the best derivation tree for a set of input
database records and our grammar, and describe several algorithms for decoding in this
framework that allows to intersect the grammar with additional information capturing
fluency and syntactic well-formedness constraints. We implement our generators using
the hypergraph framework. Contrary to traditional systems, we learn all the necessary
document, structural and linguistic knowledge from unannotated data. Additionally,
we explore a discriminative reranking approach on the hypergraph representation of
our model, by including more refined content selection features. Central to our approach
is the idea of porting our models to various domains; we experimented on four
widely different domains, namely sportscasting, weather forecast generation, booking
flights, and troubleshooting guides. The performance of our systems is competitive
and often superior compared to state-of-the-art systems that use domain specific constraints,
explicit feature engineering or labelled data
Predicting Linguistic Structure with Incomplete and Cross-Lingual Supervision
Contemporary approaches to natural language processing are predominantly based on statistical machine learning from large amounts of text, which has been manually annotated with the linguistic structure of interest. However, such complete supervision is currently only available for the world's major languages, in a limited number of domains and for a limited range of tasks. As an alternative, this dissertation considers methods for linguistic structure prediction that can make use of incomplete and cross-lingual supervision, with the prospect of making linguistic processing tools more widely available at a lower cost. An overarching theme of this work is the use of structured discriminative latent variable models for learning with indirect and ambiguous supervision; as instantiated, these models admit rich model features while retaining efficient learning and inference properties.
The first contribution to this end is a latent-variable model for fine-grained sentiment analysis with coarse-grained indirect supervision. The second is a model for cross-lingual word-cluster induction and the application thereof to cross-lingual model transfer. The third is a method for adapting multi-source discriminative cross-lingual transfer models to target languages, by means of typologically informed selective parameter sharing. The fourth is an ambiguity-aware self- and ensemble-training algorithm, which is applied to target language adaptation and relexicalization of delexicalized cross-lingual transfer parsers. The fifth is a set of sequence-labeling models that combine constraints at the level of tokens and types, and an instantiation of these models for part-of-speech tagging with incomplete cross-lingual and crowdsourced supervision. In addition to these contributions, comprehensive overviews are provided of structured prediction with no or incomplete supervision, as well as of learning in the multilingual and cross-lingual settings.
Through careful empirical evaluation, it is established that the proposed methods can be used to create substantially more accurate tools for linguistic processing, compared to both unsupervised methods and to recently proposed cross-lingual methods. The empirical support for this claim is particularly strong in the latter case; our models for syntactic dependency parsing and part-of-speech tagging achieve the hitherto best published results for a wide number of target languages, in the setting where no annotated training data is available in the target language
Robust Parsing for Ungrammatical Sentences
Natural Language Processing (NLP) is a research area that specializes in studying computational approaches to human language. However, not all of the natural language sentences are grammatically correct. Sentences that are ungrammatical, awkward, or too casual/colloquial tend to appear in a variety of NLP applications, from product reviews and social media analysis to intelligent language tutors or multilingual processing. In this thesis, we focus on parsing, because it is an essential component of many NLP applications. We investigate in what ways the performances of statistical parsers degrade when dealing with ungrammatical sentences. We also hypothesize that breaking up parse trees from problematic parts prevents NLP applications from degrading due to incorrect syntactic analysis.
A parser is robust if it can overlook problems such as grammar mistakes and produce a parse tree that closely resembles the correct analysis for the intended sentence. We develop a robustness evaluation metric and conduct a series of experiments to compare the performances of state-of-the-art parsers on the ungrammatical sentences. The evaluation results show that ungrammatical sentences present challenges for statistical parsers, because the well-formed syntactic trees they produce may not be appropriate for ungrammatical sentences. We also define a new framework for reviewing the parses of ungrammatical sentences and extracting the coherent parts whose syntactic analyses make sense. We call this task parse tree fragmentation. The experimental results suggest that the proposed overall fragmentation framework is a promising way to handle syntactically unusual sentences
Recommended from our members
Learning with Joint Inference and Latent Linguistic Structure in Graphical Models
Constructing end-to-end NLP systems requires the processing of many types of linguistic information prior to solving the desired end task. A common approach to this problem is to construct a pipeline, one component for each task, with each system\u27s output becoming input for the next. This approach poses two problems. First, errors propagate, and, much like the childhood game of telephone , combining systems in this manner can lead to unintelligible outcomes. Second, each component task requires annotated training data to act as supervision for training the model. These annotations are often expensive and time-consuming to produce, may differ from each other in genre and style, and may not match the intended application.
In this dissertation we present a general framework for constructing and reasoning on joint graphical model formulations of NLP problems. Individual models are composed using weighted Boolean logic constraints, and inference is performed using belief propagation. The systems we develop are composed of two parts: one a representation of syntax, the other a desired end task (semantic role labeling, named entity recognition, or relation extraction). By modeling these problems jointly, both models are trained in a single, integrated process, with uncertainty propagated between them. This mitigates the accumulation of errors typical of pipelined approaches.
Additionally we propose a novel marginalization-based training method in which the error signal from end task annotations is used to guide the induction of a constrained latent syntactic representation. This allows training in the absence of syntactic training data, where the latent syntactic structure is instead optimized to best support the end task predictions. We find that across many NLP tasks this training method offers performance comparable to fully supervised training of each individual component, and in some instances improves upon it by learning latent structures which are more appropriate for the task
Analyzing, enhancing, optimizing and applying dependency analysis
Tesis inédita de la Universidad Complutense de Madrid, Facultad de Informática, Departamento de IngenierÃa del Software e Inteligencia Artificial, leÃda el 19/12/2012Los analizadores de dependencias estadÃsticos han sido mejorados en gran medida durante los últimos años. Esto ha sido posible gracias a los sistemas basados en aprendizaje automático que muestran una gran precisión. Estos sistemas permiten la generación de parsers para idiomas en los que se disponga de un corpus adecuado sin causar, para ello, un gran esfuerzo en el usuario final. MaltParser es uno de estos sistemas. En esta tesis hemos usado sistemas del estado del arte, para mostrar una serie de contribuciones completamente relacionadas con el procesamiento de lenguaje natural (PLN) y análisis de dependencias: (i) Estudio del problema del análisis de dependencias demostrando la homogeneidad en la precisión y mostrando contribuciones interesantes sobre la longitud de las frases, el tamaño de los corpora de entrenamiento y como evaluamos los parsers. (ii) Hemos estudiado además algunas maneras de mejorar la precisión modificando el flujo de análisis de dos maneras distintas, analizando algunos segmentos de las frases de manera separada, y modificando el comportamiento interno de los algoritmos de parsing. (iii) Hemos investigado la selección automática de atributos para aprendizaje máquina para analizadores de dependencias basados en transiciones que consideramos un importante problema y algo que realmente es necesario resolver dado el estado de la cuestión, ya que además puede servir para resolver de mejor manera tareas relacionadas con el análisis de dependencias. (iv) Finalmente, hemos aplicado el análisis de dependencias para resolver algunos problemas, hoy en dÃa importantes, para el procesamiento de lenguage natural (PLN) como son la simplificación de textos o la inferencia del alcance de señales de negación. Por último, añadir que el conocimiento adquirido en la realización de esta tesis puede usarse para implementar aplicaciones basadas en análisis de dependencias más robustas en PLN o en otras áreas relacionadas, como se demuestra a lo largo de la tesis.
[ABSTRACT] Statistical dependency parsing accuracy has been improved substantially during the last years. One of the main reasons is the inclusion of data- driven (or machine learning) based methods. Machine learning allows the development of parsers for every language that has an adequate training corpus without requiring a great effort. MaltParser is one of such systems. In the present thesis we have used state of the art systems (mainly Malt- Parser), to show some contributions in four different areas inherently related to natural language processing (NLP) and dependency parsing: (i) We stu- died the parsing problem demonstrating the homogeneity of the performance and showing interesting contributions about sentence length, corpora size and how we normally evaluate the parsers. (ii) We have also tried some ways of improving the parsing accuracy by modifying the flow of analysis, parsing some segments of the sentences separately by finally constructing a parsing combination problem. We also studied the modification of the inter- nal behavior of the parsers focusing on the root of dependency structures, which is an important part of what a dependency parser parses and worth studying. (iii) We have researched automatic feature selection and parsing optimization for transition based parsers which we consider an important problem and something that definitely needs to be done in dependency par- sing in order to solve parsing problems in a more successful way. And (iv) we have applied syntactic dependency structures and dependency parsing to solve some Natural Language Processing (NLP) problems such as text simplification and inferring the scope of negation cues. Furthermore, the knowledge acquired when developing this thesis could be used to implement more robust dependency parsing–based applications in different NLP (or related) areas, as we demonstrate in the present thesis.Depto. de IngenierÃa de Software e Inteligencia Artificial (ISIA)Fac. de InformáticaTRUEunpu
Vine parsing and minimum risk reranking for speed and precision
We describe our entry in the CoNLL-X shared task. The system consists of three phases: a probabilistic vine parser (Eisner and N. Smith, 2005) that produces unlabeled dependency trees, a probabilistic relation-labeling model, and a discriminative minimum risk reranker (D. Smith and Eisner, 2006). The system is designed for fast training and decoding and for high precision. We describe sources of crosslingual error and ways to ameliorate them. We then provide a detailed error analysis of parses produced for sentences in German (much training data) and Arabic (little training data).
Handbook of Lexical Functional Grammar
Lexical Functional Grammar (LFG) is a nontransformational theory of
linguistic structure, first developed in the 1970s by Joan Bresnan and
Ronald M. Kaplan, which assumes that language is best described and
modeled by parallel structures representing different facets of
linguistic organization and information, related by means of
functional correspondences. This volume has five parts. Part I,
Overview and Introduction, provides an introduction to core syntactic
concepts and representations. Part II, Grammatical Phenomena, reviews
LFG work on a range of grammatical phenomena or constructions. Part
III, Grammatical modules and interfaces, provides an overview of LFG
work on semantics, argument structure, prosody, information structure,
and morphology. Part IV, Linguistic disciplines, reviews LFG work in
the disciplines of historical linguistics, learnability,
psycholinguistics, and second language learning. Part V, Formal and
computational issues and applications, provides an overview of
computational and formal properties of the theory, implementations,
and computational work on parsing, translation, grammar induction, and
treebanks. Part VI, Language families and regions, reviews LFG work
on languages spoken in particular geographical areas or in particular
language families. The final section, Comparing LFG with other
linguistic theories, discusses LFG work in relation to other
theoretical approaches