1,341 research outputs found
Latent-Variable PCFGs: Background and Applications
Latent-variable probabilistic context-free grammars are
latent-variable models that are based on context-free grammars.
Nonterminals are associated with latent states that provide
contextual information during the top-down rewriting process of
the grammar.
We survey a few of the techniques used to estimate such grammars
and to parse text with them. We also give an overview of what the latent
states represent for English Penn treebank parsing, and provide
an overview of extensions and related models to these grammars
Probabilistic Modelling of Morphologically Rich Languages
This thesis investigates how the sub-structure of words can be accounted for
in probabilistic models of language. Such models play an important role in
natural language processing tasks such as translation or speech recognition,
but often rely on the simplistic assumption that words are opaque symbols. This
assumption does not fit morphologically complex language well, where words can
have rich internal structure and sub-word elements are shared across distinct
word forms.
Our approach is to encode basic notions of morphology into the assumptions of
three different types of language models, with the intention that leveraging
shared sub-word structure can improve model performance and help overcome data
sparsity that arises from morphological processes.
In the context of n-gram language modelling, we formulate a new Bayesian
model that relies on the decomposition of compound words to attain better
smoothing, and we develop a new distributed language model that learns vector
representations of morphemes and leverages them to link together
morphologically related words. In both cases, we show that accounting for word
sub-structure improves the models' intrinsic performance and provides benefits
when applied to other tasks, including machine translation.
We then shift the focus beyond the modelling of word sequences and consider
models that automatically learn what the sub-word elements of a given language
are, given an unannotated list of words. We formulate a novel model that can
learn discontiguous morphemes in addition to the more conventional contiguous
morphemes that most previous models are limited to. This approach is
demonstrated on Semitic languages, and we find that modelling discontiguous
sub-word structures leads to improvements in the task of segmenting words into
their contiguous morphemes.Comment: DPhil thesis, University of Oxford, submitted and accepted 2014.
http://ora.ox.ac.uk/objects/uuid:8df7324f-d3b8-47a1-8b0b-3a6feb5f45c
Probabilistic grammar induction from sentences and structured meanings
The meanings of natural language sentences may be represented as compositional
logical-forms. Each word or lexicalised multiword-element has an associated logicalform
representing its meaning. Full sentential logical-forms are then composed from
these word logical-forms via a syntactic parse of the sentence.
This thesis develops two computational systems that learn both the word-meanings
and parsing model required to map sentences onto logical-forms from an example corpus
of (sentence, logical-form) pairs. One of these systems is designed to provide a
general purpose method of inducing semantic parsers for multiple languages and logical
meaning representations. Semantic parsers map sentences onto logical representations
of their meanings and may form an important part of any computational task that
needs to interpret the meanings of sentences. The other system is designed to model
the way in which a child learns the semantics and syntax of their first language. Here,
logical-forms are used to represent the potentially ambiguous context in which childdirected
utterances are spoken and a psycholinguistically plausible training algorithm
learns a probabilistic grammar that describes the target language. This computational
modelling task is important as it can provide evidence for or against competing theories
of how children learn their first language.
Both of the systems presented here are based upon two working hypotheses. First,
that the correct parse of any sentence in any language is contained in a set of possible
parses defined in terms of the sentence itself, the sentence’s logical-form and a small
set of combinatory rule schemata. The second working hypothesis is that, given a
corpus of (sentence, logical-form) pairs that each support a large number of possible
parses according to the schemata mentioned above, it is possible to learn a probabilistic
parsing model that accurately describes the target language.
The algorithm for semantic parser induction learns Combinatory Categorial Grammar
(CCG) lexicons and discriminative probabilistic parsing models from corpora of
(sentence, logical-form) pairs. This system is shown to achieve at or near state of the art
performance across multiple languages, logical meaning representations and domains.
As the approach is not tied to any single natural or logical language, this system represents
an important step towards widely applicable black-box methods for semantic parser induction. This thesis also develops an efficient representation of the CCG lexicon
that separately stores language specific syntactic regularities and domain specific
semantic knowledge. This factorised lexical representation improves the performance
of CCG based semantic parsers in sparse domains and also provides a potential basis
for lexical expansion and domain adaptation for semantic parsers.
The algorithm for modelling child language acquisition learns a generative probabilistic
model of CCG parses from sentences paired with a context set of potential
logical-forms containing one correct entry and a number of distractors. The online
learning algorithm used is intended to be psycholinguistically plausible and to assume
as little information specific to the task of language learning as is possible. It is shown
that this algorithm learns an accurate parsing model despite making very few initial
assumptions. It is also shown that the manner in which both word-meanings and syntactic
rules are learnt is in accordance with observations of both of these learning tasks
in children, supporting a theory of language acquisition that builds upon the two working
hypotheses stated above
Multilingual Part-of-Speech Tagging: Two Unsupervised Approaches
We demonstrate the effectiveness of multilingual learning for unsupervised
part-of-speech tagging. The central assumption of our work is that by combining
cues from multiple languages, the structure of each becomes more apparent. We
consider two ways of applying this intuition to the problem of unsupervised
part-of-speech tagging: a model that directly merges tag structures for a pair
of languages into a single sequence and a second model which instead
incorporates multilingual context using latent variables. Both approaches are
formulated as hierarchical Bayesian models, using Markov Chain Monte Carlo
sampling techniques for inference. Our results demonstrate that by
incorporating multilingual evidence we can achieve impressive performance gains
across a range of scenarios. We also found that performance improves steadily
as the number of available languages increases
How an idea germinates into a project or the intransitive resultative construction with entity-specific change-of-state verbs
This study discusses how seven of Levin's (1993) entity-specific change-of-state verbs (i.e. bloom, blossom, flower, germinate, sprout, swell, and blister) are subsumed into the intransitive resultative construction by highlighting and making use of the external and internal constraints proposed by the Lexical Constructional Model (LCM; Ruiz de Mendoza and Mairal 2007). External constraints refer to cognitive mechanisms, such as high-level metaphor and/or metonymy whereas internal constraints are concerned with the encyclopedic and event structure makeup of verbs. The Internal Variable Conditioning constraint is at work when the information encapsulated by a predicate determines the choice of the Z element in an intransitive resultative construction. The semantic makeup of the verb swell and the entity undergoing swelling constrain the nature of the resultant entity Z which must be bigger in size or have a bigger value than the Y element (e.g. The work, which was originally meant to consist only of a few sheets, swelled into ten volumes)
Modeling Dependencies in Natural Languages with Latent Variables
In this thesis, we investigate the use of latent variables to model complex dependencies in natural languages. Traditional models, which have a fixed parameterization, often make strong independence assumptions that lead to poor performance. This problem is often addressed by incorporating additional dependencies into the model (e.g., using higher order N-grams for language modeling). These added dependencies can increase data sparsity and/or require expert knowledge, together with trial and error, in order to identify and incorporate the most important dependencies (as in lexicalized parsing models). Traditional models, when developed for a particular genre, domain, or language, are also often difficult to adapt to another.
In contrast, previous work has shown that latent variable models, which automatically learn dependencies in a data-driven way, are able to flexibly adjust the number of parameters based on the type and the amount of training data available. We have created several different types of latent variable models for a diverse set of natural language processing applications, including novel models for part-of-speech tagging, language modeling, and machine translation, and an improved model for parsing. These models perform significantly better than traditional models. We have also created and evaluated three different methods for improving the performance of latent variable models. While these methods can be applied to any of our applications, we focus our experiments on parsing.
The first method involves self-training, i.e., we train models using a combination of gold standard training data and a large amount of automatically labeled training data. We conclude from a series of experiments that the latent variable models benefit much more from self-training than conventional models, apparently due to their flexibility to adjust their model parameterization to learn more accurate models from the additional automatically labeled training data.
The second method takes advantage of the variability among latent variable models to combine multiple models for enhanced performance. We investigate several different training protocols to combine self-training with model combination. We conclude that these two techniques are complementary to each other and can be effectively combined to train very high quality parsing models.
The third method replaces the generative multinomial lexical model of latent variable grammars with a feature-rich log-linear lexical model to provide a principled solution to address data sparsity, handle out-of-vocabulary words, and exploit overlapping features during model induction. We conclude from experiments that the resulting grammars are able to effectively parse three different languages.
This work contributes to natural language processing by creating flexible and effective latent variable models for several different languages. Our investigation of self-training, model combination, and log-linear models also provides insights into the effective application of these machine learning techniques to other disciplines
- …