56 research outputs found
A Formal Model of Ambiguity and its Applications in Machine Translation
Systems that process natural language must cope with and resolve ambiguity. In this dissertation, a model of language processing is advocated in which multiple inputs and multiple analyses of inputs are considered concurrently and a single analysis is only a last resort. Compared to conventional models, this approach can be understood as replacing single-element inputs and outputs with weighted sets of inputs and outputs. Although processing components must deal with sets (rather than individual elements), constraints are imposed on the elements of these sets, and the representations from existing models may be reused. However, to deal efficiently with large (or infinite) sets, compact representations of sets that share structure between elements, such as weighted finite-state transducers and synchronous context-free grammars, are necessary. These representations and algorithms for manipulating them are discussed in depth in depth.
To establish the effectiveness and tractability of the proposed processing model, it is applied to several problems in machine translation. Starting with spoken language translation, it is shown that translating a set of transcription hypotheses yields better translations compared to a baseline in which a single (1-best) transcription hypothesis is selected and then translated, independent of the translation model formalism used. More subtle forms of ambiguity that arise even in text-only translation (such as decisions conventionally made during system development about how to preprocess text) are then discussed, and it is shown that the ambiguity-preserving paradigm can be employed in these cases as well, again leading to improved translation quality. A model for supervised learning that learns from training data where sets (rather than single elements) of correct labels are provided for each training instance and use it to learn a model of compound word segmentation is also introduced, which is used as a preprocessing step in machine translation
Substring-based Machine Translation
Abstract Machine translation is traditionally formulated as the transduction of strings of words from the source to the target language. As a result, additional lexical processing steps such as morphological analysis, transliteration, and tokenization are required to process the internal structure of words to help cope with data-sparsity issues that occur when simply dividing words according to white spaces. In this paper, we take a different approach: not dividing lexical processing and translation into two steps, but simply viewing translation as a single transduction between character strings in the source and target languages. In particular, we demonstrate that the key to achieving accuracies on a par with word-based translation in the character-based framework is the use of a many-to-many alignment strategy that can accurately capture correspondences between arbitrary substrings. We build on the alignment method proposed in Neubig et al (2011), improving its efficiency and accuracy with a focus on character-based translation. Using a many-to-many aligner imbued with these improvements, we demonstrate that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-based translation for two distant language pairs
Parsing Linear Context-Free Rewriting Systems with Fast Matrix Multiplication
We describe a matrix multiplication recognition algorithm for a subset of
binary linear context-free rewriting systems (LCFRS) with running time
where is the running time for matrix multiplication and is the "contact rank" of the LCFRS --
the maximal number of combination and non-combination points that appear in the
grammar rules. We also show that this algorithm can be used as a subroutine to
get a recognition algorithm for general binary LCFRS with running time
. The currently best known is smaller than
. Our result provides another proof for the best known result for parsing
mildly context sensitive formalisms such as combinatory categorial grammars,
head grammars, linear indexed grammars, and tree adjoining grammars, which can
be parsed in time . It also shows that inversion transduction
grammars can be parsed in time . In addition, binary LCFRS
subsumes many other formalisms and types of grammars, for some of which we also
improve the asymptotic complexity of parsing
Recommended from our members
Refinements in hierarchical phrase-based translation systems
The relatively recently proposed hierarchical phrase-based translation model
for statistical machine translation (SMT) has achieved state-of-the-art performance
in numerous recent translation evaluations. Hierarchical phrase-based
systems comprise a pipeline of modules with complex interactions. In
this thesis, we propose refinements to the hierarchical phrase-based model
as well as improvements and analyses in various modules for hierarchical
phrase-based systems.
We took the opportunity of increasing amounts of available training data
for machine translation as well as existing frameworks for distributed computing
in order to build better infrastructure for extraction, estimation and
retrieval of hierarchical phrase-based grammars. We design and implement
grammar extraction as a series of Hadoop MapReduce jobs. We store the resulting
grammar using the HFile format, which offers competitive trade-offs
in terms of efficiency and simplicity. We demonstrate improvements over two
alternative solutions used in machine translation.
The modular nature of the SMT pipeline, while allowing individual improvements,
has the disadvantage that errors committed by one module are
propagated to the next. This thesis alleviates this issue between the word
alignment module and the grammar extraction and estimation module by
considering richer statistics from word alignment models in extraction. We
use alignment link and alignment phrase pair posterior probabilities for grammar
extraction and estimation and demonstrate translation improvements in
Chinese to English translation.
This thesis also proposes refinements in grammar and language modelling
both in the context of domain adaptation and in the context of the interaction
between first-pass decoding and lattice rescoring. We analyse alternative
strategies for grammar and language model cross-domain adaptation. We
also study interactions between first-pass and second-pass language model in terms of size and n-gram order. Finally, we analyse two smoothing methods
for large 5-gram language model rescoring.
The last two chapters are devoted to the application of phrase-based
grammars to the string regeneration task, which we consider as a means to
study the fluency of machine translation output. We design and implement a
monolingual phrase-based decoder for string regeneration and achieve state-of-the-art
performance on this task. By applying our decoder to the output
of a hierarchical phrase-based translation system, we are able to recover the
same level of translation quality as the translation system
Parsing Inside-Out
The inside-outside probabilities are typically used for reestimating
Probabilistic Context Free Grammars (PCFGs), just as the forward-backward
probabilities are typically used for reestimating HMMs. I show several novel
uses, including improving parser accuracy by matching parsing algorithms to
evaluation criteria; speeding up DOP parsing by 500 times; and 30 times faster
PCFG thresholding at a given accuracy level. I also give an elegant,
state-of-the-art grammar formalism, which can be used to compute inside-outside
probabilities; and a parser description formalism, which makes it easy to
derive inside-outside formulas and many others.Comment: Ph.D. Thesis, 257 pages, 40 postscript figure
Rich Linguistic Structure from Large-Scale Web Data
The past two decades have shown an unexpected effectiveness of Web-scale data in natural language processing. Even the simplest models, when paired with unprecedented amounts of unstructured and unlabeled Web data, have been shown to outperform sophisticated ones. It has been argued that the effectiveness of Web-scale data has undermined the necessity of sophisticated modeling or laborious data set curation. In this thesis, we argue for and illustrate an alternative view, that Web-scale data not only serves to improve the performance of simple models, but also can allow the use of qualitatively more sophisticated models that would not be deployable otherwise, leading to even further performance gains.Engineering and Applied Science
Recommended from our members
Modeling RNA, protein, and synthetic molecules using coarse-grained and all-atom representations
The aim of computational chemistry is to depict and understand the dynamics and interactions of molecular systems. In addition to increased comprehension in the physical and life sciences, this insight yields important applications to therapeutic design and materials science. In computational chemistry, molecules can be modeled in a number of representations depending on the molecular system and phenomena of interest. In this work, both simplified, coarse-grained representations and all-atom representations are used to model the interactions of RNA, cucurbituril host-guest chemistry, and cadmium selenide quantum dot binding to the Src homology 3 domain.
For RNA, a coarse-grained model was developed termed RACER (RnA CoarsE-gRained) to accurately predict RNA structure and folding free energy. After optimization to statistical potentials, RACER accurately predicted the structures of 14 RNAs with an average 4.15Å root mean square deviation (RMSD) to the experimental structure. Further, RACER captured the sequence-specific variation in folding free energy for a set of 6 RNA hairpins and 5 RNA duplexes, with a R² correlation of 0.96 to experiment.
The binding free energies of a cucurbituril host with 14 guests were computed using a polarizable force field and the free energy techniques of Bennett acceptance ratio and the orthogonal space random walk. The polarizable force field captured binding accurately, yet unexpectedly, the orthogonal space random walk method converged slowly, albeit at still reduced computational expense to the Bennett acceptance ratio.
Lastly, the nanotoxicity effects of trioctylphosphine oxide coated cadmium selenide quantum dots are investigated with the model Src homology 3 protein domain in complex with its native proline rich motif ligand. With increasing quantum dot concentration, there is an increasing preference for the quantum dots to bind to the proline rich motif active site, inhibiting Src homology 3 function.Biomedical Engineerin
- …