Search CORE

56 research outputs found

A Formal Model of Ambiguity and its Applications in Machine Translation

Author: Dyer Christopher
Publication venue
Publication date: 01/01/2010
Field of study

Systems that process natural language must cope with and resolve ambiguity. In this dissertation, a model of language processing is advocated in which multiple inputs and multiple analyses of inputs are considered concurrently and a single analysis is only a last resort. Compared to conventional models, this approach can be understood as replacing single-element inputs and outputs with weighted sets of inputs and outputs. Although processing components must deal with sets (rather than individual elements), constraints are imposed on the elements of these sets, and the representations from existing models may be reused. However, to deal efficiently with large (or infinite) sets, compact representations of sets that share structure between elements, such as weighted finite-state transducers and synchronous context-free grammars, are necessary. These representations and algorithms for manipulating them are discussed in depth in depth. To establish the effectiveness and tractability of the proposed processing model, it is applied to several problems in machine translation. Starting with spoken language translation, it is shown that translating a set of transcription hypotheses yields better translations compared to a baseline in which a single (1-best) transcription hypothesis is selected and then translated, independent of the translation model formalism used. More subtle forms of ambiguity that arise even in text-only translation (such as decisions conventionally made during system development about how to preprocess text) are then discussed, and it is shown that the ambiguity-preserving paradigm can be employed in these cases as well, again leading to improved translation quality. A model for supervised learning that learns from training data where sets (rather than single elements) of correct labels are provided for each training instance and use it to learn a model of compound word segmentation is also introduced, which is used as a preprocessing step in machine translation

Digital Repository at the University of Maryland

Substring-based Machine Translation

Author: G Neubig
Graham Neubig
S Mori
Shinsuke Mori
T Kawahara
T Watanabe
Taro Watanabe
Tatsuya Kawahara
Publication venue
Publication date: 24/04/2020
Field of study

Abstract Machine translation is traditionally formulated as the transduction of strings of words from the source to the target language. As a result, additional lexical processing steps such as morphological analysis, transliteration, and tokenization are required to process the internal structure of words to help cope with data-sparsity issues that occur when simply dividing words according to white spaces. In this paper, we take a different approach: not dividing lexical processing and translation into two steps, but simply viewing translation as a single transduction between character strings in the source and target languages. In particular, we demonstrate that the key to achieving accuracies on a par with word-based translation in the character-based framework is the use of a many-to-many alignment strategy that can accurately capture correspondences between arbitrary substrings. We build on the alignment method proposed in Neubig et al (2011), improving its efficiency and accuracy with a focus on character-based translation. Using a many-to-many aligner imbued with these improvements, we demonstrate that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-based translation for two distant language pairs

CiteSeerX

Parsing Linear Context-Free Rewriting Systems with Fast Matrix Multiplication

Author: Cohen Shay
Gildea Daniel
Publication venue: 'MIT Press - Journals'
Publication date: 08/03/2016
Field of study

We describe a matrix multiplication recognition algorithm for a subset of binary linear context-free rewriting systems (LCFRS) with running time

O(n^{\omega d})

where

M(m) = O(m^{\omega})

is the running time for

m \times m

matrix multiplication and

d

is the "contact rank" of the LCFRS -- the maximal number of combination and non-combination points that appear in the grammar rules. We also show that this algorithm can be used as a subroutine to get a recognition algorithm for general binary LCFRS with running time

O(n^{\omega d + 1})

. The currently best known

\omega

is smaller than

2.38

. Our result provides another proof for the best known result for parsing mildly context sensitive formalisms such as combinatory categorial grammars, head grammars, linear indexed grammars, and tree adjoining grammars, which can be parsed in time

O(n^{4.76})

. It also shows that inversion transduction grammars can be parsed in time

O(n^{5.76})

. In addition, binary LCFRS subsumes many other formalisms and types of grammars, for some of which we also improve the asymptotic complexity of parsing

arXiv.org e-Print Archive

Edinburgh Research Explorer

Adjunction in hierarchical phrase-based translation

Author: Arnoult S.I.
Publication venue: Institute for Logic, Language and Computation
Publication date: 01/01/2021
Field of study

International Migration, Integration and Social Cohesion online publications

Permutation forests for modeling word order in machine translation

Author: Stanojević M.
Publication venue
Publication date: 01/01/2017
Field of study

International Migration, Integration and Social Cohesion online publications

Recommended from our members

Refinements in hierarchical phrase-based translation systems

Author: Pino Juan Miguel
Publication venue: University of Cambridge
Publication date: 07/04/2015
Field of study

The relatively recently proposed hierarchical phrase-based translation model for statistical machine translation (SMT) has achieved state-of-the-art performance in numerous recent translation evaluations. Hierarchical phrase-based systems comprise a pipeline of modules with complex interactions. In this thesis, we propose refinements to the hierarchical phrase-based model as well as improvements and analyses in various modules for hierarchical phrase-based systems. We took the opportunity of increasing amounts of available training data for machine translation as well as existing frameworks for distributed computing in order to build better infrastructure for extraction, estimation and retrieval of hierarchical phrase-based grammars. We design and implement grammar extraction as a series of Hadoop MapReduce jobs. We store the resulting grammar using the HFile format, which offers competitive trade-offs in terms of efficiency and simplicity. We demonstrate improvements over two alternative solutions used in machine translation. The modular nature of the SMT pipeline, while allowing individual improvements, has the disadvantage that errors committed by one module are propagated to the next. This thesis alleviates this issue between the word alignment module and the grammar extraction and estimation module by considering richer statistics from word alignment models in extraction. We use alignment link and alignment phrase pair posterior probabilities for grammar extraction and estimation and demonstrate translation improvements in Chinese to English translation. This thesis also proposes refinements in grammar and language modelling both in the context of domain adaptation and in the context of the interaction between first-pass decoding and lattice rescoring. We analyse alternative strategies for grammar and language model cross-domain adaptation. We also study interactions between first-pass and second-pass language model in terms of size and n-gram order. Finally, we analyse two smoothing methods for large 5-gram language model rescoring. The last two chapters are devoted to the application of phrase-based grammars to the string regeneration task, which we consider as a means to study the fluency of machine translation output. We design and implement a monolingual phrase-based decoder for string regeneration and achieve state-of-the-art performance on this task. By applying our decoder to the output of a hierarchical phrase-based translation system, we are able to recover the same level of translation quality as the translation system

Apollo (Cambridge)

Parsing Inside-Out

Author: Goodman Joshua
Publication venue
Publication date: 01/01/1998
Field of study

The inside-outside probabilities are typically used for reestimating Probabilistic Context Free Grammars (PCFGs), just as the forward-backward probabilities are typically used for reestimating HMMs. I show several novel uses, including improving parser accuracy by matching parsing algorithms to evaluation criteria; speeding up DOP parsing by 500 times; and 30 times faster PCFG thresholding at a given accuracy level. I also give an elegant, state-of-the-art grammar formalism, which can be used to compute inside-outside probabilities; and a parser description formalism, which makes it easy to derive inside-outside formulas and many others.Comment: Ph.D. Thesis, 257 pages, 40 postscript figure

arXiv.org e-Print Archive

CiteSeerX

Rich Linguistic Structure from Large-Scale Web Data

Author: Yamangil Elif
Publication venue: 'Harvard University Botany Libraries'
Publication date: 18/10/2013
Field of study

The past two decades have shown an unexpected effectiveness of Web-scale data in natural language processing. Even the simplest models, when paired with unprecedented amounts of unstructured and unlabeled Web data, have been shown to outperform sophisticated ones. It has been argued that the effectiveness of Web-scale data has undermined the necessity of sophisticated modeling or laborious data set curation. In this thesis, we argue for and illustrate an alternative view, that Web-scale data not only serves to improve the performance of simple models, but also can allow the use of qualitatively more sophisticated models that would not be deployable otherwise, leading to even further performance gains.Engineering and Applied Science

CiteSeerX

Harvard University - DASH

Proceedings of the 4th DIKU-IST Joint Workshop on the Foundations of Software

Author
Publication venue: Department of Computer Science, University of Copenhagen
Publication date: 01/01/2011
Field of study

Copenhagen University Research Information System

Recommended from our members

Modeling RNA, protein, and synthetic molecules using coarse-grained and all-atom representations

Author: Bell David Russell
Publication venue
Publication date: 31/01/2019
Field of study

The aim of computational chemistry is to depict and understand the dynamics and interactions of molecular systems. In addition to increased comprehension in the physical and life sciences, this insight yields important applications to therapeutic design and materials science. In computational chemistry, molecules can be modeled in a number of representations depending on the molecular system and phenomena of interest. In this work, both simplified, coarse-grained representations and all-atom representations are used to model the interactions of RNA, cucurbituril host-guest chemistry, and cadmium selenide quantum dot binding to the Src homology 3 domain. For RNA, a coarse-grained model was developed termed RACER (RnA CoarsE-gRained) to accurately predict RNA structure and folding free energy. After optimization to statistical potentials, RACER accurately predicted the structures of 14 RNAs with an average 4.15Å root mean square deviation (RMSD) to the experimental structure. Further, RACER captured the sequence-specific variation in folding free energy for a set of 6 RNA hairpins and 5 RNA duplexes, with a R² correlation of 0.96 to experiment. The binding free energies of a cucurbituril host with 14 guests were computed using a polarizable force field and the free energy techniques of Bennett acceptance ratio and the orthogonal space random walk. The polarizable force field captured binding accurately, yet unexpectedly, the orthogonal space random walk method converged slowly, albeit at still reduced computational expense to the Bennett acceptance ratio. Lastly, the nanotoxicity effects of trioctylphosphine oxide coated cadmium selenide quantum dots are investigated with the model Src homology 3 protein domain in complex with its native proline rich motif ligand. With increasing quantum dot concentration, there is an increasing preference for the quantum dots to bind to the proline rich motif active site, inhibiting Src homology 3 function.Biomedical Engineerin

Texas ScholarWorks