180 research outputs found

    A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena

    Get PDF
    Word reordering is one of the most difficult aspects of statistical machine translation (SMT), and an important factor of its quality and efficiency. Despite the vast amount of research published to date, the interest of the community in this problem has not decreased, and no single method appears to be strongly dominant across language pairs. Instead, the choice of the optimal approach for a new translation task still seems to be mostly driven by empirical trials. To orientate the reader in this vast and complex research area, we present a comprehensive survey of word reordering viewed as a statistical modeling challenge and as a natural language phenomenon. The survey describes in detail how word reordering is modeled within different string-based and tree-based SMT frameworks and as a stand-alone task, including systematic overviews of the literature in advanced reordering modeling. We then question why some approaches are more successful than others in different language pairs. We argue that, besides measuring the amount of reordering, it is important to understand which kinds of reordering occur in a given language pair. To this end, we conduct a qualitative analysis of word reordering phenomena in a diverse sample of language pairs, based on a large collection of linguistic knowledge. Empirical results in the SMT literature are shown to support the hypothesis that a few linguistic facts can be very useful to anticipate the reordering characteristics of a language pair and to select the SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic

    Lexico-syntactic Text Simplification And Compression With Typed Dependencies

    Get PDF
    We describe two systems for text simplification using typed dependency structures, one that performs lexical and syntactic simplification, and another that performs sentence compression optimised to satisfy global text constraints such as lexical density, the ratio of difficult words, and text length. We report a substantial evaluation that demonstrates the superiority of our systems, individually and in combination, over the state of the art, and also report a comprehension based evaluation of contemporary automatic text simplification systems with target non-native readers

    Learning words and syntactic cues in highly ambiguous contexts

    Get PDF
    The cross-situational word learning paradigm argues that word meanings can be approximated by word-object associations, computed from co-occurrence statistics between words and entities in the world. Lexicon acquisition involves simultaneously guessing (1) which objects are being talked about (the ”meaning”) and (2) which words relate to those objects. However, most modeling work focuses on acquiring meanings for isolated words, largely neglecting relationships between words or physical entities, which can play an important role in learning. Semantic parsing, on the other hand, aims to learn a mapping between entire utterances and compositional meaning representations where such relations are central. The focus is the mapping between meaning and words, while utterance meanings are treated as observed quantities. Here, we extend the joint inference problem of word learning to account for compositional meanings by incorporating a semantic parsing model for relating utterances to non-linguistic context. Integrating semantic parsing and word learning permits us to explore the impact of word-word and concept-concept relations. The result is a joint-inference problem inherited from the word learning setting where we must simultaneously learn utterance-level and individual word meanings, only now we also contend with the many possible relationships between concepts in the meaning and words in the sentence. To simplify design, we factorize the model into separate modules, one for each of the world, the meaning, and the words, and merge them into a single synchronous grammar for joint inference. There are three main contributions. First, we introduce a novel word learning model and accompanying semantic parser. Second, we produce a corpus which allows us to demonstrate the importance of structure in word learning. Finally, we also present a number of technical innovations required for implementing such a model

    Wider Pipelines: N-Best Alignments and Parses in MT Training

    Get PDF
    State-of-the-art statistical machine translation systems use hypotheses from several maximum a posteriori inference steps, including word alignments and parse trees, to identify translational structure and estimate the parameters of translation models. While this approach leads to a modular pipeline of independently developed components, errors made in these “single-best” hypotheses can propagate to downstream estimation steps that treat these inputs as clean, trustworthy training data. In this work we integrate N-best alignments and parses by using a probability distribution over these alternatives to generate posterior fractional counts for use in downstream estimation. Using these fractional counts in a DOP-inspired syntax-based translation system, we show significant improvements in translation quality over a single-best trained baseline

    Fine-Grained Linguistic Soft Constraints on Statistical Natural Language Processing Models

    Get PDF
    This dissertation focuses on effective combination of data-driven natural language processing (NLP) approaches with linguistic knowledge sources that are based on manual text annotation or word grouping according to semantic commonalities. I gainfully apply fine-grained linguistic soft constraints -- of syntactic or semantic nature -- on statistical NLP models, evaluated in end-to-end state-of-the-art statistical machine translation (SMT) systems. The introduction of semantic soft constraints involves intrinsic evaluation on word-pair similarity ranking tasks, extension from words to phrases, application in a novel distributional paraphrase generation technique, and an introduction of a generalized framework of which these soft semantic and syntactic constraints can be viewed as instances, and in which they can be potentially combined. Fine granularity is key in the successful combination of these soft constraints, in many cases. I show how to softly constrain SMT models by adding fine-grained weighted features, each preferring translation of only a specific syntactic constituent. Previous attempts using coarse-grained features yielded negative results. I also show how to softly constrain corpus-based semantic models of words (“distributional profiles”) to effectively create word-sense-aware models, by using semantic word grouping information found in a manually compiled thesaurus. Previous attempts, using hard constraints and resulting in aggregated, coarse-grained models, yielded lower gains. A novel paraphrase generation technique incorporating these soft semantic constraints is then also evaluated in a SMT system. This paraphrasing technique is based on the Distributional Hypothesis. The main advantage of this novel technique over current “pivoting” techniques for paraphrasing is the independence from parallel texts, which are a limited resource. The evaluation is done by augmenting translation models with paraphrase-based translation rules, where fine-grained scoring of paraphrase-based rules yields significantly higher gains. The model augmentation includes a novel semantic reinforcement component: In many cases there are alternative paths of generating a paraphrase-based translation rule. Each of these paths reinforces a dedicated score for the “goodness” of the new translation rule. This augmented score is then used as a soft constraint, in a weighted log-linear feature, letting the translation model learn how much to “trust” the paraphrase-based translation rules. The work reported here is the first to use distributional semantic similarity measures to improve performance of an end-to-end phrase-based SMT system. The unified framework for statistical NLP models with soft linguistic constraints enables, in principle, the combination of both semantic and syntactic constraints -- and potentially other constraints, too -- in a single SMT model

    A Formal Model of Ambiguity and its Applications in Machine Translation

    Get PDF
    Systems that process natural language must cope with and resolve ambiguity. In this dissertation, a model of language processing is advocated in which multiple inputs and multiple analyses of inputs are considered concurrently and a single analysis is only a last resort. Compared to conventional models, this approach can be understood as replacing single-element inputs and outputs with weighted sets of inputs and outputs. Although processing components must deal with sets (rather than individual elements), constraints are imposed on the elements of these sets, and the representations from existing models may be reused. However, to deal efficiently with large (or infinite) sets, compact representations of sets that share structure between elements, such as weighted finite-state transducers and synchronous context-free grammars, are necessary. These representations and algorithms for manipulating them are discussed in depth in depth. To establish the effectiveness and tractability of the proposed processing model, it is applied to several problems in machine translation. Starting with spoken language translation, it is shown that translating a set of transcription hypotheses yields better translations compared to a baseline in which a single (1-best) transcription hypothesis is selected and then translated, independent of the translation model formalism used. More subtle forms of ambiguity that arise even in text-only translation (such as decisions conventionally made during system development about how to preprocess text) are then discussed, and it is shown that the ambiguity-preserving paradigm can be employed in these cases as well, again leading to improved translation quality. A model for supervised learning that learns from training data where sets (rather than single elements) of correct labels are provided for each training instance and use it to learn a model of compound word segmentation is also introduced, which is used as a preprocessing step in machine translation

    Modeling Graph Languages with Grammars Extracted via Tree Decompositions

    Get PDF
    Work on probabilistic models of natural language tends to focus on strings and trees, but there is increasing interest in more general graph-shaped structures since they seem to be better suited for representing natural language semantics, ontologies, or other varieties of knowledge structures. However, while there are relatively simple approaches to defining generative models over strings and trees, it has proven more challenging for more general graphs. This paper describes a natural generalization of the n-gram to graphs, making use of Hyperedge Replacement Grammars to define generative models of graph languages.9 page(s
    corecore