1,847 research outputs found

    Robust Tuning Datasets for Statistical Machine Translation

    Full text link
    We explore the idea of automatically crafting a tuning dataset for Statistical Machine Translation (SMT) that makes the hyper-parameters of the SMT system more robust with respect to some specific deficiencies of the parameter tuning algorithms. This is an under-explored research direction, which can allow better parameter tuning. In this paper, we achieve this goal by selecting a subset of the available sentence pairs, which are more suitable for specific combinations of optimizers, objective functions, and evaluation measures. We demonstrate the potential of the idea with the pairwise ranking optimization (PRO) optimizer, which is known to yield too short translations. We show that the learning problem can be alleviated by tuning on a subset of the development set, selected based on sentence length. In particular, using the longest 50% of the tuning sentences, we achieve two-fold tuning speedup, and improvements in BLEU score that rival those of alternatives, which fix BLEU+1's smoothing instead.Comment: RANLP-201

    Cortical Representation Underlying the Semantic Processing of Numerical Symbols: Evidence from Adult and Developmental Studies

    Get PDF
    Humans possess the remarkable ability to process numerical information using numerical symbols such as Arabic digits. A growing body of neuroimaging work has provided new insights into the neural correlates associated with symbolic numerical magnitude processing. However, little is known about the cortical specialization underlying the representation of symbolic numerical magnitude in adults and children. To constrain our current knowledge, I conducted a series of functional Magnetic Resonance Imaging (fMRI) studies that aimed to better understand the functional specialization of symbolic numerical magnitudes representation in the human brain. Using a number line estimation task, the first study contrasted the brain activation associated with processing symbolic numerical magnitude against the brain activation associated with non-numerical magnitude (brightness) processing. Results demonstrated a right lateralized parietal network that was commonly engaged when magnitude dimensions were processed. However, the left intraparietal sulcus (IPS) was additionally activated when symbolic numerical magnitudes were estimated, suggesting that number is a special category amongst magnitude dimensions and that the left hemisphere plays a critical role in representing number. The second study tested a child friendly version of an fMRI-adaptation paradigm in adults. For this participant’s brain response was habituated to a numerical value (i.e., 6) and signal recovery in response to the presentation of numerical deviants was investigated. Across two different brain normalization procedures results showed a replication of previous findings demonstrating that the brain response of the IPS is modulated by the semantic meaning of numbers in the absence of overt response selection. The last study aimed to unravel developmental changes in the cortical representation of symbolic numerical magnitudes in children. Using the paradigm tested in chapter 2, results demonstrated an increase in the signal recovery with age in the left IPS as well as an age-independent signal recovery in the right IPS. This finding indicates that the left IPS becomes increasingly specialized for the representation of symbolic numerical magnitudes over developmental time, while the right IPS may play a different and earlier role in symbolic numerical magnitude representation. Findings of these studies are discussed in relation to our current knowledge about symbolic numerical magnitude representation

    Word Reordering in Statistical Machine Translation with a POS-Based Distortion Model

    Get PDF
    In this paper we describe a word reordering strategy for statistical machine translation that reorders the source side based on Part of Speech (POS) information. Reordering rules are learned from the word aligned corpus. Reordering is integrated into the decoding process by constructing a lattice, which contains all word reorderings according to the reordering rules. Probabilities are assigned to the different reorderings. On this lattice monotone decoding is performed. This reordering strategy is compared with our previous reordering strategy, which looks at all permutations within a sliding window. We extend reordering rules by adding context information. Phrase translation pairs are learned from the original corpus and from a reordered source corpus to better capture the reordered word sequences at decoding time. Results are presented for English → Spanish and German ↔ English translations, using the European Parliament Plenary Sessions corpus

    Consensus Versus Expertise: A Case Study of Word Alignment with Mechanical Turk

    Get PDF
    Word alignment is an important preprocessing step for machine translation. The project aims at incorporating manual alignments from Amazon Mechanical Turk (MTurk) to help improve word alignment quality. As a global crowdsourcing service, MTurk can provide flexible and abundant labor force and therefore reduce the cost of obtaining labels. An easy-to-use interface is developed to simplify the labeling process. We compare the alignment results by Turkers to that by experts, and incorporate the alignments in a semi-supervised word alignment tool to improve the quality of the labels. We also compared two pricing strategies for word alignment task. Experimental results show high precision of the alignments provided by Turkers and the semi-supervised approach achieved 0.5% absolute reduction on alignment error rate

    EMDC: A Semi-supervised Approach for Word Alignment

    Get PDF
    This paper proposes a novel semisupervised word alignment technique called EMDC that integrates discriminative and generative methods. A discriminative aligner is used to find high precision partial alignments that serve as constraints for a generative aligner which implements a constrained version of the EM algorithm. Experiments on small-size Chinese and Arabic tasks show consistent improvements on AER. We also experimented with moderate-size Chinese machine translation tasks and got an average of 0.5 point improvement on BLEU scores across five standard NIST test sets and four other test sets

    Word Alignment Based On Bilingual Bracketing

    Get PDF
    In this paper, an improved word alignment based on bilingual bracketing is described. The explored approaches include using Model-1 conditional probability, a boosting strategy for lexicon probabilities based on importance sampling, applying Parts of Speech to discriminate English words and incorporating information of English base noun phrase. The results of the shared task on French-English, RomanianEnglish and Chinese-English word alignments are presented and discussed

    Combination of Machine Translation Systems via Hypothesis Selection from Combined n-best lists

    Get PDF
    Different approaches in machine translation achieve similar translation quality with a variety of translations in the output. Recently it has been shown, that it is possible to leverage the individual strengths of various systems and improve the overall translation quality by combining translation outputs. In this paper we present a method of hypothesis selection which is relatively simple compared to system combination methods which construct a synthesis of the input hypotheses. Our method uses information from n-best lists from several MT systems and features on the sentence level which are independent from the MT systems involved to improve the translation quality
    • …
    corecore