3,164 research outputs found

    Bayesian reordering model with feature selection

    No full text
    In phrase-based statistical machine translation systems, variation in grammatical structures between source and target languages can cause large movements of phrases. Modeling such movements is crucial in achieving translations of long sentences that appear natural in the target language. We explore generative learning approach to phrase reordering in Arabic to English. Formulating the reordering problem as a classification problem and using naive Bayes with feature selection, we achieve an improvement in the BLEU score over a lexicalized reordering model. The proposed model is compact, fast and scalable to a large corpus

    A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena

    Get PDF
    Word reordering is one of the most difficult aspects of statistical machine translation (SMT), and an important factor of its quality and efficiency. Despite the vast amount of research published to date, the interest of the community in this problem has not decreased, and no single method appears to be strongly dominant across language pairs. Instead, the choice of the optimal approach for a new translation task still seems to be mostly driven by empirical trials. To orientate the reader in this vast and complex research area, we present a comprehensive survey of word reordering viewed as a statistical modeling challenge and as a natural language phenomenon. The survey describes in detail how word reordering is modeled within different string-based and tree-based SMT frameworks and as a stand-alone task, including systematic overviews of the literature in advanced reordering modeling. We then question why some approaches are more successful than others in different language pairs. We argue that, besides measuring the amount of reordering, it is important to understand which kinds of reordering occur in a given language pair. To this end, we conduct a qualitative analysis of word reordering phenomena in a diverse sample of language pairs, based on a large collection of linguistic knowledge. Empirical results in the SMT literature are shown to support the hypothesis that a few linguistic facts can be very useful to anticipate the reordering characteristics of a language pair and to select the SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic

    A discriminative latent variable-based "DE" classifier for Chinese–English SMT

    Get PDF
    Syntactic reordering on the source-side is an effective way of handling word order differences. The (DE) construction is a flexible and ubiquitous syntactic structure in Chinese which is a major source of error in translation quality. In this paper, we propose a new classifier model — discriminative latent variable model (DPLVM) — to classify the DE construction to improve the accuracy of the classification and hence the translation quality. We also propose a new feature which can automatically learn the reordering rules to a certain extent. The experimental results show that the MT systems using the data reordered by our proposed model outperform the baseline systems by 6.42% and 3.08% relative points in terms of the BLEU score on PB-SMT and hierarchical phrase-based MT respectively. In addition, we analyse the impact of DE annotation on word alignment and on the SMT phrase table

    Source-side context-informed hypothesis alignment for combining outputs from machine translation systems

    Get PDF
    This paper presents a new hypothesis alignment method for combining outputs of multiple machine translation (MT) systems. Traditional hypothesis alignment algorithms such as TER, HMM and IHMM do not directly utilise the context information of the source side but rather address the alignment issues via the output data itself. In this paper, a source-side context-informed (SSCI) hypothesis alignment method is proposed to carry out the word alignment and word reordering issues. First of all, the source–target word alignment links are produced as the hidden variables by exporting source phrase spans during the translation decoding process. Secondly, a mapping strategy and normalisation model are employed to acquire the 1- to-1 alignment links and build the confusion network (CN). The source-side context-based method outperforms the state-of-the-art TERbased alignment model in our experiments on the WMT09 English-to-French and NIST Chinese-to-English data sets respectively. Experimental results demonstrate that our proposed approach scores consistently among the best results across different data and language pair conditions

    Statistical Machine Translation Features with Multitask Tensor Networks

    Full text link
    We present a three-pronged approach to improving Statistical Machine Translation (SMT), building on recent success in the application of neural networks to SMT. First, we propose new features based on neural networks to model various non-local translation phenomena. Second, we augment the architecture of the neural network with tensor layers that capture important higher-order interaction among the network units. Third, we apply multitask learning to estimate the neural network parameters jointly. Each of our proposed methods results in significant improvements that are complementary. The overall improvement is +2.7 and +1.8 BLEU points for Arabic-English and Chinese-English translation over a state-of-the-art system that already includes neural network features.Comment: 11 pages (9 content + 2 references), 2 figures, accepted to ACL 2015 as a long pape

    Using collocation segmentation to augment the phrase table

    Get PDF
    This paper describes the 2010 phrase-based statistical machine translation system developed at the TALP Research Center of the UPC1 in cooperation with BMIC2 and VMU3. In phrase-based SMT, the phrase table is the main tool in translation. It is created extracting phrases from an aligned parallel corpus and then computing translation model scores with them. Performing a collocation segmentation over the source and target corpus before the alignment causes that di erent and larger phrases are extracted from the same original documents. We performed this segmentation and used the union of this phrase set with the phrase set extracted from the nonsegmented corpus to compute the phrase table. We present the con gurations considered and also report results obtained with internal and o cial test sets.Postprint (published version

    An incremental three-pass system combination framework by combining multiple hypothesis alignment methods

    Get PDF
    System combination has been applied successfully to various machine translation tasks in recent years. As is known, the hypothesis alignment method is a critical factor for the translation quality of system combination. To date, many effective hypothesis alignment metrics have been proposed and applied to the system combination, such as TER, HMM, ITER, IHMM, and SSCI. In addition, Minimum Bayes-risk (MBR) decoding and confusion networks (CN) have become state-of-the-art techniques in system combination. In this paper, we examine different hypothesis alignment approaches and investigate how much the hypothesis alignment results impact on system combination, and finally present a three-pass system combination strategy that can combine hypothesis alignment results derived from multiple alignment metrics to generate a better translation. Firstly, these different alignment metrics are carried out to align the backbone and hypotheses, and the individual CNs are built corresponding to each set of alignment results; then we construct a ‘super network’ by merging the multiple metric-based CNs to generate a consensus output. Finally a modified MBR network approach is employed to find the best overall translation. Our proposed strategy outperforms the best single confusion network as well as the best single system in our experiments on the NIST Chinese-to-English test set and the WMT2009 English-to-French system combination shared test set

    A detailed analysis of phrase-based and syntax-based machine translation: the search for systematic differences

    Get PDF
    This paper describes a range of automatic and manual comparisons of phrase-based and syntax-based statistical machine translation methods applied to English-German and English-French translation of user-generated content. The syntax-based methods underperform the phrase-based models and the relaxation of syntactic constraints to broaden translation rule coverage means that these models do not necessarily generate output which is more grammatical than the output produced by the phrase-based models. Although the systems generate different output and can potentially be fruitfully combined, the lack of systematic difference between these models makes the combination task more challenging

    A Survey of Word Reordering Model in Statistical Machine Translation

    Get PDF
    Machine translation is the process of translating one natural language in to another natural language by computers. In statistical machine translation word reordering is a big challenge between distant language pair. It is important factor for its quality and efficiency. Word reordering is major challenge For Indian languages who have big structural difference like English and Hindi language. This paper present description about statistical machine translation, reordering model and reordering types
    corecore