5,190 research outputs found

    FBK @ IWSLT 2010

    Get PDF
    This year FBK took part in the BTEC translation task, with source languages Arabic and Turkish and target language English, and in the new TALK task, source English and target French. We worked in the framework of phrase-based statistical machine translation aiming to improve coverage of models in presence of rich morphology, on one side, and to make better use of available resources through data selection techniques. New morphological segmentation rules were developed for Turkish-English. The combination of several Turkish segmentation schemes into a lattice input led to an improvement wrt to last year. The use of additional training data was explored for Arabic-English, while on the English to French task improvement was achieved over a strong baseline by automatically selecting relevant and high quality data from the available training corpora

    Joint morphological-lexical language modeling for processing morphologically rich languages with application to dialectal Arabic

    Get PDF
    Language modeling for an inflected language such as Arabic poses new challenges for speech recognition and machine translation due to its rich morphology. Rich morphology results in large increases in out-of-vocabulary (OOV) rate and poor language model parameter estimation in the absence of large quantities of data. In this study, we present a joint morphological-lexical language model (JMLLM) that takes advantage of Arabic morphology. JMLLM combines morphological segments with the underlying lexical items and additional available information sources with regards to morphological segments and lexical items in a single joint model. Joint representation and modeling of morphological and lexical items reduces the OOV rate and provides smooth probability estimates while keeping the predictive power of whole words. Speech recognition and machine translation experiments in dialectal-Arabic show improvements over word and morpheme based trigram language models. We also show that as the tightness of integration between different information sources increases, both speech recognition and machine translation performances improve

    CCG contextual labels in hierarchical phrase-based SMT

    Get PDF
    In this paper, we present a method to employ target-side syntactic contextual information in a Hierarchical Phrase-Based system. Our method uses Combinatory Categorial Grammar (CCG) to annotate training data with labels that represent the left and right syntactic context of target-side phrases. These labels are then used to assign labels to nonterminals in hierarchical rules. CCG-based contextual labels help to produce more grammatical translations by forcing phrases which replace nonterminals during translations to comply with the contextual constraints imposed by the labels. We present experiments which examine the performance of CCG contextual labels on Chinese–English and Arabic–English translation in the news and speech expressions domains using different data sizes and CCG-labeling settings. Our experiments show that our CCG contextual labels-based system achieved a 2.42% relative BLEU improvement over a PhraseBased baseline on Arabic–English translation and a 1% relative BLEU improvement over a Hierarchical Phrase-Based system baseline on Chinese–English translation

    Exploiting alignment techniques in MATREX: the DCU machine translation system for IWSLT 2008

    Get PDF
    In this paper, we give a description of the machine translation (MT) system developed at DCU that was used for our third participation in the evaluation campaign of the International Workshop on Spoken Language Translation (IWSLT 2008). In this participation, we focus on various techniques for word and phrase alignment to improve system quality. Specifically, we try out our word packing and syntax-enhanced word alignment techniques for the Chinese–English task and for the English–Chinese task for the first time. For all translation tasks except Arabic–English, we exploit linguistically motivated bilingual phrase pairs extracted from parallel treebanks. We smooth our translation tables with out-of-domain word translations for the Arabic–English and Chinese–English tasks in order to solve the problem of the high number of out of vocabulary items. We also carried out experiments combining both in-domain and out-of-domain data to improve system performance and, finally, we deploy a majority voting procedure combining a language model based method and a translation-based method for case and punctuation restoration. We participated in all the translation tasks and translated both the single-best ASR hypotheses and the correct recognition results. The translation results confirm that our new word and phrase alignment techniques are often helpful in improving translation quality, and the data combination method we proposed can significantly improve system performance
    corecore