    Approach to Selecting Best Development Set for Phrase-Based Statistical Machine Translation

    PACLIC 23 / City University of Hong Kong / 3-5 December 2009

    Peningkatan Akurasi Mesin Penerjemah Bahasa Inggris - Indonesia dengan Memaksimalkan Kualitas dan Kuantitas Korpus Paralel

    Korpus paralel memiliki peran yang sangat penting dalam mesin penerjemah statistik (MPS). Korpus paralel yang diperoleh berbagai sumber biasanya memiliki kualitas yang kurang baik, sedangkan kuantitas korpus paralel merupakan tuntutan utama bagi hasil penerjemahan yang baik. Penelitian ini bertujuan untuk mengetahui efek ukuran dan kualitas korpus paralel di MPS. Penelitian ini menggunakan metode bilingual evaluation understudy (BLEU) untuk mengklasifikasikan pasangan kalimat paralel sebagai kalimat berkualitas tinggi atau buruk. Metode ini diterapkan ke korpus paralel yang berisi 1,5 M pasangan kalimat Inggris-Indonesia paralel dan memperoleh 900K pasangan kalimat paralel berkualitas tinggi. Beberapa sistem MPS dengan berbagai ukuran korpus paralel mentah dan korpus berkualitas tinggi yang difilter dilatih dengan MOSES dan dievaluasi kinerjanya. Hasil percobaan yang dilakukan menunjukkan bahwa ukuran korpus paralel merupakan  faktor utama dalam kinerja terjemahan. Selain itu, kinerja terjemahan yang  lebih baik dapat dicapai dengan korpus berkualitas tinggi yang lebih kecil menggunakan metode filter berkualitas. Hasil eksperimen pada MPS bahasa Inggris-Indonesia menunjukkan bahwa dengan menggunakan 60% kalimat yang kualitas terjemahannya baik, kualitas terjemahan dapat meningkat sebesar 7,31%. AbstractThe parallel corpus has a very important role in the statistical machine translator (SMT) system. The parallel corpus obtained by various sources usually has poor quality, while the quantity of parallel corpus is the main demand for good translation results. This study aims to determine the effect of the size and quality of parallel corpus at SMT. This study uses the bilingual evaluation understudy (BLEU) method to classify pairs of parallel sentences as high-quality or bad sentences. This method is applied to a parallel corpus containing 1.5 M parallel English-Indonesian sentence pairs and obtaining 900K pairs of high-quality parallel sentences. Some SMT systems with various sizes of raw parallel bodies and high-quality corpus filtered are trained with MOSES and evaluated for performance. The experimental results show that the size of the parallel corpus is a major factor in translation performance. In addition, better translation performance can be achieved with a smaller high-quality corpus using a quality filter method.The experimental results in the English-Indonesian SMT show that by using 60% of sentences whose translation quality is good, the quality of the translation can increase by 7.31%

    Comparison of Data Selection Techniques for the Translation of Video Lectures

    [EN] For the task of online translation of scientific video lectures, using huge models is not possible. In order to get smaller and efficient models, we perform data selection. In this paper, we perform a qualitative and quantitative comparison of several data selection techniques, based on cross-entropy and infrequent n-gram criteria. In terms of BLEU, a combination of translation and language model cross-entropy achieves the most stable results. As another important criterion for measuring translation quality in our application, we identify the number of out-ofvocabulary words. Here, infrequent n-gram recovery shows superior performance. Finally, we combine the two selection techniques in order to benefit from both their strengths.The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no 287755 (transLectures), and the Spanish MINECO Active2Trans (TIN2012-31723) research project.Wuebker, J.; Ney, H.; Martínez-Villaronga, A.; Giménez Pastor, A.; Juan Císcar, A.; Servan, C.; Dymetman, M.... (2014). Comparison of Data Selection Techniques for the Translation of Video Lectures. Association for Machine Translation in the Americas. http://hdl.handle.net/10251/54431

    Adaptation in Machine Translation

    Statistical machine translation (SMT) has emerged as the currently most promising approach for machine translation. One limitation to date, however, is that the quality of SMT systems strongly depends on the similarity between the training data and its deployment. This thesis is devoted to adapting MT systems in the scenario of mismatching training data. We develop different approaches to increase performance even though all or some of the training data does not match the system\u27s application

    Statistical machine translation system and computational domain adaptation

    Statističko strojno prevođenje temeljeno na frazama jedan je od mogućih pristupa automatskom strojnom prevođenju. U radu su predložene metode za poboljšanje kvalitete strojnog prijevoda prilagodbom određenih parametara u modelu sustava za statističko strojno prevođenje. Ideja rada bila jest izgraditi sustave za statističko strojno prevođenje temeljeno na frazama za hrvatski i engleski jezik. Sustavi su trenirani za dva jezična smjera, na dvije domene, na paralelnim korpusima različitih veličina i obilježja za hrvatsko-engleski i englesko-hrvatski jezični par, nakon čega proveden postupak ugađanja sustava. Istraženi su hibridni sustavi koji objedinjuju značajke obiju domena. Time je ispitan izravan utjecaj adaptacije domene na kvalitetu automatskog strojnog prijevoda hrvatskog jezika, a nova saznanja mogu koristiti pri izgradnji novih sustava. Provedena je automatska i ljudska evaluacija (vrednovanje) strojnih prijevoda, a dobiveni rezultati uspoređeni su s rezultatima strojnih prijevoda dobivenih primjenom postojećih web servisa za statističko strojno prevođenje.Phrase-based statistical machine translation is one of possible automatic machine translation approaches. This work proposes methods for increasing the quality of machine translation by adapting certain parameters in the statistical machine translation model. The idea was to build phrase-based statistical machine translation systems for Croatian and English language. The systems were be trained for two directions, on two domains, on parallel corpora of different sizes and characteristics for Croatian-English and English-Croatian language pair, after which the tuning procedure was conducted. Afterwards, hybrid systems which combine features of both domains were investigated. Thereby the direct impact of domain adaptation on the quality of automatic machine translation of Croatian language was explored, whereas new findings can be utilised for building new systems. Automatic and human evaluation of machine translations were carried out, while obtained results were compared with results obtained from applying existing statistical machine translation web services