11 research outputs found
Exploring different representational units in English-to-Turkish statistical machine translation
We investigate different representational granularities for sub-lexical representation in statistical machine translation work from English to Turkish. We find that (i) representing both Turkish and English at the morpheme-level but with some selective morpheme-grouping on the Turkish side of the training data, (ii) augmenting the training data with “sentences” comprising only the content words of the original training data to bias root word alignment, (iii) reranking
the n-best morpheme-sequence outputs of the decoder with a word-based language
model, and (iv) using model iteration all provide a non-trivial improvement over
a fully word-based baseline. Despite our very limited training data, we improve from 20.22 BLEU points for our simplest model to 25.08 BLEU points for an improvement of 4.86 points or 24% relative
GĂ©pi fordĂtás minĹ‘sĂ©gbecslĂ©sĂ©nek optimalizálása kĂ©tnyelvű szĂłtár Ă©s WordNet segĂtsĂ©gĂ©vel
Napjainkban, a gĂ©pi fordĂtás minĹ‘sĂ©gĂ©nek becslĂ©se fontos feladat. Egy megbĂzhatĂł minĹ‘sĂ©gbecslĹ‘ rendszer idĹ‘t Ă©s pĂ©nzt spĂłrolhat meg cĂ©gek, kutatĂłk Ă©s átlagfelhasználĂłk számára. A hagyományos automatikus kiĂ©rtĂ©kelĹ‘ mĂłdszerek legnagyobb problĂ©mája, hogy referenciafordĂtást igĂ©nyelnek Ă©s nem tudnak valĂłs idĹ‘ben kiĂ©rtĂ©kelni. A jelen kutatás egy olyan minĹ‘sĂ©gbecslĹ‘ rendszert mutat be, amely kĂ©pes valĂłs idĹ‘ben, referenciafordĂtás nĂ©lkĂĽl kiĂ©rtĂ©kelni. A minĹ‘sĂ©gbecslĹ‘ rendszer felĂ©pĂtĂ©sĂ©hez a QuEst keretrendszert implementáltuk Ă©s optimalizáltuk magyar nyelvre. Mindezek mellett, a QuEst rendszerhez Ăşj, saját jegyeket fejlesztettĂĽnk egy kĂ©tnyelvű szĂłtár, illetve a WordNet segĂtsĂ©gĂ©vel. A saját jegyek alkalmazása minĹ‘sĂ©gbeli javulást eredmĂ©nyezett a kiĂ©rtĂ©kelĂ©sben. Az Ăgy lĂ©trehozott magyar nyelvre optimalizált jegyhalmaz 11%-kal jobb eredmĂ©nyt ad az alaprendszerhez kĂ©pest. Az általunk implementált minĹ‘sĂ©gbecslĹ‘ rendszer megfelelĹ‘ alapot kĂ©pez egy angol-magyar gĂ©pi fordĂtást kiĂ©rtĂ©kelĹ‘ rendszerhez
Integrating meaning into quality evaluation of machine translation
Machine translation (MT) quality is evaluated through comparisons between MT outputs and the human translations (HT). Traditionally, this evaluation relies on form related features (e.g. lexicon and syntax) and ignores the transfer of meaning reflected in HT outputs. Instead, we evaluate the quality of MT outputs through meaning related features (e.g. polarity, subjectivity) with two experiments. In the first experiment, the meaning related features are compared to human rankings individually. In the second experiment, combinations of meaning related features and other quality metrics are utilized to predict the same human rankings. The results of our experiments confirm the benefit of these features in predicting human evaluation of translation quality in addition to traditional metrics which focus mainly on form
TĂ©maspecifikus gĂ©pi fordĂtĂłrendszer minĹ‘sĂ©gĂ©nek javĂtása domain adaptáciĂł segĂtsĂ©gĂ©vel
A mĂ©ly tanulásos mĂłdszerek elterjedĂ©se napjainkban nagymĂ©rtĂ©kben megváltoztatta a gĂ©pi fordĂtások emberi megĂtĂ©lĂ©sĂ©t. A statisztikai gĂ©pi fordĂtĂłrendszerekkel (SMT) szemben a neurálishálĂłzat-alapon működĹ‘ architektĂşrák (NMT) sokkal olvashatĂłbb fordĂtásokat generálnak, melyek a hivatásos fordĂtĂłk számára könnyebben Ă©s hatĂ©konyabban javĂthatĂłk az utĂłfeldolgozás során. Az Ăşj mĂłdszer nehĂ©zsĂ©ge azonban, hogy a stabilan jĂł fodĂtási minĹ‘sĂ©get adĂł rendszerek tanĂtásához nagy mĂ©retű tanĂtĂłanyagra van szĂĽksĂ©g. Ez azonban a legtöbb fordĂtĂłcĂ©g vagy nyelvpár esetĂ©n nem áll rendelkezĂ©sre. Munkám során a kicsi Ă©s jĂł minĹ‘sĂ©gű in-domain tanĂtĂłanyagokat adatszelekciĂł segĂtsĂ©gĂ©vel feldĂşsĂtottam egy nagy mĂ©retű out-of-domain korpusz leginkább hasonlĂł szegmenseivel. Az Ăgy lĂ©trehozott architektĂşrával sikerĂĽlt statisztikailag szignifikáns mĂ©rtĂ©kben javĂtanom a fordĂtĂłrendszer minĹ‘sĂ©gĂ©t az összes vizsgált esetben. Kutatásom során igyekeztem megtalálni a feladathoz leginkább alkalmas szelekciĂłs mĂłdszert, illetve megvizsgáltam a rendszer működĂ©sĂ©t több kĂĽlönbözĹ‘ nyelv- Ă©s domainpár kombináciĂłval
A prototype English-Turkish statistical machine translation system
Translating one natural language (text or speech) to another natural language automatically is known as machine translation. Machine translation is one of the major, oldest and the most active areas in natural language processing. The last decade and a half have seen the rise of the use of statistical approaches to the problem of machine translation. Statistical approaches learn translation parameters automatically from alignment text instead of relying on writing rules which is labor intensive. Although there has been quite extensive work in this area for some language pairs, there has not been research for the Turkish - English language pair. In this thesis, we present the results of our investigation and development of a state-of-theart statistical machine translation prototype from English to Turkish. Developing an English to Turkish statistical machine translation prototype is an interesting problem from a number of perspectives. The most important challenge is that English and Turkish are typologically rather distant languages. While English has very limited morphology and rather fixed Subject-Verb-Object constituent order, Turkish is an agglutinative language with very flexible (but Subject-Object-Verb dominant) constituent order and a very rich and productive derivational and inflectional morphology with word structures that can correspond to complete phrases of several words in English when translated. Our research is focused on making scientific contributions to the state-of-the-art by taking into account certain morphological properties of Turkish (and possibly similar languages) that have not been addressed sufficiently in previous research for other languages. In this thesis; we investigate how different morpheme-level representations of morphology on both the English and the Turkish sides impact statistical translation results. We experiment with local word ordering on the English side to bring the word order of specific English prepositional phrases and auxiliary verb complexes, in line with the corresponding case marked noun forms and complex verb forms, on the Turkish side to help with word alignment. We augment the training data with sentences just with content words (noun, verb, adjective, adverb) obtained from the original training data and with highly-reliable phrase-pairs obtained iteratively from an earlier phrase alignment to alleviate the dearth of the parallel data available. We use word-based language model in the reranking of the n-best lists in addition to the morpheme-based language model used for decoding, so that we can incorporate both the local morphotactic constraints and local word ordering constraints. Lastly, we present a procedure for repairing the decoder output by correcting words with incorrect morphological structure and out-of-vocabulary with respect to the training data and language model to further improve the translations. We also include fine-grained evaluation results and some oracle scores with the BLEU+ tool which is an extension of the evaluation metric BLEU. After all research and development, we improve from 19.77 BLEU points for our word-based baseline model to 27.60 BLEU points for an improvement of 7.83 points or about 40% relative improvement
Recommended from our members
Pivot-based Statistical Machine Translation for Morphologically Rich Languages
This thesis describes the research efforts on pivot-based statistical machine translation (SMT) for morphologically rich languages (MRL). We provide a framework to translate to and from morphologically rich languages especially in the context of having little or no parallel corpora between the source and the target languages. We basically address three main challenges. The first one is the sparsity of data as a result of morphological richness. The second one is maximizing the precision and recall of the pivoting process itself. And the last one is making use of any parallel data between the source and the target languages. To address the challenge of data sparsity, we explored a space of tokenization schemes and normalization options. We also examined a set of six detokenization techniques to evaluate detokenized and orthographically corrected (enriched) output. We provide a recipe of the best settings to translate to one of the most challenging languages, namely Arabic. Our best model improves the translation quality over the baseline by 1.3 BLEU points. We also investigated the idea of separation between translation and morphology generation. We compared three methods of modeling morphological features. Features can be modeled as part of the core translation. Alternatively these features can be generated using target monolingual context. Finally, the features can be predicted using both source and target information. In our experimental results, we outperform the vanilla factored translation model. In order to decide on which features to translate, generate or predict, a detailed error analysis should be provided on the system output. As a result, we present AMEANA, an open-source tool for error analysis of natural language processing tasks, targeting morphologically rich languages. The second challenge we are concerned with is the pivoting process itself. We discuss several techniques to improve the precision and recall of the pivot matching. One technique to improve the recall works on the level of the word alignment as an optimization process for pivoting driven by generating phrase pairs between source and target languages. Despite the fact that improving the recall of the pivot matching improves the overall translation quality, we also need to increase the precision of the pivot quality. To achieve this, we introduce quality constraints scores to determine the quality of the pivot phrase pairs between source and target languages. We show positive results for different language pairs which shows the consistency of our approaches. In one of our best models we reach an improvement of 1.2 BLEU points. The third challenge we are concerned with is how to make use of any parallel data between the source and the target languages. We build on the approach of improving the precision of the pivoting process and the methods of combination between the pivot system and the direct system built from the parallel data. In one of the approaches, we introduce morphology constraint scores which are added to the log linear space of features in order to determine the quality of the pivot phrase pairs. We compare two methods of generating the morphology constraints. One method is based on hand-crafted rules relying on our knowledge of the source and target languages; while in the other method, the morphology constraints are induced from available parallel data between the source and target languages which we also use to build a direct translation model. We then combine both the pivot and direct models to achieve better coverage and overall translation quality. Using induced morphology constraints outperformed the handcrafted rules and improved over our best model from all previous approaches by 0.6 BLEU points (7.2/6.7 BLEU points from the direct and pivot baselines respectively). Finally, we introduce applying smart techniques to combine pivot and direct models. We show that smart selective combination can lead to a large reduction of the pivot model without affecting the performance and in some cases improving it