Türkmenceden Türkçeye bilgisayarlı metin çevirisi

Abstract

Diller arasında bilgisayar kullanılarak çeviri yapılması konusu, doğal dil işleme alanının en önemli dallarından bir tanesidir. Ancak teknolojideki ve yöntemlerdeki gelişmelere karşın, genel amaçlı, yüksek başarıma sahip çeviri sistemleri henüz genel kullanıma sunulamamıştır. Bunun temel nedeni, diller arasındaki büyük yapısal ve anlatım farklılıklardır. Bu noktadan hareketle, benzer diller arasında Bilgisayarlı Çeviri (BÇ) gerçeklemenin daha kolay olabileceği akla gelmektedir. Nitekim son yıllarda Çekçe-Slovakça, Çekçe-Lehçe, İspanyolca-Katalanca gibi çok yakın diller arasında yüksek başarımlı çıktılar üretebilen sistemler geliştirilebilmiştir. Üstelik bu sistemler, farklılıkların derin olduğu, Japonca-İngilizce gibi dil çiftleri arasında BÇ için gerek duyulan karmaşık yöntemlere göre daha basit ve kolay gerçeklenebilir yöntemler kullanmaktadırlar. Bu çalışma kapsamında, aynı dil ailesi içinde sınıflandırılan ve birçok yönden benzerlikler gösteren Türkmence ile Türkçe dilleri arasında bir BÇ sistemi geliştirilmiştir. Söz konusu bu diller ne kadar benzer özellikler gösterse de, çözülmesi gereken farklılıklar azımsanmayacak boyuttadır. Türk Dilleri arasındaki farklılıklar, yukarıda anılan dil çiftlerinden daha fazladır ve karşılıklı anlaşılabilirlik söz konusu değildir. Sistem, hem kural tabanlı hem de istatistiksel bileşenlerden oluşan karma bir çeviri modeli kullanarak, Türkmence tümcedeki sözcüklerin sırasını değiştirmeden sözcük-sözcük Türkçeye aktarım yapılması ilkesini temel almıştır. Ancak bitişken yapılı Türk Dillerinin karmaşık biçimbilimsel özellikleri nedeniyle, diğer dillerde kullanılabilen basit doğrudan aktarım yöntemleri geliştirilerek kullanılmıştır. BLEU yöntemi ile sistemin başarım ölçümü yapılmış ve modelin başarılı sonuçlar üretebileceği gösterilmiştir. Anahtar Kelimeler: Bilgisayarlı çeviri, Türk Dilleri, Türkmence, Türkçe.Machine translation is a popular but hard field of natural language processing. Despite of the huge development of technology and inventions of new methods, general purpose full automatic Machine Translation (MT) systems do not exist. Today's MT systems either require post-editing or far from generating high quality translations, particularly of unrestricted texts. Primary reasons for that are the morphological, syntactical and lexical differences between different languages. The more distant language pairs are selected as source and target languages, the more complex methods or models must be used to build an MT system between those languages. Intuitively, this fact implies that MT between related languages can be easier than languages that have completely different structures (i.e. Japan and English). Recently, MT between related languages like Czech to Slovak, Czech to Polish and Spanish to Catalan have been implemented and these studies showed that successful translations can be produced with relatively simpler efforts. In this work our aim is building an MT model between Turkic languages. Some of the Turkic languages are Turkish, Azerbaijani, Uzbek, Turkmen, Kyrgyz, Kazakh and Uighur. All of the Turkic languages are agglutinative languages which have productive inflectional and derivational morphology. A high level of similarity can be observed between Turkic languages, especially in word order and syntactic structure. They have similar morphological structure and share some common word roots. However, some divergences preventing the mutual intelligibility are observed between these languages. From the point of view of extending previous studies to Turkic languages, some serious problems emerge due to both agglutinative structure of the languages and resource scarcities. Except Turkish, most of the Turkic languages are computationally resource poor languages and that means the lack of training corpus, morphological analyzers, POS Taggers and machine readable dictionaries. The model we have presented in this work is a hybrid model that has both rule based and statistical components. Since the word order of Turkic languages are almost same, a direct transfer approach is used in translation by means of word-by-word translation. Morphological processing, which can generate ambiguous results, is the first step of almost every NLP task for agglutinative languages. Then, the actual transfer is carried out by transferring the root words and morphological features to the target language. During the transfer of root word, another type of ambiguity, lexical ambiguity, is emerged in the process. As an exception of word-by-word transfer approach, some additional sentence level processing is done in order to translate Multi-Word Units (MWU) correctly. The two types of ambiguities are resolved by the disambiguation component that exploits Statistical Language Models (SLM) trained on the target language. In next component, a target language morphological generator produces the surface forms of the resulting candidate translation. As a last step, some work is done in sentence level because of some long distance dependencies and a number transfer rules for some phrase structures. The statistical disambiguation component is based on SLMs which are normally generated by using surface forms from the training corpus. But for an agglutinative language, such a training will heavily suffer from sparse data problem, so we propose some SLM types in which various parts of the full morphological parses are modeled. The performances of these types are investigated as well as the performance of the whole system. To evaluate the practical performance of our model, we have implemented a generic MT framework based on our model and built a Turkmen to Turkish MT system by using this framework. The rule based modules of the system are implemented as Finite State Transducers (FST) using Xerox Finite State Toolkit. A Turkmen morphological analyzer is implemented in a two-level manner while an existing wide-coverage Turkish morphological analyzer is used in the generating direction as the target language morphological generator. We have used BLEU as the automatic evaluation metric for our MT system. The results showed that general purpose MT between Turkic languages can be achieved, even relatively easier, and can generate high quality translations. Keywords: Machine Translation, Turkic languages, Turkmen Language, Turkish.

    Similar works