44 research outputs found

    A MT System from Turkmen to Turkish employing finite state and statistical methods

    Get PDF
    In this work, we present a MT system from Turkmen to Turkish. Our system exploits the similarity of the languages by using a modified version of direct translation method. However, the complex inflectional and derivational morphology of the Turkic languages necessitate special treatment for word-by-word translation model. We also employ morphology-aware multi-word processing and statistical disambiguation processes in our system. We believe that this approach is valid for most of the Turkic languages and the architecture implemented using FSTs can be easily extended to those languages

    Hesaplamalı Dil Bilimleri ve Uygur Dili Araştırmaları

    Get PDF
    Bu makalede hesaplamalı dil bilimleri kısaca anlatılmıştır ve Uygurca ile ilgili yapılan güncel hesaplamalı dil bilim araştırmaları özetlenmiştir. Teknolojinin ilerlemesi ile farklı dillere yönelik bilgisayar destekli çalışmalarda büyük başarılar elde edilmiştir. Örneğin, metinlerde içerik yönetme, bilgi edinme, konuşma sistemleri, dosya kümeleme, metin madenciliği, yazı kontrolü, yazıyı sese çevirme, sesi yazıya çevirme ve farklı diller arasında otomatik (bilgisayarlı çeviri) gibi uygulamalar geliştirilmiştir ve gerçek hayata kullanılmaktadır. Gerçi Fince, Japonca, Macarca ve Türkçe gibi Ural-Altay dilleri grubuna ait bazı diller ile ilgili birçok çalışmalar yapılsa bile, ancak yine bazı diller, örneğin Uygurca, ile ilgili yapılan çalışmalar çok az bilinmektedir. Hesaplamalı dil bilimi ile ilgili araştırmaları geliştirmek ve farklı diller arasındaki ilişkileri analiz edebilmek için, bu makalede, Uygurca ile ilgili yapılan bilgisayar destekli araştırmalar, özellik ile bilgisayarlı çeviri ile ilgili yapılan en son temel niteliğindeki çalışmalar toparlanmıştır. Aynı anda dil bilimcileri ile hesaplamalı dil bilimleri arasındaki bağıntı analiz edilmiştir

    Enhancing Bi-directional English-Tigrigna Machine Translation Using Hybrid Approach

    Get PDF
    Machine Translation (MT) is an application area of NLP where automatic systems are used to translate text or speech from one language to another while preserving the meaning of the source language. Although there exists a large volume of literature in automatic machine translation of documents in many languages, the translation between English and Tigrigna is less explored. Therefore, we proposed the hybrid approach to address the challenges of applying syntactic reordering rules which align and capture the structural arrangement of words in the source sentence to become more like the target sentences. Two language models were developed- one for English and another for Tigrigna and about 12,000 parallel sentences in four domains and 32,000 bilingual dictionaries were collected for our experiment. The parallel collected corpus was split randomly to 10,800 sentences for training set and 1,200 sentences for testing. Moses open source statistical machine translation system has been used for the experiment to train, tune and decode. The parallel corpus was aligned using the Giza++ toolkit and SRILM was used for building the language model. Three main experiments were conducted using statistical approach, hybrid approach and post-processing technique. According to our experimental result showed good translation output as high as 32.64 BLEU points Google translator and the hybrid approach was found most promising for English-Tigrigna bi-directional translation

    A Phrase-Based Approach Based on Morphological Information for Japanese-Uighur Statistical Machine Translation System

    No full text
    Summary The statistical translation approach of Japanese Uighur language in machine translation system is a blank. This paper analyses the approach of statistical machine translation system in Uighur language, discusses how to establishing of dictionary and parallel corpus and phrase based statistical machine translation system based on linguistic rules for Uighur language, and it presents the method of statistical machine translation system based on morphological information of Uighur, the rule base and the dictionary

    Theory and Applications for Advanced Text Mining

    Get PDF
    Due to the growth of computer technologies and web technologies, we can easily collect and store large amounts of text data. We can believe that the data include useful knowledge. Text mining techniques have been studied aggressively in order to extract the knowledge from the data since late 1990s. Even if many important techniques have been developed, the text mining research field continues to expand for the needs arising from various application fields. This book is composed of 9 chapters introducing advanced text mining techniques. They are various techniques from relation extraction to under or less resourced language. I believe that this book will give new knowledge in the text mining field and help many readers open their new research fields

    Resource Generation from Structured Documents for Low-density Languages

    Get PDF
    The availability and use of electronic resources for both manual and automated language related processing has increased tremendously in recent years. Nevertheless, many resources still exist only in printed form, restricting their availability and use. This especially holds true in low density languages or languages with limited electronic resources. For these documents, automated conversion into electronic resources is highly desirable. This thesis focuses on the semi-automated conversion of printed structured documents (dictionaries in particular) to usable electronic representations. In the first part we present an entry tagging system that recognizes, parses, and tags the entries of a printed dictionary to reproduce the representation. The system uses the consistent layout and structure of the dictionaries, and the features that impose this structure, to capture and recover lexicographic information. We accomplish this by adapting two methods: rule-based and HMM-based. The system is designed to produce results quickly with minimal human assistance and reasonable accuracy. The use of an adaptive transformation-based learning as a post-processor at two points in the system yields significant improvements, even with an extremely small amount of user provided training data. The second part of this thesis presents Morphology Induction from Noisy Data (MIND), a natural language morphology discovery framework that operates on information from limited, noisy data obtained from the conversion process. To use the resulting resources effectively, however, users must be able to search for them using the root form of morphologically deformed variant found in the text. Stemming and data driven methods are not suitable when data are sparse. The approach is based on the novel application of string searching algorithms. The evaluations show that MIND can segment words into roots and affixes from the noisy, limited data contained in a dictionary, and it can extract prefixes, suffixes, circumfixes, and infixes. MIND can also identify morphophonemic changes, i.e., phonemic variations between allomorphs of a morpheme, specifically point-of-affixation stem changes. This, in turn, allows non-native speakers to perform multilingual tasks for applications where response must be rapid, and they have limited knowledge. In addition, this analysis can feed other natural language processing tools requiring lexicons

    Using Comparable Corpora to Augment Statistical Machine Translation Models in Low Resource Settings

    Get PDF
    Previously, statistical machine translation (SMT) models have been estimated from parallel corpora, or pairs of translated sentences. In this thesis, we directly incorporate comparable corpora into the estimation of end-to-end SMT models. In contrast to parallel corpora, comparable corpora are pairs of monolingual corpora that have some cross-lingual similarities, for example topic or publication date, but that do not necessarily contain any direct translations. Comparable corpora are more readily available in large quantities than parallel corpora, which require significant human effort to compile. We use comparable corpora to estimate machine translation model parameters and show that doing so improves performance in settings where a limited amount of parallel data is available for training. The major contributions of this thesis are the following: * We release ‘language packs’ for 151 human languages, which include bilingual dictionaries, comparable corpora of Wikipedia document pairs, comparable corpora of time-stamped news text that we harvested from the web, and, for non-roman script languages, dictionaries of name pairs, which are likely to be transliterations. * We present a novel technique for using a small number of example word translations to learn a supervised model for bilingual lexicon induction which takes advantage of a wide variety of signals of translation equivalence that can be estimated over comparable corpora. * We show that using comparable corpora to induce new translations and estimate new phrase table feature functions improves end-to-end statistical machine translation performance for low resource language pairs as well as domains. * We present a novel algorithm for composing multiword phrase translations from multiple unigram translations and then use comparable corpora to prune the large space of hypothesis translations. We show that these induced phrase translations improve machine translation performance beyond that of component unigrams. This thesis focuses on critical low resource machine translation settings, where insufficient parallel corpora exist for training statistical models. We experiment with both low resource language pairs and low resource domains of text. We present results from our novel error analysis methodology, which show that most translation errors in low resource settings are due to unseen source language words and phrases and unseen target language translations. We also find room for fixing errors due to how different translations are weighted, or scored, in the models. We target both error types; we use comparable corpora to induce new word and phrase translations and estimate novel translation feature scores. Our experiments show that augmenting baseline SMT systems with new translations and features estimated over comparable corpora improves translation performance significantly. Additionally, our techniques expand the applicability of statistical machine translation to those language pairs for which zero parallel text is available

    Turkic C- type reduplications

    Get PDF
    The present book can be viewed as a patchwork of topics relating more or less directly to Turkic reduplications. Many are interconnected and interdependent, which renders it impossible to organize the presentation in a linear way. The thematic division adopted here is only one of the possible groupings, and not necessarily optimal for all tasks. To alleviate this inconvenience, the current chapter first summarizes the whole following a different thematic division (4.1), and then very briefly recapitualtes what I consider to be the most important conclusions (4.2). Some thoughts are expressed more clearly here than in the previous chapters, where they were lost between auxiliary observations
    corecore