11 research outputs found

    A MT System from Turkmen to Turkish employing finite state and statistical methods

    Get PDF
    In this work, we present a MT system from Turkmen to Turkish. Our system exploits the similarity of the languages by using a modified version of direct translation method. However, the complex inflectional and derivational morphology of the Turkic languages necessitate special treatment for word-by-word translation model. We also employ morphology-aware multi-word processing and statistical disambiguation processes in our system. We believe that this approach is valid for most of the Turkic languages and the architecture implemented using FSTs can be easily extended to those languages

    Computer analysis of the Turkmen language morphology

    Get PDF
    This paper describes the implementation of a two-level morphological analyzer for the Turkmen Language. Like all Turkic languages, the Turkmen Language is an agglutinative language that has productive inflectional and derivational suffixes. In this work, we implemented a finite-state two-level morphological analyzer for Turkmen Language by using Xerox Finite State Tools

    A prototype machine translation system between Turkmen and Turkish

    Get PDF
    In this work, we present a prototype system for translation of Turkmen texts into Turkish. Although machine translation (MT) is a very hard task, it is easier to implement a MT system between very close language pairs which have similar syntactic structure and word order. We implement a direct translation system between Turkmen and Turkish which performs a word-to-word transfer. We also use a Turkish Language Model to find the most probable Turkish sentence among all possible candidate translations generated by our system

    Uygurcadan Türkçeye bilgisayarlı çeviri

    Get PDF
    Machine translation is a sub-field of Natural Language Processing which belongs to Artificial Intelligence. Generally, it is based on computer technology that uses software to translate one natural language to another. In the 1950s, the Georgetown experiment involved fully-automatic translation of over sixty Russian sentences into English (Hutchins, 2004). The experiment was a great success and ushered in an era of substantial funding for machine-translation research. One of the main projects initiated by the US at that time was a machine translation system which converted Russian to English. This project continued from 1950 to 1960. In 1964, government sponsors of machine translation in the United States formed the Automatic Language Processing Advisory Committee (ALPAC) to examine the project's potential. In the famous 1966 report, ALPAC concluded that machine translation was slower, less accurate and twice as expensive as human translation, and that "there is no immediate or predictable prospect of useful machine translation" (Hutchins, 1995). The effects of this report brought about the virtual end to machine translation research in the US for over a decade after its publication. As computer technology developed, high capacity and high speed computers were produced. Thus, the main restrictions of studying natural language were removed and machine translation gained the attention of the computer science community once again. Despite technologic advances and the advent of new methods, a general purpose for full automatic machine translation systems still does not exist. To date, few machine translation systems have been developed, furthermore, they may only be applied to restricted texts and some post-editing works (usually necessary after initial translations). The main reasons for these are the morphological, syntactical and lexical differences between different languages. In conclusion, translated texts remain inferior to higher quality translations. Recently, some machine translation systems designed for related languages, such as: Czech to Slovak, Spanish to Catalan, and Turkmen language to Turkish have been implemented; studies on them have proven successful translations can be produced efficiently. In this study, our aim was to implement a machine translation system between Uyghur language and Turkish. Uyghur language is an agglutinative language such as other Turkic languages (i.e. Turkmen, Kazakh, Kyrgyz, Uzbek and Azeri etc.). All Turkic languages belong to the Ural-Altaic language family and are characteristically agglutinative languages which have productive inflectional and derivational morphology. Most research about natural language processing and machine translation of Turkic languages focus on Turkish language. Mainly due to the fact that there is active ongoing research on the subject in Turkey, and they continue to produce valuable results. To date, machine translation systems implemented between Turkic languages has been scant, such as: Turkish to Azeri, Turkish to Crimean Tatar, Turkmen language to Turkish etc. Unfortunately, little computational research about Uygur languages exists. Turkic languages tend to have similar morphological structure and share some common word roots. The main shared properties include similar word order and syntactic structure.  However, distinctions exist which prevent mutual intelligibility between these languages. In order to implement this translation system, we utilized a frame-work which is favored for translation between closely related agglutinative languages. Thus, we implemented a morphological analyzer for Uyghur language with XEROX's Finite State Transducers (FST) tools. In this morphological analyzer we considered general cases for Uyghur languages and tagged Uyghur words with the same tags that were used for tagging other Turkic languages words. Thus, it will be easy to integrate this system to other Turkic languages. In order to improve the system's performance, we implemented a rule based morphological disambiguator, additionally, a disambiguator for word senses. We have evaluated our system's performance using BLEU scores for 240 differently structured sentences. As a result, a system has been determined which may successfully translate intermediate level Uyghur language into Turkish. Keywords: Machine translation, Turkic languages, Uyghur language, Turkish.Bilgisayarlı Çeviri (BÇ) yapay zeka çalışmalarının bir alt dalı olan Doğal Dil İşlemenin (DDİ) alt konusudur. Diller arası çeviride bilgisayarların kullanılması fikri 1950’lerin ilk yıllarında ortaya çıkmıştır. O tarihten günümüze kadar pek çok dil üzerinde çalışılmış ve çeşitli yöntemler geliştirilmiştir. Ancak teknolojideki ve yöntemlerdeki gelişmelere karşın, genel amaçlı, yüksek başarıma sahip çeviri sistemleri henüz geliştirilememiştir. Bunun temel nedeni, diller arasındaki büyük yapısal ve anlatım farklılıklarıdır. Yapısal yönden benzer olan diller arasına bilgisayarlı çevirinin daha kolay olduğu bilinmektedir. Son yıllarda Çekçe-Slovakça, Çekçe-Lehçe, İspanyolca-Katalanca, Türkmence-Türkçe gibi yakın diller arasında yüksek başarımlı çeviri yapabilen sistemler geliştirilmiştir. Akraba veya yakın diler arasında çeviri amaçlı geliştirilen sistemler, farklılıkların büyük olduğu, Türkçe-İngilizce gibi diller arasında bilgisayarlı çeviri için gerek duyulan karmaşık yöntemlere göre, daha basit ve kolay gerçeklenebilir yöntemler kullanmaktadırlar. Bu çalışma kapsamında, aynı dil ailesi içinde sınıflandırılan ve birçok yönden benzerlikler gösteren Uygurcadan Türkçeye bilgisayarlı çeviri sistemi geliştirilmiştir. Aslında bu diller ne kadar benzer özellikler gösterse de, çözülmesi gereken farklılıklar azımsanmayacak kadar çoktur. Genel olarak Uygur Türkçesi ile Türkiye Türkçesinin söz dizimi aynıdır. Bundan dolayı çeviri sistemi geliştirirken, sözcüklerin dizilimi değişmemektedir. Ancak sözcüklere eklenen ekler çok farklılaşabilmektedir. Uygurca ve Türkçe bitişken diller olduğundan, ekler çok önemlidir. Ekler sözcüklerin hatta tümcenin anlamını değiştirmektedir. Bu çalışmada, akraba ve bitişen diller arasında bilgisayarlı çeviri için geliştirilen karma model üzerine, belirsizlik giderme yönteminin eklenmesi ile Uygurcadan Türkçeye bilgisayarlı çeviri sistemi geliştirilmiştir. Anahtar Kelimeler: Bilgisayarlı çeviri, Türk Dilleri, Uygurca, Türkçe

    Data Mining And Clustering

    No full text
    Tez (Yüksek Lisans) -- İstanbul Teknik Üniversitesi, Fen Bilimleri Enstitüsü, 2002Thesis (M.Sc.) -- İstanbul Technical University, Institute of Science and Technology, 2002Bu yüksek lisans tezinde, veri madenciliğinde kullanılan teknikler incelenmiş ve özellikle demetleme tekniklerinin ayrıntılı irdelenmesi yapılmıştır. Daha sonra uygulama anlamında yapılan çalışmada çok büyük boyutlarda örüntü kümeleri üzerinde demetleme işlemini gerçekleştirmek üzere algoritmaya yardımcı olacak bazı iyileştirmeler gerçeklenmiştir. Demetlenecek verileri esnek, verimli ve bellek karmaşıklığını azaltacak şekilde ana bellek içerisinde saklayan bir çerçeve veri yapısı tasarlanmıştır. Bu yapıya blok mekanizması, takas yönetimi ve çerçeve bellek kullanımını gerçekleyen bir bellek yönetim sistemi de eklenmiştir. Ayrıca algoritmanın kategorik ve karakter katarı tipindeki verileri de demetleme yaparken kullanabilmesine destek sağlanmıştır. Bu mimari üzerinde K-Means algoritması gerçeklenmiştir. Sonuçta çok büyük veriler üzerinde bile demetleme yapabilen, esnek ve bellek karmaşıklığı çok düşük bir mimariye sahip olan bir yazılım geliştirilmiştir.In this thesis, the techniques used in data mining are examined. Especially clustering in data mining is widely investigated. Some modifications has been made to the clustering algorithms to make space complexity of these algorithms smaller by an efficient and flexible way, especially while working with large data sets. A memory management system is provided in order to maximize the efficiency of memory usage and fastening the algorithms. This memory management system has block mechanism that supports swapping of blocks and frame block usage. In addition to these, the support for categorical data types and string clustering is added to the application. As a result, a software that has a template based data structure that can easily be used by all data mining techniques eventhough while dealing with very large data sets is developed. In this software K-Means algorithm is implemented as an instance of data mining algorithm.Yüksek LisansM.Sc

    Effect of tokenization granularity for Turkish large language models

    No full text
    Transformer-based language models such as BERT (and its optimized versions) have outperformed previous models, achieving state-of-the-art results on many English benchmark tasks. These multi-layered self-attention-based architectures are capable of producing contextual word vector representations. However, the tokens created in the tokenization preprocessing step are not necessarily words, particularly for languages with complex morphology, such as Turkish. While previous research has often focused on tokenization algorithms and has explored optimal vocabulary sizes for machine translation in English, our study extends the scope by investigating the impact of varying vocabulary sizes and explores the feasilitiy of incorporating morphological tagging for Turkish. The granularity of the generated tokens is a feature determined by various factors related to tokenization, especially by the vocabulary size. This study presents a new collection of BERT models (ITUTurkBERT) trained using various tokenization methods on the corpus of the BERTurk and 1 BW corpora. We fine-tuned these models for named entity recognition, sentiment analysis, and question-answering downstream tasks in Turkish and achieved state-of-the-art performance on all of these tasks. Our empirical experiments show that increasing the vocabulary size improves performance on these tasks, except for sentiment analysis, which requires further investigation

    Machine translation between Turkic languages

    No full text
    We present an approach to MT between Turkic languages and present results from an implementation of a MT system from Turkmen to Turkish. Our approach relies on ambiguous lexical and morphological transfer augmented with target side rule-based repairs and rescoring with statistical language models

    Lexical ambiguity resolution for Turkish in direct transfer machine translation models

    No full text
    This paper presents a statistical lexical ambiguity resolution method in direct transfer machine translation models in which the target language is Turkish. Since direct transfer MT models do not have full syntactic information, most of the lexical ambiguity resolution methods are not very helpful. Our disambiguation model is based on statistical language models. We have investigated the performances of some statistical language model types and parameters in lexical ambiguity resolution for our direct transfer MT system
    corecore