14 research outputs found
Morphological parsing of Albanian language: a different approach to Albanian verbs
The very first step when processing a natural language is creating a morphological parser. Verbs in Albanian language are the most complex area of inflection. Besides irregular verbs, the ways in which the regular verbs change their form while being inflected are hardly definable, and the number of exceptions is huge. In this paper, a different approach to Albanian verbs is made. Unlike traditional classification, based on the inflection themes they take, verbs are classified into different verb groups. This way, the inflection process looks clearer and more regular, as the affix remains the only changeable part of the inflected verb. This way of approach, makes us able to process the Albanian verbs simpler and easier
Pattern extraction and modelling of the behavior of Web users
İnternetin yaygınlaşması ve her alanda bilgi sağlaması günlük yaşantımıza hızla girmesine neden olmuştur. Haber, ekonomi, kültür, eğitim, sağlık hizmetler ve reklam gibi bir çok alanda bilgi kaynağı olan İnternet ortamında, kullanıcı kendisi için gerekli bilgileri bulmakta çoğu zaman zorlanmaktadır. Bunun nedeni sorgulama araçlarının kısıtlı olması ve bilgilerin fazlalığı olarak görülmektedir. Bu çalışmada kullanıcının bir sonraki istek yapacağı sayfayı öngörerek hızlı ve yüksek oranda doğru öneri yapabilen bir yöntem önerilmiştir. Model tabanlı demetleme yönteminden yaralanarak, kullanıcı oturumları aynı demette bulunan oturumlardaki ortak sayfalarda benzer süreler geçirilmesine göre demetlenmiştir. Ortaya çıkan demetler yeni kullanıcılar için öneri kümesi oluşturmak için kullanılmıştır.Anahtar Kelimeler: Web kullanım madenciliği, kullanıcı örüntüleri, model tabanlı demetleme, Poisson dağılımı. Making recommendation requires predicting what is of interest to a user at a specific time. Even the same user may have different desires at different times. It is important to extract the aggregate interest of a user from his or her navigational path through the site in a session. In this paper, we present a new model that uses only the visiting time and visiting frequencies of pages without considering the access order of page requests in user sessions. The resulting model has lower run-time computation and memory requirements, while providing predictions that are at least as precise as previous proposals. Our objective in this paper is to assess the effectiveness of non-sequentially ordered pages in predicting navigation patterns. The key idea behind this work is that user sessions can be clustered according to the similar amount of time that is spent on similar pages within a session. We first partition user sessions into clusters such that only sessions which represent similar aggregate interest of users are placed in the same cluster. We employ a model-based clustering approach and partition user sessions according to similar amount of time in similar pages. In particular, we cluster sessions by learning a mixture of Poisson odels using Expectation Maximization algorithm. The resulting clusters are then used to recommend pages to a user that are most likely contain the information which is of interest to that user at that time.Keywords: Web usage mining, usage patterns, model based clustering, Poisson distribution
Combining classification algorithms using Dempster?s rule of combination
<!-- /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {mso-style-parent:""; margin:0cm; margin-bottom:.0001pt; text-align:justify; mso-pagination:widow-orphan; font-size:12.0pt; mso-bidi-font-size:10.0pt; font-family:"Times New Roman"; mso-fareast-font-family:"Times New Roman"; mso-fareast-language:EN-US;} p.zetmetni, li.zetmetni, div.zetmetni {mso-style-name:"Özet metni"; margin-top:6.0pt; margin-right:0cm; margin-bottom:0cm; margin-left:0cm; margin-bottom:.0001pt; text-align:justify; mso-pagination:widow-orphan; font-size:11.0pt; mso-bidi-font-size:10.0pt; font-family:"Times New Roman"; mso-fareast-font-family:"Times New Roman"; mso-fareast-language:EN-US; font-style:italic; mso-bidi-font-style:normal;} @page Section1 {size:612.0pt 792.0pt; margin:70.85pt 70.85pt 70.85pt 70.85pt; mso-header-margin:35.4pt; mso-footer-margin:35.4pt; mso-paper-source:0;} div.Section1 {page:Section1;} --> Sürekli olarak büyümekte olan veri, mevcut istatistiksel yöntemlerin kullanılmasıyla, büyük miktardaki veri içindeki değerli bilginin bulunmasını ve analiz edilmesini imkansız hale getirmektedir. Mevcut analiz araçlarının yetersizliği nedeniyle çok büyük miktardaki veri içindeki değerli fakat saklanmış bilginin bulunup çıkarılması için yeni çözümler bulunmuştur. Bu çözümler veri madenciliği ve veri füzyonudur. Veri madenciliği önceden bilinmeyen, fakat yararlı bilginin büyük miktardaki veri arasından bulunup çıkarılmasıdır. Veri madenciliği, veri içindeki örüntünün keşfedilmesini ve geleceğe ilişkin tahminler yapılmasında kullanılabilecek ilişkilerin çıkarılmasını sağlayan analiz araçlarını kullanır. Veri füzyonu ise farklı sensörlerden gelen bilgilerin birleştirilmesi işlemidir. Veri füzyonu algoritmaları, savunma sektöründe hedef takibi, hedef kimlik tespiti amacıyla istihbarat, keşif ve gözetleme operasyonlarında kullanılmaktadır. Veri madenciliği ve veri füzyonu birbirini tamamlayan prosesler olmasına rağmen, araştırmacılar bu iki alanda birbirinden bağımsız olarak, herhangi bir ilişkiye girmeden çalışmaktadırlar. Sınıflandırmanın etkinliğini artırmak için bu alanlarda kullanılan teknikleri birleştiren çok az sayıda çalışma mevcuttur. Bu çalışmada sınıflandırma sonuçlarını iyileştirmek için yeni bir yöntem önerilmektedir. Söz konusu yöntem Dempster’in Birleştirme Algoritmasını kullanarak, sınıflandırıcıların doğruluğunun da göz önünde bulundurulmasıyla, farklı sınıflandırma algoritmalarından elde edilen sonuçların birleştirilmesinden oluşmaktadır. UCI kütüphanesinden alınan farklı veri takımlarıyla yapılan deneyler sonucunda Dempster’in Birleştirme Algoritmasının kullanımıyla yapılan birleştirme işleminin, birleşimde kullanılan her bir sınıflandırma algoritmasından ve mevcut birleşik algoritmalardan daha doğru sonuçlar verdiği görülmüştür. Anahtar Kelimeler: Sınıflandırma, Dempster’in birleştirme algoritması, güven derecesi.The constantly growing volume of data makes it impossible to analyze and capture the valuable knowledge among large amounts of data using the current statistical methods. Because of the insufficiency of the current analysis tools, new solutions have been found for extracting the valuable but hidden knowledge among huge data. These solutions are data mining and data fusion. Data mining tries to extract implicit, previously unknown, and potentially useful information from large amounts of data. It is a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions. Data fusion, on the other hand, is the process of combining information coming from different sensors. Data fusion algorithms are mostly used for target tracking and target identification purposes in intelligence, surveillance and reconnaissance operations in the defense sector. Although data mining and data fusion are two reciprocal processes completing each other, people are generally working on these two areas independently without having any interaction. There are few studies which combine the techniques used in these areas in order to improve the performance of classification. What is missing in most of the studies performed on this issue is uncertainty management. Every classifier has uncertainty to some extent. The second issue regarding the current classification algorithms is the lack of degree of confidence. Degree of confidence is the success that a certain classifier has displayed on similar data sets in the past. A classification algorithm must be able to use degree of confidence in order to give more precise classification results. Dempster-Shafer?s Method, in other words evidence combination rule, has the capability to handle uncertainty. Dempster-Shafer?s Method is widely used for combining evidences obtained from different sensory information in the area of data fusion. Dempster-Shafer?s Method does not require exact probability values in order to combine evidences. Pieces of information, some being incomplete, obtained from different information sources can be combined using Dempster?s Rule of Combination. However, Dempster?s Rule of Combination does not include degree of confidence presently. In our proposed method of combining classification algorithms using Demspter?s Rule of Combination, we use degree of confidence during the combination in order to improve the accuracy of the classification. Our proposed combination method using Demspter?s Rule of Combination does the following contributions: Employment of degree of confidence during the combination, uncertainty management in combining classifiers and achievement of better classification results. In our proposed method, we first perform classification with different classification algorithms. Assuming the results of the classifiers as beliefs, we calculate mass functions for each classifier. We then combine the mass values in a pairwise fashion using Dempster?s Rule of Combination. When there are more than two classifiers to combine, we first combine the first two classifiers and then combine the result of this combination with the third classifier and this process continues until there are no classifiers to combine. We perform different experiments using data sets taken from the UCI machine learning repository. Firstly, we test the success rate of the current classifiers in WEKA using 10 different data sets taken from the UCI machine learning repository. In the experiments we use the default values of the classifiers. We then check the success rate of current hybrid classifiers using the same data sets with the default values. Afterwards we do tests with the proposed method of combining classifiers using Dempster?s Rule of Combination employing the same data sets. In the experiments we combine four different classification algorithms selected as represantatives for each class of classifiers, in several combinations. In order to be able to make a one-to-one comparison of the proposed method with the current hybrid classification algorithms, we perform experiments with the current hybrid algorithms which has the capability of combining multiple classifiers, using the same data sets. The results of the experiments show that combining classifiers using Demspter?s Rule of Combination with the employment of degree of confidence not only performs better than each of the classifiers taking place in the combination but also performs better than the current hybrid classifiers. The results of the experiments also show that the employment of degree of confidence during the combination gives more precise classification results which also decrease uncertainty in the combination. Keywords: Classification, Dempster?s rule of combination, degree of confidence
Computer analysis of the Turkmen language morphology
This paper describes the implementation of a two-level morphological analyzer for the Turkmen Language. Like all Turkic languages, the Turkmen Language is an agglutinative language that has productive inflectional and derivational suffixes. In this work, we implemented a finite-state two-level morphological analyzer for Turkmen Language by using Xerox Finite State Tools
A MT System from Turkmen to Turkish employing finite state and statistical methods
In this work, we present a MT system from Turkmen to Turkish. Our system exploits the similarity of the languages by using a modified version of direct translation method. However, the complex inflectional and derivational morphology of the Turkic languages necessitate special treatment for word-by-word translation model. We also employ morphology-aware multi-word processing and statistical disambiguation processes in our system. We believe that this approach is valid for most of the Turkic languages and the architecture implemented using FSTs can be easily extended to those languages
A prototype machine translation system between Turkmen and Turkish
In this work, we present a prototype system for translation of Turkmen texts into Turkish. Although machine translation (MT) is a very hard task, it is easier to implement a MT system between very close language pairs which have similar syntactic structure and word order. We implement a direct translation system between Turkmen and Turkish which performs a word-to-word transfer. We also use a Turkish Language Model to find the most probable Turkish sentence among all possible candidate translations generated by our system
Probabilistic dependency parsing of Turkish
Bu çalışma, Türkçe için geliştirilmiş ilk istatistiksel bağlılık ayrıştırıcısının sonuçlarını sunmaktadır. Türkçe, tümce içi öğe dizilişleri serbest, karmaşık bir çekimsel ve türetimsel biçimbirime sahip olan bitişken bir dildir ve bu özellikleri ile istatistiksel ayrıştırma konusunda ilginç sorunlar ortaya koymaktadır. Türkçe’de, bağlılık ilişkileri “çekim kümesi” adı verilen sözcük parçacıkları arasında kurulmaktadır. Bu bağlılıkların bulunması amacı ile Türkçe’nin karmaşık yapısının ayrıştırma sırasında nasıl modelleneceğinin irdelenmesi gerekmektedir. Bu çalışmada, ayrıştırma için farklı gösterim birimleri kullanan olasılık tabanlı modeller incelenmiştir. Başlangıç olarak biri kural tabanlı bir ayrıştırıcı olmak üzere üç dayanak model geliştirilmiştir. Gerçekleştirilen üç olasılık tabanlı modelin, dayanak modellere ve birbirlerine oranla başarımları değerlendirilmiştir. Ayrıştırıcının eğitimi ve sınaması için Odtü Sabancı Türkçe ağaç yapılı derlemi kullanılmıştır. Çalışma ayrıca bu derlem üzerinde sınanmış ve sonuçlaı raporlanmış ilk çalışmadır. Bu ilk incelemede, derlemin sadece sağa bağımlı (iye sözcüklerin uydu sözcüklerin sağ taraflarında yer aldığı) türde ve kesişmeyen bağlılıklar içeren bir alt kümesini ayrıştırmaya odaklanılmıştır. Eldeki derlemin boyutu nedeni ile görünüm bilgisi (sözcüğün tümünün veya gövdesinin ayrıştırma birimi gösterimlerinde bir özellik olarak kullanılması) kullanmayan ve sadece birimler arası etiketsiz bağlılıkları bulmaya yönelik incelemeler yapılmıştır. Sonuçlarımız, çekim kümeleri arasındaki doğru bağlıkların bulunma başarımı gözönüne alındığında, ayrıştırma birimi olarak çekim kümelerinin kullanıldığı ve bağlam bilgisinden yararlanan modelin en yüksek başarımı sağladığını göstermektedir. Anahtar Kelimeler: Bağlılık ayrıştırması, doğal dil işleme, ayrıştırma, sentaks analizi.This paper presents results from the first statistical dependency parser for Turkish. Turkish is a free-constituent order language with complex agglutinative inflectional and derivational morphology and presents interesting challenges for statistical parsing, as in general, dependency relations are between "portions" of words called inflectional groups. We have explored statistical models that use different representational units for parsing. We have used the Turkish Dependency Treebank to train and test our parser but have limited this initial exploration to that subset of the treebank sentences with only left-to-right non-crossing dependency links. Our results indicate that the best accuracy in terms of the dependency relations between inflectional groups is obtained when we use inflectional groups as units in parsing, and when contexts around the dependent are employed. Turkish shows very different characteristics from the well-studied languages in parsing literature. Many of these characteristics are common for all agglutinative languages such as Basque, Estonian, Finnish, Hungarian, Japanese and Korean. It is a flexible constituent order language. Even though in written texts, the constituent order of sentences generally conforms to the SOV or OSV structures, the constituents may freely change their position depending on the requirements of the discourse context. From the point of view of dependency structure, Turkish is predominantly (but not exclusively) head final. Furthermore, Turkish morphotactics is quite complicated: a given word form may involve multiple derivations and the number of word forms one can generate from a nominal or verbal root is theoretically infinite. Derivations in Turkish are very productive, and the syntactic relations that a word is involved in as a dependent or head element, are determined by the inflectional properties of the one or more (possibly intermediate) derived forms. In this work, we assume that a Turkish word is represented as a sequence of inflectional groups (IGs hereafter), separated by ?DBs, denoting derivation boundaries. A sentence would then be represented as a sequence of the IGs making up the words. When a word is considered as a sequence of IGs, linguistically, the last IG of a word determines its role as a dependent, so, syntactic relation links only emanate from the last IG of a (dependent) word, and land on one of the IGs of a (head) word on the right (with minor exceptions). And again with minor exceptions, the dependency links between the IGs, when drawn above the IG sequence, do not cross. We implemented three baseline parsers: 1. The first baseline parser links a word-final IG to the first IG of the next word on the right.2. The second baseline parser links a word-final IG to the last IG of the next word on the right. 3. The third baseline parser is a deterministic rule-based parser that links each word-final IG to an IG on the right based on the approach of Nivre (2003). The parser uses 23 unlexicalized linking rules and a heuristic that links any non-punctuation word not linked by the parser to the last IG of the last word as a dependent. In addition to these, we implemented three probabilistic models:1. 'Unlexicalized' Word-based Model, where the words are represented as the concatenation of their IGs and are used as the parsing unit during the parsing. 2. IG-based Model, where each word is splitted into its IGs and then the IGs are used as the smallest parsing unit. 3. IG-based Model with Word-final IG Contexts, where the IGs are again used as the parsing unit. This model differs from the previous one in the way it uses the contextual units and calculates the distances between units. Our results indicate that all of our models perform better than the three baseline parsers, even when no contexts around the dependent and head units are used. We get our best results with Model 3, where IGs are used as units for parsing and contexts are comprised of word final IGs. The highest accuracy in terms of percent of correctly extracted IG-to-IG relations excluding punctuations (73.5%) was obtained when one word is used as context on both sides of the the dependent. We also noted that using a smaller treebank to train our models did not result in a significant reduction in our accuracy indicating that the unlexicalized models are quite effective, but this also may hint that a larger treebank with unlexicalized modeling may not be useful for improving link accuracy. Keywords: Dependency parsing, natural language processing, parsing, syntax analysis.
Uygurcadan Türkçeye bilgisayarlı çeviri
Machine translation is a sub-field of Natural Language Processing which belongs to Artificial Intelligence. Generally, it is based on computer technology that uses software to translate one natural language to another. In the 1950s, the Georgetown experiment involved fully-automatic translation of over sixty Russian sentences into English (Hutchins, 2004). The experiment was a great success and ushered in an era of substantial funding for machine-translation research. One of the main projects initiated by the US at that time was a machine translation system which converted Russian to English. This project continued from 1950 to 1960. In 1964, government sponsors of machine translation in the United States formed the Automatic Language Processing Advisory Committee (ALPAC) to examine the project's potential. In the famous 1966 report, ALPAC concluded that machine translation was slower, less accurate and twice as expensive as human translation, and that "there is no immediate or predictable prospect of useful machine translation" (Hutchins, 1995). The effects of this report brought about the virtual end to machine translation research in the US for over a decade after its publication. As computer technology developed, high capacity and high speed computers were produced. Thus, the main restrictions of studying natural language were removed and machine translation gained the attention of the computer science community once again. Despite technologic advances and the advent of new methods, a general purpose for full automatic machine translation systems still does not exist. To date, few machine translation systems have been developed, furthermore, they may only be applied to restricted texts and some post-editing works (usually necessary after initial translations). The main reasons for these are the morphological, syntactical and lexical differences between different languages. In conclusion, translated texts remain inferior to higher quality translations. Recently, some machine translation systems designed for related languages, such as: Czech to Slovak, Spanish to Catalan, and Turkmen language to Turkish have been implemented; studies on them have proven successful translations can be produced efficiently. In this study, our aim was to implement a machine translation system between Uyghur language and Turkish. Uyghur language is an agglutinative language such as other Turkic languages (i.e. Turkmen, Kazakh, Kyrgyz, Uzbek and Azeri etc.). All Turkic languages belong to the Ural-Altaic language family and are characteristically agglutinative languages which have productive inflectional and derivational morphology. Most research about natural language processing and machine translation of Turkic languages focus on Turkish language. Mainly due to the fact that there is active ongoing research on the subject in Turkey, and they continue to produce valuable results. To date, machine translation systems implemented between Turkic languages has been scant, such as: Turkish to Azeri, Turkish to Crimean Tatar, Turkmen language to Turkish etc. Unfortunately, little computational research about Uygur languages exists. Turkic languages tend to have similar morphological structure and share some common word roots. The main shared properties include similar word order and syntactic structure. However, distinctions exist which prevent mutual intelligibility between these languages. In order to implement this translation system, we utilized a frame-work which is favored for translation between closely related agglutinative languages. Thus, we implemented a morphological analyzer for Uyghur language with XEROX's Finite State Transducers (FST) tools. In this morphological analyzer we considered general cases for Uyghur languages and tagged Uyghur words with the same tags that were used for tagging other Turkic languages words. Thus, it will be easy to integrate this system to other Turkic languages. In order to improve the system's performance, we implemented a rule based morphological disambiguator, additionally, a disambiguator for word senses. We have evaluated our system's performance using BLEU scores for 240 differently structured sentences. As a result, a system has been determined which may successfully translate intermediate level Uyghur language into Turkish. Keywords: Machine translation, Turkic languages, Uyghur language, Turkish.Bilgisayarlı Çeviri (BÇ) yapay zeka çalışmalarının bir alt dalı olan Doğal Dil İşlemenin (DDİ) alt konusudur. Diller arası çeviride bilgisayarların kullanılması fikri 1950’lerin ilk yıllarında ortaya çıkmıştır. O tarihten günümüze kadar pek çok dil üzerinde çalışılmış ve çeşitli yöntemler geliştirilmiştir. Ancak teknolojideki ve yöntemlerdeki gelişmelere karşın, genel amaçlı, yüksek başarıma sahip çeviri sistemleri henüz geliştirilememiştir. Bunun temel nedeni, diller arasındaki büyük yapısal ve anlatım farklılıklarıdır. Yapısal yönden benzer olan diller arasına bilgisayarlı çevirinin daha kolay olduğu bilinmektedir. Son yıllarda Çekçe-Slovakça, Çekçe-Lehçe, İspanyolca-Katalanca, Türkmence-Türkçe gibi yakın diller arasında yüksek başarımlı çeviri yapabilen sistemler geliştirilmiştir. Akraba veya yakın diler arasında çeviri amaçlı geliştirilen sistemler, farklılıkların büyük olduğu, Türkçe-İngilizce gibi diller arasında bilgisayarlı çeviri için gerek duyulan karmaşık yöntemlere göre, daha basit ve kolay gerçeklenebilir yöntemler kullanmaktadırlar. Bu çalışma kapsamında, aynı dil ailesi içinde sınıflandırılan ve birçok yönden benzerlikler gösteren Uygurcadan Türkçeye bilgisayarlı çeviri sistemi geliştirilmiştir. Aslında bu diller ne kadar benzer özellikler gösterse de, çözülmesi gereken farklılıklar azımsanmayacak kadar çoktur. Genel olarak Uygur Türkçesi ile Türkiye Türkçesinin söz dizimi aynıdır. Bundan dolayı çeviri sistemi geliştirirken, sözcüklerin dizilimi değişmemektedir. Ancak sözcüklere eklenen ekler çok farklılaşabilmektedir. Uygurca ve Türkçe bitişken diller olduğundan, ekler çok önemlidir. Ekler sözcüklerin hatta tümcenin anlamını değiştirmektedir. Bu çalışmada, akraba ve bitişen diller arasında bilgisayarlı çeviri için geliştirilen karma model üzerine, belirsizlik giderme yönteminin eklenmesi ile Uygurcadan Türkçeye bilgisayarlı çeviri sistemi geliştirilmiştir. Anahtar Kelimeler: Bilgisayarlı çeviri, Türk Dilleri, Uygurca, Türkçe
Türkmenceden Türkçeye bilgisayarlı metin çevirisi
Diller arasında bilgisayar kullanılarak çeviri yapılması konusu, doğal dil işleme alanının en önemli dallarından bir tanesidir. Ancak teknolojideki ve yöntemlerdeki gelişmelere karşın, genel amaçlı, yüksek başarıma sahip çeviri sistemleri henüz genel kullanıma sunulamamıştır. Bunun temel nedeni, diller arasındaki büyük yapısal ve anlatım farklılıklardır. Bu noktadan hareketle, benzer diller arasında Bilgisayarlı Çeviri (BÇ) gerçeklemenin daha kolay olabileceği akla gelmektedir. Nitekim son yıllarda Çekçe-Slovakça, Çekçe-Lehçe, İspanyolca-Katalanca gibi çok yakın diller arasında yüksek başarımlı çıktılar üretebilen sistemler geliştirilebilmiştir. Üstelik bu sistemler, farklılıkların derin olduğu, Japonca-İngilizce gibi dil çiftleri arasında BÇ için gerek duyulan karmaşık yöntemlere göre daha basit ve kolay gerçeklenebilir yöntemler kullanmaktadırlar. Bu çalışma kapsamında, aynı dil ailesi içinde sınıflandırılan ve birçok yönden benzerlikler gösteren Türkmence ile Türkçe dilleri arasında bir BÇ sistemi geliştirilmiştir. Söz konusu bu diller ne kadar benzer özellikler gösterse de, çözülmesi gereken farklılıklar azımsanmayacak boyuttadır. Türk Dilleri arasındaki farklılıklar, yukarıda anılan dil çiftlerinden daha fazladır ve karşılıklı anlaşılabilirlik söz konusu değildir. Sistem, hem kural tabanlı hem de istatistiksel bileşenlerden oluşan karma bir çeviri modeli kullanarak, Türkmence tümcedeki sözcüklerin sırasını değiştirmeden sözcük-sözcük Türkçeye aktarım yapılması ilkesini temel almıştır. Ancak bitişken yapılı Türk Dillerinin karmaşık biçimbilimsel özellikleri nedeniyle, diğer dillerde kullanılabilen basit doğrudan aktarım yöntemleri geliştirilerek kullanılmıştır. BLEU yöntemi ile sistemin başarım ölçümü yapılmış ve modelin başarılı sonuçlar üretebileceği gösterilmiştir. Anahtar Kelimeler: Bilgisayarlı çeviri, Türk Dilleri, Türkmence, Türkçe.Machine translation is a popular but hard field of natural language processing. Despite of the huge development of technology and inventions of new methods, general purpose full automatic Machine Translation (MT) systems do not exist. Today's MT systems either require post-editing or far from generating high quality translations, particularly of unrestricted texts. Primary reasons for that are the morphological, syntactical and lexical differences between different languages. The more distant language pairs are selected as source and target languages, the more complex methods or models must be used to build an MT system between those languages. Intuitively, this fact implies that MT between related languages can be easier than languages that have completely different structures (i.e. Japan and English). Recently, MT between related languages like Czech to Slovak, Czech to Polish and Spanish to Catalan have been implemented and these studies showed that successful translations can be produced with relatively simpler efforts. In this work our aim is building an MT model between Turkic languages. Some of the Turkic languages are Turkish, Azerbaijani, Uzbek, Turkmen, Kyrgyz, Kazakh and Uighur. All of the Turkic languages are agglutinative languages which have productive inflectional and derivational morphology. A high level of similarity can be observed between Turkic languages, especially in word order and syntactic structure. They have similar morphological structure and share some common word roots. However, some divergences preventing the mutual intelligibility are observed between these languages. From the point of view of extending previous studies to Turkic languages, some serious problems emerge due to both agglutinative structure of the languages and resource scarcities. Except Turkish, most of the Turkic languages are computationally resource poor languages and that means the lack of training corpus, morphological analyzers, POS Taggers and machine readable dictionaries. The model we have presented in this work is a hybrid model that has both rule based and statistical components. Since the word order of Turkic languages are almost same, a direct transfer approach is used in translation by means of word-by-word translation. Morphological processing, which can generate ambiguous results, is the first step of almost every NLP task for agglutinative languages. Then, the actual transfer is carried out by transferring the root words and morphological features to the target language. During the transfer of root word, another type of ambiguity, lexical ambiguity, is emerged in the process. As an exception of word-by-word transfer approach, some additional sentence level processing is done in order to translate Multi-Word Units (MWU) correctly. The two types of ambiguities are resolved by the disambiguation component that exploits Statistical Language Models (SLM) trained on the target language. In next component, a target language morphological generator produces the surface forms of the resulting candidate translation. As a last step, some work is done in sentence level because of some long distance dependencies and a number transfer rules for some phrase structures. The statistical disambiguation component is based on SLMs which are normally generated by using surface forms from the training corpus. But for an agglutinative language, such a training will heavily suffer from sparse data problem, so we propose some SLM types in which various parts of the full morphological parses are modeled. The performances of these types are investigated as well as the performance of the whole system. To evaluate the practical performance of our model, we have implemented a generic MT framework based on our model and built a Turkmen to Turkish MT system by using this framework. The rule based modules of the system are implemented as Finite State Transducers (FST) using Xerox Finite State Toolkit. A Turkmen morphological analyzer is implemented in a two-level manner while an existing wide-coverage Turkish morphological analyzer is used in the generating direction as the target language morphological generator. We have used BLEU as the automatic evaluation metric for our MT system. The results showed that general purpose MT between Turkic languages can be achieved, even relatively easier, and can generate high quality translations. Keywords: Machine Translation, Turkic languages, Turkmen Language, Turkish.
Contex Free Grammer For Turkish
Formal Grammar which is introduced by Chomsky is one of the most important development in Natural Language Processing, a branch of Artificial Intelligence. The mathematical reresentation of languages can be possible using Formal Grammars. Almost all natural languages have word classes such as noun, adjective, verb. In addition to this one sentence consist of noun phrase and verb phrase. Noun phrase may consist of location, destination and source elements. Despite many similarities between the languages, there exist important dissimilarities in grammar rules of the languages belonging to different language families. In our study the most appropriate formal grammar representing Turkish language is investigated. Accuracy of the suggested grammars’ rules is evaluated in two different corpus. This study is the enhanced version of “Turkish Context Free Grammar Rules with Case Suffix and Phrase Relation” that was presented on UBMK 2016 International Conference on Computer Science \& Engineering \cite{ilk}. Different from the first study, this study includes all word and sentence types of Turkish. Adjectives and prepositions are considered. The quoted sentences, incomplete sentences and question sentences are included. The genitive phrase structures including verbal word are included. In this study, the noun phrases are also defined in detail.</p