56 research outputs found

    Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages

    Get PDF
    This paper presents an overview of Apertium, a free and open-source rule-based machine translation platform. Translation in Apertium happens through a pipeline of modular tools, and the platform continues to be improved as more language pairs are added. Several advances have been implemented since the last publication, including some new optional modules: a module that allows rules to process recursive structures at the structural transfer stage, a module that deals with contiguous and discontiguous multi-word expressions, and a module that resolves anaphora to aid translation. Also highlighted is the hybridisation of Apertium through statistical modules that augment the pipeline, and statistical methods that augment existing modules. This includes morphological disambiguation, weighted structural transfer, and lexical selection modules that learn from limited data. The paper also discusses how a platform like Apertium can be a critical part of access to language technology for so-called low-resource languages, which might be ignored or deemed unapproachable by popular corpus-based translation technologies. Finally, the paper presents some of the released and unreleased language pairs, concluding with a brief look at some supplementary Apertium tools that prove valuable to users as well as language developers. All Apertium-related code, including language data, is free/open-source and available at https://github.com/apertium

    Europe: So Many Languages, So Many Cultures

    Full text link
    The number of different languages in Europe by far exceeds the number of countries. All European countries have national languages, and in nearly all of them there are minority languages as well, whereas all major languages have dialects. National borders rarely coincide with linguistic borders, but the latter (including dialect borders) mark by their nature also more or less distinct cultural areas. This paper presents a survey of the different language families represented in Europe: Indo-European, Uralic, Altaic, and the four Caucasian language families, each with their sub-branches and individual languages. Some information is given on characteristic structural phenomena and on the status and history of these languages or language families and on some of their extinct predecessors. The paper ends with a short discussion on the language policy and practices of the institutions of the European Union. Europe lacks a language with the status and power comparable to Indonesian in Indonesia. The policy is therefore based on equal status of all national languages and on respect for all languages, including national minority ones. The practice, however, is unavoidably practical: “the more languages, the more English”

    Türkmenceden Türkçeye bilgisayarlı metin çevirisi

    Get PDF
    Diller arasında bilgisayar kullanılarak çeviri yapılması konusu, doğal dil işleme alanının en önemli dallarından bir tanesidir. Ancak teknolojideki ve yöntemlerdeki gelişmelere karşın, genel amaçlı, yüksek başarıma sahip çeviri sistemleri henüz genel kullanıma sunulamamıştır. Bunun temel nedeni, diller arasındaki büyük yapısal ve anlatım farklılıklardır. Bu noktadan hareketle, benzer diller arasında Bilgisayarlı Çeviri (BÇ) gerçeklemenin daha kolay olabileceği akla gelmektedir. Nitekim son yıllarda Çekçe-Slovakça, Çekçe-Lehçe, İspanyolca-Katalanca gibi çok yakın diller arasında yüksek başarımlı çıktılar üretebilen sistemler geliştirilebilmiştir. Üstelik bu sistemler, farklılıkların derin olduğu, Japonca-İngilizce gibi dil çiftleri arasında BÇ için gerek duyulan karmaşık yöntemlere göre daha basit ve kolay gerçeklenebilir yöntemler kullanmaktadırlar. Bu çalışma kapsamında, aynı dil ailesi içinde sınıflandırılan ve birçok yönden benzerlikler gösteren Türkmence ile Türkçe dilleri arasında bir BÇ sistemi geliştirilmiştir. Söz konusu bu diller ne kadar benzer özellikler gösterse de, çözülmesi gereken farklılıklar azımsanmayacak boyuttadır. Türk Dilleri arasındaki farklılıklar, yukarıda anılan dil çiftlerinden daha fazladır ve karşılıklı anlaşılabilirlik söz konusu değildir. Sistem, hem kural tabanlı hem de istatistiksel bileşenlerden oluşan karma bir çeviri modeli kullanarak, Türkmence tümcedeki sözcüklerin sırasını değiştirmeden sözcük-sözcük Türkçeye aktarım yapılması ilkesini temel almıştır. Ancak bitişken yapılı Türk Dillerinin karmaşık biçimbilimsel özellikleri nedeniyle, diğer dillerde kullanılabilen basit doğrudan aktarım yöntemleri geliştirilerek kullanılmıştır. BLEU yöntemi ile sistemin başarım ölçümü yapılmış ve modelin başarılı sonuçlar üretebileceği gösterilmiştir. Anahtar Kelimeler: Bilgisayarlı çeviri, Türk Dilleri, Türkmence, Türkçe.Machine translation is a popular but hard field of natural language processing. Despite of the huge development of technology and inventions of new methods, general purpose full automatic Machine Translation (MT) systems do not exist. Today's MT systems either require post-editing or far from generating high quality translations, particularly of unrestricted texts. Primary reasons for that are the morphological, syntactical and lexical differences between different languages. The more distant language pairs are selected as source and target languages, the more complex methods or models must be used to build an MT system between those languages. Intuitively, this fact implies that MT between related languages can be easier than languages that have completely different structures (i.e. Japan and English). Recently, MT between related languages like Czech to Slovak, Czech to Polish and Spanish to Catalan have been implemented and these studies showed that successful translations can be produced with relatively simpler efforts. In this work our aim is building an MT model between Turkic languages. Some of the Turkic languages are Turkish, Azerbaijani, Uzbek, Turkmen, Kyrgyz, Kazakh and Uighur. All of the Turkic languages are agglutinative languages which have productive inflectional and derivational morphology. A high level of similarity can be observed between Turkic languages, especially in word order and syntactic structure. They have similar morphological structure and share some common word roots. However, some divergences preventing the mutual intelligibility are observed between these languages. From the point of view of extending previous studies to Turkic languages, some serious problems emerge due to both agglutinative structure of the languages and resource scarcities. Except Turkish, most of the Turkic languages are computationally resource poor languages and that means the lack of training corpus, morphological analyzers, POS Taggers and machine readable dictionaries. The model we have presented in this work is a hybrid model that has both rule based and statistical components. Since the word order of Turkic languages are almost same, a direct transfer approach is used in translation by means of word-by-word translation. Morphological processing, which can generate ambiguous results, is the first step of almost every NLP task for agglutinative languages. Then, the actual transfer is carried out by transferring the root words and morphological features to the target language. During the transfer of root word, another type of ambiguity, lexical ambiguity, is emerged in the process. As an exception of word-by-word transfer approach, some additional sentence level processing is done in order to translate Multi-Word Units (MWU) correctly. The two types of ambiguities are resolved by the disambiguation component that exploits Statistical Language Models (SLM) trained on the target language. In next component, a target language morphological generator produces the surface forms of the resulting candidate translation. As a last step, some work is done in sentence level because of some long distance dependencies and a number transfer rules for some phrase structures. The statistical disambiguation component is based on SLMs which are normally generated by using surface forms from the training corpus. But for an agglutinative language, such a training will heavily suffer from sparse data problem, so we propose some SLM types in which various parts of the full morphological parses are modeled. The performances of these types are investigated as well as the performance of the whole system. To evaluate the practical performance of our model, we have implemented a generic MT framework based on our model and built a Turkmen to Turkish MT system by using this framework. The rule based modules of the system are implemented as Finite State Transducers (FST) using Xerox Finite State Toolkit. A Turkmen morphological analyzer is implemented in a two-level manner while an existing wide-coverage Turkish morphological analyzer is used in the generating direction as the target language morphological generator. We have used BLEU as the automatic evaluation metric for our MT system. The results showed that general purpose MT between Turkic languages can be achieved, even relatively easier, and can generate high quality translations. Keywords: Machine Translation, Turkic languages, Turkmen Language, Turkish.

    A Large-Scale Study of Machine Translation in Turkic Languages

    Get PDF
    Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 1.4 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All models, scripts, and data will be released to the public.Peer reviewe

    Disputed ethnic identity and the role of public education: the case of Moldova

    Full text link
    This dissertation examines the case of Moldova, where two ethnic nationalisms (Moldovan and Romanian) have battled over the content of national identity over the last two decades. Historically, the land on which Moldova lies was caught in a tug-of-war between Russia (later Soviet Union) and Romania. Sharing the same ethnic traits with Romania, Romanian nationalism emerged early in Moldova, only to be later deconstructed by the Soviets through deportations and executions of Romanian nationalists, and eventually reconstructed as a "Moldavian" identity. This dissertation has two goals. First, through archival and historical research it traces the process of formation of ethnic identity and the emergence of two conflicting nationalisms in Moldova. Second, it investigates the role of public education in ethno-national identity formation through interviews and a survey of Moldovan students. I hypothesize that because self-identified Romanians control the school curricula, the younger generation is more likely to identify as Romanian than the rest of the population - whose connection with school is more distant. To test this thesis, I conducted an original survey of students from seven schools. In contrast to the primordialist theory of nationalism, these findings indicate a relatively fluid national identity. However, the case of Moldovan nationalism also contradicts the instrumentalist school of thought, which over-emphasizes the socio-economic interests of nationalist agents and fails to take into account the cultural motivations of nationalism. Moldovan story indicates that at the fore-front of Romanian nationalist movement were the relatively well-off intellectuals and not the rural and urban working people as the accounts of Cash and Crowther indicate. Lastly, the structuralist (materialist) school fails to acknowledge the power of ideas and the effect they have on historical events. While material means like print media, capital markets, and urbanization facilitated the diffusion of these ideas, they did not create them. As the case of Moldova illustrates, the emergence of nationalism cannot be explained without an understanding of the motivations of the agents involved

    MA

    Get PDF
    thesisThe main premise of this thesis is that subject agreement morphology in Tuyuca can be isolated from the rest of the morphology. Subject agreement appears on evidentials, nominalizers, animate classifiers, gerunds, and verb stems requiring an auxiliary. This agreement is instantiated by a pervasive final vowel pattern that codes various values of gender, number, and person features. These final vowels also code the same information on nouns and pronouns. Before arguing for my analysis I provide some preliminary material on Tuyuca. Chapter 1 is a brief discussion of the sociolinguistic context of the language. Chapter 2 discusses issues relevant to Tuyuca data and surveys some of the literature related to Tuyuca; it also discusses some methodological concerns arising from the data and important to the thesis in general. Chapter 3 is a brief sketch of Tuyuca grammar important to agreement. Analysis is done in Chapters 4 and 5. In Chapter 4 I argue, in a descriptive-typological framework, that by isolating agreement a general deverbalizing function can be seen coded in the morpheme /-g-/. This morpheme has predictable interpretations in restricted morphosyntactic environments. It can be interpreted as a progressive or perfective aspect, an animate classifier, a gerund, and a nominalizer. In Chapter 5 I relate the general premise of isolating agreement in Tuyuca to theoretical issues belonging to the Minimalist Program. I show that isolating agreement morphemes from evidentials is, assuming the analysis in Chapter 4, straightforward. This has a practical advantage of making it easier to observe variation between present tense and past tense morphology of the evidentials. I take this as straightforward evidence that tense is fused with evidential. I also give evidence that supports the pro-drop status of Tuyuca, conjecturing that subject agreement is packaged with nominative case. I also argue informally that verbal inflection of tense-evidentials and subject agreement are "extensions" of the verb phrase and relate the predication of VP to some speech time and discourse situation of the verb event, relative to some specific world. This results in a model of functional hierarchy that places Evidential under Tense Phrase. I conjecture that this Evidential position is a predicational one, in contrast to the more accepted notions of Mood[evidential] or Modal[epistemic], which are known to be above Tense Phrase. I provide two detailed models, one with the conventional hierarchy and one with my hierarchy, arguing for the latter--based on general principles of syntactic economy and locality. I also provide a technical analysis of syntactic locality for the morphosyntactic fusion of tenseevidentials in a Distributed Morphology framework

    Evidentiality in Uzbek and Kazakh

    Get PDF
    The purpose of this work is to describe and account for the broad range of phenomena referred to as “evidentiality” in two Turkic languages: Uzbek and Kazakh. Much previous work on the Turkic languages treats evidentiality as a distinct verbal category. However, morphemes that express evidential meaning also often express other meanings such as dubitativity and admirativity, or may even express rhetorical questions. This work follows Friedman (1978; 1981; 1988) and others in considering these meanings to be the result of an evidential-like strategy: the expression of non-confirmativity. In Uzbek and Kazakh, as well as in many other Eurasian languages, the past tense is the locus of evidential meaning. There are three items in the Uzbek and Kazakh past tense paradigm, and these differ in terms of markedness for confirmativity: one is marked as confirmative, one as non-confirmative, and one is unmarked for confirmativity. The unmarked item, often referred to as the perfect, exists in a copular form. As a copular form, it expresses marked non-confirmativity. When this copular form (in Uzbek: ekan, in Kazakh: eken) is employed to express non-confirmativity, this non-confirmativity is manifested either as non-firsthand information source or as admirativity. By employing the non-confirmative analysis, we are able to account for the broad range of phenomena considered “evidential” without resorting to postulating an evidential category. Rather, in Uzbek and Kazakh, evidential meaning is merely one effect of the expression of non-confirmativity, which is a subtype of the categories of status or modality. xv NOTES ON ORTHOGRAPHY AND PHONOLOGY For the purpose of readabil
    corecore