111 research outputs found

    Evaluating Multiway Multilingual NMT in the Turkic Languages

    Get PDF
    Despite the increasing number of large and comprehensive machine translation (MT) systems, evaluation of these methods in various languages has been restrained by the lack of high-quality parallel corpora as well as engagement with the people that speak these languages. In this study, we present an evaluation of state-of-the-art approaches to training and evaluating MT systems in 22 languages from the Turkic language family, most of which being extremely under-explored. First, we adopt the TIL Corpus with a few key improvements to the training and the evaluation sets. Then, we train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations. We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost in both low- and high-resource scenarios. Our attentive analysis of evaluation criteria for MT models in Turkic languages also points to the necessity for further research in this direction. We release the corpus splits, test sets as well as models to the public.Peer reviewe

    A Large-Scale Study of Machine Translation in Turkic Languages

    Get PDF
    Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 1.4 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All models, scripts, and data will be released to the public.Peer reviewe

    From Masked Language Modeling to Translation: Non-English Auxiliary Tasks Improve Zero-shot Spoken Language Understanding

    Get PDF
    The lack of publicly available evaluation data for low-resource languages limits progress in Spoken Language Understanding (SLU). As key tasks like intent classification and slot filling require abundant training data, it is desirable to reuse existing data in high-resource languages to develop models for low-resource scenarios. We introduce xSID, a new benchmark for cross-lingual (x) Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect. To tackle the challenge, we propose a joint learning approach, with English SLU training data and non-English auxiliary tasks from raw text, syntax and translation for transfer. We study two setups which differ by type and language coverage of the pre-trained embeddings. Our results show that jointly learning the main tasks with masked language modeling is effective for slots, while machine translation transfer works best for intent classification

    Words in Space and Time

    Get PDF
    With forty-two extensively annotated maps, this atlas offers novel insights into the history and mechanics of how Central Europe’s languages have been made, unmade, and deployed for political action. The innovative combination of linguistics, history, and cartography makes a wealth of hard-to-reach knowledge readily available to both specialist and general readers. It combines information on languages, dialects, alphabets, religions, mass violence, or migrations over an extended period of time. The story first focuses on Central Europe’s dialect continua, the emergence of states, and the spread of writing technology from the tenth century onward. Most maps concentrate on the last two centuries. The main storyline opens with the emergence of the Western European concept of the nation, in accord with which the ethnolinguistic nation-states of Italy and Germany were founded. In the Central European view, a “proper” nation is none other than the speech community of a single language. The Atlas aspires to help users make the intellectual leap of perceiving languages as products of human history and part of culture. Like states, nations, universities, towns, associations, art, beauty, religions, injustice, or atheism—languages are artefacts invented and shaped by individuals and their groups

    “Parallel Worlds“. Clusters for a Theory of Concepts of Communications. Historical Intercultural and Cultural Comparative Studies in Perspectives of National and Transnational Constitutions, Values, Concepts, and Terms of ‘Communication’ - ‘Orality’ - ‘Literacy’ - ‘Rhetoric’ - ‘Media’.

    Get PDF
    This is a study regarding the history of communication based on several clusters traced back from ancient time to the 21st century. It contains also in the second part chapers on the specific conditions of communications in different cultures

    “Parallel Worlds“. Clusters for a Theory of Concepts of Communications. Historical Intercultural and Cultural Comparative Studies in Perspectives of National and Transnational Constitutions, Values, Concepts, and Terms of ‘Communication’ - ‘Orality’ - ‘Literacy’ - ‘Rhetoric’ - ‘Media’.

    Get PDF
    This is a study regarding the history of communication based on several clusters traced back from ancient time to the 21st century. It contains also in the second part chapers on the specific conditions of communications in different cultures

    Words in Space and Time

    Get PDF
    With forty-two extensively annotated maps, this atlas offers novel insights into the history and mechanics of how Central Europe’s languages have been made, unmade, and deployed for political action. The innovative combination of linguistics, history, and cartography makes a wealth of hard-to-reach knowledge readily available to both specialist and general readers. It combines information on languages, dialects, alphabets, religions, mass violence, or migrations over an extended period of time. The story first focuses on Central Europe’s dialect continua, the emergence of states, and the spread of writing technology from the tenth century onward. Most maps concentrate on the last two centuries. The main storyline opens with the emergence of the Western European concept of the nation, in accord with which the ethnolinguistic nation-states of Italy and Germany were founded. In the Central European view, a “proper” nation is none other than the speech community of a single language. The Atlas aspires to help users make the intellectual leap of perceiving languages as products of human history and part of culture. Like states, nations, universities, towns, associations, art, beauty, religions, injustice, or atheism—languages are artefacts invented and shaped by individuals and their groups
    corecore