111 research outputs found
Evaluating Multiway Multilingual NMT in the Turkic Languages
Despite the increasing number of large and comprehensive machine translation (MT) systems, evaluation of these methods in various languages has been restrained by the lack of high-quality parallel corpora as well as engagement with the people that speak these languages. In this study, we present an evaluation of state-of-the-art approaches to training and evaluating MT systems in 22 languages from the Turkic language family, most of which being extremely under-explored. First, we adopt the TIL Corpus with a few key improvements to the training and the evaluation sets. Then, we train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations. We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost in both low- and high-resource scenarios. Our attentive analysis of evaluation criteria for MT models in Turkic languages also points to the necessity for further research in this direction. We release the corpus splits, test sets as well as models to the public.Peer reviewe
A Large-Scale Study of Machine Translation in Turkic Languages
Recent advances in neural machine translation (NMT) have pushed the quality of machine translation systems to the point where they are becoming widely adopted to build competitive systems. However, there is still a large number of languages that are yet to reap the benefits of NMT. In this paper, we provide the first large-scale case study of the practical application of MT in the Turkic language family in order to realize the gains of NMT for Turkic languages under high-resource to extremely low-resource scenarios. In addition to presenting an extensive analysis that identifies the bottlenecks towards building competitive systems to ameliorate data scarcity, our study has several key contributions, including, i) a large parallel corpus covering 22 Turkic languages consisting of common public datasets in combination with new datasets of approximately 1.4 million parallel sentences, ii) bilingual baselines for 26 language pairs, iii) novel high-quality test sets in three different translation domains and iv) human evaluation scores. All models, scripts, and data will be released to the public.Peer reviewe
From Masked Language Modeling to Translation: Non-English Auxiliary Tasks Improve Zero-shot Spoken Language Understanding
The lack of publicly available evaluation data for low-resource languages limits progress in Spoken Language Understanding (SLU). As key tasks like intent classification and slot filling require abundant training data, it is desirable to reuse existing data in high-resource languages to develop models for low-resource scenarios. We introduce xSID, a new benchmark for cross-lingual (x) Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect. To tackle the challenge, we propose a joint learning approach, with English SLU training data and non-English auxiliary tasks from raw text, syntax and translation for transfer. We study two setups which differ by type and language coverage of the pre-trained embeddings. Our results show that jointly learning the main tasks with masked language modeling is effective for slots, while machine translation transfer works best for intent classification
Recommended from our members
Cross-Lingual and Low-Resource Sentiment Analysis
Identifying sentiment in a low-resource language is essential for understanding opinions internationally and for responding to the urgent needs of locals affected by disaster incidents in different world regions. While tools and resources for recognizing sentiment in high-resource languages are plentiful, determining the most effective methods for achieving this task in a low-resource language which lacks annotated data is still an open research question. Most existing approaches for cross-lingual sentiment analysis to date have relied on high-resource machine translation systems, large amounts of parallel data, or resources only available for Indo-European languages.
This work presents methods, resources, and strategies for identifying sentiment cross-lingually in a low-resource language. We introduce a cross-lingual sentiment model which can be trained on a high-resource language and applied directly to a low-resource language. The model offers the feature of lexicalizing the training data using a bilingual dictionary, but can perform well without any translation into the target language.
Through an extensive experimental analysis, evaluated on 17 target languages, we show that the model performs well with bilingual word vectors pre-trained on an appropriate translation corpus. We compare in-genre and in-domain parallel corpora, out-of-domain parallel corpora, in-domain comparable corpora, and monolingual corpora, and show that a relatively small, in-domain parallel corpus works best as a transfer medium if it is available. We describe the conditions under which other resources and embedding generation methods are successful, and these include our strategies for leveraging in-domain comparable corpora for cross-lingual sentiment analysis.
To enhance the ability of the cross-lingual model to identify sentiment in the target language, we present new feature representations for sentiment analysis that are incorporated in the cross-lingual model: bilingual sentiment embeddings that are used to create bilingual sentiment scores, and a method for updating the sentiment embeddings during training by lexicalization of the target language. This feature configuration works best for the largest number of target languages in both untargeted and targeted cross-lingual sentiment experiments.
The cross-lingual model is studied further by evaluating the role of the source language, which has traditionally been assumed to be English. We build cross-lingual models using 15 source languages, including two non-European and non-Indo-European source languages: Arabic and Chinese. We show that language families play an important role in the performance of the model, as does the morphological complexity of the source language.
In the last part of the work, we focus on sentiment analysis towards targets. We study Arabic as a representative morphologically complex language and develop models and morphological representation features for identifying entity targets and sentiment expressed towards them in Arabic open-domain text. Finally, we adapt our cross-lingual sentiment models for the detection of sentiment towards targets. Through cross-lingual experiments on Arabic and English, we demonstrate that our findings regarding resources, features, and language also hold true for the transfer of targeted sentiment
Words in Space and Time
With forty-two extensively annotated maps, this atlas offers novel insights into the history and mechanics of how Central Europe’s languages have been made, unmade, and deployed for political action. The innovative combination of linguistics, history, and cartography makes a wealth of hard-to-reach knowledge readily available to both specialist and general readers. It combines information on languages, dialects, alphabets, religions, mass violence, or migrations over an extended period of time.
The story first focuses on Central Europe’s dialect continua, the emergence of states, and the spread of writing technology from the tenth century onward. Most maps concentrate on the last two centuries. The main storyline opens with the emergence of the Western European concept of the nation, in accord with which the ethnolinguistic nation-states of Italy and Germany were founded. In the Central European view, a “proper” nation is none other than the speech community of a single language. The Atlas aspires to help users make the intellectual leap of perceiving languages as products of human history and part of culture. Like states, nations, universities, towns, associations, art, beauty, religions, injustice, or atheism—languages are artefacts invented and shaped by individuals and their groups
“Parallel Worlds“. Clusters for a Theory of Concepts of Communications. Historical Intercultural and Cultural Comparative Studies in Perspectives of National and Transnational Constitutions, Values, Concepts, and Terms of ‘Communication’ - ‘Orality’ - ‘Literacy’ - ‘Rhetoric’ - ‘Media’.
This is a study regarding the history of communication based on several clusters traced back from ancient time to the 21st century. It contains also in the second part chapers on the specific conditions of communications in different cultures
“Parallel Worlds“. Clusters for a Theory of Concepts of Communications. Historical Intercultural and Cultural Comparative Studies in Perspectives of National and Transnational Constitutions, Values, Concepts, and Terms of ‘Communication’ - ‘Orality’ - ‘Literacy’ - ‘Rhetoric’ - ‘Media’.
This is a study regarding the history of communication based on several clusters traced back from ancient time to the 21st century. It contains also in the second part chapers on the specific conditions of communications in different cultures
Words in Space and Time
With forty-two extensively annotated maps, this atlas offers novel insights into the history and mechanics of how Central Europe’s languages have been made, unmade, and deployed for political action. The innovative combination of linguistics, history, and cartography makes a wealth of hard-to-reach knowledge readily available to both specialist and general readers. It combines information on languages, dialects, alphabets, religions, mass violence, or migrations over an extended period of time.
The story first focuses on Central Europe’s dialect continua, the emergence of states, and the spread of writing technology from the tenth century onward. Most maps concentrate on the last two centuries. The main storyline opens with the emergence of the Western European concept of the nation, in accord with which the ethnolinguistic nation-states of Italy and Germany were founded. In the Central European view, a “proper” nation is none other than the speech community of a single language. The Atlas aspires to help users make the intellectual leap of perceiving languages as products of human history and part of culture. Like states, nations, universities, towns, associations, art, beauty, religions, injustice, or atheism—languages are artefacts invented and shaped by individuals and their groups
- …