Search CORE

73 research outputs found

Evaluating Multiway Multilingual NMT in the Turkic Languages

Author: Ataman Duygu
Babu Anoop
Chellappan Sriram
Firat Orhan
Ivanova Sardana
Kreutzer Julia
Licato John
Mirzakhalov Jamshidbek
Moydinboyev Bekhzodbek
Pulatova Shaxnoza
Tyers Francis M.
Uzokova Mokhiyakhon
Wahab Ahsan
Publication venue: The Association for Computational Linguistics
Publication date: 01/11/2021
Field of study

Despite the increasing number of large and comprehensive machine translation (MT) systems, evaluation of these methods in various languages has been restrained by the lack of high-quality parallel corpora as well as engagement with the people that speak these languages. In this study, we present an evaluation of state-of-the-art approaches to training and evaluating MT systems in 22 languages from the Turkic language family, most of which being extremely under-explored. First, we adopt the TIL Corpus with a few key improvements to the training and the evaluation sets. Then, we train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations. We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost in both low- and high-resource scenarios. Our attentive analysis of evaluation criteria for MT models in Turkic languages also points to the necessity for further research in this direction. We release the corpus splits, test sets as well as models to the public.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Toward Computational Processing of Less Resourced Languages: Primarily Experiments for Moroccan Amazigh Language

Author: Ataa Allah Fadoua
Boulaknadel Siham
Publication venue: 'IntechOpen'
Publication date: 21/11/2012
Field of study

IntechOpen

Crossref

Theory and Applications for Advanced Text Mining

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

Due to the growth of computer technologies and web technologies, we can easily collect and store large amounts of text data. We can believe that the data include useful knowledge. Text mining techniques have been studied aggressively in order to extract the knowledge from the data since late 1990s. Even if many important techniques have been developed, the text mining research field continues to expand for the needs arising from various application fields. This book is composed of 9 chapters introducing advanced text mining techniques. They are various techniques from relation extraction to under or less resourced language. I believe that this book will give new knowledge in the text mining field and help many readers open their new research fields

Directory of Open Access Books (DOAB)

Multiethnic Societies of Central Asia and Siberia Represented in Indigenous Oral and Written Literature

Author: Barnett Robert
Bitkeev P.T.
Bitkeeva G.S.
Eli Merhaba
Faggionato Christian
Grinevich Anna A.
Hojam Pekiniy Ahmet
Kasten Erich
Lizunova Irina V.
Mandrinina Lyudmila A.
Omakaeva Ellara
Pshenichnaya Evgeniya V.
Qahiri Tahir Mutällip
Reckel Johannes
Rogaar Eva
Rykova Valentina V.
Schatz Merle
Schatz Merle
Steenberg Reyhe Rune
Suleymanovna Madzhun Djamilya
Yeung Jessica
Özkan Deniz
Publication venue: 'Universitatsverlag Gottingen'
Publication date: 11/11/2022
Field of study

Central Asia and Siberia are characterized by multiethnic societies formed by a patchwork of often small ethnic groups. At the same time large parts of them have been dominated by state languages, especially Russian and Chinese. On a local level the languages of the autochthonous people often play a role parallel to the central national language. The contributions of this conference proceeding follow up on topics such as: What was or is collected and how can it be used under changed conditions in the research landscape, how does it help local ethnic communities to understand and preserve their own culture and language? Do the spatially dispersed but often networked collections support research on the ground? What contribution do these collections make to the local languages and cultures against the backdrop of dwindling attention to endangered groups? These and other questions are discussed against the background of the important role libraries and private collections play for multiethnic societies in often remote regions that are difficult to reach

Directory of Open Access Books (DOAB)

Resource Generation from Structured Documents for Low-density Languages

Author: Karagol-Ayan Burcu
Publication venue
Publication date: 27/08/2007
Field of study

The availability and use of electronic resources for both manual and automated language related processing has increased tremendously in recent years. Nevertheless, many resources still exist only in printed form, restricting their availability and use. This especially holds true in low density languages or languages with limited electronic resources. For these documents, automated conversion into electronic resources is highly desirable. This thesis focuses on the semi-automated conversion of printed structured documents (dictionaries in particular) to usable electronic representations. In the first part we present an entry tagging system that recognizes, parses, and tags the entries of a printed dictionary to reproduce the representation. The system uses the consistent layout and structure of the dictionaries, and the features that impose this structure, to capture and recover lexicographic information. We accomplish this by adapting two methods: rule-based and HMM-based. The system is designed to produce results quickly with minimal human assistance and reasonable accuracy. The use of an adaptive transformation-based learning as a post-processor at two points in the system yields significant improvements, even with an extremely small amount of user provided training data. The second part of this thesis presents Morphology Induction from Noisy Data (MIND), a natural language morphology discovery framework that operates on information from limited, noisy data obtained from the conversion process. To use the resulting resources effectively, however, users must be able to search for them using the root form of morphologically deformed variant found in the text. Stemming and data driven methods are not suitable when data are sparse. The approach is based on the novel application of string searching algorithms. The evaluations show that MIND can segment words into roots and affixes from the noisy, limited data contained in a dictionary, and it can extract prefixes, suffixes, circumfixes, and infixes. MIND can also identify morphophonemic changes, i.e., phonemic variations between allomorphs of a morpheme, specifically point-of-affixation stem changes. This, in turn, allows non-native speakers to perform multilingual tasks for applications where response must be rapid, and they have limited knowledge. In addition, this analysis can feed other natural language processing tools requiring lexicons

Digital Repository at the University of Maryland

Multiethnic Societies of Central Asia and Siberia Represented in Indigenous Oral and Written Literature

Author: Barnett Robert
Bitkeev P.T.
Bitkeeva G.S.
Eli Merhaba
Faggionato Christian
Grinevich Anna A.
Hojam Pekiniy Ahmet
Kasten Erich
Lizunova Irina V.
Mandrinina Lyudmila A.
Omakaeva Ellara
Pshenichnaya Evgeniya V.
Qahiri Tahir Mutällip
Reckel Johannes
Rogaar Eva
Rykova Valentina V.
Schatz Merle
Schatz Merle
Steenberg Reyhe Rune
Suleymanovna Madzhun Djamilya
Yeung Jessica
Özkan Deniz
Publication venue: 'Universitatsverlag Gottingen'
Publication date
Field of study

OAPEN Library

The European Language Resources and Technologies Forum: Shaping the Future of the Multilingual Digital Europe

Author: Baroni Paola
Bel N?ria
Budin Gerhard
Calzolari Nicoletta
Choukri Khalid
Goggi Sara
Mariani Joseph
Monachini Monica
Odijk Jan
Piperidis Stelios
Quochi Valeria
Soria Claudia
Toral Antonio
Publication venue: Istituto di Linguistica Computazionale del CNR - Pisa, ITALY
Publication date
Field of study

Proceedings of the 1st FLaReNet Forum on the European Language Resources and Technologies, held in Vienna, at the Austrian Academy of Science, on 12-13 February 2009

PUblication MAnagement

Turkic C- type reduplications

Author: Stachowski Kamil
Publication venue: Katowice : Uniwersytet Śląski
Publication date: 01/01/2013
Field of study

The present book can be viewed as a patchwork of topics relating more or less directly to Turkic reduplications. Many are interconnected and interdependent, which renders it impossible to organize the presentation in a linear way. The thematic division adopted here is only one of the possible groupings, and not necessarily optimal for all tasks. To alleviate this inconvenience, the current chapter first summarizes the whole following a different thematic division (4.1), and then very briefly recapitualtes what I consider to be the most important conclusions (4.2). Some thoughts are expressed more clearly here than in the previous chapters, where they were lost between auxiliary observations

Repozytorium Uniwersytetu Śląskiego RE-BUŚ

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

Author: Boyd-Graber Jordan
Imani Ayyoob
Kargaran Amir Hossein
Kassner Nora
Lin Peiqin
Ma Chunlan
Martins André
Okazaki Naoaki
Rogers Anna
Sabet Masoud Jalili
Schmid Helmut
Schütze Hinrich
Severini Silvia
Yvon François
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 26/05/2023
Field of study

The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, “help” from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should notlimit NLP to a small fraction of the world’s languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500

arXiv.org e-Print Archive

Open Access LMU

An introduction to The National Institute for Japanese Language and Linguistics : A sketch of its achievements sixth edition

Author: The National Institute for Japanese Language and Linguistics
国立国語研究所
Publication venue: 国立国語研究所
Publication date: 01/09/2019
Field of study

Institutional Repositories DataBase (IRDB)

Academic Repository of the National Institute for Japanese Language and Linguistics / 国立国語研究所学術情報リポジトリ