73 research outputs found

    Evaluating Multiway Multilingual NMT in the Turkic Languages

    Get PDF
    Despite the increasing number of large and comprehensive machine translation (MT) systems, evaluation of these methods in various languages has been restrained by the lack of high-quality parallel corpora as well as engagement with the people that speak these languages. In this study, we present an evaluation of state-of-the-art approaches to training and evaluating MT systems in 22 languages from the Turkic language family, most of which being extremely under-explored. First, we adopt the TIL Corpus with a few key improvements to the training and the evaluation sets. Then, we train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations. We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost in both low- and high-resource scenarios. Our attentive analysis of evaluation criteria for MT models in Turkic languages also points to the necessity for further research in this direction. We release the corpus splits, test sets as well as models to the public.Peer reviewe

    Theory and Applications for Advanced Text Mining

    Get PDF
    Due to the growth of computer technologies and web technologies, we can easily collect and store large amounts of text data. We can believe that the data include useful knowledge. Text mining techniques have been studied aggressively in order to extract the knowledge from the data since late 1990s. Even if many important techniques have been developed, the text mining research field continues to expand for the needs arising from various application fields. This book is composed of 9 chapters introducing advanced text mining techniques. They are various techniques from relation extraction to under or less resourced language. I believe that this book will give new knowledge in the text mining field and help many readers open their new research fields

    Multiethnic Societies of Central Asia and Siberia Represented in Indigenous Oral and Written Literature

    Get PDF
    Central Asia and Siberia are characterized by multiethnic societies formed by a patchwork of often small ethnic groups. At the same time large parts of them have been dominated by state languages, especially Russian and Chinese. On a local level the languages of the autochthonous people often play a role parallel to the central national language. The contributions of this conference proceeding follow up on topics such as: What was or is collected and how can it be used under changed conditions in the research landscape, how does it help local ethnic communities to understand and preserve their own culture and language? Do the spatially dispersed but often networked collections support research on the ground? What contribution do these collections make to the local languages and cultures against the backdrop of dwindling attention to endangered groups? These and other questions are discussed against the background of the important role libraries and private collections play for multiethnic societies in often remote regions that are difficult to reach

    Resource Generation from Structured Documents for Low-density Languages

    Get PDF
    The availability and use of electronic resources for both manual and automated language related processing has increased tremendously in recent years. Nevertheless, many resources still exist only in printed form, restricting their availability and use. This especially holds true in low density languages or languages with limited electronic resources. For these documents, automated conversion into electronic resources is highly desirable. This thesis focuses on the semi-automated conversion of printed structured documents (dictionaries in particular) to usable electronic representations. In the first part we present an entry tagging system that recognizes, parses, and tags the entries of a printed dictionary to reproduce the representation. The system uses the consistent layout and structure of the dictionaries, and the features that impose this structure, to capture and recover lexicographic information. We accomplish this by adapting two methods: rule-based and HMM-based. The system is designed to produce results quickly with minimal human assistance and reasonable accuracy. The use of an adaptive transformation-based learning as a post-processor at two points in the system yields significant improvements, even with an extremely small amount of user provided training data. The second part of this thesis presents Morphology Induction from Noisy Data (MIND), a natural language morphology discovery framework that operates on information from limited, noisy data obtained from the conversion process. To use the resulting resources effectively, however, users must be able to search for them using the root form of morphologically deformed variant found in the text. Stemming and data driven methods are not suitable when data are sparse. The approach is based on the novel application of string searching algorithms. The evaluations show that MIND can segment words into roots and affixes from the noisy, limited data contained in a dictionary, and it can extract prefixes, suffixes, circumfixes, and infixes. MIND can also identify morphophonemic changes, i.e., phonemic variations between allomorphs of a morpheme, specifically point-of-affixation stem changes. This, in turn, allows non-native speakers to perform multilingual tasks for applications where response must be rapid, and they have limited knowledge. In addition, this analysis can feed other natural language processing tools requiring lexicons

    Multiethnic Societies of Central Asia and Siberia Represented in Indigenous Oral and Written Literature

    Get PDF
    Central Asia and Siberia are characterized by multiethnic societies formed by a patchwork of often small ethnic groups. At the same time large parts of them have been dominated by state languages, especially Russian and Chinese. On a local level the languages of the autochthonous people often play a role parallel to the central national language. The contributions of this conference proceeding follow up on topics such as: What was or is collected and how can it be used under changed conditions in the research landscape, how does it help local ethnic communities to understand and preserve their own culture and language? Do the spatially dispersed but often networked collections support research on the ground? What contribution do these collections make to the local languages and cultures against the backdrop of dwindling attention to endangered groups? These and other questions are discussed against the background of the important role libraries and private collections play for multiethnic societies in often remote regions that are difficult to reach

    The European Language Resources and Technologies Forum: Shaping the Future of the Multilingual Digital Europe

    Get PDF
    Proceedings of the 1st FLaReNet Forum on the European Language Resources and Technologies, held in Vienna, at the Austrian Academy of Science, on 12-13 February 2009

    Turkic C- type reduplications

    Get PDF
    The present book can be viewed as a patchwork of topics relating more or less directly to Turkic reduplications. Many are interconnected and interdependent, which renders it impossible to organize the presentation in a linear way. The thematic division adopted here is only one of the possible groupings, and not necessarily optimal for all tasks. To alleviate this inconvenience, the current chapter first summarizes the whole following a different thematic division (4.1), and then very briefly recapitualtes what I consider to be the most important conclusions (4.2). Some thoughts are expressed more clearly here than in the previous chapters, where they were lost between auxiliary observations

    Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

    Get PDF
    The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, “help” from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should notlimit NLP to a small fraction of the world’s languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500
    corecore