177 research outputs found
Evaluating Multiway Multilingual NMT in the Turkic Languages
Despite the increasing number of large and comprehensive machine translation (MT) systems, evaluation of these methods in various languages has been restrained by the lack of high-quality parallel corpora as well as engagement with the people that speak these languages. In this study, we present an evaluation of state-of-the-art approaches to training and evaluating MT systems in 22 languages from the Turkic language family, most of which being extremely under-explored. First, we adopt the TIL Corpus with a few key improvements to the training and the evaluation sets. Then, we train 26 bilingual baselines as well as a multi-way neural MT (MNMT) model using the corpus and perform an extensive analysis using automatic metrics as well as human evaluations. We find that the MNMT model outperforms almost all bilingual baselines in the out-of-domain test sets and finetuning the model on a downstream task of a single pair also results in a huge performance boost in both low- and high-resource scenarios. Our attentive analysis of evaluation criteria for MT models in Turkic languages also points to the necessity for further research in this direction. We release the corpus splits, test sets as well as models to the public.Peer reviewe
Bridging the Domain Gap for Stance Detection for the Zulu language
Misinformation has become a major concern in recent last years given its
spread across our information sources. In the past years, many NLP tasks have
been introduced in this area, with some systems reaching good results on
English language datasets. Existing AI based approaches for fighting
misinformation in literature suggest automatic stance detection as an integral
first step to success. Our paper aims at utilizing this progress made for
English to transfers that knowledge into other languages, which is a
non-trivial task due to the domain gap between English and the target
languages. We propose a black-box non-intrusive method that utilizes techniques
from Domain Adaptation to reduce the domain gap, without requiring any human
expertise in the target language, by leveraging low-quality data in both a
supervised and unsupervised manner. This allows us to rapidly achieve similar
results for stance detection for the Zulu language, the target language in this
work, as are found for English. We also provide a stance detection dataset in
the Zulu language. Our experimental results show that by leveraging English
datasets and machine translation we can increase performances on both English
data along with other languages.Comment: accepted to Intellisy
Hierarchical Character-Word Models for Language Identification
Social media messages' brevity and unconventional spelling pose a challenge
to language identification. We introduce a hierarchical model that learns
character and contextualized word-level representations for language
identification. Our method performs well against strong base- lines, and can
also reveal code-switching
Enhancing Bi-directional English-Tigrigna Machine Translation Using Hybrid Approach
Machine Translation (MT) is an application area of NLP where automatic systems are used to translate text or speech from one language to another while preserving the meaning of the source language. Although there exists a large volume of literature in automatic machine translation of documents in many languages, the translation between English and Tigrigna is less explored. Therefore, we proposed the hybrid approach to address the challenges of applying syntactic reordering rules which align and capture the structural arrangement of words in the source sentence to become more like the target sentences. Two language models were developed- one for English and another for Tigrigna and about 12,000 parallel sentences in four domains and 32,000 bilingual dictionaries were collected for our experiment. The parallel collected corpus was split randomly to 10,800 sentences for training set and 1,200 sentences for testing. Moses open source statistical machine translation system has been used for the experiment to train, tune and decode. The parallel corpus was aligned using the Giza++ toolkit and SRILM was used for building the language model. Three main experiments were conducted using statistical approach, hybrid approach and post-processing technique. According to our experimental result showed good translation output as high as 32.64 BLEU points Google translator and the hybrid approach was found most promising for English-Tigrigna bi-directional translation
Hesaplamalı Dil Bilimleri ve Uygur Dili Araştırmaları
Bu makalede hesaplamalı dil bilimleri kısaca anlatılmıştır ve Uygurca ile ilgili yapılan güncel hesaplamalı dil bilim araştırmaları özetlenmiştir. Teknolojinin ilerlemesi ile farklı dillere yönelik bilgisayar destekli çalışmalarda büyük başarılar elde edilmiştir. Örneğin, metinlerde içerik yönetme, bilgi edinme, konuşma sistemleri, dosya kümeleme, metin madenciliği, yazı kontrolü, yazıyı sese çevirme, sesi yazıya çevirme ve farklı diller arasında otomatik (bilgisayarlı çeviri) gibi uygulamalar geliştirilmiştir ve gerçek hayata kullanılmaktadır. Gerçi Fince, Japonca, Macarca ve Türkçe gibi Ural-Altay dilleri grubuna ait bazı diller ile ilgili birçok çalışmalar yapılsa bile, ancak yine bazı diller, örneğin Uygurca, ile ilgili yapılan çalışmalar çok az bilinmektedir. Hesaplamalı dil bilimi ile ilgili araştırmaları geliştirmek ve farklı diller arasındaki ilişkileri analiz edebilmek için, bu makalede, Uygurca ile ilgili yapılan bilgisayar destekli araştırmalar, özellik ile bilgisayarlı çeviri ile ilgili yapılan en son temel niteliğindeki çalışmalar toparlanmıştır. Aynı anda dil bilimcileri ile hesaplamalı dil bilimleri arasındaki bağıntı analiz edilmiştir
Lego-MT: Towards Detachable Models in Massively Multilingual Machine Translation
Multilingual neural machine translation (MNMT) aims to build a unified model
for many language directions. Existing monolithic models for MNMT encounter two
challenges: parameter interference among languages and inefficient inference
for large models. In this paper, we revisit the classic multi-way structures
and develop a detachable model by assigning each language (or group of
languages) to an individual branch that supports plug-and-play training and
inference. To address the needs of learning representations for all languages
in a unified space, we propose a novel efficient training recipe, upon which we
build an effective detachable model, Lego-MT. For a fair comparison, we collect
data from OPUS and build a translation benchmark covering 433 languages and
1.3B parallel data. Experiments show that Lego-MT with 1.2B parameters brings
an average gain of 3.2 spBLEU. It even outperforms M2M-100 with 12B parameters.
The proposed training recipe brings a 28.2 speedup over the
conventional multi-way training method.\footnote{
\url{https://github.com/CONE-MT/Lego-MT}.}Comment: ACL 2023 Finding
When Is Multilinguality a Curse? Language Modeling for 250 High- and Low-Resource Languages
Multilingual language models are widely used to extend NLP systems to
low-resource languages. However, concrete evidence for the effects of
multilinguality on language modeling performance in individual languages
remains scarce. Here, we pre-train over 10,000 monolingual and multilingual
language models for over 250 languages, including multiple language families
that are under-studied in NLP. We assess how language modeling performance in
each language varies as a function of (1) monolingual dataset size, (2) added
multilingual dataset size, (3) linguistic similarity of the added languages,
and (4) model size (up to 45M parameters). We find that in moderation, adding
multilingual data improves low-resource language modeling performance, similar
to increasing low-resource dataset sizes by up to 33%. Improvements depend on
the syntactic similarity of the added multilingual data, with marginal
additional effects of vocabulary overlap. However, high-resource languages
consistently perform worse in multilingual pre-training scenarios. As dataset
sizes increase, adding multilingual data begins to hurt performance for both
low-resource and high-resource languages, likely due to limited model capacity
(the "curse of multilinguality"). These results suggest that massively
multilingual pre-training may not be optimal for any languages involved, but
that more targeted models can significantly improve performance
- …