507 research outputs found
Cognate-aware morphological segmentation for multilingual neural translation
This article describes the Aalto University entry to the WMT18 News
Translation Shared Task. We participate in the multilingual subtrack with a
system trained under the constrained condition to translate from English to
both Finnish and Estonian. The system is based on the Transformer model. We
focus on improving the consistency of morphological segmentation for words that
are similar orthographically, semantically, and distributionally; such words
include etymological cognates, loan words, and proper names. For this, we
introduce Cognate Morfessor, a multilingual variant of the Morfessor method. We
show that our approach improves the translation quality particularly for
Estonian, which has less resources for training the translation model.Comment: To appear in WMT1
Findings of the 2019 Conference on Machine Translation (WMT19)
This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019.
Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation
Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation
There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks—English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish—and one real-world task, Norwegian to North Sámi and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks-English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish-and one real-world task, Norwegian to North Sami and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.Peer reviewe
Machine learning for ancient languages: a survey
Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the study of ancient texts, but in recent years advances in artificial intelligence and machine learning have enabled analyses on a scale and in a detail that are reshaping the field of humanities, similarly to how microscopes and telescopes have contributed to the realm of science. This article aims to provide a comprehensive survey of published research using machine learning for the study of ancient texts written in any language, script, and medium, spanning over three and a half millennia of civilizations around the ancient world. To analyze the relevant literature, we introduce a taxonomy of tasks inspired by the steps involved in the study of ancient documents: digitization, restoration, attribution, linguistic analysis, textual criticism, translation, and decipherment. This work offers three major contributions: first, mapping the interdisciplinary field carved out by the synergy between the humanities and machine learning; second, highlighting how active collaboration between specialists from both fields is key to producing impactful and compelling scholarship; third, highlighting promising directions for future work in this field. Thus, this work promotes and supports the continued collaborative impetus between the humanities and machine learning
PersoNER: Persian named-entity recognition
© 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network
On the Correlation of Context-Aware Language Models With the Intelligibility of Polish Target Words to Czech Readers
This contribution seeks to provide a rational probabilistic explanation for the intelligibility
of words in a genetically related language that is unknown to the reader, a phenomenon
referred to as intercomprehension. In this research domain, linguistic distance, among
other factors, was proved to correlate well with the mutual intelligibility of individual words.
However, the role of context for the intelligibility of target words in sentences was subject
to very few studies. To address this, we analyze data from web-based experiments in
which Czech (CS) respondents were asked to translate highly predictable target words at
the final position of Polish sentences. We compare correlations of target word intelligibility
with data from 3-g language models (LMs) to their correlations with data obtained from
context-aware LMs. More specifically, we evaluate two context-aware LM architectures:
Long Short-Term Memory (LSTMs) that can, theoretically, take infinitely long-distance
dependencies into account and Transformer-based LMs which can access the whole
input sequence at the same time. We investigate how their use of context affects surprisal
and its correlation with intelligibility
- …