15,052 research outputs found
Domain adaptation : Retraining NMT with translation memories
The topic of this thesis is domain adaptation of an NMT system by retraining it with translation memories. The translation memory used in the experiments is the EMEA corpus that consists of medical texts – mostly package leaflets. The NMT system used in the experiments is OpenNMT because it is completely free and easy to use.
The goal of this thesis is to find out how an NMT system can be adapted to a special domain, and if the translation quality improves after domain adaptation. The original plan was to continue training the pretrained model of OpenNMT with EMEA data, but this is not possible. Therefore, it is necessary to train a new baseline model with the same data as the pretrained model was trained with. After this two domain adaptation methods are tested: continuation training with EMEA data and continuation training with unknown terms.
In the manual evaluation, it turned out that domain adaptation with unknown terms worsens the translation quality drastically because all sentences are translated as single words. This method is only suitable for translating wordlists because it improved the translation of unknown terms. Domain adaptation with EMEA data, for the other hand, improves the translation quality significantly. The EMEA-retrained system translates long sentences and medical terms much better than the pretrained and the baseline models. Long and complicated terms are still difficult to translate but the EMEA-retrained model makes fewer errors than the other models.
The evaluation metrics used for automatic evaluation are BLEU and LeBLEU. BLEU is stricter than LeBLEU. The results are similar as in the manual evaluation: The EMEA-retrained model translates medical texts much better than the other models, and the translation quality of the UNK-retrained model is the worst of all.
It can be presumed that an NMT system needs contextual information so that it learns to translate terms and long sentences without transforming the text into a wordlist without sentences. In addition, it seems that long terms are translated in smaller pieces so that the NMT system possibly translates some pieces wrong, which results in that the whole term is wrong
Language Model Bootstrapping Using Neural Machine Translation For Conversational Speech Recognition
Building conversational speech recognition systems for new languages is
constrained by the availability of utterances that capture user-device
interactions. Data collection is both expensive and limited by the speed of
manual transcription. In order to address this, we advocate the use of neural
machine translation as a data augmentation technique for bootstrapping language
models. Machine translation (MT) offers a systematic way of incorporating
collections from mature, resource-rich conversational systems that may be
available for a different language. However, ingesting raw translations from a
general purpose MT system may not be effective owing to the presence of named
entities, intra sentential code-switching and the domain mismatch between the
conversational data being translated and the parallel text used for MT
training. To circumvent this, we explore the following domain adaptation
techniques: (a) sentence embedding based data selection for MT training, (b)
model finetuning, and (c) rescoring and filtering translated hypotheses. Using
Hindi as the experimental testbed, we translate US English utterances to
supplement the transcribed collections. We observe a relative word error rate
reduction of 7.8-15.6%, depending on the bootstrapping phase. Fine grained
analysis reveals that translation particularly aids the interaction scenarios
which are underrepresented in the transcribed data.Comment: Accepted by IEEE ASRU workshop, 201
DesafÃos de traducción en la localización de aplicaciones web
This preliminary study aims at exploring the nature of challenges that translators face when they take on a localization project of a web application. Taking into account that localization is an activity constrained by time, process and economic resources, translators need to make use of their full skill set to overcome the various challenges imposed by the source text and the localization process itself. For the purpose of this study, an ad hoc monolingual English corpus composed of the user interface strings of web applications has been used. Since multiple types of challenges are found in a localization project of this nature, this paper focuses on those related to internationalization practices and to constraints imposed by the translation memory segmentation process. Although localization is a mature field and a great deal of guidelines and best practices is available for content creators and tool developers, as found in this qualitative study, localizers can still suffer the consequences of deficient internationalization practices and non-ergonomic translation tools.Este estudio preliminar tiene como objetivo explorar la naturaleza de los desafÃos a los que los traductores se enfrentan cuando se embarcan en un proyecto de localización de una aplicación web. Dado que la localización es una actividad condicionada por tiempo, procesos y recursos económicos, los traductores tienen que poner en marcha todas sus competencias para superar los muchos desafÃos impuestos por el texto fuente y por el proceso de localización en sÃ. En este estudio, se ha utilizado un corpus monolingüe ad hoc en inglés compuesto por mensajes de la interfaz de aplicaciones web. Puesto que en este tipo de proyecto de localización existen diferentes tipos de desafÃos, este artÃculo se centra en aquellos relacionados con las prácticas de internacionalización y con la segmentación de las memorias de traducción. A pesar de que la localización es un ámbito de considerable madurez y de que los creadores de contenido y los desarrolladores de herramientas tienen a su disposición un gran abanico de instrucciones y directrices, en el presente estudio cualitativo se concluye que los localizadores siguen sufriendo las consecuencias de prácticas de internacionalización deficientes y de herramientas de traducción poco ergonómicas
Tamil-Llama: A New Tamil Language Model Based on Llama 2
Language modeling has witnessed remarkable advancements in recent years, with
Large Language Models (LLMs) like ChatGPT setting unparalleled benchmarks in
human-like text generation. However, a prevailing limitation is the
underrepresentation of languages like Tamil in these cutting-edge models,
leading to suboptimal performance in diverse linguistic contexts. This paper
addresses this lacuna, enhancing the open-source LLaMA model with an addition
of 16,000 Tamil tokens, aiming to achieve superior text generation and
comprehension in the Tamil language. We strategically employ the LoRA
methodology for efficient model training on a comprehensive Tamil corpus,
ensuring computational feasibility and model robustness. Moreover, we introduce
a Tamil-translated version of the Alpaca dataset and a subset of the OpenOrca
dataset tailored for instruction fine-tuning. Our results showcase significant
performance improvements in Tamil text generation, with potential implications
for the broader landscape of LLMs in Indian languages. We further underscore
our commitment to open research by making our models, datasets, and code
publicly accessible, fostering further innovations in language modeling.Comment: 19 pages, 10 figure
Current trends in multilingual speech processing
In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processin
Smooth inverse frequency based text data selection for medical dictation
Under-resourced domain problem is significant in automatic speech recognition, especially in small languages such as Hungarian or in fields where data is often confidential such as finance and medicine. We introduce a method using word embedding and smooth inverse frequency (SIF) based distance measurement to filter public domain web corpora. The selection for (medical) domain matching documents can be scaled. The resulted text is used to train an augmented language model for a medical dictation system. We show that using the appropriately scaled selection leads to optimal performance of the ASR system over the baselines where no data augmentation was applied or all the augmentation data was added
A Multilingual Parallel Corpora Collection Effort for Indian Languages
We present sentence aligned parallel corpora across 10 Indian Languages -
Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi,
Punjabi, and English - many of which are categorized as low resource. The
corpora are compiled from online sources which have content shared across
languages. The corpora presented significantly extends present resources that
are either not large enough or are restricted to a specific domain (such as
health). We also provide a separate test corpus compiled from an independent
online source that can be independently used for validating the performance in
10 Indian languages. Alongside, we report on the methods of constructing such
corpora using tools enabled by recent advances in machine translation and
cross-lingual retrieval using deep neural network based methods.Comment: 9 pages. Accepted in LREC 202
- …