15,052 research outputs found

    Domain adaptation : Retraining NMT with translation memories

    Get PDF
    The topic of this thesis is domain adaptation of an NMT system by retraining it with translation memories. The translation memory used in the experiments is the EMEA corpus that consists of medical texts – mostly package leaflets. The NMT system used in the experiments is OpenNMT because it is completely free and easy to use. The goal of this thesis is to find out how an NMT system can be adapted to a special domain, and if the translation quality improves after domain adaptation. The original plan was to continue training the pretrained model of OpenNMT with EMEA data, but this is not possible. Therefore, it is necessary to train a new baseline model with the same data as the pretrained model was trained with. After this two domain adaptation methods are tested: continuation training with EMEA data and continuation training with unknown terms. In the manual evaluation, it turned out that domain adaptation with unknown terms worsens the translation quality drastically because all sentences are translated as single words. This method is only suitable for translating wordlists because it improved the translation of unknown terms. Domain adaptation with EMEA data, for the other hand, improves the translation quality significantly. The EMEA-retrained system translates long sentences and medical terms much better than the pretrained and the baseline models. Long and complicated terms are still difficult to translate but the EMEA-retrained model makes fewer errors than the other models. The evaluation metrics used for automatic evaluation are BLEU and LeBLEU. BLEU is stricter than LeBLEU. The results are similar as in the manual evaluation: The EMEA-retrained model translates medical texts much better than the other models, and the translation quality of the UNK-retrained model is the worst of all. It can be presumed that an NMT system needs contextual information so that it learns to translate terms and long sentences without transforming the text into a wordlist without sentences. In addition, it seems that long terms are translated in smaller pieces so that the NMT system possibly translates some pieces wrong, which results in that the whole term is wrong

    Language Model Bootstrapping Using Neural Machine Translation For Conversational Speech Recognition

    Full text link
    Building conversational speech recognition systems for new languages is constrained by the availability of utterances that capture user-device interactions. Data collection is both expensive and limited by the speed of manual transcription. In order to address this, we advocate the use of neural machine translation as a data augmentation technique for bootstrapping language models. Machine translation (MT) offers a systematic way of incorporating collections from mature, resource-rich conversational systems that may be available for a different language. However, ingesting raw translations from a general purpose MT system may not be effective owing to the presence of named entities, intra sentential code-switching and the domain mismatch between the conversational data being translated and the parallel text used for MT training. To circumvent this, we explore the following domain adaptation techniques: (a) sentence embedding based data selection for MT training, (b) model finetuning, and (c) rescoring and filtering translated hypotheses. Using Hindi as the experimental testbed, we translate US English utterances to supplement the transcribed collections. We observe a relative word error rate reduction of 7.8-15.6%, depending on the bootstrapping phase. Fine grained analysis reveals that translation particularly aids the interaction scenarios which are underrepresented in the transcribed data.Comment: Accepted by IEEE ASRU workshop, 201

    Desafíos de traducción en la localización de aplicaciones web

    Get PDF
    This preliminary study aims at exploring the nature of challenges that translators face when they take on a localization project of a web application. Taking into account that localization is an activity constrained by time, process and economic resources, translators need to make use of their full skill set to overcome the various challenges imposed by the source text and the localization process itself. For the purpose of this study, an ad hoc monolingual English corpus composed of the user interface strings of web applications has been used. Since multiple types of challenges are found in a localization project of this nature, this paper focuses on those related to internationalization practices and to constraints imposed by the translation memory segmentation process. Although localization is a mature field and a great deal of guidelines and best practices is available for content creators and tool developers, as found in this qualitative study, localizers can still suffer the consequences of deficient internationalization practices and non-ergonomic translation tools.Este estudio preliminar tiene como objetivo explorar la naturaleza de los desafíos a los que los traductores se enfrentan cuando se embarcan en un proyecto de localización de una aplicación web. Dado que la localización es una actividad condicionada por tiempo, procesos y recursos económicos, los traductores tienen que poner en marcha todas sus competencias para superar los muchos desafíos impuestos por el texto fuente y por el proceso de localización en sí. En este estudio, se ha utilizado un corpus monolingüe ad hoc en inglés compuesto por mensajes de la interfaz de aplicaciones web. Puesto que en este tipo de proyecto de localización existen diferentes tipos de desafíos, este artículo se centra en aquellos relacionados con las prácticas de internacionalización y con la segmentación de las memorias de traducción. A pesar de que la localización es un ámbito de considerable madurez y de que los creadores de contenido y los desarrolladores de herramientas tienen a su disposición un gran abanico de instrucciones y directrices, en el presente estudio cualitativo se concluye que los localizadores siguen sufriendo las consecuencias de prácticas de internacionalización deficientes y de herramientas de traducción poco ergonómicas

    Tamil-Llama: A New Tamil Language Model Based on Llama 2

    Full text link
    Language modeling has witnessed remarkable advancements in recent years, with Large Language Models (LLMs) like ChatGPT setting unparalleled benchmarks in human-like text generation. However, a prevailing limitation is the underrepresentation of languages like Tamil in these cutting-edge models, leading to suboptimal performance in diverse linguistic contexts. This paper addresses this lacuna, enhancing the open-source LLaMA model with an addition of 16,000 Tamil tokens, aiming to achieve superior text generation and comprehension in the Tamil language. We strategically employ the LoRA methodology for efficient model training on a comprehensive Tamil corpus, ensuring computational feasibility and model robustness. Moreover, we introduce a Tamil-translated version of the Alpaca dataset and a subset of the OpenOrca dataset tailored for instruction fine-tuning. Our results showcase significant performance improvements in Tamil text generation, with potential implications for the broader landscape of LLMs in Indian languages. We further underscore our commitment to open research by making our models, datasets, and code publicly accessible, fostering further innovations in language modeling.Comment: 19 pages, 10 figure

    Current trends in multilingual speech processing

    Get PDF
    In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processin

    Smooth inverse frequency based text data selection for medical dictation

    Get PDF
    Under-resourced domain problem is significant in automatic speech recognition, especially in small languages such as Hungarian or in fields where data is often confidential such as finance and medicine. We introduce a method using word embedding and smooth inverse frequency (SIF) based distance measurement to filter public domain web corpora. The selection for (medical) domain matching documents can be scaled. The resulted text is used to train an augmented language model for a medical dictation system. We show that using the appropriately scaled selection leads to optimal performance of the ASR system over the baselines where no data augmentation was applied or all the augmentation data was added

    A Multilingual Parallel Corpora Collection Effort for Indian Languages

    Get PDF
    We present sentence aligned parallel corpora across 10 Indian Languages - Hindi, Telugu, Tamil, Malayalam, Gujarati, Urdu, Bengali, Oriya, Marathi, Punjabi, and English - many of which are categorized as low resource. The corpora are compiled from online sources which have content shared across languages. The corpora presented significantly extends present resources that are either not large enough or are restricted to a specific domain (such as health). We also provide a separate test corpus compiled from an independent online source that can be independently used for validating the performance in 10 Indian languages. Alongside, we report on the methods of constructing such corpora using tools enabled by recent advances in machine translation and cross-lingual retrieval using deep neural network based methods.Comment: 9 pages. Accepted in LREC 202
    • …
    corecore