36 research outputs found

    Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation

    Get PDF
    There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks鈥擡nglish to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish鈥攁nd one real-world task, Norwegian to North S谩mi and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.There are several approaches for improving neural machine translation for low-resource languages: monolingual data can be exploited via pretraining or data augmentation; parallel corpora on related language pairs can be used via parameter sharing or transfer learning in multilingual models; subword segmentation and regularization techniques can be applied to ensure high coverage of the vocabulary. We review these approaches in the context of an asymmetric-resource one-to-many translation task, in which the pair of target languages are related, with one being a very low-resource and the other a higher-resource language. We test various methods on three artificially restricted translation tasks-English to Estonian (low-resource) and Finnish (high-resource), English to Slovak and Czech, English to Danish and Swedish-and one real-world task, Norwegian to North Sami and Finnish. The experiments show positive effects especially for scheduled multi-task learning, denoising autoencoder, and subword sampling.Peer reviewe

    Gender bias in natural language processing

    Get PDF
    (English) Gender bias is a dangerous form of social bias impacting an essential group of people. The effect of gender bias is propagated to our data, causing the accuracy of the predictions in models to be different depending on gender. In the deep learning era, our models are highly impacted by the training data transferring the negative biases in the data to the models. Natural Language Processing models encounter this amplification of bias in the data. Our thesis is devoted to studying the issue of gender bias in NLP applications from different points of view. To understand and manage the effect of bias amplification, evaluation and mitigation approaches have to be explored. The scientific society has exerted significant efforts in these two directions to enable proposing solutions to the problem. Our thesis is devoted to these two main directions; proposing evaluation schemes, whether as datasets or mechanisms, besides suggesting mitigation techniques. For evaluation, we proposed techniques for evaluating bias in contextualized embeddings and multilingual translation models. Besides, we presented benchmarks for evaluating bias for speech translation and multilingual machine translation models. For mitigation direction, we proposed different approaches in machine translation models by adding contextual text, contextual embeddings, or relaxing the architecture鈥檚 constraints. Our evaluation studies concluded that gender bias is encoded strongly in contextual embeddings representing professions and stereotypical nouns. We also unveiled that algorithms amplify the bias and that the system鈥檚 architecture impacts the behavior. For the evaluation purposes, we contributed to creating several benchmarks for the evaluation purpose; we introduced a benchmark that evaluates gender bias in speech translation systems. This research suggests that the current state of speech translation systems does not enable us to evaluate gender bias accurately because of the low quality of speech translation systems. Additionally, we proposed a toolkit for building multilingual balanced datasets for training and evaluating NMT models. These datasets are balanced within the gender occupation-wise. We found out that high-resource languages usually tend to predict more precise male translations. Our mitigation studies in NMT suggest that the nature of datasets and languages needs to be considered to apply the right approach. Mitigating bias can rely on adding contextual information. However, in other cases, we need to rethink the model and relax some influencing conditions to the bias that do not affect the general performance but reduce the effect of bias amplification.(Espa帽ol) El prejuicio de g茅nero es una forma peligrosa de sesgo social que afecta a un grupo esencial de personas. El efecto del prejuicio de g茅nero se propaga a nuestros datos, lo que hace quela precisi贸n de las predicciones en los modelos sea diferente seg煤n el g茅nero. En la era del aprendizaje profundo, nuestros modelos se ven afectados por los datos de entrenamiento que transfieren los prejuicios de los datos a los modelos. Los modelos de procesamiento del lenguaje natural pueden adem谩s amplificar este sesgo en los datos. Para comprender el efecto de la amplificaci贸n del prejuicio de g茅nero, se deben explorar enfoques de evaluaci贸n y mitigaci贸n. La sociedad cient铆fica ha visto la importanc铆a de estas dos direcciones para posibilitar la propuesta de soluciones al problema. Nuestra tesis est谩 dedicada a estas dos direcciones principales; proponiendo esquemas de evaluaci贸n, ya sea como conjuntos de datos y mecanismos de evaluaci贸n, adem谩s de sugerir t茅cnicas de mitigaci贸n. Para la evaluaci贸n, propusimos t茅cnicas para evaluar el prejuicio en representaciones vectoriales contextualizadas y modelos de traducci贸n multiling眉e. Adem谩s, presentamos puntos de referencia para evaluar el prejuicio de la traducci贸n de voz y los modelos de traducci贸n autom谩tica multiling眉e. Para la direcci贸n de mitigaci贸n, propusimos diferentes enfoques en los modelos de traducci贸n autom谩tica agregando texto contextual, incrustaciones contextuales o relajando las restricciones de la arquitectura. Nuestros estudios de evaluaci贸n concluyeron que el prejuicio de g茅nero est谩 fuertemente codificado en representaciones vectoriales contextuales que representan profesiones y sustantivos estereotipados. Tambi茅n revelamos que los algoritmos amplifican el sesgo y que la arquitectura del sistema afecta el comportamiento. Para efectos de evaluaci贸n, contribuimos a la creaci贸n de varios datos de referencia para fines de evaluaci贸n; presentamos un conjunto de datos que eval煤a el sesgo de g茅nero en los sistemas de traducci贸n de voz. Esta investigaci贸n sugiere que el estado actual de los sistemas de traducci贸n del habla no nos permite evaluar con precisi贸n el sesgo de g茅nero debido a la baja calidad de los sistemas de traducci贸n del habla. Adem谩s, propusimos un conjunto de herramientas para construir conjuntos de datos equilibrados multiling眉es para entrenar y evaluar modelos NMT. Estos conjuntos de datos est谩n equilibrados dentro de la ocupaci贸n de g茅nero. Descubrimos que los idiomas con muchos recursos generalmente tienden a predecir traducciones masculinas m谩s precisas. Nuestros estudios de mitigaci贸n en NMT sugieren que se debe considerar la naturaleza de los conjuntos de datos y los idiomas para aplicar el enfoque correcto. La mitigaci贸n del sesgo puede basarse en agregar informaci贸n contextual. Sin embargo, en otros casos, necesitamos repensar el modelo y relajar algunas condiciones que influyen en el sesgo que no afectan el rendimiento general pero reducen el efecto de la amplificaci贸n del sesgo.Postprint (published version

    Enhancing the Performance of NMT Models Using the Data-Based Domain Adaptation Technique for Patent Translation

    Get PDF
    During today鈥檚 age of unparalleled connectivity, language and data have become powerful tools capable of enabling effective communication and cross-cultural collaborations. Neural machine translation (NMT) models are especially capable of leveraging linguistic knowledge and parallel corpora to increase global connectivity and act as a tool for the transmission of knowledge. In this thesis, we apply a data-based domain adaptation technique to fine-tune three pre-existing NMT transformer models with attention mechanisms for the task of patent translation from English to Japanese. Languages, especially in the context of patents, can be very nuanced. A clear understanding of the intended meaning requires comprehensive domain knowledge and expert linguistic abilities which may become expensive and time-consuming. Automating the process of translation is helpful, however, commercially available NMT models perform poorly for this task as they are not trained on highly technical words whose meaning may be dependent on the relevant domain in which they are used. Our aim is to enhance the performance of translation models on highly technical inputs using a range of essential steps, focusing on data-based domain adaptation. These steps collectively contribute to the enhancement of the NMT model\u27s performance by a 41.22\% increase in the baseline BLEU score

    Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution

    Get PDF
    Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embedding
    corecore