6 research outputs found

    The impact of indirect machine translation on sentiment classification

    Get PDF
    Sentiment classification has been crucial for many natural language processing (NLP) applications, such as the analysis of movie reviews, tweets, or customer feedback. A sufficiently large amount of data is required to build a robust sentiment classification system. However, such resources are not always available for all domains or for all languages. In this work, we propose employing a machine translation (MT) system to translate customer feedback into another language to investigate in which cases translated sentences can have a positive or negative impact on an automatic sentiment classifier. Furthermore, as performing a direct translation is not always possible, we explore the performance of automatic classifiers on sentences that have been translated using a pivot MT system. We conduct several experiments using the above approaches to analyse the performance of our proposed sentiment classification system and discuss the advantages and drawbacks of classifying translated sentences

    An Efficient Method for Generating Synthetic Data for Low-Resource Machine Translation – An empirical study of Chinese, Japanese to Vietnamese Neural Machine Translation

    Get PDF
    Data sparsity is one of the challenges for low-resource language pairs in Neural Machine Translation (NMT). Previous works have presented different approaches for data augmentation, but they mostly require additional resources and obtain low-quality dummy data in the low-resource issue. This paper proposes a simple and effective novel for generating synthetic bilingual data without using external resources as in previous approaches. Moreover, some works recently have shown that multilingual translation or transfer learning can boost the translation quality in low-resource situations. However, for logographic languages such as Chinese or Japanese, this approach is still limited due to the differences in translation units in the vocabularies. Although Japanese texts contain Kanji characters that are derived from Chinese characters, and they are quite homologous in sharp and meaning, the word orders in the sentences of these languages have a big divergence. Our study will investigate these impacts in machine translation. In addition, a combined pre-trained model is also leveraged to demonstrate the efficacy of translation tasks in the more high-resource scenario. Our experiments present performance improvements up to +6.2 and +7.8 BLEU scores over bilingual baseline systems on two low-resource translation tasks from Chinese to Vietnamese and Japanese to Vietnamese

    Understanding and Enhancing the Use of Context for Machine Translation

    Get PDF
    To understand and infer meaning in language, neural models have to learn complicated nuances. Discovering distinctive linguistic phenomena from data is not an easy task. For instance, lexical ambiguity is a fundamental feature of language which is challenging to learn. Even more prominently, inferring the meaning of rare and unseen lexical units is difficult with neural networks. Meaning is often determined from context. With context, languages allow meaning to be conveyed even when the specific words used are not known by the reader. To model this learning process, a system has to learn from a few instances in context and be able to generalize well to unseen cases. The learning process is hindered when training data is scarce for a task. Even with sufficient data, learning patterns for the long tail of the lexical distribution is challenging. In this thesis, we focus on understanding certain potentials of contexts in neural models and design augmentation models to benefit from them. We focus on machine translation as an important instance of the more general language understanding problem. To translate from a source language to a target language, a neural model has to understand the meaning of constituents in the provided context and generate constituents with the same meanings in the target language. This task accentuates the value of capturing nuances of language and the necessity of generalization from few observations. The main problem we study in this thesis is what neural machine translation models learn from data and how we can devise more focused contexts to enhance this learning. Looking more in-depth into the role of context and the impact of data on learning models is essential to advance the NLP field. Moreover, it helps highlight the vulnerabilities of current neural networks and provides insights into designing more robust models.Comment: PhD dissertation defended on November 10th, 202

    Qualidade na tradução automática e na pós-edição : anotação de erros de concordância e ordem de palabras

    Get PDF
    Considerando-se as características da tradução automática, como o baixo custo e a rapidez, esse tipo de tradução tem sido cada vez mais utilizado no mercado de tradução. Todavia, a qualidade dos resultados obtidos pelos sistemas utilizados pode não ser ideal, sendo necessário fazer a tradução passar por um processo de pós-edição, feita por humanos, para atingir níveis de qualidade satisfatórios. O presente trabalho procura descrever o processo de tradução automática, pós-edição e anotação oferecido pela plataforma Unbabel, que faz uso de uma crowd para a edição online dos erros encontrados nos textos traduzidos pelo sistema Neural Machine Translation (NMT). O objetivo principal da presente pesquisa é aprimorar a qualidade dos textos traduzidos por essa empresa, através de propostas de aperfeiçoamento das orientações fornecidas pela empresa aos seus editores e anotadores e através de sugestões para a avaliação e treinamento desses elementos humanos. Para atingir esse objetivo, foram coletados e analisados dados contendo trechos de textos traduzidos pelo sistema automático, pós-editados por humanos e anotados também por humanos sob as etiquetas de Agreement e Word Order, tendo o inglês como língua de partida e o português brasileiro como língua de chegada. A partir da observação dos resultados dessas análises, foi possível definir Golden Texts e testes de múltipla escolha com mensagens de feedback para auxiliar na avaliação e treinamento dos anotadores e pós-editores.Considering the characteristics of machine translation, as low costs and speed, it has been increasingly in the translation market. Nevertheless, the quality of the results obtained with these systems may not be ideal, a post-edition step, done by humans, to reach satisfactory quality levels being necessary. The present work describes the translation, post-edition and annotation done at Unbabel’s platform, that uses a crowd to edit online the errors occurring in the texts translated by the Neural Machine Translation (NMT) system. The main objective of this research is to enhance the quality of the texts translated in that platform, through the suggestions of improvements in the guidelines the company gives to its editors and annotators and also through suggestions to evaluate and train these people. In order to achieve this goals, data containing parts of texts translated by the machine translation, post-edited by humans and annotated, also by humans, under the Agreement and Word Order labels, was collected and analyzed. This data had the English as source language and Brazilian Portuguese as target text. The results of these analyzes make it possible to define Golden Texts and multiple choice tests to help the evaluation and training of annotators and post-editors
    corecore