6 research outputs found
The impact of indirect machine translation on sentiment classification
Sentiment classification has been crucial for many natural language processing (NLP) applications, such as the analysis of movie reviews, tweets, or customer feedback. A sufficiently large amount of data is required to build a robust sentiment classification system. However, such resources are not always available for all domains or for all languages.
In this work, we propose employing a machine translation (MT) system to translate customer feedback into another language to investigate in which cases translated sentences can have a positive or negative impact on an automatic sentiment classifier. Furthermore, as performing a direct translation is not always possible, we explore the performance of automatic classifiers on sentences that have been translated using a pivot MT system.
We conduct several experiments using the above approaches to analyse the performance of our proposed sentiment classification system and discuss the advantages and drawbacks of classifying translated sentences
An Efficient Method for Generating Synthetic Data for Low-Resource Machine Translation – An empirical study of Chinese, Japanese to Vietnamese Neural Machine Translation
Data sparsity is one of the challenges for low-resource language pairs in Neural Machine Translation (NMT). Previous works have presented different approaches for data augmentation, but they mostly require additional resources and obtain low-quality dummy data in the low-resource issue. This paper proposes a simple and effective novel for generating synthetic bilingual data without using external resources as in previous approaches. Moreover, some works recently have shown that multilingual translation or transfer learning can boost the translation quality in low-resource situations. However, for logographic languages such as Chinese or Japanese, this approach is still limited due to the differences in translation units in the vocabularies. Although Japanese texts contain Kanji characters that are derived from Chinese characters, and they are quite homologous in sharp and meaning, the word orders in the sentences of these languages have a big divergence. Our study will investigate these impacts in machine translation. In addition, a combined pre-trained model is also leveraged to demonstrate the efficacy of translation tasks in the more high-resource scenario. Our experiments present performance improvements up to +6.2 and +7.8 BLEU scores over bilingual baseline systems on two low-resource translation tasks from Chinese to Vietnamese and Japanese to Vietnamese
Understanding and Enhancing the Use of Context for Machine Translation
To understand and infer meaning in language, neural models have to learn
complicated nuances. Discovering distinctive linguistic phenomena from data is
not an easy task. For instance, lexical ambiguity is a fundamental feature of
language which is challenging to learn. Even more prominently, inferring the
meaning of rare and unseen lexical units is difficult with neural networks.
Meaning is often determined from context. With context, languages allow meaning
to be conveyed even when the specific words used are not known by the reader.
To model this learning process, a system has to learn from a few instances in
context and be able to generalize well to unseen cases. The learning process is
hindered when training data is scarce for a task. Even with sufficient data,
learning patterns for the long tail of the lexical distribution is challenging.
In this thesis, we focus on understanding certain potentials of contexts in
neural models and design augmentation models to benefit from them. We focus on
machine translation as an important instance of the more general language
understanding problem. To translate from a source language to a target
language, a neural model has to understand the meaning of constituents in the
provided context and generate constituents with the same meanings in the target
language. This task accentuates the value of capturing nuances of language and
the necessity of generalization from few observations. The main problem we
study in this thesis is what neural machine translation models learn from data
and how we can devise more focused contexts to enhance this learning. Looking
more in-depth into the role of context and the impact of data on learning
models is essential to advance the NLP field. Moreover, it helps highlight the
vulnerabilities of current neural networks and provides insights into designing
more robust models.Comment: PhD dissertation defended on November 10th, 202
Qualidade na tradução automática e na pós-edição : anotação de erros de concordância e ordem de palabras
Considerando-se as características da tradução automática, como o baixo custo e a
rapidez, esse tipo de tradução tem sido cada vez mais utilizado no mercado de tradução.
Todavia, a qualidade dos resultados obtidos pelos sistemas utilizados pode não ser ideal,
sendo necessário fazer a tradução passar por um processo de pós-edição, feita por
humanos, para atingir níveis de qualidade satisfatórios. O presente trabalho procura
descrever o processo de tradução automática, pós-edição e anotação oferecido pela
plataforma Unbabel, que faz uso de uma crowd para a edição online dos erros
encontrados nos textos traduzidos pelo sistema Neural Machine Translation (NMT). O
objetivo principal da presente pesquisa é aprimorar a qualidade dos textos traduzidos por
essa empresa, através de propostas de aperfeiçoamento das orientações fornecidas pela
empresa aos seus editores e anotadores e através de sugestões para a avaliação e
treinamento desses elementos humanos. Para atingir esse objetivo, foram coletados e
analisados dados contendo trechos de textos traduzidos pelo sistema automático,
pós-editados por humanos e anotados também por humanos sob as etiquetas de
Agreement e Word Order, tendo o inglês como língua de partida e o português brasileiro
como língua de chegada. A partir da observação dos resultados dessas análises, foi
possível definir Golden Texts e testes de múltipla escolha com mensagens de feedback
para auxiliar na avaliação e treinamento dos anotadores e pós-editores.Considering the characteristics of machine translation, as low costs and speed, it has
been increasingly in the translation market. Nevertheless, the quality of the results
obtained with these systems may not be ideal, a post-edition step, done by humans, to
reach satisfactory quality levels being necessary. The present work describes the
translation, post-edition and annotation done at Unbabel’s platform, that uses a crowd
to edit online the errors occurring in the texts translated by the Neural Machine
Translation (NMT) system. The main objective of this research is to enhance the quality
of the texts translated in that platform, through the suggestions of improvements in the
guidelines the company gives to its editors and annotators and also through suggestions to evaluate and train these people. In order to achieve this goals, data containing parts of
texts translated by the machine translation, post-edited by humans and annotated, also
by humans, under the Agreement and Word Order labels, was collected and analyzed. This
data had the English as source language and Brazilian Portuguese as target text. The
results of these analyzes make it possible to define Golden Texts and multiple choice tests
to help the evaluation and training of annotators and post-editors