146 research outputs found
Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation
Multilingual pretraining approaches in Neural Machine Translation (NMT) have shown that training models to denoise synthetic code-switched data can yield impressive performance gains --- owing to better multilingual semantic representations and transfer learning. However, they generated the synthetic code-switched data using non-contextual, one-to-one word translations obtained from lexicons - which can lead to significant noise in a variety of cases, including the poor handling of polysemes and multi-word expressions, violation of linguistic agreement and inability to scale to agglutinative languages. To overcome these limitations, we propose an approach called Contextual Code-Switching (CCS), where contextual, many-to-many word translations are generated using a `base' NMT model. We conduct experiments on 3 different language families - Romance, Uralic, and Indo-Aryan - and show significant improvements (by up to 5.5 spBLEU points) over the previous lexicon-based SOTA approaches. We also observe that small CCS models can perform comparably or better than massive models like mBART50 and mRASP2, depending on the size of data provided. We empirically analyse several key factors responsible for these - including context, many-to-many substitutions, code-switching language count etc. - and prove that they all contribute to enhanced pretraining of multilingual NMT models
End-to-End Simultaneous Speech Translation
Speech translation is the task of translating speech in one language to text or speech in another language, while simultaneous translation aims at lower translation latency by starting the translation before the speaker finishes a sentence. The combination of the two, simultaneous speech translation, can be applied in low latency scenarios such as live video caption translation and real-time interpretation.
This thesis will focus on an end-to-end or direct approach for simultaneous speech translation. We first define the task of simultaneous speech translation, including the challenges of the task and its evaluation metrics. We then progressly introduce our contributions to tackle the challenges. First, we proposed a novel simultaneous translation policy, mono- tonic multihead attention, for transformer models on text-to-text translation. Second, we investigate the issues and potential solutions when adapting text-to-text simultaneous policies to end-to-end speech-to-text translation models. Third, we introduced the augmented memory transformer encoder for simultaneous speech-to-text translation models for better computation efficiency. Fourth, we explore a direct simultaneous speech translation with variational monotonic multihead attention policy, based on recent speech-to-unit models. At the end, we provide some directions for potential future research
Translationese indicators for human translation quality estimation (based on English-to-Russian translation of mass-media texts)
A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.Human translation quality estimation is a relatively new and challenging area of research,
because human translation quality is notoriously more subtle and subjective than machine
translation, which attracts much more attention and effort of the research community. At
the same time, human translation is routinely assessed by education and certification institutions,
as well as at translation competitions. Do the quality labels and scores generated
from real-life quality judgments align well with objective properties of translations? This
thesis puts this question to a test using machine learning methods.
Conceptually, this research is built around a hypothesis that linguistic properties characteristic
of translations, as a specific form of communication, can correlate with translation
quality. This assumption is often made in translation studies but has never been put to
a rigorous empirical test. Exploring translationese features in a quality estimation task
can help identify quality-related trends in translational behaviour and provide data-driven
insights into professionalism to improve training. Using translationese for quality estimation
fits well with the concept of quality in translation studies, because it is essentially a
document-level property. Linguistically-motivated translationese features are also more interpretable
than popular distributed representations and can explain linguistic differences
between quality categories in human translation.
We investigated (i) an extended set of Universal Dependencies-based morphosyntactic
features as well as two lexical feature sets capturing (ii) collocational properties of translations,
and (iii) ratios of vocabulary items in various frequency bands along with entropy
scores from n-gram models. To compare the performance of our feature sets in translationese
classifications and in quality estimation tasks against other representations, the
experiments were also run on tf-idf features, QuEst++ features and on contextualised
embeddings from a range of pre-trained language models, including the state-of-the-art
multilingual solution for machine translation quality estimation. Our major focus was on
document-level prediction, however, where the labels and features allowed, the experiments
were extended to the sentence level.
The corpus used in this research includes English-to-Russian parallel subcorpora of student
and professional translations of mass-media texts, and a register-comparable corpus of
non-translations in the target language. Quality labels for various subsets of student translations
come from a number of real-life settings: translation competitions, graded student
translations, error annotations and direct assessment. We overview approaches to benchmarking
quality in translation and provide a detailed description of our own annotation
experiments.
Of the three proposed translationese feature sets, morphosyntactic features, returned
the best results on all tasks. In many settings they were secondary only to contextualised
embeddings. At the same time, performance on various representations was contingent
on the type of quality captured by quality labels/scores. Using the outcomes of machine
learning experiments and feature analysis, we established that translationese properties of
translations were not equality reflected by various labels and scores. For example, professionalism
was much less related to translationese than expected. Labels from documentlevel
holistic assessment demonstrated maximum support for our hypothesis: lower-ranking
translations clearly exhibited more translationese. They bore more traces of mechanical
translational behaviours associated with following source language patterns whenever possible,
which led to the inflated frequencies of analytical passives, modal predicates, verbal
forms, especially copula verbs and verbs in the finite form. As expected, lower-ranking
translations were more repetitive and had longer, more complex sentences. Higher-ranking
translations were indicative of greater skill in recognising and counteracting translationese
tendencies. For document-level holistic labels as an approach to capture quality, translationese
indicators might provide a valuable contribution to an effective quality estimation
pipeline.
However, error-based scores, and especially scores from sentence-level direct assessment,
proved to be much less correlated by translationese and fluency issues, in general. This was
confirmed by relatively low regression results across all representations that had access only
to the target language side of the dataset, by feature analysis and by correlation between
error-based scores and scores from direct assessment
Learning multilingual and multimodal representations with language-specific encoders and decoders for machine translation
This thesis aims to study different language-specific approaches for Multilingual Machine Translation without parameter sharing and their properties compared to the current state-of-the-art based on parameter-sharing. We define Multilingual Machine Translation as the task that focuses on methods to translate between several pairs of languages in a single system. It has been widely studied in recent years due to its ability to easily scale to more languages, even between pairs never seen together during training (zero-shot translation). Several architectures have been proposed to tackle this problem with varying amounts of shared parameters between languages. Current state-of-the-art systems focus on a single sequence-to-sequence architecture where all languages share the complete set of parameters, including the token representation. While this has proven convenient for transfer learning, it makes it challenging to incorporate new languages into the trained model as all languages depend on the same parameters.
What all proposed architectures have in common is enforcing a shared presentation space between languages. Specifically, during this work, we will employ as representation the final output of the encoders that the decoders will use to perform cross-attention. Having a shared space reduces noise as similar sentences at semantic level produce similar vectorial representations, helping the decoders process representations from several languages. This semantic representation is particularly important for zero-shot translation as the representation similarity to the languages pairs seen during training is key to reducing ambiguity between languages and obtaining good translation performance.
This thesis is structured in three main blocks, focused on different scenarios of this task. Firstly, we propose a training method that enforces a common representation for bilingual training and a procedure to extend it to new languages efficiently. Secondly, we propose another training method that allows this representation to be learned directly on multilingual data and can be equally extended to new languages. Thirdly, we show that the proposed multilingual architecture is not limited only to textual languages. We extend our method to new data modalities by adding speech encoders, performing Spoken Language Translation, including Zero-Shot, to all the supported languages.
Our main results show that the common intermediate representation is achievable in this scenario, matching the performance of previously shared systems while allowing the addition of new languages or data modalities efficiently without negative transfer learning to the previous languages or retraining the system.El objetivo de esta tesis es estudiar diferentes arquitecturas de Traducción Automática Multilingüe con parámetros especÃficos para cada idioma que no son compartidos, en contraposición al estado del arte actual basado en compartir parámetros. Podemos definir la Traducción Automática Multilingüe como la tarea que estudia métodos para traducir entre varios pares de idiomas en un único sistema. Ésta ha sido ampliamente estudiada en los últimos años debido a que nos permite escalar nuestros sistemas con facilidad a un gran número de idiomas, incluso entre pares de idiomas que no han sido nunca entrenados juntos (traducción zero-shot). Diversas arquitecturas han sido propuestas con diferentes niveles de parámetros compartidos entre idiomas, El estado del arte actual se enfoca hacÃa un solo modelo secuencia a secuencia donde todos los parámetros son compartidos por todos los idiomas, incluyendo la representación a nivel de unidad lingüÃstica. Siendo esto beneficioso para la transferencia de conocimiento entre idiomas, también puede resultar una limitación a la hora de añadir nuevos, ya que modificarÃamos los parámetros para todos los idiomas soportados. El elemento común de todas las arquitecturas propuestas es promover un espacio común donde representar a todos los idiomas en el sistema. Concretamente, durante este trabajo, nos referiremos a la representación final de los codificadores del sistema como este espacio, puesto que es la representación utilizada durante la atención cruzada por los decodificadores al generar traducciones. El objetivo de esta representación común es reducir ruido, ya que frases similares producirán representaciones similares, lo cual resulta de ayuda al usar un mismo decodificador para procesar la representación vectorial de varios idiomas. Esto es especialmente importante en el caso de la traducción zero-shot, ya que el par de idiomas no ha sido nunca entrenado conjuntamente, para reducir posibles ambigüedades y obtener una buena calidad de traducción. La tesis está organizada en tres bloques principales, enfocados en diferentes escenarios de esta tarea. Primero, proponemos un método para entrenar una representación común en sistemas bilingües, y un procedimiento para extenderla a nuevos idiomas de manera eficiente. Segundo, proponemos otro método de entrenamiento para aprender esta representación directamente desde datos multilingües y como puede ser igualmente extendida a nuevos idiomas. Tercero, mostramos que esta representación no está limitada únicamente a datos textuales. Para ello, extendemos nuestro método a otra modalidad de datos, en este caso discurso hablado, demostrando que podemos realizar traducción de audio a texto para todos los idiomas soportados, incluyendo traducción zero-shot. Nuestros resultados muestras que una representación común puede ser aprendida sin compartir parámetros entre idiomas, con una calidad de traducción similar a la del actual estado del arte, con la ventaja de permitirnos añadir nuevos idiomas o modalidades de datos de manera eficiente, sin transferencia negativa de conocimiento a los idiomas ya soportados y sin necesidad de reentrenarlos.Postprint (published version
Structural pruning for speed in neural machine translation
Neural machine translation (NMT) strongly outperforms previous statistical techniques. With
the emergence of a transformer architecture, we consistently train and deploy deeper and
larger models, often with billions of parameters, as an ongoing effort to achieve even better
quality. On the other hand, there is also a constant pursuit for optimisation opportunities to
reduce inference runtime.
Parameter pruning is one of the staple optimisation techniques. Even though coefficient-wise
sparsity is the most popular for compression purposes, it is not easy to make a model run
faster. Sparse matrix multiplication routines require custom approaches, usually depending on
low-level hardware implementations for the most efficiency. In my thesis, I focus on structural
pruning in the field of NMT, which results in smaller but still dense architectures that do not
need any further modifications to work efficiently.
My research focuses on two main directions. The first one explores Lottery Ticket Hypothesis
(LTH), a well-known pruning algorithm, but this time in a structural setup with a custom pruning
criterion. It involves partial training and pruning steps performed in a loop. Experiments with
LTH produced substantial speed-up when applied to prune heads in the attention mechanism
of a transformer. While this method has proven successful, it carries the burden of prolonged
training cost that makes an already expensive training routine even more so.
From that point, I exclusively concentrate on research incorporating pruning into training via
regularisation. I experiment with a standard group lasso, which zeroes-out parameters together
in a structural pre-defined way. By targeting feedforward and attention layers in a transformer,
group lasso significantly improves inference speed with already optimised state-of-the-art fast
models. Improving upon that work, I designed a novel approach called aided regularisation,
where every layer penalty is scaled based on statistics gathered as training progresses. Both
gradient- and parameter-based approaches aim to decrease the depth of a model, further
optimising speed while maintaining the translation quality of an unpruned baseline.
The goal of this dissertation is to advance the state-of-the-art efficient NMT with simple but
tangible structural sparsity methods. The majority of all experiments in the thesis involve
highly-optimised models as baselines to show that this work pushes the Pareto frontier of
quality vs speed trade-off forward. For example, it is possible to prune a model to be 50% faster
with no change in translation quality
Gender bias in natural language processing
(English) Gender bias is a dangerous form of social bias impacting an essential group of people. The effect of gender bias is propagated to our data, causing the accuracy of the predictions in models to be different depending on gender. In the deep learning era, our models are highly impacted by the training data transferring the negative biases in the data to the models. Natural Language Processing models encounter this amplification of bias in the data. Our thesis is devoted to studying the issue of gender bias in NLP applications from different points of view.
To understand and manage the effect of bias amplification, evaluation and mitigation approaches have to be explored. The scientific society has exerted significant efforts in these two directions to enable proposing solutions to the problem. Our thesis is devoted to these two main directions; proposing evaluation schemes, whether as datasets or mechanisms, besides suggesting mitigation techniques. For evaluation, we proposed techniques for evaluating bias in contextualized embeddings and multilingual translation models. Besides, we presented benchmarks for evaluating bias for speech translation and multilingual machine translation models. For mitigation direction, we proposed different approaches in machine translation models by adding contextual text, contextual embeddings, or relaxing the architecture’s constraints.
Our evaluation studies concluded that gender bias is encoded strongly in contextual embeddings representing professions and stereotypical nouns. We also unveiled that algorithms amplify the bias and that the system’s architecture impacts the behavior. For the evaluation purposes, we contributed to creating several benchmarks for the evaluation purpose; we introduced a benchmark that evaluates gender bias in speech translation systems. This research suggests that the current state of speech translation systems does not enable us to evaluate gender bias accurately because of the low quality of speech translation systems. Additionally, we proposed a toolkit for building multilingual balanced datasets for training and evaluating NMT models. These datasets are balanced within the gender occupation-wise. We found out that high-resource languages usually tend to predict more precise male translations.
Our mitigation studies in NMT suggest that the nature of datasets and languages needs to be considered to apply the right approach. Mitigating bias can rely on adding contextual information. However, in other cases, we need to rethink the model and relax some influencing conditions to the bias that do not affect the general performance but reduce the effect of bias amplification.(Español) El prejuicio de género es una forma peligrosa de sesgo social que afecta a un grupo esencial de personas. El efecto del prejuicio de género se propaga a nuestros datos, lo que hace quela precisión de las predicciones en los modelos sea diferente según el género. En la era del aprendizaje profundo, nuestros modelos se ven afectados por los datos de entrenamiento que transfieren los prejuicios de los datos a los modelos. Los modelos de procesamiento del lenguaje natural pueden además amplificar este sesgo en los datos. Para comprender el efecto de la amplificación del prejuicio de género, se deben explorar enfoques de evaluación y mitigación. La sociedad cientÃfica ha visto la importancÃa de estas dos direcciones para posibilitar la propuesta de soluciones al problema. Nuestra tesis está dedicada a estas dos direcciones principales; proponiendo esquemas de evaluación, ya sea como conjuntos de datos y mecanismos de evaluación, además de sugerir técnicas de mitigación. Para la evaluación, propusimos técnicas para evaluar el prejuicio en representaciones vectoriales contextualizadas y modelos de traducción multilingüe. Además, presentamos puntos de referencia para evaluar el prejuicio de la traducción de voz y los modelos de traducción automática multilingüe. Para la dirección de mitigación, propusimos diferentes enfoques en los modelos de traducción automática agregando texto contextual, incrustaciones contextuales o relajando las restricciones de la arquitectura. Nuestros estudios de evaluación concluyeron que el prejuicio de género está fuertemente codificado en representaciones vectoriales contextuales que representan profesiones y sustantivos estereotipados. También revelamos que los algoritmos amplifican el sesgo y que la arquitectura del sistema afecta el comportamiento. Para efectos de evaluación, contribuimos a la creación de varios datos de referencia para fines de evaluación; presentamos un conjunto de datos que evalúa el sesgo de género en los sistemas de traducción de voz. Esta investigación sugiere que el estado actual de los sistemas de traducción del habla no nos permite evaluar con precisión el sesgo de género debido a la baja calidad de los sistemas de traducción del habla. Además, propusimos un conjunto de herramientas para construir conjuntos de datos equilibrados multilingües para entrenar y evaluar modelos NMT. Estos conjuntos de datos están equilibrados dentro de la ocupación de género. Descubrimos que los idiomas con muchos recursos generalmente tienden a predecir traducciones masculinas más precisas. Nuestros estudios de mitigación en NMT sugieren que se debe considerar la naturaleza de los conjuntos de datos y los idiomas para aplicar el enfoque correcto. La mitigación del sesgo puede basarse en agregar información contextual. Sin embargo, en otros casos, necesitamos repensar el modelo y relajar algunas condiciones que influyen en el sesgo que no afectan el rendimiento general pero reducen el efecto de la amplificación del sesgo.Postprint (published version
Return to the Source: Assessing Machine Translation Suitability based on the Source Text using XLM-RoBERTa
In order to assess the suitability of a text for machine translation (MT), the factors in play are many and often vary across language pairs. Readability might certainly account for part of the problem, but the metrics for its evaluation are inherently monolingual (e.g., Gunning fog index) or have language learning as a target. Thus, they solely consider human problems in language learning when approaching a text, such as text length or overly complex syntax. Although these aspects could map to a higher difficulty for an automatic translation process, they only consider the problem in the source text as a comprehension problem, whereas in real-world scenarios most of the attention is on the target text, focusing on the essential cross-language aspects of terminology and pragmatics of the target language.
This dissertation represents an attempt at approaching this problem by transferring the knowledge from established MT evaluation metrics to a new model able to predict MT quality from the source text alone. To open the door to experiments in this regard, we explore the fine-tuning of a state-of-the-art transformer model (XLM-RoBERTa), construing the problem both as single-task and multi-task. Results for this methodology are promising, with both model types seemingly able to successfully approximate well-established MT evaluation and quality estimation metrics, achieving low RMSE values in the [0.1-0.2] range
Findings of the IWSLT 2022 Evaluation Campaign.
The evaluation campaign of the 19th International Conference on Spoken Language Translation featured eight shared tasks: (i) Simultaneous speech translation, (ii) Offline speech translation, (iii) Speech to speech translation, (iv) Low-resource speech translation, (v) Multilingual speech translation, (vi) Dialect speech translation, (vii) Formality control for speech translation, (viii) Isometric speech translation. A total of 27 teams participated in at least one of the shared tasks. This paper details, for each shared task, the purpose of the task, the data that were released, the evaluation metrics that were applied, the submissions that were received and the results that were achieved
How to keep text private? A systematic review of deep learning methods for privacy-preserving natural language processing
Deep learning (DL) models for natural language processing (NLP) tasks often handle private data, demanding protection against breaches and disclosures.
Data protection laws, such as the European Union\u27s General Data Protection Regulation (GDPR), thereby enforce the need for privacy.
Although many privacy-preserving NLP methods have been proposed in recent years, no categories to organize them have been introduced yet, making it hard to follow the progress of the literature.
To close this gap, this article systematically reviews over sixty DL methods for privacy-preserving NLP published between 2016 and 2020, covering theoretical foundations, privacy-enhancing technologies, and analysis of their suitability for real-world scenarios.
First, we introduce a novel taxonomy for classifying the existing methods into three categories: data safeguarding methods, trusted methods, and verification methods.
Second, we present an extensive summary of privacy threats, datasets for applications, and metrics for privacy evaluation.
Third, throughout the review, we describe privacy issues in the NLP pipeline in a holistic view.
Further, we discuss open challenges in privacy-preserving NLP regarding data traceability, computation overhead, dataset size, the prevalence of human biases in embeddings, and the privacy-utility tradeoff.
Finally, this review presents future research directions to guide successive research and development of privacy-preserving NLP models
- …