26 research outputs found
A Comparison between NMT and PBSMT Performance for Translating Noisy User-Generated Content
International audienceThis work compares the performances achieved by Phrase-Based Statistical Ma- chine Translation systems (PBSMT) and attention-based Neural Machine Transla- tion systems (NMT) when translating User Generated Content (UGC), as encountered in social medias, from French to English. We show that, contrary to what could be ex- pected, PBSMT outperforms NMT when translating non-canonical inputs. Our error analysis uncovers the specificities of UGC that are problematic for sequential NMT architectures and suggests new avenue for improving NMT models
From feature to paradigm: deep learning in machine translation
In the last years, deep learning algorithms have highly revolutionized several areas including speech, image and natural language processing. The specific field of Machine Translation (MT) has not remained invariant. Integration of deep learning in MT varies from re-modeling existing features into standard statistical systems to the development of a new architecture. Among the different neural networks, research works use feed- forward neural networks, recurrent neural networks and the encoder-decoder schema. These architectures are able to tackle challenges as having low-resources or morphology variations. This manuscript focuses on describing how these neural networks have been integrated to enhance different aspects and models from statistical MT, including language modeling, word alignment, translation, reordering, and rescoring. Then, we report the new neural MT approach together with a description of the foundational related works and recent approaches on using subword, characters and training with multilingual languages, among others. Finally, we include an analysis of the corresponding challenges and future work in using deep learning in MTPostprint (author's final draft
A Comparison between NMT and PBSMT Performance for Translating Noisy User-Generated Content
International audienceThis work compares the performances achieved by Phrase-Based Statistical Ma- chine Translation systems (PBSMT) and attention-based Neural Machine Transla- tion systems (NMT) when translating User Generated Content (UGC), as encountered in social medias, from French to English. We show that, contrary to what could be ex- pected, PBSMT outperforms NMT when translating non-canonical inputs. Our error analysis uncovers the specificities of UGC that are problematic for sequential NMT architectures and suggests new avenue for improving NMT models
Controlling Styles in Neural Machine Translation with Activation Prompt
Controlling styles in neural machine translation (NMT) has attracted wide
attention, as it is crucial for enhancing user experience. Earlier studies on
this topic typically concentrate on regulating the level of formality and
achieve some progress in this area. However, they still encounter two major
challenges. The first is the difficulty in style evaluation. The style
comprises various aspects such as lexis, syntax, and others that provide
abundant information. Nevertheless, only formality has been thoroughly
investigated. The second challenge involves excessive dependence on incremental
adjustments, particularly when new styles are necessary. To address both
challenges, this paper presents a new benchmark and approach. A multiway
stylized machine translation (MSMT) benchmark is introduced, incorporating
diverse categories of styles across four linguistic domains. Then, we propose a
method named style activation prompt (StyleAP) by retrieving prompts from
stylized monolingual corpus, which does not require extra fine-tuning.
Experiments show that StyleAP could effectively control the style of
translation and achieve remarkable performance.Comment: Accepted by Findings of ACL 2023; The code is available at
https://github.com/IvanWang0730/StyleA
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)
Peer reviewe
Recommended from our members
The Roles of Language Models and Hierarchical Models in Neural Sequence-to-Sequence Prediction
With the advent of deep learning, research in many areas of machine learning is converging towards the same set of methods and models. For example, long short-term memory networks are not only popular for various tasks in natural language processing (NLP) such as speech recognition, machine translation, handwriting recognition, syntactic parsing, etc., but they are also applicable to seemingly unrelated fields such as robot control, time series prediction, and bioinformatics. Recent advances in contextual word embeddings like BERT boast with achieving state-of-the-art results on 11 NLP tasks with the same model. Before deep learning, a speech recognizer and a syntactic parser used to have little in common as systems were much more tailored towards the task at hand.
At the core of this development is the tendency to view each task as yet another data mapping problem, neglecting the particular characteristics and (soft) requirements tasks often have in practice. This often goes along with a sharp break of deep learning methods with previous research in the specific area. This work can be understood as an antithesis to this paradigm. We show how traditional symbolic statistical machine translation models can still improve neural machine translation (NMT) while reducing the risk for common pathologies of NMT such as hallucinations and neologisms. Other external symbolic models such as spell checkers and morphology databases help neural grammatical error correction. We also focus on language models that often do not play a role in vanilla end-to-end approaches and apply them in different ways to word reordering, grammatical error correction, low-resource NMT, and document-level NMT. Finally, we demonstrate the benefit of hierarchical models in sequence-to-sequence prediction. Hand-engineered covering grammars are effective in preventing catastrophic errors in neural text normalization systems. Our operation sequence model for interpretable NMT represents translation as a series of actions that modify the translation state, and can also be seen as derivation in a formal grammar.EPSRC grant EP/L027623/1
EPSRC Tier-2 capital grant EP/P020259/
IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages
India has a rich linguistic landscape with languages from 4 major language
families spoken by over a billion people. 22 of these languages are listed in
the Constitution of India (referred to as scheduled languages) are the focus of
this work. Given the linguistic diversity, high-quality and accessible Machine
Translation (MT) systems are essential in a country like India. Prior to this
work, there was (i) no parallel training data spanning all the 22 languages,
(ii) no robust benchmarks covering all these languages and containing content
relevant to India, and (iii) no existing translation models which support all
the 22 scheduled languages of India. In this work, we aim to address this gap
by focusing on the missing pieces required for enabling wide, easy, and open
access to good machine translation systems for all 22 scheduled Indian
languages. We identify four key areas of improvement: curating and creating
larger training datasets, creating diverse and high-quality benchmarks,
training multilingual models, and releasing models with open access. Our first
contribution is the release of the Bharat Parallel Corpus Collection (BPCC),
the largest publicly available parallel corpora for Indic languages. BPCC
contains a total of 230M bitext pairs, of which a total of 126M were newly
added, including 644K manually translated sentence pairs created as part of
this work. Our second contribution is the release of the first n-way parallel
benchmark covering all 22 Indian languages, featuring diverse domains,
Indian-origin content, and source-original test sets. Next, we present
IndicTrans2, the first model to support all 22 languages, surpassing existing
models on multiple existing and new benchmarks created as a part of this work.
Lastly, to promote accessibility and collaboration, we release our models and
associated data with permissive licenses at
https://github.com/ai4bharat/IndicTrans2