Search CORE

322 research outputs found

Transductive data-selection algorithms for fine-tuning neural machine translation

Author: Maillette de Buy Wenniger Gideon
Poncelas Alberto
Way Andy
Publication venue: ACL Anthology
Publication date: 20/08/2019
Field of study

Machine Translation models are trained to translate a variety of documents from one language into another. However, models specifically trained for a particular characteristics of the documents tend to perform better. Fine-tuning is a technique for adapting an NMT model to some domain. In this work, we want to use this technique to adapt the model to a given test set. In particular, we are using transductive data selection algorithms which take advantage the information of the test set to retrieve sentences from a larger parallel set

arXiv.org e-Print Archive

Irish Universities

DCU Online Research Access Service

Combining SMT and NMT back-translated data for efficient NMT

Author: Maillette de Buy Wenniger Gideon
Poncelas Alberto
Popović Maja
Shterionov Dimitar
Way Andy
Publication venue: 'Assoc. for Computational Linguistics Bulgaria'
Publication date: 01/09/2019
Field of study

Neural Machine Translation (NMT) models achieve their best performance when large sets of parallel data are used for training. Consequently, techniques for augmenting the training set have become popular recently. One of these methods is back-translation (Sennrich et al., 2016), which consists on generating synthetic sentences by translating a set of monolingual, target-language sentences using a Machine Translation (MT) model. Generally, NMT models are used for back-translation. In this work, we analyze the performance of models when the training data is extended with synthetic data using different MT approaches. In particular we investigate back-translated data generated not only by NMT but also by Statistical Machine Translation (SMT) models and combinations of both. The results reveal that the models achieve the best performances when the training set is augmented with back-translated data created by merging different MT approaches

arXiv.org e-Print Archive

Irish Universities

DCU Online Research Access Service

Do Masculinity and Perceived Condom Barriers Predict Heterosexual HIV Risk Behaviors Among Black Substance Abusing Men?

Author: Burlew A, Kathleen
Hatch-Maillette Mary
Johnson Candace
Montgomery LaTrice
Peteet Bridgette
Wilson Jerika
Publication venue: Digital Scholarship@UNLV
Publication date: 23/01/2015
Field of study

Although HIV prevention during substance abuse treatment is ideal, existing HIV risk-reduction interventions are less effective among Black and other ethnic minority substance abusers. The Sexual Health Model (SHM) and the Person, Extended Family and Neighborhood-3 model (PEN-3) both highlight the importance of increasing our understanding of the relationship of sociocultural factors to sexual-decision making as a step towards developing more HIV prevention interventions for ethnic minorities. However, few studies examine sociocultural factors in the sexual decision-making process of Black substance abusing men. This secondary analysis of data collected in an evaluation of Real Men Are Safe (REMAS), a HIV prevention intervention, in the National Drug Abuse Treatment Clinical Trials Network (CTN) addressed this gap by examining the relation of two specific sociocultural factors (i.e., masculinity and perceived barriers to condom use) to the self-reported sexual behaviors of Black substance abusing men with their main and casual female partners. Analyses of the baseline data of 126 Black men entering substance abuse treatment revealed that the endorsement of both personal and social masculinity predicted more unprotected sexual occasions (USO) with casual partners. The perception that condoms decreased sexual pleasure also predicted higher USO rates with casual partners. However, fewer partner barriers was not associated with USO among casual partners as expected. Neither the endorsement of social or personal masculinity or perceived condom barriers predicted USO with main partners. The findings suggest that interventions that depict condom use as both pleasurable and congruent with Black male perceptions of masculinity may be more effective with Black substance abusing men

University of Nevada, Las Vegas Repository

Agglomération et hétéroagglomération des nanoparticules d'argent en eaux douces

Author: Maillette Sébastien
Publication venue
Publication date: 01/04/2015
Field of study

Les nanomatériaux sont une classe de contaminants qui est de plus en plus présent dans l’environnement. Leur impact sur l’environnement dépendra de leur persistance, mobilité, toxicité et bioaccumulation. Chacun de ces paramètres dépendra de leur comportement physicochimique dans les eaux naturelles (i.e. dissolution et agglomération). L’objectif de cette étude est de comprendre l’agglomération et l’hétéroagglomération des nanoparticules d’argent dans l’environnement. Deux différentes sortes de nanoparticules d’argent (nAg; avec enrobage de citrate et avec enrobage d’acide polyacrylique) de 5 nm de diamètre ont été marquées de manière covalente à l’aide d’un marqueur fluorescent et ont été mélangées avec des colloïdes d’oxyde de silice (SiO2) ou d’argile (montmorillonite). L’homo- et hétéroagglomération des nAg ont été étudiés dans des conditions représentatives d’eaux douces naturelles (pH 7,0; force ionique 10 7 à 10-1 M de Ca2+). Les tailles ont été mesurées par spectroscopie de corrélation par fluorescence (FCS) et les résultats ont été confirmés à l’aide de la microscopie en champ sombre avec imagerie hyperspectrale (HSI). Les résultats ont démontrés que les nanoparticules d’argent à enrobage d’acide polyacrylique sont extrêmement stables sous toutes les conditions imposées, incluant la présence d’autres colloïdes et à des forces ioniques très élevées tandis que les nanoparticules d’argent avec enrobage de citrate ont formées des hétéroagrégats en présence des deux particules colloïdales.Nanomaterials are a class of contaminants that are increasingly found in the natural environment. Their environmental risk will depend on their persistence, mobility, toxicity and bioaccumulation. Each of these parameters will depend strongly upon their physicochemical fate (dissolution, agglomeration) in natural waters. The goal of this paper is to understand the agglomeration and heteroagglomeration of silver nanoparticles in the environment. Two different silver nanoparticles (nAg; citrate coated and polyacrylic acid coated) with a diameter of 5 nm were covalently labelled with a fluorescent dye and then mixed with colloidal silicon oxides (SiO2) and clays (montmorillonite). The homo- and heteroagglomeration of the silver nanoparticles were then studied in waters that were representative of natural freshwaters (pH 7.0; ionic strength 10-7 to 10-1 M of Ca2+). Sizes were followed by fluorescence correlation spectroscopy (FCS) and results were validated using enhanced darkfield microscopy with hyperspectral imaging (HSI). Results have demonstrated that the polyacrylic acid coated nAg was extremely stable under all conditions, including in the presence of other colloids and at high ionic strength, whereas the citrate coated nAg formed heteroagregates in the presence of both natural colloidal particles

Dépôt Institutionnel Numérique

Data selection with feature decay algorithms using an approximated target side

Author: Maillette de Buy Wenniger Gideon
Poncelas Alberto
Way Andy
Publication venue: IWSLT
Publication date: 01/01/2018
Field of study

AbstractData selection techniques applied to neural machine trans-lation (NMT) aim to increase the performance of a model byretrieving a subset of sentences for use as training data.One of the possible data selection techniques are trans-ductive learning methods, which select the data based on thetest set, i.e. the document to be translated. A limitation ofthese methods to date is that using the source-side test setdoes not by itself guarantee that sentences are selected withcorrect translations, or translations that are suitable given thetest-set domain. Some corpora, such as subtitle corpora, maycontain parallel sentences with inaccurate translations causedby localization or length restrictions.In order to try to fix this problem, in this paper we pro-pose to use an approximated target-side in addition to thesource-side when selecting suitable sentence-pairs for train-ing a model. This approximated target-side is built by pre-translating the source-side.In this work, we explore the performance of this generalidea for one specific data selection approach called FeatureDecay Algorithms (FDA).We train German-English NMT models on data selectedby using the test set (source), the approximated target side,and a mixture of both. Our findings reveal that models builtusing a combination of outputs of FDA (using the test setand an approximated target side) perform better than thosesolely using the test set. We obtain a statistically significantimprovement of more than 1.5 BLEU points over a modeltrained with all data, and more than 0.5 BLEU points over astrong FDA baseline that uses source-side information only

arXiv.org e-Print Archive

Irish Universities

DCU Online Research Access Service

Data selection with feature decay algorithms using an approximated target side

Author: Maillette de Buy Wenniger Gideon
Poncelas Alberto
Way Andy
Publication venue
Publication date: 30/10/2018
Field of study

Data selection techniques applied to neural machine translation (NMT) aim to increase the performance of a model by retrieving a subset of sentences for use as training data. One of the possible data selection techniques are transductive learning methods, which select the data based on the test set, i.e. the document to be translated. A limitation of these methods to date is that using the source-side test set does not by itself guarantee that sentences are selected with correct translations, or translations that are suitable given the test-set domain. Some corpora, such as subtitle corpora, may contain parallel sentences with inaccurate translations caused by localization or length restrictions. In order to try to fix this problem, in this paper we propose to use an approximated target-side in addition to the source-side when selecting suitable sentence-pairs for training a model. This approximated target-side is built by pretranslating the source-side. In this work, we explore the performance of this general idea for one specific data selection approach called Feature Decay Algorithms (FDA). We train German-English NMT models on data selected by using the test set (source), the approximated target side, and a mixture of both. Our findings reveal that models built using a combination of outputs of FDA (using the test set and an approximated target side) perform better than those solely using the test set. We obtain a statistically significant improvement of more than 1.5 BLEU points over a model trained with all data, and more than 0.5 BLEU points over a strong FDA baseline that uses source-side information only

DCU Online Research Access Service

Elastic-substitution decoding for hierarchical SMT: efficiency, richer search and double labels

Author: Maillette de Buy Wenniger Gideon
Sima'an Khalil
Way Andy
Publication venue
Publication date: 18/09/2017
Field of study

Elastic-substitution decoding (ESD), first introduced by Chiang (2010), can be important for obtaining good results when applying labels to enrich hierarchical statistical machine translation (SMT). However, an efficient implementation is essential for scalable application. We describe how to achieve this, contributing essential details that were missing in the original exposition. We compare ESD to strict matching and show its superiority for both reordering and syntactic labels. To overcome the sub-optimal performance due to the late evaluation of features marking label substitution types, we increase the diversity of the rules explored during cube pruning initialization with respect to labels their labels. This approach gives significant improvements over basic ESD and performs favorably compared to extending the search by increasing the cube pruning pop-limit. Finally, we look at combining multiple labels. The combination of reordering labels and target-side boundary-tags yields a significant improvement in terms of the word-order sensitive metrics Kendall reordering score and METEOR. This confirms our intuition that the combination of reordering labels and syntactic labels can yield improvements over either label by itself, despite increased sparsity

Irish Universities

DCU Online Research Access Service

Feature decay algorithms for neural machine translation

Author: Maillette de Buy Wenniger Gideon
Poncelas Alberto
Way Andy
Publication venue
Publication date: 01/01/2018
Field of study

Neural Machine Translation (NMT) systems require a lot of data to be competitive. For this reason, data selection techniques are used only for finetuning systems that have been trained with larger amounts of data. In this work we aim to use Feature Decay Algorithms (FDA) data selection techniques not only to fine-tune a system but also to build a complete system with less data. Our findings reveal that it is possible to find a subset of sentence pairs, that outperforms by 1.11 BLEU points the full training corpus, when used for training a German-English NMT system

Repositorio Institucional de la Universidad de Alicante

Irish Universities

DCU Online Research Access Service

Adaptation of machine translation models with back-translated data using transductive data selection methods

Author: Maillette de Buy Wenniger Gideon
Poncelas Alberto
Way Andy
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2019
Field of study

Data selection has proven its merit for improving Neural Machine Translation (NMT), when applied to authentic data. But the beneﬁt of using synthetic data in NMT training, produced by the popular back-translation technique, raises the question if data selection could also be useful for synthetic data? In this work we use Infrequent n-gram Recovery (INR) and Feature Decay Algorithms (FDA), two transductive data selection methods to obtain subsets of sentences from synthetic data. These methods ensure that selected sentences share n-grams with the test set so the NMT model can be adapted to translate it. Performing data selection on back-translated data creates new challenges as the source-side may contain noise originated by the model used in the back-translation. Hence, ﬁnding ngrams present in the test set become more diﬃcult. Despite that, in our work we show that adapting a model with a selection of synthetic data is an useful approach

arXiv.org e-Print Archive

Irish Universities

DCU Online Research Access Service