322 research outputs found

    Transductive data-selection algorithms for fine-tuning neural machine translation

    Get PDF
    Machine Translation models are trained to translate a variety of documents from one language into another. However, models specifically trained for a particular characteristics of the documents tend to perform better. Fine-tuning is a technique for adapting an NMT model to some domain. In this work, we want to use this technique to adapt the model to a given test set. In particular, we are using transductive data selection algorithms which take advantage the information of the test set to retrieve sentences from a larger parallel set

    Combining SMT and NMT back-translated data for efficient NMT

    Get PDF
    Neural Machine Translation (NMT) models achieve their best performance when large sets of parallel data are used for training. Consequently, techniques for augmenting the training set have become popular recently. One of these methods is back-translation (Sennrich et al., 2016), which consists on generating synthetic sentences by translating a set of monolingual, target-language sentences using a Machine Translation (MT) model. Generally, NMT models are used for back-translation. In this work, we analyze the performance of models when the training data is extended with synthetic data using different MT approaches. In particular we investigate back-translated data generated not only by NMT but also by Statistical Machine Translation (SMT) models and combinations of both. The results reveal that the models achieve the best performances when the training set is augmented with back-translated data created by merging different MT approaches

    Do Masculinity and Perceived Condom Barriers Predict Heterosexual HIV Risk Behaviors Among Black Substance Abusing Men?

    Full text link
    Although HIV prevention during substance abuse treatment is ideal, existing HIV risk-reduction interventions are less effective among Black and other ethnic minority substance abusers. The Sexual Health Model (SHM) and the Person, Extended Family and Neighborhood-3 model (PEN-3) both highlight the importance of increasing our understanding of the relationship of sociocultural factors to sexual-decision making as a step towards developing more HIV prevention interventions for ethnic minorities. However, few studies examine sociocultural factors in the sexual decision-making process of Black substance abusing men. This secondary analysis of data collected in an evaluation of Real Men Are Safe (REMAS), a HIV prevention intervention, in the National Drug Abuse Treatment Clinical Trials Network (CTN) addressed this gap by examining the relation of two specific sociocultural factors (i.e., masculinity and perceived barriers to condom use) to the self-reported sexual behaviors of Black substance abusing men with their main and casual female partners. Analyses of the baseline data of 126 Black men entering substance abuse treatment revealed that the endorsement of both personal and social masculinity predicted more unprotected sexual occasions (USO) with casual partners. The perception that condoms decreased sexual pleasure also predicted higher USO rates with casual partners. However, fewer partner barriers was not associated with USO among casual partners as expected. Neither the endorsement of social or personal masculinity or perceived condom barriers predicted USO with main partners. The findings suggest that interventions that depict condom use as both pleasurable and congruent with Black male perceptions of masculinity may be more effective with Black substance abusing men

    Agglomération et hétéroagglomération des nanoparticules d'argent en eaux douces

    Get PDF
    Les nanomatĂ©riaux sont une classe de contaminants qui est de plus en plus prĂ©sent dans l’environnement. Leur impact sur l’environnement dĂ©pendra de leur persistance, mobilitĂ©, toxicitĂ© et bioaccumulation. Chacun de ces paramĂštres dĂ©pendra de leur comportement physicochimique dans les eaux naturelles (i.e. dissolution et agglomĂ©ration). L’objectif de cette Ă©tude est de comprendre l’agglomĂ©ration et l’hĂ©tĂ©roagglomĂ©ration des nanoparticules d’argent dans l’environnement. Deux diffĂ©rentes sortes de nanoparticules d’argent (nAg; avec enrobage de citrate et avec enrobage d’acide polyacrylique) de 5 nm de diamĂštre ont Ă©tĂ© marquĂ©es de maniĂšre covalente Ă  l’aide d’un marqueur fluorescent et ont Ă©tĂ© mĂ©langĂ©es avec des colloĂŻdes d’oxyde de silice (SiO2) ou d’argile (montmorillonite). L’homo- et hĂ©tĂ©roagglomĂ©ration des nAg ont Ă©tĂ© Ă©tudiĂ©s dans des conditions reprĂ©sentatives d’eaux douces naturelles (pH 7,0; force ionique 10 7 Ă  10-1 M de Ca2+). Les tailles ont Ă©tĂ© mesurĂ©es par spectroscopie de corrĂ©lation par fluorescence (FCS) et les rĂ©sultats ont Ă©tĂ© confirmĂ©s Ă  l’aide de la microscopie en champ sombre avec imagerie hyperspectrale (HSI). Les rĂ©sultats ont dĂ©montrĂ©s que les nanoparticules d’argent Ă  enrobage d’acide polyacrylique sont extrĂȘmement stables sous toutes les conditions imposĂ©es, incluant la prĂ©sence d’autres colloĂŻdes et Ă  des forces ioniques trĂšs Ă©levĂ©es tandis que les nanoparticules d’argent avec enrobage de citrate ont formĂ©es des hĂ©tĂ©roagrĂ©gats en prĂ©sence des deux particules colloĂŻdales.Nanomaterials are a class of contaminants that are increasingly found in the natural environment. Their environmental risk will depend on their persistence, mobility, toxicity and bioaccumulation. Each of these parameters will depend strongly upon their physicochemical fate (dissolution, agglomeration) in natural waters. The goal of this paper is to understand the agglomeration and heteroagglomeration of silver nanoparticles in the environment. Two different silver nanoparticles (nAg; citrate coated and polyacrylic acid coated) with a diameter of 5 nm were covalently labelled with a fluorescent dye and then mixed with colloidal silicon oxides (SiO2) and clays (montmorillonite). The homo- and heteroagglomeration of the silver nanoparticles were then studied in waters that were representative of natural freshwaters (pH 7.0; ionic strength 10-7 to 10-1 M of Ca2+). Sizes were followed by fluorescence correlation spectroscopy (FCS) and results were validated using enhanced darkfield microscopy with hyperspectral imaging (HSI). Results have demonstrated that the polyacrylic acid coated nAg was extremely stable under all conditions, including in the presence of other colloids and at high ionic strength, whereas the citrate coated nAg formed heteroagregates in the presence of both natural colloidal particles

    Data selection with feature decay algorithms using an approximated target side

    Get PDF
    AbstractData selection techniques applied to neural machine trans-lation (NMT) aim to increase the performance of a model byretrieving a subset of sentences for use as training data.One of the possible data selection techniques are trans-ductive learning methods, which select the data based on thetest set, i.e. the document to be translated. A limitation ofthese methods to date is that using the source-side test setdoes not by itself guarantee that sentences are selected withcorrect translations, or translations that are suitable given thetest-set domain. Some corpora, such as subtitle corpora, maycontain parallel sentences with inaccurate translations causedby localization or length restrictions.In order to try to fix this problem, in this paper we pro-pose to use an approximated target-side in addition to thesource-side when selecting suitable sentence-pairs for train-ing a model. This approximated target-side is built by pre-translating the source-side.In this work, we explore the performance of this generalidea for one specific data selection approach called FeatureDecay Algorithms (FDA).We train German-English NMT models on data selectedby using the test set (source), the approximated target side,and a mixture of both. Our findings reveal that models builtusing a combination of outputs of FDA (using the test setand an approximated target side) perform better than thosesolely using the test set. We obtain a statistically significantimprovement of more than 1.5 BLEU points over a modeltrained with all data, and more than 0.5 BLEU points over astrong FDA baseline that uses source-side information only

    Data selection with feature decay algorithms using an approximated target side

    Get PDF
    Data selection techniques applied to neural machine translation (NMT) aim to increase the performance of a model by retrieving a subset of sentences for use as training data. One of the possible data selection techniques are transductive learning methods, which select the data based on the test set, i.e. the document to be translated. A limitation of these methods to date is that using the source-side test set does not by itself guarantee that sentences are selected with correct translations, or translations that are suitable given the test-set domain. Some corpora, such as subtitle corpora, may contain parallel sentences with inaccurate translations caused by localization or length restrictions. In order to try to fix this problem, in this paper we propose to use an approximated target-side in addition to the source-side when selecting suitable sentence-pairs for training a model. This approximated target-side is built by pretranslating the source-side. In this work, we explore the performance of this general idea for one specific data selection approach called Feature Decay Algorithms (FDA). We train German-English NMT models on data selected by using the test set (source), the approximated target side, and a mixture of both. Our findings reveal that models built using a combination of outputs of FDA (using the test set and an approximated target side) perform better than those solely using the test set. We obtain a statistically significant improvement of more than 1.5 BLEU points over a model trained with all data, and more than 0.5 BLEU points over a strong FDA baseline that uses source-side information only

    Elastic-substitution decoding for hierarchical SMT: efficiency, richer search and double labels

    Get PDF
    Elastic-substitution decoding (ESD), first introduced by Chiang (2010), can be important for obtaining good results when applying labels to enrich hierarchical statistical machine translation (SMT). However, an efficient implementation is essential for scalable application. We describe how to achieve this, contributing essential details that were missing in the original exposition. We compare ESD to strict matching and show its superiority for both reordering and syntactic labels. To overcome the sub-optimal performance due to the late evaluation of features marking label substitution types, we increase the diversity of the rules explored during cube pruning initialization with respect to labels their labels. This approach gives significant improvements over basic ESD and performs favorably compared to extending the search by increasing the cube pruning pop-limit. Finally, we look at combining multiple labels. The combination of reordering labels and target-side boundary-tags yields a significant improvement in terms of the word-order sensitive metrics Kendall reordering score and METEOR. This confirms our intuition that the combination of reordering labels and syntactic labels can yield improvements over either label by itself, despite increased sparsity

    Feature decay algorithms for neural machine translation

    Get PDF
    Neural Machine Translation (NMT) systems require a lot of data to be competitive. For this reason, data selection techniques are used only for finetuning systems that have been trained with larger amounts of data. In this work we aim to use Feature Decay Algorithms (FDA) data selection techniques not only to fine-tune a system but also to build a complete system with less data. Our findings reveal that it is possible to find a subset of sentence pairs, that outperforms by 1.11 BLEU points the full training corpus, when used for training a German-English NMT system

    Adaptation of machine translation models with back-translated data using transductive data selection methods

    Get PDF
    Data selection has proven its merit for improving Neural Machine Translation (NMT), when applied to authentic data. But the beneïŹt of using synthetic data in NMT training, produced by the popular back-translation technique, raises the question if data selection could also be useful for synthetic data? In this work we use Infrequent n-gram Recovery (INR) and Feature Decay Algorithms (FDA), two transductive data selection methods to obtain subsets of sentences from synthetic data. These methods ensure that selected sentences share n-grams with the test set so the NMT model can be adapted to translate it. Performing data selection on back-translated data creates new challenges as the source-side may contain noise originated by the model used in the back-translation. Hence, ïŹnding ngrams present in the test set become more diïŹƒcult. Despite that, in our work we show that adapting a model with a selection of synthetic data is an useful approach
    • 

    corecore