5 research outputs found

    A Systematic Study of Inner-Attention-Based Sentence Representations in Multilingual Neural Machine Translation

    Get PDF
    Neural machine translation has considerably improved the quality of automatic translations by learning good representations of input sentences. In this article, we explore a multilingual translation model capable of producing fixed-size sentence representations by incorporating an intermediate crosslingual shared layer, which we refer to as attention bridge. This layer exploits the semantics from each language and develops into a language-agnostic meaning representation that can be efficiently used for transfer learning. We systematically study the impact of the size of the attention bridge and the effect of including additional languages in the model. In contrast to related previous work, we demonstrate that there is no conflict between translation performance and the use of sentence representations in downstream tasks. In particular, we show that larger intermediate layers not only improve translation quality, especially for long sentences, but also push the accuracy of trainable classification tasks. Nevertheless, shorter representations lead to increased compression that is beneficial in non-trainable similarity tasks. Similarly, we show that trainable downstream tasks benefit from multilingual models, whereas additional language signals do not improve performance in non-trainable benchmarks. This is an important insight that helps to properly design models for specific applications. Finally, we also include an in-depth analysis of the proposed attention bridge and its ability to encode linguistic properties. We carefully analyze the information that is captured by individual attention heads and identify interesting patterns that explain the performance of specific settings in linguistic probing tasks.Peer reviewe

    The OPUS-MT dashboard - A toolkit for a systematic evaluation of open machine translation models

    No full text
    The OPUS-MT dashboard is a web-based platform that provides a comprehensive overview of open translation models. We focus on a systematic collection of benchmark results with verifiable translation performance and large coverage in terms of languages and domains. We provide results for in-house OPUS-MT and Tatoeba models as well as external models from the Huggingface repository and usercontributed translations. The functionalities of the evaluation tool include summaries of benchmarks for over 2,300 models covering 4,560 language directions and 294 languages, as well as the inspection of predicted translations against their human reference. We focus on centralization, reproducibility and coverage of MT evaluation combined with scalability. The dashboard can be accessed live at https://opus.nlpl.eu/dashboard/.Peer reviewe

    The {MUCOW} word sense disambiguation test suite at {WMT} 2020

    No full text
    This paper reports on our participation with the MUCOW test suite at the WMT 2020 news translation task. We introduced MUCOW at WMT 2019 to measure the ability of MT systems to perform word sense disambiguation (WSD), i.e., to translate an ambiguous word with its correct sense. MUCOW is created automatically using existing resources, and the evaluation process is also entirely automated. We evaluate all participating systems of the language pairs English -{ extgreater} Czech, English -{ extgreater} German, and English -{ extgreater} Russian and compare the results with those obtained at WMT 2019. While current NMT systems are fairly good at handling ambiguous source words, we could not identify any substantial progress - at least to the extent that it is measurable by the MUCOW method - in that area over the last year

    The CLIN27 Shared Task: Translating Historical Text to Contemporary Language for Improving Automatic Linguistic Annotation

    Get PDF
    The CLIN27 shared task evaluates the effect of translating historical text to modern text with the goal of improving the quality of the output of contemporary natural language processing tools applied to the text. We focus on improving part-of-speech tagging analysis of seventeenth-century Dutch. Eight teams took part in the shared task. The best results were obtained by teams employing character-based machine translation. The best system obtained an error reduction of 51% in comparison with the baseline of tagging unmodified text. This is close to the error reduction obtained by human translation (57%).status: publishe

    NSD1

    No full text
    International audienceExtensive dysregulation of chromatin-modifying genes in clear cell renal cell carcinoma (ccRCC) has been uncovered through next-generation sequencing. However, a scientific understanding of the cross-talk between epigenetic and genomic aberrations remains limited. Here we identify three ccRCC epigenetic clusters, including a clear cell CpG island methylator phenotype (C-CIMP) subgroup associated with promoter methylation of VEGF genes (FLT4, FLT1, and KDR). C-CIMP was furthermore characterized by silencing of genes related to vasculature development. Through an integrative analysis, we discovered frequent silencing of the histone H3 K36 methyltransferase NSD1 as the sole chromatin-modifying gene silenced by DNA methylation in ccRCC. Notably, tumors harboring NSD1 methylation were of higher grade and stage in different ccRCC datasets. NSD1 promoter methylation correlated with SETD2 somatic mutations across and within spatially distinct regions of primary ccRCC tumors. ccRCC harboring epigenetic silencing of NSD1 displayed a specific genome-wide methylome signature consistent with the NSD1 mutation methylome signature observed in Sotos syndrome. Thus, we concluded that epigenetic silencing of genes involved in angiogenesis is a hallmark of the methylator phenotype in ccRCC, implying a convergence toward loss of function of epigenetic writers of the H3K36 histone mark as a root feature of aggressive ccRCC. Cancer Res; 77(18); 4835–45. ©2017 AACR
    corecore