204 research outputs found

    Large-scale Dictionary Construction via Pivot-based Statistical Machine Translation with Significance Pruning and Neural Network Features

    Get PDF
    We present our ongoing work on large-scale Japanese-Chinese bilingual dictionary con-struction via pivot-based statistical machine translation. We utilize statistical significance pruning to control noisy translation pairs that are induced by pivoting. We construct a large dictionary which we manually verify to be of a high quality. We then use this dictionary and a parallel corpus to learn bilingual neural net-work language models to obtain features for reranking the n-best list, which leads to an ab-solute improvement of 5 % in accuracy when compared to a setting that does not use signif-icance pruning and reranking.

    Democratizing Neural Machine Translation with OPUS-MT

    Full text link
    This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission of increasing language coverage and translation quality, and also describe on-going work on the development of modular translation models and speed-optimized compact solutions for real-time translation on regular desktops and small devices

    IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

    Full text link
    India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as scheduled languages) are the focus of this work. Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are essential in a country like India. Prior to this work, there was (i) no parallel training data spanning all the 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India. In this work, we aim to address this gap by focusing on the missing pieces required for enabling wide, easy, and open access to good machine translation systems for all 22 scheduled Indian languages. We identify four key areas of improvement: curating and creating larger training datasets, creating diverse and high-quality benchmarks, training multilingual models, and releasing models with open access. Our first contribution is the release of the Bharat Parallel Corpus Collection (BPCC), the largest publicly available parallel corpora for Indic languages. BPCC contains a total of 230M bitext pairs, of which a total of 126M were newly added, including 644K manually translated sentence pairs created as part of this work. Our second contribution is the release of the first n-way parallel benchmark covering all 22 Indian languages, featuring diverse domains, Indian-origin content, and source-original test sets. Next, we present IndicTrans2, the first model to support all 22 languages, surpassing existing models on multiple existing and new benchmarks created as a part of this work. Lastly, to promote accessibility and collaboration, we release our models and associated data with permissive licenses at https://github.com/ai4bharat/IndicTrans2

    Backdoor Attacks and Countermeasures in Natural Language Processing Models: A Comprehensive Security Review

    Full text link
    Deep Neural Networks (DNNs) have led to unprecedented progress in various natural language processing (NLP) tasks. Owing to limited data and computation resources, using third-party data and models has become a new paradigm for adapting various tasks. However, research shows that it has some potential security vulnerabilities because attackers can manipulate the training process and data source. Such a way can set specific triggers, making the model exhibit expected behaviors that have little inferior influence on the model's performance for primitive tasks, called backdoor attacks. Hence, it could have dire consequences, especially considering that the backdoor attack surfaces are broad. To get a precise grasp and understanding of this problem, a systematic and comprehensive review is required to confront various security challenges from different phases and attack purposes. Additionally, there is a dearth of analysis and comparison of the various emerging backdoor countermeasures in this situation. In this paper, we conduct a timely review of backdoor attacks and countermeasures to sound the red alarm for the NLP security community. According to the affected stage of the machine learning pipeline, the attack surfaces are recognized to be wide and then formalized into three categorizations: attacking pre-trained model with fine-tuning (APMF) or prompt-tuning (APMP), and attacking final model with training (AFMT), where AFMT can be subdivided into different attack aims. Thus, attacks under each categorization are combed. The countermeasures are categorized into two general classes: sample inspection and model inspection. Overall, the research on the defense side is far behind the attack side, and there is no single defense that can prevent all types of backdoor attacks. An attacker can intelligently bypass existing defenses with a more invisible attack. ......Comment: 24 pages, 4 figure

    A history and theory of textual event detection and recognition

    Get PDF

    Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution

    Get PDF
    Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embedding
    • …
    corecore