7 research outputs found

    IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

    Full text link
    India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as scheduled languages) are the focus of this work. Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are essential in a country like India. Prior to this work, there was (i) no parallel training data spanning all the 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India. In this work, we aim to address this gap by focusing on the missing pieces required for enabling wide, easy, and open access to good machine translation systems for all 22 scheduled Indian languages. We identify four key areas of improvement: curating and creating larger training datasets, creating diverse and high-quality benchmarks, training multilingual models, and releasing models with open access. Our first contribution is the release of the Bharat Parallel Corpus Collection (BPCC), the largest publicly available parallel corpora for Indic languages. BPCC contains a total of 230M bitext pairs, of which a total of 126M were newly added, including 644K manually translated sentence pairs created as part of this work. Our second contribution is the release of the first n-way parallel benchmark covering all 22 Indian languages, featuring diverse domains, Indian-origin content, and source-original test sets. Next, we present IndicTrans2, the first model to support all 22 languages, surpassing existing models on multiple existing and new benchmarks created as a part of this work. Lastly, to promote accessibility and collaboration, we release our models and associated data with permissive licenses at https://github.com/ai4bharat/IndicTrans2

    Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution

    Get PDF
    Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embedding

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction

    Mobile translation applications: On the verge of a post-Babel world 2.0?

    Get PDF
    Situated within the technological realm of Translation Studies, this thesis provides an analysis of the ways in which people are using Machine Translation (MT) on a mobile device. This is a growing area of use of MT, given the increased accessibility of the technology and the proliferation of mobile devices this millennium. The thesis explores the history of MT, how the technology works and how it has reached the point of being accessible to almost anyone almost anywhere in the world, exploring the fact that MT is a form of Artificial Intelligence (AI) and that the emergence of AI and specifically MT can be examined through the lens of mobility and ubiquitous connectivity. This thesis offers an insight into how people are using the technology, what effects this may be having on their perceptions of translation and potential implications for the language barrier. It does this through two principal methods of data collection and analysis. The first is a survey of people’s use of MT on a mobile device, soliciting new data from them to enable a deeper understanding of how they use the technology, the particular features they use, their thoughts on its quality and limitations. The second is a more novel approach as it is an analysis of reviews left on the Google Play Store by users of two MT apps, Google Translate and Microsoft Translator, exploring what information can be gathered and analysed from an unsolicited dataset. This thesis offers an initial study of this new way of interacting with the technology of MT and seeks to lay groundwork for future studies, including a categorisation tool and a taxonomy of MT use, to enable reliability and comparability across studies, platforms and time. Ultimately, it argues that the technology has improved substantially since its inception in 1954, but that it is too soon to say that we are on the verge of a post-Babel world 2.0. Rather, the technology is moving human society further in this direction and towards this possibility

    Mobile translation applications: On the verge of a post-Babel world 2.0?

    Get PDF
    Situated within the technological realm of Translation Studies, this thesis provides an analysis of the ways in which people are using Machine Translation (MT) on a mobile device. This is a growing area of use of MT, given the increased accessibility of the technology and the proliferation of mobile devices this millennium. The thesis explores the history of MT, how the technology works and how it has reached the point of being accessible to almost anyone almost anywhere in the world, exploring the fact that MT is a form of Artificial Intelligence (AI) and that the emergence of AI and specifically MT can be examined through the lens of mobility and ubiquitous connectivity. This thesis offers an insight into how people are using the technology, what effects this may be having on their perceptions of translation and potential implications for the language barrier. It does this through two principal methods of data collection and analysis. The first is a survey of people’s use of MT on a mobile device, soliciting new data from them to enable a deeper understanding of how they use the technology, the particular features they use, their thoughts on its quality and limitations. The second is a more novel approach as it is an analysis of reviews left on the Google Play Store by users of two MT apps, Google Translate and Microsoft Translator, exploring what information can be gathered and analysed from an unsolicited dataset. This thesis offers an initial study of this new way of interacting with the technology of MT and seeks to lay groundwork for future studies, including a categorisation tool and a taxonomy of MT use, to enable reliability and comparability across studies, platforms and time. Ultimately, it argues that the technology has improved substantially since its inception in 1954, but that it is too soon to say that we are on the verge of a post-Babel world 2.0. Rather, the technology is moving human society further in this direction and towards this possibility
    corecore