72 research outputs found

    ParaCrawl: Web-Scale Acquisition of Parallel Corpora

    Get PDF
    We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems

    Multilingual representations and models for improved low-resource language processing

    Get PDF
    Word representations are the cornerstone of modern NLP. Representing words or characters using real-valued vectors as static representations that can capture the Semantics and encode the meaning has been popular among researchers. In more recent years, Pretrained Language Models using large amounts of data and creating contextualized representations achieved great performance in various tasks such as Semantic Role Labeling. These large pretrained language models are capable of storing and generalizing information and can be used as knowledge bases. Language models can produce multilingual representations while only using monolingual data during training. These multilingual representations can be beneficial in many tasks such as Machine Translation. Further, knowledge extraction models that only relied on information extracted from English resources, can now benefit from extra resources in other languages. Although these results were achieved for high-resource languages, there are thousands of languages that do not have large corpora. Moreover, for other tasks such as machine translation, if large monolingual data is not available, the models need parallel data, which is scarce for most languages. Further, many languages lack tokenization models, and splitting the text into meaningful segments such as words is not trivial. Although using subwords helps the models to have better coverage over unseen data and new words in the vocabulary, generalizing over low-resource languages with different alphabets and grammars is still a challenge. This thesis investigates methods to overcome these issues for low-resource languages. In the first publication, we explore the degree of multilinguality in multilingual pretrained language models. We demonstrate that these language models can produce high-quality word alignments without using parallel training data, which is not available for many languages. In the second paper, we extract word alignments for all available language pairs in the public bible corpus (PBC). Further, we created a tool for exploring these alignments which are especially helpful in studying low-resource languages. The third paper investigates word alignment in multiparallel corpora and exploits graph algorithms for extracting new alignment edges. In the fourth publication, we propose a new model to iteratively generate cross-lingual word embeddings and extract word alignments when only small parallel corpora are available. Lastly, the fifth paper finds that aggregation of different granularities of text can improve word alignment quality. We propose using subword sampling to produce such granularities

    Masadennin (The Little Prince in Bamana)

    Get PDF
    The launch of the online concordance with the parallel texts of The Little Prince in Bamana, French, and English is reported. Two working modes are available: the parallel text output and the output of an interlinearized Bamana text. In the parallel text output, the following search options are accessible: entire wordform, words starting with/ ending in/ containing a given character sequence, and exact phrase search. The search options in the interlinearized output mode also include part of speech, gloss, and lemma, but not the exact phrase search. The parallel texts in the corpus are aligned by paragraphs, and a simple algorithm to define the position of a corresponding sentence in the French or English text based on the position of the Bamana sentence is suggested. Automated sentence alignment based on the so-called anchor items (most frequent words) is briefly discussed for future applications.L’article présente une concordance mise en ligne des textes parallèles bambara, français et anglais du Petit Prince. Deux régimes de travail sont disponibles : la sortie d’un texte parallèle et la sortie d’un texte bambara interlinéarisé. Dans la mode du texte parallèle, des possibilités de recherche accessibles sont les suivantes : le mot-forme entier ; des mots commençant / se terminant par/ contenant des séquences données des caractères ; des phrases entières. Les possibilités de recherche dans la mode interlinéarisée sont : partie de discours, glose, lemme, mais non pas une phrase entière. Les textes parallèles du corpus sont alignés par paragraphes ; un simple algorithme est fourni pour définir la position de la phrase dans le texte français ou anglais en partant de la position de la phrase bambara. L’alignement automatique des phrases se basant sur les point d’ancrage (les mots les plus fréquents) est brièvement discuté ; cela peut être utile pour des applications ultérieures.В статье говорится о публикации в Интернете онлайнового конкорданса параллельных текстов «Маленького принца» на бамана, английском и французском языках. Представлены два режима работы: с выводом в виде параллельного корпуса и с выводом в виде синхронизированного баманского текста. В формате параллельного корпуса доступны следующие поисковые опции: по целой словоформе, по последовательности букв в начале, конце или середине слова, по фразе. Поисковые опции в интерлинеаризированной модели также включают поиск по части речи, по глоссе, по лемме, но не по фразе. Параллельные тексты в корпусе выравнены по абзацам. Предлагается простой алгоритм определения позиции соответствующих предложений французского и английского текстов. Исходя из позиции в тексте баманского предложения. Кратко обсуждается возможность применения в последующих версиях автоматического выравнивания предложений на основе так называемых опорных элементов (наиболее частотных слов)

    Pengumpulan Korpus Paralel Bahasa Indonesia-Sunda dari Wikipedia Menggunakan Metode Pointwise Mutual Information

    Get PDF
    Pengumpulan korpus paralel sedang gencar dilakukan untuk keperluan studi dan pengembangan NLP. Namun, untuk pasangan kalimat beberapa bahasa, khususnya Bahasa Indonesia-Sunda, jumlah korpus paralel yang tersedia masih sangat sedikit. Sedangkan untuk mengumpulkan korpus paralel secara manual memerlukan waktu yang lama dan biaya yang mahal. Dengan alasan tersebut, pengumpulan korpus paralel akan lebih efektif dan efisien jika dikumpulkan secara otomatis. Dalam tugas akhir ini, akan dilakukan penelitian pengumpulan korpus paralel pada Wikipedia meggunakan metode Pointwise Mutual Information (PMI) untuk menentukan sentence similarity. Pengambilan data dari artikel Wikipedia bahasa Indonesia dan Sunda dengan memanfaatkan fasilitas interlanguage link dan MediaWIki API. Dengan metode ini, diharapkan didapat korpus paralel yang cukup baik dengan efisien. Kata kunci: korpus paralel, Wikipedia, pointwise mutual information, interlanguage link, MediaWiki AP

    Improved HMM Alignment Models for Languages with Scarce Resources

    Get PDF
    We introduce improvements to statistical word alignment based on the Hidden Markov Model. One improvement incorporates syntac-tic knowledge. Results on the workshop data show that alignment performance exceeds that of a state-of-the art system based on more com-plex models, resulting in over a 5.5 % absolute reduction in error on Romanian-English.

    Proceedings

    Get PDF
    Proceedings of the NODALIDA 2011 Workshop Visibility and Availability of LT Resources. Editors: Sjur Nørstebø Moshagen and Per Langgård. NEALT Proceedings Series, Vol. 13 (2011), vi+32 pp. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/1697
    corecore