492 research outputs found

    Producing Monolingual and ParallelWeb Corpora at the Same Time – SpiderLing and Bitextor’s Love Affair

    Get PDF
    This paper presents an approach for building large monolingual corpora and, at the same time, extracting parallel data by crawling the top-level domain of a given language of interest. For gathering linguistically relevant data from top-level domains we use the SpiderLing crawler, modified to crawl data written in multiple languages. The output of this process is then fed to Bitextor, a tool for harvesting parallel data from a collection of documents. We call the system combining these two tools Spidextor, a blend of the names of its two crucial parts. We evaluate the described approach intrinsically by measuring the accuracy of the extracted bitexts from the Croatian top-level domain .hr and the Slovene top-level domain .si, and extrinsically on the English–Croatian language pair by comparing an SMT system built from the crawled data with third-party systems. We finally present parallel datasets collected with our approach for the English–Croatian, English–Finnish, English–Serbian and English–Slovene language pairs.This research is supported by the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (AbuMaTran)

    Adquisición automática de recursos para traducción automática en el proyecto Abu-MaTran

    Get PDF
    This paper provides an overview of the research and development activities carried out to alleviate the language resources' bottleneck in machine translation within the Abu-MaTran project. We have developed a range of tools for the acquisition of the main resources required by the two most popular approaches to machine translation, i.e. statistical (corpora) and rule-based models (dictionaries and rules). All these tools have been released under open-source licenses and have been developed with the aim of being useful for industrial exploitation.Este artículo presenta una panorámica de las actividades de investigación y desarrollo destinadas a aliviar el cuello de botella que supone la falta de recursos lingüísticos en el campo de la traducción automática que se han llevado a cabo en el ámbito del proyecto Abu-MaTran. Hemos desarrollado un conjunto de herramientas para la adquisición de los principales recursos requeridos por las dos aproximaciones m as comunes a la traducción automática, modelos estadísticos (corpus) y basados en reglas (diccionarios y reglas). Todas estas herramientas han sido publicadas con licencias libres y han sido desarrolladas con el objetivo de ser útiles para ser explotadas en el ámbito comercial.The research leading to these results has received funding from the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement PIAP-GA-2012-324414 (Abu-MaTran)

    Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

    Full text link
    With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.Comment: Accepted at TACL; pre-MIT Press publication versio

    Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

    Get PDF
    Peer reviewe

    Creación de datos multilingües para diversos enfoques basados en corpus en el ámbito de la traducción y la interpretación

    Get PDF
    Accordingly, this research work aims at exploiting and developing new technologies and methods to better ascertain not only translators’ and interpreters’ needs, but also professionals’ and ordinary people’s on their daily tasks, such as corpora and terminology compilation and management. The main topics covered by this work relate to Computational Linguistics (CL), Natural Language Processing (NLP), Machine Translation (MT), Comparable Corpora, Distributional Similarity Measures (DSM), Terminology Extraction Tools (TET) and Terminology Management Tools (TMT). In particular, this work examines three main questions: 1) Is it possible to create a simpler and user-friendly comparable corpora compilation tool? 2) How to identify the most suitable TMT and TET for a given translation or interpreting task? 3) How to automatically assess and measure the internal degree of relatedness in comparable corpora? This work is composed of thirteen peer-reviewed scientific publications, which are included in Appendix A, while the methodology used and the results obtained in these studies are summarised in the main body of this document. Fecha de lectura de Tesis Doctoral: 22 de noviembre 2019Corpora are playing an increasingly important role in our multilingual society. High-quality parallel corpora are a preferred resource in the language engineering and the linguistics communities. Nevertheless, the lack of sufficient and up-to-date parallel corpora, especially for narrow domains and poorly-resourced languages is currently one of the major obstacles to further advancement across various areas like translation, language learning and, automatic and assisted translation. An alternative is the use of comparable corpora, which are easier and faster to compile. Corpora, in general, are extremely important for tasks like translation, extraction, inter-linguistic comparisons and discoveries or even to lexicographical resources. Its objectivity, reusability, multiplicity and applicability of uses, easy handling and quick access to large volume of data are just an example of their advantages over other types of limited resources like thesauri or dictionaries. By a way of example, new terms are coined on a daily basis and dictionaries cannot keep up with the rate of emergence of new terms

    The Future of Information Sciences : INFuture2009 : Digital Resources and Knowledge Sharing

    Get PDF

    Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information

    Get PDF
    This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech

    Rapid Generation of Pronunciation Dictionaries for new Domains and Languages

    Get PDF
    This dissertation presents innovative strategies and methods for the rapid generation of pronunciation dictionaries for new domains and languages. Depending on various conditions, solutions are proposed and developed. Starting from the straightforward scenario in which the target language is present in written form on the Internet and the mapping between speech and written language is close up to the difficult scenario in which no written form for the target language exists
    corecore