1,436 research outputs found

    GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION

    Get PDF
    The problem of generating synthetic data for the training and evaluation of document analysis systems has been widely addressed in recent years. With the increased interest in processing multilingual sources, however, there is a tremendous need to be able to rapidly generate data in new languages and scripts, without the need to develop specialized systems. We have developed a system, which uses language support of the MS Windows operating system combined with custom print drivers to render tiff images simultaneously with windows Enhanced Metafile directives. The metafile information is parsed to generate zone, line, word, and character ground truth including location, font information and content in any language supported by Windows. The resulting images can be physically or synthetically degraded by our degradation modules, and used for training and evaluating Optical Character Recognition (OCR) systems. Our document image degradation methodology incorporates several often-encountered types of noise at the page and pixel levels. Examples of OCR evaluation and synthetically degraded document images are given to demonstrate the effectiveness

    Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

    Full text link
    We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly-available parallel corpora, and additionally mine 37.4 million sentence pairs from the web, resulting in a 4x increase. We mine the parallel sentences from the web by combining many corpora, tools, and methods: (a) web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences. Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 languages. Further, we extract 83.4 million sentence pairs between all 55 Indic language pairs from the English-centric parallel corpus using English as the pivot language. We trained multilingual NMT models spanning all these languages on Samanantar, which outperform existing models and baselines on publicly available benchmarks, such as FLORES, establishing the utility of Samanantar. Our data and models are available publicly at https://indicnlp.ai4bharat.org/samanantar/ and we hope they will help advance research in NMT and multilingual NLP for Indic languages.Comment: Accepted to the Transactions of the Association for Computational Linguistics (TACL

    Corpus-based translation research: its development and implications for general, literary and Bible translation

    Get PDF
    Corpus-based translation research emerged in the late 1990s as a new area of research in the discipline of translation studies. It is informed by a specific area of linguistics known as corpus linguistics which involves the analysis of large corpora of authentic running text by means of computer software. Within linguistics, this methodology has revolutionised lexicographic practices and methods of language teaching. In translation studies this kind of research involves using computerised corpora to study translated text, not in terms of its equivalence to source texts, but as a valid object of study in its own right. Corpus-based research in translation is concerned with revealing both the universal and the specific features of translation, through the interplay of theoretical constructs and hypotheses, variety of data, novel descriptive categories and a rigorous, flexible methodology, which can be applied to inductive and deductive research, as well as product- and process-oriented studies. In this article an overview is given of the research that has led to the formation of a new subdiscipline in translation studies, called Corpus-based Translation Studies or CTS. I also demonstrate how CTS tools and techniques can be used for the analysis of general and literary translations and therefore also for Bible translations. (Acta Theologica, Supplementum 2, 2002: 70-106

    Machine learning for ancient languages: a survey

    Get PDF
    Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the study of ancient texts, but in recent years advances in artificial intelligence and machine learning have enabled analyses on a scale and in a detail that are reshaping the field of humanities, similarly to how microscopes and telescopes have contributed to the realm of science. This article aims to provide a comprehensive survey of published research using machine learning for the study of ancient texts written in any language, script, and medium, spanning over three and a half millennia of civilizations around the ancient world. To analyze the relevant literature, we introduce a taxonomy of tasks inspired by the steps involved in the study of ancient documents: digitization, restoration, attribution, linguistic analysis, textual criticism, translation, and decipherment. This work offers three major contributions: first, mapping the interdisciplinary field carved out by the synergy between the humanities and machine learning; second, highlighting how active collaboration between specialists from both fields is key to producing impactful and compelling scholarship; third, highlighting promising directions for future work in this field. Thus, this work promotes and supports the continued collaborative impetus between the humanities and machine learning

    One Model to Rule them all: Multitask and Multilingual Modelling for Lexical Analysis

    Get PDF
    When learning a new skill, you take advantage of your preexisting skills and knowledge. For instance, if you are a skilled violinist, you will likely have an easier time learning to play cello. Similarly, when learning a new language you take advantage of the languages you already speak. For instance, if your native language is Norwegian and you decide to learn Dutch, the lexical overlap between these two languages will likely benefit your rate of language acquisition. This thesis deals with the intersection of learning multiple tasks and learning multiple languages in the context of Natural Language Processing (NLP), which can be defined as the study of computational processing of human language. Although these two types of learning may seem different on the surface, we will see that they share many similarities. The traditional approach in NLP is to consider a single task for a single language at a time. However, recent advances allow for broadening this approach, by considering data for multiple tasks and languages simultaneously. This is an important approach to explore further as the key to improving the reliability of NLP, especially for low-resource languages, is to take advantage of all relevant data whenever possible. In doing so, the hope is that in the long term, low-resource languages can benefit from the advances made in NLP which are currently to a large extent reserved for high-resource languages. This, in turn, may then have positive consequences for, e.g., language preservation, as speakers of minority languages will have a lower degree of pressure to using high-resource languages. In the short term, answering the specific research questions posed should be of use to NLP researchers working towards the same goal.Comment: PhD thesis, University of Groninge

    The History of the Normative Opposition of “Language versus Dialect”: From Its Graeco-Latin Origin to Central Europe’s Ethnolinguistic Nation-States

    Get PDF
    The History of the Normative Opposition of “Language versus Dialect”: From Its Graeco-Latin Origin to Central Europe’s Ethnolinguistic Nation-States The concept of “a language” (Einzelsprache, that is, one of many extant languages) and its opposition to “dialect” (considered as a “non-language,” and thus subjugable to an already recognized language merely as “its” dialect) is the way people tend to think about languages in the West today. It appears to be a value-free, self-evident conception of the linguistic position. So much so that the concept of “language” was included neither in Immanuel Kant’s system of categories, nor in the authoritative Geschichtliche Grundbegriffe: Historisches Lexikon zur politisch sozialen Sprache in Deutschland. This paper sketches the rise of the “dialect vs language” opposition in classical Greek, its transposition onto classical Latin, and its transfer, through medieval and renaissance Latin, to the early modern period. On the way, the Greek and Latin terms for “language” (and also for “dialect”) sometimes functioned as synonyms for peoples (that is, ethnic groups), which – importantly – contributed to the rise of the normative equation of language with nation in the early nineteenth century. It was the beginning of the ethnolinguistic kind of nationalism that prevails to this day in Central Europe.   Dzieje normatywnej dychotomii języka i dialektu: Od greko-łacińskich źródeł po państwa etnicznojęzykowe Europy Środkowej Pojęcie języka jako jednego z wielu (Einzelsprache) stawiane w diametralnej opozycji do „dialektu” (czyli „nie-języka”, który normatywnie musi zostać przyporządkowany jakiemuś już wcześniej uznanemu językowi jako jeden z jego dialektów) stanowi formę pojęciową, poprzez pryzmat której postrzega się języki i dyskutuje o nich we współczesnym świecie Zachodu. Z powodu powszechnego uznania owa forma pojęciowa wydaje się tak oczywista i wolna od nacechowania ideologicznego, że Immanuel Kant nie uwzględnił języka w zaproponowanym przez siebie systemie kategorii filozoficznych, podobnie jak i autorzy niezmiernie wpływowego dzieła z zakresu historiografii i socjologii politycznej o znamiennym tytule Geschichtliche Grundbegriffe: Historisches Lexikon zur politisch sozialen Sprache in Deutschland. W niniejszym artykule przedstawiam wyłonienie się opozycji języka wobec dialektu w starożytnej grece oraz jego recepcję na gruncie łaciny od starożytności rzymskiej po okres nowożytny. W ciągu wieków utarło się używanie greckich i łacińskich terminów w odniesieniu do „języka” jako synonimów na określenie ludów (czy też grup etnicznych), co we wczesnym XIX stuleciu silnie wpłynęło na wykształcenie się normatywnego zrównania języka z narodem. Stanowiło to początek fenomenu znanego pod nazwą „nacjonalizmu etnicznojęzykowego”, który na poziomie państw dominuje po dziś dzień w całej Europie Środkowej

    Corpora compilation for prosody-informed speech processing

    Get PDF
    Research on speech technologies necessitates spoken data, which is usually obtained through read recorded speech, and specifically adapted to the research needs. When the aim is to deal with the prosody involved in speech, the available data must reflect natural and conversational speech, which is usually costly and difficult to get. This paper presents a machine learning-oriented toolkit for collecting, handling, and visualization of speech data, using prosodic heuristic. We present two corpora resulting from these methodologies: PANTED corpus, containing 250 h of English speech from TED Talks, and Heroes corpus containing 8 h of parallel English and Spanish movie speech. We demonstrate their use in two deep learning-based applications: punctuation restoration and machine translation. The presented corpora are freely available to the research community
    corecore