Search CORE

1,436 research outputs found

GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION

Author: Zi Gang
Publication venue
Publication date: 01/05/2005
Field of study

The problem of generating synthetic data for the training and evaluation of document analysis systems has been widely addressed in recent years. With the increased interest in processing multilingual sources, however, there is a tremendous need to be able to rapidly generate data in new languages and scripts, without the need to develop specialized systems. We have developed a system, which uses language support of the MS Windows operating system combined with custom print drivers to render tiff images simultaneously with windows Enhanced Metafile directives. The metafile information is parsed to generate zone, line, word, and character ground truth including location, font information and content in any language supported by Windows. The resulting images can be physically or synthetically degraded by our degradation modules, and used for training and evaluating Optical Character Recognition (OCR) systems. Our document image degradation methodology incorporates several often-encountered types of noise at the page and pixel levels. Examples of OCR evaluation and synthetically degraded document images are given to demonstrate the effectiveness

Crossref

Digital Repository at the University of Maryland

Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

Author: AK Raghavan
Bheemaraj Aravinth
Deepak Kumar
Diddee Harshita
Doddapaneni Sumanth
J Mahalakshmi
Jobanputra Mayank
Kakwani Divyanshu
Khapra Mitesh Shantadevi
Kumar Navneet
Kumar Pratyush
Kunchukuttan Anoop
Nagaraj Srihari
Pradeep Aswin
Raghavan Vivek
Ramesh Gowtham
Sahoo Sujit
Sharma Ajitesh
Publication venue
Publication date: 18/11/2021
Field of study

We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly-available parallel corpora, and additionally mine 37.4 million sentence pairs from the web, resulting in a 4x increase. We mine the parallel sentences from the web by combining many corpora, tools, and methods: (a) web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences. Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 languages. Further, we extract 83.4 million sentence pairs between all 55 Indic language pairs from the English-centric parallel corpus using English as the pivot language. We trained multilingual NMT models spanning all these languages on Samanantar, which outperform existing models and baselines on publicly available benchmarks, such as FLORES, establishing the utility of Samanantar. Our data and models are available publicly at https://indicnlp.ai4bharat.org/samanantar/ and we hope they will help advance research in NMT and multilingual NLP for Indic languages.Comment: Accepted to the Transactions of the Association for Computational Linguistics (TACL

arXiv.org e-Print Archive

Corpus-based translation research: its development and implications for general, literary and Bible translation

Author: Kruger A.
Publication venue: 'African Journals Online (AJOL)'
Publication date: 13/10/2004
Field of study

Corpus-based translation research emerged in the late 1990s as a new area of research in the discipline of translation studies. It is informed by a specific area of linguistics known as corpus linguistics which involves the analysis of large corpora of authentic running text by means of computer software. Within linguistics, this methodology has revolutionised lexicographic practices and methods of language teaching. In translation studies this kind of research involves using computerised corpora to study translated text, not in terms of its equivalence to source texts, but as a valid object of study in its own right. Corpus-based research in translation is concerned with revealing both the universal and the specific features of translation, through the interplay of theoretical constructs and hypotheses, variety of data, novel descriptive categories and a rigorous, flexible methodology, which can be applied to inductive and deductive research, as well as product- and process-oriented studies. In this article an overview is given of the research that has led to the formation of a new subdiscipline in translation studies, called Corpus-based Translation Studies or CTS. I also demonstrate how CTS tools and techniques can be used for the analysis of general and literary translations and therefore also for Bible translations. (Acta Theologica, Supplementum 2, 2002: 70-106

AJOL - African Journals Online

Machine learning for ancient languages: a survey

Author: Androutsopoulos Ion
Assael Yannis
Bodel John
Dyer Chris
Freitas Nando de
Pavlopoulos John
Prag Jonathan
Senior Andrew
Sommerschield Thea
Stefanak Vanessa
Publication venue: MIT Press
Publication date: 10/08/2023
Field of study

Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the study of ancient texts, but in recent years advances in artificial intelligence and machine learning have enabled analyses on a scale and in a detail that are reshaping the field of humanities, similarly to how microscopes and telescopes have contributed to the realm of science. This article aims to provide a comprehensive survey of published research using machine learning for the study of ancient texts written in any language, script, and medium, spanning over three and a half millennia of civilizations around the ancient world. To analyze the relevant literature, we introduce a taxonomy of tasks inspired by the steps involved in the study of ancient documents: digitization, restoration, attribution, linguistic analysis, textual criticism, translation, and decipherment. This work offers three major contributions: first, mapping the interdisciplinary field carved out by the synergy between the humanities and machine learning; second, highlighting how active collaboration between specialists from both fields is key to producing impactful and compelling scholarship; third, highlighting promising directions for future work in this field. Thus, this work promotes and supports the continued collaborative impetus between the humanities and machine learning

Oxford University Research Archive

One Model to Rule them all: Multitask and Multilingual Modelling for Lexical Analysis

Author: Bjerva Johannes
Publication venue
Publication date: 01/01/2017
Field of study

When learning a new skill, you take advantage of your preexisting skills and knowledge. For instance, if you are a skilled violinist, you will likely have an easier time learning to play cello. Similarly, when learning a new language you take advantage of the languages you already speak. For instance, if your native language is Norwegian and you decide to learn Dutch, the lexical overlap between these two languages will likely benefit your rate of language acquisition. This thesis deals with the intersection of learning multiple tasks and learning multiple languages in the context of Natural Language Processing (NLP), which can be defined as the study of computational processing of human language. Although these two types of learning may seem different on the surface, we will see that they share many similarities. The traditional approach in NLP is to consider a single task for a single language at a time. However, recent advances allow for broadening this approach, by considering data for multiple tasks and languages simultaneously. This is an important approach to explore further as the key to improving the reliability of NLP, especially for low-resource languages, is to take advantage of all relevant data whenever possible. In doing so, the hope is that in the long term, low-resource languages can benefit from the advances made in NLP which are currently to a large extent reserved for high-resource languages. This, in turn, may then have positive consequences for, e.g., language preservation, as speakers of minority languages will have a lower degree of pressure to using high-resource languages. In the short term, answering the specific research questions posed should be of use to NLP researchers working towards the same goal.Comment: PhD thesis, University of Groninge

arXiv.org e-Print Archive

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

The History of the Normative Opposition of “Language versus Dialect”: From Its Graeco-Latin Origin to Central Europe’s Ethnolinguistic Nation-States

Author: Tomasz Kamusella
Publication venue: 'Institute of Slavic Studies Polish Academy of Sciences'
Publication date: 01/01/2016
Field of study

The History of the Normative Opposition of “Language versus Dialect”: From Its Graeco-Latin Origin to Central Europe’s Ethnolinguistic Nation-States The concept of “a language” (Einzelsprache, that is, one of many extant languages) and its opposition to “dialect” (considered as a “non-language,” and thus subjugable to an already recognized language merely as “its” dialect) is the way people tend to think about languages in the West today. It appears to be a value-free, self-evident conception of the linguistic position. So much so that the concept of “language” was included neither in Immanuel Kant’s system of categories, nor in the authoritative Geschichtliche Grundbegriffe: Historisches Lexikon zur politisch sozialen Sprache in Deutschland. This paper sketches the rise of the “dialect vs language” opposition in classical Greek, its transposition onto classical Latin, and its transfer, through medieval and renaissance Latin, to the early modern period. On the way, the Greek and Latin terms for “language” (and also for “dialect”) sometimes functioned as synonyms for peoples (that is, ethnic groups), which – importantly – contributed to the rise of the normative equation of language with nation in the early nineteenth century. It was the beginning of the ethnolinguistic kind of nationalism that prevails to this day in Central Europe. Dzieje normatywnej dychotomii języka i dialektu: Od greko-łacińskich źródeł po państwa etnicznojęzykowe Europy Środkowej Pojęcie języka jako jednego z wielu (Einzelsprache) stawiane w diametralnej opozycji do „dialektu” (czyli „nie-języka”, który normatywnie musi zostać przyporządkowany jakiemuś już wcześniej uznanemu językowi jako jeden z jego dialektów) stanowi formę pojęciową, poprzez pryzmat której postrzega się języki i dyskutuje o nich we współczesnym świecie Zachodu. Z powodu powszechnego uznania owa forma pojęciowa wydaje się tak oczywista i wolna od nacechowania ideologicznego, że Immanuel Kant nie uwzględnił języka w zaproponowanym przez siebie systemie kategorii filozoficznych, podobnie jak i autorzy niezmiernie wpływowego dzieła z zakresu historiografii i socjologii politycznej o znamiennym tytule Geschichtliche Grundbegriffe: Historisches Lexikon zur politisch sozialen Sprache in Deutschland. W niniejszym artykule przedstawiam wyłonienie się opozycji języka wobec dialektu w starożytnej grece oraz jego recepcję na gruncie łaciny od starożytności rzymskiej po okres nowożytny. W ciągu wieków utarło się używanie greckich i łacińskich terminów w odniesieniu do „języka” jako synonimów na określenie ludów (czy też grup etnicznych), co we wczesnym XIX stuleciu silnie wpłynęło na wykształcenie się normatywnego zrównania języka z narodem. Stanowiło to początek fenomenu znanego pod nazwą „nacjonalizmu etnicznojęzykowego”, który na poziomie państw dominuje po dziś dzień w całej Europie Środkowej

Crossref

Biblioteka Nauki - repozytorium artykuÅÃ³w

Directory of Open Access Journals

Corpora compilation for prosody-informed speech processing

Author: Bonafonte Antonio
Farrús Mireia
Öktem Alp
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Research on speech technologies necessitates spoken data, which is usually obtained through read recorded speech, and specifically adapted to the research needs. When the aim is to deal with the prosody involved in speech, the available data must reflect natural and conversational speech, which is usually costly and difficult to get. This paper presents a machine learning-oriented toolkit for collecting, handling, and visualization of speech data, using prosodic heuristic. We present two corpora resulting from these methodologies: PANTED corpus, containing 250 h of English speech from TED Talks, and Heroes corpus containing 8 h of parallel English and Spanish movie speech. We demonstrate their use in two deep learning-based applications: punctuation restoration and machine translation. The presented corpora are freely available to the research community

UPF Digital Repository

Diposit Digital de la Universitat de Barcelona