1,436 research outputs found
GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION
The problem of generating synthetic data for the training and evaluation of document analysis systems has been widely addressed in recent years. With the increased interest in processing multilingual sources, however, there is a tremendous need to be able to rapidly generate data in new languages and scripts, without the need to develop specialized systems. We have developed a system, which uses language support of the MS Windows operating system combined with custom print drivers to render tiff images simultaneously with windows Enhanced Metafile directives. The metafile information is parsed to generate zone, line, word, and character ground truth including location, font information and content in any language supported by Windows. The resulting images can be physically or synthetically degraded by our degradation modules, and used for training and evaluating Optical Character Recognition (OCR) systems. Our document image degradation methodology incorporates several often-encountered types of noise at the page and pixel levels. Examples of OCR evaluation and synthetically degraded document images are given to demonstrate the effectiveness
Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages
We present Samanantar, the largest publicly available parallel corpora
collection for Indic languages. The collection contains a total of 49.7 million
sentence pairs between English and 11 Indic languages (from two language
families). Specifically, we compile 12.4 million sentence pairs from existing,
publicly-available parallel corpora, and additionally mine 37.4 million
sentence pairs from the web, resulting in a 4x increase. We mine the parallel
sentences from the web by combining many corpora, tools, and methods: (a)
web-crawled monolingual corpora, (b) document OCR for extracting sentences from
scanned documents, (c) multilingual representation models for aligning
sentences, and (d) approximate nearest neighbor search for searching in a large
collection of sentences. Human evaluation of samples from the newly mined
corpora validate the high quality of the parallel sentences across 11
languages. Further, we extract 83.4 million sentence pairs between all 55 Indic
language pairs from the English-centric parallel corpus using English as the
pivot language. We trained multilingual NMT models spanning all these languages
on Samanantar, which outperform existing models and baselines on publicly
available benchmarks, such as FLORES, establishing the utility of Samanantar.
Our data and models are available publicly at
https://indicnlp.ai4bharat.org/samanantar/ and we hope they will help advance
research in NMT and multilingual NLP for Indic languages.Comment: Accepted to the Transactions of the Association for Computational
Linguistics (TACL
Corpus-based translation research: its development and implications for general, literary and Bible translation
Corpus-based translation research emerged in the late 1990s as a new area of research in the discipline of translation studies. It is informed by a specific area of linguistics known as corpus linguistics which involves the analysis of large corpora of authentic running text by means of computer software. Within linguistics, this methodology has revolutionised lexicographic practices and methods of language teaching. In translation studies this kind of research involves using computerised corpora to study translated text, not in terms of its equivalence to source texts, but as a valid object of study in its own right. Corpus-based research in translation is concerned with revealing both the universal and the specific features of translation, through the interplay of theoretical constructs and hypotheses, variety of data, novel descriptive categories and a rigorous, flexible methodology, which can be applied to inductive and deductive research, as well as product- and process-oriented studies. In this article an overview is given of the research that has led to the formation of a new subdiscipline in translation studies, called Corpus-based Translation Studies or CTS. I also demonstrate how CTS tools and techniques can be used for the analysis of general and literary translations and therefore also for Bible translations.
(Acta Theologica, Supplementum 2, 2002: 70-106
Machine learning for ancient languages: a survey
Ancient languages preserve the cultures and histories of the past. However, their study is fraught with difficulties, and experts must tackle a range of challenging text-based tasks, from deciphering lost languages to restoring damaged inscriptions, to determining the authorship of works of literature. Technological aids have long supported the study of ancient texts, but in recent years advances in artificial intelligence and machine learning have enabled analyses on a scale and in a detail that are reshaping the field of humanities, similarly to how microscopes and telescopes have contributed to the realm of science. This article aims to provide a comprehensive survey of published research using machine learning for the study of ancient texts written in any language, script, and medium, spanning over three and a half millennia of civilizations around the ancient world. To analyze the relevant literature, we introduce a taxonomy of tasks inspired by the steps involved in the study of ancient documents: digitization, restoration, attribution, linguistic analysis, textual criticism, translation, and decipherment. This work offers three major contributions: first, mapping the interdisciplinary field carved out by the synergy between the humanities and machine learning; second, highlighting how active collaboration between specialists from both fields is key to producing impactful and compelling scholarship; third, highlighting promising directions for future work in this field. Thus, this work promotes and supports the continued collaborative impetus between the humanities and machine learning
One Model to Rule them all: Multitask and Multilingual Modelling for Lexical Analysis
When learning a new skill, you take advantage of your preexisting skills and
knowledge. For instance, if you are a skilled violinist, you will likely have
an easier time learning to play cello. Similarly, when learning a new language
you take advantage of the languages you already speak. For instance, if your
native language is Norwegian and you decide to learn Dutch, the lexical overlap
between these two languages will likely benefit your rate of language
acquisition. This thesis deals with the intersection of learning multiple tasks
and learning multiple languages in the context of Natural Language Processing
(NLP), which can be defined as the study of computational processing of human
language. Although these two types of learning may seem different on the
surface, we will see that they share many similarities.
The traditional approach in NLP is to consider a single task for a single
language at a time. However, recent advances allow for broadening this
approach, by considering data for multiple tasks and languages simultaneously.
This is an important approach to explore further as the key to improving the
reliability of NLP, especially for low-resource languages, is to take advantage
of all relevant data whenever possible. In doing so, the hope is that in the
long term, low-resource languages can benefit from the advances made in NLP
which are currently to a large extent reserved for high-resource languages.
This, in turn, may then have positive consequences for, e.g., language
preservation, as speakers of minority languages will have a lower degree of
pressure to using high-resource languages. In the short term, answering the
specific research questions posed should be of use to NLP researchers working
towards the same goal.Comment: PhD thesis, University of Groninge
The History of the Normative Opposition of “Language versus Dialect”: From Its Graeco-Latin Origin to Central Europe’s Ethnolinguistic Nation-States
The History of the Normative Opposition of “Language versus Dialect”: From Its Graeco-Latin Origin to Central Europe’s Ethnolinguistic Nation-States
The concept of “a language” (Einzelsprache, that is, one of many extant languages) and its opposition to “dialect” (considered as a “non-language,” and thus subjugable to an already recognized language merely as “its” dialect) is the way people tend to think about languages in the West today. It appears to be a value-free, self-evident conception of the linguistic position. So much so that the concept of “language” was included neither in Immanuel Kant’s system of categories, nor in the authoritative Geschichtliche Grundbegriffe: Historisches Lexikon zur politisch sozialen Sprache in Deutschland. This paper sketches the rise of the “dialect vs language” opposition in classical Greek, its transposition onto classical Latin, and its transfer, through medieval and renaissance Latin, to the early modern period. On the way, the Greek and Latin terms for “language” (and also for “dialect”) sometimes functioned as synonyms for peoples (that is, ethnic groups), which – importantly – contributed to the rise of the normative equation of language with nation in the early nineteenth century. It was the beginning of the ethnolinguistic kind of nationalism that prevails to this day in Central Europe.
Dzieje normatywnej dychotomii języka i dialektu: Od greko-łacińskich źródeł po państwa etnicznojęzykowe Europy Środkowej
Pojęcie języka jako jednego z wielu (Einzelsprache) stawiane w diametralnej opozycji do „dialektu” (czyli „nie-języka”, który normatywnie musi zostać przyporządkowany jakiemuś już wcześniej uznanemu językowi jako jeden z jego dialektów) stanowi formę pojęciową, poprzez pryzmat której postrzega się języki i dyskutuje o nich we współczesnym świecie Zachodu. Z powodu powszechnego uznania owa forma pojęciowa wydaje się tak oczywista i wolna od nacechowania ideologicznego, że Immanuel Kant nie uwzględnił języka w zaproponowanym przez siebie systemie kategorii filozoficznych, podobnie jak i autorzy niezmiernie wpływowego dzieła z zakresu historiografii i socjologii politycznej o znamiennym tytule Geschichtliche Grundbegriffe: Historisches Lexikon zur politisch sozialen Sprache in Deutschland. W niniejszym artykule przedstawiam wyłonienie się opozycji języka wobec dialektu w starożytnej grece oraz jego recepcję na gruncie łaciny od starożytności rzymskiej po okres nowożytny. W ciągu wieków utarło się używanie greckich i łacińskich terminów w odniesieniu do „języka” jako synonimów na określenie ludów (czy też grup etnicznych), co we wczesnym XIX stuleciu silnie wpłynęło na wykształcenie się normatywnego zrównania języka z narodem. Stanowiło to początek fenomenu znanego pod nazwą „nacjonalizmu etnicznojęzykowego”, który na poziomie państw dominuje po dziś dzień w całej Europie Środkowej
Corpora compilation for prosody-informed speech processing
Research on speech technologies necessitates spoken data, which is usually obtained through read recorded speech, and specifically adapted to the research needs. When the aim is to deal with the prosody involved in speech, the available data must reflect natural and conversational speech, which is usually costly and difficult to get. This paper presents a machine learning-oriented toolkit for collecting, handling, and visualization of speech data, using prosodic heuristic. We present two corpora resulting from these methodologies: PANTED corpus, containing 250 h of English speech from TED Talks, and Heroes corpus containing 8 h of parallel English and Spanish movie speech. We demonstrate their use in two deep learning-based applications: punctuation restoration and machine translation. The presented corpora are freely available to the research community
- …