588 research outputs found
Unsupervised Code-Switching for Multilingual Historical Document Transcription
Transcribing documents from the printing press era, a challenge in its own right, is more complicated when documents interleave multiple languages—a common feature of 16th century texts. Additionally, many of these documents precede consistent ortho-graphic conventions, making the task even harder. We extend the state-of-the-art his-torical OCR model of Berg-Kirkpatrick et al. (2013) to handle word-level code-switching between multiple languages. Further, we en-able our system to handle spelling variabil-ity, including now-obsolete shorthand systems used by printers. Our results show average rel-ative character error reductions of 14 % across a variety of historical texts.
Analysis of Data Augmentation Methods for Low-Resource Maltese ASR
Recent years have seen an increased interest in the computational speech
processing of Maltese, but resources remain sparse. In this paper, we consider
data augmentation techniques for improving speech recognition for low-resource
languages, focusing on Maltese as a test case. We consider three different
types of data augmentation: unsupervised training, multilingual training and
the use of synthesized speech as training data. The goal is to determine which
of these techniques, or combination of them, is the most effective to improve
speech recognition for languages where the starting point is a small corpus of
approximately 7 hours of transcribed speech. Our results show that combining
the data augmentation techniques studied here lead us to an absolute WER
improvement of 15% without the use of a language model.Comment: 12 page
Hierarchical Character-Word Models for Language Identification
Social media messages' brevity and unconventional spelling pose a challenge
to language identification. We introduce a hierarchical model that learns
character and contextualized word-level representations for language
identification. Our method performs well against strong base- lines, and can
also reveal code-switching
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
A Sentiment Analysis Dataset for Code-Mixed Malayalam-English
There is an increasing demand for sentiment analysis of text from social
media which are mostly code-mixed. Systems trained on monolingual data fail for
code-mixed data due to the complexity of mixing at different levels of the
text. However, very few resources are available for code-mixed data to create
models specific for this data. Although much research in multilingual and
cross-lingual sentiment analysis has used semi-supervised or unsupervised
methods, supervised methods still performs better. Only a few datasets for
popular languages such as English-Spanish, English-Hindi, and English-Chinese
are available. There are no resources available for Malayalam-English
code-mixed data. This paper presents a new gold standard corpus for sentiment
analysis of code-mixed text in Malayalam-English annotated by voluntary
annotators. This gold standard corpus obtained a Krippendorff's alpha above 0.8
for the dataset. We use this new corpus to provide the benchmark for sentiment
analysis in Malayalam-English code-mixed texts
Natural language processing for similar languages, varieties, and dialects: A survey
There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.Non peer reviewe
- …