Search CORE

588 research outputs found

Unsupervised Code-Switching for Multilingual Historical Document Transcription

Author: Dan Garrette
Dan Klein
Hannah Alpert-abrams
Taylor Berg-kirkpatrick
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2015
Field of study

Transcribing documents from the printing press era, a challenge in its own right, is more complicated when documents interleave multiple languages—a common feature of 16th century texts. Additionally, many of these documents precede consistent ortho-graphic conventions, making the task even harder. We extend the state-of-the-art his-torical OCR model of Berg-Kirkpatrick et al. (2013) to handle word-level code-switching between multiple languages. Further, we en-able our system to handle spelling variabil-ity, including now-obsolete shorthand systems used by printers. Our results show average rel-ative character error reductions of 14 % across a variety of historical texts.

CiteSeerX

Crossref

Analysis of Data Augmentation Methods for Low-Resource Maltese ASR

Author: Borg Claudia
DeMarco Andrea
Gatt Albert
Mena Carlos
van der Plas Lonneke
Williams Aiden
Publication venue
Publication date: 20/01/2023
Field of study

Recent years have seen an increased interest in the computational speech processing of Maltese, but resources remain sparse. In this paper, we consider data augmentation techniques for improving speech recognition for low-resource languages, focusing on Maltese as a test case. We consider three different types of data augmentation: unsupervised training, multilingual training and the use of synthesized speech as training data. The goal is to determine which of these techniques, or combination of them, is the most effective to improve speech recognition for languages where the starting point is a small corpus of approximately 7 hours of transcribed speech. Our results show that combining the data augmentation techniques studied here lead us to an absolute WER improvement of 15% without the use of a language model.Comment: 12 page

arXiv.org e-Print Archive

Hierarchical Character-Word Models for Language Identification

Author: Hathi Shobhit
Jaech Aaron
Mulcaire George
Ostendorf Mari
Smith Noah A.
Publication venue
Publication date: 01/01/2016
Field of study

Social media messages' brevity and unconventional spelling pose a challenge to language identification. We introduce a hierarchical model that learns character and contextualized word-level representations for language identification. Our method performs well against strong base- lines, and can also reveal code-switching

arXiv.org e-Print Archive

Crossref

Computational Sociolinguistics: A Survey

Author: de Jong Franciska
Doğruöz A. Seza
Nguyen Dong
Rosé Carolyn P.
Publication venue
Publication date: 01/01/2016
Field of study

Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimension of language. In this article we present a survey of the emerging field of "Computational Sociolinguistics" that reflects this increased interest. We aim to provide a comprehensive overview of CL research on sociolinguistic themes, featuring topics such as the relation between language and social identity, language use in social interaction and multilingual communication. Moreover, we demonstrate the potential for synergy between the research communities involved, by showing how the large-scale data-driven methods that are widely used in CL can complement existing sociolinguistic studies, and how sociolinguistics can inform and challenge the methods and assumptions employed in CL studies. We hope to convey the possible benefits of a closer collaboration between the two communities and conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication: 18th February, 201

arXiv.org e-Print Archive

Crossref

Ghent University Academic Bibliography

EUR Research Repository

University of Twente Research Information

A Sentiment Analysis Dataset for Code-Mixed Malayalam-English

Author: Chakravarthi Bharathi Raja
Jose Navya
McCrae John P.
Sherly Elizabeth
Suryawanshi Shardul
Publication venue
Publication date: 12/05/2020
Field of study

There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff's alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts

arXiv.org e-Print Archive

Irish Universities

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Access to Research at National University of Ireland, Galway

Natural language processing for similar languages, varieties, and dialects: A survey

Author: Nakov Preslav
Scherrer Yves
Zampieri Marcos
Publication venue
Publication date: 20/11/2020
Field of study

There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.Non peer reviewe

Helsingin yliopiston digitaalinen arkisto

Flexible Techniques for Automatic Text Recognition of Historical Documents

Author: Ströbel Phillip
Publication venue
Publication date: 01/01/2023
Field of study

ZORA