2 research outputs found
Style Variation as a Vantage Point for Code-Switching
Code-Switching (CS) is a common phenomenon observed in several bilingual and
multilingual communities, thereby attaining prevalence in digital and social
media platforms. This increasing prominence demands the need to model CS
languages for critical downstream tasks. A major problem in this domain is the
dearth of annotated data and a substantial corpora to train large scale neural
models. Generating vast amounts of quality text assists several down stream
tasks that heavily rely on language modeling such as speech recognition,
text-to-speech synthesis etc,. We present a novel vantage point of CS to be
style variations between both the participating languages. Our approach does
not need any external annotations such as lexical language ids. It mainly
relies on easily obtainable monolingual corpora without any parallel alignment
and a limited set of naturally CS sentences. We propose a two-stage generative
adversarial training approach where the first stage generates competitive
negative examples for CS and the second stage generates more realistic CS
sentences. We present our experiments on the following pairs of languages:
Spanish-English, Mandarin-English, Hindi-English and Arabic-French. We show
that the trends in metrics for generated CS move closer to real CS data in each
of the above language pairs through the dual stage training process. We believe
this viewpoint of CS as style variations opens new perspectives for modeling
various tasks in CS text
Codeswitched Sentence Creation using Dependency Parsing
Codeswitching has become one of the most common occurrences across
multilingual speakers of the world, especially in countries like India which
encompasses around 23 official languages with the number of bilingual speakers
being around 300 million. The scarcity of Codeswitched data becomes a
bottleneck in the exploration of this domain with respect to various Natural
Language Processing (NLP) tasks. We thus present a novel algorithm which
harnesses the syntactic structure of English grammar to develop grammatically
sensible Codeswitched versions of English-Hindi, English-Marathi and
English-Kannada data. Apart from maintaining the grammatical sanity to a great
extent, our methodology also guarantees abundant generation of data from a
minuscule snapshot of given data. We use multiple datasets to showcase the
capabilities of our algorithm while at the same time we assess the quality of
generated Codeswitched data using some qualitative metrics along with providing
baseline results for couple of NLP tasks