3 research outputs found
Style Variation as a Vantage Point for Code-Switching
Code-Switching (CS) is a common phenomenon observed in several bilingual and
multilingual communities, thereby attaining prevalence in digital and social
media platforms. This increasing prominence demands the need to model CS
languages for critical downstream tasks. A major problem in this domain is the
dearth of annotated data and a substantial corpora to train large scale neural
models. Generating vast amounts of quality text assists several down stream
tasks that heavily rely on language modeling such as speech recognition,
text-to-speech synthesis etc,. We present a novel vantage point of CS to be
style variations between both the participating languages. Our approach does
not need any external annotations such as lexical language ids. It mainly
relies on easily obtainable monolingual corpora without any parallel alignment
and a limited set of naturally CS sentences. We propose a two-stage generative
adversarial training approach where the first stage generates competitive
negative examples for CS and the second stage generates more realistic CS
sentences. We present our experiments on the following pairs of languages:
Spanish-English, Mandarin-English, Hindi-English and Arabic-French. We show
that the trends in metrics for generated CS move closer to real CS data in each
of the above language pairs through the dual stage training process. We believe
this viewpoint of CS as style variations opens new perspectives for modeling
various tasks in CS text
Are Multilingual Models Effective in Code-Switching?
Multilingual language models have shown decent performance in multilingual
and cross-lingual natural language understanding tasks. However, the power of
these multilingual models in code-switching tasks has not been fully explored.
In this paper, we study the effectiveness of multilingual language models to
understand their capability and adaptability to the mixed-language setting by
considering the inference speed, performance, and number of parameters to
measure their practicality. We conduct experiments in three language pairs on
named entity recognition and part-of-speech tagging and compare them with
existing methods, such as using bilingual embeddings and multilingual
meta-embeddings. Our findings suggest that pre-trained multilingual models do
not necessarily guarantee high-quality representations on code-switching, while
using meta-embeddings achieves similar results with significantly fewer
parameters
El Volumen Louder Por Favor: Code-switching in Task-oriented Semantic Parsing
Being able to parse code-switched (CS) utterances, such as Spanish+English or
Hindi+English, is essential to democratize task-oriented semantic parsing
systems for certain locales. In this work, we focus on Spanglish
(Spanish+English) and release a dataset, CSTOP, containing 5800 CS utterances
alongside their semantic parses. We examine the CS generalizability of various
Cross-lingual (XL) models and exhibit the advantage of pre-trained XL language
models when data for only one language is present. As such, we focus on
improving the pre-trained models for the case when only English corpus
alongside either zero or a few CS training instances are available. We propose
two data augmentation methods for the zero-shot and the few-shot settings:
fine-tune using translate-and-align and augment using a generation model
followed by match-and-filter. Combining the few-shot setting with the above
improvements decreases the initial 30-point accuracy gap between the zero-shot
and the full-data settings by two thirds