17 research outputs found
Hindi-English Code-Switching Speech Corpus
Code-switching refers to the usage of two languages within a sentence or
discourse. It is a global phenomenon among multilingual communities and has
emerged as an independent area of research. With the increasing demand for the
code-switching automatic speech recognition (ASR) systems, the development of a
code-switching speech corpus has become highly desirable. However, for training
such systems, very limited code-switched resources are available as yet. In
this work, we present our first efforts in building a code-switching ASR system
in the Indian context. For that purpose, we have created a Hindi-English
code-switching speech database. The database not only contains the speech
utterances with code-switching properties but also covers the session and the
speaker variations like pronunciation, accent, age, gender, etc. This database
can be applied in several speech signal processing applications, such as
code-switching ASR, language identification, language modeling, speech
synthesis etc. This paper mainly presents an analysis of the statistics of the
collected code-switching speech corpus. Later, the performance results for the
ASR task have been reported for the created database
Sentiment Analysis of Code-Mixed Indian Languages: An Overview of SAIL_Code-Mixed Shared Task @ICON-2017
Sentiment analysis is essential in many real-world applications such as
stance detection, review analysis, recommendation system, and so on. Sentiment
analysis becomes more difficult when the data is noisy and collected from
social media. India is a multilingual country; people use more than one
languages to communicate within themselves. The switching in between the
languages is called code-switching or code-mixing, depending upon the type of
mixing. This paper presents overview of the shared task on sentiment analysis
of code-mixed data pairs of Hindi-English and Bengali-English collected from
the different social media platform. The paper describes the task, dataset,
evaluation, baseline and participant's systems
An Ensemble Model for Sentiment Analysis of Hindi-English Code-Mixed Data
In multilingual societies like India, code-mixed social media texts comprise
the majority of the Internet. Detecting the sentiment of the code-mixed user
opinions plays a crucial role in understanding social, economic and political
trends. In this paper, we propose an ensemble of character-trigrams based LSTM
model and word-ngrams based Multinomial Naive Bayes (MNB) model to identify the
sentiments of Hindi-English (Hi-En) code-mixed data. The ensemble model
combines the strengths of rich sequential patterns from the LSTM model and
polarity of keywords from the probabilistic ngram model to identify sentiments
in sparse and inconsistent code-mixed data. Experiments on reallife user
code-mixed data reveals that our approach yields state-of-the-art results as
compared to several baselines and other deep learning based proposed methods
A Fast, Compact, Accurate Model for Language Identification of Codemixed Text
We address fine-grained multilingual language identification: providing a
language code for every token in a sentence, including codemixed text
containing multiple languages. Such text is prevalent online, in documents,
social media, and message boards. We show that a feed-forward network with a
simple globally constrained decoder can accurately and rapidly label both
codemixed and monolingual text in 100 languages and 100 language pairs. This
model outperforms previously published multilingual approaches in terms of both
accuracy and speed, yielding an 800x speed-up and a 19.5% averaged absolute
gain on three codemixed datasets. It furthermore outperforms several benchmark
systems on monolingual language identification.Comment: EMNLP 201
Joint Language Identification of Code-Switching Speech using Attention based E2E Network
Language identification (LID) has relevance in many speech processing
applications. For the automatic recognition of code-switching speech, the
conventional approaches often employ an LID system for detecting the languages
present within an utterance. In the existing works, the LID on code-switching
speech involves modelling of the underlying languages separately. In this work,
we propose a joint modelling based LID system for code-switching speech. To
achieve the same, an attention-based end-to-end (E2E) network has been
explored. For the development and evaluation of the proposed approach, a
recently created Hindi-English code-switching corpus has been used. For the
contrast purpose, an LID system employing the connectionist temporal
classification-based E2E network is also developed. On comparing both the LID
systems, the attention based approach is noted to result in better LID
accuracy. The effective location of code-switching boundaries within the
utterance by the proposed approach has been demonstrated by plotting the
attention weights of E2E network
Feature Selection on Noisy Twitter Short Text Messages for Language Identification
The task of written language identification involves typically the detection
of the languages present in a sample of text. Moreover, a sequence of text may
not belong to a single inherent language but also may be mixture of text
written in multiple languages. This kind of text is generated in large volumes
from social media platforms due to its flexible and user friendly environment.
Such text contains very large number of features which are essential for
development of statistical, probabilistic as well as other kinds of language
models. The large number of features have rich as well as irrelevant and
redundant features which have diverse effect over the performance of the
learning model. Therefore, feature selection methods are significant in
choosing feature that are most relevant for an efficient model. In this
article, we basically consider the Hindi-English language identification task
as Hindi and English are often two most widely spoken languages of India. We
apply different feature selection algorithms across various learning algorithms
in order to analyze the effect of the algorithm as well as the number of
features on the performance of the task. The methodology focuses on the word
level language identification using a novel dataset of 6903 tweets extracted
from Twitter. Various n-gram profiles are examined with different feature
selection algorithms over many classifiers. Finally, an exhaustive comparative
analysis is put forward with respect to the overall experiments conducted for
the task
Towards Emotion Recognition in Hindi-English Code-Mixed Data: A Transformer Based Approach
In the last few years, emotion detection in social-media text has become a
popular problem due to its wide ranging application in better understanding the
consumers, in psychology, in aiding human interaction with computers, designing
smart systems etc. Because of the availability of huge amounts of data from
social-media, which is regularly used for expressing sentiments and opinions,
this problem has garnered great attention. In this paper, we present a Hinglish
dataset labelled for emotion detection. We highlight a deep learning based
approach for detecting emotions in Hindi-English code mixed tweets, using
bilingual word embeddings derived from FastText and Word2Vec approaches, as
well as transformer based models. We experiment with various deep learning
models, including CNNs, LSTMs, Bi-directional LSTMs (with and without
attention), along with transformers like BERT, RoBERTa, and ALBERT. The
transformer based BERT model outperforms all other models giving the best
performance with an accuracy of 71.43%
Investigating Target Set Reduction for End-to-End Speech Recognition of Hindi-English Code-Switching Data
End-to-end (E2E) systems are fast replacing the conventional systems in the
domain of automatic speech recognition. As the target labels are learned
directly from speech data, the E2E systems need a bigger corpus for effective
training. In the context of code-switching task, the E2E systems face two
challenges: (i) the expansion of the target set due to multiple languages
involved, and (ii) the lack of availability of sufficiently large
domain-specific corpus. Towards addressing those challenges, we propose an
approach for reducing the number of target labels for reliable training of the
E2E systems on limited data. The efficacy of the proposed approach has been
demonstrated on two prominent architectures, namely CTC-based and
attention-based E2E networks. The experimental validations are performed on a
recently created Hindi-English code-switching corpus. For contrast purpose, the
results for the full target set based E2E system and a hybrid DNN-HMM system
are also reported
Is this word borrowed? An automatic approach to quantify the likeliness of borrowing in social media
Code-mixing or code-switching are the effortless phenomena of natural
switching between two or more languages in a single conversation. Use of a
foreign word in a language; however, does not necessarily mean that the speaker
is code-switching because often languages borrow lexical items from other
languages. If a word is borrowed, it becomes a part of the lexicon of a
language; whereas, during code-switching, the speaker is aware that the
conversation involves foreign words or phrases. Identifying whether a foreign
word used by a bilingual speaker is due to borrowing or code-switching is a
fundamental importance to theories of multilingualism, and an essential
prerequisite towards the development of language and speech technologies for
multilingual communities. In this paper, we present a series of novel
computational methods to identify the borrowed likeliness of a word, based on
the social media signals. We first propose context based clustering method to
sample a set of candidate words from the social media data.Next, we propose
three novel and similar metrics based on the usage of these words by the users
in different tweets; these metrics were used to score and rank the candidate
words indicating their borrowed likeliness. We compare these rankings with a
ground truth ranking constructed through a human judgment experiment. The
Spearman's rank correlation between the two rankings (nearly 0.62 for all the
three metric variants) is more than double the value (0.26) of the most
competitive existing baseline reported in the literature. Some other striking
observations are, (i) the correlation is higher for the ground truth data
elicited from the younger participants (age less than 30) than that from the
older participants, and (ii )those participants who use mixed-language for
tweeting the least, provide the best signals of borrowing.Comment: 11 pages, 3 Figure
LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation
Recent trends in NLP research have raised an interest in linguistic
code-switching (CS); modern approaches have been proposed to solve a wide range
of NLP tasks on multiple language pairs. Unfortunately, these proposed methods
are hardly generalizable to different code-switched languages. In addition, it
is unclear whether a model architecture is applicable for a different task
while still being compatible with the code-switching setting. This is mainly
because of the lack of a centralized benchmark and the sparse corpora that
researchers employ based on their specific needs and interests. To facilitate
research in this direction, we propose a centralized benchmark for Linguistic
Code-switching Evaluation (LinCE) that combines ten corpora covering four
different code-switched language pairs (i.e., Spanish-English, Nepali-English,
Hindi-English, and Modern Standard Arabic-Egyptian Arabic) and four tasks
(i.e., language identification, named entity recognition, part-of-speech
tagging, and sentiment analysis). As part of the benchmark centralization
effort, we provide an online platform at ritual.uh.edu/lince, where researchers
can submit their results while comparing with others in real-time. In addition,
we provide the scores of different popular models, including LSTM, ELMo, and
multilingual BERT so that the NLP community can compare against
state-of-the-art systems. LinCE is a continuous effort, and we will expand it
with more low-resource languages and tasks.Comment: Accepted to LREC 202