1,364 research outputs found
Hindi-English Code-Switching Speech Corpus
Code-switching refers to the usage of two languages within a sentence or
discourse. It is a global phenomenon among multilingual communities and has
emerged as an independent area of research. With the increasing demand for the
code-switching automatic speech recognition (ASR) systems, the development of a
code-switching speech corpus has become highly desirable. However, for training
such systems, very limited code-switched resources are available as yet. In
this work, we present our first efforts in building a code-switching ASR system
in the Indian context. For that purpose, we have created a Hindi-English
code-switching speech database. The database not only contains the speech
utterances with code-switching properties but also covers the session and the
speaker variations like pronunciation, accent, age, gender, etc. This database
can be applied in several speech signal processing applications, such as
code-switching ASR, language identification, language modeling, speech
synthesis etc. This paper mainly presents an analysis of the statistics of the
collected code-switching speech corpus. Later, the performance results for the
ASR task have been reported for the created database
All that is English may be Hindi: Enhancing language identification through automatic ranking of likeliness of word borrowing in social media
In this paper, we present a set of computational methods to identify the
likeliness of a word being borrowed, based on the signals from social media. In
terms of Spearman correlation coefficient values, our methods perform more than
two times better (nearly 0.62) in predicting the borrowing likeliness compared
to the best performing baseline (nearly 0.26) reported in literature. Based on
this likeliness estimate we asked annotators to re-annotate the language tags
of foreign words in predominantly native contexts. In 88 percent of cases the
annotators felt that the foreign language tag should be replaced by native
language tag, thus indicating a huge scope for improvement of automatic
language identification systems.Comment: 11 pages, accepted in the 2017 conference on Empirical Methods on
Natural Language Processing(EMNLP 2017) arXiv admin note: substantial text
overlap with arXiv:1703.0512
L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models
Code-switching occurs when more than one language is mixed in a given
sentence or a conversation. This phenomenon is more prominent on social media
platforms and its adoption is increasing over time. Therefore code-mixed NLP
has been extensively studied in the literature. As pre-trained
transformer-based architectures are gaining popularity, we observe that real
code-mixing data are scarce to pre-train large language models. We present
L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in
a Roman script. It consists of 52.93M sentences and 1.04B tokens, scraped from
Twitter. We further present HingBERT, HingMBERT, HingRoBERTa, and HingGPT. The
BERT models have been pre-trained on codemixed HingCorpus using masked language
modelling objectives. We show the effectiveness of these BERT models on the
subsequent downstream tasks like code-mixed sentiment analysis, POS tagging,
NER, and LID from the GLUECoS benchmark. The HingGPT is a GPT2 based generative
transformer model capable of generating full tweets. We also release
L3Cube-HingLID Corpus, the largest code-mixed Hindi-English language
identification(LID) dataset and HingBERT-LID, a production-quality LID model to
facilitate capturing of more code-mixed data using the process outlined in this
work. The dataset and models are available at
https://github.com/l3cube-pune/code-mixed-nlp
Hate Speech Detection from Code-mixed Hindi-English Tweets Using Deep Learning Models
This paper reports an increment to the state-of-the-art in hate speech
detection for English-Hindi code-mixed tweets. We compare three typical deep
learning models using domain-specific embeddings. On experimenting with a
benchmark dataset of English-Hindi code-mixed tweets, we observe that using
domain-specific embeddings results in an improved representation of target
groups, and an improved F-score.Comment: This paper will appear at the 15th International Conference on
Natural Language Processing (ICON-2018) in India in December 2018. ICON is a
premier NLP conference in Indi
RiTUAL-UH at TRAC 2018 Shared Task: Aggression Identification
This paper presents our system for "TRAC 2018 Shared Task on Aggression
Identification". Our best systems for the English dataset use a combination of
lexical and semantic features. However, for Hindi data using only lexical
features gave us the best results. We obtained weighted F1- measures of 0.5921
for the English Facebook task (ranked 12th), 0.5663 for the English Social
Media task (ranked 6th), 0.6292 for the Hindi Facebook task (ranked 1st), and
0.4853 for the Hindi Social Media task (ranked 2nd).Comment: TRAC I Shared Task' 201
Language Identification in Code-Mixed Data using Multichannel Neural Networks and Context Capture
An accurate language identification tool is an absolute necessity for
building complex NLP systems to be used on code-mixed data. Lot of work has
been recently done on the same, but there's still room for improvement.
Inspired from the recent advancements in neural network architectures for
computer vision tasks, we have implemented multichannel neural networks
combining CNN and LSTM for word level language identification of code-mixed
data. Combining this with a Bi-LSTM-CRF context capture module, accuracies of
93.28% and 93.32% is achieved on our two testing sets.Comment: The 4th Workshop on Noisy User-Generated Text (W-NUT), collocated
with EMNLP 201
Multilingual Speech Recognition With A Single End-To-End Model
Training a conventional automatic speech recognition (ASR) system to support
multiple languages is challenging because the sub-word unit, lexicon and word
inventories are typically language specific. In contrast, sequence-to-sequence
models are well suited for multilingual ASR because they encapsulate an
acoustic, pronunciation and language model jointly in a single network. In this
work we present a single sequence-to-sequence ASR model trained on 9 different
Indian languages, which have very little overlap in their scripts.
Specifically, we take a union of language-specific grapheme sets and train a
grapheme-based sequence-to-sequence model jointly on data from all languages.
We find that this model, which is not explicitly given any information about
language identity, improves recognition performance by 21% relative compared to
analogous sequence-to-sequence models trained on each language individually. By
modifying the model to accept a language identifier as an additional input
feature, we further improve performance by an additional 7% relative and
eliminate confusion between different languages.Comment: Accepted in ICASSP 201
GLUECoS : An Evaluation Benchmark for Code-Switched NLP
Code-switching is the use of more than one language in the same conversation
or utterance. Recently, multilingual contextual embedding models, trained on
multiple monolingual corpora, have shown promising results on cross-lingual and
multilingual tasks. We present an evaluation benchmark, GLUECoS, for
code-switched languages, that spans several NLP tasks in English-Hindi and
English-Spanish. Specifically, our evaluation benchmark includes Language
Identification from text, POS tagging, Named Entity Recognition, Sentiment
Analysis, Question Answering and a new task for code-switching, Natural
Language Inference. We present results on all these tasks using cross-lingual
word embedding models and multilingual models. In addition, we fine-tune
multilingual models on artificially generated code-switched data. Although
multilingual models perform significantly better than cross-lingual models, our
results show that in most tasks, across both language pairs, multilingual
models fine-tuned on code-switched data perform best, showing that multilingual
models can be further optimized for code-switching tasks.Comment: To appear at ACL 202
Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation
Multilingualism is widespread around the world and code-switching (CSW) is a
common practice among different language pairs/tuples across locations and
regions. However, there is still not much progress in building successful CSW
systems, despite the recent advances in Massive Multilingual Language Models
(MMLMs). We investigate the reasons behind this setback through a critical
study about the existing CSW data sets (68) across language pairs in terms of
the collection and preparation (e.g. transcription and annotation) stages. This
in-depth analysis reveals that \textbf{a)} most CSW data involves English
ignoring other language pairs/tuples \textbf{b)} there are flaws in terms of
representativeness in data collection and preparation stages due to ignoring
the location based, socio-demographic and register variation in CSW. In
addition, lack of clarity on the data selection and filtering stages shadow the
representativeness of CSW data sets. We conclude by providing a short
check-list to improve the representativeness for forthcoming studies involving
CSW data collection and preparation.Comment: Accepted for EMNLP'23 Findings (to appear on EMNLP'23 Proceedings
- …