1,354 research outputs found
Improved Text Language Identification for the South African Languages
Virtual assistants and text chatbots have recently been gaining popularity.
Given the short message nature of text-based chat interactions, the language
identification systems of these bots might only have 15 or 20 characters to
make a prediction. However, accurate text language identification is important,
especially in the early stages of many multilingual natural language processing
pipelines.
This paper investigates the use of a naive Bayes classifier, to accurately
predict the language family that a piece of text belongs to, combined with a
lexicon based classifier to distinguish the specific South African language
that the text is written in. This approach leads to a 31% reduction in the
language detection error.
In the spirit of reproducible research the training and testing datasets as
well as the code are published on github. Hopefully it will be useful to create
a text language identification shared task for South African languages.Comment: Accepted to appear in the proceedings of The 28th Annual Symposium of
the Pattern Recognition Association of South Africa, 201
Comparative Analysis of Word Embeddings for Capturing Word Similarities
Distributed language representation has become the most widely used technique
for language representation in various natural language processing tasks. Most
of the natural language processing models that are based on deep learning
techniques use already pre-trained distributed word representations, commonly
called word embeddings. Determining the most qualitative word embeddings is of
crucial importance for such models. However, selecting the appropriate word
embeddings is a perplexing task since the projected embedding space is not
intuitive to humans. In this paper, we explore different approaches for
creating distributed word representations. We perform an intrinsic evaluation
of several state-of-the-art word embedding methods. Their performance on
capturing word similarities is analysed with existing benchmark datasets for
word pairs similarities. The research in this paper conducts a correlation
analysis between ground truth word similarities and similarities obtained by
different word embedding methods.Comment: Part of the 6th International Conference on Natural Language
Processing (NATP 2020
Multimodal Hate Speech Detection from Bengali Memes and Texts
Numerous works have been proposed to employ machine learning (ML) and deep
learning (DL) techniques to utilize textual data from social media for
anti-social behavior analysis such as cyberbullying, fake news propagation, and
hate speech mainly for highly resourced languages like English. However,
despite having a lot of diversity and millions of native speakers, some
languages such as Bengali are under-resourced, which is due to a lack of
computational resources for natural language processing (NLP). Like English,
Bengali social media content also includes images along with texts (e.g.,
multimodal contents are posted by embedding short texts into images on
Facebook), only the textual data is not enough to judge them (e.g., to
determine they are hate speech). In those cases, images might give extra
context to properly judge. This paper is about hate speech detection from
multimodal Bengali memes and texts. We prepared the only multimodal hate speech
detection dataset1 for a kind of problem for Bengali. We train several neural
architectures (i.e., neural networks like Bi-LSTM/Conv-LSTM with word
embeddings, EfficientNet + transformer architectures such as monolingual Bangla
BERT, multilingual BERT-cased/uncased, and XLM-RoBERTa) jointly analyze textual
and visual information for hate speech detection. The Conv-LSTM and XLM-RoBERTa
models performed best for texts, yielding F1 scores of 0.78 and 0.82,
respectively. As of memes, ResNet152 and DenseNet201 models yield F1 scores of
0.78 and 0.7, respectively. The multimodal fusion of mBERT-uncased +
EfficientNet-B1 performed the best, yielding an F1 score of 0.80. Our study
suggests that memes are moderately useful for hate speech detection in Bengali,
but none of the multimodal models outperform unimodal models analyzing only
textual data
Multilingual sentiment analysis in social media.
252 p.This thesis addresses the task of analysing sentiment in messages coming from social media. The ultimate goal was to develop a Sentiment Analysis system for Basque. However, because of the socio-linguistic reality of the Basque language a tool providing only analysis for Basque would not be enough for a real world application. Thus, we set out to develop a multilingual system, including Basque, English, French and Spanish.The thesis addresses the following challenges to build such a system:- Analysing methods for creating Sentiment lexicons, suitable for less resourced languages.- Analysis of social media (specifically Twitter): Tweets pose several challenges in order to understand and extract opinions from such messages. Language identification and microtext normalization are addressed.- Research the state of the art in polarity classification, and develop a supervised classifier that is tested against well known social media benchmarks.- Develop a social media monitor capable of analysing sentiment with respect to specific events, products or organizations
- …