1,800 research outputs found
Adapting Deep Learning for Sentiment Classification of Code-Switched Informal Short Text
Nowadays, an abundance of short text is being generated that uses nonstandard
writing styles influenced by regional languages. Such informal and
code-switched content are under-resourced in terms of labeled datasets and
language models even for popular tasks like sentiment classification. In this
work, we (1) present a labeled dataset called MultiSenti for sentiment
classification of code-switched informal short text, (2) explore the
feasibility of adapting resources from a resource-rich language for an informal
one, and (3) propose a deep learning-based model for sentiment classification
of code-switched informal short text. We aim to achieve this without any
lexical normalization, language translation, or code-switching indication. The
performance of the proposed models is compared with three existing multilingual
sentiment classification models. The results show that the proposed model
performs better in general and adapting character-based embeddings yield
equivalent performance while being computationally more efficient than training
word-based domain-specific embeddings
Leveraging writing systems changes for deep learning based Chinese affective analysis
Affective analysis of social media text is in great demand. Online text written in Chinese communities often contains mixed scripts including major text written in Chinese, an ideograph-based writing system, and minor text using Latin letters, an alphabet-based writing system. This phenomenon is referred to as writing systems changes (WSCs). Past studies have shown that WSCs often reflect unfiltered immediate affections. However, the use of WSCs poses more challenges in Natural Language Processing tasks because WSCs can break the syntax of the major text. In this work, we present our work to use WSCs as an effective feature in a hybrid deep learning model with attention network. The WSCs scripts are first identified by their encoding range. Then, the document representation of the text is learned through a Long Short-Term Memory model and the minor text is learned by a separate Convolution Neural Network model. To further highlight the WSCs components, an attention mechanism is adopted to re-weight the feature vector before the classification layer. Experiments show that the proposed hybrid deep learning method which better incorporates WSCs features can further improve performance compared to the state-of-the-art classification models. The experimental result indicates that WSCs can serve as effective information in affective analysis of the social media text
Share What You Already Know: Cross-Language-Script Transfer and Alignment for Sentiment Detection in Code-Mixed Data
Code-switching entails mixing multiple languages. It is an increasingly
occurring phenomenon in social media texts. Usually, code-mixed texts are
written in a single script, even though the languages involved have different
scripts. Pre-trained multilingual models primarily utilize the data in the
native script of the language. In existing studies, the code-switched texts are
utilized as they are. However, using the native script for each language can
generate better representations of the text owing to the pre-trained knowledge.
Therefore, a cross-language-script knowledge sharing architecture utilizing the
cross attention and alignment of the representations of text in individual
language scripts was proposed in this study. Experimental results on two
different datasets containing Nepali-English and Hindi-English code-switched
texts, demonstrate the effectiveness of the proposed method. The interpretation
of the model using model explainability technique illustrates the sharing of
language-specific knowledge between language-specific representations
A Survey of Cross-Lingual Sentiment Analysis Based on Pre-Trained Models
With the technology development of natural language processing, many researchers have studied Machine Learning (ML), Deep Learning (DL), monolingual Sentiment Analysis (SA) widely. However, there is not much work on Cross-Lingual SA (CLSA), although it is beneficial when dealing with low resource languages (e.g., Tamil, Malayalam, Hindi, and Arabic). This paper surveys the main challenges and issues of CLSA based on some pre-trained language models and mentions the leading methods to cope with CLSA. In particular, we compare and analyze their pros and cons. Moreover, we summarize the valuable cross-lingual resources and point out the main problems researchers need to solve in the future
Recommended from our members
Cross-Lingual and Low-Resource Sentiment Analysis
Identifying sentiment in a low-resource language is essential for understanding opinions internationally and for responding to the urgent needs of locals affected by disaster incidents in different world regions. While tools and resources for recognizing sentiment in high-resource languages are plentiful, determining the most effective methods for achieving this task in a low-resource language which lacks annotated data is still an open research question. Most existing approaches for cross-lingual sentiment analysis to date have relied on high-resource machine translation systems, large amounts of parallel data, or resources only available for Indo-European languages.
This work presents methods, resources, and strategies for identifying sentiment cross-lingually in a low-resource language. We introduce a cross-lingual sentiment model which can be trained on a high-resource language and applied directly to a low-resource language. The model offers the feature of lexicalizing the training data using a bilingual dictionary, but can perform well without any translation into the target language.
Through an extensive experimental analysis, evaluated on 17 target languages, we show that the model performs well with bilingual word vectors pre-trained on an appropriate translation corpus. We compare in-genre and in-domain parallel corpora, out-of-domain parallel corpora, in-domain comparable corpora, and monolingual corpora, and show that a relatively small, in-domain parallel corpus works best as a transfer medium if it is available. We describe the conditions under which other resources and embedding generation methods are successful, and these include our strategies for leveraging in-domain comparable corpora for cross-lingual sentiment analysis.
To enhance the ability of the cross-lingual model to identify sentiment in the target language, we present new feature representations for sentiment analysis that are incorporated in the cross-lingual model: bilingual sentiment embeddings that are used to create bilingual sentiment scores, and a method for updating the sentiment embeddings during training by lexicalization of the target language. This feature configuration works best for the largest number of target languages in both untargeted and targeted cross-lingual sentiment experiments.
The cross-lingual model is studied further by evaluating the role of the source language, which has traditionally been assumed to be English. We build cross-lingual models using 15 source languages, including two non-European and non-Indo-European source languages: Arabic and Chinese. We show that language families play an important role in the performance of the model, as does the morphological complexity of the source language.
In the last part of the work, we focus on sentiment analysis towards targets. We study Arabic as a representative morphologically complex language and develop models and morphological representation features for identifying entity targets and sentiment expressed towards them in Arabic open-domain text. Finally, we adapt our cross-lingual sentiment models for the detection of sentiment towards targets. Through cross-lingual experiments on Arabic and English, we demonstrate that our findings regarding resources, features, and language also hold true for the transfer of targeted sentiment
- …