8 research outputs found
Recommended from our members
Cross-Lingual and Low-Resource Sentiment Analysis
Identifying sentiment in a low-resource language is essential for understanding opinions internationally and for responding to the urgent needs of locals affected by disaster incidents in different world regions. While tools and resources for recognizing sentiment in high-resource languages are plentiful, determining the most effective methods for achieving this task in a low-resource language which lacks annotated data is still an open research question. Most existing approaches for cross-lingual sentiment analysis to date have relied on high-resource machine translation systems, large amounts of parallel data, or resources only available for Indo-European languages.
This work presents methods, resources, and strategies for identifying sentiment cross-lingually in a low-resource language. We introduce a cross-lingual sentiment model which can be trained on a high-resource language and applied directly to a low-resource language. The model offers the feature of lexicalizing the training data using a bilingual dictionary, but can perform well without any translation into the target language.
Through an extensive experimental analysis, evaluated on 17 target languages, we show that the model performs well with bilingual word vectors pre-trained on an appropriate translation corpus. We compare in-genre and in-domain parallel corpora, out-of-domain parallel corpora, in-domain comparable corpora, and monolingual corpora, and show that a relatively small, in-domain parallel corpus works best as a transfer medium if it is available. We describe the conditions under which other resources and embedding generation methods are successful, and these include our strategies for leveraging in-domain comparable corpora for cross-lingual sentiment analysis.
To enhance the ability of the cross-lingual model to identify sentiment in the target language, we present new feature representations for sentiment analysis that are incorporated in the cross-lingual model: bilingual sentiment embeddings that are used to create bilingual sentiment scores, and a method for updating the sentiment embeddings during training by lexicalization of the target language. This feature configuration works best for the largest number of target languages in both untargeted and targeted cross-lingual sentiment experiments.
The cross-lingual model is studied further by evaluating the role of the source language, which has traditionally been assumed to be English. We build cross-lingual models using 15 source languages, including two non-European and non-Indo-European source languages: Arabic and Chinese. We show that language families play an important role in the performance of the model, as does the morphological complexity of the source language.
In the last part of the work, we focus on sentiment analysis towards targets. We study Arabic as a representative morphologically complex language and develop models and morphological representation features for identifying entity targets and sentiment expressed towards them in Arabic open-domain text. Finally, we adapt our cross-lingual sentiment models for the detection of sentiment towards targets. Through cross-lingual experiments on Arabic and English, we demonstrate that our findings regarding resources, features, and language also hold true for the transfer of targeted sentiment
Predicting the Type and Target of Offensive Posts in Social Media
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)As offensive content has become pervasive in social media, there has been much research in identifying potentially offensive messages. However, previous work on this topic did not consider the problem as a whole, but rather focused on detecting very specific types of offensive content, e.g., hate speech, cyberbulling, or cyber-aggression. In contrast, here we target several different kinds of offensive content. In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media. For this purpose, we complied the Offensive Language Identification Dataset (OLID), a new dataset with tweets annotated for offensive content using a fine-grained three-layer annotation scheme, which we make publicly available. We discuss the main similarities and differences between OLID and pre-existing datasets for hate speech
identification, aggression detection, and similar tasks. We further experiment with and we compare the performance of different machine learning models on OLID
A Novel Mobile Wireless Sensing System for Real- time Monitoring of Posture and Spine Stress
Abstract-Poor posture or extra stress on the spine has been shown to lead to a variety of spinal disorders including chronic back pain, and to incur numerous health costs to society. For this reason, workplace ergonomics is rapidly becoming indispensable in all major corporations. Making the individual continuously aware of poor posture may reduce out-of-posture tendencies and encourage healthy spinal habits. We have developed a novel wireless mobile sensing system which monitors spine stress in real-time by detecting poor back posture and strain on the back due to prolonged sitting or standing. The system provides a new method of measuring spine stress at both the back and the feet by integrating posture sensors with strain sensors. Posture and strain data is collected by means of a posture sensor at the neck and weight sensors at the feet. Data is transmitted wirelessly to a central processing station and real-time feedback is provided to the user's mobile device when sustained bad posture is detected. Moreover, the position of the patient (sitting, standing, or walking) can be determined by analysis of the weight sensor data and is visualized in real-time, along with back posture, at the central station by means of a graphical animation. Finally, data from all sensors is stored in a database to enable post processing and data analysis, and a summary report of daily posture and physical activity is sent to the user's email. The use of centralized processing allows for high performance data analysis and storage at the central station which enables tracking of the individual's progress. We demonstrate effectiveness of our system in simultaneously monitoring posture and position by testing in numerous situations
Reranking with Linguistic and Semantic Features for Arabic Optical Character Recognition
Optical Character Recognition (OCR) systems for Arabic rely on information contained in the scanned images to recognize sequences of characters and on language models to emphasize fluency. In this paper we incorporate linguistically and semantically motivated features to an existing OCR system. To do so we follow an n-best list reranking approach that exploits recent advances in learning to rank techniques. We achieve 10.1 % and 11.4 % reduction in recognition word error rate (WER) relative to a standard baseline system on typewritten and handwritten Arabic respectively.
Large Scale Arabic Error Annotation: Guidelines and Framework
We present annotation guidelines and a web-based annotation framework developed as part of an effort to create a manually annotated
Arabic corpus of errors and corrections for various text types. Such a corpus will be invaluable for developing Arabic error correction
tools, both for training models and as a gold standard for evaluating error correction algorithms. We summarize the guidelines we
created. We also describe issues encountered during the training of the annotators, as well as problems that are specific to the Arabic
language that arose during the annotation process. Finally, we present the annotation tool that was developed as part of this project, the
annotation pipeline, and the quality of the resulting annotations
Exploring Differences in the Impact of Users' Traces on Arabic and English Facebook Search
International audienceThis paper proposes an approach on Facebook search in Arabic and English, which exploits several users' traces (e.g. comment, share, reactions) left on Facebook posts to estimate their social importance. Our goal is to show how these social traces (signals) can play a vital role in improving Arabic and English Facebook search. Firstly, we identify polarities (positive or negative) carried by the textual signals (e.g. comments) and non-textual ones (e.g. the reactions love and sad) for a given Facebook posts. Therefore, the polarity of each comment expressed in Arabic or in English on a given Facebook post, is estimated on the basis of a neural sentiment model. Secondly , we group signals according to their complementarity using attributes (features) selection algorithms. Thirdly, we apply learning to rank (LTR) algorithms to re-rank Facebook search results based on the selected groups of signals. Finally, experiments are carried out on 13,500 Facebook posts, collected from 45 topics, for each of the two languages. Experiments results reveal that Random Forests was the most effective LTR approach for this task, and for the both languages. However, the best appropriate features selection algorithms are ReliefFAttributeEval and InfoGainAttributeEval for Arabic and English Facebook search task, respectively