1,720 research outputs found
Natural language processing for Albanian: a state-of-the-art survey
Due to its wide applicability, natural language processing (NLP) has attracted significant research efforts to the machine learning and deep learning research community. Despite this, research works investigating NLP for the Albanian language are still limited. However, to the best of our knowledge, there is no literature review available, which presents a clear picture of what has been studied, argued, and established in the area. The main objective of this survey is to comprehensively review, analyze and discuss the state-of-the-art in NLP for the Albanian language. Here, we present an extensive study concerning the contribution of several authors that have contributed to the application of NLP to the Albanian language. Also, we present an overview of research carried out in the typical applications of NLP for the Albanian language. Finally, some future challenges and limitations of the area are discussed
Automatic Hate Speech Detection: A Literature Review
Hate speech has been an ongoing problem on the Internet for many years. Besides, social media, especially Facebook, and Twitter have given it a global stage where those hate speeches can spread far more rapidly. Every social media platform needs to implement an effective hate speech detection system to remove offensive content in real-time. There are various approaches to identify hate speech, such as Rule-Based, Machine Learning based, deep learning based and Hybrid approach. Since this is a review paper, we explained the valuable works of various authors who have invested their valuable time in studying to identifying hate speech using various approaches
Explainable and High-Performance Hate and Offensive Speech Detection
The spread of information through social media platforms can create
environments possibly hostile to vulnerable communities and silence certain
groups in society. To mitigate such instances, several models have been
developed to detect hate and offensive speech. Since detecting hate and
offensive speech in social media platforms could incorrectly exclude
individuals from social media platforms, which can reduce trust, there is a
need to create explainable and interpretable models. Thus, we build an
explainable and interpretable high performance model based on the XGBoost
algorithm, trained on Twitter data. For unbalanced Twitter data, XGboost
outperformed the LSTM, AutoGluon, and ULMFiT models on hate speech detection
with an F1 score of 0.75 compared to 0.38 and 0.37, and 0.38 respectively. When
we down-sampled the data to three separate classes of approximately 5000
tweets, XGBoost performed better than LSTM, AutoGluon, and ULMFiT; with F1
scores for hate speech detection of 0.79 vs 0.69, 0.77, and 0.66 respectively.
XGBoost also performed better than LSTM, AutoGluon, and ULMFiT in the
down-sampled version for offensive speech detection with F1 score of 0.83 vs
0.88, 0.82, and 0.79 respectively. We use Shapley Additive Explanations (SHAP)
on our XGBoost models' outputs to makes it explainable and interpretable
compared to LSTM, AutoGluon and ULMFiT that are black-box models
PEACE: Cross-Platform Hate Speech Detection- A Causality-guided Framework
Hate speech detection refers to the task of detecting hateful content that
aims at denigrating an individual or a group based on their religion, gender,
sexual orientation, or other characteristics. Due to the different policies of
the platforms, different groups of people express hate in different ways.
Furthermore, due to the lack of labeled data in some platforms it becomes
challenging to build hate speech detection models. To this end, we revisit if
we can learn a generalizable hate speech detection model for the cross platform
setting, where we train the model on the data from one (source) platform and
generalize the model across multiple (target) platforms. Existing
generalization models rely on linguistic cues or auxiliary information, making
them biased towards certain tags or certain kinds of words (e.g., abusive
words) on the source platform and thus not applicable to the target platforms.
Inspired by social and psychological theories, we endeavor to explore if there
exist inherent causal cues that can be leveraged to learn generalizable
representations for detecting hate speech across these distribution shifts. To
this end, we propose a causality-guided framework, PEACE, that identifies and
leverages two intrinsic causal cues omnipresent in hateful content: the overall
sentiment and the aggression in the text. We conduct extensive experiments
across multiple platforms (representing the distribution shift) showing if
causal cues can help cross-platform generalization.Comment: ECML PKDD 202
Social Emotion Mining Techniques for Facebook Posts Reaction Prediction
As of February 2016 Facebook allows users to express their experienced
emotions about a post by using five so-called `reactions'. This research paper
proposes and evaluates alternative methods for predicting these reactions to
user posts on public pages of firms/companies (like supermarket chains). For
this purpose, we collected posts (and their reactions) from Facebook pages of
large supermarket chains and constructed a dataset which is available for other
researches. In order to predict the distribution of reactions of a new post,
neural network architectures (convolutional and recurrent neural networks) were
tested using pretrained word embeddings. Results of the neural networks were
improved by introducing a bootstrapping approach for sentiment and emotion
mining on the comments for each post. The final model (a combination of neural
network and a baseline emotion miner) is able to predict the reaction
distribution on Facebook posts with a mean squared error (or misclassification
rate) of 0.135.Comment: 10 pages, 13 figures and accepted at ICAART 2018. (Dataset:
https://github.com/jerryspan/FacebookR
Transfer Learning for Low-Resource Sentiment Analysis
Sentiment analysis is the process of identifying and extracting subjective
information from text. Despite the advances to employ cross-lingual approaches
in an automatic way, the implementation and evaluation of sentiment analysis
systems require language-specific data to consider various sociocultural and
linguistic peculiarities. In this paper, the collection and annotation of a
dataset are described for sentiment analysis of Central Kurdish. We explore a
few classical machine learning and neural network-based techniques for this
task. Additionally, we employ an approach in transfer learning to leverage
pretrained models for data augmentation. We demonstrate that data augmentation
achieves a high F score and accuracy despite the difficulty of the task.Comment: 14 pages - under review at ACM TALLI
Social media mining under the COVID-19 context: Progress, challenges, and opportunities
Social media platforms allow users worldwide to create and share information, forging vast sensing networks that
allow information on certain topics to be collected, stored, mined, and analyzed in a rapid manner. During the
COVID-19 pandemic, extensive social media mining efforts have been undertaken to tackle COVID-19 challenges
from various perspectives. This review summarizes the progress of social media data mining studies in the
COVID-19 contexts and categorizes them into six major domains, including early warning and detection, human
mobility monitoring, communication and information conveying, public attitudes and emotions, infodemic and
misinformation, and hatred and violence. We further document essential features of publicly available COVID-19
related social media data archives that will benefit research communities in conducting replicable and repro�ducible studies. In addition, we discuss seven challenges in social media analytics associated with their potential
impacts on derived COVID-19 findings, followed by our visions for the possible paths forward in regard to social
media-based COVID-19 investigations. This review serves as a valuable reference that recaps social media mining
efforts in COVID-19 related studies and provides future directions along which the information harnessed from
social media can be used to address public health emergencies
CLOUD-BASED MACHINE LEARNING AND SENTIMENT ANALYSIS
The role of a Data Scientist is becoming increasingly ubiquitous as companies and institutions see the need to gain additional insights and information from data to make better decisions to improve the quality-of-service delivery to customers. This thesis document contains three aspects of data science projects aimed at improving tools and techniques used in analyzing and evaluating data. The first research study involved the use of a standard cybersecurity dataset and cloud-based auto-machine learning algorithms were applied to detect vulnerabilities in the network traffic data. The performance of the algorithms was measured and compared using standard evaluation metrics. The second research study involved the use of text-mining social media, specifically Reddit. We mined up to 100,000 comments in multiple subreddits and tested for hate speech via a custom designed version of the Python Vader sentiment analysis package. Our work integrated standard sentiment analysis with Hatebase.org and we demonstrate our new method can better detect hate speech in social media. Following sentiment analysis and hate speech detection, in the third research project, we applied statistical techniques in evaluating the significant difference in text analytics, specifically the sentiment-categories for both lexicon-based software and cloud-based tools. We compared the three big cloud providers, AWS, Azure, and GCP with the standard python Vader sentiment analysis library. We utilized statistical analysis to determine a significant difference between the cloud platforms utilized as well as Vader and demonstrated that each platform is unique in its analysis scoring mechanism
- …