1,720 research outputs found

    Natural language processing for Albanian: a state-of-the-art survey

    Get PDF
    Due to its wide applicability, natural language processing (NLP) has attracted significant research efforts to the machine learning and deep learning research community. Despite this, research works investigating NLP for the Albanian language are still limited. However, to the best of our knowledge, there is no literature review available, which presents a clear picture of what has been studied, argued, and established in the area. The main objective of this survey is to comprehensively review, analyze and discuss the state-of-the-art in NLP for the Albanian language. Here, we present an extensive study concerning the contribution of several authors that have contributed to the application of NLP to the Albanian language. Also, we present an overview of research carried out in the typical applications of NLP for the Albanian language. Finally, some future challenges and limitations of the area are discussed

    Automatic Hate Speech Detection: A Literature Review

    Get PDF
    Hate speech has been an ongoing problem on the Internet for many years. Besides, social media, especially Facebook, and Twitter have given it a global stage where those hate speeches can spread far more rapidly. Every social media platform needs to implement an effective hate speech detection system to remove offensive content in real-time. There are various approaches to identify hate speech, such as Rule-Based, Machine Learning based, deep learning based and Hybrid approach. Since this is a review paper, we explained the valuable works of various authors who have invested their valuable time in studying to identifying hate speech using various approaches

    Explainable and High-Performance Hate and Offensive Speech Detection

    Full text link
    The spread of information through social media platforms can create environments possibly hostile to vulnerable communities and silence certain groups in society. To mitigate such instances, several models have been developed to detect hate and offensive speech. Since detecting hate and offensive speech in social media platforms could incorrectly exclude individuals from social media platforms, which can reduce trust, there is a need to create explainable and interpretable models. Thus, we build an explainable and interpretable high performance model based on the XGBoost algorithm, trained on Twitter data. For unbalanced Twitter data, XGboost outperformed the LSTM, AutoGluon, and ULMFiT models on hate speech detection with an F1 score of 0.75 compared to 0.38 and 0.37, and 0.38 respectively. When we down-sampled the data to three separate classes of approximately 5000 tweets, XGBoost performed better than LSTM, AutoGluon, and ULMFiT; with F1 scores for hate speech detection of 0.79 vs 0.69, 0.77, and 0.66 respectively. XGBoost also performed better than LSTM, AutoGluon, and ULMFiT in the down-sampled version for offensive speech detection with F1 score of 0.83 vs 0.88, 0.82, and 0.79 respectively. We use Shapley Additive Explanations (SHAP) on our XGBoost models' outputs to makes it explainable and interpretable compared to LSTM, AutoGluon and ULMFiT that are black-box models

    PEACE: Cross-Platform Hate Speech Detection- A Causality-guided Framework

    Full text link
    Hate speech detection refers to the task of detecting hateful content that aims at denigrating an individual or a group based on their religion, gender, sexual orientation, or other characteristics. Due to the different policies of the platforms, different groups of people express hate in different ways. Furthermore, due to the lack of labeled data in some platforms it becomes challenging to build hate speech detection models. To this end, we revisit if we can learn a generalizable hate speech detection model for the cross platform setting, where we train the model on the data from one (source) platform and generalize the model across multiple (target) platforms. Existing generalization models rely on linguistic cues or auxiliary information, making them biased towards certain tags or certain kinds of words (e.g., abusive words) on the source platform and thus not applicable to the target platforms. Inspired by social and psychological theories, we endeavor to explore if there exist inherent causal cues that can be leveraged to learn generalizable representations for detecting hate speech across these distribution shifts. To this end, we propose a causality-guided framework, PEACE, that identifies and leverages two intrinsic causal cues omnipresent in hateful content: the overall sentiment and the aggression in the text. We conduct extensive experiments across multiple platforms (representing the distribution shift) showing if causal cues can help cross-platform generalization.Comment: ECML PKDD 202

    Social Emotion Mining Techniques for Facebook Posts Reaction Prediction

    Full text link
    As of February 2016 Facebook allows users to express their experienced emotions about a post by using five so-called `reactions'. This research paper proposes and evaluates alternative methods for predicting these reactions to user posts on public pages of firms/companies (like supermarket chains). For this purpose, we collected posts (and their reactions) from Facebook pages of large supermarket chains and constructed a dataset which is available for other researches. In order to predict the distribution of reactions of a new post, neural network architectures (convolutional and recurrent neural networks) were tested using pretrained word embeddings. Results of the neural networks were improved by introducing a bootstrapping approach for sentiment and emotion mining on the comments for each post. The final model (a combination of neural network and a baseline emotion miner) is able to predict the reaction distribution on Facebook posts with a mean squared error (or misclassification rate) of 0.135.Comment: 10 pages, 13 figures and accepted at ICAART 2018. (Dataset: https://github.com/jerryspan/FacebookR

    Transfer Learning for Low-Resource Sentiment Analysis

    Full text link
    Sentiment analysis is the process of identifying and extracting subjective information from text. Despite the advances to employ cross-lingual approaches in an automatic way, the implementation and evaluation of sentiment analysis systems require language-specific data to consider various sociocultural and linguistic peculiarities. In this paper, the collection and annotation of a dataset are described for sentiment analysis of Central Kurdish. We explore a few classical machine learning and neural network-based techniques for this task. Additionally, we employ an approach in transfer learning to leverage pretrained models for data augmentation. We demonstrate that data augmentation achieves a high F1_1 score and accuracy despite the difficulty of the task.Comment: 14 pages - under review at ACM TALLI

    Social media mining under the COVID-19 context: Progress, challenges, and opportunities

    Full text link
    Social media platforms allow users worldwide to create and share information, forging vast sensing networks that allow information on certain topics to be collected, stored, mined, and analyzed in a rapid manner. During the COVID-19 pandemic, extensive social media mining efforts have been undertaken to tackle COVID-19 challenges from various perspectives. This review summarizes the progress of social media data mining studies in the COVID-19 contexts and categorizes them into six major domains, including early warning and detection, human mobility monitoring, communication and information conveying, public attitudes and emotions, infodemic and misinformation, and hatred and violence. We further document essential features of publicly available COVID-19 related social media data archives that will benefit research communities in conducting replicable and repro�ducible studies. In addition, we discuss seven challenges in social media analytics associated with their potential impacts on derived COVID-19 findings, followed by our visions for the possible paths forward in regard to social media-based COVID-19 investigations. This review serves as a valuable reference that recaps social media mining efforts in COVID-19 related studies and provides future directions along which the information harnessed from social media can be used to address public health emergencies

    CLOUD-BASED MACHINE LEARNING AND SENTIMENT ANALYSIS

    Get PDF
    The role of a Data Scientist is becoming increasingly ubiquitous as companies and institutions see the need to gain additional insights and information from data to make better decisions to improve the quality-of-service delivery to customers. This thesis document contains three aspects of data science projects aimed at improving tools and techniques used in analyzing and evaluating data. The first research study involved the use of a standard cybersecurity dataset and cloud-based auto-machine learning algorithms were applied to detect vulnerabilities in the network traffic data. The performance of the algorithms was measured and compared using standard evaluation metrics. The second research study involved the use of text-mining social media, specifically Reddit. We mined up to 100,000 comments in multiple subreddits and tested for hate speech via a custom designed version of the Python Vader sentiment analysis package. Our work integrated standard sentiment analysis with Hatebase.org and we demonstrate our new method can better detect hate speech in social media. Following sentiment analysis and hate speech detection, in the third research project, we applied statistical techniques in evaluating the significant difference in text analytics, specifically the sentiment-categories for both lexicon-based software and cloud-based tools. We compared the three big cloud providers, AWS, Azure, and GCP with the standard python Vader sentiment analysis library. We utilized statistical analysis to determine a significant difference between the cloud platforms utilized as well as Vader and demonstrated that each platform is unique in its analysis scoring mechanism
    • …
    corecore