3 research outputs found

    Pre processing of social media remarks for forensics

    Get PDF
    The Internet's rapid growth has led to a surge in social network users, resulting in an increase in extreme emotional and hate speech online. This study focuses on the security of public opinion in cyber security by analyzing Twitter data. The goal is to develop a model that can detect both sentiment and hate speech in user texts, aiding in the identification of content that may violate laws and regulations. The study involves pre processing the acquired forensic data, including tasks like lowercasing, stop word removal, and stemming, to obtain clear and effective data. This paper contributes to the field of public opinion security by linking forensic data with machine learning techniques, showcasing the potential for detecting and analyzing Twitter text data

    SENTIMENT CLASSIFICATION OF TWEETS WITH EXPLICIT WORD NEGATIONS AND EMOJI USING DEEP LEARNING

    Get PDF
    The widespread use of social media platforms such as Twitter, Instagram, Facebook, and LinkedIn have had a huge impact on daily human interactions and decision-making. Owing to Twitter's widespread acceptance, users can express their opinions/sentiments on nearly any issue, ranging from public opinion, a product/service, to even a specific group of people. Sharing these opinions/sentiments results in a massive production of user content known as tweets, which can be assessed to generate new knowledge. Corporate insights, government policy formation, decision-making, and brand identity monitoring all benefit from analyzing the opinions/sentiments expressed in these tweets. Even though several techniques have been created to analyze user sentiments from tweets, social media engagements include negation words and emoji elements that, if not properly pre-processed, would result in misclassification. The majority of available pre-processing techniques rely on clean data and machine learning algorithms to annotate sentiment in unlabeled texts. In this study, we propose a text pre-processing approach that takes into consideration negation words and emoji characteristics in text data by translating these features into single contextual words in tweets to minimize context loss. The proposed preprocessor was evaluated on benchmark Twitter datasets using four deep learning algorithms: Long Short-Term Memory (LSTM), Recurrent Neural Network (RNN), and Artificial Neural Network (ANN). The results showed that LSTM performed better than the approaches already discussed in the literature, with an accuracy of 96.36%, 88.41%, and 95.39%. The findings also suggest that pre-processing information like emoji and explicit word negations aids in the preservation of sentimental information. This appears to be the first study to classify sentiments in tweets while accounting for both explicit word negation conversion and emoji translation

    Enhancing extremist data classification through textual analysis

    Get PDF
    The high volume of extremist materials on the Internet has created the need for intelligence gathering via the Web and real-time monitoring of potential websites for evidence of extremist activities. However, the manual classification for such contents is practically difficult and time-consuming. In response to this challenge, the work reported here developed several classification frameworks. Each framework provides a basis of text representation before being fed into machine learning algorithm. The basis of text representation are Sentiment-rule, Posit-textual analysis with word-level features, and an extension of Posit analysis, known as Extended-Posit, which adopts character-level as well as word-level data. Identifying some gaps in the aforementioned techniques created avenues for further improvements, most especially in handling larger datasets with better classification accuracy. Consequently, a novel basis of text representation known as the Composite-based method was developed. This is a computational framework that explores the combination of both sentiment and syntactic features of textual contents of a Web page. Subsequently, these techniques are applied on a dataset that had been subjected to a manual classification process, thereafter fed into machine learning algorithm. This is to generate a measure of how well each page can be classified into their appropriate classes. The classifiers considered are both Neural Network (RNN and MLP) and Machine Learning classifiers (such as J48, Random Forest and KNN). In addition, features selection and model optimisation were evaluated to know the cost when creating machine learning model. However, considering all the result obtained from each of the framework, the results indicated that composite features are preferable to solely syntactic or sentiment features which offer improved classification accuracy when used with machine learning algorithms. Furthermore, the extension of Posit analysis to include both word and character-level data out-performed word-level feature alone when applied on the assembled textual data. Moreover, Random Forest classifier outperformed other classifiers explored. Taking cost into account, feature selection improves classification accuracy and save time better than hyperparameter turning (model optimisation).The high volume of extremist materials on the Internet has created the need for intelligence gathering via the Web and real-time monitoring of potential websites for evidence of extremist activities. However, the manual classification for such contents is practically difficult and time-consuming. In response to this challenge, the work reported here developed several classification frameworks. Each framework provides a basis of text representation before being fed into machine learning algorithm. The basis of text representation are Sentiment-rule, Posit-textual analysis with word-level features, and an extension of Posit analysis, known as Extended-Posit, which adopts character-level as well as word-level data. Identifying some gaps in the aforementioned techniques created avenues for further improvements, most especially in handling larger datasets with better classification accuracy. Consequently, a novel basis of text representation known as the Composite-based method was developed. This is a computational framework that explores the combination of both sentiment and syntactic features of textual contents of a Web page. Subsequently, these techniques are applied on a dataset that had been subjected to a manual classification process, thereafter fed into machine learning algorithm. This is to generate a measure of how well each page can be classified into their appropriate classes. The classifiers considered are both Neural Network (RNN and MLP) and Machine Learning classifiers (such as J48, Random Forest and KNN). In addition, features selection and model optimisation were evaluated to know the cost when creating machine learning model. However, considering all the result obtained from each of the framework, the results indicated that composite features are preferable to solely syntactic or sentiment features which offer improved classification accuracy when used with machine learning algorithms. Furthermore, the extension of Posit analysis to include both word and character-level data out-performed word-level feature alone when applied on the assembled textual data. Moreover, Random Forest classifier outperformed other classifiers explored. Taking cost into account, feature selection improves classification accuracy and save time better than hyperparameter turning (model optimisation)
    corecore