2,146 research outputs found

    Impact Of Content Features For Automatic Online Abuse Detection

    Full text link
    Online communities have gained considerable importance in recent years due to the increasing number of people connected to the Internet. Moderating user content in online communities is mainly performed manually, and reducing the workload through automatic methods is of great financial interest for community maintainers. Often, the industry uses basic approaches such as bad words filtering and regular expression matching to assist the moderators. In this article, we consider the task of automatically determining if a message is abusive. This task is complex since messages are written in a non-standardized way, including spelling errors, abbreviations, community-specific codes... First, we evaluate the system that we propose using standard features of online messages. Then, we evaluate the impact of the addition of pre-processing strategies, as well as original specific features developed for the community of an online in-browser strategy game. We finally propose to analyze the usefulness of this wide range of features using feature selection. This work can lead to two possible applications: 1) automatically flag potentially abusive messages to draw the moderator's attention on a narrow subset of messages ; and 2) fully automate the moderation process by deciding whether a message is abusive without any human intervention

    A pipeline and comparative study of 12 machine learning models for text classification

    Get PDF
    Text-based communication is highly favoured as a communication method, especially in business environments. As a result, it is often abused by sending malicious messages, e.g., spam emails, to deceive users into relaying personal information, including online accounts credentials or banking details. For this reason, many machine learning methods for text classification have been proposed and incorporated into the services of most email providers. However, optimising text classification algorithms and finding the right tradeoff on their aggressiveness is still a major research problem. We present an updated survey of 12 machine learning text classifiers applied to a public spam corpus. A new pipeline is proposed to optimise hyperparameter selection and improve the models' performance by applying specific methods (based on natural language processing) in the preprocessing stage. Our study aims to provide a new methodology to investigate and optimise the effect of different feature sizes and hyperparameters in machine learning classifiers that are widely used in text classification problems. The classifiers are tested and evaluated on different metrics including F-score (accuracy), precision, recall, and run time. By analysing all these aspects, we show how the proposed pipeline can be used to achieve a good accuracy towards spam filtering on the Enron dataset, a widely used public email corpus. Statistical tests and explainability techniques are applied to provide a robust analysis of the proposed pipeline and interpret the classification outcomes of the 12 machine learning models, also identifying words that drive the classification results. Our analysis shows that it is possible to identify an effective machine learning model to classify the Enron dataset with an F-score of 94%.Comment: This article has been accepted for publication in Expert Systems with Applications, April 2022. Published by Elsevier. All data, models, and code used in this work are available on GitHub at https://github.com/Angione-Lab/12-machine-learning-models-for-text-classificatio

    Detecting spam e-mails using stop word TF-IDF and stemming algorithm with Naïve Bayes classifier on the multicore GPU

    Get PDF
    A spam filter is a program which is used to identify unwanted emails and prevents those messages from getting into a user's mail. The study was focused on how the algorithms can be applied on a number of e-mails consisting of both ham and spam e-mails. First, the working principle and steps which are followed for implementation of stop words, TF-IDF and stemming algorithm on NVIDIA’s Tesla P100 GPU are discussed and to verify the findings by executing of Naïve Bayes algorithm. After complete training and testing of the spam e-mails dataset taken from Kaggle by using the proposed method, we got a high training accuracy of 99.67% and got a testing accuracy of about 99.03% on the multicore GPU that boosted the speed of execution of training time period and testing time period which is improved of training and testing accuracy around 0.22% and 0.18% respectively when compared to that after applying only Naïve Bayes i.e. conventional method to the same dataset where we found training and testing accuracy to be 99.45% and 98.85% respectively. Also, we found that training time taken on GPU is 1.361 seconds which was about 1.49X faster than that taken on CPU which is 2.029 seconds. And the testing time taken on GPU is 1.978 seconds which was about 1.15X faster than that taken on CPU which is 2.280 seconds

    IDSGAN: Generative Adversarial Networks for Attack Generation against Intrusion Detection

    Full text link
    As an important tool in security, the intrusion detection system bears the responsibility of the defense to network attacks performed by malicious traffic. Nowadays, with the help of machine learning algorithms, the intrusion detection system develops rapidly. However, the robustness of this system is questionable when it faces the adversarial attacks. To improve the detection system, more potential attack approaches should be researched. In this paper, a framework of the generative adversarial networks, IDSGAN, is proposed to generate the adversarial attacks, which can deceive and evade the intrusion detection system. Considering that the internal structure of the detection system is unknown to attackers, adversarial attack examples perform the black-box attacks against the detection system. IDSGAN leverages a generator to transform original malicious traffic into adversarial malicious traffic. A discriminator classifies traffic examples and simulates the black-box detection system. More significantly, we only modify part of the attacks' nonfunctional features to guarantee the validity of the intrusion. Based on the dataset NSL-KDD, the feasibility of the model is demonstrated to attack many detection systems with different attacks and the excellent results are achieved. Moreover, the robustness of IDSGAN is verified by changing the amount of the unmodified features.Comment: 8 pages, 5 figure

    Application of Big Data Technology, Text Classification, and Azure Machine Learning for Financial Risk Management Using Data Science Methodology

    Get PDF
    Data science plays a crucial role in enabling organizations to optimize data-driven opportunities within financial risk management. It involves identifying, assessing, and mitigating risks, ultimately safeguarding investments, reducing uncertainty, ensuring regulatory compliance, enhancing decision-making, and fostering long-term sustainability. This thesis explores three facets of Data Science projects: enhancing customer understanding, fraud prevention, and predictive analysis, with the goal of improving existing tools and enabling more informed decision-making. The first project examined leveraged big data technologies, such as Hadoop and Spark, to enhance financial risk management by accurately predicting loan defaulters and their repayment likelihood. In the second project, we investigated risk assessment and fraud prevention within the financial sector, where Natural Language Processing and machine learning techniques were applied to classify emails into categories like spam, ham, and phishing. After training various models, their performance was rigorously evaluated. In the third project, we explored the utilization of Azure machine learning to identify loan defaulters, emphasizing the comparison of different machine learning algorithms for predictive analysis. The results aimed to determine the best-performing model by evaluating various performance metrics for the dataset. This study is important because it offers a strategy for enhancing risk management, preventing fraud, and encouraging innovation in the financial industry, ultimately resulting in better financial outcomes and enhanced customer protection

    Differential evolution detection models for SMS spam

    Get PDF
    With the growth of mobile phones, short message service (SMS) became an essential text communication service. However, the low cost and ease use of SMS led to an increase in SMS Spam. In this paper, the characteristics of SMS spam has studied and a set of features has introduced to get rid of SMS spam. In addition, the problem of SMS spam detection was addressed as a clustering analysis that requires a metaheuristic algorithm to find the clustering structures. Three differential evolution variants viz DE/rand/1, jDE/rand/1, jDE/best/1, are adopted for solving the SMS spam problem. Experimental results illustrate that the jDE/best/1 produces best results over other variants in terms of accuracy, false-positive rate and false-negative rate. Moreover, it surpasses the baseline methods

    Unmasking Falsehoods in Reviews: An Exploration of NLP Techniques

    Full text link
    In the contemporary digital landscape, online reviews have become an indispensable tool for promoting products and services across various businesses. Marketers, advertisers, and online businesses have found incentives to create deceptive positive reviews for their products and negative reviews for their competitors' offerings. As a result, the writing of deceptive reviews has become an unavoidable practice for businesses seeking to promote themselves or undermine their rivals. Detecting such deceptive reviews has become an intense and ongoing area of research. This research paper proposes a machine learning model to identify deceptive reviews, with a particular focus on restaurants. This study delves into the performance of numerous experiments conducted on a dataset of restaurant reviews known as the Deceptive Opinion Spam Corpus. To accomplish this, an n-gram model and max features are developed to effectively identify deceptive content, particularly focusing on fake reviews. A benchmark study is undertaken to explore the performance of two different feature extraction techniques, which are then coupled with five distinct machine learning classification algorithms. The experimental results reveal that the passive aggressive classifier stands out among the various algorithms, showcasing the highest accuracy not only in text classification but also in identifying fake reviews. Moreover, the research delves into data augmentation and implements various deep learning techniques to further enhance the process of detecting deceptive reviews. The findings shed light on the efficacy of the proposed machine learning approach and offer valuable insights into dealing with deceptive reviews in the realm of online businesses.Comment: 6 pages, 3 figure
    • …
    corecore