323 research outputs found

    Comprehensive Literature Review on Machine Learning Structures for Web Spam Classification

    Get PDF
    AbstractVarious Web spam features and machine learning structures were constantly proposed to classify Web spam in recent years. The aim of this paper was to provide a comprehensive machine learning algorithms comparison within the Web spam detection community. Several machine learning algorithms and ensemble meta-algorithms as classifiers, area under receiver operating characteristic as performance evaluation and two public available datasets (WEBSPAM-UK2006 and WEBSPAM-UK2007) were experimented in this study. The results have shown that random forest with variations of AdaBoost had achieved 0.937 in WEBSPAM-UK2006 and 0.852 in WEBSPAM-UK2007

    Malware Detection and Analysis

    Get PDF
    Malicious software poses a serious threat to the cybersecurity of network infrastructures and is a global pandemic in the form of computer viruses, Trojan horses, and Internet worms. Studies imply that the effects of malware are deteriorating. The main defense against malware is malware detectors. The methods that such a detector employ define its level of quality. Therefore, it is crucial that we research malware detection methods and comprehend their advantages and disadvantages. Attackers are creating malware that is polymorphic and metamorphic and has the capacity to modify their source code as they spread. Furthermore, existing defenses, which often utilize signature-based approaches and are unable to identify the previously undiscovered harmful executables, are significantly undermined by the diversity and volume of their variations. Malware families\u27 variations exhibit common behavioral characteristics that reveal their origin and function. Machine learning techniques may be used to detect and categorize novel viruses into their recognized families utilizing the behavioral patterns discovered via static or dynamic analysis. In this paper, we\u27ll talk about malware, its various forms, malware concealment strategies, and malware attack mechanisms. Additionally, many detection methods and classification models are presented in this study. The method of malware analysis is demonstrated by conducting an analysis of a malware program in a contained environment

    A systematic framework to discover pattern for web spam classification

    Full text link
    Web spam is a big problem for search engine users in World Wide Web. They use deceptive techniques to achieve high rankings. Although many researchers have presented the different approach for classification and web spam detection still it is an open issue in computer science. Analyzing and evaluating these websites can be an effective step for discovering and categorizing the features of these websites. There are several methods and algorithms for detecting those websites, such as decision tree algorithm. In this paper, we present a systematic framework based on CHAID algorithm and a modified string matching algorithm (KMP) for extract features and analysis of these websites. We evaluated our model and other methods with a dataset of Alexa Top 500 Global Sites and Bing search engine results in 500 queries.Comment: Proceedings of IEEE IEMCON 201

    Methods for demoting and detecting Web spam

    Get PDF
    Web spamming has tremendously subverted the ranking mechanism of information retrieval in Web search engines. It manipulates data source maliciously either by contents or links with the intention of contributing negative impacts to Web search results. The altering order of the search results by spammers has increased the difficulty level of searching and time consumption for Web users to retrieve relevant information. In order to improve the quality of Web search engines results, the design of anti-Web spam techniques are developed in this thesis to detect and demote Web spam via trust and distrust and Web spam classification.A comprehensive literature on existing anti-Web spam techniques emphasizing on trust and distrust model and machine learning model is presented. Furthermore, several experiments are conducted to show the vulnerability of ranking algorithm towards Web spam. Two public available Web spam datasets are used for the experiments throughout the thesis - WEBSPAM-UK2006 and WEBSPAM-UK2007.Two link-based trust and distrust model algorithms are presented subsequently: Trust Propagation Rank and Trust Propagation Spam Mass. Both algorithms semi automatically detect and demote Web spam based on limited human experts’ evaluation of non-spam and spam pages. In the experiments, the results for Trust Propagation Rank and Trust Propagation Spam Mass have achieved up to 10.88% and 43.94% improvement over the benchmark algorithms.Thereafter, the weight properties which associated as the linkage between two Web hosts are introduced into the task of Web spam detection. In most studies, the weight properties are involved in ranking mechanism; in this research work, the weight properties are incorporated into distrust based algorithms to detect more spam. The experiments have shown that the weight properties enhanced existing distrust based Web spam detection algorithms for up to 30.26% and 31.30% on both aforementioned datasets.Even though the integration of weight properties has shown significant results in detecting Web spam, the discussion on distrust seed set propagation algorithm is presented to further enhance the Web spam detection experience. Distrust seed set propagation algorithm propagates the distrust score in a wider range to estimate the probability of other unevaluated Web pages for being spam. The experimental results have shown that the algorithm improved the distrust based Web spam detection algorithms up to 19.47% and 25.17% on both datasets.An alternative machine learning classifier - multilayered perceptron neural network is proposed in the thesis to further improve the detection rate of Web spam. In the experiments, the detection rate of Web spam using multilayered perceptron neural network has increased up to 14.02% and 3.53% over the conventional classifier – support vector machines. At the same time, a mechanism to determine the number of hidden neurons for multilayered perceptron neural network is presented in this thesis to simplify the designing process of network structure

    Feature Selection by Multiobjective Optimization: Application to Spam Detection System by Neural Networks and Grasshopper Optimization Algorithm

    Get PDF
    Networks are strained by spam, which also overloads email servers and blocks mailboxes with unwanted messages and files. Setting the protective level for spam filtering might become even more crucial for email users when malicious steps are taken since they must deal with an increase in the number of valid communications being marked as spam. By finding patterns in email communications, spam detection systems (SDS) have been developed to keep track of spammers and filter email activity. SDS has also enhanced the tool for detecting spam by reducing the rate of false positives and increasing the accuracy of detection. The difficulty with spam classifiers is the abundance of features. The importance of feature selection (FS) comes from its role in directing the feature selection algorithm’s search for ways to improve the SDS’s classification performance and accuracy. As a means of enhancing the performance of the SDS, we use a wrapper technique in this study that is based on the multi-objective grasshopper optimization algorithm (MOGOA) for feature extraction and the recently revised EGOA algorithm for multilayer perceptron (MLP) training. The suggested system’s performance was verified using the SpamBase, SpamAssassin, and UK-2011 datasets. Our research showed that our novel approach outperformed a variety of established practices in the literature by as much as 97.5%, 98.3%, and 96.4% respectively.©2022 the Authors. Published by IEEE. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/fi=vertaisarvioitu|en=peerReviewed

    A pipeline and comparative study of 12 machine learning models for text classification

    Get PDF
    Text-based communication is highly favoured as a communication method, especially in business environments. As a result, it is often abused by sending malicious messages, e.g., spam emails, to deceive users into relaying personal information, including online accounts credentials or banking details. For this reason, many machine learning methods for text classification have been proposed and incorporated into the services of most email providers. However, optimising text classification algorithms and finding the right tradeoff on their aggressiveness is still a major research problem. We present an updated survey of 12 machine learning text classifiers applied to a public spam corpus. A new pipeline is proposed to optimise hyperparameter selection and improve the models' performance by applying specific methods (based on natural language processing) in the preprocessing stage. Our study aims to provide a new methodology to investigate and optimise the effect of different feature sizes and hyperparameters in machine learning classifiers that are widely used in text classification problems. The classifiers are tested and evaluated on different metrics including F-score (accuracy), precision, recall, and run time. By analysing all these aspects, we show how the proposed pipeline can be used to achieve a good accuracy towards spam filtering on the Enron dataset, a widely used public email corpus. Statistical tests and explainability techniques are applied to provide a robust analysis of the proposed pipeline and interpret the classification outcomes of the 12 machine learning models, also identifying words that drive the classification results. Our analysis shows that it is possible to identify an effective machine learning model to classify the Enron dataset with an F-score of 94%.Comment: This article has been accepted for publication in Expert Systems with Applications, April 2022. Published by Elsevier. All data, models, and code used in this work are available on GitHub at https://github.com/Angione-Lab/12-machine-learning-models-for-text-classificatio

    Spam Classification Using Machine Learning Techniques - Sinespam

    Get PDF
    Most e-mail readers spend a non-trivial amount of time regularly deleting junk e-mail (spam) messages, even as an expanding volume of such e-mail occupies server storage space and consumes network bandwidth. An ongoing challenge, therefore, rests within the development and refinement of automatic classifiers that can distinguish legitimate e-mail from spam. Some published studies have examined spam detectors using NaĂŻve Bayesian approaches and large feature sets of binary attributes that determine the existence of common keywords in spam, and many commercial applications also use NaĂŻve Bayesian techniques. Spammers recognize these attempts to prevent their messages and have developed tactics to circumvent these filters, but these evasive tactics are themselves patterns that human readers can often identify quickly. This work had the objectives of developing an alternative approach using a neural network (NN) classifier brained on a corpus of e-mail messages from several users. The features selection used in this work is one of the major improvements, because the feature set uses descriptive characteristics of words and messages similar to those that a human reader would use to identify spam, and the model to select the best feature set, was based on forward feature selection. Another objective in this work was to improve the spam detection near 95% of accuracy using Artificial Neural Networks; actually nobody has reached more than 89% of accuracy using ANN
    • …
    corecore