560 research outputs found

    Comparing content-filter techniques for stopping spam

    Get PDF
    There are many new theoretical techniques for detecting spam e-mail based upon the message contents. Although Bayesian methods are the most wellknown, there are other approaches for classifying information. This paper establishes some criteria for measuring spam filter effectiveness and compares the Boosting and Support Vector Machine approaches with some well-known existing filter software. It also examines ways of transforming e-mail messages into a form which is more readily processable by such algorithms

    A discrete hidden Markov model for SMS spam detection

    Get PDF
    Many machine learning methods have been applied for short messaging service (SMS) spam detection, including traditional methods such as naive Bayes (NB), vector space model (VSM), and support vector machine (SVM), and novel methods such as long short-term memory (LSTM) and the convolutional neural network (CNN). These methods are based on the well-known bag of words (BoW) model, which assumes documents are unordered collection of words. This assumption overlooks an important piece of information, i.e., word order. Moreover, the term frequency, which counts the number of occurrences of each word in SMS, is unable to distinguish the importance of words, due to the length limitation of SMS. This paper proposes a new method based on the discrete hidden Markov model (HMM) to use the word order information and to solve the low term frequency issue in SMS spam detection. The popularly adopted SMS spam dataset from the UCI machine learning repository is used for performance analysis of the proposed HMM method. The overall performance is compatible with deep learning by employing CNN and LSTM models. A Chinese SMS spam dataset with 2000 messages is used for further performance evaluation. Experiments show that the proposed HMM method is not language-sensitive and can identify spam with high accuracy on both datasets

    Addressing the new generation of spam (Spam 2.0) through Web usage models

    Get PDF
    New Internet collaborative media introduce new ways of communicating that are not immune to abuse. A fake eye-catching profile in social networking websites, a promotional review, a response to a thread in online forums with unsolicited content or a manipulated Wiki page, are examples of new the generation of spam on the web, referred to as Web 2.0 Spam or Spam 2.0. Spam 2.0 is defined as the propagation of unsolicited, anonymous, mass content to infiltrate legitimate Web 2.0 applications.The current literature does not address Spam 2.0 in depth and the outcome of efforts to date are inadequate. The aim of this research is to formalise a definition for Spam 2.0 and provide Spam 2.0 filtering solutions. Early-detection, extendibility, robustness and adaptability are key factors in the design of the proposed method.This dissertation provides a comprehensive survey of the state-of-the-art web spam and Spam 2.0 filtering methods to highlight the unresolved issues and open problems, while at the same time effectively capturing the knowledge in the domain of spam filtering.This dissertation proposes three solutions in the area of Spam 2.0 filtering including: (1) characterising and profiling Spam 2.0, (2) Early-Detection based Spam 2.0 Filtering (EDSF) approach, and (3) On-the-Fly Spam 2.0 Filtering (OFSF) approach. All the proposed solutions are tested against real-world datasets and their performance is compared with that of existing Spam 2.0 filtering methods.This work has coined the term ‘Spam 2.0’, provided insight into the nature of Spam 2.0, and proposed filtering mechanisms to address this new and rapidly evolving problem

    Automatic text categorisation of racist webpages

    Get PDF
    Automatic Text Categorisation (TC) involves the assignment of one or more predefined categories to text documents in order that they can be effectively managed. In this thesis we examine the possibility of applying automatic text categorisation to the problem of categorising texts (web pages) based on whether or not they are racist. TC has proven successful for topic-based problems such as news story categorisation. However, the problem of detecting racism is dissimilar to topic-based problems in that lexical items present in racist documents can also appear in anti-racist documents or indeed potentially any document. The mere presence of a potentially racist term does not necessarily mean the document is racist. The difficulty is finding what discerns racist documents from non-racist. We use a machine learning method called Support Vector Machines (SVM) to automatically learn features of racism in order to be capable of making a decision about the target class of unseen documents. We examine various representations within an SVM so as to identify the most effective method for handling this problem. Our work shows that it is possible to develop automatic categorisation of web pages, based on these approache

    Enhancing data privacy and security related process through machine learning

    Get PDF
    In this thesis, we exploit the advantages of Machine learning (ML) in the domains of data security and data privacy. ML is one of the most exciting technologies being developed in the world today. The major advantages of ML technology are its prediction capability and its ability to reduce the need for human activities to perform tasks. These benefits motivated us to exploit ML to improve users' data privacy and security. Firstly, we use ML technology to try to predict the best privacy settings for users, since ML has a strong prediction ability and the average user might find it difficult to properly set up privacy settings due to a lack of knowledge and subsequent lack of decision-making abilities regarding the privacy of their data. Besides, since the ML approach has the potential to considerably cut down on manual efforts by humans, our second task in this thesis is to exploit ML technology to redesign security mechanisms of social media environments that rely on human participation for providing such services. In particular, we use ML to train spam filters for identifying and removing violent, insulting, aggressive, and harassing content creators (a.k.a. spammers) from a social media platform. It helps to solve violent and aggressive issues that have been growing on social media environments. The experimental results show that our proposals are efficient and effective

    Holistic Network Defense: Fusing Host and Network Features for Attack Classification

    Get PDF
    This work presents a hybrid network-host monitoring strategy, which fuses data from both the network and the host to recognize malware infections. This work focuses on three categories: Normal, Scanning, and Infected. The network-host sensor fusion is accomplished by extracting 248 features from network traffic using the Fullstats Network Feature generator and from the host using text mining, looking at the frequency of the 500 most common strings and analyzing them as word vectors. Improvements to detection performance are made by synergistically fusing network features obtained from IP packet flows and host features, obtained from text mining port, processor, logon information among others. In addition, the work compares three different machine learning algorithms and updates the script required to obtain network features. Hybrid method results outperformed host only classification by 31.7% and network only classification by 25%. The new approach also reduces the number of alerts while remaining accurate compared with the commercial IDS SNORT. These results make it such that even the most typical users could understand alert classification messages
    corecore