95 research outputs found

    Comprehensive Literature Review on Machine Learning Structures for Web Spam Classification

    Get PDF
    AbstractVarious Web spam features and machine learning structures were constantly proposed to classify Web spam in recent years. The aim of this paper was to provide a comprehensive machine learning algorithms comparison within the Web spam detection community. Several machine learning algorithms and ensemble meta-algorithms as classifiers, area under receiver operating characteristic as performance evaluation and two public available datasets (WEBSPAM-UK2006 and WEBSPAM-UK2007) were experimented in this study. The results have shown that random forest with variations of AdaBoost had achieved 0.937 in WEBSPAM-UK2006 and 0.852 in WEBSPAM-UK2007

    Review of the machine learning methods in the classification of phishing attack

    Get PDF
    The development of computer networks today has increased rapidly. This can be seen based on the trend of computer users around the world, whereby they need to connect their computer to the Internet. This shows that the use of Internet networks is very important, whether for work purposes or access to social media accounts. However, in widely using this computer network, the privacy of computer users is in danger, especially for computer users who do not install security systems in their computer. This problem will allow hackers to hack and commit network attacks. This is very dangerous, especially for Internet users because hackers can steal confidential information such as bank login account or social media login account. The attacks that can be made include phishing attacks. The goal of this study is to review the types of phishing attacks and current methods used in preventing them. Based on the literature, the machine learning method is widely used to prevent phishing attacks. There are several algorithms that can be used in the machine learning method to prevent these attacks. This study focused on an algorithm that was thoroughly made and the methods in implementing this algorithm are discussed in detail

    Deep learning to filter SMS spam

    Get PDF
    The popularity of short message service (SMS) has been growing over the last decade. For businesses, these text messages are more effective than even emails. This is because while 98% of mobile users read their SMS by the end of the day, about 80% of the emails remain unopened. The popularity of SMS has also given rise to SMS Spam, which refers to any irrelevant text messages delivered using mobile networks. They are severely annoying to users. Most existing research that has attempted to filter SMS Spam has relied on manually identified features. Extending the current literature, this paper uses deep learning to classify Spam and Not-Spam text messages. Specifically, Convolutional Neural Network and Long Short-term memory models were employed. The proposed models were based on text data only, and self-extracted the feature set. On a benchmark dataset consisting of 747 Spam and 4,827 Not-Spam text messages, a remarkable accuracy of 99.44% was achieved

    Investigation into the Application of Personality Insights and Language Tone Analysis in Spam Classification

    Get PDF
    Due to its persistence spam remains as one of the biggest problems facing users and suppliers of email communication services. Machine learning techniques have been very successful at preventing many spam mails from arriving in user mailboxes, however they still account for over 50% of all emails sent. Despite this relative success the economic cost of spam has been estimated as high as 50billionin2005andmorerecentlyat50 billion in 2005 and more recently at 20 billion so spam can still be considered a considerable problem. In essence a spam email is a commercial communication trying to entice the receiver to take some positive action. This project uses the text from emails and creates personality insight and language tone scores through the use of IBM Watsons’ Tone Analyzer API. Those scores are used to investigate whether the language used in emails can be transformed into useful features that can be used to correctly classify them as spam or genuine emails. And during the course of this investigation a range of machine learning techniques are applied. Results from this experiment found that where just the personality insight and language tone features are used in the model some promising results with one dataset were shown. However over all datasets results were inconclusive with this model. Furthermore it was found that in a model where these features were used in combination with a normalised term-frequency feature-set no real improvement in the classification performance was shown

    Efficient feature reduction and classification methods

    Get PDF
    Durch die steigende Anzahl verfügbarer Daten in unterschiedlichsten Anwendungsgebieten nimmt der Aufwand vieler Data-Mining Applikationen signifikant zu. Speziell hochdimensionierte Daten (Daten die über viele verschiedene Attribute beschrieben werden) können ein großes Problem für viele Data-Mining Anwendungen darstellen. Neben höheren Laufzeiten können dadurch sowohl für überwachte (supervised), als auch nicht überwachte (unsupervised) Klassifikationsalgorithmen weitere Komplikationen entstehen (z.B. ungenaue Klassifikationsgenauigkeit, schlechte Clustering-Eigenschaften, …). Dies führt zu einem Bedarf an effektiven und effizienten Methoden zur Dimensionsreduzierung. Feature Selection (die Auswahl eines Subsets von Originalattributen) und Dimensionality Reduction (Transformation von Originalattribute in (Linear)-Kombinationen der Originalattribute) sind zwei wichtige Methoden um die Dimension von Daten zu reduzieren. Obwohl sich in den letzten Jahren vielen Studien mit diesen Methoden beschäftigt haben, gibt es immer noch viele offene Fragestellungen in diesem Forschungsgebiet. Darüber hinaus ergeben sich in vielen Anwendungsbereichen durch die immer weiter steigende Anzahl an verfügbaren und verwendeten Attributen und Features laufend neue Probleme. Das Ziel dieser Dissertation ist es, verschiedene Fragenstellungen in diesem Bereich genau zu analysieren und Verbesserungsmöglichkeiten zu entwickeln. Grundsätzlich, werden folgende Ansprüche an Methoden zur Feature Selection und Dimensionality Reduction gestellt: Die Methoden sollten effizient (bezüglich ihres Rechenaufwandes) sein und die resultierenden Feature-Sets sollten die Originaldaten möglichst kompakt repräsentieren können. Darüber hinaus ist es in vielen Anwendungsgebieten wichtig, die Interpretierbarkeit der Originaldaten beizubehalten. Letztendlich sollte der Prozess der Dimensionsreduzierung keinen negativen Effekt auf die Klassifikationsgenauigkeit haben - sondern idealerweise, diese noch verbessern. Offene Problemstellungen in diesem Bereich betreffen unter anderem den Zusammenhang zwischen Methoden zur Dimensionsreduzierung und der resultierenden Klassifikationsgenauigkeit, wobei sowohl eine möglichst kompakte Repräsentation der Daten, als auch eine hohe Klassifikationsgenauigkeit erzielt werden sollen. Wie bereits erwähnt, ergibt sich durch die große Anzahl an Daten auch ein erhöhter Rechenaufwand, weshalb schnelle und effektive Methoden zur Dimensionsreduzierung entwickelt werden müssen, bzw. existierende Methoden verbessert werden müssen. Darüber hinaus sollte natürlich auch der Rechenaufwand der verwendeten Klassifikationsmethoden möglichst gering sein. Des Weiteren ist die Interpretierbarkeit von Feature Sets zwar möglich, wenn Feature Selection Methoden für die Dimensionsreduzierung verwendet werden, im Fall von Dimensionality Reduction sind die resultierenden Feature Sets jedoch meist Linearkombinationen der Originalfeatures. Daher ist es schwierig zu überprüfen, wie viel Information einzelne Originalfeatures beitragen. Im Rahmen dieser Dissertation konnten wichtige Beiträge zu den oben genannten Problemstellungen präsentiert werden: Es wurden neue, effiziente Initialisierungsvarianten für die Dimensionality Reduction Methode Nonnegative Matrix Factorization (NMF) entwickelt, welche im Vergleich zu randomisierter Initialisierung und im Vergleich zu State-of-the-Art Initialisierungsmethoden zu einer schnelleren Reduktion des Approximationsfehlers führen. Diese Initialisierungsvarianten können darüber hinaus mit neu entwickelten und sehr effektiven Klassifikationsalgorithmen basierend auf NMF kombiniert werden. Um die Laufzeit von NMF weiter zu steigern wurden unterschiedliche Varianten von NMF Algorithmen auf Multi-Prozessor Systemen vorgestellt, welche sowohl Task- als auch Datenparallelismus unterstützen und zu einer erheblichen Reduktion der Laufzeit für NMF führen. Außerdem wurde eine effektive Verbesserung der Matlab Implementierung des ALS Algorithmus vorgestellt. Darüber hinaus wurde eine Technik aus dem Bereich des Information Retrieval -- Latent Semantic Indexing -- erfolgreich als Klassifikationsalgorithmus für Email Daten angewendet. Schließlich wurde eine ausführliche empirische Studie über den Zusammenhang verschiedener Feature Reduction Methoden (Feature Selection und Dimensionality Reduction) und der resultierenden Klassifikationsgenauigkeit unterschiedlicher Lernalgorithmen präsentiert. Der starke Einfluss unterschiedlicher Methoden zur Dimensionsreduzierung auf die resultierende Klassifikationsgenauigkeit unterstreicht dass noch weitere Untersuchungen notwendig sind um das komplexe Zusammenspiel von Dimensionsreduzierung und Klassifikation genau analysieren zu können.The sheer volume of data today and its expected growth over the next years are some of the key challenges in data mining and knowledge discovery applications. Besides the huge number of data samples that are collected and processed, the high dimensional nature of data arising in many applications causes the need to develop effective and efficient techniques that are able to deal with this massive amount of data. In addition to the significant increase in the demand of computational resources, those large datasets might also influence the quality of several data mining applications (especially if the number of features is very high compared to the number of samples). As the dimensionality of data increases, many types of data analysis and classification problems become significantly harder. This can lead to problems for both supervised and unsupervised learning. Dimensionality reduction and feature (subset) selection methods are two types of techniques for reducing the attribute space. While in feature selection a subset of the original attributes is extracted, dimensionality reduction in general produces linear combinations of the original attribute set. In both approaches, the goal is to select a low dimensional subset of the attribute space that covers most of the information of the original data. During the last years, feature selection and dimensionality reduction techniques have become a real prerequisite for data mining applications. There are several open questions in this research field, and due to the often increasing number of candidate features for various application areas (e.\,g., email filtering or drug classification/molecular modeling) new questions arise. In this thesis, we focus on some open research questions in this context, such as the relationship between feature reduction techniques and the resulting classification accuracy and the relationship between the variability captured in the linear combinations of dimensionality reduction techniques (e.\,g., PCA, SVD) and the accuracy of machine learning algorithms operating on them. Another important goal is to better understand new techniques for dimensionality reduction, such as nonnegative matrix factorization (NMF), which can be applied for finding parts-based, linear representations of nonnegative data. This ``sum-of-parts'' representation is especially useful if the interpretability of the original data should be retained. Moreover, performance aspects of feature reduction algorithms are investigated. As data grow, implementations of feature selection and dimensionality reduction techniques for high-performance parallel and distributed computing environments become more and more important. In this thesis, we focus on two types of open research questions: methodological advances without any specific application context, and application-driven advances for a specific application context. Summarizing, new methodological contributions are the following: The utilization of nonnegative matrix factorization in the context of classification methods is investigated. In particular, it is of interest how the improved interpretability of NMF factors due to the non-negativity constraints (which is of central importance in various problem settings) can be exploited. Motivated by this problem context two new fast initialization techniques for NMF based on feature selection are introduced. It is shown how approximation accuracy can be increased and/or how computational effort can be reduced compared to standard randomized seeding of the NMF and to state-of-the-art initialization strategies suggested earlier. For example, for a given number of iterations and a required approximation error a speedup of 3.6 compared to standard initialization, and a speedup of 3.4 compared to state-of-the-art initialization strategies could be achieved. Beyond that, novel classification methods based on the NMF are proposed and investigated. We can show that they are not only competitive in terms of classification accuracy with state-of-the-art classifiers, but also provide important advantages in terms of computational effort (especially for low-rank approximations). Moreover, parallelization and distributed execution of NMF is investigated. Several algorithmic variants for efficiently computing NMF on multi-core systems are studied and compared to each other. In particular, several approaches for exploiting task and/or data-parallelism in NMF are studied. We show that for some scenarios new algorithmic variants clearly outperform existing implementations. Last, but not least, a computationally very efficient adaptation of the implementation of the ALS algorithm in Matlab 2009a is investigated. This variant reduces the runtime significantly (in some settings by a factor of 8) and also provides several possibilities to be executed concurrently. In addition to purely methodological questions, we also address questions arising in the adaptation of feature selection and classification methods to two specific application problems: email classification and in silico screening for drug discovery. Different research challenges arise in the contexts of these different application areas, such as the dynamic nature of data for email classification problems, or the imbalance in the number of available samples of different classes for drug discovery problems. Application-driven advances of this thesis comprise the adaptation and application of latent semantic indexing (LSI) to the task of email filtering. Experimental results show that LSI achieves significantly better classification results than the widespread de-facto standard method for this special application context. In the context of drug discovery problems, several groups of well discriminating descriptors could be identified by utilizing the ``sum-of-parts`` representation of NMF. The number of important descriptors could be further increased when applying sparseness constraints on the NMF factors

    Human Activity Recognition for Physical Rehabilitation

    Get PDF
    The recognition of human activity is a challenging topic for machine learning. We present an analysis of Support Vector Machines (SVM) and Random Forests (RF) in their ability to accurately classify Kinect kinematic activities. Twenty participants were captured using the Microsoft Kinect performing ten physical rehabilitation activities. We extracted the kinematic location, velocity and energy of the skeletal joints at each frame of the activity to form a feature vector. Principle Component Analysis (PCA) was applied as a pre-processing step to reduce dimensionality and identify significant features amongst activity classes. SVM and RF are then trained on the PCA feature space to assess classification performance; we undertook an incremental increase in the dataset size.We analyse the classification accuracy, model training and classification time quantitatively at each incremental increase. The experimental results demonstrate that RF outperformed SVM in classification rate for six out of the ten activities. Although SVM has performance advantages in training time, RF would be more suited to real-time activity classification due to its low classification time and high classification accuracy when using eight to ten participants in the training set. © 2013 IEEE

    Modeling Attrition in Organizations from Email Communication

    Full text link
    Abstract—Modeling people’s online behavior in relation to their real-world social context is an interesting and important research problem. In this paper, we present our preliminary study of attrition behavior in real-world organizations based on two online datasets: a dataset from a small startup (40+ users) and a dataset from one large US company (3600+ users). The small startup dataset is collected using our privacy-preserving data logging tool, which removes personal identifiable information from content data and extracts only aggregated statistics such as word frequency counts and sentiment features. The privacy-preserving measures have enabled us to recruit participants to support this study. Correlation analysis over the startup dataset has shown that statistically there is often a change point in people’s online behavior, and data exhibits weak trends that may be manifestation of real-world attrition. Same findings are also verified in the large company dataset. Furthermore, we have trained a classifier to predict real-world attrition with a moderate accuracy of 60-65 % on the large company dataset. Given the incompleteness and noisy nature of data, the accuracy is encouraging. I

    Advanced Machine Learning Techniques and Meta-Heuristic Optimization for the Detection of Masquerading Attacks in Social Networks

    Get PDF
    According to the report published by the online protection firm Iovation in 2012, cyber fraud ranged from 1 percent of the Internet transactions in North America Africa to a 7 percent in Africa, most of them involving credit card fraud, identity theft, and account takeover or h¼acking attempts. This kind of crime is still growing due to the advantages offered by a non face-to-face channel where a increasing number of unsuspecting victims divulges sensitive information. Interpol classifies these illegal activities into 3 types: • Attacks against computer hardware and software. • Financial crimes and corruption. • Abuse, in the form of grooming or “sexploitation”. Most research efforts have been focused on the target of the crime developing different strategies depending on the casuistic. Thus, for the well-known phising, stored blacklist or crime signals through the text are employed eventually designing adhoc detectors hardly conveyed to other scenarios even if the background is widely shared. Identity theft or masquerading can be described as a criminal activity oriented towards the misuse of those stolen credentials to obtain goods or services by deception. On March 4, 2005, a million of personal and sensitive information such as credit card and social security numbers was collected by White Hat hackers at Seattle University who just surfed the Web for less than 60 minutes by means of the Google search engine. As a consequence they proved the vulnerability and lack of protection with a mere group of sophisticated search terms typed in the engine whose large data warehouse still allowed showing company or government websites data temporarily cached. As aforementioned, platforms to connect distant people in which the interaction is undirected pose a forcible entry for unauthorized thirds who impersonate the licit user in a attempt to go unnoticed with some malicious, not necessarily economic, interests. In fact, the last point in the list above regarding abuses has become a major and a terrible risk along with the bullying being both by means of threats, harassment or even self-incrimination likely to drive someone to suicide, depression or helplessness. California Penal Code Section 528.5 states: “Notwithstanding any other provision of law, any person who knowingly and without consent credibly impersonates another actual person through or on an Internet Web site or by other electronic means for purposes of harming, intimidating, threatening, or defrauding another person is guilty of a public offense punishable pursuant to subdivision [...]”. IV Therefore, impersonation consists of any criminal activity in which someone assumes a false identity and acts as his or her assumed character with intent to get a pecuniary benefit or cause some harm. User profiling, in turn, is the process of harvesting user information in order to construct a rich template with all the advantageous attributes in the field at hand and with specific purposes. User profiling is often employed as a mechanism for recommendation of items or useful information which has not yet considered by the client. Nevertheless, deriving user tendency or preferences can be also exploited to define the inherent behavior and address the problem of impersonation by detecting outliers or strange deviations prone to entail a potential attack. This dissertation is meant to elaborate on impersonation attacks from a profiling perspective, eventually developing a 2-stage environment which consequently embraces 2 levels of privacy intrusion, thus providing the following contributions: • The inference of behavioral patterns from the connection time traces aiming at avoiding the usurpation of more confidential information. When compared to previous approaches, this procedure abstains from impinging on the user privacy by taking over the messages content, since it only relies on time statistics of the user sessions rather than on their content. • The application and subsequent discussion of two selected algorithms for the previous point resolution: – A commonly employed supervised algorithm executed as a binary classifier which thereafter has forced us to figure out a method to deal with the absence of labeled instances representing an identity theft. – And a meta-heuristic algorithm in the search for the most convenient parameters to array the instances within a high dimensional space into properly delimited clusters so as to finally apply an unsupervised clustering algorithm. • The analysis of message content encroaching on more private information but easing the user identification by mining discriminative features by Natural Language Processing (NLP) techniques. As a consequence, the development of a new feature extraction algorithm based on linguistic theories motivated by the massive quantity of features often gathered when it comes to texts. In summary, this dissertation means to go beyond typical, ad-hoc approaches adopted by previous identity theft and authorship attribution research. Specifically it proposes tailored solutions to this particular and extensively studied paradigm with the aim at introducing a generic approach from a profiling view, not tightly bound to a unique application field. In addition technical contributions have been made in the course of the solution formulation intending to optimize familiar methods for a better versatility towards the problem at hand. In summary: this Thesis establishes an encouraging research basis towards unveiling subtle impersonation attacks in Social Networks by means of intelligent learning techniques
    • …
    corecore