21 research outputs found

    Moving Targets: Addressing Concept Drift in Supervised Models for Hacker Communication Detection

    Get PDF
    Abstract—In this paper, we are investigating the presence of concept drift in machine learning models for detection of hacker communications posted in social media and hacker forums. The supervised models in this experiment are analysed in terms of performance over time by different sources of data (Surface web and Deep web). Additionally, to simulate real-world situations, these models are evaluated using time-stamped messages from our datasets, posted over time on social media platforms. We have found that models applied to hacker forums (deep web) presents an accuracy deterioration in less than a 1-year period, whereas models applied to Twitter (surface web) have not shown a decrease in accuracy for the same period of time. The problem is alleviated by retraining the model with new instances (and applying weights) in order to reduce the effects of concept drift. While our results indicated that performance degradation due to concept drift is avoided by 50% relabelling, which is challenging in real-world scenarios, our work paves the way to more targeted concept drift solutions to reduce the re-training tasks. Index Terms—Cyber Security, Machine Learning, Concept Drift, Hacker Communication, Software Vulnerabilitie

    IoT Data Analytics in Dynamic Environments: From An Automated Machine Learning Perspective

    Full text link
    With the wide spread of sensors and smart devices in recent years, the data generation speed of the Internet of Things (IoT) systems has increased dramatically. In IoT systems, massive volumes of data must be processed, transformed, and analyzed on a frequent basis to enable various IoT services and functionalities. Machine Learning (ML) approaches have shown their capacity for IoT data analytics. However, applying ML models to IoT data analytics tasks still faces many difficulties and challenges, specifically, effective model selection, design/tuning, and updating, which have brought massive demand for experienced data scientists. Additionally, the dynamic nature of IoT data may introduce concept drift issues, causing model performance degradation. To reduce human efforts, Automated Machine Learning (AutoML) has become a popular field that aims to automatically select, construct, tune, and update machine learning models to achieve the best performance on specified tasks. In this paper, we conduct a review of existing methods in the model selection, tuning, and updating procedures in the area of AutoML in order to identify and summarize the optimal solutions for every step of applying ML algorithms to IoT data analytics. To justify our findings and help industrial users and researchers better implement AutoML approaches, a case study of applying AutoML to IoT anomaly detection problems is conducted in this work. Lastly, we discuss and classify the challenges and research directions for this domain.Comment: Published in Engineering Applications of Artificial Intelligence (Elsevier, IF:7.8); Code/An AutoML tutorial is available at Github link: https://github.com/Western-OC2-Lab/AutoML-Implementation-for-Static-and-Dynamic-Data-Analytic

    Towards handling temporal dependence in concept drift streams.

    Get PDF
    Modern technological advancements have led to the production of an incomprehensible amount of data from a wide array of devices. A constant supply of new data provides an invaluable opportunity for access to qualitative and quantitative insights. Organisations recognise that, in today's modern era, data provides a means of mitigating risk and loss whilst maximising effciency and profit. However, processing this data is not without its challenges. Much of this data is produced in an online environment. Realtime stream data is unbound in size, variety and velocity. Data may arrive complete or with missing attributes, and data availability and persistence is limited to a small window of time. Classification methods and techniques that process offline data are not applicable to online data streams. Instead, new online classification methods have been developed. Research concerning the problematic and prevalent issue of concept drift has produced a considerable number of methods that allow online classifiers to adapt to changes in the stream distribution. However, recent research suggests that the presence of temporal dependence can cause misleading evaluation when accuracy is used as the core metric. This thesis investigates temporal dependence and its negative effcts upon the classification of concept drift data. First, this thesis proposes a novel method for coping with temporal dependence during the classification of real-time data streams, where concept drift is present. Results indicate that a statistical based, selective resetting approach can reduce the impact of temporal dependence in concept drift streams without significant loss in predictive accuracy. Secondly, a new ensemble based method, KTUE, that adopts the Kappa-Temporal statistic for vote weighting is suggested. Results show that this method is capable of outperforming some state-of-the-art ensemble methods in both temporally dependent and non-temporally dependent environments. Finally, this research proposes a novel algorithm for the simulation of temporally dependent concept drift data, which aims to help address the lack of established datasets available for evaluation. Experimental results show that temporal dependence can be injected into fabricated data streams using existing generation methods

    Concept drift from 1980 to 2020: a comprehensive bibliometric analysis with future research insight

    Get PDF
    In nonstationary environments, high-dimensional data streams have been generated unceasingly where the underlying distribution of the training and target data may change over time. These drifts are labeled as concept drift in the literature. Learning from evolving data streams demands adaptive or evolving approaches to handle concept drifts, which is a brand-new research affair. In this effort, a wide-ranging comparative analysis of concept drift is represented to highlight state-of-the-art approaches, embracing the last four decades, namely from 1980 to 2020. Considering the scope and discipline; the core collection of the Web of Science database is regarded as the basis of this study, and 1,564 publications related to concept drift are retrieved. As a result of the classification and feature analysis of valid literature data, the bibliometric indicators are revealed at the levels of countries/regions, institutions, and authors. The overall analyses, respecting the publications, citations, and cooperation of networks, are unveiled not only the highly authoritative publications but also the most prolific institutions, influential authors, dynamic networks, etc. Furthermore, deep analyses including text mining such as; the burst detection analysis, co-occurrence analysis, timeline view analysis, and bibliographic coupling analysis are conducted to disclose the current challenges and future research directions. This paper contributes as a remarkable reference for invaluable further research of concept drift, which enlightens the emerging/trend topics, and the possible research directions with several graphs, visualized by using the VOS viewer and Cite Space software

    Detection of Software Vulnerability Communication in Expert Social Media Channels: A Data-driven Approach

    Get PDF
    Conceptually, a vulnerability is: A flaw or weakness in a system’s design, implementation,or operation and management that could be exploited to violate the system’s security policy .Some of these flaws can go undetected and exploited for long periods of time after soft-ware release. Although some software providers are making efforts to avoid this situ-ation, inevitability, users are still exposed to vulnerabilities that allow criminal hackersto take advantage. These vulnerabilities are constantly discussed in specialised forumson social media. Therefore, from a cyber security standpoint, the information found inthese places can be used for countermeasures actions against malicious exploitation ofsoftware. However, manual inspection of the vast quantity of shared content in socialmedia is impractical. For this reason, in this thesis, we analyse the real applicability ofsupervised classification models to automatically detect software vulnerability com-munication in expert social media channels. We cover the following three principal aspects: Firstly, we investigate the applicability of classification models in a range of 5 differ-ent datasets collected from 3 Internet Domains: Dark Web, Deep Web and SurfaceWeb. Since supervised models require labelled data, we have provided a systematiclabelling process using multiple annotators to guarantee accurate labels to carry outexperiments. Using these datasets, we have investigated the classification models withdifferent combinations of learning-based algorithms and traditional features represen-tation. Also, by oversampling the positive instances, we have achieved an increaseof 5% in Positive Recall (on average) in these models. On top of that, we have appiiplied Feature Reduction, Feature Extraction and Feature Selection techniques, whichprovided a reduction on the dimensionality of these models without damaging the accuracy, thus, providing computationally efficient models. Furthermore, in addition to traditional features representation, we have investigated the performance of robust language models, such as Word Embedding (WEMB) andSentence Embedding (SEMB) on the accuracy of classification models. RegardingWEMB, our experiment has shown that this model trained with a small security-vocabulary dataset provides comparable results with WEMB trained in a very large general-vocabulary dataset. Regarding SEMB model, our experiment has shown thatits use overcomes WEMB model in detecting vulnerability communication, recording 8% of Avg. Class Accuracy and 74% of Positive Recall. In addition, we investigate twoDeep Learning algorithms as classifiers, text CNN (Convolutional Neural Network)and RNN (Recurrent Neural Network)-based algorithms, which have improved ourmodel, resulting in the best overall performance for our task

    Evaluating k-NN in the Classification of Data Streams with Concept Drift

    Full text link
    Data streams are often defined as large amounts of data flowing continuously at high speed. Moreover, these data are likely subject to changes in data distribution, known as concept drift. Given all the reasons mentioned above, learning from streams is often online and under restrictions of memory consumption and run-time. Although many classification algorithms exist, most of the works published in the area use Naive Bayes (NB) and Hoeffding Trees (HT) as base learners in their experiments. This article proposes an in-depth evaluation of k-Nearest Neighbors (k-NN) as a candidate for classifying data streams subjected to concept drift. It also analyses the complexity in time and the two main parameters of k-NN, i.e., the number of nearest neighbors used for predictions (k), and window size (w). We compare different parameter values for k-NN and contrast it to NB and HT both with and without a drift detector (RDDM) in many datasets. We formulated and answered 10 research questions which led to the conclusion that k-NN is a worthy candidate for data stream classification, especially when the run-time constraint is not too restrictive.Comment: 25 pages, 10 tables, 7 figures + 30 pages appendi

    Optimized and Automated Machine Learning Techniques Towards IoT Data Analytics and Cybersecurity

    Get PDF
    The Internet-of-Things (IoT) systems have emerged as a prevalent technology in our daily lives. With the wide spread of sensors and smart devices in recent years, the data generation volume and speed of IoT systems have increased dramatically. In most IoT systems, massive volumes of data must be processed, transformed, and analyzed on a frequent basis to enable various IoT services and functionalities. Machine Learning (ML) approaches have shown their capacity for IoT data analytics. However, applying ML models to IoT data analytics tasks still faces many difficulties and challenges. The first challenge is to process large amounts of dynamic IoT data to make accurate and informed decisions. The second challenge is to automate and optimize the data analytics process. The third challenge is to protect IoT devices and systems against various cyber threats and attacks. To address the IoT data analytics challenges, this thesis proposes various ML-based frameworks and data analytics approaches in several applications. Specifically, the first part of the thesis provides a comprehensive review of applying Automated Machine Learning (AutoML) techniques to IoT data analytics tasks. It discusses all procedures of the general ML pipeline. The second part of the thesis proposes several supervised ML-based novel Intrusion Detection Systems (IDSs) to improve the security of the Internet of Vehicles (IoV) systems and connected vehicles. Optimization techniques are used to obtain optimized ML models with high attack detection accuracy. The third part of the thesis developed unsupervised ML algorithms to identify network anomalies and malicious network entities (e.g., attacker IPs, compromised machines, and polluted files/content) to protect Content Delivery Networks (CDNs) from service targeting attacks, including distributed denial of service and cache pollution attacks. The proposed framework is evaluated on real-world CDN access log data to illustrate its effectiveness. The fourth part of the thesis proposes adaptive online learning algorithms for addressing concept drift issues (i.e., data distribution changes) and effectively handling dynamic IoT data streams in order to provide reliable IoT services. The development of drift adaptive learning methods can effectively adapt to data distribution changes and avoid data analytics model performance degradation
    corecore