Search CORE

21 research outputs found

Moving Targets: Addressing Concept Drift in Supervised Models for Hacker Communication Detection

Author: Keegan Brian
McKeever Susan
quieroz andrei
Publication venue: Dublin Institute of Technology
Publication date: 30/06/2020
Field of study

Abstract—In this paper, we are investigating the presence of concept drift in machine learning models for detection of hacker communications posted in social media and hacker forums. The supervised models in this experiment are analysed in terms of performance over time by different sources of data (Surface web and Deep web). Additionally, to simulate real-world situations, these models are evaluated using time-stamped messages from our datasets, posted over time on social media platforms. We have found that models applied to hacker forums (deep web) presents an accuracy deterioration in less than a 1-year period, whereas models applied to Twitter (surface web) have not shown a decrease in accuracy for the same period of time. The problem is alleviated by retraining the model with new instances (and applying weights) in order to reduce the effects of concept drift. While our results indicated that performance degradation due to concept drift is avoided by 50% relabelling, which is challenging in real-world scenarios, our work paves the way to more targeted concept drift solutions to reduce the re-training tasks. Index Terms—Cyber Security, Machine Learning, Concept Drift, Hacker Communication, Software Vulnerabilitie

Crossref

Arrow@TUDublin

Predictive Analytics for Spatio-Temporal Data

Author: Mariana Rafaela Figueiredo Ferreira de Oliveira
Publication venue
Publication date: 13/12/2021
Field of study

Repositório Aberto da Universidade do Porto

Recommended from our members

Online semi-supervised learning in non-stationary environments

Author: Idrees Mobin M.
Publication venue
Publication date: 31/01/2024
Field of study

Existing Data Stream Mining (DSM) algorithms assume the availability of labelled and balanced data, immediately or after some delay, to extract worthwhile knowledge from the continuous and rapid data streams. However, in many real-world applications such as Robotics, Weather Monitoring, Fraud Detection Systems, Cyber Security, and Computer Network Traffic Flow, an enormous amount of high-speed data is generated by Internet of Things sensors and real-time data on the Internet. Manual labelling of these data streams is not practical due to time consumption and the need for domain expertise. Another challenge is learning under Non-Stationary Environments (NSEs), which occurs due to changes in the data distributions in a set of input variables and/or class labels. The problem of Extreme Verification Latency (EVL) under NSEs is referred to as Initially Labelled Non-Stationary Environment (ILNSE). This is a challenging task because the learning algorithms have no access to the true class labels directly when the concept evolves. Several approaches exist that deal with NSE and EVL in isolation. However, few algorithms address both issues simultaneously. This research directly responds to ILNSE’s challenge in proposing two novel algorithms “Predictor for Streaming Data with Scarce Labels” (PSDSL) and Heterogeneous Dynamic Weighted Majority (HDWM) classifier. PSDSL is an Online Semi-Supervised Learning (OSSL) method for real-time DSM and is closely related to label scarcity issues in online machine learning. The key capabilities of PSDSL include learning from a small amount of labelled data in an incremental or online manner and being available to predict at any time. To achieve this, PSDSL utilises both labelled and unlabelled data to train the prediction models, meaning it continuously learns from incoming data and updates the model as new labelled or unlabelled data becomes available over time. Furthermore, it can predict under NSE conditions under the scarcity of class labels. PSDSL is built on top of the HDWM classifier, which preserves the diversity of the classifiers. PSDSL and HDWM can intelligently switch and adapt to the conditions. The PSDSL adapts to learning states between self-learning, micro-clustering and CGC, whichever approach is beneficial, based on the characteristics of the data stream. HDWM makes use of “seed” learners of different types in an ensemble to maintain its diversity. The ensembles are simply the combination of predictive models grouped to improve the predictive performance of a single classifier. PSDSL is empirically evaluated against COMPOSE, LEVELIW, SCARGC and MClassification on benchmarks, NSE datasets as well as Massive Online Analysis (MOA) data streams and real-world datasets. The results showed that PSDSL performed significantly better than existing approaches on most real-time data streams including randomised data instances. PSDSL performed significantly better than ‘Static’ i.e. the classifier is not updated after it is trained with the first examples in the data streams. When applied to MOA-generated data streams, PSDSL ranked highest (1.5) and thus performed significantly better than SCARGC, while SCARGC performed the same as the Static. PSDSL achieved better average prediction accuracies in a short time than SCARGC. The HDWM algorithm is evaluated on artificial and real-world data streams against existing well-known approaches such as the heterogeneous WMA and the homogeneous Dynamic DWM algorithm. The results showed that HDWM performed significantly better than WMA and DWM. Also, when recurring concept drifts were present, the predictive performance of HDWM showed an improvement over DWM. In both drift and real-world streams, significance tests and post hoc comparisons found significant differences between algorithms, HDWM performed significantly better than DWM and WMA when applied to MOA data streams and 4 real-world datasets Electric, Spam, Sensor and Forest cover. The seeding mechanism and dynamic inclusion of new base learners in the HDWM algorithms benefit from the use of both forgetting and retaining the models. The algorithm also provides the independence of selecting the optimal base classifier in its ensemble depending on the problem. A new approach, Envelope-Clustering is introduced to resolve the cluster overlap conflicts during the cluster labelling process. In this process, PSDSL transforms the centroids’ information of micro-clusters into micro-instances and generates new clusters called Envelopes. The nearest envelope clusters assist the conflicted micro-clusters and successfully guide the cluster labelling process after the concept drifts in the absence of true class labels. PSDSL has been evaluated on real-world problem ‘keystroke dynamics’, and the results show that PSDSL achieved higher prediction accuracy (85.3%) and SCARGC (81.6%), while the Static (49.0%) significantly degrades the performance due to changes in the users typing pattern. Furthermore, the predictive accuracies of SCARGC are found highly fluctuated between (41.1% to 81.6%) based on different values of parameter ‘k’ (number of clusters), while PSDSL automatically determine the best values for this parameter

Central Archive at the University of Reading

IoT Data Analytics in Dynamic Environments: From An Automated Machine Learning Perspective

Author: Shami Abdallah
Yang Li
Publication venue: 'Elsevier BV'
Publication date: 16/09/2022
Field of study

With the wide spread of sensors and smart devices in recent years, the data generation speed of the Internet of Things (IoT) systems has increased dramatically. In IoT systems, massive volumes of data must be processed, transformed, and analyzed on a frequent basis to enable various IoT services and functionalities. Machine Learning (ML) approaches have shown their capacity for IoT data analytics. However, applying ML models to IoT data analytics tasks still faces many difficulties and challenges, specifically, effective model selection, design/tuning, and updating, which have brought massive demand for experienced data scientists. Additionally, the dynamic nature of IoT data may introduce concept drift issues, causing model performance degradation. To reduce human efforts, Automated Machine Learning (AutoML) has become a popular field that aims to automatically select, construct, tune, and update machine learning models to achieve the best performance on specified tasks. In this paper, we conduct a review of existing methods in the model selection, tuning, and updating procedures in the area of AutoML in order to identify and summarize the optimal solutions for every step of applying ML algorithms to IoT data analytics. To justify our findings and help industrial users and researchers better implement AutoML approaches, a case study of applying AutoML to IoT anomaly detection problems is conducted in this work. Lastly, we discuss and classify the challenges and research directions for this domain.Comment: Published in Engineering Applications of Artificial Intelligence (Elsevier, IF:7.8); Code/An AutoML tutorial is available at Github link: https://github.com/Western-OC2-Lab/AutoML-Implementation-for-Static-and-Dynamic-Data-Analytic

arXiv.org e-Print Archive

Towards handling temporal dependence in concept drift streams.

Author: Wares Scott Brian
Publication venue
Publication date: 31/05/2023
Field of study

Modern technological advancements have led to the production of an incomprehensible amount of data from a wide array of devices. A constant supply of new data provides an invaluable opportunity for access to qualitative and quantitative insights. Organisations recognise that, in today's modern era, data provides a means of mitigating risk and loss whilst maximising effciency and profit. However, processing this data is not without its challenges. Much of this data is produced in an online environment. Realtime stream data is unbound in size, variety and velocity. Data may arrive complete or with missing attributes, and data availability and persistence is limited to a small window of time. Classification methods and techniques that process offline data are not applicable to online data streams. Instead, new online classification methods have been developed. Research concerning the problematic and prevalent issue of concept drift has produced a considerable number of methods that allow online classifiers to adapt to changes in the stream distribution. However, recent research suggests that the presence of temporal dependence can cause misleading evaluation when accuracy is used as the core metric. This thesis investigates temporal dependence and its negative effcts upon the classification of concept drift data. First, this thesis proposes a novel method for coping with temporal dependence during the classification of real-time data streams, where concept drift is present. Results indicate that a statistical based, selective resetting approach can reduce the impact of temporal dependence in concept drift streams without significant loss in predictive accuracy. Secondly, a new ensemble based method, KTUE, that adopts the Kappa-Temporal statistic for vote weighting is suggested. Results show that this method is capable of outperforming some state-of-the-art ensemble methods in both temporally dependent and non-temporally dependent environments. Finally, this research proposes a novel algorithm for the simulation of temporally dependent concept drift data, which aims to help address the lack of established datasets available for evaluation. Experimental results show that temporal dependence can be injected into fabricated data streams using existing generation methods

Open Access Institutional Repository at Robert Gordon University

Concept drift from 1980 to 2020: a comprehensive bibliometric analysis with future research insight

Author: Baburoglu Elif Selen
Dereli Turkay
Durmusoglu Alptekin
Publication venue: SPRINGER HEIDELBERG
Publication date
Field of study

In nonstationary environments, high-dimensional data streams have been generated unceasingly where the underlying distribution of the training and target data may change over time. These drifts are labeled as concept drift in the literature. Learning from evolving data streams demands adaptive or evolving approaches to handle concept drifts, which is a brand-new research affair. In this effort, a wide-ranging comparative analysis of concept drift is represented to highlight state-of-the-art approaches, embracing the last four decades, namely from 1980 to 2020. Considering the scope and discipline; the core collection of the Web of Science database is regarded as the basis of this study, and 1,564 publications related to concept drift are retrieved. As a result of the classification and feature analysis of valid literature data, the bibliometric indicators are revealed at the levels of countries/regions, institutions, and authors. The overall analyses, respecting the publications, citations, and cooperation of networks, are unveiled not only the highly authoritative publications but also the most prolific institutions, influential authors, dynamic networks, etc. Furthermore, deep analyses including text mining such as; the burst detection analysis, co-occurrence analysis, timeline view analysis, and bibliographic coupling analysis are conducted to disclose the current challenges and future research directions. This paper contributes as a remarkable reference for invaluable further research of concept drift, which enlightens the emerging/trend topics, and the possible research directions with several graphs, visualized by using the VOS viewer and Cite Space software

DSpace@HKU

Detection of Software Vulnerability Communication in Expert Social Media Channels: A Data-driven Approach

Author: Queiroz Andrei Lima
Publication venue: Dublin Institute of Technology
Publication date: 01/09/2020
Field of study

Conceptually, a vulnerability is: A flaw or weakness in a system’s design, implementation,or operation and management that could be exploited to violate the system’s security policy .Some of these flaws can go undetected and exploited for long periods of time after soft-ware release. Although some software providers are making efforts to avoid this situ-ation, inevitability, users are still exposed to vulnerabilities that allow criminal hackersto take advantage. These vulnerabilities are constantly discussed in specialised forumson social media. Therefore, from a cyber security standpoint, the information found inthese places can be used for countermeasures actions against malicious exploitation ofsoftware. However, manual inspection of the vast quantity of shared content in socialmedia is impractical. For this reason, in this thesis, we analyse the real applicability ofsupervised classification models to automatically detect software vulnerability com-munication in expert social media channels. We cover the following three principal aspects: Firstly, we investigate the applicability of classification models in a range of 5 differ-ent datasets collected from 3 Internet Domains: Dark Web, Deep Web and SurfaceWeb. Since supervised models require labelled data, we have provided a systematiclabelling process using multiple annotators to guarantee accurate labels to carry outexperiments. Using these datasets, we have investigated the classification models withdifferent combinations of learning-based algorithms and traditional features represen-tation. Also, by oversampling the positive instances, we have achieved an increaseof 5% in Positive Recall (on average) in these models. On top of that, we have appiiplied Feature Reduction, Feature Extraction and Feature Selection techniques, whichprovided a reduction on the dimensionality of these models without damaging the accuracy, thus, providing computationally efficient models. Furthermore, in addition to traditional features representation, we have investigated the performance of robust language models, such as Word Embedding (WEMB) andSentence Embedding (SEMB) on the accuracy of classification models. RegardingWEMB, our experiment has shown that this model trained with a small security-vocabulary dataset provides comparable results with WEMB trained in a very large general-vocabulary dataset. Regarding SEMB model, our experiment has shown thatits use overcomes WEMB model in detecting vulnerability communication, recording 8% of Avg. Class Accuracy and 74% of Positive Recall. In addition, we investigate twoDeep Learning algorithms as classifiers, text CNN (Convolutional Neural Network)and RNN (Recurrent Neural Network)-based algorithms, which have improved ourmodel, resulting in the best overall performance for our task

Arrow@TUDublin

Evaluating k-NN in the Classification of Data Streams with Concept Drift

Author: Barddal Jean Paul
de Barros Roberto Souto Maior
Santos Silas Garrido Teixeira de Carvalho
Publication venue
Publication date: 05/10/2022
Field of study

Data streams are often defined as large amounts of data flowing continuously at high speed. Moreover, these data are likely subject to changes in data distribution, known as concept drift. Given all the reasons mentioned above, learning from streams is often online and under restrictions of memory consumption and run-time. Although many classification algorithms exist, most of the works published in the area use Naive Bayes (NB) and Hoeffding Trees (HT) as base learners in their experiments. This article proposes an in-depth evaluation of k-Nearest Neighbors (k-NN) as a candidate for classifying data streams subjected to concept drift. It also analyses the complexity in time and the two main parameters of k-NN, i.e., the number of nearest neighbors used for predictions (k), and window size (w). We compare different parameter values for k-NN and contrast it to NB and HT both with and without a drift detector (RDDM) in many datasets. We formulated and answered 10 research questions which led to the conclusion that k-NN is a worthy candidate for data stream classification, especially when the run-time constraint is not too restrictive.Comment: 25 pages, 10 tables, 7 figures + 30 pages appendi

arXiv.org e-Print Archive

Optimized and Automated Machine Learning Techniques Towards IoT Data Analytics and Cybersecurity

Author: Yang Li
Publication venue: Scholarship@Western
Publication date: 18/08/2022
Field of study

The Internet-of-Things (IoT) systems have emerged as a prevalent technology in our daily lives. With the wide spread of sensors and smart devices in recent years, the data generation volume and speed of IoT systems have increased dramatically. In most IoT systems, massive volumes of data must be processed, transformed, and analyzed on a frequent basis to enable various IoT services and functionalities. Machine Learning (ML) approaches have shown their capacity for IoT data analytics. However, applying ML models to IoT data analytics tasks still faces many difficulties and challenges. The first challenge is to process large amounts of dynamic IoT data to make accurate and informed decisions. The second challenge is to automate and optimize the data analytics process. The third challenge is to protect IoT devices and systems against various cyber threats and attacks. To address the IoT data analytics challenges, this thesis proposes various ML-based frameworks and data analytics approaches in several applications. Specifically, the first part of the thesis provides a comprehensive review of applying Automated Machine Learning (AutoML) techniques to IoT data analytics tasks. It discusses all procedures of the general ML pipeline. The second part of the thesis proposes several supervised ML-based novel Intrusion Detection Systems (IDSs) to improve the security of the Internet of Vehicles (IoV) systems and connected vehicles. Optimization techniques are used to obtain optimized ML models with high attack detection accuracy. The third part of the thesis developed unsupervised ML algorithms to identify network anomalies and malicious network entities (e.g., attacker IPs, compromised machines, and polluted files/content) to protect Content Delivery Networks (CDNs) from service targeting attacks, including distributed denial of service and cache pollution attacks. The proposed framework is evaluated on real-world CDN access log data to illustrate its effectiveness. The fourth part of the thesis proposes adaptive online learning algorithms for addressing concept drift issues (i.e., data distribution changes) and effectively handling dynamic IoT data streams in order to provide reliable IoT services. The development of drift adaptive learning methods can effectively adapt to data distribution changes and avoid data analytics model performance degradation

Scholarship@Western