Search CORE

9 research outputs found

Hoeffding Tree Algorithms for Anomaly Detection in Streaming Datasets: A Survey

Author: Biswal Biswajit
Muallem Asmah
Pan Jan W.
Shetty Sachin
Zhao Juan
Publication venue: ODU Digital Commons
Publication date: 01/01/2017
Field of study

This survey aims to deliver an extensive and well-constructed overview of using machine learning for the problem of detecting anomalies in streaming datasets. The objective is to provide the effectiveness of using Hoeffding Trees as a machine learning algorithm solution for the problem of detecting anomalies in streaming cyber datasets. In this survey we categorize the existing research works of Hoeffding Trees which can be feasible for this type of study into the following: surveying distributed Hoeffding Trees, surveying ensembles of Hoeffding Trees and surveying existing techniques using Hoeffding Trees for anomaly detection. These categories are referred to as compositions within this paper and were selected based on their relation to streaming data and the flexibility of their techniques for use within different domains of streaming data. We discuss the relevance of how combining the techniques of the proposed research works within these compositions can be used to address the anomaly detection problem in streaming cyber datasets. The goal is to show how a combination of techniques from different compositions can solve a prominent problem, anomaly detection

Old Dominion University

COMPOSE: Compacted object sample extraction a framework for semi-supervised learning in nonstationary environments

Author: Dyer Karl
Publication venue: Rowan Digital Works
Publication date: 21/10/2015
Field of study

An increasing number of real-world applications are associated with streaming data drawn from drifting and nonstationary distributions. These applications demand new algorithms that can learn and adapt to such changes, also known as concept drift. Proper characterization of such data with existing approaches typically requires substantial amount of labeled instances, which may be difficult, expensive, or even impractical to obtain. In this thesis, compacted object sample extraction (COMPOSE) is introduced - a computational geometry-based framework to learn from nonstationary streaming data - where labels are unavailable (or presented very sporadically) after initialization. The feasibility and performance of the algorithm are evaluated on several synthetic and real-world data sets, which present various different scenarios of initially labeled streaming environments. On carefully designed synthetic data sets, we also compare the performance of COMPOSE against the optimal Bayes classifier, as well as the arbitrary subpopulation tracker algorithm, which addresses a similar environment referred to as extreme verification latency. Furthermore, using the real-world National Oceanic and Atmospheric Administration weather data set, we demonstrate that COMPOSE is competitive even with a well-established and fully supervised nonstationary learning algorithm that receives labeled data in every batch

Rowan University

AMANDA : density-based adaptive model for nonstationary data under extreme verification latency scenarios

Author: Ferreira Raul Sena
Publication venue: 'Programa de Pos-graduacao em Ciencias Contabeis da UFRJ'
Publication date: 01/06/2018
Field of study

Gradual concept-drift refers to a smooth and gradual change in the relations between input and output data in the underlying distribution over time. The problem generates a model obsolescence and consequently a quality decrease in predictions. Besides, there is a challenging task during the stream: The extreme verification latency (EVL) to verify the labels. For batch scenarios, state-of-the-art methods propose an adaptation of a supervised model by using an unconstrained least squares importance fitting (uLSIF) algorithm or a semi-supervised approach along with a core support extraction (CSE) method. However, these methods do not properly tackle the mentioned problems due to their high computational time for large data volumes, lack in representing the right samples of the drift or even for having several parameters for tuning. Therefore, we propose a density-based adaptive model for nonstationary data (AMANDA), which uses a semi-supervised classifier along with a CSE method. AMANDA has two variations: AMANDA with a fixed cutting percentage (AMANDA-FCP); and AMANDA with a dynamic cutting percentage (AMANDADCP). Our results indicate that the two variations of AMANDA outperform the state-of-the-art methods for almost all synthetic datasets and real ones with an improvement up to 27.98% regarding the average error. We have found that the use of AMANDA-FCP improved the results for a gradual concept-drift even with a small size of initial labeled data. Moreover, our results indicate that SSL classifiers are improved when they work along with our static or dynamic CSE methods. Therefore, we emphasize the importance of research directions based on this approach.Concept-drift gradual refere-se à mudança suave e gradual na distribuição dos dados conforme o tempo passa. Este problema causa obsolescência no modelo de aprendizado e queda na qualidade das previsões. Além disso, existe um complicador durante o processamento dos dados: a latência de verificação extrema (LVE) para se verificar os rótulos. Métodos do estado da arte propõem uma adaptação do modelo supervisionado usando uma abordagem de estimação de importância baseado em mínimos quadrados ou usando uma abordagem semi-supervisionada em conjunto com a extração de instâncias centrais, na sigla em inglês (CSE). Entretanto, estes métodos não tratam adequadamente os problemas mencionados devido ao fato de requererem alto tempo computacional para processar grandes volumes de dados, falta de correta seleção das instâncias que representam a mudança da distribuição, ou ainda por demandarem o ajuste de grande quantidade de parâmetros. Portanto, propomos um modelo adaptativo baseado em densidades para dados não-estacionários (AMANDA), que tem como base um classificador semi-supervisionado e um método CSE baseado em densidade. AMANDA tem duas variações: percentual de corte fixo (AMANDAFCP); e percentual de corte dinâmico (AMANDA-DCP). Nossos resultados indicam que as duas variações da proposta superam o estado da arte em quase todas as bases de dados sintéticas e reais em até 27,98% em relação ao erro médio. Concluímos que a aplicação do método AMANDA-FCP faz com que a classificação melhore mesmo quando há uma pequena porção inicial de dados rotulados. Mais ainda, os classificadores semi-supervisionados são melhorados quando trabalham em conjunto com nossos métodos de CSE, estático ou dinâmico

Pantheon

Learning in Dynamic Data-Streams with a Scarcity of Labels

Author: Fahy Conor
Publication venue: Faculty of Computing, Engineering and Media
Publication date: 01/01/2019
Field of study

Analysing data in real-time is a natural and necessary progression from traditional data mining. However, real-time analysis presents additional challenges to batch-analysis; along with strict time and memory constraints, change is a major consideration. In a dynamic stream there is an assumption that the underlying process generating the stream is non-stationary and that concepts within the stream will drift and change over time. Adopting a false assumption that a stream is stationary will result in non-adaptive models degrading and eventually becoming obsolete. The challenge of recognising and reacting to change in a stream is compounded by the scarcity of labels problem. This refers to the very realistic situation in which the true class label of an incoming point is not immediately available (or will never be available) or in situations where manually labelling incoming points is prohibitively expensive. The goal of this thesis is to evaluate unsupervised learning as the basis for online classification in dynamic data-streams with a scarcity of labels. To realise this goal, a novel stream clustering algorithm based on the collective behaviour of ants (Ant Colony Stream Clustering (ACSC)) is proposed. This algorithm is shown to be faster and more accurate than comparative, peer stream-clustering algorithms while requiring fewer sensitive parameters. The principles of ACSC are extended in a second stream-clustering algorithm named Multi-Density Stream Clustering (MDSC). This algorithm has adaptive parameters and crucially, can track clusters and monitor their dynamic behaviour over time. A novel technique called a Dynamic Feature Mask (DFM) is proposed to ``sit on top’’ of these stream-clustering algorithms and can be used to observe and track change at the feature level in a data stream. This Feature Mask acts as an unsupervised feature selection method allowing high-dimensional streams to be clustered. Finally, data-stream clustering is evaluated as an approach to one-class classification and a novel framework (named COCEL: Clustering and One class Classification Ensemble Learning) for classification in dynamic streams with a scarcity of labels is described. The proposed framework can identify and react to change in a stream and hugely reduces the number of required labels (typically less than 0.05% of the entire stream)

De Montfort University Open Research Archive

Recommended from our members

Online Anomaly Detection for Time Series. Towards Incorporating Feature Extraction, Model Uncertainty and Concept Drift Adaptation for Improving Anomaly Detection

Author: Tambuwal Ahmad I.
Publication venue: Department of Computer Science. Faculty of Engineering and Informatics
Publication date: 01/01/2021
Field of study

Time series anomaly detection receives increasing research interest given the growing number of data-rich application domains. Recent additions to anomaly detection methods in research literature include deep learning algorithms. The nature and performance of these algorithms in sequence analysis enable them to learn hierarchical discriminating features and time-series temporal nature. However, their performance is affected by the speed at which the time series arrives, the use of a fixed threshold, and the assumption of Gaussian distribution on the prediction error to identify anomalous values. An exact parametric distribution is often not directly relevant in many applications and it’s often difficult to select an appropriate threshold that will differentiate anomalies with noise. Thus, implementations need the Prediction Interval (PI) that quantifies the level of uncertainty associated with the Deep Neural Network (DNN) point forecasts, which helps in making a better-informed decision and mitigates against false anomaly alerts. To achieve this, a new anomaly detection method is proposed that computes the uncertainty in estimates using quantile regression and used the quantile interval to identify anomalies. Similarly, to handle the speed at which the data arrives, an online anomaly detection method is proposed where a model is trained incrementally to adapt to the concept drift that improves prediction. This is implemented using a window-based strategy, in which a time series is broken into sliding windows of sub-sequences as input to the model. To adapt to concept drift, the model is updated when changes occur in the new arrival instances. This is achieved by using anomaly likelihood which is computed using the Q-function to define the abnormal degree of the current data point based on the previous data points. Specifically, when concept drift occurs, the proposed method will mark the current data point as anomalous. However, when the abnormal behavior continues for a longer period of time, the abnormal degree of the current data point will be low compared to the previous data points using the likelihood. As such, the current data point is added to the previous data to retrain the model which will allow the model to learn the new characteristics of the data and hence adapt to the concept changes thereby redefining the abnormal behavior. The proposed method also incorporates feature extraction to capture structural patterns in the time series. This is especially significant for multivariate time-series data, for which there is a need to capture the complex temporal dependencies that may exist between the variables. In summary, this thesis contributes to the theory, design, and development of algorithms and models for the detection of anomalies in both static and evolving time series data. Several experiments were conducted, and the results obtained indicate the significance of this research on offline and online anomaly detection in both static and evolving time-series data. In chapter 3, the newly proposed method (Deep Quantile Regression Anomaly Detection Method) is evaluated and compared with six other prediction-based anomaly detection methods that assume a normal distribution of prediction or reconstruction error for the identification of anomalies. Results in the first part of the experiment indicate that DQR-AD obtained relatively better precision than all other methods which demonstrates the capability of the method in detecting a higher number of anomalous points with low false positive rates. Also, the results show that DQR-AD is approximately 2 – 3 times better than the DeepAnT which performs better than all the remaining methods on all domains in the NAB dataset. In the second part of the experiment, sMAP dataset is used with 4-dimensional features to demonstrate the method on multivariate time-series data. Experimental result shows DQR-AD have 10% better performance than AE on three datasets (SMAP1, SMAP3, and SMAP5) and equal performance on the remaining two datasets. In chapter 5, two levels of experiments were conducted basis of false-positive rate and concept drift adaptation. In the first level of the experiment, the result shows that online DQR-AD is 18% better than both DQR-AD and VAE-LSTM on five NAB datasets. Similarly, results in the second level of the experiment show that the online DQR-AD method has better performance than five counterpart methods with a relatively 10% margin on six out of the seven NAB datasets. This result demonstrates how concept drift adaptation strategies adopted in the proposed online DQR-AD improve the performance of anomaly detection in time series.Petroleum Technology Development Fund (PTDF

Bradford Scholars

A dynamic visual analytics framework for complex temporal environments

Author: Kamaleswaran Rishikesan
Publication venue
Publication date: 08/01/2016
Field of study

Introduction: Data streams are produced by sensors that sample an external system at a periodic interval. As the cost of developing sensors continues to fall, an increasing number of data stream acquisition systems have been deployed to take advantage of the volume and velocity of data streams. An overabundance of information in complex environments have been attributed to information overload, a state of exposure to overwhelming and excessive information. The use of visual analytics provides leverage over potential information overload challenges. Apart from automated online analysis, interactive visual tools provide significant leverage for human-driven trend analysis and pattern recognition. To facilitate analysis and knowledge discovery in the space of multidimensional big data, research is warranted for an online visual analytic framework that supports human-driven exploration and consumption of complex data streams. Method: A novel framework was developed called the temporal Tri-event parameter based Dynamic Visual Analytics (TDVA). The TDVA framework was instantiated in two case studies, namely, a case study involving a hypothesis generation scenario, and a second case study involving a cohort-based hypothesis testing scenario. Two evaluations were conducted for each case study involving expert participants. This framework is demonstrated in a neonatal intensive care unit case study. The hypothesis generation phase of the pipeline is conducted through a multidimensional and in-depth one subject study using PhysioEx, a novel visual analytic tool for physiologic data stream analysis. The cohort-based hypothesis testing component of the analytic pipeline is validated through CoRAD, a visual analytic tool for performing case-controlled studies. Results: The results of both evaluations show improved task performance, and subjective satisfaction with the use of PhysioEx and CoRAD. Results from the evaluation of PhysioEx reveals insight about current limitations for supporting single subject studies in complex environments, and areas for future research in that space. Results from CoRAD also support the need for additional research to explore complex multi-dimensional patterns across multiple observations. From an information systems approach, the efficacy and feasibility of the TDVA framework is demonstrated by the instantiation and evaluation of PhysioEx and CoRAD. Conclusion: This research, introduces the TDVA framework and provides results to validate the deployment of online dynamic visual analytics in complex environments. The TDVA framework was instantiated in two case studies derived from an environment where dynamic and complex data streams were available. The first instantiation enabled the end-user to rapidly extract information from complex data streams to conduct in-depth analysis. The second allowed the end-user to test emerging patterns across multiple observations. To both ends, this thesis provides knowledge that can be used to improve the visual analytic pipeline in dynamic and complex environments

e-Scholar@UOIT