9 research outputs found
Hoeffding Tree Algorithms for Anomaly Detection in Streaming Datasets: A Survey
This survey aims to deliver an extensive and well-constructed overview of using machine learning for the problem of detecting anomalies in streaming datasets. The objective is to provide the effectiveness of using Hoeffding Trees as a machine learning algorithm solution for the problem of detecting anomalies in streaming cyber datasets. In this survey we categorize the existing research works of Hoeffding Trees which can be feasible for this type of study into the following: surveying distributed Hoeffding Trees, surveying ensembles of Hoeffding Trees and surveying existing techniques using Hoeffding Trees for anomaly detection. These categories are referred to as compositions within this paper and were selected based on their relation to streaming data and the flexibility of their techniques for use within different domains of streaming data. We discuss the relevance of how combining the techniques of the proposed research works within these compositions can be used to address the anomaly detection problem in streaming cyber datasets. The goal is to show how a combination of techniques from different compositions can solve a prominent problem, anomaly detection
COMPOSE: Compacted object sample extraction a framework for semi-supervised learning in nonstationary environments
An increasing number of real-world applications are associated with streaming data drawn from drifting and nonstationary distributions. These applications demand new algorithms that can learn and adapt to such changes, also known as concept drift. Proper characterization of such data with existing approaches typically requires substantial amount of labeled instances, which may be difficult, expensive, or even impractical to obtain. In this thesis, compacted object sample extraction (COMPOSE) is introduced - a computational geometry-based framework to learn from nonstationary streaming data - where labels are unavailable (or presented very sporadically) after initialization. The feasibility and performance of the algorithm are evaluated on several synthetic and real-world data sets, which present various different scenarios of initially labeled streaming environments. On carefully designed synthetic data sets, we also compare the performance of COMPOSE against the optimal Bayes classifier, as well as the arbitrary subpopulation tracker algorithm, which addresses a similar environment referred to as extreme verification latency. Furthermore, using the real-world National Oceanic and Atmospheric Administration weather data set, we demonstrate that COMPOSE is competitive even with a well-established and fully supervised nonstationary learning algorithm that receives labeled data in every batch
AMANDA : density-based adaptive model for nonstationary data under extreme verification latency scenarios
Gradual concept-drift refers to a smooth and gradual change in the relations between input and output data in the underlying distribution over time. The problem generates a model obsolescence and consequently a quality decrease in predictions. Besides, there is a challenging task during the stream: The extreme verification latency (EVL) to verify the labels. For batch scenarios, state-of-the-art methods propose an adaptation of a supervised model by using an unconstrained least squares importance fitting (uLSIF) algorithm or a semi-supervised approach along with a core support extraction (CSE) method. However, these methods do not properly tackle the mentioned problems due to their high computational time for large data volumes, lack in representing the right samples of the drift or even for having several parameters for tuning. Therefore, we propose a density-based adaptive model for nonstationary data (AMANDA), which uses a semi-supervised classifier along with a CSE method. AMANDA has two variations: AMANDA with a fixed cutting percentage (AMANDA-FCP); and AMANDA with a dynamic cutting percentage (AMANDADCP). Our results indicate that the two variations of AMANDA outperform the state-of-the-art methods for almost all synthetic datasets and real ones with an improvement up to 27.98% regarding the average error. We have found that the use of AMANDA-FCP improved the results for a gradual concept-drift even with a small size of initial labeled data. Moreover, our results indicate that SSL classifiers are improved when they work along with our static or dynamic CSE methods. Therefore, we emphasize the importance of research directions based on this approach.Concept-drift gradual refere-se à mudança suave e gradual na distribuição dos dados conforme o tempo passa. Este problema causa obsolescência no modelo de aprendizado e queda na qualidade das previsões. Além disso, existe um complicador durante o processamento dos dados: a latência de verificação extrema (LVE) para se verificar os rótulos. Métodos do estado da arte propõem uma adaptação do modelo supervisionado usando uma abordagem de estimação de importância baseado em mÃnimos quadrados ou usando uma abordagem semi-supervisionada em conjunto com a extração de instâncias centrais, na sigla em inglês (CSE). Entretanto, estes métodos não tratam adequadamente os problemas mencionados devido ao fato de requererem alto tempo computacional para processar grandes volumes de dados, falta de correta seleção das instâncias que representam a mudança da distribuição, ou ainda por demandarem o ajuste de grande quantidade de parâmetros. Portanto, propomos um modelo adaptativo baseado em densidades para dados não-estacionários (AMANDA), que tem como base um classificador semi-supervisionado e um método CSE baseado em densidade. AMANDA tem duas variações: percentual de corte fixo (AMANDAFCP); e percentual de corte dinâmico (AMANDA-DCP). Nossos resultados indicam que as duas variações da proposta superam o estado da arte em quase todas as bases de dados sintéticas e reais em até 27,98% em relação ao erro médio. ConcluÃmos que a aplicação do método AMANDA-FCP faz com que a classificação melhore mesmo quando há uma pequena porção inicial de dados rotulados. Mais ainda, os classificadores semi-supervisionados são melhorados quando trabalham em conjunto com nossos métodos de CSE, estático ou dinâmico
Learning in Dynamic Data-Streams with a Scarcity of Labels
Analysing data in real-time is a natural and necessary progression from traditional data mining. However, real-time analysis presents additional challenges to batch-analysis; along with strict time and memory constraints, change is a major consideration. In a dynamic stream there is an assumption that the underlying process generating the stream is non-stationary and that concepts within the stream will drift and change over time. Adopting a false assumption that a stream is stationary will result in non-adaptive models degrading and eventually becoming obsolete. The challenge of recognising and reacting to change in a stream is compounded by the scarcity of labels problem. This refers to the very realistic situation in which the true class label of an incoming point is not immediately available (or will never be available) or in situations where manually labelling incoming points is prohibitively expensive. The goal of this thesis is to evaluate unsupervised learning as the basis for online classification in dynamic data-streams with a scarcity of labels. To realise this goal, a novel stream clustering algorithm based on the collective behaviour of ants (Ant Colony Stream Clustering (ACSC)) is proposed. This algorithm is shown to be faster and more accurate than comparative, peer stream-clustering algorithms while requiring fewer sensitive parameters. The principles of ACSC are extended in a second stream-clustering algorithm named Multi-Density Stream Clustering (MDSC). This algorithm has adaptive parameters and crucially, can track clusters and monitor their dynamic behaviour over time. A novel technique called a Dynamic Feature Mask (DFM) is proposed to ``sit on top’’ of these stream-clustering algorithms and can be used to observe and track change at the feature level in a data stream. This Feature Mask acts as an unsupervised feature selection method allowing high-dimensional streams to be clustered. Finally, data-stream clustering is evaluated as an approach to one-class classification and a novel framework (named COCEL: Clustering and One class Classification Ensemble Learning) for classification in dynamic streams with a scarcity of labels is described. The proposed framework can identify and react to change in a stream and hugely reduces the number of required labels (typically less than 0.05% of the entire stream)
Recommended from our members
Online Anomaly Detection for Time Series. Towards Incorporating Feature Extraction, Model Uncertainty and Concept Drift Adaptation for Improving Anomaly Detection
Time series anomaly detection receives increasing research interest given
the growing number of data-rich application domains. Recent additions
to anomaly detection methods in research literature include deep learning
algorithms. The nature and performance of these algorithms in sequence
analysis enable them to learn hierarchical discriminating features
and time-series temporal nature. However, their performance is affected
by the speed at which the time series arrives, the use of a fixed threshold,
and the assumption of Gaussian distribution on the prediction error
to identify anomalous values. An exact parametric distribution is often
not directly relevant in many applications and it’s often difficult to select
an appropriate threshold that will differentiate anomalies with noise.
Thus, implementations need the Prediction Interval (PI) that quantifies the
level of uncertainty associated with the Deep Neural Network (DNN) point
forecasts, which helps in making a better-informed decision and mitigates
against false anomaly alerts. To achieve this, a new anomaly detection
method is proposed that computes the uncertainty in estimates using quantile
regression and used the quantile interval to identify anomalies. Similarly,
to handle the speed at which the data arrives, an online anomaly detection
method is proposed where a model is trained incrementally to adapt
to the concept drift that improves prediction. This is implemented using a
window-based strategy, in which a time series is broken into sliding windows
of sub-sequences as input to the model. To adapt to concept drift,
the model is updated when changes occur in the new arrival instances.
This is achieved by using anomaly likelihood which is computed using the
Q-function to define the abnormal degree of the current data point based
on the previous data points. Specifically, when concept drift occurs, the
proposed method will mark the current data point as anomalous. However,
when the abnormal behavior continues for a longer period of time,
the abnormal degree of the current data point will be low compared to the
previous data points using the likelihood. As such, the current data point is
added to the previous data to retrain the model which will allow the model
to learn the new characteristics of the data and hence adapt to the concept
changes thereby redefining the abnormal behavior. The proposed method
also incorporates feature extraction to capture structural patterns in the
time series. This is especially significant for multivariate time-series data,
for which there is a need to capture the complex temporal dependencies
that may exist between the variables. In summary, this thesis contributes
to the theory, design, and development of algorithms and models for the
detection of anomalies in both static and evolving time series data.
Several experiments were conducted, and the results obtained indicate the
significance of this research on offline and online anomaly detection in
both static and evolving time-series data. In chapter 3, the newly proposed
method (Deep Quantile Regression Anomaly Detection Method) is evaluated
and compared with six other prediction-based anomaly detection
methods that assume a normal distribution of prediction or reconstruction
error for the identification of anomalies. Results in the first part of
the experiment indicate that DQR-AD obtained relatively better precision
than all other methods which demonstrates the capability of the method
in detecting a higher number of anomalous points with low false positive
rates. Also, the results show that DQR-AD is approximately 2 – 3
times better than the DeepAnT which performs better than all the remaining
methods on all domains in the NAB dataset. In the second part of the
experiment, sMAP dataset is used with 4-dimensional features to demonstrate
the method on multivariate time-series data. Experimental result
shows DQR-AD have 10% better performance than AE on three datasets
(SMAP1, SMAP3, and SMAP5) and equal performance on the remaining
two datasets. In chapter 5, two levels of experiments were conducted
basis of false-positive rate and concept drift adaptation. In the first level
of the experiment, the result shows that online DQR-AD is 18% better
than both DQR-AD and VAE-LSTM on five NAB datasets. Similarly, results
in the second level of the experiment show that the online DQR-AD
method has better performance than five counterpart methods with a relatively
10% margin on six out of the seven NAB datasets. This result
demonstrates how concept drift adaptation strategies adopted in the proposed
online DQR-AD improve the performance of anomaly detection in
time series.Petroleum Technology Development Fund (PTDF
A dynamic visual analytics framework for complex temporal environments
Introduction: Data streams are produced by sensors that sample an external system at a periodic interval. As the cost of developing sensors continues to fall, an increasing number of data stream acquisition systems have been deployed to take advantage of the volume and velocity of data streams. An overabundance of information in complex environments have been attributed to information overload, a state of exposure to overwhelming and excessive information. The use of visual analytics provides leverage over potential information overload challenges. Apart from automated online analysis, interactive visual tools provide significant leverage for human-driven trend analysis and pattern recognition. To facilitate analysis and knowledge discovery in the space of multidimensional big data, research is warranted for an online visual analytic framework that supports human-driven exploration and consumption of complex data streams.
Method: A novel framework was developed called the temporal Tri-event parameter based Dynamic Visual Analytics (TDVA). The TDVA framework was instantiated in two case studies, namely, a case study involving a hypothesis generation scenario, and a second case study involving a cohort-based hypothesis testing scenario. Two evaluations were conducted for each case study involving expert participants. This framework is demonstrated in a neonatal intensive care unit case study. The hypothesis generation phase of the pipeline is conducted through a multidimensional and in-depth one subject study using PhysioEx, a novel visual analytic tool for physiologic data stream analysis. The cohort-based hypothesis testing component of the analytic pipeline is validated through CoRAD, a visual analytic tool for performing case-controlled studies.
Results: The results of both evaluations show improved task performance, and subjective satisfaction with the use of PhysioEx and CoRAD. Results from the evaluation of PhysioEx reveals insight about current limitations for supporting single subject studies in complex environments, and areas for future research in that space. Results from CoRAD also support the need for additional research to explore complex multi-dimensional patterns across multiple observations. From an information systems approach, the efficacy and feasibility of the TDVA framework is demonstrated by the instantiation and evaluation of PhysioEx and CoRAD.
Conclusion: This research, introduces the TDVA framework and provides results to validate the deployment of online dynamic visual analytics in complex environments. The TDVA framework was instantiated in two case studies derived from an environment where dynamic and complex data streams were available. The first instantiation enabled the end-user to rapidly extract information from complex data streams to conduct in-depth analysis. The second allowed the end-user to test emerging patterns across multiple observations. To both ends, this thesis provides knowledge that can be used to improve the visual analytic pipeline in dynamic and complex environments