918 research outputs found
Change Detection in Multivariate Datastreams: Likelihood and Detectability Loss
We address the problem of detecting changes in multivariate datastreams, and
we investigate the intrinsic difficulty that change-detection methods have to
face when the data dimension scales. In particular, we consider a general
approach where changes are detected by comparing the distribution of the
log-likelihood of the datastream over different time windows. Despite the fact
that this approach constitutes the frame of several change-detection methods,
its effectiveness when data dimension scales has never been investigated, which
is indeed the goal of our paper. We show that the magnitude of the change can
be naturally measured by the symmetric Kullback-Leibler divergence between the
pre- and post-change distributions, and that the detectability of a change of a
given magnitude worsens when the data dimension increases. This problem, which
we refer to as \emph{detectability loss}, is due to the linear relationship
between the variance of the log-likelihood and the data dimension. We
analytically derive the detectability loss on Gaussian-distributed datastreams,
and empirically demonstrate that this problem holds also on real-world datasets
and that can be harmful even at low data-dimensions (say, 10)
Improving adaptive bagging methods for evolving data streams
We propose two new improvements for bagging methods on evolving data streams. Recently, two new variants of Bagging were proposed: ADWIN Bagging and Adaptive-Size Hoeffding Tree (ASHT) Bagging. ASHT Bagging uses trees of different sizes, and ADWIN Bagging uses ADWIN as a change detector to decide when to discard underperforming ensemble members. We improve ADWIN Bagging using Hoeffding Adaptive Trees, trees that can adaptively learn from data streams that change over time. To speed up the time for adapting to change of Adaptive-Size Hoeffding Tree (ASHT) Bagging, we add an error change detector for each classifier. We test our improvements by performing an evaluation study on synthetic and real-world datasets comprising up to ten million examples
Nonparametric and Online Change Detection in Multivariate Datastreams Using QuantTree
We address the problem of online change detection in multivariate datastreams, and we introduce QuantTree Exponentially Weighted Moving Average (QT-EWMA), a nonparametric change-detection algorithm that can control the expected time before a false alarm, yielding a desired Average Run Length (ARL
0
). Controlling false alarms is crucial in many applications and is rarely guaranteed by online change-detection algorithms that can monitor multivariate datastreams without knowing the data distribution. Like many change-detection algorithms, QT-EWMA builds a model of the data distribution, in our case a QuantTree histogram, from a stationary training set. To monitor datastreams even when the training set is extremely small, we propose QT-EWMA-update, which incrementally updates the QuantTree histogram during monitoring, always keeping the ARL0 under control. Our experiments, performed on synthetic and real-world datastreams, demonstrate that QT-EWMA and QT-EWMA-update control the ARL0 and the false alarm rate better than state-of-the-art methods operating in similar conditions, achieving lower or comparable detection delays
SMOClust: Synthetic Minority Oversampling based on Stream Clustering for Evolving Data Streams
Many real-world data stream applications not only suffer from concept drift
but also class imbalance. Yet, very few existing studies investigated this
joint challenge. Data difficulty factors, which have been shown to be key
challenges in class imbalanced data streams, are not taken into account by
existing approaches when learning class imbalanced data streams. In this work,
we propose a drift adaptable oversampling strategy to synthesise minority class
examples based on stream clustering. The motivation is that stream clustering
methods continuously update themselves to reflect the characteristics of the
current underlying concept, including data difficulty factors. This nature can
potentially be used to compress past information without caching data in the
memory explicitly. Based on the compressed information, synthetic examples can
be created within the region that recently generated new minority class
examples. Experiments with artificial and real-world data streams show that the
proposed approach can handle concept drift involving different minority class
decomposition better than existing approaches, especially when the data stream
is severely class imbalanced and presenting high proportions of safe and
borderline minority class examples.Comment: 59 pages, 85 figure
Multimodal Batch-Wise Change Detection
We address the problem of detecting distribution changes in a novel batch-wise and multimodal setup. This setup is characterized by a stationary condition where batches are drawn from potentially different modalities among a set of distributions in Rd represented in the training set. Existing change detection (CD) algorithms assume that there is a unique-possibly multipeaked-distribution characterizing stationary conditions, and in batch-wise multimodal context exhibit either low detection power or poor control of false positives. We present MultiModal QuantTree (MMQT), a novel CD algorithm that uses a single histogram to model the batch-wise multimodal stationary conditions. During testing, MMQT automatically identifies which modality has generated the incoming batch and detects changes by means of a modality-specific statistic. We leverage the theoretical properties of QuantTree to: 1) automatically estimate the number of modalities in a training set and 2) derive a principled calibration procedure that guarantees false-positive control. Our experiments show that MMQT achieves high detection power and accurate control over false positives in synthetic and real-world multimodal CD problems. Moreover, we show the potential of MMQT in Stream Learning applications, where it proves effective at detecting concept drifts and the emergence of novel classes by solely monitoring the input distribution
learning and adaptation to detect changes and anomalies in high dimensional data
The problem of monitoring a datastream and detecting whether the data generating process changes from normal to novel and possibly anomalous conditions has relevant applications in many real scenarios, such as health monitoring and quality inspection of industrial processes. A general approach often adopted in the literature is to learn a model to describe normal data and detect as anomalous those data that do not conform to the learned model. However, several challenges have to be addressed to make this approach effective in real world scenarios, where acquired data are often characterized by high dimension and feature complex structures (such as signals and images). We address this problem from two perspectives corresponding to different modeling assumptions on the data-generating process. At first, we model data as realization of random vectors, as it is customary in the statistical literature. In this settings we focus on the change detection problem, where the goal is to detect whether the datastream permanently departs from normal conditions. We theoretically prove the intrinsic difficulty of this problem when the data dimension increases and propose a novel non-parametric and multivariate change-detection algorithm. In the second part, we focus on data having complex structure and we adopt dictionaries yielding sparse representations to model normal data. We propose novel algorithms to detect anomalies in such datastreams and to adapt the learned model when the process generating normal data changes
Finding and tracking multi-density clusters in an online dynamic data stream
The file attached to this record is the author's final peer reviewed version.Change is one of the biggest challenges in dynamic stream mining. From a data-mining perspective, adapting and tracking change is desirable in order to understand how and why change has occurred. Clustering, a form of unsupervised learning, can be used to identify the underlying patterns in a stream. Density-based clustering identifies clusters as areas of high density separated by areas of low density. This paper proposes a Multi-Density Stream Clustering (MDSC) algorithm to address these two problems; the multi-density problem and the problem of discovering and tracking changes in a dynamic stream. MDSC consists of two on-line components; discovered, labelled clusters and an outlier buffer. Incoming points are assigned to a live cluster or passed to the outlier buffer. New clusters are discovered in the buffer using an ant-inspired swarm intelligence approach. The newly discovered cluster is uniquely labelled and added to the set of live clusters. Processed data is subject to an ageing function and will disappear when it is no longer relevant. MDSC is shown to perform favourably to state-of-the-art peer stream-clustering algorithms on a range of real and synthetic data-streams. Experimental results suggest that MDSC can discover qualitatively useful patterns while being scalable and robust to noise
A survey on feature drift adaptation: Definition, benchmark, challenges and future directions
Data stream mining is a fast growing research topic due to the ubiquity of data in several real-world problems. Given their ephemeral nature, data stream sources are expected to undergo changes in data distribution, a phenomenon called concept drift. This paper focuses on one specific type of drift that has not yet been thoroughly studied, namely feature drift. Feature drift occurs whenever a subset of features becomes, or ceases to be, relevant to the learning task; thus, learners must detect and adapt to these changes accordingly. We survey existing work on feature drift adaptation with both explicit and implicit approaches. Additionally, we benchmark several algorithms and a naive feature drift detection approach using synthetic and real-world datasets. The results from our experiments indicate the need for future research in this area as even naive approaches produced gains in accuracy while reducing resources usage. Finally, we state current research topics, challenges and future directions for feature drift adaptation
Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms
Abstract Analyzing data streams has received considerable attention over the past decades due to the widespread usage of sensors, social media and other streaming data sources. A core research area in this field is stream clustering which aims to recognize patterns in an unordered, infinite and evolving stream of observations. Clustering can be a crucial support in decision making, since it aims for an optimized aggregated representation of a continuous data stream over time and allows to identify patterns in large and high-dimensional data. A multitude of algorithms and approaches has been developed that are able to find and maintain clusters over time in the challenging streaming scenario. This survey explores, summarizes and categorizes a total of 51 stream clustering algorithms and identifies core research threads over the past decades. In particular, it identifies categories of algorithms based on distance thresholds, density grids and statistical models as well as algorithms for high dimensional data. Furthermore, it discusses applications scenarios, available software and how to configure stream clustering algorithms. This survey is considerably more extensive than comparable studies, more up-to-date and highlights how concepts are interrelated and have been developed over time
- …