1,117 research outputs found

    A new density estimation neural network to detect abnormal condition in streaming data

    Get PDF
    Along with the development of monitoring technologies, numerous measured data pour into monitoring system and form the high-volume and open-ended data stream. Usually, abnormal condition of monitored system can be characterized by the density variation of measured data stream. However, traditional density estimation methods can not dynamically track density variation of data stream due to the limitation of processing time and computation memory. In this paper, we propose a new density estimation neural network to continuously estimate the density of streaming data in a time-based sliding window. The network has a feedforward structure composed of discretization, input and summation layer. In the discretization layer, value range of data stream is discretized to network nodes with equal intervals. Measured data in the predefined time window are pushed into input layer and updated with the window sliding. In summation layer, the activation results between input neurons and discretization neurons are summed up and multiplied by a weight factor. The network outputs the kernel density estimators of sliding segment in data stream and achieves a one-pass estimation algorithm consuming constant computation memory. By subnet separation and local activation, computation load of the network is significantly reduced to catch up the pace of data stream. The nonlinear statistics, quantile and entropy, which can be consecutively figured out with the density estimators output by the density estimation neural network, are calculated as condition indictors to track the density variation of data stream. The proposed method is evaluated by a simulated data stream consisting of two mixing distribution data sets and a pressure data stream measured from a centrifugal compressor respectively. Results show that the underlying anomalies are successfully detected

    Stochastic Gradient Hamiltonian Monte Carlo

    Full text link
    Hamiltonian Monte Carlo (HMC) sampling methods provide a mechanism for defining distant proposals with high acceptance probabilities in a Metropolis-Hastings framework, enabling more efficient exploration of the state space than standard random-walk proposals. The popularity of such methods has grown significantly in recent years. However, a limitation of HMC methods is the required gradient computation for simulation of the Hamiltonian dynamical system-such computation is infeasible in problems involving a large sample size or streaming data. Instead, we must rely on a noisy gradient estimate computed from a subset of the data. In this paper, we explore the properties of such a stochastic gradient HMC approach. Surprisingly, the natural implementation of the stochastic approximation can be arbitrarily bad. To address this problem we introduce a variant that uses second-order Langevin dynamics with a friction term that counteracts the effects of the noisy gradient, maintaining the desired target distribution as the invariant distribution. Results on simulated data validate our theory. We also provide an application of our methods to a classification task using neural networks and to online Bayesian matrix factorization.Comment: ICML 2014 versio

    Incremental Decision Tree based on order statistics

    Get PDF
    International audienceNew application domains generate data which are not persistent anymore but volatile: network management, web profile modeling... These data arrive quickly, massively and are visible just once. Thus they necessarily have to be learnt according to their arrival orders. For classification problems online decision trees are known to perform well and are widely used on streaming data. In this paper, we propose a new decision tree method based on order statistics. The construction of an online tree usually needs summaries in the leaves. Our solution uses bounded error quantiles summaries. A robust and performing discretization or grouping method uses these summaries to provide, at the same time, a criterion to find the best split and better density estimations. This estimation is then used to build a na¨ıve Bayes classifier in the leaves to improve the prediction in the early learning stage

    Monitoring data streams

    Get PDF
    Stream monitoring is concerned with analyzing data that is represented in the form of infinite streams. This field has gained prominence in recent years, as streaming data is generated in increasing volume and dimension in a variety of areas. It finds application in connection with monitoring industrial sensors, "smart" technology like smart houses and smart cars, wearable devices used for medical and physiological monitoring, but also in environmental surveillance or finance. However, stream monitoring is a challenging task due to the diverse and changing nature of the streaming data, its high volume and high dimensionality with thousands of sensors producing streams with millions of measurements over short time spans. Automated, scalable and efficient analysis of these streams can help to keep track of important events, highlight relevant aspects and provide better insights into the monitored system. In this thesis, we propose techniques adapted to these tasks in supervised and unsupervised settings, in particular Stream Classification and Stream Dependency Monitoring. After a motivating introduction, we introduce concepts related to streaming data and discuss technological frameworks that have emerged to deal with streaming data in the second chapter of this thesis. We introduce the notion of information theoretical entropy as a useful basis for data monitoring in the third chapter. In the second part of the thesis, we present Probabilistic Hoeffding Trees, a novel approach towards stream classification. We will show how probabilistic learning greatly improves the flexibility of decision trees and their ability to adapt to changes in data streams. The general technique is applicable to a variety of classification models and fast to compute without significantly greater memory cost compared to regular Hoeffding Trees. We show that our technique achieves better or on-par results to current state-of-the-art tree classification models on a variety of large, synthetic and real life data sets. In the third part of the thesis, we concentrate on unsupervised monitoring of data streams. We will use mutual information as entropic measure to identify the most important relationships in a monitored system. By using the powerful concept of mutual information we can, first, capture relevant aspects in a great variety of data sources with different underlying concepts and possible relationships and, second, analyze theoretical and computational complexity. We present the MID and DIMID algorithms. They perform extremely efficient on high dimensional data streams and provide accurate results, outperforming state-of-the-art algorithms for dependency monitoring. In the fourth part of this thesis, we introduce delayed relationships as a further feature in the dependency analysis. In reality, the phenomena monitored by e.g. some type of sensor might depend on another, but measurable effects can be delayed. This delay might be due to technical reasons, i.e. different stream processing speeds, or because the effects actually appear delayed over time. We present Loglag, the first algorithm that monitors dependency with respect to an optimal delay. It utilizes several approximation techniques to achieve competitive resource requirements. We demonstrate its scalability and accuracy on real world data, and also give theoretical guarantees to its accuracy

    Monitoring data streams

    Get PDF
    Stream monitoring is concerned with analyzing data that is represented in the form of infinite streams. This field has gained prominence in recent years, as streaming data is generated in increasing volume and dimension in a variety of areas. It finds application in connection with monitoring industrial sensors, "smart" technology like smart houses and smart cars, wearable devices used for medical and physiological monitoring, but also in environmental surveillance or finance. However, stream monitoring is a challenging task due to the diverse and changing nature of the streaming data, its high volume and high dimensionality with thousands of sensors producing streams with millions of measurements over short time spans. Automated, scalable and efficient analysis of these streams can help to keep track of important events, highlight relevant aspects and provide better insights into the monitored system. In this thesis, we propose techniques adapted to these tasks in supervised and unsupervised settings, in particular Stream Classification and Stream Dependency Monitoring. After a motivating introduction, we introduce concepts related to streaming data and discuss technological frameworks that have emerged to deal with streaming data in the second chapter of this thesis. We introduce the notion of information theoretical entropy as a useful basis for data monitoring in the third chapter. In the second part of the thesis, we present Probabilistic Hoeffding Trees, a novel approach towards stream classification. We will show how probabilistic learning greatly improves the flexibility of decision trees and their ability to adapt to changes in data streams. The general technique is applicable to a variety of classification models and fast to compute without significantly greater memory cost compared to regular Hoeffding Trees. We show that our technique achieves better or on-par results to current state-of-the-art tree classification models on a variety of large, synthetic and real life data sets. In the third part of the thesis, we concentrate on unsupervised monitoring of data streams. We will use mutual information as entropic measure to identify the most important relationships in a monitored system. By using the powerful concept of mutual information we can, first, capture relevant aspects in a great variety of data sources with different underlying concepts and possible relationships and, second, analyze theoretical and computational complexity. We present the MID and DIMID algorithms. They perform extremely efficient on high dimensional data streams and provide accurate results, outperforming state-of-the-art algorithms for dependency monitoring. In the fourth part of this thesis, we introduce delayed relationships as a further feature in the dependency analysis. In reality, the phenomena monitored by e.g. some type of sensor might depend on another, but measurable effects can be delayed. This delay might be due to technical reasons, i.e. different stream processing speeds, or because the effects actually appear delayed over time. We present Loglag, the first algorithm that monitors dependency with respect to an optimal delay. It utilizes several approximation techniques to achieve competitive resource requirements. We demonstrate its scalability and accuracy on real world data, and also give theoretical guarantees to its accuracy

    Proceedings of the 2nd Computer Science Student Workshop: Microsoft Istanbul, Turkey, April 9, 2011

    Get PDF
    corecore