49 research outputs found
Change detection in streaming data analytics: a comparison of Bayesian online and martingale approaches
On line change detection is a key activity in streaming analytics, which aims to determine whether the current observation in a time series marks a change point in some important characteristic of the data, given the sequence of data observed so far. It can be a challenging task when monitoring complex systems, which are generating streaming data of significant volume and velocity. While applicable to diverse problem domains, it is highly relevant to monitoring high value and critical engineering assets. This paper presents an empirical evaluation of two algorithmic approaches for streaming data change detection. These are a modified martingale and a Bayesian online detection algorithm. Results obtained with both synthetic and real world data sets are presented and relevant advantages and limitations are discussed
Testing randomness online
The hypothesis of randomness is fundamental in statistical machine learning
and in many areas of nonparametric statistics; it says that the observations
are assumed to be independent and coming from the same unknown probability
distribution. This hypothesis is close, in certain respects, to the hypothesis
of exchangeability, which postulates that the distribution of the observations
is invariant with respect to their permutations. This paper reviews known
methods of testing the two hypotheses concentrating on the online mode of
testing, when the observations arrive sequentially. All known online methods
for testing these hypotheses are based on conformal martingales, which are
defined and studied in detail. The paper emphasizes conceptual and practical
aspects and states two kinds of results. Validity results limit the probability
of a false alarm or the frequency of false alarms for various procedures based
on conformal martingales, including conformal versions of the CUSUM and
Shiryaev-Roberts procedures. Efficiency results establish connections between
randomness, exchangeability, and conformal martingales.Comment: 34 pages, 1 table, 4 figure
Inductive Conformal Martingales for Change-Point Detection
We consider the problem of quickest change-point detection in data streams.
Classical change-point detection procedures, such as CUSUM, Shiryaev-Roberts
and Posterior Probability statistics, are optimal only if the change-point
model is known, which is an unrealistic assumption in typical applied problems.
Instead we propose a new method for change-point detection based on Inductive
Conformal Martingales, which requires only the independence and identical
distribution of observations. We compare the proposed approach to standard
methods, as well as to change-point detection oracles, which model a typical
practical situation when we have only imprecise (albeit parametric) information
about pre- and post-change data distributions. Results of comparison provide
evidence that change-point detection based on Inductive Conformal Martingales
is an efficient tool, capable to work under quite general conditions unlike
traditional approaches.Comment: 22 pages, 9 figures, 5 table
Online Distribution Shift Detection via Recency Prediction
When deploying modern machine learning-enabled robotic systems in high-stakes
applications, detecting distribution shift is critical. However, most existing
methods for detecting distribution shift are not well-suited to robotics
settings, where data often arrives in a streaming fashion and may be very
high-dimensional. In this work, we present an online method for detecting
distribution shift with guarantees on the false positive rate - i.e., when
there is no distribution shift, our system is very unlikely (with probability
) to falsely issue an alert; any alerts that are issued should
therefore be heeded. Our method is specifically designed for efficient
detection even with high dimensional data, and it empirically achieves up to
11x faster detection on realistic robotics settings compared to prior work
while maintaining a low false negative rate in practice (whenever there is a
distribution shift in our experiments, our method indeed emits an alert). We
demonstrate our approach in both simulation and hardware for a visual servoing
task, and show that our method indeed issues an alert before a failure occurs
Transcend:Detecting Concept Drift in Malware Classification Models
Building machine learning models of malware behavior is widely accepted as a panacea towards effective malware classification. A crucial requirement for building sustainable learning models, though, is to train on a wide variety of malware samples. Unfortunately, malware evolves rapidly and it thus becomes hard—if not impossible—to generalize learning models to reflect future, previously-unseen behaviors. Consequently, most malware classifiers become unsustainable in the long run, becoming rapidly antiquated as malware continues to evolve. In this work, we propose Transcend, a framework to identify aging classification models in vivo during deployment, much before the machine learning model’s performance starts to degrade. This is a significant departure from conventional approaches that retrain aging models retrospectively when poor performance is observed. Our approach uses a statistical comparison of samples seen during deployment with those used to train the model, thereby building metrics for prediction quality. We show how Transcend can be used to identify concept drift based on two separate case studies on Android andWindows malware, raising a red flag before the model starts making consistently poor decisions due to out-of-date training