23 research outputs found
Universal outlier hypothesis testing with applications to anomaly detection
Outlier hypothesis testing is studied in a universal setting. Multiple sequences of observations are collected, a small subset (possibly empty) of which are outliers. A sequence is considered an outlier if the observations in that sequence are distributed according to an “outlier” distribution, distinct from the “typical” distribution governing the observations in the majority of the sequences. The outlier and typical distributions are not fully known, and they can be arbitrarily close. The goal is to design a universal test to best discern the outlier sequence(s). Both fixed sample size and sequential settings are considered in this dissertation. In the fixed sample size setting, for models with exactly one outlier, the generalized likelihood test is shown to be universally exponentially consistent. A single letter characterization of the error exponent achieved by such a test is derived, and it is shown that the test achieves the optimal error exponent asymptotically as the number of sequences goes to infinity. When the null hypothesis with no outlier is included, a modification of the generalized likelihood test is shown to achieve the same error exponent under each non-null hypothesis, and also consistency under the null hypothesis. Then, models with multiple outliers are considered. When the outliers can be distinctly distributed, in order to achieve exponential consistency, it is shown that it is essential that the number of outliers be known at the outset. For the setting with a known number of distinctly distributed outliers, the generalized likelihood test is shown to be universally exponentially consistent. The limiting error exponent achieved by such a test is characterized, and the test is shown to be asymptotically exponentially consistent. For the setting with an unknown number of identically distributed outliers, a modification of the generalized likelihood test is shown to achieve a positive error exponent under each non-null hypothesis, and consistency under the null hypothesis. In the sequential setting, a test with the flavor of the repeated significance test is proposed. The test is shown to be universally consistent, and universally exponentially consistent under non-null hypotheses. In addition, with the typical distribution being known, the test is shown to be asymptotically optimal universally when the number of outliers is the largest possible. In all cases, the asymptotic performance of the proposed test when none of the underlying distributions is known is shown to converge to that when only the typical distribution is known as the number of sequences goes to infinity. For models with continuous alphabets, a test with the same structure as the generalized likelihood test is proposed, and it is shown to be universally consistent. It is also demonstrated that there is a close connection between universal outlier hypothesis testing and cluster analysis. The performance of various proposed tests is evaluated against a synthetic data set, and contrasted with that of two popular clustering methods. Applied to a real data set for spam detection, the sequential test is shown to outperform the fixed sample size test when the lengths of the sequences exceed a certain value. In addition, the performance of the proposed tests is shown to be superior to that of another kernel-based test for large sample sizes
Universal Sequential Outlier Hypothesis Testing
Universal outlier hypothesis testing is studied in a sequential setting.
Multiple observation sequences are collected, a small subset of which are
outliers. A sequence is considered an outlier if the observations in that
sequence are generated by an "outlier" distribution, distinct from a common
"typical" distribution governing the majority of the sequences. Apart from
being distinct, the outlier and typical distributions can be arbitrarily close.
The goal is to design a universal test to best discern all the outlier
sequences. A universal test with the flavor of the repeated significance test
is proposed and its asymptotic performance is characterized under various
universal settings. The proposed test is shown to be universally consistent.
For the model with identical outliers, the test is shown to be asymptotically
optimal universally when the number of outliers is the largest possible and
with the typical distribution being known, and its asymptotic performance
otherwise is also characterized. An extension of the findings to the model with
multiple distinct outliers is also discussed. In all cases, it is shown that
the asymptotic performance guarantees for the proposed test when neither the
outlier nor typical distribution is known converge to those when the typical
distribution is known.Comment: Proc. of the Asilomar Conference on Signals, Systems, and Computers,
2014. To appea
Nonparametric Detection of Anomalous Data Streams
A nonparametric anomalous hypothesis testing problem is investigated, in
which there are totally n sequences with s anomalous sequences to be detected.
Each typical sequence contains m independent and identically distributed
(i.i.d.) samples drawn from a distribution p, whereas each anomalous sequence
contains m i.i.d. samples drawn from a distribution q that is distinct from p.
The distributions p and q are assumed to be unknown in advance.
Distribution-free tests are constructed using maximum mean discrepancy as the
metric, which is based on mean embeddings of distributions into a reproducing
kernel Hilbert space. The probability of error is bounded as a function of the
sample size m, the number s of anomalous sequences and the number n of
sequences. It is then shown that with s known, the constructed test is
exponentially consistent if m is greater than a constant factor of log n, for
any p and q, whereas with s unknown, m should has an order strictly greater
than log n. Furthermore, it is shown that no test can be consistent for
arbitrary p and q if m is less than a constant factor of log n, thus the
order-level optimality of the proposed test is established. Numerical results
are provided to demonstrate that our tests outperform (or perform as well as)
the tests based on other competitive approaches under various cases.Comment: Submitted to IEEE Transactions on Signal Processing, 201
Asymptotically Optimal Anomaly Detection via Sequential Testing
Sequential detection of independent anomalous processes among K processes is
considered. At each time, only M processes can be observed, and the
observations from each chosen process follow two different distributions,
depending on whether the process is normal or abnormal. Each anomalous process
incurs a cost per unit time until its anomaly is identified and fixed.
Switching across processes and state declarations are allowed at all times,
while decisions are based on all past observations and actions. The objective
is a sequential search strategy that minimizes the total expected cost incurred
by all the processes during the detection process under reliability
constraints. Low-complexity algorithms are established to achieve
asymptotically optimal performance as the error constraints approach zero.
Simulation results demonstrate strong performance in the finite regime.Comment: 28 pages, 5 figures, part of this work will be presented at the 52nd
Annual Allerton Conference on Communication, Control, and Computing, 201