10 research outputs found
Nonparametric Detection of Anomalous Data Streams
A nonparametric anomalous hypothesis testing problem is investigated, in
which there are totally n sequences with s anomalous sequences to be detected.
Each typical sequence contains m independent and identically distributed
(i.i.d.) samples drawn from a distribution p, whereas each anomalous sequence
contains m i.i.d. samples drawn from a distribution q that is distinct from p.
The distributions p and q are assumed to be unknown in advance.
Distribution-free tests are constructed using maximum mean discrepancy as the
metric, which is based on mean embeddings of distributions into a reproducing
kernel Hilbert space. The probability of error is bounded as a function of the
sample size m, the number s of anomalous sequences and the number n of
sequences. It is then shown that with s known, the constructed test is
exponentially consistent if m is greater than a constant factor of log n, for
any p and q, whereas with s unknown, m should has an order strictly greater
than log n. Furthermore, it is shown that no test can be consistent for
arbitrary p and q if m is less than a constant factor of log n, thus the
order-level optimality of the proposed test is established. Numerical results
are provided to demonstrate that our tests outperform (or perform as well as)
the tests based on other competitive approaches under various cases.Comment: Submitted to IEEE Transactions on Signal Processing, 201
Sketch-Based Streaming Anomaly Detection in Dynamic Graphs
Given a stream of graph edges from a dynamic graph, how can we assign anomaly
scores to edges and subgraphs in an online manner, for the purpose of detecting
unusual behavior, using constant time and memory? For example, in intrusion
detection, existing work seeks to detect either anomalous edges or anomalous
subgraphs, but not both. In this paper, we first extend the count-min sketch
data structure to a higher-order sketch. This higher-order sketch has the
useful property of preserving the dense subgraph structure (dense subgraphs in
the input turn into dense submatrices in the data structure). We then propose
four online algorithms that utilize this enhanced data structure, which (a)
detect both edge and graph anomalies; (b) process each edge and graph in
constant memory and constant update time per newly arriving edge, and; (c)
outperform state-of-the-art baselines on four real-world datasets. Our method
is the first streaming approach that incorporates dense subgraph search to
detect graph anomalies in constant memory and time
Nonparametric Anomaly Detection and Secure Communication
Two major security challenges in information systems are detection of anomalous data patterns that reflect malicious intrusions into data storage systems and protection of data from malicious eavesdropping during data transmissions. The first problem typically involves design of statistical tests to identify data variations, and the second problem generally involves design of communication schemes to transmit data securely in the presence of malicious eavesdroppers. The main theme of this thesis is to exploit information theoretic and statistical tools to address the above two security issues in order to provide information theoretically provable security, i.e., anomaly detection with vanishing probability of error and guaranteed secure communication with vanishing leakage rate at eavesdroppers.
First, the anomaly detection problem is investigated, in which typical and anomalous patterns (i.e., distributions that generate data) are unknown \emph{a priori}. Two types of problems are investigated. The first problem considers detection of the existence of anomalous geometric structures over networks, and the second problem considers the detection of a set of anomalous data streams out of a large number of data streams. In both problems, anomalous data are assumed to be generated by a distribution , which is different from a distribution generating typical samples. For both problems, kernel-based tests are proposed, which are based on maximum mean discrepancy (MMD) that measures the distance between mean embeddings of distributions into a reproducing kernel Hilbert space. These tests are nonparametric without exploiting the information about and and are universally applicable to arbitrary and . Furthermore, these tests are shown to be statistically consistent under certain conditions on the parameters of the problems. These conditions are further shown to be necessary or nearly necessary, which implies that the MMD-based tests are order level optimal or nearly order level optimal. Numerical results are provided to demonstrate the performance of the proposed tests.
The secure communication problem is then investigated, for which the focus is on degraded broadcast channels. In such channels, one transmitter sends messages to multiple receivers, the channel quality of which can be ordered. Two specific models are studied. In the first model, layered decoding and layered secrecy are required, i.e., each receiver decodes one more message than the receiver with one level worse channel quality, and this message should be kept secure from all receivers with worse channel qualities. In the second model, secrecy only outside a bounded range is required, i.e., each message is required to be kept secure from the receiver with two-level worse channel quality. Communication schemes for both models are designed and the corresponding achievable rate regions (i.e., inner bounds on the capacity region) are characterized. Furthermore, outer bounds on the capacity region are developed, which match the inner bounds, and hence the secrecy capacity regions are established for both models
Anomaly Detection for a Large Number of Streams: A Permutation-Based Higher Criticism Approach
Anomaly detection when observing a large number of data streams is essential
in a variety of applications, ranging from epidemiological studies to
monitoring of complex systems. High-dimensional scenarios are usually tackled
with scan-statistics and related methods, requiring stringent modeling
assumptions for proper calibration. In this work we take a non-parametric
stance, and propose a permutation-based variant of the higher criticism
statistic not requiring knowledge of the null distribution. This results in an
exact test in finite samples which is asymptotically optimal in the wide class
of exponential models. We demonstrate the power loss in finite samples is
minimal with respect to the oracle test. Furthermore, since the proposed
statistic does not rely on asymptotic approximations it typically performs
better than popular variants of higher criticism that rely on such
approximations. We include recommendations such that the test can be readily
applied in practice, and demonstrate its applicability in monitoring the daily
number of COVID-19 cases in the Netherlands