2 research outputs found
Setting the threshold for high throughput detectors: A mathematical approach for ensembles of dynamic, heterogeneous, probabilistic anomaly detectors
Anomaly detection (AD) has garnered ample attention in security research, as
such algorithms complement existing signature-based methods but promise
detection of never-before-seen attacks. Cyber operations manage a high volume
of heterogeneous log data; hence, AD in such operations involves multiple
(e.g., per IP, per data type) ensembles of detectors modeling heterogeneous
characteristics (e.g., rate, size, type) often with adaptive online models
producing alerts in near real time. Because of high data volume, setting the
threshold for each detector in such a system is an essential yet underdeveloped
configuration issue that, if slightly mistuned, can leave the system useless,
either producing a myriad of alerts and flooding downstream systems, or giving
none. In this work, we build on the foundations of Ferragut et al. to provide a
set of rigorous results for understanding the relationship between threshold
values and alert quantities, and we propose an algorithm for setting the
threshold in practice. Specifically, we give an algorithm for setting the
threshold of multiple, heterogeneous, possibly dynamic detectors completely a
priori, in principle. Indeed, if the underlying distribution of the incoming
data is known (closely estimated), the algorithm provides provably manageable
thresholds. If the distribution is unknown (e.g., has changed over time) our
analysis reveals how the model distribution differs from the actual
distribution, indicating a period of model refitting is necessary. We provide
empirical experiments showing the efficacy of the capability by regulating the
alert rate of a system with 2,500 adaptive detectors scoring over 1.5M
events in 5 hours. Further, we demonstrate on the real network data and
detection framework of Harshaw et al. the alternative case, showing how the
inability to regulate alerts indicates the detection model is a bad fit to the
data.Comment: 11 pages, 5 figures. Proceedings of IEEE Big Data Conference, 201
General Framework for Binary Classification on Top Samples
Many binary classification problems minimize misclassification above (or
below) a threshold. We show that instances of ranking problems, accuracy at the
top or hypothesis testing may be written in this form. We propose a general
framework to handle these classes of problems and show which known methods
(both known and newly proposed) fall into this framework. We provide a
theoretical analysis of this framework and mention selected possible pitfalls
the methods may encounter. We suggest several numerical improvements including
the implicit derivative and stochastic gradient descent. We provide an
extensive numerical study. Based both on the theoretical properties and
numerical experiments, we conclude the paper by suggesting which method should
be used in which situation