3,711 research outputs found
Finding Likely Errors with Bayesian Specifications
We present a Bayesian framework for learning probabilistic specifications
from large, unstructured code corpora, and a method to use this framework to
statically detect anomalous, hence likely buggy, program behavior. The
distinctive insight here is to build a statistical model that correlates all
specifications hidden inside a corpus with the syntax and observed behavior of
programs that implement these specifications. During the analysis of a
particular program, this model is conditioned into a posterior distribution
that prioritizes specifications that are relevant to this program. This allows
accurate program analysis even if the corpus is highly heterogeneous. The
problem of finding anomalies is now framed quantitatively, as a problem of
computing a distance between a "reference distribution" over program behaviors
that our model expects from the program, and the distribution over behaviors
that the program actually produces.
We present a concrete embodiment of our framework that combines a topic model
and a neural network model to learn specifications, and queries the learned
models to compute anomaly scores. We evaluate this implementation on the task
of detecting anomalous usage of Android APIs. Our encouraging experimental
results show that the method can automatically discover subtle errors in
Android applications in the wild, and has high precision and recall compared to
competing probabilistic approaches
ATD: Anomalous Topic Discovery in High Dimensional Discrete Data
We propose an algorithm for detecting patterns exhibited by anomalous
clusters in high dimensional discrete data. Unlike most anomaly detection (AD)
methods, which detect individual anomalies, our proposed method detects groups
(clusters) of anomalies; i.e. sets of points which collectively exhibit
abnormal patterns. In many applications this can lead to better understanding
of the nature of the atypical behavior and to identifying the sources of the
anomalies. Moreover, we consider the case where the atypical patterns exhibit
on only a small (salient) subset of the very high dimensional feature space.
Individual AD techniques and techniques that detect anomalies using all the
features typically fail to detect such anomalies, but our method can detect
such instances collectively, discover the shared anomalous patterns exhibited
by them, and identify the subsets of salient features. In this paper, we focus
on detecting anomalous topics in a batch of text documents, developing our
algorithm based on topic models. Results of our experiments show that our
method can accurately detect anomalous topics and salient features (words)
under each such topic in a synthetic data set and two real-world text corpora
and achieves better performance compared to both standard group AD and
individual AD techniques. All required code to reproduce our experiments is
available from https://github.com/hsoleimani/AT
Deep Learning for Unsupervised Insider Threat Detection in Structured Cybersecurity Data Streams
Analysis of an organization's computer network activity is a key component of
early detection and mitigation of insider threat, a growing concern for many
organizations. Raw system logs are a prototypical example of streaming data
that can quickly scale beyond the cognitive power of a human analyst. As a
prospective filter for the human analyst, we present an online unsupervised
deep learning approach to detect anomalous network activity from system logs in
real time. Our models decompose anomaly scores into the contributions of
individual user behavior features for increased interpretability to aid
analysts reviewing potential cases of insider threat. Using the CERT Insider
Threat Dataset v6.2 and threat detection recall as our performance metric, our
novel deep and recurrent neural network models outperform Principal Component
Analysis, Support Vector Machine and Isolation Forest based anomaly detection
baselines. For our best model, the events labeled as insider threat activity in
our dataset had an average anomaly score in the 95.53 percentile, demonstrating
our approach's potential to greatly reduce analyst workloads.Comment: Proceedings of AI for Cyber Security Workshop at AAAI 201
Setting the threshold for high throughput detectors: A mathematical approach for ensembles of dynamic, heterogeneous, probabilistic anomaly detectors
Anomaly detection (AD) has garnered ample attention in security research, as
such algorithms complement existing signature-based methods but promise
detection of never-before-seen attacks. Cyber operations manage a high volume
of heterogeneous log data; hence, AD in such operations involves multiple
(e.g., per IP, per data type) ensembles of detectors modeling heterogeneous
characteristics (e.g., rate, size, type) often with adaptive online models
producing alerts in near real time. Because of high data volume, setting the
threshold for each detector in such a system is an essential yet underdeveloped
configuration issue that, if slightly mistuned, can leave the system useless,
either producing a myriad of alerts and flooding downstream systems, or giving
none. In this work, we build on the foundations of Ferragut et al. to provide a
set of rigorous results for understanding the relationship between threshold
values and alert quantities, and we propose an algorithm for setting the
threshold in practice. Specifically, we give an algorithm for setting the
threshold of multiple, heterogeneous, possibly dynamic detectors completely a
priori, in principle. Indeed, if the underlying distribution of the incoming
data is known (closely estimated), the algorithm provides provably manageable
thresholds. If the distribution is unknown (e.g., has changed over time) our
analysis reveals how the model distribution differs from the actual
distribution, indicating a period of model refitting is necessary. We provide
empirical experiments showing the efficacy of the capability by regulating the
alert rate of a system with 2,500 adaptive detectors scoring over 1.5M
events in 5 hours. Further, we demonstrate on the real network data and
detection framework of Harshaw et al. the alternative case, showing how the
inability to regulate alerts indicates the detection model is a bad fit to the
data.Comment: 11 pages, 5 figures. Proceedings of IEEE Big Data Conference, 201
Collective Anomaly Detection based on Long Short Term Memory Recurrent Neural Network
Intrusion detection for computer network systems becomes one of the most
critical tasks for network administrators today. It has an important role for
organizations, governments and our society due to its valuable resources on
computer networks. Traditional misuse detection strategies are unable to detect
new and unknown intrusion. Besides, anomaly detection in network security is
aim to distinguish between illegal or malicious events and normal behavior of
network systems. Anomaly detection can be considered as a classification
problem where it builds models of normal network behavior, which it uses to
detect new patterns that significantly deviate from the model. Most of the cur-
rent research on anomaly detection is based on the learning of normally and
anomaly behaviors. They do not take into account the previous, re- cent events
to detect the new incoming one. In this paper, we propose a real time
collective anomaly detection model based on neural network learning and feature
operating. Normally a Long Short Term Memory Recurrent Neural Network (LSTM
RNN) is trained only on normal data and it is capable of predicting several
time steps ahead of an input. In our approach, a LSTM RNN is trained with
normal time series data before performing a live prediction for each time step.
Instead of considering each time step separately, the observation of prediction
errors from a certain number of time steps is now proposed as a new idea for
detecting collective anomalies. The prediction errors from a number of the
latest time steps above a threshold will indicate a collective anomaly. The
model is built on a time series version of the KDD 1999 dataset. The
experiments demonstrate that it is possible to offer reliable and efficient for
collective anomaly detection
Towards Malware Detection via CPU Power Consumption: Data Collection Design and Analytics (Extended Version)
This paper presents an experimental design and data analytics approach aimed
at power-based malware detection on general-purpose computers. Leveraging the
fact that malware executions must consume power, we explore the postulate that
malware can be accurately detected via power data analytics. Our experimental
design and implementation allow for programmatic collection of CPU power
profiles for fixed tasks during uninfected and infected states using five
different rootkits. To characterize the power consumption profiles, we use both
simple statistical and novel, sophisticated features. We test a one-class
anomaly detection ensemble (that baselines non-infected power profiles) and
several kernel-based SVM classifiers (that train on both uninfected and
infected profiles) in detecting previously unseen malware and clean profiles.
The anomaly detection system exhibits perfect detection when using all features
and tasks, with smaller false detection rate than the supervised classifiers.
The primary contribution is the proof of concept that baselining power of fixed
tasks can provide accurate detection of rootkits. Moreover, our treatment
presents engineering hurdles needed for experimentation and allows analysis of
each statistical feature individually. This work appears to be the first step
towards a viable power-based detection capability for general-purpose
computers, and presents next steps toward this goal.Comment: Published version appearing in IEEE TrustCom-18. This version
contains more details on mathematics and data collectio
Log-based Anomaly Detection of CPS Using a Statistical Method
Detecting anomalies of a cyber physical system (CPS), which is a complex
system consisting of both physical and software parts, is important because a
CPS often operates autonomously in an unpredictable environment. However,
because of the ever-changing nature and lack of a precise model for a CPS,
detecting anomalies is still a challenging task. To address this problem, we
propose applying an outlier detection method to a CPS log. By using a log
obtained from an actual aquarium management system, we evaluated the
effectiveness of our proposed method by analyzing outliers that it detected. By
investigating the outliers with the developer of the system, we confirmed that
some outliers indicate actual faults in the system. For example, our method
detected failures of mutual exclusion in the control system that were unknown
to the developer. Our method also detected transient losses of functionalities
and unexpected reboots. On the other hand, our method did not detect anomalies
that were too many and similar. In addition, our method reported rare but
unproblematic concurrent combinations of operations as anomalies. Thus, our
approach is effective at finding anomalies, but there is still room for
improvement
A Machine-Synesthetic Approach To DDoS Network Attack Detection
In the authors' opinion, anomaly detection systems, or ADS, seem to be the
most perspective direction in the subject of attack detection, because these
systems can detect, among others, the unknown (zero-day) attacks. To detect
anomalies, the authors propose to use machine synesthesia. In this case,
machine synesthesia is understood as an interface that allows using image
classification algorithms in the problem of detecting network anomalies, making
it possible to use non-specialized image detection methods that have recently
been widely and actively developed. The proposed approach is that the network
traffic data is "projected" into the image. It can be seen from the
experimental results that the proposed method for detecting anomalies shows
high results in the detection of attacks. On a large sample, the value of the
complex efficiency indicator reaches 97%.Comment: 12 pages, 2 figures, 5 tables. Accepted to the Intelligent Systems
Conference (IntelliSys) 201
Learning Execution Contexts from System Call Distributions for Intrusion Detection in Embedded Systems
Existing techniques used for intrusion detection do not fully utilize the
intrinsic properties of embedded systems. In this paper, we propose a
lightweight method for detecting anomalous executions using a distribution of
system call frequencies. We use a cluster analysis to learn the legitimate
execution contexts of embedded applications and then monitor them at run-time
to capture abnormal executions. We also present an architectural framework with
minor processor modifications to aid in this process. Our prototype shows that
the proposed method can effectively detect anomalous executions without relying
on sophisticated analyses or affecting the critical execution paths
Energy-based Models for Video Anomaly Detection
Automated detection of abnormalities in data has been studied in research
area in recent years because of its diverse applications in practice including
video surveillance, industrial damage detection and network intrusion
detection. However, building an effective anomaly detection system is a
non-trivial task since it requires to tackle challenging issues of the shortage
of annotated data, inability of defining anomaly objects explicitly and the
expensive cost of feature engineering procedure. Unlike existing appoaches
which only partially solve these problems, we develop a unique framework to
cope the problems above simultaneously. Instead of hanlding with ambiguous
definition of anomaly objects, we propose to work with regular patterns whose
unlabeled data is abundant and usually easy to collect in practice. This allows
our system to be trained completely in an unsupervised procedure and liberate
us from the need for costly data annotation. By learning generative model that
capture the normality distribution in data, we can isolate abnormal data points
that result in low normality scores (high abnormality scores). Moreover, by
leverage on the power of generative networks, i.e. energy-based models, we are
also able to learn the feature representation automatically rather than
replying on hand-crafted features that have been dominating anomaly detection
research over many decades. We demonstrate our proposal on the specific
application of video anomaly detection and the experimental results indicate
that our method performs better than baselines and are comparable with
state-of-the-art methods in many benchmark video anomaly detection datasets
- …