3,711 research outputs found

    Finding Likely Errors with Bayesian Specifications

    Full text link
    We present a Bayesian framework for learning probabilistic specifications from large, unstructured code corpora, and a method to use this framework to statically detect anomalous, hence likely buggy, program behavior. The distinctive insight here is to build a statistical model that correlates all specifications hidden inside a corpus with the syntax and observed behavior of programs that implement these specifications. During the analysis of a particular program, this model is conditioned into a posterior distribution that prioritizes specifications that are relevant to this program. This allows accurate program analysis even if the corpus is highly heterogeneous. The problem of finding anomalies is now framed quantitatively, as a problem of computing a distance between a "reference distribution" over program behaviors that our model expects from the program, and the distribution over behaviors that the program actually produces. We present a concrete embodiment of our framework that combines a topic model and a neural network model to learn specifications, and queries the learned models to compute anomaly scores. We evaluate this implementation on the task of detecting anomalous usage of Android APIs. Our encouraging experimental results show that the method can automatically discover subtle errors in Android applications in the wild, and has high precision and recall compared to competing probabilistic approaches

    ATD: Anomalous Topic Discovery in High Dimensional Discrete Data

    Full text link
    We propose an algorithm for detecting patterns exhibited by anomalous clusters in high dimensional discrete data. Unlike most anomaly detection (AD) methods, which detect individual anomalies, our proposed method detects groups (clusters) of anomalies; i.e. sets of points which collectively exhibit abnormal patterns. In many applications this can lead to better understanding of the nature of the atypical behavior and to identifying the sources of the anomalies. Moreover, we consider the case where the atypical patterns exhibit on only a small (salient) subset of the very high dimensional feature space. Individual AD techniques and techniques that detect anomalies using all the features typically fail to detect such anomalies, but our method can detect such instances collectively, discover the shared anomalous patterns exhibited by them, and identify the subsets of salient features. In this paper, we focus on detecting anomalous topics in a batch of text documents, developing our algorithm based on topic models. Results of our experiments show that our method can accurately detect anomalous topics and salient features (words) under each such topic in a synthetic data set and two real-world text corpora and achieves better performance compared to both standard group AD and individual AD techniques. All required code to reproduce our experiments is available from https://github.com/hsoleimani/AT

    Deep Learning for Unsupervised Insider Threat Detection in Structured Cybersecurity Data Streams

    Full text link
    Analysis of an organization's computer network activity is a key component of early detection and mitigation of insider threat, a growing concern for many organizations. Raw system logs are a prototypical example of streaming data that can quickly scale beyond the cognitive power of a human analyst. As a prospective filter for the human analyst, we present an online unsupervised deep learning approach to detect anomalous network activity from system logs in real time. Our models decompose anomaly scores into the contributions of individual user behavior features for increased interpretability to aid analysts reviewing potential cases of insider threat. Using the CERT Insider Threat Dataset v6.2 and threat detection recall as our performance metric, our novel deep and recurrent neural network models outperform Principal Component Analysis, Support Vector Machine and Isolation Forest based anomaly detection baselines. For our best model, the events labeled as insider threat activity in our dataset had an average anomaly score in the 95.53 percentile, demonstrating our approach's potential to greatly reduce analyst workloads.Comment: Proceedings of AI for Cyber Security Workshop at AAAI 201

    Setting the threshold for high throughput detectors: A mathematical approach for ensembles of dynamic, heterogeneous, probabilistic anomaly detectors

    Full text link
    Anomaly detection (AD) has garnered ample attention in security research, as such algorithms complement existing signature-based methods but promise detection of never-before-seen attacks. Cyber operations manage a high volume of heterogeneous log data; hence, AD in such operations involves multiple (e.g., per IP, per data type) ensembles of detectors modeling heterogeneous characteristics (e.g., rate, size, type) often with adaptive online models producing alerts in near real time. Because of high data volume, setting the threshold for each detector in such a system is an essential yet underdeveloped configuration issue that, if slightly mistuned, can leave the system useless, either producing a myriad of alerts and flooding downstream systems, or giving none. In this work, we build on the foundations of Ferragut et al. to provide a set of rigorous results for understanding the relationship between threshold values and alert quantities, and we propose an algorithm for setting the threshold in practice. Specifically, we give an algorithm for setting the threshold of multiple, heterogeneous, possibly dynamic detectors completely a priori, in principle. Indeed, if the underlying distribution of the incoming data is known (closely estimated), the algorithm provides provably manageable thresholds. If the distribution is unknown (e.g., has changed over time) our analysis reveals how the model distribution differs from the actual distribution, indicating a period of model refitting is necessary. We provide empirical experiments showing the efficacy of the capability by regulating the alert rate of a system with \approx2,500 adaptive detectors scoring over 1.5M events in 5 hours. Further, we demonstrate on the real network data and detection framework of Harshaw et al. the alternative case, showing how the inability to regulate alerts indicates the detection model is a bad fit to the data.Comment: 11 pages, 5 figures. Proceedings of IEEE Big Data Conference, 201

    Collective Anomaly Detection based on Long Short Term Memory Recurrent Neural Network

    Full text link
    Intrusion detection for computer network systems becomes one of the most critical tasks for network administrators today. It has an important role for organizations, governments and our society due to its valuable resources on computer networks. Traditional misuse detection strategies are unable to detect new and unknown intrusion. Besides, anomaly detection in network security is aim to distinguish between illegal or malicious events and normal behavior of network systems. Anomaly detection can be considered as a classification problem where it builds models of normal network behavior, which it uses to detect new patterns that significantly deviate from the model. Most of the cur- rent research on anomaly detection is based on the learning of normally and anomaly behaviors. They do not take into account the previous, re- cent events to detect the new incoming one. In this paper, we propose a real time collective anomaly detection model based on neural network learning and feature operating. Normally a Long Short Term Memory Recurrent Neural Network (LSTM RNN) is trained only on normal data and it is capable of predicting several time steps ahead of an input. In our approach, a LSTM RNN is trained with normal time series data before performing a live prediction for each time step. Instead of considering each time step separately, the observation of prediction errors from a certain number of time steps is now proposed as a new idea for detecting collective anomalies. The prediction errors from a number of the latest time steps above a threshold will indicate a collective anomaly. The model is built on a time series version of the KDD 1999 dataset. The experiments demonstrate that it is possible to offer reliable and efficient for collective anomaly detection

    Towards Malware Detection via CPU Power Consumption: Data Collection Design and Analytics (Extended Version)

    Full text link
    This paper presents an experimental design and data analytics approach aimed at power-based malware detection on general-purpose computers. Leveraging the fact that malware executions must consume power, we explore the postulate that malware can be accurately detected via power data analytics. Our experimental design and implementation allow for programmatic collection of CPU power profiles for fixed tasks during uninfected and infected states using five different rootkits. To characterize the power consumption profiles, we use both simple statistical and novel, sophisticated features. We test a one-class anomaly detection ensemble (that baselines non-infected power profiles) and several kernel-based SVM classifiers (that train on both uninfected and infected profiles) in detecting previously unseen malware and clean profiles. The anomaly detection system exhibits perfect detection when using all features and tasks, with smaller false detection rate than the supervised classifiers. The primary contribution is the proof of concept that baselining power of fixed tasks can provide accurate detection of rootkits. Moreover, our treatment presents engineering hurdles needed for experimentation and allows analysis of each statistical feature individually. This work appears to be the first step towards a viable power-based detection capability for general-purpose computers, and presents next steps toward this goal.Comment: Published version appearing in IEEE TrustCom-18. This version contains more details on mathematics and data collectio

    Log-based Anomaly Detection of CPS Using a Statistical Method

    Full text link
    Detecting anomalies of a cyber physical system (CPS), which is a complex system consisting of both physical and software parts, is important because a CPS often operates autonomously in an unpredictable environment. However, because of the ever-changing nature and lack of a precise model for a CPS, detecting anomalies is still a challenging task. To address this problem, we propose applying an outlier detection method to a CPS log. By using a log obtained from an actual aquarium management system, we evaluated the effectiveness of our proposed method by analyzing outliers that it detected. By investigating the outliers with the developer of the system, we confirmed that some outliers indicate actual faults in the system. For example, our method detected failures of mutual exclusion in the control system that were unknown to the developer. Our method also detected transient losses of functionalities and unexpected reboots. On the other hand, our method did not detect anomalies that were too many and similar. In addition, our method reported rare but unproblematic concurrent combinations of operations as anomalies. Thus, our approach is effective at finding anomalies, but there is still room for improvement

    A Machine-Synesthetic Approach To DDoS Network Attack Detection

    Full text link
    In the authors' opinion, anomaly detection systems, or ADS, seem to be the most perspective direction in the subject of attack detection, because these systems can detect, among others, the unknown (zero-day) attacks. To detect anomalies, the authors propose to use machine synesthesia. In this case, machine synesthesia is understood as an interface that allows using image classification algorithms in the problem of detecting network anomalies, making it possible to use non-specialized image detection methods that have recently been widely and actively developed. The proposed approach is that the network traffic data is "projected" into the image. It can be seen from the experimental results that the proposed method for detecting anomalies shows high results in the detection of attacks. On a large sample, the value of the complex efficiency indicator reaches 97%.Comment: 12 pages, 2 figures, 5 tables. Accepted to the Intelligent Systems Conference (IntelliSys) 201

    Learning Execution Contexts from System Call Distributions for Intrusion Detection in Embedded Systems

    Full text link
    Existing techniques used for intrusion detection do not fully utilize the intrinsic properties of embedded systems. In this paper, we propose a lightweight method for detecting anomalous executions using a distribution of system call frequencies. We use a cluster analysis to learn the legitimate execution contexts of embedded applications and then monitor them at run-time to capture abnormal executions. We also present an architectural framework with minor processor modifications to aid in this process. Our prototype shows that the proposed method can effectively detect anomalous executions without relying on sophisticated analyses or affecting the critical execution paths

    Energy-based Models for Video Anomaly Detection

    Full text link
    Automated detection of abnormalities in data has been studied in research area in recent years because of its diverse applications in practice including video surveillance, industrial damage detection and network intrusion detection. However, building an effective anomaly detection system is a non-trivial task since it requires to tackle challenging issues of the shortage of annotated data, inability of defining anomaly objects explicitly and the expensive cost of feature engineering procedure. Unlike existing appoaches which only partially solve these problems, we develop a unique framework to cope the problems above simultaneously. Instead of hanlding with ambiguous definition of anomaly objects, we propose to work with regular patterns whose unlabeled data is abundant and usually easy to collect in practice. This allows our system to be trained completely in an unsupervised procedure and liberate us from the need for costly data annotation. By learning generative model that capture the normality distribution in data, we can isolate abnormal data points that result in low normality scores (high abnormality scores). Moreover, by leverage on the power of generative networks, i.e. energy-based models, we are also able to learn the feature representation automatically rather than replying on hand-crafted features that have been dominating anomaly detection research over many decades. We demonstrate our proposal on the specific application of video anomaly detection and the experimental results indicate that our method performs better than baselines and are comparable with state-of-the-art methods in many benchmark video anomaly detection datasets
    corecore