171 research outputs found

    End-to-end anomaly detection in stream data

    Get PDF
    Nowadays, huge volumes of data are generated with increasing velocity through various systems, applications, and activities. This increases the demand for stream and time series analysis to react to changing conditions in real-time for enhanced efficiency and quality of service delivery as well as upgraded safety and security in private and public sectors. Despite its very rich history, time series anomaly detection is still one of the vital topics in machine learning research and is receiving increasing attention. Identifying hidden patterns and selecting an appropriate model that fits the observed data well and also carries over to unobserved data is not a trivial task. Due to the increasing diversity of data sources and associated stochastic processes, this pivotal data analysis topic is loaded with various challenges like complex latent patterns, concept drift, and overfitting that may mislead the model and cause a high false alarm rate. Handling these challenges leads the advanced anomaly detection methods to develop sophisticated decision logic, which turns them into mysterious and inexplicable black-boxes. Contrary to this trend, end-users expect transparency and verifiability to trust a model and the outcomes it produces. Also, pointing the users to the most anomalous/malicious areas of time series and causal features could save them time, energy, and money. For the mentioned reasons, this thesis is addressing the crucial challenges in an end-to-end pipeline of stream-based anomaly detection through the three essential phases of behavior prediction, inference, and interpretation. The first step is focused on devising a time series model that leads to high average accuracy as well as small error deviation. On this basis, we propose higher-quality anomaly detection and scoring techniques that utilize the related contexts to reclassify the observations and post-pruning the unjustified events. Last but not least, we make the predictive process transparent and verifiable by providing meaningful reasoning behind its generated results based on the understandable concepts by a human. The provided insight can pinpoint the anomalous regions of time series and explain why the current status of a system has been flagged as anomalous. Stream-based anomaly detection research is a principal area of innovation to support our economy, security, and even the safety and health of societies worldwide. We believe our proposed analysis techniques can contribute to building a situational awareness platform and open new perspectives in a variety of domains like cybersecurity, and health

    A Framework for Hybrid Intrusion Detection Systems

    Get PDF
    Web application security is a definite threat to the world’s information technology infrastructure. The Open Web Application Security Project (OWASP), generally defines web application security violations as unauthorized or unintentional exposure, disclosure, or loss of personal information. These breaches occur without the company’s knowledge and it often takes a while before the web application attack is revealed to the public, specifically because the security violations are fixed. Due to the need to protect their reputation, organizations have begun researching solutions to these problems. The most widely accepted solution is the use of an Intrusion Detection System (IDS). Such systems currently rely on either signatures of the attack used for the data breach or changes in the behavior patterns of the system to identify an intruder. These systems, either signature-based or anomaly-based, are readily understood by attackers. Issues arise when attacks are not noticed by an existing IDS because the attack does not fit the pre-defined attack signatures the IDS is implemented to discover. Despite current IDSs capabilities, little research has identified a method to detect all potential attacks on a system. This thesis intends to address this problem. A particular emphasis will be placed on detecting advanced attacks, such as those that take place at the application layer. These types of attacks are able to bypass existing IDSs, increase the potential for a web application security breach to occur and not be detected. In particular, the attacks under study are all web application layer attacks. Those included in this thesis are SQL injection, cross-site scripting, directory traversal and remote file inclusion. This work identifies common and existing data breach detection methods as well as the necessary improvements for IDS models. Ultimately, the proposed approach combines an anomaly detection technique measured by cross entropy and a signature-based attack detection framework utilizing genetic algorithm. The proposed hybrid model for data breach detection benefits organizations by increasing security measures and allowing attacks to be identified in less time and more efficiently

    Graph-Based Multi-Label Classification for WiFi Network Traffic Analysis

    Get PDF
    Network traffic analysis, and specifically anomaly and attack detection, call for sophisticated tools relying on a large number of features. Mathematical modeling is extremely difficult, given the ample variety of traffic patterns and the subtle and varied ways that malicious activity can be carried out in a network. We address this problem by exploiting data-driven modeling and computational intelligence techniques. Sequences of packets captured on the communication medium are considered, along with multi-label metadata. Graph-based modeling of the data are introduced, thus resorting to the powerful GRALG approach based on feature information granulation, identification of a representative alphabet, embedding and genetic optimization. The obtained classifier is evaluated both under accuracy and complexity for two different supervised problems and compared with state-of-the-art algorithms. We show that the proposed preprocessing strategy is able to describe higher level relations between data instances in the input domain, thus allowing the algorithms to suitably reconstruct the structure of the input domain itself. Furthermore, the considered Granular Computing approach is able to extract knowledge on multiple semantic levels, thus effectively describing anomalies as subgraphs-based symbols of the whole network graph, in a specific time interval. Interesting performances can thus be achieved in identifying network traffic patterns, in spite of the complexity of the considered traffic classes

    Pattern Discovery in Time-Ordered Data

    Full text link

    Interpretable Sequence Classification via Discrete Optimization

    Full text link
    Sequence classification is the task of predicting a class label given a sequence of observations. In many applications such as healthcare monitoring or intrusion detection, early classification is crucial to prompt intervention. In this work, we learn sequence classifiers that favour early classification from an evolving observation trace. While many state-of-the-art sequence classifiers are neural networks, and in particular LSTMs, our classifiers take the form of finite state automata and are learned via discrete optimization. Our automata-based classifiers are interpretable---supporting explanation, counterfactual reasoning, and human-in-the-loop modification---and have strong empirical performance. Experiments over a suite of goal recognition and behaviour classification datasets show our learned automata-based classifiers to have comparable test performance to LSTM-based classifiers, with the added advantage of being interpretable

    NLP Methods in Host-based Intrusion Detection Systems: A Systematic Review and Future Directions

    Full text link
    Host based Intrusion Detection System (HIDS) is an effective last line of defense for defending against cyber security attacks after perimeter defenses (e.g., Network based Intrusion Detection System and Firewall) have failed or been bypassed. HIDS is widely adopted in the industry as HIDS is ranked among the top two most used security tools by Security Operation Centers (SOC) of organizations. Although effective and efficient HIDS is highly desirable for industrial organizations, the evolution of increasingly complex attack patterns causes several challenges resulting in performance degradation of HIDS (e.g., high false alert rate creating alert fatigue for SOC staff). Since Natural Language Processing (NLP) methods are better suited for identifying complex attack patterns, an increasing number of HIDS are leveraging the advances in NLP that have shown effective and efficient performance in precisely detecting low footprint, zero day attacks and predicting the next steps of attackers. This active research trend of using NLP in HIDS demands a synthesized and comprehensive body of knowledge of NLP based HIDS. Thus, we conducted a systematic review of the literature on the end to end pipeline of the use of NLP in HIDS development. For the end to end NLP based HIDS development pipeline, we identify, taxonomically categorize and systematically compare the state of the art of NLP methods usage in HIDS, attacks detected by these NLP methods, datasets and evaluation metrics which are used to evaluate the NLP based HIDS. We highlight the relevant prevalent practices, considerations, advantages and limitations to support the HIDS developers. We also outline the future research directions for the NLP based HIDS development

    Dueling-HMM Analysis on Masquerade Detection

    Get PDF
    Masquerade detection is the ability to detect attackers known as masqueraders that intrude on another user’s system and pose as legitimate users. Once a masquerader obtains access to a user’s system, the masquerader has free reign over whatever data is on that system. In this research, we focus on masquerade detection and user classi cation using the following two di erent approaches: the heavy hitter approach and 2 di erent approaches based on hidden Markov models (HMMs), the dueling-HMM and threshold-HMM strategies. The heavy hitter approach computes the frequent elements seen in the training data sequence and test data sequence and computes the distance to see whether the test data sequence is masqueraded or not. The results show very misleading classi cations, suggesting that the approach is not viable for masquerade detection. A hidden Markov model is a tool for representing probability distributions over sequences of observations [9]. Previous research has shown that using a threshold-based hidden Markov model (HMM) approach is successful in a variety of categories: malware detection, intrusion detection, pattern recognition, etc. We have veri ed that using a threshold-based HMM approach produces high accuracy with low amounts of a false positives. Using the dueling- HMM approach, which utilizes multiple training HMMs, we obtain an overall accuracy of 81.96%. With the introduction of the bias in the dueling-HMM approach, we produce similar results to the results obtained in the threshold-based HMM approach, where we see many non-masqueraded data detected, while many masqueraded data avoid detection, yet still result in an high overall accuracy

    Practical Analysis of Encrypted Network Traffic

    Get PDF
    The growing use of encryption in network communications is an undoubted boon for user privacy. However, the limitations of real-world encryption schemes are still not well understood, and new side-channel attacks against encrypted communications are disclosed every year. Furthermore, encrypted network communications, by preventing inspection of packet contents, represent a significant challenge from a network security perspective: our existing infrastructure relies on such inspection for threat detection. Both problems are exacerbated by the increasing prevalence of encrypted traffic: recent estimates suggest that 65% or more of downstream Internet traffic will be encrypted by the end of 2016. This work addresses these problems by expanding our understanding of the properties and characteristics of encrypted network traffic and exploring new, specialized techniques for the handling of encrypted traffic by network monitoring systems. We first demonstrate that opaque traffic, of which encrypted traffic is a subset, can be identified in real-time and how this ability can be leveraged to improve the capabilities of existing IDS systems. To do so, we evaluate and compare multiple methods for rapid identification of opaque packets, ultimately pinpointing a simple hypothesis test (which can be implemented on an FPGA) as an efficient and effective detector of such traffic. In our experiments, using this technique to “winnow”, or filter, opaque packets from the traffic load presented to an IDS system significantly increased the throughput of the system, allowing the identification of many more potential threats than the same system without winnowing. Second, we show that side channels in encrypted VoIP traffic enable the reconstruction of approximate transcripts of conversations. Our approach leverages techniques from linguistics, machine learning, natural language processing, and machine translation to accomplish this task despite the limited information leaked by such side channels. Our ability to do so underscores both the potential threat to user privacy which such side channels represent and the degree to which this threat has been underestimated. Finally, we propose and demonstrate the effectiveness of a new paradigm for identifying HTTP resources retrieved over encrypted connections. Our experiments demonstrate how the predominant paradigm from prior work fails to accurately represent real-world situations and how our proposed approach offers significant advantages, including the ability to infer partial information, in comparison. We believe these results represent both an enhanced threat to user privacy and an opportunity for network monitors and analysts to improve their own capabilities with respect to encrypted traffic.Doctor of Philosoph
    corecore