Developing reliable anomaly detection system for critical hosts: a proactive defense paradigm

Abstract

Current host-based anomaly detection systems have limited accuracy and incur high processing costs. This is due to the need for processing massive audit data of the critical host(s) while detecting complex zero-day attacks which can leave minor, stealthy and dispersed artefacts. In this research study, this observation is validated using existing datasets and state-of-the-art algorithms related to the construction of the features of a host's audit data, such as the popular semantic-based extraction and decision engines, including Support Vector Machines, Extreme Learning Machines and Hidden Markov Models. There is a challenging trade-off between achieving accuracy with a minimum processing cost and processing massive amounts of audit data that can include complex attacks. Also, there is a lack of a realistic experimental dataset that reflects the normal and abnormal activities of current real-world computers. This thesis investigates the development of new methodologies for host-based anomaly detection systems with the specific aims of improving accuracy at a minimum processing cost while considering challenges such as complex attacks which, in some cases, can only be visible via a quantified computing resource, for example, the execution times of programs, the processing of massive amounts of audit data, the unavailability of a realistic experimental dataset and the automatic minimization of the false positive rate while dealing with the dynamics of normal activities. This study provides three original and significant contributions to this field of research which represent a marked advance in its body of knowledge. The first major contribution is the generation and release of a realistic intrusion detection systems dataset as well as the development of a metric based on fuzzy qualitative modeling for embedding the possible quality of realism in a dataset's design process and assessing this quality in existing or future datasets. The second key contribution is constructing and evaluating the hidden host features to identify the trivial differences between the normal and abnormal artefacts of hosts' activities at a minimum processing cost. Linux-centric features include the frequencies and ranges, frequency-domain representations and Gaussian interpretations of system call identifiers with execution times while, for Windows, a count of the distinct core Dynamic Linked Library calls is identified as a hidden host feature. The final key contribution is the development of two new anomaly-based statistical decision engines for capitalizing on the potential of some of the suggested hidden features and reliably detecting anomalies. The first engine, which has a forensic module, is based on stochastic theories including Hierarchical hidden Markov models and the second is modeled using Gaussian Mixture Modeling and Correntropy. The results demonstrate that the proposed host features and engines are competent for meeting the identified challenges

    Similar works

    Full text

    thumbnail-image

    Available Versions