Current host-based anomaly detection systems have limited accuracy and incur
high processing costs. This is due to the need for processing massive audit data
of the critical host(s) while detecting complex zero-day attacks which can leave
minor, stealthy and dispersed artefacts. In this research study, this observation
is validated using existing datasets and state-of-the-art algorithms related to the
construction of the features of a host's audit data, such as the popular semantic-based
extraction and decision engines, including Support Vector Machines, Extreme
Learning Machines and Hidden Markov Models. There is a challenging
trade-off between achieving accuracy with a minimum processing cost and processing
massive amounts of audit data that can include complex attacks. Also,
there is a lack of a realistic experimental dataset that reflects the normal and
abnormal activities of current real-world computers.
This thesis investigates the development of new methodologies for host-based
anomaly detection systems with the specific aims of improving accuracy at a minimum
processing cost while considering challenges such as complex attacks which,
in some cases, can only be visible via a quantified computing resource, for example,
the execution times of programs, the processing of massive amounts of audit data,
the unavailability of a realistic experimental dataset and the automatic minimization
of the false positive rate while dealing with the dynamics of normal activities.
This study provides three original and significant contributions to this field of
research which represent a marked advance in its body of knowledge.
The first major contribution is the generation and release of a realistic intrusion
detection systems dataset as well as the development of a metric based on fuzzy
qualitative modeling for embedding the possible quality of realism in a dataset's
design process and assessing this quality in existing or future datasets.
The second key contribution is constructing and evaluating the hidden host
features to identify the trivial differences between the normal and abnormal artefacts
of hosts' activities at a minimum processing cost. Linux-centric features include
the frequencies and ranges, frequency-domain representations and Gaussian
interpretations of system call identifiers with execution times while, for Windows,
a count of the distinct core Dynamic Linked Library calls is identified as a hidden
host feature.
The final key contribution is the development of two new anomaly-based statistical
decision engines for capitalizing on the potential of some of the suggested
hidden features and reliably detecting anomalies. The first engine, which has
a forensic module, is based on stochastic theories including Hierarchical hidden
Markov models and the second is modeled using Gaussian Mixture Modeling and
Correntropy. The results demonstrate that the proposed host features and engines
are competent for meeting the identified challenges