7 research outputs found
Discriminative models for multi-instance problems with tree-structure
Modeling network traffic is gaining importance in order to counter modern
threats of ever increasing sophistication. It is though surprisingly difficult
and costly to construct reliable classifiers on top of telemetry data due to
the variety and complexity of signals that no human can manage to interpret in
full. Obtaining training data with sufficiently large and variable body of
labels can thus be seen as prohibitive problem. The goal of this work is to
detect infected computers by observing their HTTP(S) traffic collected from
network sensors, which are typically proxy servers or network firewalls, while
relying on only minimal human input in model training phase. We propose a
discriminative model that makes decisions based on all computer's traffic
observed during predefined time window (5 minutes in our case). The model is
trained on collected traffic samples over equally sized time window per large
number of computers, where the only labels needed are human verdicts about the
computer as a whole (presumed infected vs. presumed clean). As part of training
the model itself recognizes discriminative patterns in traffic targeted to
individual servers and constructs the final high-level classifier on top of
them. We show the classifier to perform with very high precision, while the
learned traffic patterns can be interpreted as Indicators of Compromise. In the
following we implement the discriminative model as a neural network with
special structure reflecting two stacked multi-instance problems. The main
advantages of the proposed configuration include not only improved accuracy and
ability to learn from gross labels, but also automatic learning of server types
(together with their detectors) which are typically visited by infected
computers
Challenges and Open Questions of Machine Learning in Computer Security
This habilitation thesis presents advancements in machine learning for computer security,
arising from problems in network intrusion detection and steganography.
The thesis put an emphasis on explanation of traits shared by steganalysis, network intrusion
detection, and other security domains, which makes these domains different from
computer vision, speech recognition, and other fields where machine learning is typically
studied. Then, the thesis presents methods developed to at least partially solve the identified
problems with an overall goal to make machine learning based intrusion detection
system viable. Most of them are general in the sense that they can be used outside intrusion
detection and steganalysis on problems with similar constraints.
A common feature of all methods is that they are generally simple, yet surprisingly
effective. According to large-scale experiments they almost always improve the prior art,
which is likely caused by being tailored to security problems and designed for large volumes
of data.
Specifically, the thesis addresses following problems:
anomaly detection with low computational and memory complexity such that efficient
processing of large data is possible;
multiple-instance anomaly detection improving signal-to-noise ration by classifying
larger group of samples;
supervised classification of tree-structured data simplifying their encoding in neural
networks;
clustering of structured data;
supervised training with the emphasis on the precision in top p% of returned data;
and finally explanation of anomalies to help humans understand the nature of anomaly
and speed-up their decision.
Many algorithms and method presented in this thesis are deployed in the real intrusion
detection system protecting millions of computers around the globe