2 research outputs found
SVM Ensembles for large volumes of data
Support Vector Machines (SVM) are a popular machine learning (ML) algorithm that has been
extensively used for its remarkable performance in tasks of classification and regression. However, its
high computational complexity is often a limiting factor for its use, specially in the context of problems
with large volumes of data. The extraordinary increase in the availability of data experienced in the last
decades demand the development of new algorithms in the field of ML that are able to deal with this
ever increasing volumes of data.
In this undergraduate thesis we have designed, developed and analyzed a ML model based on
SVM ensembles specially aimed at problems with large datasets. To achieve this goal we combined
several ensemble methods and introduced a modified version of subbagging that capitalized on the high
availability of data. The resulting model shows a very good performance in comparison to other models:
Compared to other SVM ensembles with equal computational requirements the developed ensemble
achieves greater stability in its scores. Compared to a single SVM the proposed model reaches higher
accuracies for the same training time budget. Furthermore, it achieves comparable accuracies to a
single SVM trained without time limitations, using only a 10% of its training time, even less in the best
cases
Challenges and Open Questions of Machine Learning in Computer Security
This habilitation thesis presents advancements in machine learning for computer security,
arising from problems in network intrusion detection and steganography.
The thesis put an emphasis on explanation of traits shared by steganalysis, network intrusion
detection, and other security domains, which makes these domains different from
computer vision, speech recognition, and other fields where machine learning is typically
studied. Then, the thesis presents methods developed to at least partially solve the identified
problems with an overall goal to make machine learning based intrusion detection
system viable. Most of them are general in the sense that they can be used outside intrusion
detection and steganalysis on problems with similar constraints.
A common feature of all methods is that they are generally simple, yet surprisingly
effective. According to large-scale experiments they almost always improve the prior art,
which is likely caused by being tailored to security problems and designed for large volumes
of data.
Specifically, the thesis addresses following problems:
anomaly detection with low computational and memory complexity such that efficient
processing of large data is possible;
multiple-instance anomaly detection improving signal-to-noise ration by classifying
larger group of samples;
supervised classification of tree-structured data simplifying their encoding in neural
networks;
clustering of structured data;
supervised training with the emphasis on the precision in top p% of returned data;
and finally explanation of anomalies to help humans understand the nature of anomaly
and speed-up their decision.
Many algorithms and method presented in this thesis are deployed in the real intrusion
detection system protecting millions of computers around the globe