2 research outputs found

    SVM Ensembles for large volumes of data

    Full text link
    Support Vector Machines (SVM) are a popular machine learning (ML) algorithm that has been extensively used for its remarkable performance in tasks of classification and regression. However, its high computational complexity is often a limiting factor for its use, specially in the context of problems with large volumes of data. The extraordinary increase in the availability of data experienced in the last decades demand the development of new algorithms in the field of ML that are able to deal with this ever increasing volumes of data. In this undergraduate thesis we have designed, developed and analyzed a ML model based on SVM ensembles specially aimed at problems with large datasets. To achieve this goal we combined several ensemble methods and introduced a modified version of subbagging that capitalized on the high availability of data. The resulting model shows a very good performance in comparison to other models: Compared to other SVM ensembles with equal computational requirements the developed ensemble achieves greater stability in its scores. Compared to a single SVM the proposed model reaches higher accuracies for the same training time budget. Furthermore, it achieves comparable accuracies to a single SVM trained without time limitations, using only a 10% of its training time, even less in the best cases

    Challenges and Open Questions of Machine Learning in Computer Security

    Get PDF
    This habilitation thesis presents advancements in machine learning for computer security, arising from problems in network intrusion detection and steganography. The thesis put an emphasis on explanation of traits shared by steganalysis, network intrusion detection, and other security domains, which makes these domains different from computer vision, speech recognition, and other fields where machine learning is typically studied. Then, the thesis presents methods developed to at least partially solve the identified problems with an overall goal to make machine learning based intrusion detection system viable. Most of them are general in the sense that they can be used outside intrusion detection and steganalysis on problems with similar constraints. A common feature of all methods is that they are generally simple, yet surprisingly effective. According to large-scale experiments they almost always improve the prior art, which is likely caused by being tailored to security problems and designed for large volumes of data. Specifically, the thesis addresses following problems: anomaly detection with low computational and memory complexity such that efficient processing of large data is possible; multiple-instance anomaly detection improving signal-to-noise ration by classifying larger group of samples; supervised classification of tree-structured data simplifying their encoding in neural networks; clustering of structured data; supervised training with the emphasis on the precision in top p% of returned data; and finally explanation of anomalies to help humans understand the nature of anomaly and speed-up their decision. Many algorithms and method presented in this thesis are deployed in the real intrusion detection system protecting millions of computers around the globe
    corecore