Machine Learning for Internet Security: Malware Detection and Web Image Classification

Abstract

In today's fast-moving Internet-driven world, new opportunities are emerging to take advantage of the latest technologies. However, this trend of empowerment is not only available for the good, but also for various questionable and criminal activities. The first part of the thesis addresses the problem of the automatic mal ware detection. An unusual restriction applied to malware classification is the strict zero False Positives rate. To satisfy this restriction, a two-stage methodology is proposed. Due to nominal features representation, an adaptation of the Min Hash algorithm is used on the first stage, balanced in accuracy and running time. The second stage classifier uses two ELMs, each with a hyper-parameter adjusting the trade-off between coverage and an amount of False Positives/Negatives. Final outputs include the third "unknown" class; sacrificing some coverage to achieve a really low zero False Positives rate (2 out of 38,000 on test set). The second half of the thesis explores the web image classification for the web content filtering. The training dataset inherits properties of real web images: high variability, often weak clues to the website class, and a high amount of semantic noise. For the classification, a suitable image representation and a two-stage methodology are proposed. Images are represented by their local features, with the local feature descriptors being the smallest processing unit. On the first stage, the class probability density in the descriptor space is estimated with a random Vector Quantization. On the second stage, classes of images are derived from their classified descriptors, in the image-to-class fashion. The approach provides the average accuracy of 35% in a 10-class setting, with the particular accuracy for an "Adult" class over 70%

    Similar works

    Full text

    thumbnail-image