Feature Selection and Improving Classification Performance for Malware Detection

Abstract

The ubiquitous advance of technology has been conducive to the proliferation of cyber threats, resulting in attacks that have grown exponentially. Consequently, researchers have developed models based on machine learning algorithms for detecting malware. However, these methods require significant amount of extracted features for correct malware classification, making that feature extraction, training, and testing take significant time; even more, it has been unexplored which are the most important features for accomplish the correct classification. In this Thesis, it is created and analyzed a dataset of malware and clean files (goodware) from the static and dynamic features provided by the online framework VirusTotal. The purpose was to select the smallest number of features that keep the classification accuracy as high as the state of the art researches. Selecting the most representative features for malware detection relies on the possibility reducing the training time, given that it increases in O(n2) with respect to the number of features, and creating an embedded program that monitors processes executed by the OS. Thus, feature selection was made taking the most important features. In addition, classification algorithms such as Random Forest, Support Vector Machine and Neural Networks were used in a novel combination that not only showed an increase in accuracy, but also in the training speed from hours to just minutes. Next, the model was tested on one additional dataset of unseen malware files. Results showed that “9” features were enough to distinguish malware from goodware files within an accuracy of 99.60%

    Similar works