10 research outputs found

    Machine learning classification for advanced malware detection

    Get PDF
    This introductory document discusses topics related to malware detection via the application of machine learning algorithms. It is intended as a supplement to the published work submitted (a complete list of which can be found in Table 1) and outlines the motivation behind the experiments. The document begins with the following sections: • Section 2 presents a preliminary discussion of the research methodology employed. • Section 3 presents the background analysis of malware detection in general, and the use of machine learning. • Section 4 provides a brief introduction of the most common machine learning algorithms in current use. The remaining sections present the main body of the experimental work, which lead to the conclusions in Section 10. • Section 5 analyzes different initialization strategies for machine learning models, with a view to ensuring that the most effective training and testing strategy is employed. Following this, a purely dynamic approach is proposed, which results in perfect classification of the samples against benign files, and therefore provides a baseline against which the performance of subsequent static approaches can be compared. • Section 6 introduces the static-based tests, beginning with the challenging problem of zero-day detection samples, i.e. malware samples for which not enough data has been gathered yet to train the machine learning models. • Section 7 describes the testing of several different approaches to static malware detection. During these tests, the effectiveness of these algorithms is analyzed and compared with other means of classification. 7 • Section 8 proposes and compares techniques to boost the detection accuracy by combining the scores obtained from other detection algorithms, with a view to improving static classification scores and thus reach the perfect detection obtained with dynamic features. • Section 9 tests the effectiveness of generic malware models by assessing the detection effectiveness of a generic malware model trained on several different families. The experiments are intended to introduce a more realistic scenario where a single, comprehensive, machine learning model is used to detect several families. This Section shows the difficulty to build a single model to detect several malware families

    Malware Classification with Word Embedding Features

    Get PDF
    Malware classification is an important and challenging problem in information security. Modern malware classification techniques rely on machine learning models that can be trained on features such as opcode sequences, API calls, and byte nn-grams, among many others. In this research, we consider opcode features. We implement hybrid machine learning techniques, where we engineer feature vectors by training hidden Markov models -- a technique that we refer to as HMM2Vec -- and Word2Vec embeddings on these opcode sequences. The resulting HMM2Vec and Word2Vec embedding vectors are then used as features for classification algorithms. Specifically, we consider support vector machine (SVM), kk-nearest neighbor (kk-NN), random forest (RF), and convolutional neural network (CNN) classifiers. We conduct substantial experiments over a variety of malware families. Our experiments extend well beyond any previous work in this field

    Malware Classification with Gaussian Mixture Model-Hidden Markov Models

    Get PDF
    Discrete hidden Markov models (HMM) are often applied to the malware detection and classification problems. However, the continuous analog of discrete HMMs, that is, Gaussian mixture model-HMMs (GMM-HMM), are rarely considered in the field of cybersecurity. In this study, we apply GMM-HMMs to the malware classification problem and we compare our results to those obtained using discrete HMMs. As features, we consider opcode sequences and entropy-based sequences. For our opcode features, GMM-HMMs produce results that are comparable to those obtained using discrete HMMs, whereas for our entropy-based features, GMM-HMMs generally improve on the classification results that we can attain with discrete HMMs

    Graphs Resemblance based Software Birthmarks through Data Mining for Piracy Control

    Get PDF
    The emergence of software artifacts greatly emphasizes the need for protecting intellectual property rights (IPR) hampered by software piracy requiring effective measures for software piracy control. Software birthmarking targets to counter ownership theft of software by identifying similarity of their origins. A novice birthmarking approach has been proposed in this paper that is based on hybrid of text-mining and graph-mining techniques. The code elements of a program and their relations with other elements have been identified through their properties (i.e code constructs) and transformed into Graph Manipulation Language (GML). The software birthmarks generated by exploiting the graph theoretic properties (through clustering coefficient) are used for the classifications of similarity or dissimilarity of two programs. The proposed technique has been evaluated over metrics of credibility, resilience, method theft, modified code detection and self-copy detection for programs asserting the effectiveness of proposed approach against software ownership theft. The comparative analysis of proposed approach with contemporary ones shows better results for having properties and relations of program nodes and for employing dynamic techniques of graph mining without adding any overhead (such as increased program size and processing cost)

    Malware Classification Based on Hidden Markov Model and Word2Vec Features

    Get PDF
    Malware classification is an important and challenging problem in information security. Modern malware classification techniques rely on machine learning models that can be trained on a wide variety of features, including opcode sequences, API calls, and byte ��-grams, among many others. In this research, we implement hybrid machine learning techniques, where we train hidden Markov models (HMM) and compute Word2Vec encodings based on opcode sequences. The resulting trained HMMs and Word2Vec embedding vectors are then used as features for classification algorithms. Specifically, we consider support vector machine (SVM), ��-nearest neighbor (��-NN), random forest (RF), and deep neural network (DNN) classifiers. We conduct substantial experiments over a variety of malware families. Our results surpass those of comparable classification experiments

    Getting to the root of the problem: A detailed comparison of kernel and user level data for dynamic malware analysis

    Get PDF
    Dynamic malware analysis is fast gaining popularity over static analysis since it is not easily defeated by evasion tactics such as obfuscation and polymorphism. During dynamic analysis it is common practice to capture the system calls that are made to better understand the behaviour of malware. There are several techniques to capture system calls, the most popular of which is a user-level hook. To study the effects of collecting system calls at different privilege levels and viewpoints, we collected data at a process-specific user-level using a virtualised sandbox environment and a system-wide kernel-level using a custom-built kernel driver. We then tested the performance of several state-of-the-art machine learning classifiers on the data. Random Forest was the best performing classifier with an accuracy of 95.2% for the kernel driver and 94.0% at a user-level. The combination of user and kernel level data gave the best classification results with an accuracy of 96.0% for Random Forest. This may seem intuitive but was hitherto not empirically demonstrated. Additionally, we observed that machine learning algorithms trained on data from the user-level tended to use the anti-debug/anti-vm features in malware to distinguish it from benignware. Whereas, when trained on data from our kernel driver, machine learning algorithms seemed to use the differences in the general behaviour of the system to make their prediction, which explains why they complement each other so well. Our results show that capturing data at different privilege levels will affect the classifier's ability to detect malware, with kernel-level providing more utility than user-level for malware classification. Despite this, there exist more established user-level tools than kernel-level tools, suggesting more research effort should be directed at kernel-level. In short, this paper provides the first objective, evidence-based comparison of user and kernel level data for the purposes of malware classification

    Comparing the utility of user-level and kernel-level data for dynamic malware analysis

    Get PDF
    Dynamic malware analysis is fast gaining popularity over static analysis since it is not easily defeated by evasion tactics such as obfuscation and polymorphism. During dynamic analysis, it is common practice to capture the system calls that are made to better understand the behaviour of malware. System calls are captured by hooking certain structures in the Operating System. There are several hooking techniques that broadly fall into two categories, those that run at user-level and those that run at kernel level. User-level hooks are currently more popular despite there being no evidence that they are better suited to detecting malware. The focus in much of the literature surrounding dynamic malware analysis is on the data analysis method over the data capturing method. This thesis, on the other hand, seeks to ascertain if the level at which data is captured affects the ability of a detector to identify malware. This is important because if the data captured by the hooking method most commonly used is sub-optimal, the machine learning classifier can only go so far. To study the effects of collecting system calls at different privilege levels and viewpoints, data was collected at a process-specific user-level using a virtualised sandbox environment and a systemwide kernel-level using a custom-built kernel driver for all experiments in this thesis. The experiments conducted in this thesis showed kernel-level data to be marginally better for detecting malware than user-level data. Further analysis revealed that the behaviour of malware used to differentiate it differed based on the data given to the classifiers. When trained on user-level data, classifiers used the evasive features of malware to differentiate it from benignware. These are the very features that malware uses to avoid detection. When trained on kernel-level data, the classifiers preferred to use the general behaviour of malware to differentiate it from benignware. The implications of this were witnessed when the classifiers trained on user-level and kernel-level data were made to classify malware that had been stripped of its evasive properties. Classifiers trained on user-level data could not detect malware that only possessed malicious attributes. While classifiers trained on kernel-level data were unable to detect malware that did not exhibit the amount of general activity they expected in malware. This research highlights the importance of giving careful consideration to the hooking methodology employed to collect data, since it not only affects the classification results, but a classifier’s understanding of malware
    corecore