783 research outputs found

    Machine Learning Aided Static Malware Analysis: A Survey and Tutorial

    Full text link
    Malware analysis and detection techniques have been evolving during the last decade as a reflection to development of different malware techniques to evade network-based and host-based security protections. The fast growth in variety and number of malware species made it very difficult for forensics investigators to provide an on time response. Therefore, Machine Learning (ML) aided malware analysis became a necessity to automate different aspects of static and dynamic malware investigation. We believe that machine learning aided static analysis can be used as a methodological approach in technical Cyber Threats Intelligence (CTI) rather than resource-consuming dynamic malware analysis that has been thoroughly studied before. In this paper, we address this research gap by conducting an in-depth survey of different machine learning methods for classification of static characteristics of 32-bit malicious Portable Executable (PE32) Windows files and develop taxonomy for better understanding of these techniques. Afterwards, we offer a tutorial on how different machine learning techniques can be utilized in extraction and analysis of a variety of static characteristic of PE binaries and evaluate accuracy and practical generalization of these techniques. Finally, the results of experimental study of all the method using common data was given to demonstrate the accuracy and complexity. This paper may serve as a stepping stone for future researchers in cross-disciplinary field of machine learning aided malware forensics.Comment: 37 Page

    Survey of Machine Learning Techniques for Malware Analysis

    Get PDF
    Coping with malware is getting more and more challenging, given their relentless growth in complexity and volume. One of the most common approaches in literature is using machine learning techniques, to automatically learn models and patterns behind such complexity, and to develop technologies for keeping pace with the speed of development of novel malware. This survey aims at providing an overview on the way machine learning has been used so far in the context of malware analysis. We systematize surveyed papers according to their objectives (i.e., the expected output, what the analysis aims to), what information about malware they specifically use (i.e., the features), and what machine learning techniques they employ (i.e., what algorithm is used to process the input and produce the output). We also outline a number of problems concerning the datasets used in considered works, and finally introduce the novel concept of malware analysis economics, regarding the study of existing tradeoffs among key metrics, such as analysis accuracy and economical costs

    pDroid

    Get PDF
    When an end user attempts to download an app on the Google Play Store they receive two related items that can be used to assess the potential threats of an application, the list of permissions used by the application and the textual description of the application. However, this raises several concerns. First, applications tend to use more permissions than they need and end users are not tech-savvy enough to fully understand the security risks. Therefore, it is challenging to assess the threats of an application fully by only seeing the permissions. On the other hand, most textual descriptions do not clearly define why they need a particular permission. These two issues conjoined make it difficult for end users to accurately assess the security threats of an application. This has lead to a demand for a framework that can accurately determine if a textual description adequately describes the actual behavior of an application. In this Master Thesis, we present pDroid (short for privateDroid), a market-independent framework that can compare an Android application’s textual description to its internal behavior. We evaluated pDroid using 1562 benign apps and 243 malware samples, and pDroid correctly classified 91.4% of malware with a false positive rate of 4.9%

    Malware variant identification using incremental clustering

    Get PDF
    Dynamic analysis and pattern matching techniques are widely used in industry, and they provide a straightforward method for the identification of malware samples. Yara is a pattern matching technique that can use sandbox memory dumps for the identification of malware families. However, pattern matching techniques fail silently due to minor code variations, leading to unidentified malware samples. This paper presents a two-layered Malware Variant Identification using Incremental Clustering (MVIIC) process and proposes clustering of unidentified malware samples to enable the identification of malware variants and new malware families. The novel incremental clustering algorithm is used in the identification of new malware variants from the unidentified malware samples. This research shows that clustering can provide a higher level of performance than Yara rules, and that clustering is resistant to small changes introduced by malware variants. This paper proposes a hybrid approach, using Yara scanning to eliminate known malware, followed by clustering, acting in concert, to allow the identification of new malware variants. F1 score and V-Measure clustering metrics are used to evaluate our results

    Feature Selection and Improving Classification Performance for Malware Detection

    Get PDF
    The ubiquitous advance of technology has been conducive to the proliferation of cyber threats, resulting in attacks that have grown exponentially. Consequently, researchers have developed models based on machine learning algorithms for detecting malware. However, these methods require significant amount of extracted features for correct malware classification, making that feature extraction, training, and testing take significant time; even more, it has been unexplored which are the most important features for accomplish the correct classification. In this Thesis, it is created and analyzed a dataset of malware and clean files (goodware) from the static and dynamic features provided by the online framework VirusTotal. The purpose was to select the smallest number of features that keep the classification accuracy as high as the state of the art researches. Selecting the most representative features for malware detection relies on the possibility reducing the training time, given that it increases in O(n2) with respect to the number of features, and creating an embedded program that monitors processes executed by the OS. Thus, feature selection was made taking the most important features. In addition, classification algorithms such as Random Forest, Support Vector Machine and Neural Networks were used in a novel combination that not only showed an increase in accuracy, but also in the training speed from hours to just minutes. Next, the model was tested on one additional dataset of unseen malware files. Results showed that “9” features were enough to distinguish malware from goodware files within an accuracy of 99.60%
    • …
    corecore