141 research outputs found
An investigation of a deep learning based malware detection system
We investigate a Deep Learning based system for malware detection. In the
investigation, we experiment with different combination of Deep Learning
architectures including Auto-Encoders, and Deep Neural Networks with varying
layers over Malicia malware dataset on which earlier studies have obtained an
accuracy of (98%) with an acceptable False Positive Rates (1.07%). But these
results were done using extensive man-made custom domain features and investing
corresponding feature engineering and design efforts. In our proposed approach,
besides improving the previous best results (99.21% accuracy and a False
Positive Rate of 0.19%) indicates that Deep Learning based systems could
deliver an effective defense against malware. Since it is good in automatically
extracting higher conceptual features from the data, Deep Learning based
systems could provide an effective, general and scalable mechanism for
detection of existing and unknown malware.Comment: 13 Pages, 4 figure
Automatic Malware Detection
The problem of automatic malware detection presents challenges for antivirus vendors. Since the manual investigation is not possible due to the massive number of samples being submitted every day, automatic malware classication is necessary. Our work is focused on an automatic malware detection framework based on machine learning algorithms. We proposed several static malware detection systems for the Windows operating system to achieve the primary goal of distinguishing between malware and benign software. We also considered the more practical goal of detecting as much malware as possible while maintaining a suciently low false positive rate. We proposed several malware detection systems using various machine learning techniques, such as ensemble classier, recurrent neural network, and distance metric learning. We designed architectures of the proposed detection systems, which are automatic in the sense that extraction of features, preprocessing, training, and evaluating the detection model can be automated. However, antivirus program relies on more complex system that consists of many components where several of them depends on malware analysts and researchers. Malware authors adapt their malicious programs frequently in order to bypass antivirus programs that are regularly updated. Our proposed detection systems are not automatic in the sense that they are not able to automatically adapt to detect the newest malware. However, we can partly solve this problem by running our proposed systems again if the training set contains the newest malware. Our work relied on static analysis only. In this thesis, we discuss advantages and drawbacks in comparison to dynamic analysis. Static analysis still plays an important role, and it is used as one component of a complex detection system.The problem of automatic malware detection presents challenges for antivirus vendors. Since the manual investigation is not possible due to the massive number of samples being submitted every day, automatic malware classication is necessary. Our work is focused on an automatic malware detection framework based on machine learning algorithms. We proposed several static malware detection systems for the Windows operating system to achieve the primary goal of distinguishing between malware and benign software. We also considered the more practical goal of detecting as much malware as possible while maintaining a suciently low false positive rate. We proposed several malware detection systems using various machine learning techniques, such as ensemble classier, recurrent neural network, and distance metric learning. We designed architectures of the proposed detection systems, which are automatic in the sense that extraction of features, preprocessing, training, and evaluating the detection model can be automated. However, antivirus program relies on more complex system that consists of many components where several of them depends on malware analysts and researchers. Malware authors adapt their malicious programs frequently in order to bypass antivirus programs that are regularly updated. Our proposed detection systems are not automatic in the sense that they are not able to automatically adapt to detect the newest malware. However, we can partly solve this problem by running our proposed systems again if the training set contains the newest malware. Our work relied on static analysis only. In this thesis, we discuss advantages and drawbacks in comparison to dynamic analysis. Static analysis still plays an important role, and it is used as one component of a complex detection system
Malgazer: An Automated Malware Classifier With Running Window Entropy and Machine Learning
This dissertation explores functional malware classification using running window entropy and machine learning classifiers. This topic was under researched in the prior literature, but the implications are important for malware defense. This dissertation will present six new design science artifacts. The first artifact was a generalized machine learning based malware classifier model. This model was used to categorize and explain the gaps in the prior literature. This artifact was also used to compare the prior literature to the classifiers created in this dissertation, herein referred to as “Malgazer” classifiers.
Running window entropy data was required, but the algorithm was too slow to compute at scale. This dissertation presents an optimized version of the algorithm that requires less than 2% of the time of the original algorithm. Next, the classifications for the malware samples were required, but there was no one unified and consistent source for this information. One of the design science artifacts was the method to determine the classifications from publicly available resources.
Once the running window entropy data was computed and the functional classifications were collected, the machine learning algorithms were trained at scale so that one individual could complete over 200 computationally intensive experiments for this dissertation. The method to scale the computations was an instantiation design science artifact. The trained classifiers were another design science artifact. Lastly, a web application was developed so that the classifiers could be utilized by those without a programming background. This was the last design science artifact created by this research.
Once the classifiers were developed, they were compared to prior literature theoretically and empirically. A malware classification method from prior literature was chosen (referred to herein as “GIST”) for an empirical comparison to the Malgazer classifiers. The best Malgazer classifier produced an accuracy of approximately 95%, which was around 0.76% more accurate than the GIST method on the same data sets. Then, the Malgazer classifier was compared to the prior literature theoretically, based upon the empirical analysis with GIST, and Malgazer performed at least as well as the prior literature. While the data, methods, and source code are open sourced from this research, most prior literature did not provide enough information or data to replicate and verify each method. This prevented a full and true comparison to prior literature, but it did not prevent recommending the Malgazer classifier for some use cases
Large Scale Malware Analysis, Detection and Signature Generation.
As the primary vehicle for most organized cybercrimes, malicious software (or malware) has become one of the most serious threats to computer systems and the Internet. With the recent advent of automated malware development toolkits, it has become relatively easy, even for marginally skilled adversaries, to create and mutate malware, bypassing Anti-Virus (AV) detection. This has led to a surge in the number of new malware threats and has created several major challenges for the AV industry. AV companies typically receive tens of thousands of suspicious samples daily. However, the overwhelming number of new malware easily overtax the available human resources at AV companies, making them less responsive to emerging threats and leading to poor detection rates. To address these issues, this dissertation proposes several new and scalable systems to facilitate malware analysis and detection, with the focus on a central theme: ``automation and scalability".
This dissertation makes four primary contributions. First, it builds a large-scale malware database management system called SMIT that addresses the challenges of determining whether a suspicious sample is indeed malicious. SMIT exploits the insight that most new malicious samples are simple syntactic variations of existing malware. Thus, one way to ascertain the maliciousness of an unknown sample is to check if it is sufficiently similar to any existing malware. SMIT is designed to make such decisions efficiently using malware's function call graph---a high-level structural representation that is less susceptible to the low-level obfuscation employed by malware writers to evade detection. Second, the dissertation develops an automatic malware clustering system called MutantX. By quickly grouping similar samples into clusters, MutantX allows malware analysts to focus on representative samples and automatically generate labels based on samples’ association with existing groups. Third, this dissertation introduces a signature-generation system, called Hancock, that automatically creates high-quality string signatures with extremely low false-positive rates. Finally, observing that two widely used malware analysis approaches---i.e., static and dynamic analyses---have their respective pros and cons, this dissertation proposes a novel system that optimally integrates static-feature and dynamic-behavior based malware clusterings, mitigating their respective shortcomings without losing their merits.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89760/1/huxin_1.pd
An integrated malware detection and classification system
This thesis is to develop effective and efficient methodologies which can be applied to continuously improve the performance of detection and classification on malware collected over an extended period of time. The robustness of the proposed methodologies has been tested on malware collected over 2003-2010.<br /
- …