175 research outputs found

    Machine Learning Interpretability in Malware Detection

    Get PDF
    The ever increasing processing power of modern computers, as well as the increased availability of large and complex data sets, has led to an explosion in machine learning research. This has led to increasingly complex machine learning algorithms, such as Convolutional Neural Networks, with increasingly complex applications, such as malware detection. Recently, malware authors have become increasingly successful in bypassing traditional malware detection methods, partly due to advanced evasion techniques such as obfuscation and server-side polymorphism. Further, new programming paradigms such as fileless malware, that is malware that exist only in the main memory (RAM) of the infected host, add to the challenges faced with modern day malware detection. This has led security specialists to turn to machine learning to augment their malware detection systems. However, with this new technology comes new challenges. One of these challenges is the need for interpretability in machine learning. Machine learning interpretability is the process of giving explanations of a machine learning model\u27s predictions to humans. Rather than trying to understand everything that is learnt by the model, it is an attempt to find intuitive explanations which are simple enough and provide relevant information for downstream tasks. Cybersecurity analysts always prefer interpretable solutions because of the need to fine tune these solutions. If malware analysts can\u27t interpret the reason behind a misclassification, they will not accept the non-interpretable or black box detector. In this thesis, we provide an overview of machine learning and discuss its roll in cyber security, the challenges it faces, and potential improvements to current approaches in the literature. We showcase its necessity as a result of new computing paradigms by implementing a proof of concept fileless malware with JavaScript. We then present techniques for interpreting machine learning based detectors which leverage n-gram analysis and put forward a novel and fully interpretable approach for malware detection which uses convolutional neural networks. We also define a novel approach for evaluating the robustness of a machine learning based detector

    A Survey on Malware Detection with Graph Representation Learning

    Full text link
    Malware detection has become a major concern due to the increasing number and complexity of malware. Traditional detection methods based on signatures and heuristics are used for malware detection, but unfortunately, they suffer from poor generalization to unknown attacks and can be easily circumvented using obfuscation techniques. In recent years, Machine Learning (ML) and notably Deep Learning (DL) achieved impressive results in malware detection by learning useful representations from data and have become a solution preferred over traditional methods. More recently, the application of such techniques on graph-structured data has achieved state-of-the-art performance in various domains and demonstrates promising results in learning more robust representations from malware. Yet, no literature review focusing on graph-based deep learning for malware detection exists. In this survey, we provide an in-depth literature review to summarize and unify existing works under the common approaches and architectures. We notably demonstrate that Graph Neural Networks (GNNs) reach competitive results in learning robust embeddings from malware represented as expressive graph structures, leading to an efficient detection by downstream classifiers. This paper also reviews adversarial attacks that are utilized to fool graph-based detection methods. Challenges and future research directions are discussed at the end of the paper.Comment: Preprint, submitted to ACM Computing Surveys on March 2023. For any suggestions or improvements, please contact me directly by e-mai

    Formalizing evasion attacks against machine learning security detectors

    Get PDF
    Recent work has shown that adversarial examples can bypass machine learning-based threat detectors relying on static analysis by applying minimal perturbations. To preserve malicious functionality, previous attacks either apply trivial manipulations (e.g. padding), potentially limiting their effectiveness, or require running computationally-demanding validation steps to discard adversarial variants that do not correctly execute in sandbox environments. While machine learning systems for detecting SQL injections have been proposed in the literature, no attacks have been tested against the proposed solutions to assess the effectiveness and robustness of these methods. In this thesis, we overcome these limitations by developing RAMEn, a unifying framework that (i) can express attacks for different domains, (ii) generalizes previous attacks against machine learning models, and (iii) uses functions that preserve the functionality of manipulated objects. We provide new attacks for both Windows malware and SQL injection detection scenarios by exploiting the format used for representing these objects. To show the efficacy of RAMEn, we provide experimental results of our strategies in both white-box and black-box settings. The white-box attacks against Windows malware detectors show that it takes only the 2% of the input size of the target to evade detection with ease. To further speed up the black-box attacks, we overcome the issues mentioned before by presenting a novel family of black-box attacks that are both query-efficient and functionality-preserving, as they rely on the injection of benign content, which will never be executed, either at the end of the malicious file, or within some newly-created sections, encoded in an algorithm called GAMMA. We also evaluate whether GAMMA transfers to other commercial antivirus solutions, and surprisingly find that it can evade many commercial antivirus engines. For evading SQLi detectors, we create WAF-A-MoLE, a mutational fuzzer that that exploits random mutations of the input samples, keeping alive only the most promising ones. WAF-A-MoLE is capable of defeating detectors built with different architectures by using the novel practical manipulations we have proposed. To facilitate reproducibility and future work, we open-source our framework and corresponding attack implementations. We conclude by discussing the limitations of current machine learning-based malware detectors, along with potential mitigation strategies based on embedding domain knowledge coming from subject-matter experts naturally into the learning process

    Bridging statistical learning and formal reasoning for cyber attack detection

    Get PDF
    Current cyber-infrastructures are facing increasingly stealthy attacks that implant malicious payloads under the cover of benign programs. Current attack detection approaches based on statistical learning methods may generate misleading decision boundaries when processing noisy data with such a mixture of benign and malicious behaviors. On the other hand, attack detection based on formal program analysis may lack completeness or adaptivity when modeling attack behaviors. In light of these limitations, we have developed LEAPS, an attack detection system based on supervised statistical learning to classify benign and malicious system events. Furthermore, we leverage control flow graphs inferred from the system event logs to enable automatic pruning of the training data, which leads to a more accurate classification model when applied to the testing data. Our extensive evaluation shows that, compared with pure statistical learning models, LEAPS achieves consistently higher accuracy when detecting real-world camouflaged attackswith benign program cover-up

    Detecting malware with information complexity

    Get PDF
    Malware concealment is the predominant strategy for malware propagation. Black hats create variants of malware based on polymorphism and metamorphism. Malware variants, by definition, share some information. Although the concealment strategy alters this information, there are still patterns on the software. Given a zoo of labelled malware and benign-ware, we ask whether a suspect program is more similar to our malware or to our benign-ware. Normalized Compression Distance (NCD) is a generic metric that measures the shared information content of two strings. This measure opens a new front in the malware arms race, one where the countermeasures promise to be more costly for malware writers, who must now obfuscate patterns as strings qua strings, without reference to execution, in their variants. Our approach classifies disk-resident malware with 97.4% accuracy and a false positive rate of 3%. We demonstrate that its accuracy can be improved by combining NCD with the compressibility rates of executables using decision forests, paving the way for future improvements. We demonstrate that malware reported within a narrow time frame of a few days is more homogeneous than malware reported over two years, but that our method still classifies the latter with 95.2% accuracy and a 5% false positive rate. Due to its use of compression, the time and computation cost of our method is nontrivial. We show that simple approximation techniques can improve its running time by up to 63%. We compare our results to the results of applying the 59 anti-malware programs used on the VirusTotal website to our malware. Our approach outperforms each one used alone and matches that of all of them used collectively
    • 

    corecore