Search CORE

4 research outputs found

Data augmentation and transfer learning to classify malware images in a deep learning context

Author: Dalla Preda M
Giacobazzi R
Marastoni N
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

In the past few years, malware classification techniques have shifted from shallow traditional machine learning models to deeper neural network architectures. The main benefit of some of these is the ability to work with raw data, guaranteed by their automatic feature extraction capabilities. This results in less technical expertise needed while building the models, thus less initial pre-processing resources. Nevertheless, such advantage comes with its drawbacks, since deep learning models require huge quantities of data in order to generate a model that generalizes well. The amount of data required to train a deep network without overfitting is often unobtainable for malware analysts. We take inspiration from image-based data augmentation techniques and apply a sequence of semantics-preserving syntactic code transformations (obfuscations) to a small dataset of programs to generate a larger dataset. We then design two learning models, a convolutional neural network and a bi-directional long short-term memory, and we train them on images extracted from compiled binaries of the newly generated dataset. Through transfer learning we then take the features learned from the obfuscated binaries and train the models against two state of the art malware datasets, each containing around 10 000 samples. Our models easily achieve up to 98.5% accuracy on the test set, which is on par or better than the present state of the art approaches, thus validating the approach

Catalogo dei prodotti della ricerca

Open Access Repository

Program Similarity Analysis for Malware Classification and its Pitfalls

Author: Marastoni Niccolo'
Publication venue
Publication date: 01/01/2021
Field of study

Malware classification, specifically the task of grouping malware samples into families according to their behaviour, is vital in order to understand the threat they pose and how to protect against them. Recognizing whether one program shares behaviors with another is a task that requires semantic reasoning, meaning that it needs to consider what a program actually does. This is a famously uncomputable problem, due to Rice\u2019s theorem. As there is no one-size-fits-all solution, determining program similarity in the context of malware classification requires different tools and methods depending on what is available to the malware defender. When the malware source code is readily available (or at least, easy to retrieve), most approaches employ semantic \u201cabstractions\u201d, which are computable approximations of the semantics of the program. We consider this the first scenario for this thesis: malware classification using semantic abstractions extracted from the source code in an open system. Structural features, such as the control flow graphs of programs, can be used to classify malware reasonably well. To demonstrate this, we build a tool for malware analysis, R.E.H.A. which targets the Android system and leverages its openness to extract a structural feature from the source code of malware samples. This tool is first successfully evaluated against a state of the art malware dataset and then on a newly collected dataset. We show that R.E.H.A. is able to classify the new samples into their respective families, often outperforming commercial antivirus software. However, abstractions have limitations by virtue of being approximations. We show that by increasing the granularity of the abstractions used to produce more fine-grained features, we can improve the accuracy of the results as in our second tool, StranDroid, which generates fewer false positives on the same datasets. The source code of malware samples is not often available or easily retrievable. For this reason, we introduce a second scenario in which the classification must be carried out with only the compiled binaries of malware samples on hand. Program similarity in this context cannot be done using semantic abstractions as before, since it is difficult to create meaningful abstractions from zeros and ones. Instead, by treating the compiled programs as raw data, we transform them into images and build upon common image classification algorithms using machine learning. This led us to develop novel deep learning models, a convolutional neural network and a long short-term memory, to classify the samples into their respective families. To overcome the usual obstacle of deep learning of lacking sufficiently large and balanced datasets, we utilize obfuscations as a data augmentation tool to generate semantically equivalent variants of existing samples and expand the dataset as needed. Finally, to lower the computational cost of the training process, we use transfer learning and show that a model trained on one dataset can be used to successfully classify samples in different malware datasets. The third scenario explored in this thesis assumes that even the binary itself cannot be accessed for analysis, but it can be executed, and the execution traces can then be used to extract semantic properties. However, dynamic analysis lacks the formal tools and frameworks that exist in static analysis to allow proving the effectiveness of obfuscations. For this reason, the focus shifts to building a novel formal framework that is able to assess the potency of obfuscations against dynamic analysis. We validate the new framework by using it to encode known analyses and obfuscations, and show how these obfuscations actually hinder the dynamic analysis process

Catalogo dei prodotti della ricerca

Developing Robust Models, Algorithms, Databases and Tools With Applications to Cybersecurity and Healthcare

Author: Freitas Scott
Publication venue: Georgia Institute of Technology
Publication date: 14/01/2022
Field of study

As society and technology becomes increasingly interconnected, so does the threat landscape. Once isolated threats now pose serious concerns to highly interdependent systems, highlighting the fundamental need for robust machine learning. This dissertation contributes novel tools, algorithms, databases, and models—through the lens of robust machine learning—in a research effort to solve large-scale societal problems affecting millions of people in the areas of cybersecurity and healthcare. (1) Tools: We develop TIGER, the first comprehensive graph robustness toolbox; and our ROBUSTNESS SURVEY identifies critical yet missing areas of graph robustness research. (2) Algorithms: Our survey and toolbox reveal existing work has overlooked lateral attacks on computer authentication networks. We develop D2M, the first algorithmic framework to quantify and mitigate network vulnerability to lateral attacks by modeling lateral attack movement from a graph theoretic perspective. (3) Databases: To prevent lateral attacks altogether, we develop MALNET-GRAPH, the world’s largest cybersecurity graph database—containing over 1.2M graphs across 696 classes—and show the first large-scale results demonstrating the effectiveness of malware detection through a graph medium. We extend MALNET-GRAPH by constructing the largest binary-image cybersecurity database—containing 1.2M images, 133×more images than the only other public database—enabling new discoveries in malware detection and classification research restricted to a few industry labs (MALNET-IMAGE). (4) Models: To protect systems from adversarial attacks, we develop UNMASK, the first model that flags semantic incoherence in computer vision systems, which detects up to 96.75% of attacks, and defends the model by correctly classifying up to 93% of attacks. Inspired by UNMASK’s ability to protect computer visions systems from adversarial attack, we develop REST, which creates noise robust models through a novel combination of adversarial training, spectral regularization, and sparsity regularization. In the presence of noise, our method improves state-of-the-art sleep stage scoring by 71%—allowing us to diagnose sleep disorders earlier on and in the home environment—while using 19× less parameters and 15×less MFLOPS. Our work has made significant impact to industry and society: the UNMASK framework laid the foundation for a multi-million dollar DARPA GARD award; the TIGER toolbox for graph robustness analysis is a part of the Nvidia Data Science Teaching Kit, available to educators around the world; we released MALNET, the world’s largest graph classification database with 1.2M graphs; and the D2M framework has had major impact to Microsoft products, inspiring changes to the product’s approach to lateral attack detection.Ph.D

Scholarly Materials And Research @ Georgia Tech

Neural malware analysis with attention mechanism

Author: Abrams
Ahmadi
Alazab
Alrabaee
Anderson
Bahdanau
Buhov
Canzanese
Chen
Cordonsky
Dahl
Dam
David
Dovom
Dullein
Farhadi
Gao
García-Cervigón
Goel
Goodfellow
Hai
Hardy
He
Hiromu Yakura
Huang
Jacob
Jun Sakuma
Kalash
Khan
Kim
Kingma
Kirat
Kolosnjaji
Le
Lestringant
Liang
Lin
McLaughlin
Moser
Nataraj
Nataraj
Oliva
Oyama
Pascanu
Percival
Raff
Razavian
Reon Nishimura
Saxe
Severi
Shinnosuke Shinozaki
Ståhlberg
Su
Tobiyama
Vyas
Williamson
Wressnegger
Xu
Xu
Yoshihiro Oyama
You
Yuan
Yuan
Zhang
Zhou
Zhuge
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref