Tese de mestrado, Segurança Informática, Universidade de Lisboa, Faculdade de Ciências, 2022Malware attacks have been one of the most serious cyber risks in recent years. Almost every week, the
number of vulnerability reports is increasing in the security communities. One of the key causes for the
exponential growth is the fact that malware authors started introducing mutations to avoid detection.
This means that malicious files from the same malware family, with the same malicious behaviour, are
constantly modified or obfuscated using a variety of technics to make them appear to be different.
Characteristics retrieved from raw binary files or disassembled code are used in existing machine
learning-based malware categorization algorithms. The variety of such attributes has made it difficult to
develop generic malware categorization methods that operate well in a variety of operating scenarios.
To be effective in evaluating and categorizing such enormous volumes of data, it is necessary
to divide them into groups and identify their respective families based on their behaviour. Malicious
software is converted to a greyscale image representation, due to the possibility to capture subtle changes
while keeping the global structure helps to detect variations. Motivated by the Machine Learning results
achieved in the ImageNet challenge, this dissertation proposes an agnostic deep learning solution, for
efficiently classifying malware into families based on a collection of discriminant patterns retrieved
from its visualization as images.
In this thesis, we present Malwizard, an adaptable Python solution suited for companies or end users, that allows them to automatically obtain a fast malware analysis. The solution was implemented
as an Outlook add-in and an API service for the SOAR platforms, as emails are the first vector for this
type of attack, with companies being the most attractive targets.
The Microsoft Classification Challenge dataset was used in the evaluation of the noble
approach. Therefore, its image representation was ciphered and generated the correspondent ciphered
image to evaluate if the same patterns could be identified using traditional machine learning techniques.
Thus, allowing the privacy concerns to be addressed, maintaining the data analysed by neural networks
secure to unauthorized parties.
Experimental comparison demonstrates the noble approach performed close to the best analysed
model on a plain text dataset, completing the task in one-third of the time. Regarding the encrypted
dataset, classical techniques need to be adapted in order to be efficient