28 research outputs found

    Effective methods to detect metamorphic malware: A systematic review

    Get PDF
    The succeeding code for metamorphic Malware is routinely rewritten to remain stealthy and undetected within infected environments. This characteristic is maintained by means of encryption and decryption methods, obfuscation through garbage code insertion, code transformation and registry modification which makes detection very challenging. The main objective of this study is to contribute an evidence-based narrative demonstrating the effectiveness of recent proposals. Sixteen primary studies were included in this analysis based on a pre-defined protocol. The majority of the reviewed detection methods used Opcode, Control Flow Graph (CFG) and API Call Graph. Key challenges facing the detection of metamorphic malware include code obfuscation, lack of dynamic capabilities to analyse code and application difficulty. Methods were further analysed on the basis of their approach, limitation, empirical evidence and key parameters such as dataset, Detection Rate (DR) and False Positive Rate (FPR)

    The Effect of Code Obfuscation on Authorship Attribution of Binary Computer Files

    Get PDF
    In many forensic investigations, questions linger regarding the identity of the authors of the software specimen. Research has identified methods for the attribution of binary files that have not been obfuscated, but a significant percentage of malicious software has been obfuscated in an effort to hide both the details of its origin and its true intent. Little research has been done around analyzing obfuscated code for attribution. In part, the reason for this gap in the research is that deobfuscation of an unknown program is a challenging task. Further, the additional transformation of the executable file introduced by the obfuscator modifies or removes features from the original executable that would have been used in the author attribution process. Existing research has demonstrated good success in attributing the authorship of an executable file of unknown provenance using methods based on static analysis of the specimen file. With the addition of file obfuscation, static analysis of files becomes difficult, time consuming, and in some cases, may lead to inaccurate findings. This paper presents a novel process for authorship attribution using dynamic analysis methods. A software emulated system was fully instrumented to become a test harness for a specimen of unknown provenance, allowing for supervised control, monitoring, and trace data collection during execution. This trace data was used as input into a supervised machine learning algorithm trained to identify stylometric differences in the specimen under test and provide predictions on who wrote the specimen. The specimen files were also analyzed for authorship using static analysis methods to compare prediction accuracies with prediction accuracies gathered from this new, dynamic analysis based method. Experiments indicate that this new method can provide better accuracy of author attribution for files of unknown provenance, especially in the case where the specimen file has been obfuscated

    Techniques for the reverse engineering of banking malware

    Get PDF
    Malware attacks are a significant and frequently reported problem, adversely affecting the productivity of organisations and governments worldwide. The well-documented consequences of malware attacks include financial loss, data loss, reputation damage, infrastructure damage, theft of intellectual property, compromise of commercial negotiations, and national security risks. Mitiga-tion activities involve a significant amount of manual analysis. Therefore, there is a need for automated techniques for malware analysis to identify malicious behaviours. Research into automated techniques for malware analysis covers a wide range of activities. This thesis consists of a series of studies: an anal-ysis of banking malware families and their common behaviours, an emulated command and control environment for dynamic malware analysis, a technique to identify similar malware functions, and a technique for the detection of ransomware. An analysis of the nature of banking malware, its major malware families, behaviours, variants, and inter-relationships are provided in this thesis. In doing this, this research takes a broad view of malware analysis, starting with the implementation of the malicious behaviours through to detailed analysis using machine learning. The broad approach taken in this thesis differs from some other studies that approach malware research in a more abstract sense. A disadvantage of approaching malware research without domain knowledge, is that important methodology questions may not be considered. Large datasets of historical malware samples are available for countermea-sures research. However, due to the age of these samples, the original malware infrastructure is no longer available, often restricting malware operations to initialisation functions only. To address this absence, an emulated command and control environment is provided. This emulated environment provides full control of the malware, enabling the capabilities of the original in-the-wild operation, while enabling feature extraction for research purposes. A major focus of this thesis has been the development of a machine learn-ing function similarity method with a novel feature encoding that increases feature strength. This research develops techniques to demonstrate that the machine learning model trained on similarity features from one program can find similar functions in another, unrelated program. This finding can lead to the development of generic similar function classifiers that can be packaged and distributed in reverse engineering tools such as IDA Pro and Ghidra. Further, this research examines the use of API call features for the identi-fication of ransomware and shows that a failure to consider malware analysis domain knowledge can lead to weaknesses in experimental design. In this case, we show that existing research has difficulty in discriminating between ransomware and benign cryptographic software. This thesis by publication, has developed techniques to advance the disci-pline of malware reverse engineering, in order to minimize harm due to cyber-attacks on critical infrastructure, government institutions, and industry.Doctor of Philosoph

    Needles in a Haystack: Mining Information from Public Dynamic Analysis Sandboxes for Malware Intelligence

    Get PDF
    Malware sandboxes are automated dynamic analysis systems that execute programs in a controlled environment. Within the large volumes of samples submitted every day to these services, some submissions appear to be different from others, and show interesting characteristics. For example, we observed that malware samples involved in famous targeted attacks \u2013 like the Regin APT framework or the recently disclosed malwares from the Equation Group \u2013 were submitted to our sandbox months or even years before they were detected in the wild. In other cases, the malware developers themselves interact with public sandboxes to test their creations or to develop a new evasion technique. We refer to similar cases as malware developments. In this paper, we propose a novel methodology to automatically identify malware development cases from the samples submitted to a malware analysis sandbox. The results of our experiments show that, by combining dynamic and static analysis with features based on the file submission, it is possible to achieve a good accuracy in automatically identifying cases of malware development. Our goal is to raise awareness on this problem and on the importance of looking at these samples from an intelligence and threat prevention point of view

    Software similarity and classification

    Full text link
    This thesis analyses software programs in the context of their similarity to other software programs. Applications proposed and implemented include detecting malicious software and discovering security vulnerabilities

    On Leveraging Next-Generation Deep Learning Techniques for IoT Malware Classification, Family Attribution and Lineage Analysis

    Get PDF
    Recent years have witnessed the emergence of new and more sophisticated malware targeting insecure Internet of Things (IoT) devices, as part of orchestrated large-scale botnets. Moreover, the public release of the source code of popular malware families such as Mirai [1] has spawned diverse variants, making it harder to disambiguate their ownership, lineage, and correct label. Such a rapidly evolving landscape makes it also harder to deploy and generalize effective learning models against retired, updated, and/or new threat campaigns. To mitigate such threat, there is an utmost need for effective IoT malware detection, classification and family attribution, which provide essential steps towards initiating attack mitigation/prevention countermeasures, as well as understanding the evolutionary trajectories and tangled relationships of IoT malware. This is particularly challenging due to the lack of fine-grained empirical data about IoT malware, the diverse architectures of IoT-targeted devices, and the massive code reuse between IoT malware families. To address these challenges, in this thesis, we leverage the general lack of obfuscation in IoT malware to extract and combine static features from multi-modal views of the executable binaries (e.g., images, strings, assembly instructions), along with Deep Learning (DL) architectures for effective IoT malware classification and family attribution. Additionally, we aim to address concept drift and the limitations of inter-family classification due to the evolutionary nature of IoT malware, by detecting in-class evolving IoT malware variants and interpreting the meaning behind their mutations. To this end, we perform the following to achieve our objectives: First, we analyze 70,000 IoT malware samples collected by a specialized IoT honeypot and popular malware repositories in the past 3 years. Consequently, we utilize features extracted from strings- and image-based representations of IoT malware to implement a multi-level DL architecture that fuses the learned features from each sub-component (i.e, images, strings) through a neural network classifier. Our in-depth experiments with four prominent IoT malware families highlight the significant accuracy of the proposed approach (99.78%), which outperforms conventional single-level classifiers, by relying on different representations of the target IoT malware binaries that do not require expensive feature extraction. Additionally, we utilize our IoT-tailored approach for labeling unknown malware samples, while identifying new malware strains. Second, we seek to identify when the classifier shows signs of aging, by which it fails to effectively recognize new variants and adapt to potential changes in the data. Thus, we introduce a robust and effective method that uses contrastive learning and attentive Transformer models to learn and compare semantically meaningful representations of IoT malware binaries and codes without the need for expensive target labels. We find that the evolution of IoT binaries can be used as an augmentation strategy to learn effective representations to contrast (dis)similar variant pairs. We discuss the impact and findings of our analysis and present several evaluation studies to highlight the tangled relationships of IoT malware, as well as the efficiency of our contrastively learned fine-grained feature vectors in preserving semantics and reducing out-of-vocabulary size in cross-architecture IoT malware binaries. We conclude this thesis by summarizing our findings and discussing research gaps that lay the way for future work

    Avaliação da viabilidade de modelos filogenéticos na classificação de aplicações maliciosas

    Get PDF
    Orientador: André Ricardo Abed GrégioTese (Doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 03/02/2023Inclui referências: p. 150-170Área de concentração: Ciência da ComputaçãoResumo: Milhares de códigos maliciosos são criados, modificados com apoio de ferramentas de automação e liberados diariamente na rede mundial de computadores. Entre essas ameaças, malware são programas projetados especificamente para interromper, danificar ou obter acesso não autorizado a um sistema ou dispositivo. Para facilitar a identificação e a categorização de comportamentos comuns, estruturas e outras características de malware, possibilitando o desenvolvimento de soluções de defesa, existem estratégias de análise que classificam malware em grupos conhecidos como famílias. Uma dessas estratégias é a Filogenia, técnica baseada na Biologia, que investiga o relacionamento histórico e evolutivo de uma espécie ou outro grupo de elementos. Além disso, a utilização de técnicas de agrupamento em conjuntos semelhantes facilita tarefas de engenharia reversa para análise de variantes desconhecidas. Uma variante se refere a uma nova versão de um código malicioso que é criada a partir de modificações de malware existentes. O presente trabalho investiga a viabilidade do uso de filogenias e de métodos de agrupamento na classificação de variantes de malware para plataforma Android. Inicialmente foram analisados 82 trabalhos correlatos para verificação de configurações de experimentos do estado da arte. Após esse estudo, foram realizados quatro experimentos para avaliar uso de métricas de similaridade e de algoritmos de agrupamento na classificação de variantes e na análise de similaridade entre famílias. Propôs-se então um Fluxo de Atividades para Agrupamento de malware com o objetivo de auxiliar na definição de parâmetros para técnicas de agrupamentos, incluindo métricas de similaridade, tipo de algoritmo de agrupamento a ser utilizado e seleção de características. Como prova de conceito, foi proposto o framework Androidgyny para análise de amostras, extração de características e classificação de variantes com base em medóides (elementos representativos médios de cada grupo) e características exclusivas de famílias conhecidas. Para validar o Androidgyny foram feitos dois experimentos: um comparativo com a ferramenta correlata Gefdroid e outro, com exemplares das 25 famílias mais populosas do dataset Androzoo.Abstract: Thousands of malicious codes are created, modified with the support of tools of automation and released daily on the world wide web. Among these threats, malware are programs specifically designed to interrupt, damage, or gain access unauthorized access to a system or device. To facilitate identification and categorization of common behaviors, structures and other characteristics of malware, enabling the development of defense solutions, there are analysis strategies that classify malware into groups known as families. One of these strategies is Phylogeny, a technique based on the Biology, which investigates the historical and evolutionary relationship of a species or other group of elements. In addition, the use of clustering techniques on similar sets facilitates reverse engineering tasks for analysis of unknown variants. a variant refers to a new version of malicious code that is created from modifications of existing malware. The present work investigates the feasibility of using phylogenies and methods of grouping in the classification of malware variants for the Android platform. Initially 82 related works were analyzed to verify experiment configurations of the state of the art. After this study, four experiments were carried out to evaluate the use of similarity measures and clustering algorithms in the classification of variants and in the similarity analysis between families. In addition to these experiments, a Flow of Activities for Malware grouping with five distinct phases. This flow has purpose of helping to define parameters for clustering techniques, including measures of similarity, type of clustering algorithm to be used and feature selection. After defining the flow of activities, the Androidgyny framework was proposed, a prototype for sample analysis, feature extraction and classification of variants based on medoids and unique features of known families. To validate Androidgyny were Two experiments were carried out: a comparison with the related tool Gefdroid and another with copies of the 25 most populous families in the Androzoo dataset

    Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection

    Full text link
    Recent works within machine learning have been tackling inputs of ever-increasing size, with cybersecurity presenting sequence classification problems of particularly extreme lengths. In the case of Windows executable malware detection, inputs may exceed 100100 MB, which corresponds to a time series with T=100,000,000T=100,000,000 steps. To date, the closest approach to handling such a task is MalConv, a convolutional neural network capable of processing up to T=2,000,000T=2,000,000 steps. The O(T)\mathcal{O}(T) memory of CNNs has prevented further application of CNNs to malware. In this work, we develop a new approach to temporal max pooling that makes the required memory invariant to the sequence length TT. This makes MalConv 116×116\times more memory efficient, and up to 25.8×25.8\times faster to train on its original dataset, while removing the input length restrictions to MalConv. We re-invest these gains into improving the MalConv architecture by developing a new Global Channel Gating design, giving us an attention mechanism capable of learning feature interactions across 100 million time steps in an efficient manner, a capability lacked by the original MalConv CNN. Our implementation can be found at https://github.com/NeuromorphicComputationResearchProgram/MalConv2Comment: To appear in AAAI 202
    corecore