Search CORE

3,737 research outputs found

Evaluating Similarity of Cross-Architecture Basic Blocks

Author: Meyer Elijah L.
Publication venue: CORE Scholar
Publication date: 01/01/2022
Field of study

Vulnerabilities in source code can be compiled for multiple processor architectures and make their way into several different devices. Security researchers frequently have no way to obtain this source code to analyze for vulnerabilities. Therefore, the ability to effectively analyze binary code is essential. Similarity detection is one facet of binary code analysis. Because source code can be compiled for different architectures, the need can arise for detecting code similarity across architectures. This need is especially apparent when analyzing firmware from embedded computing environments such as Internet of Things devices, where the processor architecture is dependent on the product and cannot be controlled by the researcher. In this thesis, we propose a system for cross-architecture binary similarity detection and present an implementation. Our system simplifies the process by lifting the binary code into an intermediate representation provided by Ghidra before analyzing it with a neural network. This eliminates the noise that can result from analyzing two disparate sets of instructions simultaneously. Our tool shows a high degree of accuracy when comparing basic blocks. In future work, we hope to expand its functionality to capture function-level control flow data

CORE

Neural malware detection

Author: Park Sean
Publication venue: 'Federation University Australia'
Publication date: 01/01/2019
Field of study

At the heart of today’s malware problem lies theoretically infinite diversity created by metamorphism. The majority of conventional machine learning techniques tackle the problem with the assumptions that a sufficiently large number of training samples exist and that the training set is independent and identically distributed. However, the lack of semantic features combined with the models under these wrong assumptions result largely in overfitting with many false positives against real world samples, resulting in systems being left vulnerable to various adversarial attacks. A key observation is that modern malware authors write a script that automatically generates an arbitrarily large number of diverse samples that share similar characteristics in program logic, which is a very cost-effective way to evade detection with minimum effort. Given that many malware campaigns follow this paradigm of economic malware manufacturing model, the samples within a campaign are likely to share coherent semantic characteristics. This opens up a possibility of one-to-many detection. Therefore, it is crucial to capture this non-linear metamorphic pattern unique to the campaign in order to detect these seemingly diverse but identically rooted variants. To address these issues, this dissertation proposes novel deep learning models, including generative static malware outbreak detection model, generative dynamic malware detection model using spatio-temporal isomorphic dynamic features, and instruction cognitive malware detection. A comparative study on metamorphic threats is also conducted as part of the thesis. Generative adversarial autoencoder (AAE) over convolutional network with global average pooling is introduced as a fundamental deep learning framework for malware detection, which captures highly complex non-linear metamorphism through translation invariancy and local variation insensitivity. Generative Adversarial Network (GAN) used as a part of the framework enables oneshot training where semantically isomorphic malware campaigns are identified by a single malware instance sampled from the very initial outbreak. This is a major innovation because, to the best of our knowledge, no approach has been found to this challenging training objective against the malware distribution that consists of a large number of very sparse groups artificially driven by arms race between attackers and defenders. In addition, we propose a novel method that extracts instruction cognitive representation from uninterpreted raw binary executables, which can be used for oneto- many malware detection via one-shot training against frequency spectrum of the Transformer’s encoded latent representation. The method works regardless of the presence of diverse malware variations while remaining resilient to adversarial attacks that mostly use random perturbation against raw binaries. Comprehensive performance analyses including mathematical formulations and experimental evaluations are provided, with the proposed deep learning framework for malware detection exhibiting a superior performance over conventional machine learning methods. The methods proposed in this thesis are applicable to a variety of threat environments here artificially formed sparse distributions arise at the cyber battle fronts.Doctor of Philosoph

Federation ResearchOnline

Effectiveness of Similarity Digest Algorithms for Binary Code Similarity in Memory Forensic Analysis

Author: Martín Pérez Miguel
Rodríguez Hernández Ricardo Julio
Publication venue: Universidad de Zaragoza, Prensas de la Universidad
Publication date: 01/01/2022
Field of study

Hoy en dı́a, cualquier organización que esté conectada a Internet es susceptible de sufrir incidentes de ciberseguridad y por tanto, debe contar con un plan de respuesta a incidentes. Este plan ayuda a prevenir, detectar, priorizar y gestionar los incidentes de ciberseguridad. Uno de los pasos para gestionar estos incidentes es la fase de eliminación, que se encarga de neutralizar la persistencia de los ataques, evaluar el alcance de los mismos e identificar el grado de compromiso. Uno de los puntos clave de esta fase es la identicación mediante triaje de la información que es relevante en el incidente. Esto suele hacerse comparando los elementos disponibles con información conocida, centrándose ası́ en aquellos elementos que tienen relevancia para la investigación (llamados evidencias).Este objetivo puede alcanzarse estudiando dos fuentes de información. Por un lado, mediante el análisis de los datos persistentes, como los datos de los discos duros o los dispositivos USB. Por otro lado, mediante el análisis de los datos volátiles, como los datos de la memoria RAM. A diferencia del análisis de datos persistentes, el análisis de datos volátiles permite determinar el alcance de algunos tipos de ataque que no guardan su código en dispositivos de persistencia o cuando los archivos ejecutables almacenados en el disco están cifrados; cuyo código sólo se muestra cuando está en la memoria y se está ejecutado.Existe una limitación en el uso de hashes criptográficos, comúnmente utilizados en el caso de identificación de evidencias en datos persistentes, para identificar evidencias de memoria. Esta limitación se debe a que las evidencias nunca serán idénticas porque la ejecución modifica el contenido de la memoria constantemente. Además, es imposible adquirir la memoria más de una vez con todos los programas en el mismo punto de ejecución. Por lo tanto, los hashes son un método de identificación inválido para el triaje de memoria. Como solución a este problema, en esta tesis se propone el uso de algoritmos de similitud de digest, que miden la similitud entre dos entradas de manera aproximada.Las principales aportaciones de esta tesis son tres. En primer lugar, se realiza un estudio del dominio del problema en el que se evalúa la gestión de la memoria y la modificación de la misma en ejecución. A continuación, se estudian los algoritmos de similitud de digest, desarrollando una clasificación de sus fases y de los ataques contra estos algoritmos, correlacionando las caracterı́sticas de la primera clasificación con los ataques identificados. Por último, se proponen dos métodos de preprocesamiento del contenido de volcados de memoria para mejorar la identificación de los elementos de interés para el análisis.Como conclusión, en esta tesis se muestra que la modificación de bytes dispersos afecta negativamente a los cálculos de similitud entre evidencias de memoria. Esta modificación se produce principalmente por el gestor de memoria del sistema operativo. Además, se muestra que las técnicas propuestas para preprocesar el contenido de volcados de memoria permiten mejorar el proceso de identificación de evidencias en memoria.<br /

Repositorio Universidad de Zaragoza

The Effect of Code Obfuscation on Authorship Attribution of Binary Computer Files

Author: Hendrikse Steven
Publication venue: NSUWorks
Publication date: 01/01/2017
Field of study

In many forensic investigations, questions linger regarding the identity of the authors of the software specimen. Research has identified methods for the attribution of binary files that have not been obfuscated, but a significant percentage of malicious software has been obfuscated in an effort to hide both the details of its origin and its true intent. Little research has been done around analyzing obfuscated code for attribution. In part, the reason for this gap in the research is that deobfuscation of an unknown program is a challenging task. Further, the additional transformation of the executable file introduced by the obfuscator modifies or removes features from the original executable that would have been used in the author attribution process. Existing research has demonstrated good success in attributing the authorship of an executable file of unknown provenance using methods based on static analysis of the specimen file. With the addition of file obfuscation, static analysis of files becomes difficult, time consuming, and in some cases, may lead to inaccurate findings. This paper presents a novel process for authorship attribution using dynamic analysis methods. A software emulated system was fully instrumented to become a test harness for a specimen of unknown provenance, allowing for supervised control, monitoring, and trace data collection during execution. This trace data was used as input into a supervised machine learning algorithm trained to identify stylometric differences in the specimen under test and provide predictions on who wrote the specimen. The specimen files were also analyzed for authorship using static analysis methods to compare prediction accuracies with prediction accuracies gathered from this new, dynamic analysis based method. Experiments indicate that this new method can provide better accuracy of author attribution for files of unknown provenance, especially in the case where the specimen file has been obfuscated

NSU Works

Lightweight Fingerprints for Fast Approximate Keyword Matching Using Bitwise Operations

Author: Cisłak Aleksander
Grabowski Szymon
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 31/05/2019
Field of study

We aim to speed up approximate keyword matching with the use of a lightweight, fixed-size block of data for each string, called a fingerprint. These work in a similar way to hash values; however, they can be also used for matching with errors. They store information regarding symbol occurrences using individual bits, and they can be compared against each other with a constant number of bitwise operations. In this way, certain strings can be deduced to be at least within the distance k from each other (using Hamming or Levenshtein distance) without performing an explicit verification. We show experimentally that for a preprocessed collection of strings, fingerprints can provide substantial speedups for k = 1, namely over 2.5 times for the Hamming distance and over 30 times for the Levenshtein distance. Tests were conducted on synthetic and real-world English and URL data

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)