3,737 research outputs found

    Evaluating Similarity of Cross-Architecture Basic Blocks

    Get PDF
    Vulnerabilities in source code can be compiled for multiple processor architectures and make their way into several different devices. Security researchers frequently have no way to obtain this source code to analyze for vulnerabilities. Therefore, the ability to effectively analyze binary code is essential. Similarity detection is one facet of binary code analysis. Because source code can be compiled for different architectures, the need can arise for detecting code similarity across architectures. This need is especially apparent when analyzing firmware from embedded computing environments such as Internet of Things devices, where the processor architecture is dependent on the product and cannot be controlled by the researcher. In this thesis, we propose a system for cross-architecture binary similarity detection and present an implementation. Our system simplifies the process by lifting the binary code into an intermediate representation provided by Ghidra before analyzing it with a neural network. This eliminates the noise that can result from analyzing two disparate sets of instructions simultaneously. Our tool shows a high degree of accuracy when comparing basic blocks. In future work, we hope to expand its functionality to capture function-level control flow data

    Neural malware detection

    Get PDF
    At the heart of todayā€™s malware problem lies theoretically infinite diversity created by metamorphism. The majority of conventional machine learning techniques tackle the problem with the assumptions that a sufficiently large number of training samples exist and that the training set is independent and identically distributed. However, the lack of semantic features combined with the models under these wrong assumptions result largely in overfitting with many false positives against real world samples, resulting in systems being left vulnerable to various adversarial attacks. A key observation is that modern malware authors write a script that automatically generates an arbitrarily large number of diverse samples that share similar characteristics in program logic, which is a very cost-effective way to evade detection with minimum effort. Given that many malware campaigns follow this paradigm of economic malware manufacturing model, the samples within a campaign are likely to share coherent semantic characteristics. This opens up a possibility of one-to-many detection. Therefore, it is crucial to capture this non-linear metamorphic pattern unique to the campaign in order to detect these seemingly diverse but identically rooted variants. To address these issues, this dissertation proposes novel deep learning models, including generative static malware outbreak detection model, generative dynamic malware detection model using spatio-temporal isomorphic dynamic features, and instruction cognitive malware detection. A comparative study on metamorphic threats is also conducted as part of the thesis. Generative adversarial autoencoder (AAE) over convolutional network with global average pooling is introduced as a fundamental deep learning framework for malware detection, which captures highly complex non-linear metamorphism through translation invariancy and local variation insensitivity. Generative Adversarial Network (GAN) used as a part of the framework enables oneshot training where semantically isomorphic malware campaigns are identified by a single malware instance sampled from the very initial outbreak. This is a major innovation because, to the best of our knowledge, no approach has been found to this challenging training objective against the malware distribution that consists of a large number of very sparse groups artificially driven by arms race between attackers and defenders. In addition, we propose a novel method that extracts instruction cognitive representation from uninterpreted raw binary executables, which can be used for oneto- many malware detection via one-shot training against frequency spectrum of the Transformerā€™s encoded latent representation. The method works regardless of the presence of diverse malware variations while remaining resilient to adversarial attacks that mostly use random perturbation against raw binaries. Comprehensive performance analyses including mathematical formulations and experimental evaluations are provided, with the proposed deep learning framework for malware detection exhibiting a superior performance over conventional machine learning methods. The methods proposed in this thesis are applicable to a variety of threat environments here artificially formed sparse distributions arise at the cyber battle fronts.Doctor of Philosoph

    Effectiveness of Similarity Digest Algorithms for Binary Code Similarity in Memory Forensic Analysis

    Get PDF
    Hoy en dıĢa, cualquier organizacioĢn que esteĢ conectada a Internet es susceptible de sufrir incidentes de ciberseguridad y por tanto, debe contar con un plan de respuesta a incidentes. Este plan ayuda a prevenir, detectar, priorizar y gestionar los incidentes de ciberseguridad. Uno de los pasos para gestionar estos incidentes es la fase de eliminacioĢn, que se encarga de neutralizar la persistencia de los ataques, evaluar el alcance de los mismos e identificar el grado de compromiso. Uno de los puntos clave de esta fase es la identicacioĢn mediante triaje de la informacioĢn que es relevante en el incidente. Esto suele hacerse comparando los elementos disponibles con informacioĢn conocida, centraĢndose asıĢ en aquellos elementos que tienen relevancia para la investigacioĢn (llamados evidencias).Este objetivo puede alcanzarse estudiando dos fuentes de informacioĢn. Por un lado, mediante el anaĢlisis de los datos persistentes, como los datos de los discos duros o los dispositivos USB. Por otro lado, mediante el anaĢlisis de los datos volaĢtiles, como los datos de la memoria RAM. A diferencia del anaĢlisis de datos persistentes, el anaĢlisis de datos volaĢtiles permite determinar el alcance de algunos tipos de ataque que no guardan su coĢdigo en dispositivos de persistencia o cuando los archivos ejecutables almacenados en el disco estaĢn cifrados; cuyo coĢdigo soĢlo se muestra cuando estaĢ en la memoria y se estaĢ ejecutado.Existe una limitacioĢn en el uso de hashes criptograĢficos, comuĢnmente utilizados en el caso de identificacioĢn de evidencias en datos persistentes, para identificar evidencias de memoria. Esta limitacioĢn se debe a que las evidencias nunca seraĢn ideĢnticas porque la ejecucioĢn modifica el contenido de la memoria constantemente. AdemaĢs, es imposible adquirir la memoria maĢs de una vez con todos los programas en el mismo punto de ejecucioĢn. Por lo tanto, los hashes son un meĢtodo de identificacioĢn invaĢlido para el triaje de memoria. Como solucioĢn a este problema, en esta tesis se propone el uso de algoritmos de similitud de digest, que miden la similitud entre dos entradas de manera aproximada.Las principales aportaciones de esta tesis son tres. En primer lugar, se realiza un estudio del dominio del problema en el que se evaluĢa la gestioĢn de la memoria y la modificacioĢn de la misma en ejecucioĢn. A continuacioĢn, se estudian los algoritmos de similitud de digest, desarrollando una clasificacioĢn de sus fases y de los ataques contra estos algoritmos, correlacionando las caracterıĢsticas de la primera clasificacioĢn con los ataques identificados. Por uĢltimo, se proponen dos meĢtodos de preprocesamiento del contenido de volcados de memoria para mejorar la identificacioĢn de los elementos de intereĢs para el anaĢlisis.Como conclusioĢn, en esta tesis se muestra que la modificacioĢn de bytes dispersos afecta negativamente a los caĢlculos de similitud entre evidencias de memoria. Esta modificacioĢn se produce principalmente por el gestor de memoria del sistema operativo. AdemaĢs, se muestra que las teĢcnicas propuestas para preprocesar el contenido de volcados de memoria permiten mejorar el proceso de identificacioĢn de evidencias en memoria.<br /

    The Effect of Code Obfuscation on Authorship Attribution of Binary Computer Files

    Get PDF
    In many forensic investigations, questions linger regarding the identity of the authors of the software specimen. Research has identified methods for the attribution of binary files that have not been obfuscated, but a significant percentage of malicious software has been obfuscated in an effort to hide both the details of its origin and its true intent. Little research has been done around analyzing obfuscated code for attribution. In part, the reason for this gap in the research is that deobfuscation of an unknown program is a challenging task. Further, the additional transformation of the executable file introduced by the obfuscator modifies or removes features from the original executable that would have been used in the author attribution process. Existing research has demonstrated good success in attributing the authorship of an executable file of unknown provenance using methods based on static analysis of the specimen file. With the addition of file obfuscation, static analysis of files becomes difficult, time consuming, and in some cases, may lead to inaccurate findings. This paper presents a novel process for authorship attribution using dynamic analysis methods. A software emulated system was fully instrumented to become a test harness for a specimen of unknown provenance, allowing for supervised control, monitoring, and trace data collection during execution. This trace data was used as input into a supervised machine learning algorithm trained to identify stylometric differences in the specimen under test and provide predictions on who wrote the specimen. The specimen files were also analyzed for authorship using static analysis methods to compare prediction accuracies with prediction accuracies gathered from this new, dynamic analysis based method. Experiments indicate that this new method can provide better accuracy of author attribution for files of unknown provenance, especially in the case where the specimen file has been obfuscated

    Lightweight Fingerprints for Fast Approximate Keyword Matching Using Bitwise Operations

    Get PDF
    We aim to speed up approximate keyword matching with the use of a lightweight, fixed-size block of data for each string, called a fingerprint. These work in a similar way to hash values; however, they can be also used for matching with errors. They store information regarding symbol occurrences using individual bits, and they can be compared against each other with a constant number of bitwise operations. In this way, certain strings can be deduced to be at least within the distance k from each other (using Hamming or Levenshtein distance) without performing an explicit verification. We show experimentally that for a preprocessed collection of strings, fingerprints can provide substantial speedups for k = 1, namely over 2.5 times for the Hamming distance and over 30 times for the Levenshtein distance. Tests were conducted on synthetic and real-world English and URL data
    • ā€¦
    corecore