8,897 research outputs found

    Building Better Bit-Blasting for Floating-Point Problems

    Get PDF
    An effective approach to handling the theory of floating-point is to reduce it to the theory of bit-vectors. Implementing the required encodings is complex, error prone and requires a deep understanding of floating-point hardware. This paper presents SymFPU, a library of encodings that can be included in solvers. It also includes a verification argument for its correctness, and experimental results showing that its use in CVC4 out-performs all previous tools. As well as a significantly improved performance and correctness, it is hoped this will give a simple route to add support for the theory of floating-point

    Neural malware detection

    Get PDF
    At the heart of today’s malware problem lies theoretically infinite diversity created by metamorphism. The majority of conventional machine learning techniques tackle the problem with the assumptions that a sufficiently large number of training samples exist and that the training set is independent and identically distributed. However, the lack of semantic features combined with the models under these wrong assumptions result largely in overfitting with many false positives against real world samples, resulting in systems being left vulnerable to various adversarial attacks. A key observation is that modern malware authors write a script that automatically generates an arbitrarily large number of diverse samples that share similar characteristics in program logic, which is a very cost-effective way to evade detection with minimum effort. Given that many malware campaigns follow this paradigm of economic malware manufacturing model, the samples within a campaign are likely to share coherent semantic characteristics. This opens up a possibility of one-to-many detection. Therefore, it is crucial to capture this non-linear metamorphic pattern unique to the campaign in order to detect these seemingly diverse but identically rooted variants. To address these issues, this dissertation proposes novel deep learning models, including generative static malware outbreak detection model, generative dynamic malware detection model using spatio-temporal isomorphic dynamic features, and instruction cognitive malware detection. A comparative study on metamorphic threats is also conducted as part of the thesis. Generative adversarial autoencoder (AAE) over convolutional network with global average pooling is introduced as a fundamental deep learning framework for malware detection, which captures highly complex non-linear metamorphism through translation invariancy and local variation insensitivity. Generative Adversarial Network (GAN) used as a part of the framework enables oneshot training where semantically isomorphic malware campaigns are identified by a single malware instance sampled from the very initial outbreak. This is a major innovation because, to the best of our knowledge, no approach has been found to this challenging training objective against the malware distribution that consists of a large number of very sparse groups artificially driven by arms race between attackers and defenders. In addition, we propose a novel method that extracts instruction cognitive representation from uninterpreted raw binary executables, which can be used for oneto- many malware detection via one-shot training against frequency spectrum of the Transformer’s encoded latent representation. The method works regardless of the presence of diverse malware variations while remaining resilient to adversarial attacks that mostly use random perturbation against raw binaries. Comprehensive performance analyses including mathematical formulations and experimental evaluations are provided, with the proposed deep learning framework for malware detection exhibiting a superior performance over conventional machine learning methods. The methods proposed in this thesis are applicable to a variety of threat environments here artificially formed sparse distributions arise at the cyber battle fronts.Doctor of Philosoph

    The Effect of Code Obfuscation on Authorship Attribution of Binary Computer Files

    Get PDF
    In many forensic investigations, questions linger regarding the identity of the authors of the software specimen. Research has identified methods for the attribution of binary files that have not been obfuscated, but a significant percentage of malicious software has been obfuscated in an effort to hide both the details of its origin and its true intent. Little research has been done around analyzing obfuscated code for attribution. In part, the reason for this gap in the research is that deobfuscation of an unknown program is a challenging task. Further, the additional transformation of the executable file introduced by the obfuscator modifies or removes features from the original executable that would have been used in the author attribution process. Existing research has demonstrated good success in attributing the authorship of an executable file of unknown provenance using methods based on static analysis of the specimen file. With the addition of file obfuscation, static analysis of files becomes difficult, time consuming, and in some cases, may lead to inaccurate findings. This paper presents a novel process for authorship attribution using dynamic analysis methods. A software emulated system was fully instrumented to become a test harness for a specimen of unknown provenance, allowing for supervised control, monitoring, and trace data collection during execution. This trace data was used as input into a supervised machine learning algorithm trained to identify stylometric differences in the specimen under test and provide predictions on who wrote the specimen. The specimen files were also analyzed for authorship using static analysis methods to compare prediction accuracies with prediction accuracies gathered from this new, dynamic analysis based method. Experiments indicate that this new method can provide better accuracy of author attribution for files of unknown provenance, especially in the case where the specimen file has been obfuscated

    XFL: Naming Functions in Binaries with Extreme Multi-label Learning

    Full text link
    Reverse engineers benefit from the presence of identifiers such as function names in a binary, but usually these are removed for release. Training a machine learning model to predict function names automatically is promising but fundamentally hard: unlike words in natural language, most function names occur only once. In this paper, we address this problem by introducing eXtreme Function Labeling (XFL), an extreme multi-label learning approach to selecting appropriate labels for binary functions. XFL splits function names into tokens, treating each as an informative label akin to the problem of tagging texts in natural language. We relate the semantics of binary code to labels through DEXTER, a novel function embedding that combines static analysis-based features with local context from the call graph and global context from the entire binary. We demonstrate that XFL/DEXTER outperforms the state of the art in function labeling on a dataset of 10,047 binaries from the Debian project, achieving a precision of 83.5%. We also study combinations of XFL with alternative binary embeddings from the literature and show that DEXTER consistently performs best for this task. As a result, we demonstrate that binary function labeling can be effectively phrased in terms of multi-label learning, and that binary function embeddings benefit from including explicit semantic features

    Protecting Software through Obfuscation:Can It Keep Pace with Progress in Code Analysis?

    Get PDF
    Software obfuscation has always been a controversially discussed research area. While theoretical results indicate that provably secure obfuscation in general is impossible, its widespread application in malware and commercial software shows that it is nevertheless popular in practice. Still, it remains largely unexplored to what extent today’s software obfuscations keep up with state-of-the-art code analysis and where we stand in the arms race between software developers and code analysts. The main goal of this survey is to analyze the effectiveness of different classes of software obfuscation against the continuously improving deobfuscation techniques and off-the-shelf code analysis tools. The answer very much depends on the goals of the analyst and the available resources. On the one hand, many forms of lightweight static analysis have difficulties with even basic obfuscation schemes, which explains the unbroken popularity of obfuscation among malware writers. On the other hand, more expensive analysis techniques, in particular when used interactively by a human analyst, can easily defeat many obfuscations. As a result, software obfuscation for the purpose of intellectual property protection remains highly challenging.</jats:p

    Memory-Based Grammatical Relation Finding

    Get PDF

    Sequence-to-sequence learning for machine translation and automatic differentiation for machine learning software tools

    Full text link
    Cette thèse regroupe des articles d'apprentissage automatique et s'articule autour de deux thématiques complémentaires. D'une part, les trois premiers articles examinent l'application des réseaux de neurones artificiels aux problèmes du traitement automatique du langage naturel (TALN). Le premier article introduit une structure codificatrice-décodificatrice avec des réseaux de neurones récurrents pour traduire des segments de phrases de longueur variable. Le deuxième article analyse la performance de ces modèles de `traduction neuronale automatique' de manière qualitative et quantitative, tout en soulignant les difficultés posées par les phrases longues et les mots rares. Le troisième article s'adresse au traitement des mots rares et hors du vocabulaire commun en combinant des algorithmes de compression par dictionnaire et des réseaux de neurones récurrents. D'autre part, la deuxième partie de cette thèse fait abstraction de modèles particuliers de réseaux de neurones afin d'aborder l'infrastructure logicielle nécessaire à leur définition et entraînement. Les infrastructures modernes d'apprentissage profond doivent avoir la capacité d'exécuter efficacement des programmes d'algèbre linéaire et par tableaux, tout en étant capable de différentiation automatique (DA) pour calculer des dérivées multiples. Le premier article aborde les défis généraux posés par la conciliation de ces deux objectifs et propose la solution d'une représentation intermédiaire fondée sur les graphes. Le deuxième article attaque le même problème d'une manière différente: en implémentant un code source par bande dans un langage de programmation dynamique par tableau (Python et NumPy).This thesis consists of a series of articles that contribute to the field of machine learning. In particular, it covers two distinct and loosely related fields. The first three articles consider the use of neural network models for problems in natural language processing (NLP). The first article introduces the use of an encoder-decoder structure involving recurrent neural networks (RNNs) to translate from and to variable length phrases and sentences. The second article contains a quantitative and qualitative analysis of the performance of these `neural machine translation' models, laying bare the difficulties posed by long sentences and rare words. The third article deals with handling rare and out-of-vocabulary words in neural network models by using dictionary coder compression algorithms and multi-scale RNN models. The second half of this thesis does not deal with specific neural network models, but with the software tools and frameworks that can be used to define and train them. Modern deep learning frameworks need to be able to efficiently execute programs involving linear algebra and array programming, while also being able to employ automatic differentiation (AD) in order to calculate a variety of derivatives. The first article provides an overview of the difficulties posed in reconciling these two objectives, and introduces a graph-based intermediate representation that aims to tackle these difficulties. The second article considers a different approach to the same problem, implementing a tape-based source-code transformation approach to AD on a dynamically typed array programming language (Python and NumPy)
    • …
    corecore