8 research outputs found

    Classification de Logiciels Malveillants DirigĂ©e par les DonnĂ©es et AssistĂ©e par des MĂ©thodes d’Apprentissage Automatique

    Get PDF
    Historically, malware (MW) analysis has heavily resorted to human savvy for manual signature creation to detect and classify MW.This procedure is very costly and time consuming, thus unable to cope with modern cyber threat scenario.The solution is to widely automate MW analysis.Toward this goal, MW classification allows optimizing the handling of large MW corpora by identifying resemblances across similar instances.Consequently, MW classification figures as a key activity related to MW analysis, which is paramount in the operation of computer security as a whole.This thesis addresses the problem of MW classification taking an approach in which human intervention is spared as much as possible.Furthermore, we steer clear of subjectivity inherent to human analysis by designing MW classification solely on data directly extracted from MW analysis, thus taking a data-driven approach.Our objective is to improve the automation of malware analysis and to combine it with machine learning methods that are able to autonomously spot and reveal unwitting commonalities within data.We phased our work in three stages.Initially we focused on improving MW analysis and its automation, studying new ways of leveraging symbolic execution in MW analysis and developing a distributed framework to scale up our computational power.Then we concentrated on the representation of MW behavior, with painstaking attention to its accuracy and robustness.Finally, we fixed attention on MW clustering, devising a methodology that has no restriction in the combination of syntactical and behavioral features and remains scalable in practice.As for our main contributions, we revamp the use of symbolic execution for MW analysis with special attention to the optimal use of SMT solver tactics and hyperparameter settings;we conceive a new evaluation paradigm for MW analysis systems;we formulate a compact graph representation of behavior, along with a corresponding function for pairwise similarity computation, which is accurate and robust;and we elaborate a new MW clustering strategy based on ensemble clustering that is flexible with respect to the combination of syntactical and behavioral features.Historiquement, l'analyse des logiciels malveillants (ou malware, MW) a fortement fait appel au savoir-faire humain pour la création manuelle de signatures permettant de détecter et de classer les MW.Cette procédure est trÚs coûteuse et prend beaucoup de temps, ce qui ne permet pas de faire face aux scénario modernes de cybermenaces.La solution consiste à automatiser largement l'analyse des MW.Dans ce but, la classification des MW permet d'optimiser le traitement de grands corpus de MW en identifiant les ressemblances entre des instances similaires.La classification des MW est donc une activité clé liée à l'analyse des MW.Cette thÚse aborde le problÚme de la classification des MW en adoptant une approche pour laquelle l'intervention humaine est évitée autant que possible.De plus, nous contournons la subjectivité inhérente à l'analyse humaine en concevant la classification uniquement à partir de données directement issues de l'analyse des MW, adoptant ainsi une approche dirigée par les données.Notre objectif est d'améliorer l'automatisation de l'analyse des MW et de la combiner avec des méthodes d'apprentissage automatique capables de repérer et de révéler de maniÚre autonome des points communs imprévisibles au sein des données.Nous avons échelonné notre travail en trois étapes.Dans un premier temps, nous nous sommes concentrés sur l'amélioration de l'analyse des MW et sur son automatisation, étudiant de nouvelles façons d'exploiter l'exécution symbolique dans l'analyse des MW et développant un cadre d'exécution distribué pour augmenter notre puissance de calcul.Nous nous sommes ensuite concentrés sur la représentation du comportement des MW, en accordant une attention particuliÚre à sa précision et à sa robustesse.Enfin, nous nous sommes focalisés sur le partitionnement des MW, en concevant une méthodologie qui qui ne restreint pas la combinaison des caractéristiques syntaxiques et comportementales, et qui monte bien en charge en pratique.Quant à nos principales contributions, nous revisitions l'usage de l'exécution symbolique pour l'analyse des MW en accordant une attention particuliÚre à l'utilisation optimale des tactiques des solveurs SMT et aux réglages des hyperparamÚtres ;nous concevons un nouveau paradigme d'évaluation pour les systÚmes d'analyse des MW ;nous formulons une représentation compacte du comportement sous la forme de graphe, ainsi qu'une fonction associée pour le calcul de la similarité par paire, qui est précise et robuste ;et nous élaborons une nouvelle stratégie de partitionnement des MW basée sur un partitionnement d'ensemble flexible en ce qui concerne la combinaison des caractéristiques syntaxiques et comportementales

    Accurate and Robust Malware Analysis through Similarity of External Calls Dependency Graphs (ECDG)

    No full text
    The authors received the price of Best Paper Award IWCC 2021 for this presentation performed in the workshop.International audienceMalware is a primary concern in cybersecurity, being one of the attacker's favorite cyberweapons. Over time, malware evolves not only in complexity but also in diversity and quantity. Malware analysis automation is thus crucial. In this paper we present ECDGs, a shorter call graph representation, and a new similarity function that is accurate and robust. Toward this goal, we revisit some principles of malware analysis research to define basic primitives and an evaluation paradigm addressed for the setup of more reliable experiments. Our benchmark shows that our similarity function is very efficient in practice, achieving speedup rates of 3.30x and 354, 11x wrt. radiff2 for the standard and the cache-enhanced implementations, respectively. Our evaluations generate clusters that produce almost unerring results-homogeneity score of 0.983 for the accuracy phase-and marginal information loss for a highly polluted dataset-NMI score of 0.974 between initial and final clusters of the robustness phase. Overall, ECDGs and our similarity function enable autonomous frameworks for malware search and clustering that can assist human-based analysis or improve classification models for malware analysis

    SE-PAC: A Self-Evolving PAcker Classifier against rapid packers evolution

    Get PDF
    International audiencePackers are widespread tools used by malware authors to hinder static malware detection and analysis. Identifying the packer used to pack a malware is essential to properly unpack and analyze the malware, be it manually or automatically. While many well-known packers are used, there is a growing trend for new custom packers that make malware analysis and detection harder. Research works have been very effective in identifying known packers or their variants, with signature-based, supervised machine learning or similarity-based techniques. However, identifying new packer classes remains an open problem. This paper presents a self-evolving packer classifier that provides an effective, incremental, and robust solution to cope with the rapid evolution of packers. We propose a composite pairwise distance metric combining different types of packer features. We derive an incremental clustering approach able to identify both (variants of) known packer classes and new ones, as well as to update clusters automatically and efficiently. Our system thus continuously enhances, integrates, adapts and evolves packer knowledge. Moreover, to optimize post clustering packer processing costs, we introduce a new post clustering strategy for selecting small subsets of relevant samples from the clusters. Our approach effectiveness and time-resilience are assessed with: 1) a real-world malware feed dataset composed of 16k packed binaries, comprising 29 unique packers, and 2) a synthetic dataset composed of 19k manually crafted packed binaries, comprising 31 unique packers (including custom ones)

    Optimizing symbolic execution for malware behavior classification

    No full text
    Increasingly software correctness, reliability, and security is being analyzed using tools that combine various formal and heuristic approaches. Often such analysis becomes expensive in terms of time and at the cost of high quality results. In this experience report we explore the tuning and optimization of the tools underlying binary malware detection and classification. We identify heuristics and SMT solver tactics for the effective symbolic execution of binary files. We combine these with effective heuristics for the construction of behavioral signatures of programs that can be used for a supervised learning multi-class malware classifier. Further, a set of experiments following the full-factorial design allowed us to identify the correlations between heuristics and the overall performance of the classifier
    corecore