255 research outputs found

    Learning Fast and Slow: PROPEDEUTICA for Real-time Malware Detection

    Full text link
    In this paper, we introduce and evaluate PROPEDEUTICA, a novel methodology and framework for efficient and effective real-time malware detection, leveraging the best of conventional machine learning (ML) and deep learning (DL) algorithms. In PROPEDEUTICA, all software processes in the system start execution subjected to a conventional ML detector for fast classification. If a piece of software receives a borderline classification, it is subjected to further analysis via more performance expensive and more accurate DL methods, via our newly proposed DL algorithm DEEPMALWARE. Further, we introduce delays to the execution of software subjected to deep learning analysis as a way to "buy time" for DL analysis and to rate-limit the impact of possible malware in the system. We evaluated PROPEDEUTICA with a set of 9,115 malware samples and 877 commonly used benign software samples from various categories for the Windows OS. Our results show that the false positive rate for conventional ML methods can reach 20%, and for modern DL methods it is usually below 6%. However, the classification time for DL can be 100X longer than conventional ML methods. PROPEDEUTICA improved the detection F1-score from 77.54% (conventional ML method) to 90.25%, and reduced the detection time by 54.86%. Further, the percentage of software subjected to DL analysis was approximately 40% on average. Further, the application of delays in software subjected to ML reduced the detection time by approximately 10%. Finally, we found and discussed a discrepancy between the detection accuracy offline (analysis after all traces are collected) and on-the-fly (analysis in tandem with trace collection). Our insights show that conventional ML and modern DL-based malware detectors in isolation cannot meet the needs of efficient and effective malware detection: high accuracy, low false positive rate, and short classification time.Comment: 17 pages, 7 figure

    A Multi-view Context-aware Approach to Android Malware Detection and Malicious Code Localization

    Full text link
    Existing Android malware detection approaches use a variety of features such as security sensitive APIs, system calls, control-flow structures and information flows in conjunction with Machine Learning classifiers to achieve accurate detection. Each of these feature sets provides a unique semantic perspective (or view) of apps' behaviours with inherent strengths and limitations. Meaning, some views are more amenable to detect certain attacks but may not be suitable to characterise several other attacks. Most of the existing malware detection approaches use only one (or a selected few) of the aforementioned feature sets which prevent them from detecting a vast majority of attacks. Addressing this limitation, we propose MKLDroid, a unified framework that systematically integrates multiple views of apps for performing comprehensive malware detection and malicious code localisation. The rationale is that, while a malware app can disguise itself in some views, disguising in every view while maintaining malicious intent will be much harder. MKLDroid uses a graph kernel to capture structural and contextual information from apps' dependency graphs and identify malice code patterns in each view. Subsequently, it employs Multiple Kernel Learning (MKL) to find a weighted combination of the views which yields the best detection accuracy. Besides multi-view learning, MKLDroid's unique and salient trait is its ability to locate fine-grained malice code portions in dependency graphs (e.g., methods/classes). Through our large-scale experiments on several datasets (incl. wild apps), we demonstrate that MKLDroid outperforms three state-of-the-art techniques consistently, in terms of accuracy while maintaining comparable efficiency. In our malicious code localisation experiments on a dataset of repackaged malware, MKLDroid was able to identify all the malice classes with 94% average recall

    Identifying Authorship Style in Malicious Binaries: Techniques, Challenges & Datasets

    Get PDF
    Attributing a piece of malware to its creator typically requires threat intelligence. Binary attribution increases the level of difficulty as it mostly relies upon the ability to disassemble binaries to identify authorship style. Our survey explores malicious author style and the adversarial techniques used by them to remain anonymous. We examine the adversarial impact on the state-of-the-art methods. We identify key findings and explore the open research challenges. To mitigate the lack of ground truth datasets in this domain, we publish alongside this survey the largest and most diverse meta-information dataset of 15,660 malware labeled to 164 threat actor groups

    Need for Speed : analysis of brazilian malware classifiers' expiration date

    Get PDF
    Orientador : André Ricardo Abed GrégioCoorientador : David Menotti GomesDissertação (mestrado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 27/02/2018Inclui referênciasResumo: Novos programas maliciosos são criados e liberados diariamente para enganar usuários e superar soluções de segurança, assim exigindo melhora continua nestes mecanismos (por exemplo, atualização constante de antivírus). Apesar da maioria dos programas maliciosos serem "genéricos suficiente para infectar o mesmo tipo de sistema operacional mundialmente, alguns deles estão relacionados as especificidades de um ciberespaço de certos países alvos. Neste trabalho, nos apresentemos uma analise de milhares de exemplares de malware coletados no ciberespaço brasileiro ao longo de vários anos, incluindo suas evoluções e o impacto dessas evoluções na classificação de malware. Nos também disponibilizamos um dataset desse conjunto de malware para permitir que outros experimentos e comparações sejam feitas pela comunidade. Este dataset representa o ciberespaço brasileiro e contem perfis de programas que sao conhecidamente malignos e benignos, baseados em características estáticas de seus binários. Nossa analise utilizou algoritmos de aprendizado de maquina (em particular, nos avaliamos quatro algoritmos populares off-the-shelf : Support Vector Machines, Multilayer Perceptron, KNN e Random Forest) para classificar os programas do nosso dataset como maligno ou benigno (incluindo experimentos com thresholds) e identificar o potencial concept drift que ocorre quando o modelo de classificação evolui com o passar do tempo. Nos também providenciamos detalhes extensos sobre nosso dataset, que e composto por 38.000 programas - 20.000 rotulados como malignos, coletados de anexos de e-mails maliciosos/usuários infectados (coletados em ambos os casos por uma grande instituição financeira brasileira com uma rede distribuída em todo o pais entre 2013 e começo de 2017. Por uma questão de reprodutibilidade e comparação imparcial, nos disponibilizamos publicamente os vetores de características utilizados. Finalmente, nos discutimos os experimentos conduzimos, cuja analise evidencia a existência de concept drift nos programas, tanto benignos como malignos, e mostra que não e possível dizer que existe sasonalidade em nosso dataset. Palavras-chave: Classificação de programas, Identificação de malware, Aprendizado de maquina, Concept drift.Abstract: New malware variants are produced and released daily to deceive users and overcome defense solutions, thus demanding continuous improvements on these mechanisms (e.g., antiviruses constant updates). Although most malware samples are usually "generic" enough to infect the same type of operating system world-widely, some of them are tied to the specificities regarding the cyberspace of certain target countries. In this work, we present an analysis of thousands of malware samples collected in the Brazilian cyberspace along several years, including their evolution and the impact of this evolution on malware classification. We also share a labeled dataset of this Brazilian malware set to allow other experiments and comparisons by the community. This dataset is representative of the Brazilian cyberspace and contains profiles of known-bad and known-good programs based on binaries' static features. Our analysis leveraged machine learning algorithms (in particular, we evaluated four popular off-the-shelf classifiers: Support Vector Machines, Multilayer Perceptron, KNN and Random Forest) to classify the programs of our dataset as malware or goodware (including experiments with thresholds) and to identify the potential concept drift that occurs when the subject of a classification scheme evolves as time goes by. We also provide extensive details about our dataset, which is composed of 38, 000 programs - 20, 000 labeled as known malware, collected from malicious email attachments/infected users (triaged in both cases by a major Brazilian financial institution with a country-wide distributed network) between 2013 and early 2017. For the sake of reproducibility and unbiased comparison, we make the feature vectors produced from our database publicly available. Finally, we discuss the results of the conducted experiments, whose analysis evidences the existence of concept drift on programs, either goodware and malware, and shows that it is not possible to say that there is seasonality in our dataset. Keywords: Program classification, Malware identification, Machine learning, Concept drift
    corecore