18 research outputs found
Android Malware Family Classification Based on Resource Consumption over Time
The vast majority of today's mobile malware targets Android devices. This has
pushed the research effort in Android malware analysis in the last years. An
important task of malware analysis is the classification of malware samples
into known families. Static malware analysis is known to fall short against
techniques that change static characteristics of the malware (e.g. code
obfuscation), while dynamic analysis has proven effective against such
techniques. To the best of our knowledge, the most notable work on Android
malware family classification purely based on dynamic analysis is DroidScribe.
With respect to DroidScribe, our approach is easier to reproduce. Our
methodology only employs publicly available tools, does not require any
modification to the emulated environment or Android OS, and can collect data
from physical devices. The latter is a key factor, since modern mobile malware
can detect the emulated environment and hide their malicious behavior. Our
approach relies on resource consumption metrics available from the proc file
system. Features are extracted through detrended fluctuation analysis and
correlation. Finally, a SVM is employed to classify malware into families. We
provide an experimental evaluation on malware samples from the Drebin dataset,
where we obtain a classification accuracy of 82%, proving that our methodology
achieves an accuracy comparable to that of DroidScribe. Furthermore, we make
the software we developed publicly available, to ease the reproducibility of
our results.Comment: Extended Versio
Recommended from our members
Uncovering Features in Behaviorally Similar Programs
The detection of similar code can support many so ware engineering tasks such as program understanding and program classification. Many excellent approaches have been proposed to detect programs having similar syntactic features. However, these approaches are unable to identify programs dynamically or statistically close to each other, which we call behaviorally similar programs. We believe the detection of behaviorally similar programs can enhance or even automate the tasks relevant to program classification. In this thesis, we will discuss our current approaches to identify programs having similar behavioral features in multiple perspectives.
We first discuss how to detect programs having similar functionality. While the definition of a programâs functionality is undecidable, we use inputs and outputs (I/Os) of programs as the proxy of their functionality. We then use I/Os of programs as a behavioral feature to detect which programs are functionally similar: two programs are functionally similar if they share similar inputs and outputs. This approach has been studied and developed in the C language to detect functionally equivalent programs having equivalent I/Os. Nevertheless, some natural problems in Object Oriented languages, such as input generation and comparisons between application-specific data types, hinder the development of this approach. We propose a new technique, in-vivo detection, which uses existing and meaningful inputs to drive applications systematically and then applies a novel similarity model considering both inputs and outputs of programs, to detect functionally similar programs. We develop the tool, HitoshiIO, based on our in-vivo detection. In the subjects that we study, HitoshiIO correctly detect 68.4% of functionally similar programs, where its false positive rate is only 16.6%.
In addition to functional I/Os of programs, we attempt to discover programs having similar execution behavior. Again, the execution behavior of a program can be undecidable, so we use instructions executed at run-time as a behavioral feature of a program. We create DyCLINK, which observes program executions and encodes them in dynamic instruction graphs. A vertex in a dynamic instruction graph is an instruction and an edge is a type of dependency between two instructions. The problem to detect which programs have similar executions can then be reduced to a problem of solving inexact graph isomorphism. We propose a link analysis based algorithm, LinkSub, which vectorizes each dynamic instruction graph by the importance of every instruction, to solve this graph isomorphism problem efficiently. In a K Nearest Neighbor (KNN) based program classification experiment, DyCLINK achieves 90 + % precision.
Because HitoshiIO and DyCLINK both rely on dynamic analysis to expose program behavior, they have better capability to locate and search for behaviorally similar programs than traditional static analysis tools. However, they suffer from some common problems of dynamic analysis, such as input generation and run-time overhead. These problems may make our approaches challenging to scale. Thus, we create the system, Macneto, which integrates static analysis with machine topic modeling and deep learning to approximate program behaviors from their binaries without truly executing programs. In our deobfuscation experiments considering two commercial obfuscators that alter lexical information and syntax in programs, Macneto achieves 90 + % precision, where the groundtruth is that the behavior of a program before and after obfuscation should be the same.
In this thesis, we offer a more extensive view of similar programs than the traditional definitions. While the traditional definitions of similar programs mostly use static features, such as syntax and lexical information, we propose to leverage the power of dynamic analysis and machine learning models to trace/collect behavioral features of pro- grams. These behavioral features of programs can then apply to detect behaviorally similar programs. We believe the techniques we invented in this thesis to detect behaviorally similar programs can improve the development of software engineering and security applications, such as code search and deobfuscation
Machine learning techniques for identification using mobile and social media data
Networked access and mobile devices provide near constant data generation and collection. Users, environments, applications, each generate different types of data; from the voluntarily provided data posted in social networks to data collected by sensors on mobile devices, it is becoming trivial to access big data caches. Processing sufficiently large amounts of data results in inferences that can be characterized as privacy invasive. In order to address privacy risks we must understand the limits of the data exploring relationships between variables and how the user is reflected in them. In this dissertation we look at data collected from social networks and sensors to identify some aspect of the user or their surroundings. In particular, we find that from social media metadata we identify individual user accounts and from the magnetic field readings we identify both the (unique) cellphone device owned by the user and their course-grained location. In each project we collect real-world datasets and apply supervised learning techniques, particularly multi-class classification algorithms to test our hypotheses. We use both leave-one-out cross validation as well as k-fold cross validation to reduce any bias in the results. Throughout the dissertation we find that unprotected data reveals sensitive information about users. Each chapter also contains a discussion about possible obfuscation techniques or countermeasures and their effectiveness with regards to the conclusions we present. Overall our results show that deriving information about users is attainable and, with each of these results, users would have limited if any indication that any type of analysis was taking place
Android malware detection using machine learning to mitigate adversarial evasion attacks
In the current digital era, smartphones have become indispensable. Over the past few years, the exponential growth of Android users has made this operating system (OS) a prime target for smartphone malware. Consequently, the arms race between Android security personnel and malware developers seems enduring. Considering Machine Learning (ML) as the core component, various techniques are proposed in the literature to counter Android malware, however, the problem of adversarial evasion attacks on ML-based malware classifiers is understated. MLbased techniques are vulnerable to adversarial evasion attacks. The malware authors constantly try to craft adversarial examples to elude existing malware detection systems. This research presents the fragility of ML-based Android malware classifiers in adversarial environments and proposes novel techniques to counter adversarial evasion attacks on ML based Android malware classifiers.
First, we start our analysis by introducing the problem of Android malware detection in adversarial environments and provide a comprehensive overview of the domain. Second, we highlight the problem of malware clones in popular Android malware repositories. The malware clones in the datasets can potentially lead to biased results and computational overhead. Although many strategies are proposed in the literature to detect repackaged Android malware, these techniques require burdensome code inspection. Consequently, we employ a lightweight and novel strategy based on package names reusing to identify repackaged Android malware and build a clones-free Android malware dataset. Furthermore, we investigate the impact of repacked Android malware on various ML-based classifiers by training them on a clones free training set and testing on a set of benign, non repacked malware and all the malware clones in the dataset. Although trained on a reduced train set, we achieved up to 98.7% F1 score. Third, we propose Cure-Droid, an Android malware classification model trained on hybrid features and optimized using a tree-based pipeline optimization technique (TPoT). Fourth, to present the fragility of Cure- Droid model in adversarial environments, we formulate multiple adversarial evasion attacks to elude the model. Fifth, to counter adversarial evasion attacks on ML-based Android malware detectors, we propose CureDroid*, a novel and adversarially aware Android malware classification model. CureDroid* is based on an ensemble of ML-based models trained on distinct set of features where each model has the individual capability to detect Android malware. The CureDroid* model employs an ensemble of five ML-based models where each model is selected and optimized using TPoT. Our experimental results demonstrate that CureDroid* achieves up to 99.2% accuracy in non-adversarial settings and can detect up to 30 fabricated input features in the best case. Finally, we propose TrickDroid, a novel cumulative adversarial training framework based on Oracle and GAN-based adversarial data. Our experimental results present the efficacy of TrickDroid with up to 99.46% evasion detection
Creating better ground truth to further understand Android malware: A large scale mining approach based on antivirus labels and malicious artifacts
Mobile applications are essential for interacting with technology and other people. With more than 2 billion devices deployed all over the world, Android offers a thriving ecosystem by making accessible the work of thousands of developers on digital marketplaces such as Google Play. Nevertheless, the success of Android also exposes millions of users to malware authors who seek to siphon private information and hijack mobile devices for their benefits.
To fight against the proliferation of Android malware, the security community embraced machine learning, a branch of artificial intelligence that powers a new generation of detection systems. Machine learning algorithms, however, require a substantial number of qualified samples to learn the classification rules enforced by security experts. Unfortunately, malware ground truths are notoriously hard to construct due to the inherent complexity of Android applications and the global lack of public information about malware. In a context where both information and human resources are limited, the security community is in demand for new approaches to aid practitioners to accurately define Android malware, automate classification decisions, and improve the comprehension of Android malware.
This dissertation proposes three solutions to assist with the creation of malware ground truths.
The first contribution is STASE, an analytical framework that qualifies the composition of malware ground truths. STASE reviews the information shared by antivirus products with nine metrics in order to support the reproducibility of research experiments and detect potential biases. This dissertation reports the results of STASE against three typical settings and suggests additional recommendations for designing experiments based on Android malware.
The second contribution is EUPHONY, a heuristic system built to unify family clusters belonging to malware ground truths. EUPHONY exploits the co-occurrence of malware labels obtained from antivirus reports to study the relationship between Android applications and proposes a single family name per sample for the sake of facilitating malware experiments. This dissertation evaluates EUPHONY on well-known malware ground truths to assess the precision of our approach and produce a large dataset of malware tags for the research community.
The third contribution is AP-GRAPH, a knowledge database for dissecting the characteristics of malware ground truths. AP-GRAPH leverages the results of EUPHONY and static analysis to index artifacts that are highly correlated with malware activities and recommend the inspection of the most suspicious components. This dissertation explores the set of artifacts retrieved by AP-GRAPH from popular malware families to track down their correlation and their evolution compared to other malware populations
On the malware detection problem : challenges and novel approaches
Orientador: AndrĂ© Ricardo Abed GrĂ©gioCoorientador: Paulo LĂcio de GeusTese (doutorado) - Universidade Federal do ParanĂĄ, Setor de CiĂȘncias Exatas, Programa de PĂłs-Graduação em InformĂĄtica. Defesa : Curitiba,Inclui referĂȘnciasĂrea de concentração: CiĂȘncia da ComputaçãoResumo: Software Malicioso (malware) Ă© uma das maiores ameaças aos sistemas computacionais atuais, causando danos Ă imagem de indivĂduos e corporaçÔes, portanto requerendo o desenvolvimento de soluçÔes de detecção para prevenir que exemplares de malware causem danos e para permitir o uso seguro dos sistemas. Diversas iniciativas e soluçÔes foram propostas ao longo do tempo para detectar exemplares de malware, de Anti-VĂrus (AVs) a sandboxes, mas a detecção de malware de forma efetiva e eficiente ainda se mantĂ©m como um problema em aberto. Portanto, neste trabalho, me proponho a investigar alguns desafios, falĂĄcias e consequĂȘncias das pesquisas em detecção de malware de modo a contribuir para o aumento da capacidade de detecção das soluçÔes de segurança. Mais especificamente, proponho uma nova abordagem para o desenvolvimento de experimentos com malware de modo prĂĄtico mas ainda cientĂfico e utilizo-me desta abordagem para investigar quatro questĂ”es relacionadas a pesquisa em detecção de malware: (i) a necessidade de se entender o contexto das infecçÔes para permitir a detecção de ameaças em diferentes cenĂĄrios; (ii) a necessidade de se desenvolver melhores mĂ©tricas para a avaliação de soluçÔes antivĂrus; (iii) a viabilidade de soluçÔes com colaboração entre hardware e software para a detecção de malware de forma mais eficiente; (iv) a necessidade de predizer a ocorrĂȘncia de novas ameaças de modo a permitir a resposta Ă incidentes de segurança de forma mais rĂĄpida.Abstract: Malware is a major threat to most current computer systems, causing image damages and financial losses to individuals and corporations, thus requiring the development of detection solutions to prevent malware to cause harm and allow safe computers usage. Many initiatives and solutions to detect malware have been proposed over time, from AntiViruses (AVs) to sandboxes, but effective and efficient malware detection remains as a still open problem. Therefore, in this work, I propose taking a look on some malware detection challenges, pitfalls and consequences to contribute towards increasing malware detection system's capabilities. More specifically, I propose a new approach to tackle malware research experiments in a practical but still scientific manner and leverage this approach to investigate four issues: (i) the need for understanding context to allow proper detection of localized threats; (ii) the need for developing better metrics for AV solutions evaluation; (iii) the feasibility of leveraging hardware-software collaboration for efficient AV implementation; and (iv) the need for predicting future threats to allow faster incident responses
Advances in Computer Recognition, Image Processing and Communications, Selected Papers from CORES 2021 and IP&C 2021
As almost all human activities have been moved online due to the pandemic, novel robust and efficient approaches and further research have been in higher demand in the field of computer science and telecommunication. Therefore, this (reprint) book contains 13 high-quality papers presenting advancements in theoretical and practical aspects of computer recognition, pattern recognition, image processing and machine learning (shallow and deep), including, in particular, novel implementations of these techniques in the areas of modern telecommunications and cybersecurity
Symmetry-Adapted Machine Learning for Information Security
Symmetry-adapted machine learning has shown encouraging ability to mitigate the security risks in information and communication technology (ICT) systems. It is a subset of artificial intelligence (AI) that relies on the principles of processing future events by learning past events or historical data. The autonomous nature of symmetry-adapted machine learning supports effective data processing and analysis for security detection in ICT systems without the interference of human authorities. Many industries are developing machine-learning-adapted solutions to support security for smart hardware, distributed computing, and the cloud. In our Special Issue book, we focus on the deployment of symmetry-adapted machine learning for information security in various application areas. This security approach can support effective methods to handle the dynamic nature of security attacks by extraction and analysis of data to identify hidden patterns of data. The main topics of this Issue include malware classification, an intrusion detection system, image watermarking, color image watermarking, battlefield target aggregation behavior recognition model, IP camera, Internet of Things (IoT) security, service function chain, indoor positioning system, and crypto-analysis
Graph-Based Machine Learning for Passive Network Reconnaissance within Encrypted Networks
Network reconnaissance identifies a networkâs vulnerabilities to both prevent and mitigate the impact of cyber-attacks. The difficulty of performing adequate network reconnaissance has been exacerbated by the rising complexity of modern networks (e.g., encryption). We identify that the majority of network reconnaissance solutions proposed in literature are infeasible for widespread deployment in realistic modern networks. This thesis provides novel network reconnaissance solutions to address the limitations of the existing conventional approaches proposed in literature. The existing approaches are limited by their reliance on large, heterogeneous feature sets making them difficult to deploy under realistic network conditions. In contrast, we devise a bipartite graph-based representation to create network reconnaissance solutions that rely only on a single feature (e.g., the Internet protocol (IP) address field). We exploit a widely available feature set to provide network reconnaissance solutions that are scalable, independent of encryption, and deployable across diverse Internet (TCP/IP) networks. We design bipartite graph embeddings (BGE); a graph-based machine learning (ML) technique for extracting insight from the structural properties of the bipartite graph-based representation. BGE is the first known graph embedding technique designed explicitly for network reconnaissance. We validate the use of BGE through an evaluation of a universityâs enterprise network. BGE is shown to provide insight into crucial areas of network reconnaissance (e.g., device characterisation, service prediction, and network visualisation). We design an extension of BGE to acquire insight within a private network. Private networksâsuch as a virtual private network (VPN)âhave posed significant challenges for network reconnaissance as they deny direct visibility into their composition. Our extension of BGE provides the first known solution for inferring the composition of both the devices and applications acting behind diverse private networks. This thesis provides novel graph-based ML techniques for two crucial aims of network reconnaissanceâdevice characterisation and intrusion detection. The techniques developed within this thesis provide unique cybersecurity solutions to both prevent and mitigate the impact of cyber-attacks.Thesis (Ph.D.) -- University of Adelaide, School of Electrical and Electronic Engineering , 202