    A Multi-view Context-aware Approach to Android Malware Detection and Malicious Code Localization

    Existing Android malware detection approaches use a variety of features such as security sensitive APIs, system calls, control-flow structures and information flows in conjunction with Machine Learning classifiers to achieve accurate detection. Each of these feature sets provides a unique semantic perspective (or view) of apps' behaviours with inherent strengths and limitations. Meaning, some views are more amenable to detect certain attacks but may not be suitable to characterise several other attacks. Most of the existing malware detection approaches use only one (or a selected few) of the aforementioned feature sets which prevent them from detecting a vast majority of attacks. Addressing this limitation, we propose MKLDroid, a unified framework that systematically integrates multiple views of apps for performing comprehensive malware detection and malicious code localisation. The rationale is that, while a malware app can disguise itself in some views, disguising in every view while maintaining malicious intent will be much harder. MKLDroid uses a graph kernel to capture structural and contextual information from apps' dependency graphs and identify malice code patterns in each view. Subsequently, it employs Multiple Kernel Learning (MKL) to find a weighted combination of the views which yields the best detection accuracy. Besides multi-view learning, MKLDroid's unique and salient trait is its ability to locate fine-grained malice code portions in dependency graphs (e.g., methods/classes). Through our large-scale experiments on several datasets (incl. wild apps), we demonstrate that MKLDroid outperforms three state-of-the-art techniques consistently, in terms of accuracy while maintaining comparable efficiency. In our malicious code localisation experiments on a dataset of repackaged malware, MKLDroid was able to identify all the malice classes with 94% average recall

    Android Malware Detection via Graphlet Sampling

    Android systems are widely used in mobile & wireless distributed systems. In the near future, Android is believed to dominate the mobile distributed environment. However, with the popularity of Android-based smartphones/tablets comes the rampancy of Android-based malware. In this paper, we propose a novel topological signature of Android apps based on the function call graphs (FCGs) extracted from their Android App PacKages (APKs). Specifically, by leveraging recent advances on graphlet mining, the proposed method fully captures the invocator-invocatee relationship at local neighborhoods in an FCG without exponentially inflating the state space. Using real benign app and malware samples, we demonstrate that our method, ACTS (App topologiCal signature through graphleT Sampling), can detect malware and identify malware families robustly and efficiently. More importantly, we demonstrate that, without augmenting the FCG with any semantic features such as bytecode-based vertex typing, local topological information captured by ACTS alone can achieve a high malware detection accuracy. Since ACTS only uses structural features, which are orthogonal to semantic features, it is expected that combining them would give a greater improvement in malware detection accuracy than combining non-orthogonal semantic features

    Obfuscation-resilient Android Malware Analysis Based on Contrastive Learning

    Due to its open-source nature, Android operating system has been the main target of attackers to exploit. Malware creators always perform different code obfuscations on their apps to hide malicious activities. Features extracted from these obfuscated samples through program analysis contain many useless and disguised features, which leads to many false negatives. To address the issue, in this paper, we demonstrate that obfuscation-resilient malware analysis can be achieved through contrastive learning. We take the Android malware classification as an example to demonstrate our analysis. The key insight behind our analysis is that contrastive learning can be used to reduce the difference introduced by obfuscation while amplifying the difference between malware and benign apps (or other types of malware). Based on the proposed analysis, we design a system that can achieve robust and interpretable classification of Android malware. To achieve robust classification, we perform contrastive learning on malware samples to learn an encoder that can automatically extract robust features from malware samples. To achieve interpretable classification, we transform the function call graph of a sample into an image by centrality analysis. Then the corresponding heatmaps are obtained by visualization techniques. These heatmaps can help users understand why the malware is classified as this family. We implement IFDroid and perform extensive evaluations on two widely used datasets. Experimental results show that IFDroid is superior to state-of-the-art Android malware familial classification systems. Moreover, IFDroid is capable of maintaining 98.2% true positive rate on classifying 8,112 obfuscated malware samples

    Latent Representation and Sampling in Network: Application in Text Mining and Biology.

    In classical machine learning, hand-designed features are used for learning a mapping from raw data. However, human involvement in feature design makes the process expensive. Representation learning aims to learn abstract features directly from data without direct human involvement. Raw data can be of various forms. Network is one form of data that encodes relational structure in many real-world domains. Therefore, learning abstract features for network units is an important task. In this dissertation, we propose models for incorporating temporal information given as a collection of networks from subsequent time-stamps. The primary objective of our models is to learn a better abstract feature representation of nodes and edges in an evolving network. We show that the temporal information in the abstract feature improves the performance of link prediction task substantially. Besides applying to the network data, we also employ our models to incorporate extra-sentential information in the text domain for learning better representation of sentences. We build a context network of sentences to capture extra-sentential information. This information in abstract feature representation of sentences improves various text-mining tasks substantially over a set of baseline methods. A problem with the abstract features that we learn is that they lack interpretability. In real-life applications on network data, for some tasks, it is crucial to learn interpretable features in the form of graphical structures. For this we need to mine important graphical structures along with their frequency statistics from the input dataset. However, exact algorithms for these tasks are computationally expensive, so scalable algorithms are of urgent need. To overcome this challenge, we provide efficient sampling algorithms for mining higher-order structures from network(s). We show that our sampling-based algorithms are scalable. They are also superior to a set of baseline algorithms in terms of retrieving important graphical sub-structures, and collecting their frequency statistics. Finally, we show that we can use these frequent subgraph statistics and structures as features in various real-life applications. We show one application in biology and another in security. In both cases, we show that the structures and their statistics significantly improve the performance of knowledge discovery tasks in these domains

    NSDroid: Efficient Multi-classification of Android Malware using Neighborhood Signature in Local Function Call Graphs

    With the rapid development of mobile Internet, Android applications are used more and more in people\u27s daily life. While bringing convenience and making people\u27s life smarter, Android applications also face much serious security and privacy issues, e.g., information leakage and monetary loss caused by malware. Detection and classification of malware have thus attracted much research attention in recent years. Most current malware detection and classification approaches are based on graph-based similarity analysis (e.g., subgraph isomorphism), which is well known to be time-consuming, especially for large graphs. In this paper, we propose NSDroid, a time-efficient malware multi-classification approach based on neighborhood signature in local function call graphs (FCGs). NSDroid uses a approach based on neighborhood signature to calculate the similarity of different applications\u27 FCGs, which is significantly faster than traditional approaches based on subgraph isomorphism. For each node in the FCGs, NSDroid uses a fixed-length neighborhood signature to capture the caller-callee relationship between different functions and combines neighborhood signatures of all nodes to form a vector that characterizes the function call relationship in the whole application. The generated signature vector is fed into a SVM-based classifier to determine which family the malware belongs to. Experimental results on large-scale benchmarks show that, compared with state-of-the-art solutions, NSDroid reduces average detection latency by nearly 20x, and meanwhile improves many evaluation index such as recall rate and others
