47 research outputs found

    Android Malware Family Classification and Analysis: Current Status and Future Directions

    Get PDF
    Android receives major attention from security practitioners and researchers due to the influx number of malicious applications. For the past twelve years, Android malicious applications have been grouped into families. In the research community, detecting new malware families is a challenge. As we investigate, most of the literature reviews focus on surveying malware detection. Characterizing the malware families can improve the detection process and understand the malware patterns. For this reason, we conduct a comprehensive survey on the state-of-the-art Android malware familial detection, identification, and categorization techniques. We categorize the literature based on three dimensions: type of analysis, features, and methodologies and techniques. Furthermore, we report the datasets that are commonly used. Finally, we highlight the limitations that we identify in the literature, challenges, and future research directions regarding the Android malware family.https://doi.org/10.3390/electronics906094

    Proposed Framework to Improving Performance of Familial Classification in Android Malware

    Get PDF
    Because of the recent developments in hardware and software technologies for mobile phones, people depend on their smartphones more than ever before. Today, people conduct a variety of business, health, and financial transactions on their mobile devices. This trend has caused an influx of mobile applications that require users' sensitive information. As these applications increase so too have the number of malicious applications increased, which may compromise users' sensitive information. Between all smartphone, Android receives major attention from security practitioners and researchers due to the large number of malicious applications. For the past twelve years, Android malicious applications have been clustered into groups for better identification. Characterizing the malware families can improve the detection process and understand the malware patterns. However, in the research community, detecting new malware families is a challenge. In this research, a framework is proposed to improve the performance of familial classification in Android malware. The framework is named a Reverse Engineering Framework (RevEng). Within RevEng, applications' permissions were selected and then fed into machine learning algorithms. Through our research, we created a reduced set of permissions using Extremely Randomized Trees algorithm that achieved high accuracy and a shorter execution time. Furthermore, we conducted two approaches based on the extracted information. The first approach used a binary value representation of the permissions. The second approach used the features' importance. We represented each selected permission in latter approach by its weight value instead of its binary value in the former approach. We conducted a comparison between the results of our two approaches and other relevant works. Our approaches achieved better results in both accuracy and time performance with a reduced number of permissions

    Advanced Security Analysis for Emergent Software Platforms

    Get PDF
    Emergent software ecosystems, boomed by the advent of smartphones and the Internet of Things (IoT) platforms, are perpetually sophisticated, deployed into highly dynamic environments, and facilitating interactions across heterogeneous domains. Accordingly, assessing the security thereof is a pressing need, yet requires high levels of scalability and reliability to handle the dynamism involved in such volatile ecosystems. This dissertation seeks to enhance conventional security detection methods to cope with the emergent features of contemporary software ecosystems. In particular, it analyzes the security of Android and IoT ecosystems by developing rigorous vulnerability detection methods. A critical aspect of this work is the focus on detecting vulnerable and unsafe interactions between applications that share common components and devices. Contributions of this work include novel insights and methods for: (1) detecting vulnerable interactions between Android applications that leverage dynamic loading features for concealing the interactions; (2) identifying unsafe interactions between smart home applications by considering physical and cyber channels; (3) detecting malicious IoT applications that are developed to target numerous IoT devices; (4) detecting insecure patterns of emergent security APIs that are reused from open-source software. In all of the four research thrusts, we present thorough security analysis and extensive evaluations based on real-world applications. Our results demonstrate that the proposed detection mechanisms can efficiently and effectively detect vulnerabilities in contemporary software platforms. Advisers: Hamid Bagheri and Qiben Ya

    Latent Representation and Sampling in Network: Application in Text Mining and Biology.

    Get PDF
    In classical machine learning, hand-designed features are used for learning a mapping from raw data. However, human involvement in feature design makes the process expensive. Representation learning aims to learn abstract features directly from data without direct human involvement. Raw data can be of various forms. Network is one form of data that encodes relational structure in many real-world domains. Therefore, learning abstract features for network units is an important task. In this dissertation, we propose models for incorporating temporal information given as a collection of networks from subsequent time-stamps. The primary objective of our models is to learn a better abstract feature representation of nodes and edges in an evolving network. We show that the temporal information in the abstract feature improves the performance of link prediction task substantially. Besides applying to the network data, we also employ our models to incorporate extra-sentential information in the text domain for learning better representation of sentences. We build a context network of sentences to capture extra-sentential information. This information in abstract feature representation of sentences improves various text-mining tasks substantially over a set of baseline methods. A problem with the abstract features that we learn is that they lack interpretability. In real-life applications on network data, for some tasks, it is crucial to learn interpretable features in the form of graphical structures. For this we need to mine important graphical structures along with their frequency statistics from the input dataset. However, exact algorithms for these tasks are computationally expensive, so scalable algorithms are of urgent need. To overcome this challenge, we provide efficient sampling algorithms for mining higher-order structures from network(s). We show that our sampling-based algorithms are scalable. They are also superior to a set of baseline algorithms in terms of retrieving important graphical sub-structures, and collecting their frequency statistics. Finally, we show that we can use these frequent subgraph statistics and structures as features in various real-life applications. We show one application in biology and another in security. In both cases, we show that the structures and their statistics significantly improve the performance of knowledge discovery tasks in these domains

    Program Similarity Analysis for Malware Classification and its Pitfalls

    Get PDF
    Malware classification, specifically the task of grouping malware samples into families according to their behaviour, is vital in order to understand the threat they pose and how to protect against them. Recognizing whether one program shares behaviors with another is a task that requires semantic reasoning, meaning that it needs to consider what a program actually does. This is a famously uncomputable problem, due to Rice\u2019s theorem. As there is no one-size-fits-all solution, determining program similarity in the context of malware classification requires different tools and methods depending on what is available to the malware defender. When the malware source code is readily available (or at least, easy to retrieve), most approaches employ semantic \u201cabstractions\u201d, which are computable approximations of the semantics of the program. We consider this the first scenario for this thesis: malware classification using semantic abstractions extracted from the source code in an open system. Structural features, such as the control flow graphs of programs, can be used to classify malware reasonably well. To demonstrate this, we build a tool for malware analysis, R.E.H.A. which targets the Android system and leverages its openness to extract a structural feature from the source code of malware samples. This tool is first successfully evaluated against a state of the art malware dataset and then on a newly collected dataset. We show that R.E.H.A. is able to classify the new samples into their respective families, often outperforming commercial antivirus software. However, abstractions have limitations by virtue of being approximations. We show that by increasing the granularity of the abstractions used to produce more fine-grained features, we can improve the accuracy of the results as in our second tool, StranDroid, which generates fewer false positives on the same datasets. The source code of malware samples is not often available or easily retrievable. For this reason, we introduce a second scenario in which the classification must be carried out with only the compiled binaries of malware samples on hand. Program similarity in this context cannot be done using semantic abstractions as before, since it is difficult to create meaningful abstractions from zeros and ones. Instead, by treating the compiled programs as raw data, we transform them into images and build upon common image classification algorithms using machine learning. This led us to develop novel deep learning models, a convolutional neural network and a long short-term memory, to classify the samples into their respective families. To overcome the usual obstacle of deep learning of lacking sufficiently large and balanced datasets, we utilize obfuscations as a data augmentation tool to generate semantically equivalent variants of existing samples and expand the dataset as needed. Finally, to lower the computational cost of the training process, we use transfer learning and show that a model trained on one dataset can be used to successfully classify samples in different malware datasets. The third scenario explored in this thesis assumes that even the binary itself cannot be accessed for analysis, but it can be executed, and the execution traces can then be used to extract semantic properties. However, dynamic analysis lacks the formal tools and frameworks that exist in static analysis to allow proving the effectiveness of obfuscations. For this reason, the focus shifts to building a novel formal framework that is able to assess the potency of obfuscations against dynamic analysis. We validate the new framework by using it to encode known analyses and obfuscations, and show how these obfuscations actually hinder the dynamic analysis process

    Mustererkennungsbasierte Verteidgung gegen gezielte Angriffe

    Get PDF
    The speed at which everything and everyone is being connected considerably outstrips the rate at which effective security mechanisms are introduced to protect them. This has created an opportunity for resourceful threat actors which have specialized in conducting low-volume persistent attacks through sophisticated techniques that are tailored to specific valuable targets. Consequently, traditional approaches are rendered ineffective against targeted attacks, creating an acute need for innovative defense mechanisms. This thesis aims at supporting the security practitioner in bridging this gap by introducing a holistic strategy against targeted attacks that addresses key challenges encountered during the phases of detection, analysis and response. The structure of this thesis is therefore aligned to these three phases, with each one of its central chapters taking on a particular problem and proposing a solution built on a strong foundation on pattern recognition and machine learning. In particular, we propose a detection approach that, in the absence of additional authentication mechanisms, allows to identify spear-phishing emails without relying on their content. Next, we introduce an analysis approach for malware triage based on the structural characterization of malicious code. Finally, we introduce MANTIS, an open-source platform for authoring, sharing and collecting threat intelligence, whose data model is based on an innovative unified representation for threat intelligence standards based on attributed graphs. As a whole, these ideas open new avenues for research on defense mechanisms and represent an attempt to counteract the imbalance between resourceful actors and society at large.In unserer heutigen Welt sind alle und alles miteinander vernetzt. Dies bietet mächtigen Angreifern die Möglichkeit, komplexe Verfahren zu entwickeln, die auf spezifische Ziele angepasst sind. Traditionelle Ansätze zur Bekämpfung solcher Angriffe werden damit ineffektiv, was die Entwicklung innovativer Methoden unabdingbar macht. Die vorliegende Dissertation verfolgt das Ziel, den Sicherheitsanalysten durch eine umfassende Strategie gegen gezielte Angriffe zu unterstützen. Diese Strategie beschäftigt sich mit den hauptsächlichen Herausforderungen in den drei Phasen der Erkennung und Analyse von sowie der Reaktion auf gezielte Angriffe. Der Aufbau dieser Arbeit orientiert sich daher an den genannten drei Phasen. In jedem Kapitel wird ein Problem aufgegriffen und eine entsprechende Lösung vorgeschlagen, die stark auf maschinellem Lernen und Mustererkennung basiert. Insbesondere schlagen wir einen Ansatz vor, der eine Identifizierung von Spear-Phishing-Emails ermöglicht, ohne ihren Inhalt zu betrachten. Anschliessend stellen wir einen Analyseansatz für Malware Triage vor, der auf der strukturierten Darstellung von Code basiert. Zum Schluss stellen wir MANTIS vor, eine Open-Source-Plattform für Authoring, Verteilung und Sammlung von Threat Intelligence, deren Datenmodell auf einer innovativen konsolidierten Graphen-Darstellung für Threat Intelligence Stardards basiert. Wir evaluieren unsere Ansätze in verschiedenen Experimenten, die ihren potentiellen Nutzen in echten Szenarien beweisen. Insgesamt bereiten diese Ideen neue Wege für die Forschung zu Abwehrmechanismen und erstreben, das Ungleichgewicht zwischen mächtigen Angreifern und der Gesellschaft zu minimieren

    Negative Results of Fusing Code and Documentation for Learning to Accurately Identify Sensitive Source and Sink Methods An Application to the Android Framework for Data Leak Detection

    Get PDF
    Almost two-thirds of the population owns a mobile phone. Given that there is a profusion of mobile applications that manipulate all sorts of data, privacy-related concerns arise more and more. New regulations such as the General Data Protection Regulation (GDPR) provide rules for which developers must comply when their apps process sensitive and/or private data. Ensuring that no such data is leaked without the consent of the user is a primary objective in each GDPR compliance check. Researchers have proposed sophisticated approaches to track sensitive data within mobile apps, all of which rely on specific lists of sensitive source and sink methods. The data flow analysis results greatly depend on these lists' quality. Previous approaches either used incomplete hand-written lists and quickly became outdated or relied on machine learning. The latter, however, leads to numerous false positives, as we show. This paper introduces CoDoC that aims to revive the machine-learning approach to precisely identify the privacy-related source and sink API methods. In contrast to previous approaches, CoDoC uses deep learning techniques and combines the source code with the documentation of API methods. Firstly, we propose novel definitions that clarify the concepts of taint analysis, source, and sink methods. Secondly, based on these definitions, we build a new ground truth of Android methods representing sensitive source, sink, and neither methods that will be used to train our classifier. We evaluate CoDoC and show that, on our validation dataset, it achieves a precision, recall, and F1 score of 91%, outperforming the state-of-the-art SuSi. However, similarly to existing tools, we show that in the wild, i.e., with unseen data, CoDoC performs poorly and generates many false-positive results. Our findings suggest that machine-learning models for abstract concepts such as privacy fail in practice despite good lab results. To encourage future research, we release all our artifacts to the community
    corecore