25 research outputs found

    DexBERT: Effective, Task-Agnostic and Fine-Grained Representation Learning of Android Bytecode

    Get PDF
    peer reviewedThe automation of an increasingly large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). One foundational building block in the application of ML to software artifacts is the representation of these artifacts (e.g., source code or executable code) into a form that is suitable for learning. Traditionally, researchers and practitioners have relied on manually selected features, based on expert knowledge, for the task at hand. Such knowledge is sometimes imprecise and generally incomplete. To overcome this limitation, many studies have leveraged representation learning, delegating to ML itself the job of automatically devising suitable representations and selections of the most relevant features. Yet, in the context of Android problems, existing models are either limited to coarse-grained whole-app level (e.g., apk2vec) or conducted for one specific downstream task (e.g., smali2vec). Thus, the produced representation may turn out to be unsuitable for fine-grained tasks or cannot generalize beyond the task that they have been trained on. Our work is part of a new line of research that investigates effective, task-agnostic, and fine-grained universal representations of bytecode to mitigate both of these two limitations. Such representations aim to capture information relevant to various low-level downstream tasks (e.g., at the class-level). We are inspired by the field of Natural Language Processing, where the problem of universal representation was addressed by building Universal Language Models, such as BERT, whose goal is to capture abstract semantic information about sentences, in a way that is reusable for a variety of tasks. We propose DexBERT, a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications. We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in three distinct class-level software engineering tasks: Malicious Code Localization, Defect Prediction, and Component Type Classification. We also experiment with strategies to deal with the problem of catering to apps having vastly different sizes, and we demonstrate one example of using our technique to investigate what information is relevant to a given task

    A Software Vulnerabilities Odysseus: Analysis, Detection, and Mitigation

    Get PDF
    Programming has become central in the development of human activities while not being immune to defaults, or bugs. Developers have developed specific methods and sequences of tests that they implement to prevent these bugs from being deployed in releases. Nonetheless, not all cases can be thought through beforehand, and automation presents limits the community attempts to overcome. As a consequence, not all bugs can be caught. These defaults are causing particular concerns in case bugs can be exploited to breach the program’s security policy. They are then called vulnerabilities and provide specific actors with undesired access to the resources a program manages. It damages the trust in the program and in its developers, and may eventually impact the adoption of the program. Hence, to attribute a specific attention to vulnerabilities appears as a natural outcome. In this regard, this PhD work targets the following three challenges: (1) The research community references those vulnerabilities, categorises them, reports and ranks their impact. As a result, analysts can learn from past vulnerabilities in specific programs and figure out new ideas to counter them. Nonetheless, the resulting quality of the lessons and the usefulness of ensuing solutions depend on the quality and the consistency of the information provided in the reports. (2) New methods to detect vulnerabilities can emerge among the teachings this monitoring provides. With responsible reporting, these detection methods can provide hardening of the programs we rely on. Additionally, in a context of computer perfor- mance gain, machine learning algorithms are increasingly adopted, providing engaging promises. (3) If some of these promises can be fulfilled, not all are not reachable today. Therefore a complementary strategy needs to be adopted while vulnerabilities evade detection up to public releases. Instead of preventing their introduction, programs can be hardened to scale down their exploitability. Increasing the complexity to exploit or lowering the impact below specific thresholds makes the presence of vulnerabilities an affordable risk for the feature provided. The history of programming development encloses the experimentation and the adoption of so-called defence mechanisms. Their goals and performances can be diverse, but their implementation in worldwide adopted programs and systems (such as the Android Open Source Project) acknowledges their pivotal position. To face these challenges, we provide the following contributions: • We provide a manual categorisation of the vulnerabilities of the worldwide adopted Android Open Source Project up to June 2020. Clarifying to adopt a vulnera- bility analysis provides consistency in the resulting data set. It facilitates the explainability of the analyses and sets up for the updatability of the resulting set of vulnerabilities. Based on this analysis, we study the evolution of AOSP’s vulnerabilities. We explore the different temporal evolutions of the vulnerabilities affecting the system for their severity, the type of vulnerability, and we provide a focus on memory corruption-related vulnerabilities. • We undertake the replication of a machine-learning based detection algorithms that, besides being part of the state-of-the-art and referenced to by ensuing works, was not available. Named VCCFinder, this algorithm implements a Support- Vector Machine and bases its training on Vulnerability-Contributing Commits and related patches for C and C++ code. Not in capacity to achieve analogous performances to the original article, we explore parameters and algorithms, and attempt to overcome the challenge provided by the over-population of unlabeled entries in the data set. We provide the community with our code and results as a replicable baseline for further improvement. • We eventually list the defence mechanisms that the Android Open Source Project incrementally implements, and we discuss how it sometimes answers comments the community addressed to the project’s developers. We further verify the extent to which specific memory corruption defence mechanisms were implemented in the binaries of different versions of Android (from API-level 10 to 28). We eventually confront the evolution of memory corruption-related vulnerabilities with the implementation timeline of related defence mechanisms

    A Pre-Trained BERT Model for Android Applications

    Full text link
    The automation of an increasingly large number of software engineering tasks is becoming possible thanks to Machine Learning (ML). One foundational building block in the application of ML to software artifacts is the representation of these artifacts (e.g., source code or executable code) into a form that is suitable for learning. Many studies have leveraged representation learning, delegating to ML itself the job of automatically devising suitable representations. Yet, in the context of Android problems, existing models are either limited to coarse-grained whole-app level (e.g., apk2vec) or conducted for one specific downstream task (e.g., smali2vec). Our work is part of a new line of research that investigates effective, task-agnostic, and fine-grained universal representations of bytecode to mitigate both of these two limitations. Such representations aim to capture information relevant to various low-level downstream tasks (e.g., at the class-level). We are inspired by the field of Natural Language Processing, where the problem of universal representation was addressed by building Universal Language Models, such as BERT, whose goal is to capture abstract semantic information about sentences, in a way that is reusable for a variety of tasks. We propose DexBERT, a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications. We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in two distinct class-level software engineering tasks: Malicious Code Localization and Defect Prediction. We also experiment with strategies to deal with the problem of catering to apps having vastly different sizes, and we demonstrate one example of using our technique to investigate what information is relevant to a given task

    Creating better ground truth to further understand Android malware: A large scale mining approach based on antivirus labels and malicious artifacts

    Get PDF
    Mobile applications are essential for interacting with technology and other people. With more than 2 billion devices deployed all over the world, Android offers a thriving ecosystem by making accessible the work of thousands of developers on digital marketplaces such as Google Play. Nevertheless, the success of Android also exposes millions of users to malware authors who seek to siphon private information and hijack mobile devices for their benefits. To fight against the proliferation of Android malware, the security community embraced machine learning, a branch of artificial intelligence that powers a new generation of detection systems. Machine learning algorithms, however, require a substantial number of qualified samples to learn the classification rules enforced by security experts. Unfortunately, malware ground truths are notoriously hard to construct due to the inherent complexity of Android applications and the global lack of public information about malware. In a context where both information and human resources are limited, the security community is in demand for new approaches to aid practitioners to accurately define Android malware, automate classification decisions, and improve the comprehension of Android malware. This dissertation proposes three solutions to assist with the creation of malware ground truths. The first contribution is STASE, an analytical framework that qualifies the composition of malware ground truths. STASE reviews the information shared by antivirus products with nine metrics in order to support the reproducibility of research experiments and detect potential biases. This dissertation reports the results of STASE against three typical settings and suggests additional recommendations for designing experiments based on Android malware. The second contribution is EUPHONY, a heuristic system built to unify family clusters belonging to malware ground truths. EUPHONY exploits the co-occurrence of malware labels obtained from antivirus reports to study the relationship between Android applications and proposes a single family name per sample for the sake of facilitating malware experiments. This dissertation evaluates EUPHONY on well-known malware ground truths to assess the precision of our approach and produce a large dataset of malware tags for the research community. The third contribution is AP-GRAPH, a knowledge database for dissecting the characteristics of malware ground truths. AP-GRAPH leverages the results of EUPHONY and static analysis to index artifacts that are highly correlated with malware activities and recommend the inspection of the most suspicious components. This dissertation explores the set of artifacts retrieved by AP-GRAPH from popular malware families to track down their correlation and their evolution compared to other malware populations

    Eight years of rider measurement in the Android malware ecosystem: evolution and lessons learned

    Full text link
    Despite the growing threat posed by Android malware, the research community is still lacking a comprehensive view of common behaviors and trends exposed by malware families active on the platform. Without such view, the researchers incur the risk of developing systems that only detect outdated threats, missing the most recent ones. In this paper, we conduct the largest measurement of Android malware behavior to date, analyzing over 1.2 million malware samples that belong to 1.2K families over a period of eight years (from 2010 to 2017). We aim at understanding how the behavior of Android malware has evolved over time, focusing on repackaging malware. In this type of threats different innocuous apps are piggybacked with a malicious payload (rider), allowing inexpensive malware manufacturing. One of the main challenges posed when studying repackaged malware is slicing the app to split benign components apart from the malicious ones. To address this problem, we use differential analysis to isolate software components that are irrelevant to the campaign and study the behavior of malicious riders alone. Our analysis framework relies on collective repositories and recent advances on the systematization of intelligence extracted from multiple anti-virus vendors. We find that since its infancy in 2010, the Android malware ecosystem has changed significantly, both in the type of malicious activity performed by the malicious samples and in the level of obfuscation used by malware to avoid detection. We then show that our framework can aid analysts who attempt to study unknown malware families. Finally, we discuss what our findings mean for Android malware detection research, highlighting areas that need further attention by the research community.Accepted manuscrip

    Android HIV: A Study of Repackaging Malware for Evading Machine-Learning Detection

    Full text link
    Machine learning based solutions have been successfully employed for automatic detection of malware in Android applications. However, machine learning models are known to lack robustness against inputs crafted by an adversary. So far, the adversarial examples can only deceive Android malware detectors that rely on syntactic features, and the perturbations can only be implemented by simply modifying Android manifest. While recent Android malware detectors rely more on semantic features from Dalvik bytecode rather than manifest, existing attacking/defending methods are no longer effective. In this paper, we introduce a new highly-effective attack that generates adversarial examples of Android malware and evades being detected by the current models. To this end, we propose a method of applying optimal perturbations onto Android APK using a substitute model. Based on the transferability concept, the perturbations that successfully deceive the substitute model are likely to deceive the original models as well. We develop an automated tool to generate the adversarial examples without human intervention to apply the attacks. In contrast to existing works, the adversarial examples crafted by our method can also deceive recent machine learning based detectors that rely on semantic features such as control-flow-graph. The perturbations can also be implemented directly onto APK's Dalvik bytecode rather than Android manifest to evade from recent detectors. We evaluated the proposed manipulation methods for adversarial examples by using the same datasets that Drebin and MaMadroid (5879 malware samples) used. Our results show that, the malware detection rates decreased from 96% to 1% in MaMaDroid, and from 97% to 1% in Drebin, with just a small distortion generated by our adversarial examples manipulation method.Comment: 15 pages, 11 figure

    MediaSync: Handbook on Multimedia Synchronization

    Get PDF
    This book provides an approachable overview of the most recent advances in the fascinating field of media synchronization (mediasync), gathering contributions from the most representative and influential experts. Understanding the challenges of this field in the current multi-sensory, multi-device, and multi-protocol world is not an easy task. The book revisits the foundations of mediasync, including theoretical frameworks and models, highlights ongoing research efforts, like hybrid broadband broadcast (HBB) delivery and users' perception modeling (i.e., Quality of Experience or QoE), and paves the way for the future (e.g., towards the deployment of multi-sensory and ultra-realistic experiences). Although many advances around mediasync have been devised and deployed, this area of research is getting renewed attention to overcome remaining challenges in the next-generation (heterogeneous and ubiquitous) media ecosystem. Given the significant advances in this research area, its current relevance and the multiple disciplines it involves, the availability of a reference book on mediasync becomes necessary. This book fills the gap in this context. In particular, it addresses key aspects and reviews the most relevant contributions within the mediasync research space, from different perspectives. Mediasync: Handbook on Multimedia Synchronization is the perfect companion for scholars and practitioners that want to acquire strong knowledge about this research area, and also approach the challenges behind ensuring the best mediated experiences, by providing the adequate synchronization between the media elements that constitute these experiences
    corecore