62 research outputs found
Efficient Concept Drift Handling for Batch Android Malware Detection Models
The rapidly evolving nature of Android apps poses a significant challenge to
static batch machine learning algorithms employed in malware detection systems,
as they quickly become obsolete. Despite this challenge, the existing
literature pays limited attention to addressing this issue, with many advanced
Android malware detection approaches, such as Drebin, DroidDet and MaMaDroid,
relying on static models. In this work, we show how retraining techniques are
able to maintain detector capabilities over time. Particularly, we analyze the
effect of two aspects in the efficiency and performance of the detectors: 1)
the frequency with which the models are retrained, and 2) the data used for
retraining. In the first experiment, we compare periodic retraining with a more
advanced concept drift detection method that triggers retraining only when
necessary. In the second experiment, we analyze sampling methods to reduce the
amount of data used to retrain models. Specifically, we compare fixed sized
windows of recent data and state-of-the-art active learning methods that select
those apps that help keep the training dataset small but diverse. Our
experiments show that concept drift detection and sample selection mechanisms
result in very efficient retraining strategies which can be successfully used
to maintain the performance of the static Android malware state-of-the-art
detectors in changing environments.Comment: 18 page
Resilient and Scalable Android Malware Fingerprinting and Detection
Malicious software (Malware) proliferation reaches hundreds of thousands daily. The manual analysis of such a large volume of malware is daunting and time-consuming. The diversity of targeted systems in terms of architecture and platforms compounds the challenges of Android malware detection and malware in general. This highlights the need to design and implement new scalable and robust methods, techniques, and tools to detect Android malware. In this thesis, we develop a malware fingerprinting framework to cover accurate Android malware detection and family attribution. In this context, we emphasize the following: (i) the scalability over a large malware corpus; (ii) the resiliency to common obfuscation techniques; (iii) the portability over different platforms and architectures.
In the context of bulk and offline detection on the laboratory/vendor level: First, we propose an approximate fingerprinting technique for Android packaging that captures the underlying static structure of the Android apps. We also propose a malware clustering framework on top of this fingerprinting technique to perform unsupervised malware detection and grouping by building and partitioning a similarity network of malicious apps. Second, we propose an approximate fingerprinting technique for Android malware's behavior reports generated using dynamic analyses leveraging natural language processing techniques. Based on this fingerprinting technique, we propose a portable malware detection and family threat attribution framework employing supervised machine learning techniques. Third, we design an automatic framework to produce intelligence about the underlying malicious cyber-infrastructures of Android malware. We leverage graph analysis techniques to generate relevant, actionable, and granular intelligence that can be used to identify the threat effects induced by malicious Internet activity associated to Android malicious apps.
In the context of the single app and online detection on the mobile device level, we further propose the following: Fourth, we design a portable and effective Android malware detection system that is suitable for deployment on mobile and resource constrained devices, using machine learning classification on raw method call sequences. Fifth, we elaborate a framework for Android malware detection that is resilient to common code obfuscation techniques and adaptive to operating systems and malware change overtime, using natural language processing and deep learning techniques.
We also evaluate the portability of the proposed techniques and methods beyond Android platform malware, as follows: Sixth, we leverage the previously elaborated techniques to build a framework for cross-platform ransomware fingerprinting relying on raw hybrid features in conjunction with advanced deep learning techniques
Drebin: Effective and Explainable Detection of Android Malware in Your Pocket
Malicious applications pose a threat to the security of the Android platform. The growing amount and diversity of these applications render conventional defenses largely ineffective and thus Android smartphones often remain un-protected from novel malware. In this paper, we propose DREBIN, a lightweight method for detection of Android malware that enables identifying malicious applications di-rectly on the smartphone. As the limited resources impede monitoring applications at run-time, DREBIN performs a broad static analysis, gathering as many features of an ap-plication as possible. These features are embedded in a joint vector space, such that typical patterns indicative for malware can be automatically identified and used for ex-plaining the decisions of our method. In an evaluation with 123,453 applications and 5,560 malware samples DREBIN outperforms several related approaches and detects 94% of the malware with few false alarms, where the explana-tions provided for each detection reveal relevant properties of the detected malware. On five popular smartphones, the method requires 10 seconds for an analysis on average, ren-dering it suitable for checking downloaded applications di-rectly on the device.
Android malware detection using machine learning to mitigate adversarial evasion attacks
In the current digital era, smartphones have become indispensable. Over the past few years, the exponential growth of Android users has made this operating system (OS) a prime target for smartphone malware. Consequently, the arms race between Android security personnel and malware developers seems enduring. Considering Machine Learning (ML) as the core component, various techniques are proposed in the literature to counter Android malware, however, the problem of adversarial evasion attacks on ML-based malware classifiers is understated. MLbased techniques are vulnerable to adversarial evasion attacks. The malware authors constantly try to craft adversarial examples to elude existing malware detection systems. This research presents the fragility of ML-based Android malware classifiers in adversarial environments and proposes novel techniques to counter adversarial evasion attacks on ML based Android malware classifiers.
First, we start our analysis by introducing the problem of Android malware detection in adversarial environments and provide a comprehensive overview of the domain. Second, we highlight the problem of malware clones in popular Android malware repositories. The malware clones in the datasets can potentially lead to biased results and computational overhead. Although many strategies are proposed in the literature to detect repackaged Android malware, these techniques require burdensome code inspection. Consequently, we employ a lightweight and novel strategy based on package names reusing to identify repackaged Android malware and build a clones-free Android malware dataset. Furthermore, we investigate the impact of repacked Android malware on various ML-based classifiers by training them on a clones free training set and testing on a set of benign, non repacked malware and all the malware clones in the dataset. Although trained on a reduced train set, we achieved up to 98.7% F1 score. Third, we propose Cure-Droid, an Android malware classification model trained on hybrid features and optimized using a tree-based pipeline optimization technique (TPoT). Fourth, to present the fragility of Cure- Droid model in adversarial environments, we formulate multiple adversarial evasion attacks to elude the model. Fifth, to counter adversarial evasion attacks on ML-based Android malware detectors, we propose CureDroid*, a novel and adversarially aware Android malware classification model. CureDroid* is based on an ensemble of ML-based models trained on distinct set of features where each model has the individual capability to detect Android malware. The CureDroid* model employs an ensemble of five ML-based models where each model is selected and optimized using TPoT. Our experimental results demonstrate that CureDroid* achieves up to 99.2% accuracy in non-adversarial settings and can detect up to 30 fabricated input features in the best case. Finally, we propose TrickDroid, a novel cumulative adversarial training framework based on Oracle and GAN-based adversarial data. Our experimental results present the efficacy of TrickDroid with up to 99.46% evasion detection
Detection of Android Malware in the Internet of Things through the K-Nearest Neighbor Algorithm
Predicting attacks in Android malware devices using machine learning for recommender systems-based IoT can be a challenging task. However, it is possible to use various machine-learning techniques to achieve this goal. An internet-based framework is used to predict and recommend Android malware on IoT devices. As the prevalence of Android devices grows, the malware creates new viruses on a regular basis, posing a threat to the central system’s security and the privacy of the users. The suggested system uses static analysis to predict the malware in Android apps used by consumer devices. The training of the presented system is used to predict and recommend malicious devices to block them from transmitting the data to the cloud server. By taking into account various machine-learning methods, feature selection is performed and the K-Nearest Neighbor (KNN) machine-learning model is proposed. Testing was carried out on more than 10,000 Android applications to check malicious nodes and recommend that the cloud server block them. The developed model contemplated all four machine-learning algorithms in parallel, i.e., naive Bayes, decision tree, support vector machine, and the K-Nearest Neighbor approach and static analysis as a feature subset selection algorithm, and it achieved the highest prediction rate of 93% to predict the malware in real-world applications of consumer devices to minimize the utilization of energy. The experimental results show that KNN achieves 93%, 95%, 90%, and 92% accuracy, precision, recall and f1 measures, respectively
A Pre-Trained BERT Model for Android Applications
The automation of an increasingly large number of software engineering tasks
is becoming possible thanks to Machine Learning (ML). One foundational building
block in the application of ML to software artifacts is the representation of
these artifacts (e.g., source code or executable code) into a form that is
suitable for learning. Many studies have leveraged representation learning,
delegating to ML itself the job of automatically devising suitable
representations. Yet, in the context of Android problems, existing models are
either limited to coarse-grained whole-app level (e.g., apk2vec) or conducted
for one specific downstream task (e.g., smali2vec). Our work is part of a new
line of research that investigates effective, task-agnostic, and fine-grained
universal representations of bytecode to mitigate both of these two
limitations. Such representations aim to capture information relevant to
various low-level downstream tasks (e.g., at the class-level). We are inspired
by the field of Natural Language Processing, where the problem of universal
representation was addressed by building Universal Language Models, such as
BERT, whose goal is to capture abstract semantic information about sentences,
in a way that is reusable for a variety of tasks. We propose DexBERT, a
BERT-like Language Model dedicated to representing chunks of DEX bytecode, the
main binary format used in Android applications. We empirically assess whether
DexBERT is able to model the DEX language and evaluate the suitability of our
model in two distinct class-level software engineering tasks: Malicious Code
Localization and Defect Prediction. We also experiment with strategies to deal
with the problem of catering to apps having vastly different sizes, and we
demonstrate one example of using our technique to investigate what information
is relevant to a given task
Android Malware Detection System using Genetic Programming
Nowadays, smartphones and other mobile devices are playing a significant role in the
way people engage in entertainment, communicate, network, work, and bank and shop
online. As the number of mobile phones sold has increased dramatically worldwide, so
have the security risks faced by the users, to a degree most do not realise. One of the
risks is the threat from mobile malware. In this research, we investigate how supervised
learning with evolutionary computation can be used to synthesise a system to detect
Android mobile phone attacks. The attacks include malware, ransomware and mobile
botnets. The datasets used in this research are publicly downloadable, available for use
with appropriate acknowledgement. The primary source is Drebin. We also used
ransomware and mobile botnet datasets from other Android mobile phone researchers.
The research in this thesis uses Genetic Programming (GP) to evolve programs to
distinguish malicious and non-malicious applications in Android mobile datasets. It also
demonstrates the use of GP and Multi-Objective Evolutionary Algorithms (MOEAs)
together to explore functional (detection rate) and non-functional (execution time and
power consumption) trade-offs. Our results show that malicious and non-malicious
applications can be distinguished effectively using only the permissions held by
applications recorded in the application's Android Package (APK). Such a minimalist
source of features can serve as the basis for highly efficient Android malware detection.
Non-functional tradeoffs are also highlight
Cyber Security and Critical Infrastructures
This book contains the manuscripts that were accepted for publication in the MDPI Special Topic "Cyber Security and Critical Infrastructure" after a rigorous peer-review process. Authors from academia, government and industry contributed their innovative solutions, consistent with the interdisciplinary nature of cybersecurity. The book contains 16 articles: an editorial explaining current challenges, innovative solutions, real-world experiences including critical infrastructure, 15 original papers that present state-of-the-art innovative solutions to attacks on critical systems, and a review of cloud, edge computing, and fog's security and privacy issues
Effiziente und erklärbare Erkennung von mobiler Schadsoftware mittels maschineller Lernmethoden
In recent years, mobile devices shipped with Google’s Android operating system
have become ubiquitous. Due to their popularity and the high concentration of
sensitive user data on these devices, however, they have also become a
profitable target of malware authors. As a result, thousands of new malware
instances targeting Android are found almost every day. Unfortunately, common
signature-based methods often fail to detect these applications, as these
methods can- not keep pace with the rapid development of new malware.
Consequently, there is an urgent need for new malware detection methods to
tackle this growing threat.
In this thesis, we address the problem by combining concepts of static analysis
and machine learning, such that mobile malware can be detected directly on the
mobile device with low run-time overhead. To this end, we first discuss our
analysis results of a sophisticated malware that uses an ultrasonic side
channel to spy on unwitting smartphone users. Based on the insights we gain
throughout this thesis, we gradually develop a method that allows detecting
Android malware in general. The resulting method performs a broad static
analysis, gathering a large number of features associated with an application.
These features are embedded in a joint vector space, where typical patterns
indicative of malware can be automatically identified and used for explaining
the decisions of our method. In addition to an evaluation of its overall
detection and run-time performance, we also examine the interpretability of the
underlying detection model and strengthen the classifier against realistic
evasion attacks.
In a large set of experiments, we show that the method clearly outperforms
several related approaches, including popular anti-virus scanners. In most
experiments, our approach detects more than 90% of all malicious samples in the
dataset at a low false positive rate of only 1%. Furthermore, even on older
devices, it offers a good run-time performance, and can output a decision along
with a proper explanation within a few seconds, despite the use of machine
learning techniques directly on the mobile device.
Overall, we find that the application of machine learning techniques is a
promising research direction to improve the security of mobile devices. While
these techniques alone cannot defeat the threat of mobile malware, they at
least raise the bar for malicious actors significantly, especially if combined
with existing techniques.Die Verbreitung von Smartphones, insbesondere mit dem Android-Betriebssystem,
hat in den vergangenen Jahren stark zugenommen. Aufgrund ihrer hohen
Popularität haben sich diese Geräte jedoch zugleich auch zu einem lukrativen
Ziel für Entwickler von Schadsoftware entwickelt, weshalb mittlerweile täglich
neue Schadprogramme fĂĽr Android gefunden werden.
Obwohl verschiedene Lösungen existieren, die Schadprogramme auch auf mobilen
Endgeräten identifizieren sollen, bieten diese in der Praxis häufig keinen
ausreichenden Schutz. Dies liegt vor allem daran, dass diese Verfahren zumeist
signaturbasiert arbeiten und somit schädliche Programme erst zuverlässig
identifizieren können, sobald entsprechende Erkennungssignaturen vorhanden
sind. Jedoch wird es fĂĽr Antiviren-Hersteller immer schwieriger, die zur
Erkennung notwendigen Signaturen rechtzeitig bereitzustellen. Daher ist die
Entwicklung von neuen Verfahren nötig, um der wachsenden Bedrohung durch mobile
Schadsoftware besser begegnen zu können.
In dieser Dissertation wird ein Verfahren vorgestellt und eingehend untersucht,
das Techniken der statischen Code-Analyse mit Methoden des maschinellen Lernens
kombiniert, um so eine zuverlässige Erkennung von mobiler Schadsoftware direkt
auf dem Mobilgerät zu ermöglichen. Die Methode analysiert hierfür mobile
Anwendungen zunächst statisch und extrahiert dabei spezielle Merkmale, die eine
Abbildung einer Applikation in einen hochdimensionalen Vektorraum ermöglichen.
In diesem Vektorraum sind schlieĂźlich maschinelle Lernmethoden in der Lage,
automatisch Muster zur Erkennung von Schadprogrammen zu finden. Die gefundenen
Muster können dabei nicht nur zur Erkennung, sondern darüber hinaus auch zur
Erklärung einer getroffenenen Entscheidung dienen.
Im Rahmen einer ausfĂĽhrlichen Evaluation wird nicht nur die Erkennungsleistung
und die Laufzeit der vorgestellten Methode untersucht, sondern darĂĽber hinaus
das gelernte Erkennungsmodell im Detail analysiert. Hierbei wird auch die
Robustheit des Modells gegenĂĽber gezielten Angriffe untersucht und verbessert.
In einer Reihe von Experimenten kann gezeigt werden, dass mit dem
vorgeschlagenen Verfahren bessere Ergebnisse erzielt werden können als mit
vergleichbaren Methoden, sogar einschließlich einiger populärer
Antivirenprogramme. In den meisten Experimenten kann die Methode Schadprogramme
zuverlässig erkennen und erreicht Erkennungsraten von über 90% bei einer
geringen Falsch-Positiv-Rate von 1%
Reconstructing cancer genomes from paired-end sequencing data
<p>Abstract</p> <p>Background</p> <p>A cancer genome is derived from the germline genome through a series of somatic mutations. Somatic structural variants - including duplications, deletions, inversions, translocations, and other rearrangements - result in a cancer genome that is a scrambling of intervals, or "blocks" of the germline genome sequence. We present an efficient algorithm for reconstructing the block organization of a cancer genome from paired-end DNA sequencing data.</p> <p>Results</p> <p>By aligning paired reads from a cancer genome - and a matched germline genome, if available - to the human reference genome, we derive: (i) a partition of the reference genome into intervals; (ii) adjacencies between these intervals in the cancer genome; (iii) an estimated copy number for each interval. We formulate the Copy Number and Adjacency Genome Reconstruction Problem of determining the cancer genome as a sequence of the derived intervals that is consistent with the measured adjacencies and copy numbers. We design an efficient algorithm, called Paired-end Reconstruction of Genome Organization (PREGO), to solve this problem by reducing it to an optimization problem on an interval-adjacency graph constructed from the data. The solution to the optimization problem results in an Eulerian graph, containing an alternating Eulerian tour that corresponds to a cancer genome that is consistent with the sequencing data. We apply our algorithm to five ovarian cancer genomes that were sequenced as part of The Cancer Genome Atlas. We identify numerous rearrangements, or structural variants, in these genomes, analyze reciprocal vs. non-reciprocal rearrangements, and identify rearrangements consistent with known mechanisms of duplication such as tandem duplications and breakage/fusion/bridge (B/F/B) cycles.</p> <p>Conclusions</p> <p>We demonstrate that PREGO efficiently identifies complex and biologically relevant rearrangements in cancer genome sequencing data. An implementation of the PREGO algorithm is available at <url>http://compbio.cs.brown.edu/software/</url>.</p
- …