1,714 research outputs found
Malytics: A Malware Detection Scheme
An important problem of cyber-security is malware analysis. Besides good
precision and recognition rate, a malware detection scheme needs to be able to
generalize well for novel malware families (a.k.a zero-day attacks). It is
important that the system does not require excessive computation particularly
for deployment on the mobile devices. In this paper, we propose a novel scheme
to detect malware which we call Malytics. It is not dependent on any particular
tool or operating system. It extracts static features of any given binary file
to distinguish malware from benign. Malytics consists of three stages: feature
extraction, similarity measurement and classification. The three phases are
implemented by a neural network with two hidden layers and an output layer. We
show feature extraction, which is performed by tf -simhashing, is equivalent to
the first layer of a particular neural network. We evaluate Malytics
performance on both Android and Windows platforms. Malytics outperforms a wide
range of learning-based techniques and also individual state-of-the-art models
on both platforms. We also show Malytics is resilient and robust in addressing
zero-day malware samples. The F1-score of Malytics is 97.21% and 99.45% on
Android dex file and Windows PE files respectively, in the applied datasets.
The speed and efficiency of Malytics are also evaluated
Multiple Instance Learning for Malware Classification
This work addresses classification of unknown binaries executed in sandbox by
modeling their interaction with system resources (files, mutexes, registry keys
and communication with servers over the network) and error messages provided by
the operating system, using vocabulary-based method from the multiple instance
learning paradigm. It introduces similarities suitable for individual resource
types that combined with an approximative clustering method efficiently group
the system resources and define features directly from data. This approach
effectively removes randomization often employed by malware authors and
projects samples into low-dimensional feature space suitable for common
classifiers. An extensive comparison to the state of the art on a large corpus
of binaries demonstrates that the proposed solution achieves superior results
using only a fraction of training samples. Moreover, it makes use of a source
of information different than most of the prior art, which increases the
diversity of tools detecting the malware, hence making detection evasion more
difficult
Projecting "better than randomly": How to reduce the dimensionality of very large datasets in a way that outperforms random projections
For very large datasets, random projections (RP) have become the tool of
choice for dimensionality reduction. This is due to the computational
complexity of principal component analysis. However, the recent development of
randomized principal component analysis (RPCA) has opened up the possibility of
obtaining approximate principal components on very large datasets. In this
paper, we compare the performance of RPCA and RP in dimensionality reduction
for supervised learning. In Experiment 1, study a malware classification task
on a dataset with over 10 million samples, almost 100,000 features, and over 25
billion non-zero values, with the goal of reducing the dimensionality to a
compressed representation of 5,000 features. In order to apply RPCA to this
dataset, we develop a new algorithm called large sample RPCA (LS-RPCA), which
extends the RPCA algorithm to work on datasets with arbitrarily many samples.
We find that classification performance is much higher when using LS-RPCA for
dimensionality reduction than when using random projections. In particular,
across a range of target dimensionalities, we find that using LS-RPCA reduces
classification error by between 37% and 54%. Experiment 2 generalizes the
phenomenon to multiple datasets, feature representations, and classifiers.
These findings have implications for a large number of research projects in
which random projections were used as a preprocessing step for dimensionality
reduction. As long as accuracy is at a premium and the target dimensionality is
sufficiently less than the numeric rank of the dataset, randomized PCA may be a
superior choice. Moreover, if the dataset has a large number of samples, then
LS-RPCA will provide a method for obtaining the approximate principal
components.Comment: Originally published in IEEE DSAA in 2016; this post-print fixes a
rendering error of the += operator in Algorithm
Stimulation and Detection of Android Repackaged Malware with Active Learning
Repackaging is a technique that has been increasingly adopted by authors of
Android malware. The main problem facing the research community working on
devising techniques to detect this breed of malware is the lack of ground truth
that pinpoints the malicious segments grafted within benign apps. Without this
crucial knowledge, it is difficult to train reliable classifiers able to
effectively classify novel, out-of-sample repackaged malware. To circumvent
this problem, we argue that reliable classifiers can be trained to detect
repackaged malware, if they are allowed to request new, more accurate
representations of an app's behavior. This learning technique is referred to as
active learning.
In this paper, we propose the usage of active learning to train classifiers
able to cope with the ambiguous nature of repackaged malware. We implemented an
architecture, Aion, that connects the processes of stimulating and detecting
repackaged malware using a feedback loop depicting active learning. Our
evaluation of a sample implementation of Aion using two malware datasets
(Malgenome and Piggybacking) shows that active learning can outperform
conventional detection techniques and, hence, has great potential to detect
Android repackaged malware
Attack and Defense of Dynamic Analysis-Based, Adversarial Neural Malware Classification Models
Recently researchers have proposed using deep learning-based systems for
malware detection. Unfortunately, all deep learning classification systems are
vulnerable to adversarial attacks. Previous work has studied adversarial
attacks against static analysis-based malware classifiers which only classify
the content of the unknown file without execution. However, since the majority
of malware is either packed or encrypted, malware classification based on
static analysis often fails to detect these types of files. To overcome this
limitation, anti-malware companies typically perform dynamic analysis by
emulating each file in the anti-malware engine or performing in-depth scanning
in a virtual machine. These strategies allow the analysis of the malware after
unpacking or decryption. In this work, we study different strategies of
crafting adversarial samples for dynamic analysis. These strategies operate on
sparse, binary inputs in contrast to continuous inputs such as pixels in
images. We then study the effects of two, previously proposed defensive
mechanisms against crafted adversarial samples including the distillation and
ensemble defenses. We also propose and evaluate the weight decay defense.
Experiments show that with these three defensive strategies, the number of
successfully crafted adversarial samples is reduced compared to a standard
baseline system without any defenses. In particular, the ensemble defense is
the most resilient to adversarial attacks. Importantly, none of the defenses
significantly reduce the classification accuracy for detecting malware.
Finally, we demonstrate that while adding additional hidden layers to neural
models does not significantly improve the malware classification accuracy, it
does significantly increase the classifier's robustness to adversarial attacks
Malware triage for early identification of Advanced Persistent Threat activities
In the last decade, a new class of cyber-threats has emerged. This new
cybersecurity adversary is known with the name of "Advanced Persistent Threat"
(APT) and is referred to different organizations that in the last years have
been "in the center of the eye" due to multiple dangerous and effective attacks
targeting financial and politic, news headlines, embassies, critical
infrastructures, TV programs, etc. In order to early identify APT related
malware, a semi-automatic approach for malware samples analysis is needed. In
our previous work we introduced a "malware triage" step for a semi-automatic
malware analysis architecture. This step has the duty to analyze as fast as
possible new incoming samples and to immediately dispatch the ones that deserve
a deeper analysis, among all the malware delivered per day in the cyber-space,
the ones that really worth to be further examined by analysts. Our paper
focuses on malware developed by APTs, and we build our knowledge base, used in
the triage, on known APTs obtained from publicly available reports. In order to
have the triage as fast as possible, we only rely on static malware features,
that can be extracted with negligible delay, and use machine learning
techniques for the identification. In this work we move from multiclass
classification to a group of oneclass classifier, which simplify the training
and allows higher modularity. The results of the proposed framework highlight
high performances, reaching a precision of 100% and an accuracy over 95
A Neural Embeddings Approach for Detecting Mobile Counterfeit Apps
Counterfeit apps impersonate existing popular apps in attempts to misguide
users to install them for various reasons such as collecting personal
information, spreading malware, or simply to increase their advertisement
revenue. Many counterfeits can be identified once installed, however even a
tech-savvy user may struggle to detect them before installation as app icons
and descriptions can be quite similar to the original app. To this end, this
paper proposes to use neural embeddings generated by state-of-the-art
convolutional neural networks (CNNs) to measure the similarity between images.
Our results show that for the problem of counterfeit detection a novel approach
of using style embeddings given by the Gram matrix of CNN filter responses
outperforms baseline methods such as content embeddings and SIFT features. We
show that further performance increases can be achieved by combining style
embeddings with content embeddings. We present an analysis of approximately 1.2
million apps from Google Play Store and identify a set of potential
counterfeits for top-1,000 apps. Under a conservative assumption, we were able
to find 139 apps that contain malware in a set of 6,880 apps that showed high
visual similarity to one of the top-1,000 apps in Google Play Store
We Can Track You If You Take the Metro: Tracking Metro Riders Using Accelerometers on Smartphones
Motion sensors (e.g., accelerometers) on smartphones have been demonstrated
to be a powerful side channel for attackers to spy on users' inputs on
touchscreen. In this paper, we reveal another motion accelerometer-based attack
which is particularly serious: when a person takes the metro, a malicious
application on her smartphone can easily use accelerator readings to trace her.
We first propose a basic attack that can automatically extract metro-related
data from a large amount of mixed accelerator readings, and then use an
ensemble interval classier built from supervised learning to infer the riding
intervals of the user. While this attack is very effective, the supervised
learning part requires the attacker to collect labeled training data for each
station interval, which is a significant amount of effort. To improve the
efficiency of our attack, we further propose a semi-supervised learning
approach, which only requires the attacker to collect labeled data for a very
small number of station intervals with obvious characteristics. We conduct real
experiments on a metro line in a major city. The results show that the
inferring accuracy could reach 89\% and 92\% if the user takes the metro for 4
and 6 stations, respectively
Adversarial Feature Selection against Evasion Attacks
Pattern recognition and machine learning techniques have been increasingly
adopted in adversarial settings such as spam, intrusion and malware detection,
although their security against well-crafted attacks that aim to evade
detection by manipulating data at test time has not yet been thoroughly
assessed. While previous work has been mainly focused on devising
adversary-aware classification algorithms to counter evasion attempts, only few
authors have considered the impact of using reduced feature sets on classifier
security against the same attacks. An interesting, preliminary result is that
classifier security to evasion may be even worsened by the application of
feature selection. In this paper, we provide a more detailed investigation of
this aspect, shedding some light on the security properties of feature
selection against evasion attacks. Inspired by previous work on adversary-aware
classifiers, we propose a novel adversary-aware feature selection model that
can improve classifier security against evasion attacks, by incorporating
specific assumptions on the adversary's data manipulation strategy. We focus on
an efficient, wrapper-based implementation of our approach, and experimentally
validate its soundness on different application examples, including spam and
malware detection
eXpose: A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys
For years security machine learning research has promised to obviate the need
for signature based detection by automatically learning to detect indicators of
attack. Unfortunately, this vision hasn't come to fruition: in fact, developing
and maintaining today's security machine learning systems can require
engineering resources that are comparable to that of signature-based detection
systems, due in part to the need to develop and continuously tune the
"features" these machine learning systems look at as attacks evolve. Deep
learning, a subfield of machine learning, promises to change this by operating
on raw input signals and automating the process of feature design and
extraction. In this paper we propose the eXpose neural network, which uses a
deep learning approach we have developed to take generic, raw short character
strings as input (a common case for security inputs, which include artifacts
like potentially malicious URLs, file paths, named pipes, named mutexes, and
registry keys), and learns to simultaneously extract features and classify
using character-level embeddings and convolutional neural network. In addition
to completely automating the feature design and extraction process, eXpose
outperforms manual feature extraction based baselines on all of the intrusion
detection problems we tested it on, yielding a 5%-10% detection rate gain at
0.1% false positive rate compared to these baselines
- …