10 research outputs found
Lempel-Ziv Networks
Sequence processing has long been a central area of machine learning
research. Recurrent neural nets have been successful in processing sequences
for a number of tasks; however, they are known to be both ineffective and
computationally expensive when applied to very long sequences.
Compression-based methods have demonstrated more robustness when processing
such sequences -- in particular, an approach pairing the Lempel-Ziv Jaccard
Distance (LZJD) with the k-Nearest Neighbor algorithm has shown promise on long
sequence problems (up to steps) involving malware
classification. Unfortunately, use of LZJD is limited to discrete domains. To
extend the benefits of LZJD to a continuous domain, we investigate the
effectiveness of a deep-learning analog of the algorithm, the Lempel-Ziv
Network. While we achieve successful proof of concept, we are unable to improve
meaningfully on the performance of a standard LSTM across a variety of datasets
and sequence processing tasks. In addition to presenting this negative result,
our work highlights the problem of sub-par baseline tuning in newer research
areas.Comment: I Can't Believe It's Not Better Workshop at NeurIPS 202
Engineering a Simplified 0-Bit Consistent Weighted Sampling
The Min-Hashing approach to sketching has become an important tool in data
analysis, information retrial, and classification. To apply it to real-valued
datasets, the ICWS algorithm has become a seminal approach that is widely used,
and provides state-of-the-art performance for this problem space. However, ICWS
suffers a computational burden as the sketch size K increases. We develop a new
Simplified approach to the ICWS algorithm, that enables us to obtain over 20x
speedups compared to the standard algorithm. The veracity of our approach is
demonstrated empirically on multiple datasets and scenarios, showing that our
new Simplified CWS obtains the same quality of results while being an order of
magnitude faster
Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection
Recent works within machine learning have been tackling inputs of
ever-increasing size, with cybersecurity presenting sequence classification
problems of particularly extreme lengths. In the case of Windows executable
malware detection, inputs may exceed MB, which corresponds to a time
series with steps. To date, the closest approach to handling
such a task is MalConv, a convolutional neural network capable of processing up
to steps. The memory of CNNs has prevented
further application of CNNs to malware. In this work, we develop a new approach
to temporal max pooling that makes the required memory invariant to the
sequence length . This makes MalConv more memory efficient, and
up to faster to train on its original dataset, while removing the
input length restrictions to MalConv. We re-invest these gains into improving
the MalConv architecture by developing a new Global Channel Gating design,
giving us an attention mechanism capable of learning feature interactions
across 100 million time steps in an efficient manner, a capability lacked by
the original MalConv CNN. Our implementation can be found at
https://github.com/NeuromorphicComputationResearchProgram/MalConv2Comment: To appear in AAAI 202
Semi-supervised Classification of Malware Families Under Extreme Class Imbalance via Hierarchical Non-Negative Matrix Factorization with Automatic Model Selection
Identification of the family to which a malware specimen belongs is essential
in understanding the behavior of the malware and developing mitigation
strategies. Solutions proposed by prior work, however, are often not
practicable due to the lack of realistic evaluation factors. These factors
include learning under class imbalance, the ability to identify new malware,
and the cost of production-quality labeled data. In practice, deployed models
face prominent, rare, and new malware families. At the same time, obtaining a
large quantity of up-to-date labeled malware for training a model can be
expensive. In this paper, we address these problems and propose a novel
hierarchical semi-supervised algorithm, which we call the HNMFk Classifier,
that can be used in the early stages of the malware family labeling process.
Our method is based on non-negative matrix factorization with automatic model
selection, that is, with an estimation of the number of clusters. With HNMFk
Classifier, we exploit the hierarchical structure of the malware data together
with a semi-supervised setup, which enables us to classify malware families
under conditions of extreme class imbalance. Our solution can perform
abstaining predictions, or rejection option, which yields promising results in
the identification of novel malware families and helps with maintaining the
performance of the model when a low quantity of labeled data is used. We
perform bulk classification of nearly 2,900 both rare and prominent malware
families, through static analysis, using nearly 388,000 samples from the
EMBER-2018 corpus. In our experiments, we surpass both supervised and
semi-supervised baseline models with an F1 score of 0.80.Comment: Accepted at ACM TOP
Malware Resistant Data Protection in Hyper-connected Networks: A survey
Data protection is the process of securing sensitive information from being
corrupted, compromised, or lost. A hyperconnected network, on the other hand,
is a computer networking trend in which communication occurs over a network.
However, what about malware. Malware is malicious software meant to penetrate
private data, threaten a computer system, or gain unauthorised network access
without the users consent. Due to the increasing applications of computers and
dependency on electronically saved private data, malware attacks on sensitive
information have become a dangerous issue for individuals and organizations
across the world. Hence, malware defense is critical for keeping our computer
systems and data protected. Many recent survey articles have focused on either
malware detection systems or single attacking strategies variously. To the best
of our knowledge, no survey paper demonstrates malware attack patterns and
defense strategies combinedly. Through this survey, this paper aims to address
this issue by merging diverse malicious attack patterns and machine learning
(ML) based detection models for modern and sophisticated malware. In doing so,
we focus on the taxonomy of malware attack patterns based on four fundamental
dimensions the primary goal of the attack, method of attack, targeted exposure
and execution process, and types of malware that perform each attack. Detailed
information on malware analysis approaches is also investigated. In addition,
existing malware detection techniques employing feature extraction and ML
algorithms are discussed extensively. Finally, it discusses research
difficulties and unsolved problems, including future research directions.Comment: 30 pages, 9 figures, 7 tables, no where submitted ye
Survey of Machine Learning Techniques for Malware Analysis
Coping with malware is getting more and more challenging, given their
relentless growth in complexity and volume. One of the most common approaches
in literature is using machine learning techniques, to automatically learn
models and patterns behind such complexity, and to develop technologies for
keeping pace with the speed of development of novel malware. This survey aims
at providing an overview on the way machine learning has been used so far in
the context of malware analysis. We systematize surveyed papers according to
their objectives (i.e., the expected output, what the analysis aims to), what
information about malware they specifically use (i.e., the features), and what
machine learning techniques they employ (i.e., what algorithm is used to
process the input and produce the output). We also outline a number of problems
concerning the datasets used in considered works, and finally introduce the
novel concept of malware analysis economics, regarding the study of existing
tradeoffs among key metrics, such as analysis accuracy and economical costs
Effizientes Maschinelles Lernen für die Angriffserkennung
Detecting and fending off attacks on computer systems is an enduring
problem in computer security. In light of a plethora of different
threats and the growing automation used by attackers, we are in urgent
need of more advanced methods for attack detection.
In this thesis, we address the necessity of advanced attack detection
and develop methods to detect attacks using machine learning to
establish a higher degree of automation for reactive security. Machine
learning is data-driven and not void of bias. For the effective
application of machine learning for attack detection, thus, a periodic
retraining over time is crucial. However, the training complexity of
many learning-based approaches is substantial. We show that with the
right data representation, efficient algorithms for mining substring
statistics, and implementations based on probabilistic data structures,
training the underlying model can be achieved in linear time.
In two different scenarios, we demonstrate the effectiveness of
so-called language models that allow to generically portray the content
and structure of attacks: On the one hand, we are learning malicious
behavior of Flash-based malware using classification, and on the other
hand, we detect intrusions by learning normality in industrial control
networks using anomaly detection. With a data throughput of up to
580 Mbit/s during training, we do not only meet our expectations with
respect to runtime but also outperform related approaches by up to an
order of magnitude in detection performance. The same techniques that
facilitate learning in the previous scenarios can also be used for
revealing malicious content, embedded in passive file formats, such as
Microsoft Office documents. As a further showcase, we additionally
develop a method based on the efficient mining of substring statistics
that is able to break obfuscations irrespective of the used key length,
with up to 25 Mbit/s and thus, succeeds where related approaches fail.
These methods significantly improve detection performance and enable
operation in linear time. In doing so, we counteract the trend of
compensating increasing runtime requirements with resources. While the
results are promising and the approaches provide urgently needed
automation, they cannot and are not intended to replace human experts or
traditional approaches, but are designed to assist and complement them.Die Erkennung und Abwehr von Angriffen auf Endnutzer und Netzwerke ist
seit vielen Jahren ein anhaltendes Problem in der Computersicherheit.
Angesichts der hohen Anzahl an unterschiedlichen Angriffsvektoren und
der zunehmenden Automatisierung von Angriffen, bedarf es dringend
moderner Methoden zur Angriffserkennung.
In dieser Doktorarbeit werden Ansätze entwickelt, um Angriffe mit Hilfe
von Methoden des maschinellen Lernens zuverlässig, aber auch effizient
zu erkennen. Sie stellen der Automatisierung von Angriffen einen
entsprechend hohen Grad an Automatisierung von Verteidigungsmaßnahmen
entgegen. Das Trainieren solcher Methoden ist allerdings rechnerisch
aufwändig und erfolgt auf sehr großen Datenmengen. Laufzeiteffiziente
Lernverfahren sind also entscheidend. Wir zeigen, dass durch den Einsatz
von effizienten Algorithmen zur statistischen Analyse von Zeichenketten
und Implementierung auf Basis von probabilistischen Datenstrukturen, das
Lernen von effektiver Angriffserkennung auch in linearer Zeit möglich
ist.
Anhand von zwei unterschiedlichen Anwendungsfällen, demonstrieren wir
die Effektivität von Modellen, die auf der Extraktion von sogenannten
n-Grammen basieren: Zum einen, betrachten wir die Erkennung von
Flash-basiertem Schadcode mittels Methoden der Klassifikation, und zum
anderen, die Erkennung von Angriffen auf Industrienetzwerke bzw.
SCADA-Systeme mit Hilfe von Anomaliedetektion. Dabei erzielen wir
während des Trainings dieser Modelle einen Datendurchsatz von bis zu
580 Mbit/s und übertreffen gleichzeitig die Erkennungsleistung von
anderen Ansätzen deutlich. Die selben Techniken, um diese lernenden
Ansätze zu ermöglichen, können außerdem für die Erkennung von Schadcode
verwendet werden, der in anderen Dateiformaten eingebettet und mittels
einfacher Verschlüsselungen obfuskiert wurde. Hierzu entwickeln wir eine
Methode die basierend auf der statistischen Auswertung von Zeichenketten
einfache Verschlüsselungen bricht. Der entwickelte Ansatz arbeitet
unabhängig von der verwendeten Schlüssellänge, mit einem Datendurchsatz
von bis zu 25 Mbit/s und ermöglicht so die erfolgreiche Deobfuskierung
in Fällen an denen andere Ansätze scheitern.
Die erzielten Ergebnisse in Hinsicht auf Laufzeiteffizienz und
Erkennungsleistung sind vielversprechend. Die vorgestellten Methoden
ermöglichen die dringend nötige Automatisierung von
Verteidigungsmaßnahmen, sollen den Experten oder etablierte Methoden
aber nicht ersetzen, sondern diese unterstützen und ergänzen