30 research outputs found
Boosted Hidden Markov Models for Malware Detection
Digital security is an important issue today, and efficient malware detection is at the forefront of research into building secure digital systems. As with many other fields, malware detection research has seen a dramatic increase in the application of machine learning algorithms. One machine learning technique that has found widespread application in the field of pattern matching and malware detection is hidden Markov models (HMMs). Since HMM training is a hill climb technique, we can often significantly improve a model by training multiple times with different initial values. In this research, we compare boosted HMMs (using AdaBoost) to HMMs trained with multiple random restarts, in the context of malware detection. These techniques are applied to a variety of challenging malware datasets and we analyze the results in terms of effectiveness and efficiency
Hidden Markov Models with Random Restarts vs Boosting for Malware Detection
Effective and efficient malware detection is at the forefront of research
into building secure digital systems. As with many other fields, malware
detection research has seen a dramatic increase in the application of machine
learning algorithms. One machine learning technique that has been used widely
in the field of pattern matching in general-and malware detection in
particular-is hidden Markov models (HMMs). HMM training is based on a hill
climb, and hence we can often improve a model by training multiple times with
different initial values. In this research, we compare boosted HMMs (using
AdaBoost) to HMMs trained with multiple random restarts, in the context of
malware detection. These techniques are applied to a variety of challenging
malware datasets. We find that random restarts perform surprisingly well in
comparison to boosting. Only in the most difficult "cold start" cases (where
training data is severely limited) does boosting appear to offer sufficient
improvement to justify its higher computational cost in the scoring phase
Machine learning classification for advanced malware detection
This introductory document discusses topics related to malware detection via the application
of machine learning algorithms. It is intended as a supplement to the published work
submitted (a complete list of which can be found in Table 1) and outlines the motivation
behind the experiments.
The document begins with the following sections:
• Section 2 presents a preliminary discussion of the research methodology employed.
• Section 3 presents the background analysis of malware detection in general, and the
use of machine learning.
• Section 4 provides a brief introduction of the most common machine learning
algorithms in current use.
The remaining sections present the main body of the experimental work, which lead to the
conclusions in Section 10.
• Section 5 analyzes different initialization strategies for machine learning models, with
a view to ensuring that the most effective training and testing strategy is employed.
Following this, a purely dynamic approach is proposed, which results in perfect
classification of the samples against benign files, and therefore provides a baseline
against which the performance of subsequent static approaches can be compared.
• Section 6 introduces the static-based tests, beginning with the challenging problem of
zero-day detection samples, i.e. malware samples for which not enough data has been
gathered yet to train the machine learning models.
• Section 7 describes the testing of several different approaches to static malware
detection. During these tests, the effectiveness of these algorithms is analyzed and
compared with other means of classification.
7
• Section 8 proposes and compares techniques to boost the detection accuracy by
combining the scores obtained from other detection algorithms, with a view to
improving static classification scores and thus reach the perfect detection obtained
with dynamic features.
• Section 9 tests the effectiveness of generic malware models by assessing the detection
effectiveness of a generic malware model trained on several different families. The
experiments are intended to introduce a more realistic scenario where a single,
comprehensive, machine learning model is used to detect several families. This
Section shows the difficulty to build a single model to detect several malware families
Word Embedding Techniques for Malware Classification
Word embeddings are often used in natural language processing as a means to quantify relationships between words. More generally, these same word embedding techniques can be used to quantify relationships between features. In this paper, we conduct a series of experiments that are designed to determine the effectiveness of word embedding in the context of malware classification. First, we conduct experiments where hidden Markov models (HMM) are directly applied to opcode sequences. These results serve to establish a baseline for comparison with our subsequent word embedding experiments. We then experiment with word embedding vectors derived from HMMs— a technique that we refer to as HMM2Vec. In another set of experiments, we generate vector embeddings based on principal component analysis, which we refer to as PCA2Vec. And, for a third set of word embedding experiments, we consider the well- known neural network based technique, Word2Vec. In each of these word embedding experiments, we derive feature embeddings based on opcode sequences for malware samples from a variety of different families. We show that in most cases, we obtain improved classification accuracy using feature embeddings, as compared to our baseline HMM experiments. These results provide strong evidence that word embedding techniques can play a useful role in feature engineering within the field of malware analysis
Fake malware opcodes generation using HMM and different GAN algorithms
Malware, or malicious software, is a program that is intended to harm systems. In the past decade, the number of malware attacks have grown and, more importantly, evolved. Many researchers have successfully integrated cutting edge Machine Learning techniques to combat this ever present and growing threat to cyber and information security. One big challenge faced by many researchers is the lack of enough data to train machine learning models and specifically deep neural networks properly. Generative modelling has proven to be very efficient at generating synthesized data that can match the actual data distribution.
In this project, we aim to generate malware samples as opcode sequences and attempt to differentiate between the fake and real samples. We use different Generative Adversarial Networks (GAN) algorithms and Hidden Markov Models (HMM) to generate fake samples
Fake Malware Generation Using HMM and GAN
In the past decade, the number of malware attacks have grown considerably and, more importantly, evolved. Many researchers have successfully integrated state-of-the-art machine learning techniques to combat this ever present and rising threat to information security. However, the lack of enough data to appropriately train these machine learning models is one big challenge that is still present. Generative modelling has proven to be very efficient at generating image-like synthesized data that can match the actual data distribution. In this paper, we aim to generate malware samples as opcode sequences and attempt to differentiate them from the real ones with the goal to build fake malware data that can be used to effectively train the machine learning models. We use and compare different Generative Adversarial Networks (GAN) algorithms and Hidden Markov Models (HMM) to generate such fake samples obtaining promising results
Crowdfunding Non-fungible Tokens on the Blockchain
Non-fungible tokens (NFTs) have been used as a way of rewarding content creators. Artists publish their works on the blockchain as NFTs, which they can then sell. The buyer of an NFT then holds ownership of a unique digital asset, which can be resold in much the same way that real-world art collectors might trade paintings. However, while a deal of effort has been spent on selling works of art on the blockchain, very little attention has been paid to using the blockchain as a means of fundraising to help finance the artist’s work in the first place. Additionally, while blockchains like Ethereum are ideal for smaller works of art, additional support is needed when the artwork is larger than is feasible to store on the blockchain. In this paper, we propose a fundraising mechanism that will help artists to gain financial support for their initiatives, and where the backers can receive a share of the profits in exchange for their support. We discuss our prototype implementation using the SpartanGold framework. We then discuss how this system could be expanded to support large NFTs with the 0Chain blockchain, and describe how we could provide support for ongoing storage of these NFTs
Robustness of Image-Based Malware Analysis
In previous work, “gist descriptor” features extracted from images have been used in malware classification problems and have shown promising results. In this research, we determine whether gist descriptors are robust with respect to malware obfuscation techniques, as compared to Convolutional Neural Networks (CNN) trained directly on malware images. Using the Python Image Library (PIL), we create images from malware executables and from malware that we obfuscate. We conduct experiments to compare classifying these images with a CNN as opposed to extracting the gist descriptor features from these images to use in classification. For the gist descriptors, we consider a variety of classification algorithms including k-nearest neighbors, random forest, support vector machine, and multi-layer perceptron. We find that gist descriptors are more robust than CNNs, with respect to the obfuscation techniques that we consider
Twitter Bots’ Detection with Benford’s Law and Machine Learning
Online Social Networks (OSNs) have grown exponentially in terms of active users and have now become an influential factor in the formation of public opinions. For this reason, the use of bots and botnets for spreading misinformation on OSNs has become a widespread concern. Identifying bots and botnets on Twitter can require complex statistical methods to score a profile based on multiple features. Benford’s Law, or the Law of Anomalous Numbers, states that, in any naturally occurring sequence of numbers, the First Significant Leading Digit (FSLD) frequency follows a particular pattern such that they are unevenly distributed and reducing. This principle can be applied to the first-degree egocentric network of a Twitter profile to assess its conformity to such law and, thus, classify it as a bot profile or normal profile. This paper focuses on leveraging Benford’s Law in combination with various Machine Learning (ML) classifiers to identify bot profiles on Twitter. In addition, a comparison with other statistical methods is produced to confirm our classification results
A Blockchain-Based Retribution Mechanism for Collaborative Intrusion Detection
Collaborative intrusion detection approach uses the shared detection signature between the collaborative participants to facilitate coordinated defense. In the context of collaborative intrusion detection system (CIDS), however, there is no research focusing on the efficiency of the shared detection signature. The inefficient detection signature costs not only the IDS resource but also the process of the peer-to-peer (P2P) network. In this paper, we therefore propose a blockchain-based retribution mechanism, which aims to incentivize the participants to contribute to verifying the efficiency of the detection signature in terms of certain distributed consensus. We implement a prototype using Ethereum blockchain, which instantiates a token-based retribution mechanism and a smart contract-enabled voting-based distributed consensus. We conduct a number of experiments built on the prototype, and the experimental results demonstrate the effectiveness of the proposed approach