1,943 research outputs found
Hidden Markov Models for Malware Classification
Malware is a software which is developed for malicious intent. Malware is a rapidly evolving threat to the computing community. Although many techniques for malware classification have been proposed, there is still the lack of a comprehensible and useful taxonomy to classify malware samples. Previous research has shown that hidden Markov model (HMM) analysis is useful for detecting certain types of malware. In this research, we consider the related problem of malware classification based on HMMs. We train HMMs for a variety of malware generators and a variety of compilers. More than 9000 malware samples are then scored against each of these models and the malware samples are separated into clusters based on the resulting scores. We analyze the clusters and show that they correspond to certain characteristics of malware. These results indicate that HMMs are an effective tool for the challenging task of automatically classifying malware
Classification of Malware Models
Automatically classifying similar malware families is a challenging problem. In this research, we attempt to classify malware families by applying machine learning to machine learning models. Specifically, we train hidden Markov models (HMM) for each malware family in our dataset. The resulting models are then compared in two ways. First, we treat the HMM matrices as images and experiment with convolutional neural networks (CNN) for image classification. Second, we apply support vector machines (SVM) to classify the HMMs. We analyze the results and discuss the relative advantages and disadvantages of each approach
Malware Classification Based on Hidden Markov Model and Word2Vec Features
Malware classification is an important and challenging problem in information security. Modern malware classification techniques rely on machine learning models that can be trained on a wide variety of features, including opcode sequences, API calls, and byte ��-grams, among many others. In this research, we implement hybrid machine learning techniques, where we train hidden Markov models (HMM) and compute Word2Vec encodings based on opcode sequences. The resulting trained HMMs and Word2Vec embedding vectors are then used as features for classification algorithms. Specifically, we consider support vector machine (SVM), ��-nearest neighbor
(��-NN), random forest (RF), and deep neural network (DNN) classifiers. We conduct substantial experiments over a variety of malware families. Our results surpass those of comparable classification experiments
Malware Classification with Word Embedding Features
Malware classification is an important and challenging problem in information
security. Modern malware classification techniques rely on machine learning
models that can be trained on features such as opcode sequences, API calls, and
byte -grams, among many others. In this research, we consider opcode
features. We implement hybrid machine learning techniques, where we engineer
feature vectors by training hidden Markov models -- a technique that we refer
to as HMM2Vec -- and Word2Vec embeddings on these opcode sequences. The
resulting HMM2Vec and Word2Vec embedding vectors are then used as features for
classification algorithms. Specifically, we consider support vector machine
(SVM), -nearest neighbor (-NN), random forest (RF), and convolutional
neural network (CNN) classifiers. We conduct substantial experiments over a
variety of malware families. Our experiments extend well beyond any previous
work in this field
Malware Classification with GMM-HMM Models
Discrete hidden Markov models (HMM) are often applied to malware detection
and classification problems. However, the continuous analog of discrete HMMs,
that is, Gaussian mixture model-HMMs (GMM-HMM), are rarely considered in the
field of cybersecurity. In this paper, we use GMM-HMMs for malware
classification and we compare our results to those obtained using discrete
HMMs. As features, we consider opcode sequences and entropy-based sequences.
For our opcode features, GMM-HMMs produce results that are comparable to those
obtained using discrete HMMs, whereas for our entropy-based features, GMM-HMMs
generally improve significantly on the classification results that we have
achieved with discrete HMMs
Malware Classification with Gaussian Mixture Model-Hidden Markov Models
Discrete hidden Markov models (HMM) are often applied to the malware detection and classification problems. However, the continuous analog of discrete HMMs, that is, Gaussian mixture model-HMMs (GMM-HMM), are rarely considered in the field of cybersecurity. In this study, we apply GMM-HMMs to the malware classification problem and we compare our results to those obtained using discrete HMMs. As features, we consider opcode sequences and entropy-based sequences. For our opcode features, GMM-HMMs produce results that are comparable to those obtained using discrete HMMs, whereas for our entropy-based features, GMM-HMMs generally improve on the classification results that we can attain with discrete HMMs
A Natural Language Processing Approach to Malware Classification
Many different machine learning and deep learning techniques have been
successfully employed for malware detection and classification. Examples of
popular learning techniques in the malware domain include Hidden Markov Models
(HMM), Random Forests (RF), Convolutional Neural Networks (CNN), Support Vector
Machines (SVM), and Recurrent Neural Networks (RNN) such as Long Short-Term
Memory (LSTM) networks. In this research, we consider a hybrid architecture,
where HMMs are trained on opcode sequences, and the resulting hidden states of
these trained HMMs are used as feature vectors in various classifiers. In this
context, extracting the HMM hidden state sequences can be viewed as a form of
feature engineering that is somewhat analogous to techniques that are commonly
employed in Natural Language Processing (NLP). We find that this NLP-based
approach outperforms other popular techniques on a challenging malware dataset,
with an HMM-Random Forrest model yielding the best results
A Comparison of Clustering Techniques for Malware Analysis
In this research, we apply clustering techniques to the malware detection problem. Our goal is to classify malware as part of a fully automated detection strategy. We compute clusters using the well-known �-means and EM clustering algorithms, with scores obtained from Hidden Markov Models (HMM). The previous work in this area consists of using HMM and �-means clustering technique to achieve the same. The current effort aims to extend it to use EM clustering technique for detection and also compare this technique with the �-means clustering
Malware Classification using API Call Information and Word Embeddings
Malware classification is the process of classifying malware into recognizable categories and is an integral part of implementing computer security. In recent times, machine learning has emerged as one of the most suitable techniques to perform this task. Models can be trained on various malware features such as opcodes, and API calls among many others to deduce information that would be helpful in the classification.
Word embeddings are a key part of natural language processing and can be seen as a representation of text wherein similar words will have closer representations. These embeddings can be used to discover a quantifiable measure of similarity between words. In this research, we conduct a series of experiments using hybrid machine learning techniques, where we generate word vectors and use them as features with various classifiers. We use Hidden Markov Models and Word2Vec to generate embeddings based on dynamic API call logs of the malware. Apart from these, we also use the popular BERT and ELMo models which are known for generating contextualized embeddings. The resulting vectors are used as input for our classifiers, specifically Support Vector Machines (SVM), Random forest (RF), k-Nearest Neighbors (kNN), and Convolutional Neural Networks (CNN). Using these, we conduct two distinct sets of experiments where we try to classify the family of malware as well as the category of malware. The results achieved here prove that embeddings of API calls can be a useful tool in malware classification, especially in the case of families
Malware Detection Using Dynamic Analysis
In this research, we explore the field of dynamic analysis which has shown promis- ing results in the field of malware detection. Here, we extract dynamic software birth- marks during malware execution and apply machine learning based detection tech- niques to the resulting feature set. Specifically, we consider Hidden Markov Models and Profile Hidden Markov Models. To determine the effectiveness of this dynamic analysis approach, we compare our detection results to the results obtained by using static analysis. We show that in some cases, significantly stronger results can be obtained using our dynamic approach
- …