Search CORE

30 research outputs found

Boosted Hidden Markov Models for Malware Detection

Author: Raghavan Aditya
Publication venue: SJSU ScholarWorks
Publication date: 01/04/2018
Field of study

Digital security is an important issue today, and efficient malware detection is at the forefront of research into building secure digital systems. As with many other fields, malware detection research has seen a dramatic increase in the application of machine learning algorithms. One machine learning technique that has found widespread application in the field of pattern matching and malware detection is hidden Markov models (HMMs). Since HMM training is a hill climb technique, we can often significantly improve a model by training multiple times with different initial values. In this research, we compare boosted HMMs (using AdaBoost) to HMMs trained with multiple random restarts, in the context of malware detection. These techniques are applied to a variety of challenging malware datasets and we analyze the results in terms of effectiveness and efficiency

SJSU ScholarWorks

Hidden Markov Models with Random Restarts vs Boosting for Malware Detection

Author: Di Troia Fabio
Raghavan Aditya
Stamp Mark
Publication venue
Publication date: 17/07/2023
Field of study

Effective and efficient malware detection is at the forefront of research into building secure digital systems. As with many other fields, malware detection research has seen a dramatic increase in the application of machine learning algorithms. One machine learning technique that has been used widely in the field of pattern matching in general-and malware detection in particular-is hidden Markov models (HMMs). HMM training is based on a hill climb, and hence we can often improve a model by training multiple times with different initial values. In this research, we compare boosted HMMs (using AdaBoost) to HMMs trained with multiple random restarts, in the context of malware detection. These techniques are applied to a variety of challenging malware datasets. We find that random restarts perform surprisingly well in comparison to boosting. Only in the most difficult "cold start" cases (where training data is severely limited) does boosting appear to offer sufficient improvement to justify its higher computational cost in the scoring phase

arXiv.org e-Print Archive

Machine learning classification for advanced malware detection

Author: Di Troia Fabio
Publication venue: Kingston University
Publication date
Field of study

This introductory document discusses topics related to malware detection via the application of machine learning algorithms. It is intended as a supplement to the published work submitted (a complete list of which can be found in Table 1) and outlines the motivation behind the experiments. The document begins with the following sections: • Section 2 presents a preliminary discussion of the research methodology employed. • Section 3 presents the background analysis of malware detection in general, and the use of machine learning. • Section 4 provides a brief introduction of the most common machine learning algorithms in current use. The remaining sections present the main body of the experimental work, which lead to the conclusions in Section 10. • Section 5 analyzes different initialization strategies for machine learning models, with a view to ensuring that the most effective training and testing strategy is employed. Following this, a purely dynamic approach is proposed, which results in perfect classification of the samples against benign files, and therefore provides a baseline against which the performance of subsequent static approaches can be compared. • Section 6 introduces the static-based tests, beginning with the challenging problem of zero-day detection samples, i.e. malware samples for which not enough data has been gathered yet to train the machine learning models. • Section 7 describes the testing of several different approaches to static malware detection. During these tests, the effectiveness of these algorithms is analyzed and compared with other means of classification. 7 • Section 8 proposes and compares techniques to boost the detection accuracy by combining the scores obtained from other detection algorithms, with a view to improving static classification scores and thus reach the perfect detection obtained with dynamic features. • Section 9 tests the effectiveness of generic malware models by assessing the detection effectiveness of a generic malware model trained on several different families. The experiments are intended to introduce a more realistic scenario where a single, comprehensive, machine learning model is used to detect several families. This Section shows the difficulty to build a single model to detect several malware families

Kingston University Research Repository

Word Embedding Techniques for Malware Classification

Author: Chandak Aniket
Publication venue: SJSU ScholarWorks
Publication date: 20/05/2020
Field of study

Word embeddings are often used in natural language processing as a means to quantify relationships between words. More generally, these same word embedding techniques can be used to quantify relationships between features. In this paper, we conduct a series of experiments that are designed to determine the effectiveness of word embedding in the context of malware classification. First, we conduct experiments where hidden Markov models (HMM) are directly applied to opcode sequences. These results serve to establish a baseline for comparison with our subsequent word embedding experiments. We then experiment with word embedding vectors derived from HMMs— a technique that we refer to as HMM2Vec. In another set of experiments, we generate vector embeddings based on principal component analysis, which we refer to as PCA2Vec. And, for a third set of word embedding experiments, we consider the well- known neural network based technique, Word2Vec. In each of these word embedding experiments, we derive feature embeddings based on opcode sequences for malware samples from a variety of different families. We show that in most cases, we obtain improved classification accuracy using feature embeddings, as compared to our baseline HMM experiments. These results provide strong evidence that word embedding techniques can play a useful role in feature engineering within the field of malware analysis

SJSU ScholarWorks

Fake malware opcodes generation using HMM and different GAN algorithms

Author: Trehan Harshit
Publication venue: SJSU ScholarWorks
Publication date: 25/05/2021
Field of study

Malware, or malicious software, is a program that is intended to harm systems. In the past decade, the number of malware attacks have grown and, more importantly, evolved. Many researchers have successfully integrated cutting edge Machine Learning techniques to combat this ever present and growing threat to cyber and information security. One big challenge faced by many researchers is the lack of enough data to train machine learning models and specifically deep neural networks properly. Generative modelling has proven to be very efficient at generating synthesized data that can match the actual data distribution. In this project, we aim to generate malware samples as opcode sequences and attempt to differentiate between the fake and real samples. We use different Generative Adversarial Networks (GAN) algorithms and Hidden Markov Models (HMM) to generate fake samples

SJSU ScholarWorks

Fake Malware Generation Using HMM and GAN

Author: Di Troia Fabio
Trehan Harshit
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

In the past decade, the number of malware attacks have grown considerably and, more importantly, evolved. Many researchers have successfully integrated state-of-the-art machine learning techniques to combat this ever present and rising threat to information security. However, the lack of enough data to appropriately train these machine learning models is one big challenge that is still present. Generative modelling has proven to be very efficient at generating image-like synthesized data that can match the actual data distribution. In this paper, we aim to generate malware samples as opcode sequences and attempt to differentiate them from the real ones with the goal to build fake malware data that can be used to effectively train the machine learning models. We use and compare different Generative Adversarial Networks (GAN) algorithms and Hidden Markov Models (HMM) to generate such fake samples obtaining promising results

SJSU ScholarWorks

Crowdfunding Non-fungible Tokens on the Blockchain

Author: Austin Thomas H.
Basu Kimaya
Basu Sean
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Non-fungible tokens (NFTs) have been used as a way of rewarding content creators. Artists publish their works on the blockchain as NFTs, which they can then sell. The buyer of an NFT then holds ownership of a unique digital asset, which can be resold in much the same way that real-world art collectors might trade paintings. However, while a deal of effort has been spent on selling works of art on the blockchain, very little attention has been paid to using the blockchain as a means of fundraising to help finance the artist’s work in the first place. Additionally, while blockchains like Ethereum are ideal for smaller works of art, additional support is needed when the artwork is larger than is feasible to store on the blockchain. In this paper, we propose a fundraising mechanism that will help artists to gain financial support for their initiatives, and where the backers can receive a share of the profits in exchange for their support. We discuss our prototype implementation using the SpartanGold framework. We then discuss how this system could be expanded to support large NFTs with the 0Chain blockchain, and describe how we could provide support for ongoing storage of these NFTs

SJSU ScholarWorks

Robustness of Image-Based Malware Analysis

Author: Di Troia Fabio
Stamp Mark
Tran Katrina
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

In previous work, “gist descriptor” features extracted from images have been used in malware classification problems and have shown promising results. In this research, we determine whether gist descriptors are robust with respect to malware obfuscation techniques, as compared to Convolutional Neural Networks (CNN) trained directly on malware images. Using the Python Image Library (PIL), we create images from malware executables and from malware that we obfuscate. We conduct experiments to compare classifying these images with a CNN as opposed to extracting the gist descriptor features from these images to use in classification. For the gist descriptors, we consider a variety of classification algorithms including k-nearest neighbors, random forest, support vector machine, and multi-layer perceptron. We find that gist descriptors are more robust than CNNs, with respect to the obfuscation techniques that we consider

SJSU ScholarWorks

Twitter Bots’ Detection with Benford’s Law and Machine Learning

Author: Bhosale Sanmesh
Di Troia Fabio
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Online Social Networks (OSNs) have grown exponentially in terms of active users and have now become an influential factor in the formation of public opinions. For this reason, the use of bots and botnets for spreading misinformation on OSNs has become a widespread concern. Identifying bots and botnets on Twitter can require complex statistical methods to score a profile based on multiple features. Benford’s Law, or the Law of Anomalous Numbers, states that, in any naturally occurring sequence of numbers, the First Significant Leading Digit (FSLD) frequency follows a particular pattern such that they are unevenly distributed and reducing. This principle can be applied to the first-degree egocentric network of a Twitter profile to assess its conformity to such law and, thus, classify it as a bot profile or normal profile. This paper focuses on leveraging Benford’s Law in combination with various Machine Learning (ML) classifiers to identify bot profiles on Twitter. In addition, a comparison with other statistical methods is produced to confirm our classification results

SJSU ScholarWorks

A Blockchain-Based Retribution Mechanism for Collaborative Intrusion Detection

Author: Chang Sang Yoon
Fan Wenjun
Kumar Shubham
Park Younghee
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Collaborative intrusion detection approach uses the shared detection signature between the collaborative participants to facilitate coordinated defense. In the context of collaborative intrusion detection system (CIDS), however, there is no research focusing on the efficiency of the shared detection signature. The inefficient detection signature costs not only the IDS resource but also the process of the peer-to-peer (P2P) network. In this paper, we therefore propose a blockchain-based retribution mechanism, which aims to incentivize the participants to contribute to verifying the efficiency of the detection signature in terms of certain distributed consensus. We implement a prototype using Ethereum blockchain, which instantiates a token-based retribution mechanism and a smart contract-enabled voting-based distributed consensus. We conduct a number of experiments built on the prototype, and the experimental results demonstrate the effectiveness of the proposed approach

SJSU ScholarWorks