301 research outputs found

    Feature Selection and Improving Classification Performance for Malware Detection

    Get PDF
    The ubiquitous advance of technology has been conducive to the proliferation of cyber threats, resulting in attacks that have grown exponentially. Consequently, researchers have developed models based on machine learning algorithms for detecting malware. However, these methods require significant amount of extracted features for correct malware classification, making that feature extraction, training, and testing take significant time; even more, it has been unexplored which are the most important features for accomplish the correct classification. In this Thesis, it is created and analyzed a dataset of malware and clean files (goodware) from the static and dynamic features provided by the online framework VirusTotal. The purpose was to select the smallest number of features that keep the classification accuracy as high as the state of the art researches. Selecting the most representative features for malware detection relies on the possibility reducing the training time, given that it increases in O(n2) with respect to the number of features, and creating an embedded program that monitors processes executed by the OS. Thus, feature selection was made taking the most important features. In addition, classification algorithms such as Random Forest, Support Vector Machine and Neural Networks were used in a novel combination that not only showed an increase in accuracy, but also in the training speed from hours to just minutes. Next, the model was tested on one additional dataset of unseen malware files. Results showed that “9” features were enough to distinguish malware from goodware files within an accuracy of 99.60%

    Malware Classification Based on Hidden Markov Model and Word2Vec Features

    Get PDF
    Malware classification is an important and challenging problem in information security. Modern malware classification techniques rely on machine learning models that can be trained on a wide variety of features, including opcode sequences, API calls, and byte ��-grams, among many others. In this research, we implement hybrid machine learning techniques, where we train hidden Markov models (HMM) and compute Word2Vec encodings based on opcode sequences. The resulting trained HMMs and Word2Vec embedding vectors are then used as features for classification algorithms. Specifically, we consider support vector machine (SVM), ��-nearest neighbor (��-NN), random forest (RF), and deep neural network (DNN) classifiers. We conduct substantial experiments over a variety of malware families. Our results surpass those of comparable classification experiments

    Resilient and Scalable Android Malware Fingerprinting and Detection

    Get PDF
    Malicious software (Malware) proliferation reaches hundreds of thousands daily. The manual analysis of such a large volume of malware is daunting and time-consuming. The diversity of targeted systems in terms of architecture and platforms compounds the challenges of Android malware detection and malware in general. This highlights the need to design and implement new scalable and robust methods, techniques, and tools to detect Android malware. In this thesis, we develop a malware fingerprinting framework to cover accurate Android malware detection and family attribution. In this context, we emphasize the following: (i) the scalability over a large malware corpus; (ii) the resiliency to common obfuscation techniques; (iii) the portability over different platforms and architectures. In the context of bulk and offline detection on the laboratory/vendor level: First, we propose an approximate fingerprinting technique for Android packaging that captures the underlying static structure of the Android apps. We also propose a malware clustering framework on top of this fingerprinting technique to perform unsupervised malware detection and grouping by building and partitioning a similarity network of malicious apps. Second, we propose an approximate fingerprinting technique for Android malware's behavior reports generated using dynamic analyses leveraging natural language processing techniques. Based on this fingerprinting technique, we propose a portable malware detection and family threat attribution framework employing supervised machine learning techniques. Third, we design an automatic framework to produce intelligence about the underlying malicious cyber-infrastructures of Android malware. We leverage graph analysis techniques to generate relevant, actionable, and granular intelligence that can be used to identify the threat effects induced by malicious Internet activity associated to Android malicious apps. In the context of the single app and online detection on the mobile device level, we further propose the following: Fourth, we design a portable and effective Android malware detection system that is suitable for deployment on mobile and resource constrained devices, using machine learning classification on raw method call sequences. Fifth, we elaborate a framework for Android malware detection that is resilient to common code obfuscation techniques and adaptive to operating systems and malware change overtime, using natural language processing and deep learning techniques. We also evaluate the portability of the proposed techniques and methods beyond Android platform malware, as follows: Sixth, we leverage the previously elaborated techniques to build a framework for cross-platform ransomware fingerprinting relying on raw hybrid features in conjunction with advanced deep learning techniques

    Malware Classification with Word Embedding Features

    Get PDF
    Malware classification is an important and challenging problem in information security. Modern malware classification techniques rely on machine learning models that can be trained on features such as opcode sequences, API calls, and byte nn-grams, among many others. In this research, we consider opcode features. We implement hybrid machine learning techniques, where we engineer feature vectors by training hidden Markov models -- a technique that we refer to as HMM2Vec -- and Word2Vec embeddings on these opcode sequences. The resulting HMM2Vec and Word2Vec embedding vectors are then used as features for classification algorithms. Specifically, we consider support vector machine (SVM), kk-nearest neighbor (kk-NN), random forest (RF), and convolutional neural network (CNN) classifiers. We conduct substantial experiments over a variety of malware families. Our experiments extend well beyond any previous work in this field

    CAMP: A Common API for Measuring Performance

    Get PDF
    Accurate performance testing of heterogeneous distributed systems, such as those created using GRID technology, requires a consistent method for retrieving system performance data from multiple platforms. This paper presents CAMP: a low-level platform independent performance data API designed for use with distributed testing frameworks. CAMP is not necessarily tied to the distributed testing task: it provides a simple, low-level interface into operating system performance data that can be used to build complex performance measurement applications. This paper discusses CAMP\u27s functionality and implementation in detail. It also contains a detailed analysis of the API\u27s correctness, performance, and overhead

    Dynamic Code Checksum Generator

    Get PDF
    A checksum (i.e., a cryptographic hash) of a file can be used as an integrity check, if an attacker tries to change the code in an executable file, a checksum can be used to detect the tampering. While it is easy to compute a checksum for any static file, it is possible for an attacker to tamper with an executable file as it is being loaded into memory, or after it has been loaded. Therefore, it would be more useful to checksum an executable file dynamically only after the file has been loaded into memory. However, checksumming dynamic code is much more challenging than dealing with static code – the code can be loaded into different locations in memory, and parts of the code will change depending on where the code resides in memory (addresses, labels, etc.).Windows Vista and later versions of Windows include a new technology known as Address Space Layout Randomization (ASLR). ASLR, which serves as a defense against buffer overflow attacks, causes the executable file to be loaded at a randomly-selected location in memory. The goal of this project is to develop a robust and efficient technique for computing the cryptographic hash of a dynamic executable in the presence of ASLR
    corecore