52 research outputs found

    Neural malware detection

    Get PDF
    At the heart of today’s malware problem lies theoretically infinite diversity created by metamorphism. The majority of conventional machine learning techniques tackle the problem with the assumptions that a sufficiently large number of training samples exist and that the training set is independent and identically distributed. However, the lack of semantic features combined with the models under these wrong assumptions result largely in overfitting with many false positives against real world samples, resulting in systems being left vulnerable to various adversarial attacks. A key observation is that modern malware authors write a script that automatically generates an arbitrarily large number of diverse samples that share similar characteristics in program logic, which is a very cost-effective way to evade detection with minimum effort. Given that many malware campaigns follow this paradigm of economic malware manufacturing model, the samples within a campaign are likely to share coherent semantic characteristics. This opens up a possibility of one-to-many detection. Therefore, it is crucial to capture this non-linear metamorphic pattern unique to the campaign in order to detect these seemingly diverse but identically rooted variants. To address these issues, this dissertation proposes novel deep learning models, including generative static malware outbreak detection model, generative dynamic malware detection model using spatio-temporal isomorphic dynamic features, and instruction cognitive malware detection. A comparative study on metamorphic threats is also conducted as part of the thesis. Generative adversarial autoencoder (AAE) over convolutional network with global average pooling is introduced as a fundamental deep learning framework for malware detection, which captures highly complex non-linear metamorphism through translation invariancy and local variation insensitivity. Generative Adversarial Network (GAN) used as a part of the framework enables oneshot training where semantically isomorphic malware campaigns are identified by a single malware instance sampled from the very initial outbreak. This is a major innovation because, to the best of our knowledge, no approach has been found to this challenging training objective against the malware distribution that consists of a large number of very sparse groups artificially driven by arms race between attackers and defenders. In addition, we propose a novel method that extracts instruction cognitive representation from uninterpreted raw binary executables, which can be used for oneto- many malware detection via one-shot training against frequency spectrum of the Transformer’s encoded latent representation. The method works regardless of the presence of diverse malware variations while remaining resilient to adversarial attacks that mostly use random perturbation against raw binaries. Comprehensive performance analyses including mathematical formulations and experimental evaluations are provided, with the proposed deep learning framework for malware detection exhibiting a superior performance over conventional machine learning methods. The methods proposed in this thesis are applicable to a variety of threat environments here artificially formed sparse distributions arise at the cyber battle fronts.Doctor of Philosoph

    Multimodal Approach for Malware Detection

    Get PDF
    Although malware detection is a very active area of research, few works were focused on using physical properties (e.g., power consumption) and multimodal features for malware detection. We designed an experimental testbed that allowed us to run samples of malware and non-malicious software applications and to collect power consumption, network traffic, and system logs data, and subsequently to extract dynamic behavioral-based features. We also extracted code-based static features of both malware and non-malicious software applications. These features were used for malware detection based on: feature level fusion using power consumption and network traffic data, feature level fusion using network traffic data and system logs, and multimodal feature level and decision level fusion. The contributions when using feature level fusion of power consumption and network traffic data are: (1) We focused on detecting real malware using the extracted dynamic behavioral features (both power-based and network traffic-based) and supervised machine learning algorithms, which has not been done by any of the prior works. (2) We ran a large number of machine learning experiments, which allowed us to identify the best performing learner, DC voltage rails that led to the best malware detection performance, and the subset of features that are the best predictors for malware detection. (3) The comparison of malware detection performance was done using a comprehensive set of metrics that reflect different aspects of the quality of malware detection. In the case of the feature level fusion using network traffic data and system logs, the contributions are: (1) Most of the previous works that have used network flows-based features have done classification of the network traffic, while our focus was on classifying the software running in a machine as malware and non-malicious software using the extracted dynamic behavioral features. (2) We experimented with different sizes of the training set (i.e., 90%, 75%, 50%, and 25% of the data) and found that smaller training sets produced very good classification results. This aspect of our work has a practical value because the manual labeling of the training set is a tedious and time consuming process. In this dissertation we present a multimodal deep learning neural network that integrates different modalities (i.e., power consumption, system logs, network traffic, and code-based static data) using decision level fusion. We evaluated the performance of each modality individually, when using feature level fusion, and when using decision level fusion. The contributions of our multimodal approach are as follow: (1) Collecting data from different modalities allowed us to develop a multimodal approach to malware detection, which has not been widely explored by prior works. Even more, none of the previous works compared the performance of feature level fusion with decision level fusion, which is explored in this dissertation. (2) We proposed a multimodal decision level fusion malware detection approach using a deep neural network and compared its performance with the performance of feature level fusion approaches based on deep neural network and standard supervised machine learning algorithms (i.e., Random Forest, J48, JRip, PART, Naive Bayes, and SMO)

    Improved Detection for Advanced Polymorphic Malware

    Get PDF
    Malicious Software (malware) attacks across the internet are increasing at an alarming rate. Cyber-attacks have become increasingly more sophisticated and targeted. These targeted attacks are aimed at compromising networks, stealing personal financial information and removing sensitive data or disrupting operations. Current malware detection approaches work well for previously known signatures. However, malware developers utilize techniques to mutate and change software properties (signatures) to avoid and evade detection. Polymorphic malware is practically undetectable with signature-based defensive technologies. Today’s effective detection rate for polymorphic malware detection ranges from 68.75% to 81.25%. New techniques are needed to improve malware detection rates. Improved detection of polymorphic malware can only be accomplished by extracting features beyond the signature realm. Targeted detection for polymorphic malware must rely upon extracting key features and characteristics for advanced analysis. Traditionally, malware researchers have relied on limited dimensional features such as behavior (dynamic) or source/execution code analysis (static). This study’s focus was to extract and evaluate a limited set of multidimensional topological data in order to improve detection for polymorphic malware. This study used multidimensional analysis (file properties, static and dynamic analysis) with machine learning algorithms to improve malware detection. This research demonstrated improved polymorphic malware detection can be achieved with machine learning. This study conducted a number of experiments using a standard experimental testing protocol. This study utilized three advanced algorithms (Metabagging (MB), Instance Based k-Means (IBk) and Deep Learning Multi-Layer Perceptron) with a limited set of multidimensional data. Experimental results delivered detection results above 99.43%. In addition, the experiments delivered near zero false positives. The study’s approach was based on single case experimental design, a well-accepted protocol for progressive testing. The study constructed a prototype to automate feature extraction, assemble files for analysis, and analyze results through multiple clustering algorithms. The study performed an evaluation of large malware sample datasets to understand effectiveness across a wide range of malware. The study developed an integrated framework which automated feature extraction for multidimensional analysis. The feature extraction framework consisted of four modules: 1) a pre-process module that extracts and generates topological features based on static analysis of machine code and file characteristics, 2) a behavioral analysis module that extracts behavioral characteristics based on file execution (dynamic analysis), 3) an input file construction and submission module, and 4) a machine learning module that employs various advanced algorithms. As with most studies, careful attention was paid to false positive and false negative rates which reduce their overall detection accuracy and effectiveness. This study provided a novel approach to expand the malware body of knowledge and improve the detection for polymorphic malware targeting Microsoft operating systems

    Protecting Android Devices from Malware Attacks: A State-of-the-Art Report of Concepts, Modern Learning Models and Challenges

    Get PDF
    Advancements in microelectronics have increased the popularity of mobile devices like cellphones, tablets, e-readers, and PDAs. Android, with its open-source platform, broad device support, customizability, and integration with the Google ecosystem, has become the leading operating system for mobile devices. While Android's openness brings benefits, it has downsides like a lack of official support, fragmentation, complexity, and security risks if not maintained. Malware exploits these vulnerabilities for unauthorized actions and data theft. To enhance device security, static and dynamic analysis techniques can be employed. However, current attackers are becoming increasingly sophisticated, and they are employing packaging, code obfuscation, and encryption techniques to evade detection models. Researchers prefer flexible artificial intelligence methods, particularly deep learning models, for detecting and classifying malware on Android systems. In this survey study, a detailed literature review was conducted to investigate and analyze how deep learning approaches have been applied to malware detection on Android systems. The study also provides an overview of the Android architecture, datasets used for deep learning-based detection, and open issues that will be studied in the future

    ConstrucciĂłn de clasificadores de malware para agencias de seguridad del Estado

    Get PDF
    Sandboxing has been used regularly to analyze software samples and determine if these contain suspicious properties or behaviors. Even if sandboxing is a powerful technique to perform malware analysis, it requires that a malware analyst performs a rigorous analysis of the results to determine the nature of the sample: goodware or malware. This paper proposes two machine learning models able to classify samples based on signatures and permissions obtained through Cuckoo sandbox, Androguard and VirusTotal. The developed models are also tested obtaining an acceptable percentage of correctly classified samples, being in this way useful tools for a malware analyst. A proposal of architecture for an IoT sentinel that uses one of the developed machine learning model is also showed. Finally, different approaches, perspectives, and challenges about the use of sandboxing and machine learning by security teams in State security agencies are also shared.El sandboxing ha sido usado de manera regular para analizar muestras de software y determinar si estas contienen propiedades o comportamientos sospechosos. A pesar de que el sandboxing es una técnica poderosa para desarrollar análisis de malware, esta requiere que un analista de malware desarrolle un análisis riguroso de los resultados para determinar la naturaleza de la muestra: goodware o malware. Este artículo propone dos modelos de aprendizaje automáticos capaces de clasificar muestras con base a un análisis de firmas o permisos extraídos por medio de Cuckoo sandbox, Androguard y VirusTotal. En este artículo también se presenta una propuesta de arquitectura de centinela IoT que protege dispositivos IoT, usando uno de los modelos de aprendizaje automáticos desarrollados anteriormente. Finalmente, diferentes enfoques y perspectivas acerca del uso de sandboxing y aprendizaje automático por parte de agencias de seguridad del Estado también son aportados

    Formalization and Detection of Host-Based Code Injection Attacks in the Context of Malware

    Get PDF
    The Host-Based Code Injection Attack (HBCIAs) is a technique that malicious software utilizes in order to avoid detection or steal sensitive information. In a nutshell, this is a local attack where code is injected across process boundaries and executed in the context of a victim process. Malware employs HBCIAs on several operating systems including Windows, Linux, and macOS. This thesis investigates the topic of HBCIAs in the context of malware. First, we conduct basic research on this topic. We formalize HBCIAs in the context of malware and show in several measurements, amongst others, the high prevelance of HBCIA-utilizing malware. Second, we present Bee Master, a platform-independent approach to dynamically detect HBCIAs. This approach applies the honeypot paradigm to operating system processes. Bee Master deploys fake processes as honeypots, which are attacked by malicious software. We show that Bee Master reliably detects HBCIAs on Windows and Linux. Third, we present Quincy, a machine learning-based system to detect HBCIAs in post-mortem memory dumps. It utilizes up to 38 features including memory region sparseness, memory region protection, and the occurence of HBCIA-related strings. We evaluate Quincy with two contemporary detection systems called Malfind and Hollowfind. This evaluation shows that Quincy outperforms them both. It is able to increase the detection performance by more than eight percent

    Detecting Cryptojacking Web Threats: An Approach with Autoencoders and Deep Dense Neural Networks

    Get PDF
    With the growing popularity of cryptocurrencies, which are an important part of day-to-day transactions over the Internet, the interest in being part of the so-called cryptomining service has attracted the attention of investors who wish to quickly earn profits by computing powerful transactional records towards the blockchain network. Since most users cannot afford the cost of specialized or standardized hardware for mining purposes, new techniques have been developed to make the latter easier, minimizing the computational cost required. Developers of large cryptocurrency houses have made available executable binaries and mainly browser-side scripts in order to authoritatively tap into users’ collective resources and effectively complete the calculation of puzzles to complete a proof of work. However, malicious actors have taken advantage of this capability to insert malicious scripts and illegally mine data without the user’s knowledge. This cyber-attack, also known as cryptojacking, is stealthy and difficult to analyze, whereby, solutions based on anti-malware extensions, blocklists, JavaScript disabling, among others, are not sufficient for accurate detection, creating a gap in multi-layer security mechanisms. Although in the state-of-the-art there are alternative solutions, mainly using machine learning techniques, one of the important issues to be solved is still the correct characterization of network and host samples, in the face of the increasing escalation of new tampering or obfuscation techniques. This paper develops a method that performs a fingerprinting technique to detect possible malicious sites, which are then characterized by an autoencoding algorithm that preserves the best information of the infection traces, thus, maximizing the classification power by means of a deep dense neural network

    Challenges and Open Questions of Machine Learning in Computer Security

    Get PDF
    This habilitation thesis presents advancements in machine learning for computer security, arising from problems in network intrusion detection and steganography. The thesis put an emphasis on explanation of traits shared by steganalysis, network intrusion detection, and other security domains, which makes these domains different from computer vision, speech recognition, and other fields where machine learning is typically studied. Then, the thesis presents methods developed to at least partially solve the identified problems with an overall goal to make machine learning based intrusion detection system viable. Most of them are general in the sense that they can be used outside intrusion detection and steganalysis on problems with similar constraints. A common feature of all methods is that they are generally simple, yet surprisingly effective. According to large-scale experiments they almost always improve the prior art, which is likely caused by being tailored to security problems and designed for large volumes of data. Specifically, the thesis addresses following problems: anomaly detection with low computational and memory complexity such that efficient processing of large data is possible; multiple-instance anomaly detection improving signal-to-noise ration by classifying larger group of samples; supervised classification of tree-structured data simplifying their encoding in neural networks; clustering of structured data; supervised training with the emphasis on the precision in top p% of returned data; and finally explanation of anomalies to help humans understand the nature of anomaly and speed-up their decision. Many algorithms and method presented in this thesis are deployed in the real intrusion detection system protecting millions of computers around the globe
    • …
    corecore