29 research outputs found
GUIDE FOR THE COLLECTION OF INSTRUSION DATA FOR MALWARE ANALYSIS AND DETECTION IN THE BUILD AND DEPLOYMENT PHASE
During the COVID-19 pandemic, when most businesses were not equipped for remote work and cloud computing, we saw a significant surge in ransomware attacks. This study aims to utilize machine learning and artificial intelligence to prevent known and unknown malware threats from being exploited by threat actors when developers build and deploy applications to the cloud. This study demonstrated an experimental quantitative research design using Aqua. The experiment\u27s sample is a Docker image. Aqua checked the Docker image for malware, sensitive data, Critical/High vulnerabilities, misconfiguration, and OSS license. The data collection approach is experimental. Our analysis of the experiment demonstrated how unapproved images were prevented from running anywhere in our environment based on known vulnerabilities, embedded secrets, OSS licensing, dynamic threat analysis, and secure image configuration. In addition to the experiment, the forensic data collected in the build and deployment phase are exploitable vulnerability, Critical/High Vulnerability Score, Misconfiguration, Sensitive Data, and Root User (Super User). Since Aqua generates a detailed audit record for every event during risk assessment and runtime, we viewed two events on the Audit page for our experiment. One of the events caused an alert due to two failed controls (Vulnerability Score, Super User), and the other was a successful event meaning that the image is secure to deploy in the production environment. The primary finding for our study is the forensic data associated with the two events on the Audit page in Aqua. In addition, Aqua validated our security controls and runtime policies based on the forensic data with both events on the Audit page. Finally, the study’s conclusions will mitigate the likelihood that organizations will fall victim to ransomware by mitigating and preventing the total damage caused by a malware attack
Survey of Machine Learning Techniques for Malware Analysis
Coping with malware is getting more and more challenging, given their
relentless growth in complexity and volume. One of the most common approaches
in literature is using machine learning techniques, to automatically learn
models and patterns behind such complexity, and to develop technologies for
keeping pace with the speed of development of novel malware. This survey aims
at providing an overview on the way machine learning has been used so far in
the context of malware analysis. We systematize surveyed papers according to
their objectives (i.e., the expected output, what the analysis aims to), what
information about malware they specifically use (i.e., the features), and what
machine learning techniques they employ (i.e., what algorithm is used to
process the input and produce the output). We also outline a number of problems
concerning the datasets used in considered works, and finally introduce the
novel concept of malware analysis economics, regarding the study of existing
tradeoffs among key metrics, such as analysis accuracy and economical costs
Recommended from our members
Uncovering Features in Behaviorally Similar Programs
The detection of similar code can support many so ware engineering tasks such as program understanding and program classification. Many excellent approaches have been proposed to detect programs having similar syntactic features. However, these approaches are unable to identify programs dynamically or statistically close to each other, which we call behaviorally similar programs. We believe the detection of behaviorally similar programs can enhance or even automate the tasks relevant to program classification. In this thesis, we will discuss our current approaches to identify programs having similar behavioral features in multiple perspectives.
We first discuss how to detect programs having similar functionality. While the definition of a program’s functionality is undecidable, we use inputs and outputs (I/Os) of programs as the proxy of their functionality. We then use I/Os of programs as a behavioral feature to detect which programs are functionally similar: two programs are functionally similar if they share similar inputs and outputs. This approach has been studied and developed in the C language to detect functionally equivalent programs having equivalent I/Os. Nevertheless, some natural problems in Object Oriented languages, such as input generation and comparisons between application-specific data types, hinder the development of this approach. We propose a new technique, in-vivo detection, which uses existing and meaningful inputs to drive applications systematically and then applies a novel similarity model considering both inputs and outputs of programs, to detect functionally similar programs. We develop the tool, HitoshiIO, based on our in-vivo detection. In the subjects that we study, HitoshiIO correctly detect 68.4% of functionally similar programs, where its false positive rate is only 16.6%.
In addition to functional I/Os of programs, we attempt to discover programs having similar execution behavior. Again, the execution behavior of a program can be undecidable, so we use instructions executed at run-time as a behavioral feature of a program. We create DyCLINK, which observes program executions and encodes them in dynamic instruction graphs. A vertex in a dynamic instruction graph is an instruction and an edge is a type of dependency between two instructions. The problem to detect which programs have similar executions can then be reduced to a problem of solving inexact graph isomorphism. We propose a link analysis based algorithm, LinkSub, which vectorizes each dynamic instruction graph by the importance of every instruction, to solve this graph isomorphism problem efficiently. In a K Nearest Neighbor (KNN) based program classification experiment, DyCLINK achieves 90 + % precision.
Because HitoshiIO and DyCLINK both rely on dynamic analysis to expose program behavior, they have better capability to locate and search for behaviorally similar programs than traditional static analysis tools. However, they suffer from some common problems of dynamic analysis, such as input generation and run-time overhead. These problems may make our approaches challenging to scale. Thus, we create the system, Macneto, which integrates static analysis with machine topic modeling and deep learning to approximate program behaviors from their binaries without truly executing programs. In our deobfuscation experiments considering two commercial obfuscators that alter lexical information and syntax in programs, Macneto achieves 90 + % precision, where the groundtruth is that the behavior of a program before and after obfuscation should be the same.
In this thesis, we offer a more extensive view of similar programs than the traditional definitions. While the traditional definitions of similar programs mostly use static features, such as syntax and lexical information, we propose to leverage the power of dynamic analysis and machine learning models to trace/collect behavioral features of pro- grams. These behavioral features of programs can then apply to detect behaviorally similar programs. We believe the techniques we invented in this thesis to detect behaviorally similar programs can improve the development of software engineering and security applications, such as code search and deobfuscation
Early-stage malware prediction using recurrent neural networks
Static malware analysis is well-suited to endpoint anti-virus systems as it can be conducted quickly by examining the features of an executable piece of code and matching it to previously observed malicious code. However, static code analysis can be vulnerable to code obfuscation techniques. Behavioural data collected during file execution is more difficult to obfuscate, but takes a relatively long time to capture - typically up to 5 minutes, meaning the malicious payload has likely already been delivered by the time it is detected. In this paper we investigate the possibility of predicting whether or not an executable is malicious based on a short snapshot of behavioural data. We find that an ensemble of recurrent neural networks are able to predict whether an executable is malicious or benign within the first 5 seconds of execution with 94% accuracy. This is the first time general types of malicious file have been predicted to be malicious during execution rather than using a complete activity log file post-execution, and enables cyber security endpoint protection to be advanced to use behavioural data for blocking malicious payloads rather than detecting them post-execution and having to repair the damage
Efficient, Scalable, and Accurate Program Fingerprinting in Binary Code
Why was this binary written? Which compiler was used? Which free software
packages did the developer use? Which sections of the code were borrowed? Who wrote
the binary? These questions are of paramount importance to security analysts and reverse
engineers, and binary fingerprinting approaches may provide valuable insights that can
help answer them. This thesis advances the state of the art by addressing some of the
most fundamental problems in program fingerprinting for binary code, notably, reusable
binary code discovery, fingerprinting free open source software packages, and authorship
attribution.
First, to tackle the problem of discovering reusable binary code, we employ a technique
for identifying reused functions by matching traces of a novel representation of binary
code known as the semantic integrated graph. This graph enhances the control flow
graph, the register flow graph, and the function call graph, key concepts from classical program analysis, and merges them with other structural information to create a joint data
structure. Second, we approach the problem of fingerprinting free open source software
(FOSS) packages by proposing a novel resilient and efficient system that incorporates
three components. The first extracts the syntactical features of functions by considering
opcode frequencies and performing a hidden Markov model statistical test. The second
applies a neighborhood hash graph kernel to random walks derived from control flow
graphs, with the goal of extracting the semantics of the functions. The third applies the
z-score to normalized instructions to extract the behavior of the instructions in a function.
Then, the components are integrated using a Bayesian network model which synthesizes
the results to determine the FOSS function, making it possible to detect user-related functions.
Third, with these elements now in place, we present a framework capable of decoupling
binary program functionality from the coding habits of authors. To capture coding habits,
the framework leverages a set of features that are based on collections of functionalityindependent
choices made by authors during coding. Finally, it is well known that techniques
such as refactoring and code transformations can significantly alter the structure
of code, even for simple programs. Applying such techniques or changing the compiler
and compilation settings can significantly affect the accuracy of available binary analysis
tools, which severely limits their practicability, especially when applied to malware. To
address these issues, we design a technique that extracts the semantics of binary code in terms of both data and control flow. The proposed technique allows more robust binary
analysis because the extracted semantics of the binary code is generally immune
from code transformation, refactoring, and varying the compilers or compilation settings.
Specifically, it employs data-flow analysis to extract the semantic flow of the registers as
well as the semantic components of the control flow graph, which are then synthesized
into a novel representation called the semantic flow graph (SFG).
We evaluate the framework on large-scale datasets extracted from selected open source
C++ projects on GitHub, Google Code Jam events, Planet Source Code contests, and students’
programming projects and found that it outperforms existing methods in several
respects. First, it is able to detect the reused functions. Second, it can identify FOSS
packages in real-world projects and reused binary functions with high precision. Third, it
decouples authorship from functionality so that it can be applied to real malware binaries
to automatically generate evidence of similar coding habits. Fourth, compared to existing
research contributions, it successfully attributes a larger number of authors with a significantly
higher accuracy. Finally, the new framework is more robust than previous methods
in the sense that there is no significant drop in accuracy when the code is subjected to
refactoring techniques, code transformation methods, and different compilers
Fingerprinting Vulnerabilities in Intelligent Electronic Device Firmware
Modern smart grid deployments heavily rely on the advanced capabilities that Intelligent Electronic Devices (IEDs) provide. Furthermore, these devices firmware often contain critical vulnerabilities that if exploited could cause large impacts on national economic security, and national safety. As such, a scalable domain specific approach is required in order to assess the security of IED firmware. In order to resolve this lack of an appropriate methodology, we present a scalable vulnerable function identification framework. It is specifically designed to analyze IED firmware and binaries that employ the ARM CPU architecture. Its core functionality revolves around a multi-stage detection methodology that is specifically designed to resolve the lack of specialization that limits other general-purpose approaches. This is achieved by compiling an extensive database of IED specific vulnerabilities and domain specific firmware that is evaluated. Its analysis approach is composed of three stages that leverage function syntactic, semantic, structural and statistical features in order to identify vulnerabilities. As such it (i) first filters out dissimilar functions based on a group of heterogeneous features, (ii) it then further filters out dissimilar functions based on their execution paths, and (iii) it finally identifies candidate functions based on fuzzy graph matching . In order to validate our methodologies capabilities, it is implemented as a binary analysis framework entitled BinArm. The resulting algorithm is then put through a rigorous set of evaluations that demonstrate its capabilities. These include the capability to identify vulnerabilities within a given IED firmware image with a total accuracy of 0.92
Racing demons: Malware detection in early execution
Malicious software (malware) causes increasingly devastating social and financial losses each year. As such, academic and commercial research has been directed towards automatically sorting malicious software from benign software. Machine learning (ML)has been widely proposed to address this challenge in an attempt to move away from the time consuming practice of hand-writing detection rules. Building on the promising results of previous ML malware detection research, this thesis focuses on the use of dynamic behavioural data captured from malware activity, arguing that dynamic models are more robust to attacker evasion techniques than code-based detection methods.
This thesis seeks to address some of the open problems that security practitioners may face in adopting dynamic behavioural automatic malware detection. First, the reliability in performance of different data sources and algorithms when translating lab-oratory results into real-world use; this has not been analysed in previous dynamic detection literature. After highlighting that the best-performing data and algorithm in the laboratory may not be the best-performing in the real world, the thesis turns to one of the main criticisms of dynamic data: the time taken to collect it. In previous research, dynamic detection is often conducted for several minutes per sample, making it incompatible with the speed of code-based detection. This thesis presents the first model of early-stage malware prediction using just a few seconds of collected data. Finally, building on early-stage detection in an isolated environment, real-time detection on a live machine in use is simulated. Real-time detection further reduces the computational costs of dynamic analysis. This thesis further presents the first results of the damage prevention using automated malware detection and process killing during normal machine use
Cyber Law and Espionage Law as Communicating Vessels
Professor Lubin\u27s contribution is Cyber Law and Espionage Law as Communicating Vessels, pp. 203-225.
Existing legal literature would have us assume that espionage operations and “below-the-threshold” cyber operations are doctrinally distinct. Whereas one is subject to the scant, amorphous, and under-developed legal framework of espionage law, the other is subject to an emerging, ever-evolving body of legal rules, known cumulatively as cyber law. This dichotomy, however, is erroneous and misleading. In practice, espionage and cyber law function as communicating vessels, and so are better conceived as two elements of a complex system, Information Warfare (IW). This paper therefore first draws attention to the similarities between the practices – the fact that the actors, technologies, and targets are interchangeable, as are the knee-jerk legal reactions of the international community. In light of the convergence between peacetime Low-Intensity Cyber Operations (LICOs) and peacetime Espionage Operations (EOs) the two should be subjected to a single regulatory framework, one which recognizes the role intelligence plays in our public world order and which adopts a contextual and consequential method of inquiry. The paper proceeds in the following order: Part 2 provides a descriptive account of the unique symbiotic relationship between espionage and cyber law, and further explains the reasons for this dynamic. Part 3 places the discussion surrounding this relationship within the broader discourse on IW, making the claim that the convergence between EOs and LICOs, as described in Part 2, could further be explained by an even larger convergence across all the various elements of the informational environment. Parts 2 and 3 then serve as the backdrop for Part 4, which details the attempt of the drafters of the Tallinn Manual 2.0 to compartmentalize espionage law and cyber law, and the deficits of their approach. The paper concludes by proposing an alternative holistic understanding of espionage law, grounded in general principles of law, which is more practically transferable to the cyber realmhttps://www.repository.law.indiana.edu/facbooks/1220/thumbnail.jp