14 research outputs found
Recommended from our members
Uncovering Features in Behaviorally Similar Programs
The detection of similar code can support many so ware engineering tasks such as program understanding and program classification. Many excellent approaches have been proposed to detect programs having similar syntactic features. However, these approaches are unable to identify programs dynamically or statistically close to each other, which we call behaviorally similar programs. We believe the detection of behaviorally similar programs can enhance or even automate the tasks relevant to program classification. In this thesis, we will discuss our current approaches to identify programs having similar behavioral features in multiple perspectives.
We first discuss how to detect programs having similar functionality. While the definition of a program’s functionality is undecidable, we use inputs and outputs (I/Os) of programs as the proxy of their functionality. We then use I/Os of programs as a behavioral feature to detect which programs are functionally similar: two programs are functionally similar if they share similar inputs and outputs. This approach has been studied and developed in the C language to detect functionally equivalent programs having equivalent I/Os. Nevertheless, some natural problems in Object Oriented languages, such as input generation and comparisons between application-specific data types, hinder the development of this approach. We propose a new technique, in-vivo detection, which uses existing and meaningful inputs to drive applications systematically and then applies a novel similarity model considering both inputs and outputs of programs, to detect functionally similar programs. We develop the tool, HitoshiIO, based on our in-vivo detection. In the subjects that we study, HitoshiIO correctly detect 68.4% of functionally similar programs, where its false positive rate is only 16.6%.
In addition to functional I/Os of programs, we attempt to discover programs having similar execution behavior. Again, the execution behavior of a program can be undecidable, so we use instructions executed at run-time as a behavioral feature of a program. We create DyCLINK, which observes program executions and encodes them in dynamic instruction graphs. A vertex in a dynamic instruction graph is an instruction and an edge is a type of dependency between two instructions. The problem to detect which programs have similar executions can then be reduced to a problem of solving inexact graph isomorphism. We propose a link analysis based algorithm, LinkSub, which vectorizes each dynamic instruction graph by the importance of every instruction, to solve this graph isomorphism problem efficiently. In a K Nearest Neighbor (KNN) based program classification experiment, DyCLINK achieves 90 + % precision.
Because HitoshiIO and DyCLINK both rely on dynamic analysis to expose program behavior, they have better capability to locate and search for behaviorally similar programs than traditional static analysis tools. However, they suffer from some common problems of dynamic analysis, such as input generation and run-time overhead. These problems may make our approaches challenging to scale. Thus, we create the system, Macneto, which integrates static analysis with machine topic modeling and deep learning to approximate program behaviors from their binaries without truly executing programs. In our deobfuscation experiments considering two commercial obfuscators that alter lexical information and syntax in programs, Macneto achieves 90 + % precision, where the groundtruth is that the behavior of a program before and after obfuscation should be the same.
In this thesis, we offer a more extensive view of similar programs than the traditional definitions. While the traditional definitions of similar programs mostly use static features, such as syntax and lexical information, we propose to leverage the power of dynamic analysis and machine learning models to trace/collect behavioral features of pro- grams. These behavioral features of programs can then apply to detect behaviorally similar programs. We believe the techniques we invented in this thesis to detect behaviorally similar programs can improve the development of software engineering and security applications, such as code search and deobfuscation
First-Order Model Checking on Monadically Stable Graph Classes
A graph class is called monadically stable if one cannot
interpret, in first-order logic, arbitrary large linear orders in colored
graphs from . We prove that the model checking problem for
first-order logic is fixed-parameter tractable on every monadically stable
graph class. This extends the results of [Grohe, Kreutzer, and Siebertz; J. ACM
'17] for nowhere dense classes and of [Dreier, M\"ahlmann, and Siebertz; STOC
'23] for structurally nowhere dense classes to all monadically stable classes.
As a complementary hardness result, we prove that for every hereditary graph
class that is edge-stable (excludes some half-graph as a
semi-induced subgraph) but not monadically stable, first-order model checking
is -hard on , and -hard when
restricted to existential sentences. This confirms, in the special case of
edge-stable classes, an on-going conjecture that the notion of monadic NIP
delimits the tractability of first-order model checking on hereditary classes
of graphs.
For our tractability result, we first prove that monadically stable graph
classes have almost linear neighborhood complexity. Using this, we construct
sparse neighborhood covers for monadically stable classes, which provides the
missing ingredient for the algorithm of [Dreier, M\"ahlmann, and Siebertz; STOC
'23]. The key component of this construction is the usage of orders with low
crossing number [Welzl; SoCG '88], a tool from the area of range queries.
For our hardness result, we prove a new characterization of monadically
stable graph classes in terms of forbidden induced subgraphs. We then use this
characterization to show that in hereditary classes that are edge-stable but
not monadically stable, one can effectively interpret the class of all graphs
using only existential formulas.Comment: 55 pages, 13 figure
Effizientes Maschinelles Lernen für die Angriffserkennung
Detecting and fending off attacks on computer systems is an enduring
problem in computer security. In light of a plethora of different
threats and the growing automation used by attackers, we are in urgent
need of more advanced methods for attack detection.
In this thesis, we address the necessity of advanced attack detection
and develop methods to detect attacks using machine learning to
establish a higher degree of automation for reactive security. Machine
learning is data-driven and not void of bias. For the effective
application of machine learning for attack detection, thus, a periodic
retraining over time is crucial. However, the training complexity of
many learning-based approaches is substantial. We show that with the
right data representation, efficient algorithms for mining substring
statistics, and implementations based on probabilistic data structures,
training the underlying model can be achieved in linear time.
In two different scenarios, we demonstrate the effectiveness of
so-called language models that allow to generically portray the content
and structure of attacks: On the one hand, we are learning malicious
behavior of Flash-based malware using classification, and on the other
hand, we detect intrusions by learning normality in industrial control
networks using anomaly detection. With a data throughput of up to
580 Mbit/s during training, we do not only meet our expectations with
respect to runtime but also outperform related approaches by up to an
order of magnitude in detection performance. The same techniques that
facilitate learning in the previous scenarios can also be used for
revealing malicious content, embedded in passive file formats, such as
Microsoft Office documents. As a further showcase, we additionally
develop a method based on the efficient mining of substring statistics
that is able to break obfuscations irrespective of the used key length,
with up to 25 Mbit/s and thus, succeeds where related approaches fail.
These methods significantly improve detection performance and enable
operation in linear time. In doing so, we counteract the trend of
compensating increasing runtime requirements with resources. While the
results are promising and the approaches provide urgently needed
automation, they cannot and are not intended to replace human experts or
traditional approaches, but are designed to assist and complement them.Die Erkennung und Abwehr von Angriffen auf Endnutzer und Netzwerke ist
seit vielen Jahren ein anhaltendes Problem in der Computersicherheit.
Angesichts der hohen Anzahl an unterschiedlichen Angriffsvektoren und
der zunehmenden Automatisierung von Angriffen, bedarf es dringend
moderner Methoden zur Angriffserkennung.
In dieser Doktorarbeit werden Ansätze entwickelt, um Angriffe mit Hilfe
von Methoden des maschinellen Lernens zuverlässig, aber auch effizient
zu erkennen. Sie stellen der Automatisierung von Angriffen einen
entsprechend hohen Grad an Automatisierung von Verteidigungsmaßnahmen
entgegen. Das Trainieren solcher Methoden ist allerdings rechnerisch
aufwändig und erfolgt auf sehr großen Datenmengen. Laufzeiteffiziente
Lernverfahren sind also entscheidend. Wir zeigen, dass durch den Einsatz
von effizienten Algorithmen zur statistischen Analyse von Zeichenketten
und Implementierung auf Basis von probabilistischen Datenstrukturen, das
Lernen von effektiver Angriffserkennung auch in linearer Zeit möglich
ist.
Anhand von zwei unterschiedlichen Anwendungsfällen, demonstrieren wir
die Effektivität von Modellen, die auf der Extraktion von sogenannten
n-Grammen basieren: Zum einen, betrachten wir die Erkennung von
Flash-basiertem Schadcode mittels Methoden der Klassifikation, und zum
anderen, die Erkennung von Angriffen auf Industrienetzwerke bzw.
SCADA-Systeme mit Hilfe von Anomaliedetektion. Dabei erzielen wir
während des Trainings dieser Modelle einen Datendurchsatz von bis zu
580 Mbit/s und übertreffen gleichzeitig die Erkennungsleistung von
anderen Ansätzen deutlich. Die selben Techniken, um diese lernenden
Ansätze zu ermöglichen, können außerdem für die Erkennung von Schadcode
verwendet werden, der in anderen Dateiformaten eingebettet und mittels
einfacher Verschlüsselungen obfuskiert wurde. Hierzu entwickeln wir eine
Methode die basierend auf der statistischen Auswertung von Zeichenketten
einfache Verschlüsselungen bricht. Der entwickelte Ansatz arbeitet
unabhängig von der verwendeten Schlüssellänge, mit einem Datendurchsatz
von bis zu 25 Mbit/s und ermöglicht so die erfolgreiche Deobfuskierung
in Fällen an denen andere Ansätze scheitern.
Die erzielten Ergebnisse in Hinsicht auf Laufzeiteffizienz und
Erkennungsleistung sind vielversprechend. Die vorgestellten Methoden
ermöglichen die dringend nötige Automatisierung von
Verteidigungsmaßnahmen, sollen den Experten oder etablierte Methoden
aber nicht ersetzen, sondern diese unterstützen und ergänzen
Detecting derivative malware samples using deobfuscation-assisted similarity analysis
The overwhelming popularity of PHP as a hosting platform has made it the language of choice for developers of Remote Access Trojans (RATs or web shells) and other malicious software. These shells are typically used to compromise and monetise web platforms by providing the attacker with basic remote access to the system, including _le transfer, command execution, network reconnaissance, and database connectivity. Once infected, compromised systems can be used to defraud users by hosting phishing sites, performing Distributed Denial of Service attacks, or serving as anonymous platforms for sending spam or other malfeasance. The vast majority of these threats are largely derivative, incorporating core capabilities found in more established RATs such as c99 and r57. Authors of malicious software routinely produce new shell variants by modifying the behaviours of these ubiquitous RATs, either to add desired functionality or to avoid detection by signature-based detection systems. Once these modified shells are eventually identified (or additional functionality is required), the process of shell adaptation begins again. The end result of this iterative process is a web of separate but related shell variants, many of which are at least partially derived from one of the more popular and influential RATs. In response to the problem outlined above, the author set out to design and implement a system capable of circumventing common obfuscation techniques and identifying derivative malware samples in a given collection. To begin with, a decoder component was developed to syntactically deobfuscate and normalise PHP code by detecting and reversing idiomatic obfuscation constructs, and to apply uniform formatting conventions to all system inputs. A unified malware analysis framework, called Viper, was then extended to create a modular similarity analysis system comprised of individual feature extraction modules, modules responsible for batch processing, a matrix module for comparing sample features, and two visualisation modules capable of generating visual representations of shell similarity. The principal conclusion of the research was that the deobfuscation performed by the decoder component prior to analysis dramatically improved the observed levels of similarity between test samples. This in turn allowed the modular similarity analysis system to identify derivative clusters (or families) within a large collection of shells more accurately. Techniques for isolating and re-rendering these clusters were also developed and demonstrated to be effective at increasing the amount of detail available for evaluating the relative magnitudes of the relationships within each cluster
Efficient, Scalable, and Accurate Program Fingerprinting in Binary Code
Why was this binary written? Which compiler was used? Which free software
packages did the developer use? Which sections of the code were borrowed? Who wrote
the binary? These questions are of paramount importance to security analysts and reverse
engineers, and binary fingerprinting approaches may provide valuable insights that can
help answer them. This thesis advances the state of the art by addressing some of the
most fundamental problems in program fingerprinting for binary code, notably, reusable
binary code discovery, fingerprinting free open source software packages, and authorship
attribution.
First, to tackle the problem of discovering reusable binary code, we employ a technique
for identifying reused functions by matching traces of a novel representation of binary
code known as the semantic integrated graph. This graph enhances the control flow
graph, the register flow graph, and the function call graph, key concepts from classical program analysis, and merges them with other structural information to create a joint data
structure. Second, we approach the problem of fingerprinting free open source software
(FOSS) packages by proposing a novel resilient and efficient system that incorporates
three components. The first extracts the syntactical features of functions by considering
opcode frequencies and performing a hidden Markov model statistical test. The second
applies a neighborhood hash graph kernel to random walks derived from control flow
graphs, with the goal of extracting the semantics of the functions. The third applies the
z-score to normalized instructions to extract the behavior of the instructions in a function.
Then, the components are integrated using a Bayesian network model which synthesizes
the results to determine the FOSS function, making it possible to detect user-related functions.
Third, with these elements now in place, we present a framework capable of decoupling
binary program functionality from the coding habits of authors. To capture coding habits,
the framework leverages a set of features that are based on collections of functionalityindependent
choices made by authors during coding. Finally, it is well known that techniques
such as refactoring and code transformations can significantly alter the structure
of code, even for simple programs. Applying such techniques or changing the compiler
and compilation settings can significantly affect the accuracy of available binary analysis
tools, which severely limits their practicability, especially when applied to malware. To
address these issues, we design a technique that extracts the semantics of binary code in terms of both data and control flow. The proposed technique allows more robust binary
analysis because the extracted semantics of the binary code is generally immune
from code transformation, refactoring, and varying the compilers or compilation settings.
Specifically, it employs data-flow analysis to extract the semantic flow of the registers as
well as the semantic components of the control flow graph, which are then synthesized
into a novel representation called the semantic flow graph (SFG).
We evaluate the framework on large-scale datasets extracted from selected open source
C++ projects on GitHub, Google Code Jam events, Planet Source Code contests, and students’
programming projects and found that it outperforms existing methods in several
respects. First, it is able to detect the reused functions. Second, it can identify FOSS
packages in real-world projects and reused binary functions with high precision. Third, it
decouples authorship from functionality so that it can be applied to real malware binaries
to automatically generate evidence of similar coding habits. Fourth, compared to existing
research contributions, it successfully attributes a larger number of authors with a significantly
higher accuracy. Finally, the new framework is more robust than previous methods
in the sense that there is no significant drop in accuracy when the code is subjected to
refactoring techniques, code transformation methods, and different compilers
Analysis and study on text representation to improve the accuracy of the Normalized Compression Distance
The huge amount of information stored in text form makes methods that deal
with texts really interesting. This thesis focuses on dealing with texts using
compression distances. More specifically, the thesis takes a small step towards
understanding both the nature of texts and the nature of compression distances.
Broadly speaking, the way in which this is done is exploring the effects that
several distortion techniques have on one of the most successful distances in
the family of compression distances, the Normalized Compression Distance -NCD-.Comment: PhD Thesis; 202 page