43 research outputs found
Efficient, Scalable, and Accurate Program Fingerprinting in Binary Code
Why was this binary written? Which compiler was used? Which free software
packages did the developer use? Which sections of the code were borrowed? Who wrote
the binary? These questions are of paramount importance to security analysts and reverse
engineers, and binary fingerprinting approaches may provide valuable insights that can
help answer them. This thesis advances the state of the art by addressing some of the
most fundamental problems in program fingerprinting for binary code, notably, reusable
binary code discovery, fingerprinting free open source software packages, and authorship
attribution.
First, to tackle the problem of discovering reusable binary code, we employ a technique
for identifying reused functions by matching traces of a novel representation of binary
code known as the semantic integrated graph. This graph enhances the control flow
graph, the register flow graph, and the function call graph, key concepts from classical program analysis, and merges them with other structural information to create a joint data
structure. Second, we approach the problem of fingerprinting free open source software
(FOSS) packages by proposing a novel resilient and efficient system that incorporates
three components. The first extracts the syntactical features of functions by considering
opcode frequencies and performing a hidden Markov model statistical test. The second
applies a neighborhood hash graph kernel to random walks derived from control flow
graphs, with the goal of extracting the semantics of the functions. The third applies the
z-score to normalized instructions to extract the behavior of the instructions in a function.
Then, the components are integrated using a Bayesian network model which synthesizes
the results to determine the FOSS function, making it possible to detect user-related functions.
Third, with these elements now in place, we present a framework capable of decoupling
binary program functionality from the coding habits of authors. To capture coding habits,
the framework leverages a set of features that are based on collections of functionalityindependent
choices made by authors during coding. Finally, it is well known that techniques
such as refactoring and code transformations can significantly alter the structure
of code, even for simple programs. Applying such techniques or changing the compiler
and compilation settings can significantly affect the accuracy of available binary analysis
tools, which severely limits their practicability, especially when applied to malware. To
address these issues, we design a technique that extracts the semantics of binary code in terms of both data and control flow. The proposed technique allows more robust binary
analysis because the extracted semantics of the binary code is generally immune
from code transformation, refactoring, and varying the compilers or compilation settings.
Specifically, it employs data-flow analysis to extract the semantic flow of the registers as
well as the semantic components of the control flow graph, which are then synthesized
into a novel representation called the semantic flow graph (SFG).
We evaluate the framework on large-scale datasets extracted from selected open source
C++ projects on GitHub, Google Code Jam events, Planet Source Code contests, and students’
programming projects and found that it outperforms existing methods in several
respects. First, it is able to detect the reused functions. Second, it can identify FOSS
packages in real-world projects and reused binary functions with high precision. Third, it
decouples authorship from functionality so that it can be applied to real malware binaries
to automatically generate evidence of similar coding habits. Fourth, compared to existing
research contributions, it successfully attributes a larger number of authors with a significantly
higher accuracy. Finally, the new framework is more robust than previous methods
in the sense that there is no significant drop in accuracy when the code is subjected to
refactoring techniques, code transformation methods, and different compilers
Recommended from our members
Uncovering Features in Behaviorally Similar Programs
The detection of similar code can support many so ware engineering tasks such as program understanding and program classification. Many excellent approaches have been proposed to detect programs having similar syntactic features. However, these approaches are unable to identify programs dynamically or statistically close to each other, which we call behaviorally similar programs. We believe the detection of behaviorally similar programs can enhance or even automate the tasks relevant to program classification. In this thesis, we will discuss our current approaches to identify programs having similar behavioral features in multiple perspectives.
We first discuss how to detect programs having similar functionality. While the definition of a program’s functionality is undecidable, we use inputs and outputs (I/Os) of programs as the proxy of their functionality. We then use I/Os of programs as a behavioral feature to detect which programs are functionally similar: two programs are functionally similar if they share similar inputs and outputs. This approach has been studied and developed in the C language to detect functionally equivalent programs having equivalent I/Os. Nevertheless, some natural problems in Object Oriented languages, such as input generation and comparisons between application-specific data types, hinder the development of this approach. We propose a new technique, in-vivo detection, which uses existing and meaningful inputs to drive applications systematically and then applies a novel similarity model considering both inputs and outputs of programs, to detect functionally similar programs. We develop the tool, HitoshiIO, based on our in-vivo detection. In the subjects that we study, HitoshiIO correctly detect 68.4% of functionally similar programs, where its false positive rate is only 16.6%.
In addition to functional I/Os of programs, we attempt to discover programs having similar execution behavior. Again, the execution behavior of a program can be undecidable, so we use instructions executed at run-time as a behavioral feature of a program. We create DyCLINK, which observes program executions and encodes them in dynamic instruction graphs. A vertex in a dynamic instruction graph is an instruction and an edge is a type of dependency between two instructions. The problem to detect which programs have similar executions can then be reduced to a problem of solving inexact graph isomorphism. We propose a link analysis based algorithm, LinkSub, which vectorizes each dynamic instruction graph by the importance of every instruction, to solve this graph isomorphism problem efficiently. In a K Nearest Neighbor (KNN) based program classification experiment, DyCLINK achieves 90 + % precision.
Because HitoshiIO and DyCLINK both rely on dynamic analysis to expose program behavior, they have better capability to locate and search for behaviorally similar programs than traditional static analysis tools. However, they suffer from some common problems of dynamic analysis, such as input generation and run-time overhead. These problems may make our approaches challenging to scale. Thus, we create the system, Macneto, which integrates static analysis with machine topic modeling and deep learning to approximate program behaviors from their binaries without truly executing programs. In our deobfuscation experiments considering two commercial obfuscators that alter lexical information and syntax in programs, Macneto achieves 90 + % precision, where the groundtruth is that the behavior of a program before and after obfuscation should be the same.
In this thesis, we offer a more extensive view of similar programs than the traditional definitions. While the traditional definitions of similar programs mostly use static features, such as syntax and lexical information, we propose to leverage the power of dynamic analysis and machine learning models to trace/collect behavioral features of pro- grams. These behavioral features of programs can then apply to detect behaviorally similar programs. We believe the techniques we invented in this thesis to detect behaviorally similar programs can improve the development of software engineering and security applications, such as code search and deobfuscation
Software similarity and classification
This thesis analyses software programs in the context of their similarity to other software programs. Applications proposed and implemented include detecting malicious software and discovering security vulnerabilities
Survey on representation techniques for malware detection system
Malicious programs are malignant software’s designed by hackers or cyber offenders with a harmful intent to disrupt computer operation. In various researches, we found that the balance between designing an accurate architecture that can detect the malware and track several advanced techniques that malware creators apply to get variants of malware are always a difficult line. Hence the study of malware detection techniques has become more important and challenging within the security field. This review paper provides a detailed discussion and full reviews for various types of malware, malware detection techniques, various researches on them, malware analysis methods and different dynamic programmingbased tools that could be used to represent the malware sampled. We have provided a comprehensive bibliography in malware detection, its techniques and analysis methods for malware researchers
A Geometric Approach for Deciphering Protein Structure from Cryo-EM Volumes
Electron Cryo-Microscopy or cryo-EM is an area that has received much attention in the recent past. Compared to the traditional methods of X-Ray Crystallography and NMR Spectroscopy, cryo-EM can be used to image much larger complexes, in many different conformations, and under a wide range of biochemical conditions. This is because it does not require the complex to be crystallisable. However, cryo-EM reconstructions are limited to intermediate resolutions, with the state-of-the-art being 3.6A, where secondary structure elements can be visually identified but not individual amino acid residues. This lack of atomic level resolution creates new computational challenges for protein structure identification. In this dissertation, we present a suite of geometric algorithms to address several aspects of protein modeling using cryo-EM density maps. Specifically, we develop novel methods to capture the shape of density volumes as geometric skeletons. We then use these skeletons to find secondary structure elements: SSEs) of a given protein, to identify the correspondence between these SSEs and those predicted from the primary sequence, and to register high-resolution protein structures onto the density volume. In addition, we designed and developed Gorgon, an interactive molecular modeling system, that integrates the above methods with other interactive routines to generate reliable and accurate protein backbone models
Report on shape analysis and matching and on semantic matching
In GRAVITATE, two disparate specialities will come together in one working platform for the archaeologist: the fields of shape analysis, and of metadata search. These fields are relatively disjoint at the moment, and the research and development challenge of GRAVITATE is precisely to merge them for our chosen tasks. As shown in chapter 7 the small amount of literature that already attempts join 3D geometry and semantics is not related to the cultural heritage domain. Therefore, after the project is done, there should be a clear ‘before-GRAVITATE’ and ‘after-GRAVITATE’ split in how these two aspects of a cultural heritage artefact are treated.This state of the art report (SOTA) is ‘before-GRAVITATE’. Shape analysis and metadata description are described separately, as currently in the literature and we end the report with common recommendations in chapter 8 on possible or plausible cross-connections that suggest themselves. These considerations will be refined for the Roadmap for Research deliverable.Within the project, a jargon is developing in which ‘geometry’ stands for the physical properties of an artefact (not only its shape, but also its colour and material) and ‘metadata’ is used as a general shorthand for the semantic description of the provenance, location, ownership, classification, use etc. of the artefact. As we proceed in the project, we will find a need to refine those broad divisions, and find intermediate classes (such as a semantic description of certain colour patterns), but for now the terminology is convenient – not least because it highlights the interesting area where both aspects meet.On the ‘geometry’ side, the GRAVITATE partners are UVA, Technion, CNR/IMATI; on the metadata side, IT Innovation, British Museum and Cyprus Institute; the latter two of course also playing the role of internal users, and representatives of the Cultural Heritage (CH) data and target user’s group. CNR/IMATI’s experience in shape analysis and similarity will be an important bridge between the two worlds for geometry and metadata. The authorship and styles of this SOTA reflect these specialisms: the first part (chapters 3 and 4) purely by the geometry partners (mostly IMATI and UVA), the second part (chapters 5 and 6) by the metadata partners, especially IT Innovation while the joint overview on 3D geometry and semantics is mainly by IT Innovation and IMATI. The common section on Perspectives was written with the contribution of all
Exploring novel designs of NLP solvers: Architecture and Implementation of WORHP
Mathematical Optimization in general and Nonlinear Programming in particular, are applied by many scientific disciplines, such as the automotive sector, the aerospace industry, or the space agencies. With some established NLP solvers having been available for decades, and with the mathematical community being rather conservative in this respect, many of their programming standards are severely outdated. It is safe to assume that such usability shortcomings impede the wider use of NLP methods; a representative example is the use of static workspaces by legacy FORTRAN codes. This dissertation gives an account of the construction of the European NLP solver WORHP by using and combining software standards and techniques that have not previously been applied to mathematical software to this extent. Examples include automatic code generation, a consistent reverse communication architecture and the elimination of static workspaces. The result is a novel, industrial-grade NLP solver that overcomes many technical weaknesses of established NLP solvers and other mathematical software
Elastic phone : towards detecting and mitigating computation and energy inefficiencies in mobile apps
Mobile devices have become ubiquitous and their ever evolving capabilities are bringing them closer to personal computers. Nonetheless, due to their mobility and small size factor constraints, they still present many hardware and software challenges. Their limited battery life time has led to the design of mobile networks that are inherently different from previous networks (e.g., wifi) and more restrictive task scheduling. Additionally, mobile device ecosystems are more susceptible to the heterogeneity of hardware and from conflicting interests of distributors, internet service providers, manufacturers, developers, etc. The high number of stakeholders ultimately responsible for the performance of a device, results in an inconsistent behavior and makes it very challenging to build a solution that improves resource usage in most cases. The focus of this thesis is on the study and development of techniques to detect and mitigate computation and energy inefficiencies in mobile apps. It follows a bottom-up approach, starting from the challenges behind detecting inefficient execution scheduling by looking only at apps’ implementations. It shows that scheduling APIs are largely misused and have a great impact on devices wake up frequency and on the efficiency of existing energy saving techniques (e.g., batching scheduled executions). Then it addresses many challenges of app testing in the dynamic analysis field. More specifically, how to scale mobile app testing with realistic user input and how to analyze closed source apps’ code at runtime, showing that introducing humans in the app testing loop improves the coverage of app’s code and generated network volume. Finally, using the combined knowledge of static and dynamic analysis, it focuses on the challenges of identifying the resource hungry sections of apps and how to improve their execution via offloading. There is a special focus on performing non-intrusive offloading transparent to existing apps and on in-network computation offloading and distribution. It shows that, even without a custom OS or app modifications, in-network offloading is still possible, greatly improving execution times, energy consumption and reducing both end-user experienced latency and request drop rates. It concludes with a real app measurement study, showing that a good portion of the most popular apps’ code can indeed be offloaded and proposes future directions for the app testing and computation offloading fields.Los dispositivos móviles se han tornado omnipresentes y sus capacidades están en constante evolución acercándolos a los computadoras personales. Sin embargo, debido a su movilidad y tamaño reducido, todavía presentan muchos desafíos de hardware y software. Su duración limitada de batería ha llevado al diseño de redes móviles que son inherentemente diferentes de las redes anteriores y una programación de tareas más restrictiva. Además, los ecosistemas de dispositivos móviles son más susceptibles a la heterogeneidad de hardware y los intereses conflictivos de las entidades responsables por el rendimiento final de un dispositivo. El objetivo de esta tesis es el estudio y desarrollo de técnicas para detectar y mitigar las ineficiencias de computación y energéticas en las aplicaciones móviles. Empieza con los desafíos detrás de la detección de planificación de ejecución ineficientes, mirando sólo la implementación de las aplicaciones. Se muestra que las API de planificación son en gran medida mal utilizadas y tienen un gran impacto en la frecuencia con que los dispositivos despiertan y en la eficiencia de las técnicas de ahorro de energía existentes. A continuación, aborda muchos desafíos de las pruebas de aplicaciones en el campo de análisis dinámica. Más específicamente, cómo escalar las pruebas de aplicaciones móviles con una interacción realista y cómo analizar código de aplicaciones de código cerrado durante la ejecución, mostrando que la introducción de humanos en el bucle de prueba de aplicaciones mejora la cobertura del código y el volumen de comunicación de red generado. Por último, combinando la análisis estática y dinámica, se centra en los desafíos de identificar las secciones de aplicaciones con uso intensivo de recursos y cómo mejorar su ejecución a través de la ejecución remota (i.e.,"offload"). Hay un enfoque especial en el "offload" no intrusivo y transparente a las aplicaciones existentes y en el "offload"y distribución de computación dentro de la red. Demuestra que, incluso sin un sistema operativo personalizado o modificaciones en la aplicación, el "offload" en red sigue siendo posible, mejorando los tiempos de ejecución, el consumo de energía y reduciendo la latencia del usuario final y las tasas de caída de solicitudes de "offload". Concluye con un estudio real de las aplicaciones más populares, mostrando que una buena parte de su código puede de hecho ser ejecutado remotamente y propone direcciones futuras para los campos de "offload" de aplicaciones
Elastic phone : towards detecting and mitigating computation and energy inefficiencies in mobile apps
Mobile devices have become ubiquitous and their ever evolving capabilities are bringing them closer to personal computers. Nonetheless, due to their mobility and small size factor constraints, they still present many hardware and software challenges. Their limited battery life time has led to the design of mobile networks that are inherently different from previous networks (e.g., wifi) and more restrictive task scheduling. Additionally, mobile device ecosystems are more susceptible to the heterogeneity of hardware and from conflicting interests of distributors, internet service providers, manufacturers, developers, etc. The high number of stakeholders ultimately responsible for the performance of a device, results in an inconsistent behavior and makes it very challenging to build a solution that improves resource usage in most cases. The focus of this thesis is on the study and development of techniques to detect and mitigate computation and energy inefficiencies in mobile apps. It follows a bottom-up approach, starting from the challenges behind detecting inefficient execution scheduling by looking only at apps’ implementations. It shows that scheduling APIs are largely misused and have a great impact on devices wake up frequency and on the efficiency of existing energy saving techniques (e.g., batching scheduled executions). Then it addresses many challenges of app testing in the dynamic analysis field. More specifically, how to scale mobile app testing with realistic user input and how to analyze closed source apps’ code at runtime, showing that introducing humans in the app testing loop improves the coverage of app’s code and generated network volume. Finally, using the combined knowledge of static and dynamic analysis, it focuses on the challenges of identifying the resource hungry sections of apps and how to improve their execution via offloading. There is a special focus on performing non-intrusive offloading transparent to existing apps and on in-network computation offloading and distribution. It shows that, even without a custom OS or app modifications, in-network offloading is still possible, greatly improving execution times, energy consumption and reducing both end-user experienced latency and request drop rates. It concludes with a real app measurement study, showing that a good portion of the most popular apps’ code can indeed be offloaded and proposes future directions for the app testing and computation offloading fields.Los dispositivos móviles se han tornado omnipresentes y sus capacidades están en constante evolución acercándolos a los computadoras personales. Sin embargo, debido a su movilidad y tamaño reducido, todavía presentan muchos desafíos de hardware y software. Su duración limitada de batería ha llevado al diseño de redes móviles que son inherentemente diferentes de las redes anteriores y una programación de tareas más restrictiva. Además, los ecosistemas de dispositivos móviles son más susceptibles a la heterogeneidad de hardware y los intereses conflictivos de las entidades responsables por el rendimiento final de un dispositivo. El objetivo de esta tesis es el estudio y desarrollo de técnicas para detectar y mitigar las ineficiencias de computación y energéticas en las aplicaciones móviles. Empieza con los desafíos detrás de la detección de planificación de ejecución ineficientes, mirando sólo la implementación de las aplicaciones. Se muestra que las API de planificación son en gran medida mal utilizadas y tienen un gran impacto en la frecuencia con que los dispositivos despiertan y en la eficiencia de las técnicas de ahorro de energía existentes. A continuación, aborda muchos desafíos de las pruebas de aplicaciones en el campo de análisis dinámica. Más específicamente, cómo escalar las pruebas de aplicaciones móviles con una interacción realista y cómo analizar código de aplicaciones de código cerrado durante la ejecución, mostrando que la introducción de humanos en el bucle de prueba de aplicaciones mejora la cobertura del código y el volumen de comunicación de red generado. Por último, combinando la análisis estática y dinámica, se centra en los desafíos de identificar las secciones de aplicaciones con uso intensivo de recursos y cómo mejorar su ejecución a través de la ejecución remota (i.e.,"offload"). Hay un enfoque especial en el "offload" no intrusivo y transparente a las aplicaciones existentes y en el "offload"y distribución de computación dentro de la red. Demuestra que, incluso sin un sistema operativo personalizado o modificaciones en la aplicación, el "offload" en red sigue siendo posible, mejorando los tiempos de ejecución, el consumo de energía y reduciendo la latencia del usuario final y las tasas de caída de solicitudes de "offload". Concluye con un estudio real de las aplicaciones más populares, mostrando que una buena parte de su código puede de hecho ser ejecutado remotamente y propone direcciones futuras para los campos de "offload" de aplicaciones.Postprint (published version
Mustererkennungsbasierte Verteidgung gegen gezielte Angriffe
The speed at which everything and everyone is being connected considerably outstrips the rate at which effective security mechanisms are introduced to protect them. This has created an opportunity for resourceful threat actors which have specialized in conducting low-volume persistent attacks through sophisticated techniques that are tailored to specific valuable targets. Consequently, traditional approaches are rendered ineffective against targeted attacks, creating an acute need for innovative defense mechanisms.
This thesis aims at supporting the security practitioner in bridging this gap by introducing a holistic strategy against targeted attacks that addresses key challenges encountered during the phases of detection, analysis and response. The structure of this thesis is therefore aligned to these three phases, with each one of its central chapters taking on a particular problem and proposing a solution built on a strong foundation on pattern recognition and machine learning.
In particular, we propose a detection approach that, in the absence of additional authentication mechanisms, allows to identify spear-phishing emails without relying on their content. Next, we introduce an analysis approach for malware triage based on the structural characterization of malicious code. Finally, we introduce MANTIS, an open-source platform for authoring, sharing and collecting threat intelligence, whose data model is based on an innovative unified representation for threat intelligence standards based on attributed graphs.
As a whole, these ideas open new avenues for research on defense mechanisms and represent an attempt to counteract the imbalance between resourceful actors and society at large.In unserer heutigen Welt sind alle und alles miteinander vernetzt. Dies bietet mächtigen Angreifern die Möglichkeit, komplexe Verfahren zu entwickeln, die auf spezifische Ziele angepasst sind. Traditionelle Ansätze zur Bekämpfung solcher Angriffe werden damit ineffektiv, was die Entwicklung innovativer Methoden unabdingbar macht.
Die vorliegende Dissertation verfolgt das Ziel, den Sicherheitsanalysten durch eine umfassende Strategie gegen gezielte Angriffe zu unterstützen. Diese Strategie beschäftigt sich mit den hauptsächlichen Herausforderungen in den drei Phasen der Erkennung und Analyse von sowie der Reaktion auf gezielte Angriffe. Der Aufbau dieser Arbeit orientiert sich daher an den genannten drei Phasen. In jedem Kapitel wird ein Problem aufgegriffen und eine entsprechende Lösung vorgeschlagen, die stark auf maschinellem Lernen und Mustererkennung basiert.
Insbesondere schlagen wir einen Ansatz vor, der eine Identifizierung von Spear-Phishing-Emails ermöglicht, ohne ihren Inhalt zu betrachten. Anschliessend stellen wir einen Analyseansatz für Malware Triage vor, der auf der strukturierten Darstellung von Code basiert. Zum Schluss stellen wir MANTIS vor, eine Open-Source-Plattform für Authoring, Verteilung und Sammlung von Threat Intelligence, deren Datenmodell auf einer innovativen konsolidierten Graphen-Darstellung für Threat Intelligence Stardards basiert. Wir evaluieren unsere Ansätze in verschiedenen Experimenten, die ihren potentiellen Nutzen in echten Szenarien beweisen.
Insgesamt bereiten diese Ideen neue Wege für die Forschung zu Abwehrmechanismen und erstreben, das Ungleichgewicht zwischen mächtigen Angreifern und der Gesellschaft zu minimieren