30 research outputs found
Recommended from our members
Uncovering Features in Behaviorally Similar Programs
The detection of similar code can support many so ware engineering tasks such as program understanding and program classification. Many excellent approaches have been proposed to detect programs having similar syntactic features. However, these approaches are unable to identify programs dynamically or statistically close to each other, which we call behaviorally similar programs. We believe the detection of behaviorally similar programs can enhance or even automate the tasks relevant to program classification. In this thesis, we will discuss our current approaches to identify programs having similar behavioral features in multiple perspectives.
We first discuss how to detect programs having similar functionality. While the definition of a program’s functionality is undecidable, we use inputs and outputs (I/Os) of programs as the proxy of their functionality. We then use I/Os of programs as a behavioral feature to detect which programs are functionally similar: two programs are functionally similar if they share similar inputs and outputs. This approach has been studied and developed in the C language to detect functionally equivalent programs having equivalent I/Os. Nevertheless, some natural problems in Object Oriented languages, such as input generation and comparisons between application-specific data types, hinder the development of this approach. We propose a new technique, in-vivo detection, which uses existing and meaningful inputs to drive applications systematically and then applies a novel similarity model considering both inputs and outputs of programs, to detect functionally similar programs. We develop the tool, HitoshiIO, based on our in-vivo detection. In the subjects that we study, HitoshiIO correctly detect 68.4% of functionally similar programs, where its false positive rate is only 16.6%.
In addition to functional I/Os of programs, we attempt to discover programs having similar execution behavior. Again, the execution behavior of a program can be undecidable, so we use instructions executed at run-time as a behavioral feature of a program. We create DyCLINK, which observes program executions and encodes them in dynamic instruction graphs. A vertex in a dynamic instruction graph is an instruction and an edge is a type of dependency between two instructions. The problem to detect which programs have similar executions can then be reduced to a problem of solving inexact graph isomorphism. We propose a link analysis based algorithm, LinkSub, which vectorizes each dynamic instruction graph by the importance of every instruction, to solve this graph isomorphism problem efficiently. In a K Nearest Neighbor (KNN) based program classification experiment, DyCLINK achieves 90 + % precision.
Because HitoshiIO and DyCLINK both rely on dynamic analysis to expose program behavior, they have better capability to locate and search for behaviorally similar programs than traditional static analysis tools. However, they suffer from some common problems of dynamic analysis, such as input generation and run-time overhead. These problems may make our approaches challenging to scale. Thus, we create the system, Macneto, which integrates static analysis with machine topic modeling and deep learning to approximate program behaviors from their binaries without truly executing programs. In our deobfuscation experiments considering two commercial obfuscators that alter lexical information and syntax in programs, Macneto achieves 90 + % precision, where the groundtruth is that the behavior of a program before and after obfuscation should be the same.
In this thesis, we offer a more extensive view of similar programs than the traditional definitions. While the traditional definitions of similar programs mostly use static features, such as syntax and lexical information, we propose to leverage the power of dynamic analysis and machine learning models to trace/collect behavioral features of pro- grams. These behavioral features of programs can then apply to detect behaviorally similar programs. We believe the techniques we invented in this thesis to detect behaviorally similar programs can improve the development of software engineering and security applications, such as code search and deobfuscation
SEMEO: A SEMANTIC EQUIVALENCE ANALYSIS FRAMEWORK FOR OBFUSCATED ANDROID APPLICATIONS
Software repackaging is a common approach for creating malware. In this approach, malware authors inject malicious payloads into legitimate applications; then, to ren- der security analysis more difficult, they obfuscate most or all of the code. This forces analysts to spend a large amount of effort filtering out benign obfuscated methods in order to locate potentially malicious methods for further analysis. If an effective mechanism for filtering out benign obfuscated methods were available, the number of methods that must be analyzed could be reduced, allowing analysts to be more productive. In this thesis, we introduce SEMEO, a highly effective and efficient fil- tering approach that can determine whether an obfuscated and an original version of a method are semantically equivalent. Our approach handles seven common, com- plex types of obfuscation and can be effective even when all types are compositely applied. In an empirical evaluation, we applied SEMEO to nine Android apps of varying complexity, and the approach provided over 76% recall and 100% precision in identifying semantically equivalent methods. We then performed three additional studies, that showed that: (1) SEMEO is much more effective at identifying semantically equivalent methods than FSquaDRA, an existing technique; (2) SEMEO is also effective for identifying repackaged apps that have been previously obfuscated by ProGuard, a popular obfuscation tool; and (3) SEMEO is effective at identifying semantically equivalent methods in a repackaged, malicious version of Pokemon Go
Mapping System Level Behaviors with Android APIs via System Call Dependence Graphs
Due to Android's open source feature and low barriers to entry for
developers, millions of developers and third-party organizations have been
attracted into the Android ecosystem. However, over 90 percent of mobile
malware are found targeted on Android. Though Android provides multiple
security features and layers to protect user data and system resources, there
are still some over-privileged applications in Google Play Store or third-party
Android app stores at wild. In this paper, we proposed an approach to map
system level behavior and Android APIs, based on the observation that system
level behaviors cannot be avoided but sensitive Android APIs could be evaded.
To the best of our knowledge, our approach provides the first work to map
system level behavior and Android APIs through System Call Dependence Graphs.
The study also shows that our approach can effectively identify potential
permission abusing, with almost negligible performance impact.Comment: 14 pages, 6 figure
From wearable towards epidermal computing : soft wearable devices for rich interaction on the skin
Human skin provides a large, always available, and easy to access real-estate for interaction. Recent advances in new materials, electronics, and human-computer interaction have led to the emergence of electronic devices that reside directly on the user's skin. These conformal devices, referred to as Epidermal Devices, have mechanical properties compatible with human skin: they are very thin, often thinner than human hair; they elastically deform when the body is moving, and stretch with the user's skin.
Firstly, this thesis provides a conceptual understanding of Epidermal Devices in the HCI literature. We compare and contrast them with other technical approaches that enable novel on-skin interactions. Then, through a multi-disciplinary analysis of Epidermal Devices, we identify the design goals and challenges that need to be addressed for advancing this emerging research area in HCI. Following this, our fundamental empirical research investigated how epidermal devices of different rigidity levels affect passive and active tactile perception. Generally, a correlation was found between the device rigidity and tactile sensitivity thresholds as well as roughness discrimination ability. Based on these findings, we derive design recommendations for realizing epidermal devices. Secondly, this thesis contributes novel Epidermal Devices that enable rich on-body interaction. SkinMarks contributes to the fabrication and design of novel Epidermal Devices that are highly skin-conformal and enable touch, squeeze, and bend sensing with co-located visual output. These devices can be deployed on highly challenging body locations, enabling novel interaction techniques and expanding the design space of on-body interaction. Multi-Touch Skin enables high-resolution multi-touch input on the body. We present the first non-rectangular and high-resolution multi-touch sensor overlays for use on skin and introduce a design tool that generates such sensors in custom shapes and sizes. Empirical results from two technical evaluations confirm that the sensor achieves a high signal-to-noise ratio on the body under various grounding conditions and has a high spatial accuracy even when subjected to strong deformations. Thirdly, Epidermal Devices are in contact with the skin, they offer opportunities for sensing rich physiological signals from the body. To leverage this unique property, this thesis presents rapid fabrication and computational design techniques for realizing Multi-Modal Epidermal Devices that can measure multiple physiological signals from the human body. Devices fabricated through these techniques can measure ECG (Electrocardiogram), EMG (Electromyogram), and EDA (Electro-Dermal Activity). We also contribute a computational design and optimization method based on underlying human anatomical models to create optimized device designs that provide an optimal trade-off between physiological signal acquisition capability and device size. The graphical tool allows for easily specifying design preferences and to visually analyze the generated designs in real-time, enabling designer-in-the-loop optimization. Experimental results show high quantitative agreement between the prediction of the optimizer and experimentally collected physiological data. Finally, taking a multi-disciplinary perspective, we outline the roadmap for future research in this area by highlighting the next important steps, opportunities, and challenges. Taken together, this thesis contributes towards a holistic understanding of Epidermal Devices}: it provides an empirical and conceptual understanding as well as technical insights through contributions in DIY (Do-It-Yourself), rapid fabrication, and computational design techniques.Die menschliche Haut bietet eine große, stets verfügbare und leicht zugängliche Fläche für Interaktion. Jüngste Fortschritte in den Bereichen Materialwissenschaft, Elektronik und Mensch-Computer-Interaktion (Human-Computer-Interaction, HCI) [so that you can later use the Englisch abbreviation] haben zur Entwicklung elektronischer Geräte geführt, die sich direkt auf der Haut des Benutzers befinden. Diese sogenannten Epidermisgeräte haben mechanische Eigenschaften, die mit der menschlichen Haut kompatibel sind: Sie sind sehr dünn, oft dünner als ein menschliches Haar; sie verformen sich elastisch, wenn sich der Körper bewegt, und dehnen sich mit der Haut des Benutzers. Diese Thesis bietet, erstens, ein konzeptionelles Verständnis von Epidermisgeräten in der HCI-Literatur. Wir vergleichen sie mit anderen technischen Ansätzen, die neuartige Interaktionen auf der Haut ermöglichen. Dann identifizieren wir durch eine multidisziplinäre Analyse von Epidermisgeräten die Designziele und Herausforderungen, die angegangen werden müssen, um diesen aufstrebenden Forschungsbereich voranzubringen. Im Anschluss daran untersuchten wir in unserer empirischen Grundlagenforschung, wie epidermale Geräte unterschiedlicher Steifigkeit die passive und aktive taktile Wahrnehmung beeinflussen. Im Allgemeinen wurde eine Korrelation zwischen der Steifigkeit des Geräts und den taktilen Empfindlichkeitsschwellen sowie der Fähigkeit zur Rauheitsunterscheidung festgestellt. Basierend auf diesen Ergebnissen leiten wir Designempfehlungen für die Realisierung epidermaler Geräte ab. Zweitens trägt diese Thesis zu neuartigen Epidermisgeräten bei, die eine reichhaltige Interaktion am Körper ermöglichen. SkinMarks trägt zur Herstellung und zum Design neuartiger Epidermisgeräte bei, die hochgradig an die Haut angepasst sind und Berührungs-, Quetsch- und Biegesensoren mit gleichzeitiger visueller Ausgabe ermöglichen. Diese Geräte können an sehr schwierigen Körperstellen eingesetzt werden, ermöglichen neuartige Interaktionstechniken und erweitern den Designraum für die Interaktion am Körper. Multi-Touch Skin ermöglicht hochauflösende Multi-Touch-Eingaben am Körper. Wir präsentieren die ersten nicht-rechteckigen und hochauflösenden Multi-Touch-Sensor-Overlays zur Verwendung auf der Haut und stellen ein Design-Tool vor, das solche Sensoren in benutzerdefinierten Formen und Größen erzeugt. Empirische Ergebnisse aus zwei technischen Evaluierungen bestätigen, dass der Sensor auf dem Körper unter verschiedenen Bedingungen ein hohes Signal-Rausch-Verhältnis erreicht und eine hohe räumliche Auflösung aufweist, selbst wenn er starken Verformungen ausgesetzt ist. Drittens, da Epidermisgeräte in Kontakt mit der Haut stehen, bieten sie die Möglichkeit, reichhaltige physiologische Signale des Körpers zu erfassen. Um diese einzigartige Eigenschaft zu nutzen, werden in dieser Arbeit Techniken zur schnellen Herstellung und zum computergestützten Design von multimodalen Epidermisgeräten vorgestellt, die mehrere physiologische Signale des menschlichen Körpers messen können. Die mit diesen Techniken hergestellten Geräte können EKG (Elektrokardiogramm), EMG (Elektromyogramm) und EDA (elektrodermale Aktivität) messen. Darüber hinaus stellen wir eine computergestützte Design- und Optimierungsmethode vor, die auf den zugrunde liegenden anatomischen Modellen des Menschen basiert, um optimierte Gerätedesigns zu erstellen. Diese Designs bieten einen optimalen Kompromiss zwischen der Fähigkeit zur Erfassung physiologischer Signale und der Größe des Geräts. Das grafische Tool ermöglicht die einfache Festlegung von Designpräferenzen und die visuelle Analyse der generierten Designs in Echtzeit, was eine Optimierung durch den Designer im laufenden Betrieb ermöglicht. Experimentelle Ergebnisse zeigen eine hohe quantitative Übereinstimmung zwischen den Vorhersagen des Optimierers und den experimentell erfassten physiologischen Daten. Schließlich skizzieren wir aus einer multidisziplinären Perspektive einen Fahrplan für zukünftige Forschung in diesem Bereich, indem wir die nächsten wichtigen Schritte, Möglichkeiten und Herausforderungen hervorheben. Insgesamt trägt diese Arbeit zu einem ganzheitlichen Verständnis von Epidermisgeräten bei: Sie liefert ein empirisches und konzeptionelles Verständnis sowie technische Einblicke durch Beiträge zu DIY (Do-It-Yourself), schneller Fertigung und computergestützten Entwurfstechniken
Malware variant detection
Malware programs (e.g., viruses, worms, Trojans, etc.) are a worldwide epidemic. Studies and statistics show that the impact of malware is getting worse. Malware detectors are the primary tools in the defence against malware. Most commercial anti-malware scanners maintain a database of malware patterns and heuristic signatures for detecting malicious programs within a computer system. Malware writers use semantic-preserving code transformation (obfuscation) techniques to produce new stealth variants of their malware programs. Malware variants are hard to detect with today's detection technologies as these tools rely mostly on syntactic properties and ignore the semantics of malicious executable programs. A robust malware detection technique is required to handle this emerging security threat. In this thesis, we propose a new methodology that overcomes the drawback of existing malware detection methods by analysing the semantics of known malicious code. The methodology consists of three major analysis techniques: the development of a semantic signature, slicing analysis and test data generation analysis. The core element in this approach is to specify an approximation for malware code semantics and to produce signatures for identifying, possibly obfuscated but semantically equivalent, variants of a sample of malware. A semantic signature consists of a program test input and semantic traces of a known malware code. The key challenge in developing our semantics-based approach to malware variant detection is to achieve a balance between improving the detection rate (i.e. matching semantic traces) and performance, with or without the e ects of obfuscation on malware variants. We develop slicing analysis to improve the construction of semantic signatures. We back our trace-slicing method with a theoretical result that shows the notion of correctness of the slicer. A proof-of-concept implementation of our malware detector demonstrates that the semantics-based analysis approach could improve current detection tools and make the task more di cult for malware authors. Another important part of this thesis is exploring program semantics for the selection of a suitable part of the semantic signature, for which we provide two new theoretical results. In particular, this dissertation includes a test data generation method that works for binary executables and the notion of correctness of the method
Analysis and Defense of Emerging Malware Attacks
The persistent evolution of malware intrusion brings great challenges to current anti-malware industry. First, the traditional signature-based detection and prevention schemes produce outgrown signature databases for each end-host user and user has to install the AV tool and tolerate consuming huge amount of resources for pairwise matching. At the other side of malware analysis, the emerging malware can detect its running environment and determine whether it should infect the host or not. Hence, traditional dynamic malware analysis can no longer find the desired malicious logic if the targeted environment cannot be extracted in advance. Both these two problems uncover that current malware defense schemes are too passive and reactive to fulfill the task.
The goal of this research is to develop new analysis and protection schemes for the emerging malware threats. Firstly, this dissertation performs a detailed study on recent targeted malware attacks. Based on the study, we develop a new technique to perform effectively and efficiently targeted malware analysis. Second, this dissertation studies a new trend of massive malware intrusion and proposes a new protection scheme to proactively defend malware attack. Lastly, our focus is new P2P malware. We propose a new scheme, which is named as informed active probing, for large-scale P2P malware analysis and detection. In further, our internet-wide evaluation shows
our active probing scheme can successfully detect malicious P2P malware and its corresponding malicious servers
THE SCALABLE AND ACCOUNTABLE BINARY CODE SEARCH AND ITS APPLICATIONS
The past decade has been witnessing an explosion of various applications and devices.
This big-data era challenges the existing security technologies: new analysis techniques
should be scalable to handle “big data” scale codebase; They should be become smart
and proactive by using the data to understand what the vulnerable points are and where
they locate; effective protection will be provided for dissemination and analysis of the data
involving sensitive information on an unprecedented scale.
In this dissertation, I argue that the code search techniques can boost existing security
analysis techniques (vulnerability identification and memory analysis) in terms of scalability and accuracy. In order to demonstrate its benefits, I address two issues of code search by using the code analysis: scalability and accountability. I further demonstrate the benefit of code search by applying it for the scalable vulnerability identification [57] and the
cross-version memory analysis problems [55, 56].
Firstly, I address the scalability problem of code search by learning “higher-level” semantic
features from code [57]. Instead of conducting fine-grained testing on a single device
or program, it becomes much more crucial to achieve the quick vulnerability scanning
in devices or programs at a “big data” scale. However, discovering vulnerabilities in “big
code” is like finding a needle in the haystack, even when dealing with known vulnerabilities. This new challenge demands a scalable code search approach. To this end, I leverage successful techniques from the image search in computer vision community and propose a novel code encoding method for scalable vulnerability search in binary code. The evaluation results show that this approach can achieve comparable or even better accuracy and efficiency than the baseline techniques.
Secondly, I tackle the accountability issues left in the vulnerability searching problem
by designing vulnerability-oriented raw features [58]. The similar code does not always
represent the similar vulnerability, so it requires that the feature engineering for the code
search should focus on semantic level features rather than syntactic ones. I propose to
extract conditional formulas as higher-level semantic features from the raw binary code to
conduct the code search. A conditional formula explicitly captures two cardinal factors
of a vulnerability: 1) erroneous data dependencies and 2) missing or invalid condition
checks. As a result, the binary code search on conditional formulas produces significantly
higher accuracy and provides meaningful evidence for human analysts to further examine
the search results. The evaluation results show that this approach can further improve
the search accuracy of existing bug search techniques with very reasonable performance
overhead.
Finally, I demonstrate the potential of the code search technique in the memory analysis
field, and apply it to address their across-version issue in the memory forensic problem
[55, 56]. The memory analysis techniques for COTS software usually rely on the
so-called “data structure profiles” for their binaries. Construction of such profiles requires
the expert knowledge about the internal working of a specified software version. However,
it is still a cumbersome manual effort most of time. I propose to leverage the code search
technique to enable a notion named “cross-version memory analysis”, which can update a
profile for new versions of a software by transferring the knowledge from the model that
has already been trained on its old version. The evaluation results show that the code search based approach advances the existing memory analysis methods by reducing the
manual efforts while maintaining the reasonable accuracy. With the help of collaborators, I
further developed two plugins to the Volatility memory forensic framework [2], and show
that each of the two plugins can construct a localized profile to perform specified memory
forensic tasks on the same memory dump, without the need of manual effort in creating the corresponding profile