43 research outputs found
JSBiRTH: Dynamic javascript birthmark based on the run-time heap
JavaScript is currently the dominating client-side scripting language in the web community. However, the source code of JavaScript can be easily copied through a browser. The intellectual property right of the developers lacks protection. In this paper, we consider using dynamic software birthmark for JavaScript. Instead of using control flow trace (which can be corrupted by code obfuscation) and API (which may not work if the software does not have many API calls), we exploit the run-time heap, which reflects substantially the dynamic behavior of a program, to extract birthmarks. We introduce JSBiRTH, a novel software birthmark system for JavaScript based on the comparison of run-time heaps. We evaluated our system using 20 JavaScript programs with most of them being large-scale. Our system gave no false positive or false negative. Moreover, it is robust against code obfuscation attack. We also show that our system is effective in detecting partial code theft. © 2011 IEEE.published_or_final_versionThe 35th IEEE Annual Computer Software and Applications Conference (COMPSAC 2011), Munich, Germany, 18-22 July 2011. In Proceedings of 35th COMPSAC, 2011, p. 407-41
Exploiting JavaScript Birthmarking Techniques for Code Theft Detection
Este relatório visa a análise de técnicas de birthmarking para detetar o roubo de código. Um birthmark de um software, como o próprio nome indica, é um conjunto de características únicas que permitem identificar esse mesmo software. Para a deteção do roubo de código são extraídos birthmarks de dois programas, o original e um suspeito, e são comparados um com o outro, permitindo assim detetar o roubo caso sejam muito semelhantes ou iguais. Como hoje em dia a internet é cada vez mais utilizada e o código JavaScript é usado para grande parte das aplicações web, o roubo de código nesta área é um grande problema da atualidade. Tendo isto em conta, a solução final tem como objetivo esta mesma linguagem.São analisadas, cronologicamente e tendo em conta a relevância para o tema, algumas das técnicas de birthmarking existentes. As técnicas são analisadas individualmente e no final é feito um resumo e comparação de todas as técnicas. Como a maior parte das técnicas existentes não foram pensadas para JavaScript, a sua aplicabilidade à linguagem é também analisada e são tiradas conclusões acerca de bons candidatos à solução final. O objetivo final é construir uma ferramenta que, usando uma técnica de birthmarking, determine se dois programas JavaScript foram copiados, de modo a suportar alegações de roubo de código.The purpose of this dissertation is to analyse birthmarking techniques in order to detect code theft. A birthmark of a software is, as the name suggests, the set of unique characteristics that allow to identify that software. In order to detect the theft, the birthmarks of two programs are extracted, the suspect and the original, and compared to each other to check if they are too similar or identical. Nowadays the web applications are growing and JavaScript code is the most used in this field, therefore the theft in this area is a current problem. Because of that, the theft detection of programs developed in that language is the focus of this dissertation.Some techniques of birthmarking are analysed in chronological order and accordingly to the relevance for the theme. Each technique is analysed individually and in the end a comparison between them is made. Given that most of the techniques were not created for JavaScript, their applicability to the language is analysed. With those analysis, some conclusions about the best candidates to the final solution are drawn.The final goal is to develop a tool, that uses a birthmarking technique, to determine if two JavaScript programs were copied, in order to support code theft allegations
Graphs Resemblance based Software Birthmarks through Data Mining for Piracy Control
The emergence of software artifacts greatly emphasizes the need for protecting intellectual property rights (IPR) hampered by software piracy requiring effective measures for software piracy control. Software birthmarking targets to counter ownership theft of software by identifying similarity of their origins. A novice birthmarking approach has been proposed in this paper that is based on hybrid of text-mining and graph-mining techniques. The code elements of a program and their relations with other elements have been identified through their properties (i.e code constructs) and transformed into Graph Manipulation Language (GML). The software birthmarks generated by exploiting the graph theoretic properties (through clustering coefficient) are used for the classifications of similarity or dissimilarity of two programs. The proposed technique has been evaluated over metrics of credibility, resilience, method theft, modified code detection and self-copy detection for programs asserting the effectiveness of proposed approach against software ownership theft. The comparative analysis of proposed approach with contemporary ones shows better results for having properties and relations of program nodes and for employing dynamic techniques of graph mining without adding any overhead (such as increased program size and processing cost)
Software similarity and classification
This thesis analyses software programs in the context of their similarity to other software programs. Applications proposed and implemented include detecting malicious software and discovering security vulnerabilities
Recommended from our members
Uncovering Features in Behaviorally Similar Programs
The detection of similar code can support many so ware engineering tasks such as program understanding and program classification. Many excellent approaches have been proposed to detect programs having similar syntactic features. However, these approaches are unable to identify programs dynamically or statistically close to each other, which we call behaviorally similar programs. We believe the detection of behaviorally similar programs can enhance or even automate the tasks relevant to program classification. In this thesis, we will discuss our current approaches to identify programs having similar behavioral features in multiple perspectives.
We first discuss how to detect programs having similar functionality. While the definition of a program’s functionality is undecidable, we use inputs and outputs (I/Os) of programs as the proxy of their functionality. We then use I/Os of programs as a behavioral feature to detect which programs are functionally similar: two programs are functionally similar if they share similar inputs and outputs. This approach has been studied and developed in the C language to detect functionally equivalent programs having equivalent I/Os. Nevertheless, some natural problems in Object Oriented languages, such as input generation and comparisons between application-specific data types, hinder the development of this approach. We propose a new technique, in-vivo detection, which uses existing and meaningful inputs to drive applications systematically and then applies a novel similarity model considering both inputs and outputs of programs, to detect functionally similar programs. We develop the tool, HitoshiIO, based on our in-vivo detection. In the subjects that we study, HitoshiIO correctly detect 68.4% of functionally similar programs, where its false positive rate is only 16.6%.
In addition to functional I/Os of programs, we attempt to discover programs having similar execution behavior. Again, the execution behavior of a program can be undecidable, so we use instructions executed at run-time as a behavioral feature of a program. We create DyCLINK, which observes program executions and encodes them in dynamic instruction graphs. A vertex in a dynamic instruction graph is an instruction and an edge is a type of dependency between two instructions. The problem to detect which programs have similar executions can then be reduced to a problem of solving inexact graph isomorphism. We propose a link analysis based algorithm, LinkSub, which vectorizes each dynamic instruction graph by the importance of every instruction, to solve this graph isomorphism problem efficiently. In a K Nearest Neighbor (KNN) based program classification experiment, DyCLINK achieves 90 + % precision.
Because HitoshiIO and DyCLINK both rely on dynamic analysis to expose program behavior, they have better capability to locate and search for behaviorally similar programs than traditional static analysis tools. However, they suffer from some common problems of dynamic analysis, such as input generation and run-time overhead. These problems may make our approaches challenging to scale. Thus, we create the system, Macneto, which integrates static analysis with machine topic modeling and deep learning to approximate program behaviors from their binaries without truly executing programs. In our deobfuscation experiments considering two commercial obfuscators that alter lexical information and syntax in programs, Macneto achieves 90 + % precision, where the groundtruth is that the behavior of a program before and after obfuscation should be the same.
In this thesis, we offer a more extensive view of similar programs than the traditional definitions. While the traditional definitions of similar programs mostly use static features, such as syntax and lexical information, we propose to leverage the power of dynamic analysis and machine learning models to trace/collect behavioral features of pro- grams. These behavioral features of programs can then apply to detect behaviorally similar programs. We believe the techniques we invented in this thesis to detect behaviorally similar programs can improve the development of software engineering and security applications, such as code search and deobfuscation
Structural analysis of source code plagiarism using graphs
A dissertation submitted to the Faculty of Science, University of the Witwatersrand,
Johannesburg in fulfillment of the requirements for the degree of Master of Science.
May 2017Plagiarism is a serious problem in academia. It is prevalent in the computing discipline
where students are expected to submit source code assignments as part of their
assessment; hence, there is every likelihood of copying. Ideally, students can collaborate
with each other to perform a programming task, but it is expected that each student
submit his/her own solution for the programming task. More so, one might conclude
that the interaction would make them learn programming. Unfortunately, that may not
always be the case. In undergraduate courses, especially in the computer sciences, if a
given class is large, it would be unfeasible for an instructor to manually check each and
every assignment for probable plagiarism. Even if the class size were smaller, it is still
impractical to inspect every assignment for likely plagiarism because some potentially
plagiarised content could still be missed by humans. Therefore, automatically checking
the source code programs for likely plagiarism is essential.
There have been many proposed methods that attempt to detect source code plagiarism
in undergraduate source code assignments but, an ideal system should be able to
differentiate actual cases of plagiarism from coincidental similarities that usually occur
in source code plagiarism. Some of the existing source code plagiarism detection
systems are either not scalable, or performed better when programs are modified with
a number of insertions and deletions to obfuscate plagiarism. To address this issue, a
graph-based model which considers structural similarities of programs is introduced to
address cases of plagiarism in programming assignments.
This research study proposes an approach to measuring cases of similarities in programming
assignments using an existing plagiarism detection system to find similarities
in programs, and a graph-based model to annotate the programs. We describe
experiments with data sets of undergraduate Java programs to inspect the programs
for plagiarism and evaluate the graph-model with good precision. An evaluation of
the graph-based model reveals a high rate of plagiarism in the programs and resilience
to many obfuscation techniques, while false detection (coincident similarity) rarely occurred.
If this detection method is adopted into use, it will aid an instructor to carry
out the detection process conscientiously.MT 201
Heaps don't lie : countering unsoundness with heap snapshots
Static analyses aspire to explore all possible executions in order to achieve soundness. Yet, in practice, they fail to capture common dynamic behavior. Enhancing static analyses with dynamic information is a common pattern, with tools such as Tamiflex. Past approaches, however, miss significant portions of dynamic behavior, due to native code, unsupported features (e.g., invokedynamic or lambdas in Java), and more. We present techniques that substantially counteract the unsoundness of a static analysis, with virtually no intrusion to the analysis logic. Our approach is reified in the HeapDL toolchain and consists in taking whole-heap snapshots during program execution, that are further enriched to capture significant aspects of dynamic behavior, regardless of the causes of such behavior. The snapshots are then used as extra inputs to the static analysis. The approach exhibits both portability and significantly increased coverage. Heap information under one set of dynamic inputs allows a static analysis to cover many more behaviors under other inputs. A HeapDL-enhanced static analysis of the DaCapo benchmarks computes 99.5% (median) of the call-graph edges of unseen dynamic executions (vs. 76.9% for the Tamiflex tool).peer-reviewe
SEMEO: A SEMANTIC EQUIVALENCE ANALYSIS FRAMEWORK FOR OBFUSCATED ANDROID APPLICATIONS
Software repackaging is a common approach for creating malware. In this approach, malware authors inject malicious payloads into legitimate applications; then, to ren- der security analysis more difficult, they obfuscate most or all of the code. This forces analysts to spend a large amount of effort filtering out benign obfuscated methods in order to locate potentially malicious methods for further analysis. If an effective mechanism for filtering out benign obfuscated methods were available, the number of methods that must be analyzed could be reduced, allowing analysts to be more productive. In this thesis, we introduce SEMEO, a highly effective and efficient fil- tering approach that can determine whether an obfuscated and an original version of a method are semantically equivalent. Our approach handles seven common, com- plex types of obfuscation and can be effective even when all types are compositely applied. In an empirical evaluation, we applied SEMEO to nine Android apps of varying complexity, and the approach provided over 76% recall and 100% precision in identifying semantically equivalent methods. We then performed three additional studies, that showed that: (1) SEMEO is much more effective at identifying semantically equivalent methods than FSquaDRA, an existing technique; (2) SEMEO is also effective for identifying repackaged apps that have been previously obfuscated by ProGuard, a popular obfuscation tool; and (3) SEMEO is effective at identifying semantically equivalent methods in a repackaged, malicious version of Pokemon Go