92 research outputs found

    A Symmetric Approach to Compilation and Decompilation

    Get PDF
    Just as specializing a source interpreter can achieve compilation from a source language to a target language, we observe that specializing a target interpreter can achieve compilation from the target language to the source language. In both cases, the key issue is the choice of whether to perform an evaluation or to emit code that represents this evaluation. We substantiate this observation by specializing two source interpreters and two target interpreters. We first consider a source language of arithmetic expressions and a target language for a stack machine, and then the lambda-calculus and the SECD-machine language. In each case, we prove that the target-to-source compiler is a left inverse of the source-to-target compiler, i.e., it is a decompiler. In the context of partial evaluation, compilation by source-interpreter specialization is classically referred to as a Futamura projection. By symmetry, it seems logical to refer to decompilation by target-interpreter specialization as a Futamura embedding

    An efficient, parametric fixpoint algorithm for analysis of java bytecode

    Get PDF
    Abstract interpretation has been widely used for the analysis of object-oriented languages and, in particular, Java source and bytecode. However, while most existing work deals with the problem of flnding expressive abstract domains that track accurately the characteristics of a particular concrete property, the underlying flxpoint algorithms have received comparatively less attention. In fact, many existing (abstract interpretation based—) flxpoint algorithms rely on relatively inefHcient techniques for solving inter-procedural caligraphs or are speciflc and tied to particular analyses. We also argüe that the design of an efficient fixpoint algorithm is pivotal to supporting the analysis of large programs. In this paper we introduce a novel algorithm for analysis of Java bytecode which includes a number of optimizations in order to reduce the number of iterations. The algorithm is parametric -in the sense that it is independent of the abstract domain used and it can be applied to different domains as "plug-ins"-, multivariant, and flow-sensitive. Also, is based on a program transformation, prior to the analysis, that results in a highly uniform representation of all the features in the language and therefore simplifies analysis. Detailed descriptions of decompilation solutions are given and discussed with an example. We also provide some performance data from a preliminary implementation of the analysis

    The Strengths and Behavioral Quirks of Java Bytecode Decompilers

    Full text link
    During compilation from Java source code to bytecode, some information is irreversibly lost. In other words, compilation and decompilation of Java code is not symmetric. Consequently, the decompilation process, which aims at producing source code from bytecode, must establish some strategies to reconstruct the information that has been lost. Modern Java decompilers tend to use distinct strategies to achieve proper decompilation. In this work, we hypothesize that the diverse ways in which bytecode can be decompiled has a direct impact on the quality of the source code produced by decompilers. We study the effectiveness of eight Java decompilers with respect to three quality indicators: syntactic correctness, syntactic distortion and semantic equivalence modulo inputs. This study relies on a benchmark set of 14 real-world open-source software projects to be decompiled (2041 classes in total). Our results show that no single modern decompiler is able to correctly handle the variety of bytecode structures coming from real-world programs. Even the highest ranking decompiler in this study produces syntactically correct output for 84% of classes of our dataset and semantically equivalent code output for 78% of classes.Comment: 11 pages, 6 figures, 9 listings, 3 table

    An efficient, parametric fixpoint algorithm for incremental analysis of java bytecode

    Get PDF
    Abstract interpretation has been widely used for the analysis of object-oriented languages and, more precisely, Java source and bytecode. However, while most of the existing work deals with the problem of finding expressive abstract domains that track accurately the characteristics of a particular concrete property, the underlying fixpoint algorithms have received comparatively less attention. In fact, many existing (abstract interpretation based) fixpoint algorithms rely on relatively inefficient techniques to solve inter-procedural call graphs or are specific and tied to particular analyses. We argue that the design of an efficient fixpoint algorithm is pivotal to support the analysis of large programs. In this paper we introduce a novel algorithm for analysis of Java bytecode which includes a number of optimizations in order to reduce the number of iterations. Also, the algorithm is parametric in the sense that it is independent of the abstract domain used and it can be applied to different domains as "plug-ins". It is also incremental in the sense that, if desired, analysis data can be saved so that only a reduced amount of reanalysis is needed after a small program change, which can be instrumental for large programs. The algorithm is also multivariant and flowsensitive. Finally, another interesting characteristic of the algorithm is that it is based on a program transformation, prior to the analysis, that results in a highly uniform representation of all the features in the language and therefore simplifies analysis. Detailed descriptions of decompilation solutions are provided and discussed with an example

    Hijacker: Efficient static software instrumentation with applications in high performance computing: Poster paper

    Get PDF
    Static Binary Instrumentation is a technique that allows compile-time program manipulation. In particular, by relying on ad-hoc tools, the end user is able to alter the program's execution flow without affecting its overall semantic. This technique has been effectively used, e.g., to support code profiling, performance analysis, error detection, attack detection, or behavior monitoring. Nevertheless, efficiently relying on static instrumentation for producing executables which can be deployed without affecting the overall performance of the application still presents technical and methodological issues. In this paper, we present Hijacker, an open-source customizable static binary instrumentation tool which is able to alter a program's execution flow according to some user-specified rules, limiting the execution overhead due to the code snippets inserted in the original program, thus enabling for the exploitation in high performance computing. The tool is highly modular and works on an internal representation of the program which allows to perform complex instrumentation tasks efficiently, and can be additionally extended to support different instruction sets and executable formats without any need to modify the instrumentation engine. We additionally present an experimental assessment of the overhead induced by the injected code in real HPC applications. © 2013 IEEE

    Solvers for Type Recovery and Decompilation of Binaries

    Get PDF
    Reconstructing the meaning of a program from its binary is known as reverse engineering. Since reverse engineering is ultimately a search for meaning, there is growing interest in inferring a type (a meaning) for the elements of a binary in a consistent way. Currently there is no consensus on how best to achieve this, with the few existing approaches utilising ad-hoc techniques which lack any formal basis. Moreover, previous work does not answer (or even ask) the fundamental question of what it means for recovered types to be correct. This thesis demonstrates how solvers for Satisfiability Modulo Theories (SMT) and Constraint Handling Rules (CHR) can be leveraged to solve the type reconstruction problem. In particular, an approach based on a new SMT theory of rational tree constraints is developed and evaluated. The resulting solver, based on the reification mechanisms of Prolog, is shown to scale well, and leads to a reification driven SMT framework that supports rapid implementation of SMT solvers for different theories in just a few hundred lines of code. The question of how to guarantee semantic relevance for reconstructed types is answered with a new and semantically-founded approach that provides strong guarantees for the reconstructed types. Key to this approach is the derivation of a witness program in a type safe high-level language alongside the reconstructed types. This witness has the same semantics as the binary, is type correct by construction, and it induces a (justifiable) type assignment on the binary. Moreover, the approach, implemented using CHR, yields a type-directed decompiler. Finally, to evaluate the flexibility of reificiation-based SMT solving, the SMT framework is instantiated with theories of general linear inequalities, integer difference problems and octagons. The integer difference solver is shown to perform competitively with state-of-the-art SMT solvers. Two new algorithms for incremental closure of the octagonal domain are presented and proven correct. These are shown to be both conceptually simple, and offer improved performance over existing algorithms. Although not directly related to reverse engineering, these results follow from the work on SMT solver construction

    SOFTWARE INTEROPERABILITY: Issues at the Intersection between Intellectual Property and Competition Policy

    Get PDF
    The dissertation project proceeds through three papers, analyzing issues related to software interoperability and respectively pertaining to one of the three following interdependent levels of analysis. The first level addresses the legal status of software interoperability information under current intellectual property law (focusing on copyright law, which is the main legal tool for the protection of these pieces of code), trying to clarify if, how and to what extent theses pieces of code (and the associated pieces of information) are protected erga omnes by the law. The second level complements the first one, analyzing legal and economic issues related to the technical possibility of actually accessing this interoperability information through reverse engineering (and software decompilation in particular). Once a de facto standard gains the favor of the market, reverse engineering is the main self-help tool available to competitors in order to achieve interoperability and compete “inside this standard”. The third step consists in recognizing that – in a limited number of cases, but which are potentially of great economic relevance – market failures could arise, despite any care taken in devising checks and balances in the legal setting concerning both the legal status of interoperability information and the legal rules governing software reverse engineering. When this is the case, some undertakings may stably gain a dominant position in software markets, and possibly abuse it. Hence, at this level of analysis, competition policy intervention is taken into account. The first paper of the present dissertation shows that interoperability specifications are not protected by copyright. In the paper, I argue that existing doubts and uncertainty are typically related to a poor understanding of the technical nature of software interfaces. To remedy such misunderstanding, the paper focuses on the distinction between interface specifications and implementations and stresses the difference between the steps needed to access to the ideas and principle constituting an interfaces specification and the re-implementation of a functionally equivalent interface through new software code. At the normative level, the paper shows that no major modifications to the existing model of legal protection of software (and software interfaces) are needed; however, it suggests that policymakers could reduce the Fear of legal actions, other forms of legal Uncertainty and several residual Doubts (FUD) by explicitly stating that interface specifications are unprotectable and freely appropriable. In the second paper, I offer a critique of legal restraints on software reverse engineering, focusing in particular on Europe, but considering also similar restraints in the US, in particular in the context of the Digital Millennium Copyright Act. Through an analysis of entry conditions for late comers and of the comparative costs of developing programs in the first place or reverse engineering them, the paper shows that limitations on decompilation imposed by article 6 of the Software Directive were mostly superfluous and basically non-binding at the time of drafting. What is more, the paper shows that nowadays new – and largely unanticipated – developments in software development models (e.g. open source) make these restraints an obstacle to competition against dominant incumbent controlling software platforms. In fact, limitations on the freedom to decompile obstacle major reverse engineering projects performed in a decentralized way, as in the context of an open source community. Hence, since open source projects are the most credible tools to recreate some competitive pressure in a number of crucial software markets, the paper recommends creating a simpler and clear-cut safe harbor for software reverse engineering. The third paper claims that, in software markets, refusal-to-deal (or “information-withholding”) strategies are normally complementary with tying (or “predatory-innovation”) strategies, and that this complementarity is so relevant that dominant platform controllers need to couple both in order to create significant anti- competitive effects. Hence, the paper argues that mandatory unbundling (i.e. mandating a certain degree of modularity in software development) could be an appropriate – and frequently preferable – alternative to mandatory disclosure of interoperability information. However, considering the critiques moved from part of the literature to the Commission’s Decision in the recent European Microsoft antitrust case, an objection to the previous argument could be that – also in the case of mandatory unbundling – one should still determine the minimum price for the unbundled product. The last part of the paper applies some intuitions coming from the literature concerning complementary oligopoly to demonstrate that this objection is not well grounded and that – in software markets – mandatory unbundling (modularity) may be a useful policy even if the only constraint on the price of the unbundled good is the one of non-negativity

    Code similarity and clone search in large-scale source code data

    Get PDF
    Software development is tremendously benefited from the Internet by having online code corpora that enable instant sharing of source code and online developer's guides and documentation. Nowadays, duplicated code (i.e., code clones) not only exists within or across software projects but also between online code repositories and websites. We call them "online code clones."' They can lead to license violations, bug propagation, and re-use of outdated code similar to classic code clones between software systems. Unfortunately, they are difficult to locate and fix since the search space in online code corpora is large and no longer confined to a local repository. This thesis presents a combined study of code similarity and online code clones. We empirically show that many code snippets on Stack Overflow are cloned from open source projects. Several of them become outdated or violate their original license and are possibly harmful to reuse. To develop a solution for finding online code clones, we study various code similarity techniques to gain insights into their strengths and weaknesses. A framework, called OCD, for evaluating code similarity and clone search tools is introduced and used to compare 34 state-of-the-art techniques on pervasively modified code and boiler-plate code. We also found that clone detection techniques can be enhanced by compilation and decompilation. Using the knowledge from the comparison of code similarity analysers, we create and evaluate Siamese, a scalable token-based clone search technique via multiple code representations. Our evaluation shows that Siamese scales to large-scale source code data of 365 million lines of code and offers high search precision and recall. Its clone search precision is comparable to seven state-of-the-art clone detection tools on the OCD framework. Finally, we demonstrate the usefulness of Siamese by applying the tool to find online code clones, automatically analyse clone licenses, and recommend tests for reuse
    • …
    corecore