237 research outputs found

    Program Analysis and Compilation Techniques for Speeding up Transactional Database Workloads

    Get PDF
    There is a trend towards increased specialization of data management software for performance reasons. The improved performance not only leads to a more efficient usage of the underlying hardware and cuts the operation costs of the system, but also is a game-changing competitive advantage for many emerging application domains such as high-frequency algorithmic trading, clickstream analysis, infrastructure monitoring, fraud detection, and online advertising to name a few. In this thesis, we study the automatic specialization and optimization of database application programs -- sequences of queries and updates, augmented with control flow constructs as they appear in database scripts, user-defined functions (UDFs), transactional workloads and triggers in languages such as PL/SQL. We propose to build online transaction processing (OLTP) systems around a modern compiler infrastructure. We show how to build an optimizing compiler for transaction programs using generative programming and state-of-the-art compiler technology, and present techniques for aggressive code inlining, fusion, deforestation, and data structure specialization in the domain of relational transaction programs. We also identify and explore the key optimizations that can be applied in this domain. In addition, we study the advantage of using program dependency analysis and restructuring to enable the concurrency control algorithms to achieve higher performance. Traditionally, optimistic concurrency control algorithms, such as optimistic Multi-Version Concurrency Control (MVCC), avoid blocking concurrent transactions at the cost of having a validation phase. Upon failure in the validation phase, the transaction is usually aborted and restarted from scratch. The "abort and restart" approach becomes a performance bottleneck for use cases with high contention objects or long running transactions. In addition, restarting from scratch creates a negative feedback loop in the system, because the system incurs additional overhead that may create even more conflicts. However, using the dependency information inside the transaction programs, we propose a novel transaction repair approach for in-memory databases. This low overhead approach summarizes the transaction programs in the form of a dependency graph. The dependency graph also contains the constructs used in the validation phase of the MVCC algorithm. Then, when encountering conflicts among transactions, our mechanism quickly detects the conflict locations in the program and partially re-executes the conflicting transactions. This approach maximizes the reuse of the computations done in the first execution round and increases the transaction processing throughput. We evaluate the proposed ideas and techniques in the thesis on some popular benchmarks such as TPC-C and modified versions of TPC-H and TPC-E, as well as other micro-benchmarks. We show that applying these techniques leads to 2x-100x performance improvement in many use cases. Besides, by selectively disabling some of the optimizations in the compiler, we derive a clinical and precise way of obtaining insight into their individual performance contributions

    Improving the Interoperation between Generics Translations

    Get PDF
    Generics on the Java platform are compiled using the erasure transformation, which only supports by-reference values. This causes slowdowns when generics operate on primitive types, such as integers, as they have to be transformed into reference-based objects. Project Valhalla is an effort to remedy this problem by specializing classes at load-time so they can efficiently handle primitive values. In its current early prototype, the Valhalla compilation scheme limits the interaction between specialized and erased generics, thus preventing certain useful code patterns from being expressed. Scala has been using compile-time specialization for 6 years and has three generics compilation schemes working side by side. In Scala, programmers are allowed to write code that freely exercises the interaction between the different compilation schemes, at the expense of introducing subtle performance issues. Similar performance issues can affect Valhalla-enabled bytecode, whether the code was written in Java or translated from other JVM languages. In this context we explain how we help programmers avoid these performance regressions in the miniboxing transformation: (1) by issuing actionable performance advisories that steer programmers away from performance regressions and (2) by providing alternatives to the standard library constructs that use the miniboxing encoding, thus avoiding the conversion overhead

    遅延評価に基づく関数型言語におけるメモリ割当量の削減

    Get PDF
    遅延評価は,値が実際に必要になるまで計算を遅らせる評価戦略である.必要になった値から計算するため,最終結果を求めるのに不要な計算を除去し,計算の最適化を目指すことができる.それと同時に,どの値が必要となるかという判断を処理系に任せることによって,プログラムに計算の進め方を記述する必要がなくなり,宣言的で簡潔なプログラムの記述につながる.たとえば,リストなどの再帰的データ構造を扱う際には,そのデータ構造を生成する処理と読み進める処理を分けて記述することができるため,遅延評価により得られる記述面での恩恵は大きい.記述面での利点が多い一方で,遅延評価を行う言語処理系を実装するには,計算の遅延に必要となるオブジェクト(遅延オブジェクト,以下サンクと呼ぶ)について,十分考慮して処理系を設計する必要がある.特に,サンクをメモリ上に割り当てる時間的・空間的コストが問題となり,遅延評価によって不要な計算を除去できるとしても,プログラムの実行時のオーバヘッドが大きくなってしまうという問題点がある.そのため,効率的な遅延評価機構の実現を目指して,サンクの生成を抑制する静的解析手法について今まで多くの研究がなされてきた.たとえば,正格性解析は,プログラムの最終結果を求めるために必要となる計算を,プログラムから静的に解析する.値が必要となる式は遅延させずに済ませることができるため,その式に対するサンクを生成しない効率的なコードを生成することができる.多くのプログラムにおいて,正格性解析によりサンクの生成を抑えられることは,すでに確認されているが,プログラムの文面から得られる静的な情報のみを用いるため,動的なふるまいを考慮すれば削減可能と判断できるようなサンクは,正格性解析による削減の対象ではない.たとえば,リストをどれだけの長さ読み進めるかというような実行時に決定する要素があると,リストの遅延に必要となるサンクの生成を正格性解析のみで抑制することは難しい.本論文は,サンクの削減という目的を達成するため,リストのような線形再帰的に定義される代数データ構造に注目し,既存のサンクを再利用する手法Thunk Recycling を提案する.Thunk Recycling は,すでに割り当てられているサンクを破壊的に更新して再利用し,新たなサンクの生成を抑える.たとえば,リストであれば,後続のリストの生成を遅延するサンクを再利用できる.本論文では,まず,Thunk Recycling の動作について述べ,その実現に必要となる機構についてまとめる.再利用を可能とするために,再利用が可能であるサンクを既存のサンクと区別して扱う.再利用機構は,破壊的な更新により矛盾が起こらないようにするコンパイル時の変換機構と,実行時に再利用を行う機構から構成される.プログラム変換の基本的な方針は,再利用可能なサンクへの参照を単一にすることである.また,実行時の再利用機構は,既存のサンクの生成・評価という仕組みの多くの部分を流用する.次に,Thunk Recycling の形式的な定義と,その正しさの証明について述べる.簡単な関数型言語を定義し,その言語に対するThunk Recycling のプログラム変換を定義した.さらに,サンクを再利用する操作的意味論を定義した.その意味論を用いて,Thunk Recycling の適用の有無により,プログラムのふるまいが変わらないことを証明した.次に,関数型プログラミング言語Haskell の処理系であるGlasgow Haskell Compiler(GHC)へのThunk Recycling の実装について述べる.GHC は,Haskell の標準的な処理系であり,多くの研究の基盤として用いられ,新しい言語概念など先進的な研究成果が取り入れられている.本論文では,Thunk Recycling の機構のGHC における実装について,考えうる各種の設計方針と,それぞれの設計方針を選択するに至った設計上の得失に関して論じる.GHC は,その大部分が関数型言語であるHaskell で記述されており,関数型言語による大規模で洗練されたシステムであるという面を持つ.そのため,Thunk Recycling の実装は,関数型言語による大規模なソフトウェアに対する開発事例の一例となっている.そこで,本論文では,関数型言語で書かれたプログラミング言語処理系に変更を加えるという観点から,遅延評価を行う関数型言語処理系の実装に関して得られた知見を論じる.最後に,GHC 上の実装について,ベンチマークプログラムを用いた実験について述べる.実行時間に関しては,適用するプログラムを選ぶものの,再利用によって総メモリ割当量を削減できた.電気通信大学201

    On Fast Large-Scale Program Analysis in Datalog

    Get PDF
    Designing and crafting a static program analysis is challenging due to the complexity of the task at hand. Among the challenges are modelling the semantics of the input language, finding suitable abstractions for the analysis, and handwriting efficient code for the analysis in a traditional imperative language such as C++. Hence, the development of static program analysis tools is costly in terms of development time and resources for real world languages. To overcome, or at least alleviate the costs of developing a static program analysis, Datalog has been proposed as a domain specific language (DSL).With Datalog, a designer expresses a static program analysis in the form of a logical specification. While a domain specific language approach aids in the ease of development of program analyses, it is commonly accepted that such an approach has worse runtime performance than handcrafted static analysis tools. In this work, we introduce a new program synthesis methodology for Datalog specifications to produce highly efficient monolithic C++ analyzers. The synthesis technique requires the re-interpretation of the semi-naïve evaluation as a scaffolding for translation using partial evaluation. To achieve high-performance, we employ staged compilation techniques and specialize the underlying relational data structures for a given Datalog specification. Experimentation on benchmarks for large-scale program analysis validates the superior performance of our approach over available Datalog tools and demonstrates our competitiveness with state-of-the-art handcrafted tools

    Interoperation between Miniboxing and Other Generics Translations

    Get PDF
    Generics allow programmers to design algorithms and data structures that operate in the same way regardless of the data used by abstracting over data types. Generics are useful as they improve the programmer’s productivity by raising the level of abstraction, which in turn leads to reducing code duplication and uniform interfaces. However, as data on the low-level comes in different shapes and sizes, it is not a trivial job of compiler to bridge the gap between the uniform interface and the non-uniform low level implementation. Different approaches are used for generics translation and all of them can be categorized either into homogeneous or heterogeneous group. The characteristic of homogeneous translations is that all different data representations are transformed into an identical representation and use the same low-level code for this purpose. In the heterogeneous translations, code is duplicated and adapted for each incompatible data type. From a programmer’s point of view, there should be no difference between a generic method or class compiled using some homogeneous or heterogeneous translation. Therefore, the programmer can combine different types of translations together on different parts of the code and the program has to be correct. But, as different generics translations are implemented in different ways, interoperation between them introduces noticeable slowdowns as values need to be converted to the foreign object’s desired representation, incurring significant performance losses. In this thesis, it will be explored why slowdowns happen when different translations interact together and proposed the ways how they can interoperate more efficiently. Proposed approaches are implemented and their effectiveness is presented by benchmarking the implementation

    Java Virtual Machine Optimizations for Java and Dynamic Languages

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 문수묵.Java virtual machine (JVM) has been introduced as the machine-independent run- time environment to run a Java program. As a 32-bit stack machine, JVM can execute bytecode instructions generated through compilation of a Java program on any ma- chine if the JVM runtime was correctly ported on it. The machine-independence of JVM brought about the huge success of both the Java programming language and the Java virtual machine itself on various systems encompassing from cloud servers to embedded systems including handsets and smart cards. Since a bytecode instruction should be interpreted by the JVM runtime for execu- tion on top of a specific underlying system, a Java program runs innately slower due to the interpretation overhead than a C/C++ program that is compiled directly for the sys- tem. Java just-in-time (JIT) compilers, the de facto performance add-on modules, are employed to improve the performance of a Java virtual machine (JVM) by translating Java bytecode into native machine code on demand. One important problem in Java JIT compilation is how to map stack entries and local variables of the JVM runtime to physical registers efficiently and quickly, since register-based computations are much faster than memory-based ones, while JIT com- pilation overhead is part of the whole running time. This paper introduces LaTTe, an open-source Java JIT compiler that performs fast generation of efficiently register- mapped RISC code. LaTTe first maps all local variables and stack entries into pseudo registers, followed by real register allocation which also coalesces copies correspond- ing to pushes and pops between local variables and stack entries aggressively. In ad- dition to the efficient register allocation, LaTTe is equipped with various traditional and object-oriented optimizations such as CSE, dynamic method inlining, and special- ization. We also devised new mechanisms for Java exception handling and monitor handling in LaTTe, named on-demand exception handling and lightweight monitor, respectively, to boost up the JVM performance more. Our experimental results indicate that LaTTes sophisticated register mapping and allocation really pay off, achieving twice the performance of a naive JIT compiler that maps all local variables and stack entries to memory. It is also shown that LaTTe makes a reasonable trade-off between quality and speed of register mapping and allocation for the bytecode. We expect these results will also be beneficial to parallel and distributed Java computing 1) by enhancing single-thread Java performance and 2) by significantly reducing the number of memory accesses which the rest of the system must properly order to maintain coherence and keep threads synchronized. Furthermore, Java virtual machine (JVM) has recently evolved into a general- purpose language runtime environment to execute popular programming languages such as JavaScript, Ruby, Python, or Scala. These languages have complex non-Java features including dynamic typing and first-class function, so additional language run- times (engines) are provided on top of the JVM to support them with bytecode ex- tensions. Although there are high-performance JVMs with powerful just-in-time (JIT) compilers, running these languages efficiently on the JVM is still a challenge. This paper introduces a simple and novel technique for the JVM JIT compiler called exceptionization to improve the performance of JVM-based language runtimes. We observed that the JVM executing some non-Java languages encounters at least 2 times more branch bytecodes than Java, most of which are highly biased to take only one target. Exceptionization treats such a highly-biased branch as some implicit exception-throwing instruction. This allows the JVM JIT compiler to prune the infre- quent target of the branch from the frequent control flow, thus compiling the frequent control flow more aggressively with better optimization. If a pruned path was taken, it would run like a Java exception handler, i.e., a catch block. We also devised de- exceptionization, a mechanism to cope with the case when a pruned path is actually executed more often than expected. Since exceptionization is a generic JVM optimization, independent of any specific language runtime, it would be generally applicable to any language runtime on the JVM. Our experimental result shows that exceptionization accelerates the performance of several non-Java languages. The JavaScript-on-JVM runs faster by as much as 60%, and by 6% on average, when running the Octane benchmark suite on Oracles latest Nashorn JavaScript engine and HotSpot 1.9 JVM. Additionally, the Ruby-on-JVM experiences the performance improvement by as much as 60% and by 6% on average, while the Python-on-JVM by as much as 6%. We found that exceptionization is most effectively applicable to the branch bytecode of the language runtime itself, rather than the bytecode corresponding to the application code or the bytecode of the Java class libraries. This implies that the performance benefit of exceptionization comes from better JIT compilation of the non-Java language runtime.1. Introduction 1 2. Java Virtual Machine Optimization for Java 6 3. Java Virtual Machine Optimization for Dynamic Languages 39 4. Summary and Conclusion 76 Abstract (In Korean) 84Docto
    corecore