    An Empirical Study on Deoptimization in the Graal Compiler

    Managed language platforms such as the Java Virtual Machine or the Common Language Runtime rely on a dynamic compiler to achieve high performance. Besides making optimization decisions based on the actual program execution and the underlying hardware platform, a dynamic compiler is also in an ideal position to perform speculative optimizations. However, these tend to increase the compilation costs, because unsuccessful speculations trigger deoptimization and recompilation of the affected parts of the program, wasting previous work. Even though speculative optimizations are widely used, the costs of these optimizations in terms of extra compilation work has not been previously studied. In this paper, we analyze the behavior of the Graal dynamic compiler integrated in Oracle\u27s HotSpot Virtual Machine. We focus on situations which cause program execution to switch from machine code to the interpreter, and compare application performance using three different deoptimization strategies which influence the amount of extra compilation work done by Graal. Using an adaptive deoptimization strategy, we managed to improve the average start-up performance of benchmarks from the DaCapo, ScalaBench, and Octane benchmark suites, mostly by avoiding wasted compilation work. On a single-core system, we observed an average speed-up of 6.4% for the DaCapo and ScalaBench workloads, and a speed-up of 5.1% for the Octane workloads; the improvement decreases with an increasing number of available CPU cores. We also find that the choice of a deoptimization strategy has negligible impact on steady-state performance. This indicates that the cost of speculation matters mainly during start-up, where it can disturb the delicate balance between executing the program and the compiler, but is quickly amortized in steady state

    Phase-based adaptive recompilation in a JVM

    Novel online profiling for virtual machines

    Abstract Application profiling is a popular technique to improve program performance based on its behavior. Offline profiling, although beneficial for several applications, fails in cases where prior program runs may not be feasible, or if changes in input cause the profile to not match the behavior of the actual program run. Managed languages, like Java and C#, provide a unique opportunity to overcome the drawbacks of offline profiling by generating the profile information online during the current program run. Indeed, online profiling is extensively used in current VMs, especially during selective compilation to improve program startup performance, as well as during other feedback-directed optimizations. In this paper we illustrate the drawbacks of the current reactive mechanism of online profiling during selective compilation. Current VM profiling mechanisms are slow -thereby delaying associated transformations, and estimate future behavior based on the program's immediate past -leading to potential misspeculation that limit the benefits of compilation. We show that these drawbacks produce an average performance loss of over 14.5% on our set of benchmark programs, over an ideal offline approach that accurately compiles the hot methods early. We then propose and evaluate the potential of a novel strategy to achieve similar performance benefits with an online profiling approach. Our new online profiling strategy uses early determination of loop iteration bounds to predict future method hotness. We explore and present promising results on the potential, feasibility, and other issues involved for the successful implementation of this approach

    Three pitfalls in Java performance evaluation

    The Java programming language has known a remarkable growth over the last decade. This is partially due to the infrastructure required to run Java ap- plications on general purpose microprocessors: a Java virtual machine (VM). The VM ensures that Java applications are portable across different hardware platforms, because it shelters the applications from the underlying system. Hence the motto write once, run (almost) anywhere. Java applications are compiled to an intermediate form, called bytecode, and consist of a number of so-called class files. The virtual machine takes care of class loading, interpreting or compiling the bytecode to the native code of the underlying hardware platform, thread scheduling, garbage collection, etc. As such, during the execution of a Java application, the VM regularly intervenes to take care of housekeeping tasks and to optimise the application as it is executing. Furthermore, the specific implementation details of most virtual machines insert non-deterministic behaviour, not into the semantic part of the execution, but rather into the lower level execution. For example, to bring a Java application up to competitive speed with classical compiled programs written in languages such as C, the virtual machine needs to optimise Java bytecode. To limit the execution overhead, most virtual machines use a time sampling mechanism to determine the hot methods in the application. This introduces non-determinism, as over several runs, the methods are not always optimised at the same moment, nor is the set of optimised methods always the same. Other factors that introduce non-determinism are the thread scheduling, garbage collection, etc. It is readily seen that performance analysis of Java applications is not as simple as it seems at first, and warrants closer inspection. In this dissertation we are mainly interested in the behaviour of Java applications and their performance. In the course of this work, we uncovered three major pitfalls that were not taken into account by researchers when analysing Java performance prior to this work. We will briefly summarise the main achievements presented in this dissertation. The first pitfall we present involves the interaction between the virtual machine, the application and the input to the application. The performance for short running applications is shown to be mainly determined by the virtual machine. For longer running applications, this influence decreases, but remains tangible. We use statistical analysis, such as principal components analysis and cluster analysis (K-means and hierarchical clustering) to demonstrate and clarify the pitfall. By means of a large number of performance char- acteristics measured using hardware performance counters, five virtual machines and fourteen benchmarks with both a small and a large input size, we demonstrate that short running workloads are primarily clustered by virtual machines. Even for long running applications from the SPECjvm98 benchmark suite, the virtual machine still exerts a large influence on the observed behaviour at the microarchitectural level. This work has shown the need for both larger and longer running benchmarks than were available prior to it โ€“ this was (partially) met by the introduction of the DaCapo benchmark suite โ€“ as well as a careful consideration when setting up an experiment to avoid measuring the virtual machine, rather than the benchmark. Prior to this work, people were quite often using simulation with short running applications (to save time) for exploring Java performance. The second pitfall we uncover involves the analysis of performance numbers. During a survey of 50 papers published at premier conferences, such as OOPSLA, PLDI, CGO, ISMM and VEE, over the past seven years, we found that a variety of approaches are used, both for experimental design โ€“ for example, the input size, virtual machines, heap sizes, etc. โ€“ and, even more importantly, for data analysis โ€“ for example, using a best out of 3 performance number. New techniques are pitted against existing work using these prevalent approaches, and conclusions regarding their successfulness in beating prior state-of-the-art are based upon them. Given the fact that the execution of Java applications usually involves non-determinism in the virtual machine โ€“ for example, when determining which methods to optimise โ€“ it should come as no surprise that the lack of statistical rigour in these prevalent approaches leads to misleading or even incorrect conclusions. By this we mean that the conclusions are either not representative of what actually happens, or even contradict reality, as modelled in a statistical manner. To circumvent this pitfall, we propose a rigorous statistical approach that uses confidence intervals to both report and compare performance numbers. We also claim that sufficient experiments should be conducted to get a reliable performance measure. The non-determinism caused by the timer-based optimisation component in a virtual machine can be eliminated using so-called replay compilation. This technique will record a compilation plan during a first execution or profiling run of the application. During a second execution, the application is iterated twice: once to compile and optimise all methods found in the compilation plan, and a second time to perform the actual measurement. It turns out however that current practice of using either a single plan โ€“ corresponding to the best performing profiling run โ€“ or a combined plan choosing the methods that were optimised in, say, more than half the profiling runs, is no match for using multiple plans. The variability observed in the plans themselves is too large to capture in one of the current practices. Consequently, using multiple plans is definitely the better option. Moreover, this allows using a matched-pair approach in the data analysis, which results in tighter confidence intervals for the mean performance number. The third pitfall we examine is the usage of global performance numbers when tuning either an application or a virtual machine. We show that Java applications exhibit phase behaviour at the method level. This means that instances of the same method show more similarity to each other, behaviourwise, than to instances of other methods. A phase can then be identified as a set of sub-trees of the dynamic call-tree, with each sub-tree headed by the same method. We present an two-step algorithm that allows correlating hardware performance counter data in step 2 with the phases determined in step 1. The information obtained can be applied to show the programmer which methods perform worse than average, for example with respect to the number of cache misses they incur. In the dissertation, we pay particular attention to statistical rigour. For each pitfall, we use statistics to demonstrate its presence. Hopefully this work will encourage other researchers to use more rigour in their work as well

    Observable dynamic compilation

    Managed language platforms such as the Java Virtual Machine rely on a dynamic compiler to achieve high performance. Despite the benefits that dynamic compilation provides, it also introduces some challenges to program profiling. Firstly, profilers based on bytecode instrumentation may yield wrong results in the presence of an optimizing dynamic compiler, either due to not being aware of optimizations, or because the inserted instrumentation code disrupts such optimizations. To avoid such perturbations, we present a technique to make profilers based on bytecode instrumentation aware of the optimizations performed by the dynamic compiler, and make the dynamic compiler aware of the inserted code. We implement our technique for separating inserted instrumentation code from base-program code in Oracle's Graal compiler, integrating our extension into the OpenJDK Graal project. We demonstrate its significance with concrete profilers. On the one hand, we improve accuracy of existing profiling techniques, for example, to quantify the impact of escape analysis on bytecode-level allocation profiling, to analyze object life-times, and to evaluate the impact of method inlining when profiling method invocations. On the other hand, we also illustrate how our technique enables new kinds of profilers, such as a profiler for non-inlined callsites, and a testing framework for locating performance bugs in dynamic compiler implementations. Secondly, the lack of profiling support at the intermediate representation (IR) level complicates the understanding of program behavior in the compiled code. This issue cannot be addressed by bytecode instrumentation because it cannot precisely capture the occurrence of IR-level operations. Binary instrumentation is not suited either, as it lacks a mapping from the collected low-level metrics to higher-level operations of the observed program. To fill this gap, we present an easy-to-use event-based framework for profiling operations at the IR level. We integrate the IR profiling framework in the Graal compiler, together with our instrumentation-separation technique. We illustrate our approach with a profiler that tracks the execution of memory barriers within compiled code. In addition, using a deoptimization profiler based on our IR profiling framework, we conduct an empirical study on deoptimization in the Graal compiler. We focus on situations which cause program execution to switch from machine code to the interpreter, and compare application performance using three different deoptimization strategies which influence the amount of extra compilation work done by Graal. Using an adaptive deoptimization strategy, we manage to improve the average start-up performance of benchmarks from the DaCapo, ScalaBench, and Octane suites by avoiding wasted compilation work. We also find that different deoptimization strategies have little impact on steady- state performance

    Gรฉnรฉration ef๏ฌcace de graphes dโ€™appels dynamiques complets

    Analyser le code permet de vรฉri๏ฌer ses fonctionnalitรฉs, dรฉtecter des bogues ou amรฉliorer sa performance. Lโ€™analyse du code peut รชtre statique ou dynamique. Des approches combinants les deux analyses sont plus appropriรฉes pour les applications de taille industrielle oรน lโ€™utilisation individuelle de chaque approche ne peut fournir les rรฉsultats souhaitรฉs. Les approches combinรฉes appliquent lโ€™analyse dynamique pour dรฉterminer les portions ร  problรจmes dans le code et effectuent par la suite une analyse statique concentrรฉe sur les parties identi๏ฌรฉes. Toutefois les outils dโ€™analyse dynamique existants gรฉnรจrent des donnรฉes imprรฉcises ou incomplรจtes, ou aboutissent en un ralentissement inacceptable du temps dโ€™exรฉcution. Lors de ce travail, nous nous intรฉressons ร  la gรฉnรฉration de graphes dโ€™appels dynamiques complets ainsi que dโ€™autres informations nรฉcessaires ร  la dรฉtection des portions ร  problรจmes dans le code. Pour ceci, nous faisons usage de la technique dโ€™instrumentation dynamique du bytecode Java pour extraire lโ€™information sur les sites dโ€™appels, les sites de crรฉation dโ€™objets et construire le graphe dโ€™appel dynamique du programme. Nous dรฉmontrons quโ€™il est possible de pro๏ฌler dynamiquement une exรฉcution complรจte dโ€™une application ร  temps dโ€™exรฉcution non triviale, et dโ€™extraire la totalitรฉ de lโ€™information ร  un coup raisonnable. Des mesures de performance de notre pro๏ฌleur sur trois sรฉries de benchmarks ร  charges de travail diverses nous ont permis de constater que la moyenne du coรปt de pro๏ฌlage se situe entre 2.01 et 6.42. Notre outil de gรฉnรฉration de graphes dynamiques complets, nommรฉ dyko, constitue รฉgalement une plateforme extensible pour lโ€™ajout de nouvelles approches dโ€™instrumentation. Nous avons testรฉ une nouvelle technique dโ€™instrumentation des sites de crรฉation dโ€™objets qui consiste ร  adapter les modi๏ฌcations apportรฉes par lโ€™instrumentation au bytecode de chaque mรฉthode. Nous avons aussi testรฉ lโ€™impact de la rรฉsolution des sites dโ€™appels sur la performance gรฉnรฉrale du pro๏ฌleur.Code analysis is used to verify code functionality, detect bugs or improve its performance. Analyzing the code can be done either statically or dynamically. Approaches combining both analysis techniques are most appropriate for industrial-scale applications where each one individually cannot provide the desired results. Blended analysis, for example, ๏ฌrst applies dynamic analysis to identify problematic code regions and then performs a focused static analysis on these regions. However, the existing dynamic analysis tools generate inaccurate or incomplete data, or result in an unacceptably slow execution times. In this work, we focus on the generation of complete dynamic call graphs with additional information required for blended analysis. We make use of dynamic instrumentation techniques of Java bytecode to extract information about call sites and object creation sites, and to build the dynamic call graph of the program. We demonstrate that it is possible to pro๏ฌle real-world applications to ef๏ฌciently extract complete and accurate information. Performance measurement of our pro๏ฌler on three sets of benchmarks with various workloads places the overhead of our pro๏ฌler between 2.01 and 6.42. Our pro๏ฌling tool generating complete dynamic graphs, named dyko, is also an extensible platform for evaluating new instrumentation approaches. We tested a new adaptive instrumentation technique for object creation sites which accommodates instrumentation to the bytecode of each method. We also tested the impact of call sites resolution on the overall performance of the pro๏ฌler

    ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด ๋Ÿฐํƒ€์ž„์—์„œ์˜ ์‘์šฉํ”„๋กœ๊ทธ๋žจ ์‹œ์ž‘ ๊ฐ€์†์„ ์œ„ํ•œ ์ตœ์ ํ™”

    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2015. 8. ๋ฌธ์ˆ˜๋ฌต.์ž๋ฐ”๋‚˜ ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ์™€ ๊ฐ™์€ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋Ÿฐํƒ€์ž„ ํ™˜๊ฒฝ์€ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์˜ ์ด์‹์„ฑ์„ ์žฅ์ ์œผ๋กœ ํ•˜์—ฌ ์ž„๋ฒ ๋””๋“œ ์†Œํ”„ํŠธ์›จ์–ด ํ”Œ๋žซํผ์œผ๋กœ์จ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ์ž๋ฐ” ์‘์šฉํ”„๋กœ๊ทธ๋žจ์€ ๋ฐ”์ดํŠธ์ฝ”๋“œ์˜ ํ˜•ํƒœ๋กœ ๋ฐฐํฌ๋˜์–ด ๋””์ง€ํ„ธ ํ…”๋ ˆ๋น„์ „์ด๋‚˜ ์•ˆ๋“œ๋กœ์ด๋“œ ํ”Œ๋žซํผ์—์„œ ๋™์ž‘ํ•˜๋ฉฐ ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ๋Š” ์†Œ์Šค ์ฝ”๋“œ ํ˜•ํƒœ๋กœ ์›น ํ”Œ๋žซํผ์—์„œ ์ˆ˜ํ–‰๋œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด ๋Ÿฐํƒ€์ž„์— ์˜ํ•œ ์ด์‹์„ฑ์€ ๋ณธ์งˆ์ ์œผ๋กœ ์„ฑ๋Šฅ ๋ฌธ์ œ๋ฅผ ์•ผ๊ธฐํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ํ•˜๋“œ์›จ์–ด๊ฐ€ ์•„๋‹Œ ์ธํ„ฐํ”„๋ฆฌํ„ฐ์™€ ๊ฐ™์€ ์†Œํ”„ํŠธ์›จ์–ด์— ์˜ํ•ด ์‘์šฉํ”„๋กœ๊ทธ๋žจ์˜ ๋ฐ”์ดํŠธ์ฝ”๋“œ๋‚˜ ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋”ฐ๋ผ์„œ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด ์ˆ˜ํ–‰ ์ค‘ ๋ฐ”์ดํŠธ์ฝ”๋“œ๋‚˜ ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ๊ธฐ๊ณ„์–ด๋กœ ๋ฒˆ์—ญํ•˜๋Š” ์ ์‹œ ์ปดํŒŒ์ผ๋Ÿฌ๋‚˜ inline caching๊ณผ ๊ฐ™์ด ๋ฐ˜๋ณต ์ˆ˜ํ–‰๋˜๋Š” ๋™์ž‘์— ํŠนํ™”๋œ ์ตœ์ ํ™”๋ฅผ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด ๋Ÿฐํƒ€์ž„์— ์ ์šฉํ•˜๊ธฐ๋„ ํ•œ๋‹ค. ํ•œํŽธ, ์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ์—์„œ ๋™์ž‘ํ•˜๋Š” ์ž๋ฐ” ์‘์šฉํ”„๋กœ๊ทธ๋žจ์ด๋‚˜ ์›นํŽ˜์ด์ง€์˜ ๋กœ๋”ฉ ์ค‘ ์ˆ˜ํ–‰๋˜๋Š” ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ๋Š” ์•ˆ์ •๋œ ์ƒํƒœ์—์„œ์˜ ๋™์ž‘๋ณด๋‹ค๋Š” ๊ธ‰๊ฒฉํ•œ ๋ณ€ํ™”๋ฅผ ์ˆ˜๋ฐ˜ํ•˜๋Š” ์‹œ์ž‘ ๊ณผ์ •์˜ ํ–‰ํƒœ๊ฐ€ ๋” ๋‘๋“œ๋Ÿฌ์ง„๋‹ค. ๋”ฐ๋ผ์„œ ๋น„๊ต์  ์งง์€ ์ˆ˜ํ–‰์‹œ๊ฐ„์„ ๊ฐ€์ง€๊ณ , ๋™์ผํ•œ ๋™์ž‘์„ ๋ฐ˜๋ณตํ•˜๋Š” ๊ฒฝํ–ฅ์ด ๋‚ฎ์œผ๋ฉฐ, ์ˆ˜ํ–‰์‹œ๊ฐ„์—์„œ์˜ ๋น„์ค‘์ด ๋†’์€ ํ•ซ์ŠคํŒŸ์ด ๋“œ๋ฌธ ํŠน์ง•์„ ๊ฐ€์ง„๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ•ซ์ŠคํŒŸ์— ํšจ๊ณผ์ ์ธ ์ ์‹œ ์ปดํŒŒ์ผ๋Ÿฌ๋‚˜ ๋ฐ˜๋ณต๋˜๋Š” ๋™์ž‘์— ํŠนํ™”๋œ ์ตœ์ ํ™”๋Š” ์ด์™€ ๊ฐ™์€ ์‘์šฉํ”„๋กœ๊ทธ๋žจ ์‹œ๋™์˜ ํ–‰ํƒœ์— ๋Œ€ํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์–ด๋ ค์šธ ์ˆ˜ ๋ฐ–์— ์—†๋‹ค. ์ด ๋…ผ๋ฌธ์„ ํ†ตํ•˜์—ฌ ๊ธฐ์กด์˜ ๋ฐฉ์‹ ๋ณด๋‹ค ์ •๊ตํ•˜๊ฒŒ ์ถ”์ •ํ•œ ์ˆ˜ํ–‰์‹œ๊ฐ„์„ ๊ทผ๊ฑฐ๋กœ ์ž‘๋™ํ•˜๋Š” ํ•ซ์ŠคํŒŸ ๊ฐ์ง€ ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•จ์œผ๋กœ์จ ํ•ซ์ŠคํŒŸ์ด ๋ถˆ๋ถ„๋ช…ํ•œ ์ƒํ™ฉ์—์„œ ์ž๋ฐ” ์ ์‹œ ์ปดํŒŒ์ผ๋Ÿฌ์— ์˜ํ•œ ์ˆ˜ํ–‰ ์†๋„์˜ ํ–ฅ์ƒ์„ ๊พ€ํ•˜์—ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ ์‘์šฉํ”„๋กœ๊ทธ๋žจ ์‹œ์ž‘์˜ ํ–‰ํƒœ๋ฅผ ๋ณด์ด๋Š” ๋ฒค์น˜๋งˆํฌ ํ”„๋กœ๊ทธ๋žจ์˜ ์ฒซ๋ฒˆ์งธ ์ˆ˜ํ–‰์‹œ๊ฐ„์„ ๊ธฐ์กด์˜ HotSpot ์ž๋ฐ” ๊ฐ€์ƒ๋จธ์‹ ์˜ ํ•ซ์ŠคํŒŸ ๊ฐ์ง€ ๊ธฐ๋ฒ• ๋Œ€๋น„ ์•ฝ 10% ๊ฐ€์†ํ™”ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์‹ค์ œ ์‘์šฉํ”„๋กœ๊ทธ๋žจ์œผ๋กœ์„œ ๋””์ง€ํ„ธ ๋ฐฉ์†ก์— ์˜ํ•ด ๋ฐฐํฌ๋œ Xlet์˜ ์‹œ์ž‘์— ๊ฑธ๋ฆฌ๋Š” ์ˆ˜ํ–‰์‹œ๊ฐ„ ์—ญ์‹œ ์•ฝ 7%๊ฐ€ ๊ฐœ์„ ๋˜์—ˆ๋‹ค. ๋˜ํ•œ, ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ ์ ์‹œ ์ปดํŒŒ์ผ๋Ÿฌ์—์„œ ์ƒ์„ฑ๋˜๋Š” ๊ธฐ๊ณ„์–ด์˜ ์šฉ๋Ÿ‰์„ ์ค„์ด๊ธฐ ์œ„ํ•˜์—ฌ ์ถ•์†Œ๋œ ๋ช…๋ น์–ด ์ง‘ํ•ฉ์— ์ตœ์ ํ™”๋œ ๊ธฐ๊ณ„์–ด๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด๋ฅผ ํ†ตํ•˜์—ฌ ์•ฝ 29%์— ํ•ด๋‹นํ•˜๋Š” ๊ธฐ๊ณ„์–ด์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์—ˆ๊ณ , ์ด ๊ฒฐ๊ณผ๋Š” ์›นํŽ˜์ด์ง€ ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ์˜ ์‹œ์ž‘ ๊ณผ์ •์—์„œ ์ˆ˜ํ–‰๋˜๋Š” ๋Œ€๋Ÿ‰์˜ ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ์—์„œ ๋”์šฑ ํšจ๊ณผ์ ์ผ ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ ์‹œ ์ปดํŒŒ์ผ๋Ÿฌ๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ํ™˜๊ฒฝ์—์„œ ์›นํŽ˜์ด์ง€ ์ž๋ฐ”์Šคํฌ๋ฆผํŠธ ์‹œ์ž‘ ์†๋„์˜ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๋‚˜ํƒ€๋‚จ์„ ๋ฐœ๊ฒฌํ•˜์˜€๊ณ , ์ด๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ธํ„ฐํ”„๋ฆฌํ„ฐ ์ˆ˜ํ–‰์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ ํƒ์  ์ปดํŒŒ์ผ์„ ์‹œ๋„ํ•จ์œผ๋กœ์จ ์ ์‹œ ์ปดํŒŒ์ผ๋Ÿฌ์— ์˜ํ•œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ตœ์†Œํ™” ํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์›นํŽ˜์ด์ง€ ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ ์‹œ์ž‘์˜ ์ˆ˜ํ–‰ ํ–‰ํƒœ์— ๋Œ€ํ•˜์—ฌ ๋ถ„์„์„ ์‹ค์‹œํ•œ ๊ฒฐ๊ณผ, ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋ฐœ์ƒํ•˜๋Š” ๊ฐ์ฒด์— ๋Œ€ํ•œ ์ ‘๊ทผ์„ ๊ฐ€์†ํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐ”์ดํŠธ์ฝ”๋“œ ์ˆ˜์ค€์˜ ์ตœ์ ํ™”๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ธํ„ฐํ”„๋ฆฌํ„ฐ ์ˆ˜ํ–‰์— ์ ์‹œ ์ปดํŒŒ์ผ๋Ÿฌ๋ฅผ ์ถ”๊ฐ€๋กœ ์ ์šฉํ•˜์—ฌ๋„ ์›นํŽ˜์ด์ง€ ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ ์‹œ์ž‘์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์€ ์—†์—ˆ๋˜ ๋ฐ˜๋ฉด, ์ œ์•ˆํ•œ ๋ฐ”์ดํŠธ์ฝ”๋“œ ์ˆ˜์ค€์˜ ์ตœ์ ํ™”๋Š” ์ˆ˜ํ–‰์‹œ๊ฐ„์„ ์•ฝ 3% ๊ฐ€์†ํ™”ํ•จ์œผ๋กœ์จ ์›นํŽ˜์ด์ง€ ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ ์‹œ์ž‘์— ๋” ํšจ๊ณผ์ ์ธ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.Chapter 1. Introduction 1 1.1 Hot Spot Detection 1 1.2 Memory Consumption of JIT Compiled Code 4 1.3 Web Page JavaScript Performance with JITC 5 Chapter 2. Enhanced Hot Spot Detection 8 2.1 Previous Approaches to Hot Spot Detection 8 2.1.1 Simple Heuristic 8 2.1.2 Hot Heuristic 9 2.1.3 Static Analysis Heuristic 10 2.2 Flow-Sensitive Runtime Estimation 11 2.3 Static-FSRE for First-Invocation Compilation 15 2.4 Merged Heuristic of Dynamic and Static FSRE 18 2.4.1 Threshold of FSRE 18 2.4.2 Merged Heuristic 19 2.5 Experimental Results 19 2.5.1 Benchmark Results 19 Experimental Environment 19 Evaluation Heuristics 20 Performance of the Five Heuristics 21 Preciseness of Hot Spot Detection 23 Hot Spot Detection Time 28 Hot Spot Detection Overhead 29 2.5.2 Digital TV Java Xlet Results 31 DTV Environment and Java Xlet application 31 Heuristic Adjustments 33 Performance Improvement and Comparison 33 Chapter 3. Code Size Optimization for JITC 40 3.1 JavaScript JITC in SFX and Thumb2 40 3.1.1 JavaScript and Execution Semantics 40 3.1.2 SquirrelFish Extreme and the Bytecode 41 3.1.3 SFX JITC Architecture 43 3.1.4 JITC Code Generation for Thumb2 45 3.2 SFX JITC Optimizations for Thumb2 45 3.2.1 Code Generation with Register Re-map 45 3.2.2 Constant Pool Aggregation 46 3.2.3 Patching PC-relative Branches 49 3.3 Experimental Result 52 3.3.1 Experimental Environment 52 3.3.2 Code Size Result 52 3.3.3 Performance Result 55 Chapter 4. Selective JITC for Web Page JavaScript 56 4.1 JavaScript and SFX JITC 56 4.1.1 JavaScript and Interaction with DOM 56 4.1.2 SFX JITC and Its Architecture 59 4.1.3 Benchmark JavaScript and Web Page JavaScript 62 4.2 Selective JITC for the SFX 64 4.2.1 Selective JITC 64 4.2.2 Selective JITC Implementation for the SFX 65 4.3 Experimental Result 66 4.3.1 Experiment Environment 66 4.3.2 Web Page JavaScript and SunSpider Benchmark 66 4.3.3 Web page JavaScript Execution Time 71 4.3.4 Comparison to Benchmark Execution Time 73 4.3.5 Evaluation of the Selective JITC Heuristic 74 4.3.6 Discussions 76 Chapter 5. Bytecode Level Optimizations 78 5.1 Analysis on Web Page JavaScript Execution 78 5.2 Overhead in Property Accesses 82 5.3 Super-Bytecode Construction (SBC) 85 5.4 Bytecode Chaining (BC) 86 5.5 Experimental Evaluation 87 5.5.1 Performance Result 88 5.5.2 Performance Analysis 89 Optimized Runtime Services with SBC 89 Removed Runtime Services with BC 90 Chapter 6. Related Work 92 Chapter 7. Conclusion 94 Bibliography 97 Abstract 103Docto

    Profileringstechnieken voor prestatieanalyse en optimalisatie van Javaprogramma's

    Using HPM-Sampling to Drive Dynamic Compilation

    All high-performance production JVMs employ an adaptive strategy for program execution. Methods are first executed unoptimized and then an online profiling mechanism is used to find a subset of methods that should be optimized during the same execution. This paper empirically evaluates the design space of several profilers for initiating dynamic compilation and shows that existing online profiling schemes suffer from several limitations. They provide an insufficient number of samples, are untimely, and have limited accuracy at determining the frequently executed methods. We describe and comprehensively evaluate HPM-sampling, a simple but effective profiling scheme for finding optimization candidates using hardware performance monitors (HPMs) that addresses the aforementioned limitations. We show that HPM-sampling is more accurate; has low overhead; and improves performance by 5.7 % on average and up to 18.3% when compared to the default system in Jikes RVM, without changing the compiler