2,699 research outputs found

    Portable and Accurate Collection of Calling-Context-Sensitive Bytecode Metrics for the Java Virtual Machine

    Get PDF
    Calling-context profiles and dynamic metrics at the bytecode level are important for profiling, workload characterization, program comprehension, and reverse engineering. Prevailing tools for collecting calling-context profiles or dynamic bytecode metrics often provide only incomplete information or suffer from limited compatibility with standard JVMs. However, completeness and accuracy of the profiles is essential for tasks such as workload characterization, and compatibility with standard JVMs is important to ensure that complex workloads can be executed. In this paper, we present the design and implementation of JP2, a new tool that profiles both the inter- and intra-procedural control flow of workloads on standard JVMs. JP2 produces calling-context profiles preserving callsite information, as well as execution statistics at the level of individual basic blocks of code. JP2 is complemented with scripts that compute various dynamic bytecode metrics from the profiles. As a case-study and tutorial on the use of JP2, we use it for cross-profiling for an embedded Java processor

    Profiling Power Consumption on Mobile Devices

    Get PDF
    The proliferation of mobile devices, and the migration of the information access paradigm to mobile platforms, motivate studies of power consumption behaviors with the purpose of increasing the device battery life. The aim of this work is to profile the power consumption of a Samsung Galaxy I7500 and a Samsung Nexus S, in order to understand how such feature has evolved over the years. We performed two experiments: the first one measures consumption for a set of usage scenarios, which represent common daily user activities, while the second one analyzes a context-aware application with a known source code. The first experiment shows that the most recent device in terms of OS and hardware components shows significantly lower consumption than the least recent one. The second experiment shows that the impact of different configurations of the same application causes a different power consumption behavior on both smartphones. Our results show that hardware improvements and energy-aware software applications greatly impact the energy efficiency of mobile device

    Coz: Finding Code that Counts with Causal Profiling

    Full text link
    Improving performance is a central concern for software developers. To locate optimization opportunities, developers rely on software profilers. However, these profilers only report where programs spent their time: optimizing that code may have no impact on performance. Past profilers thus both waste developer time and make it difficult for them to uncover significant optimization opportunities. This paper introduces causal profiling. Unlike past profiling approaches, causal profiling indicates exactly where programmers should focus their optimization efforts, and quantifies their potential impact. Causal profiling works by running performance experiments during program execution. Each experiment calculates the impact of any potential optimization by virtually speeding up code: inserting pauses that slow down all other code running concurrently. The key insight is that this slowdown has the same relative effect as running that line faster, thus "virtually" speeding it up. We present Coz, a causal profiler, which we evaluate on a range of highly-tuned applications: Memcached, SQLite, and the PARSEC benchmark suite. Coz identifies previously unknown optimization opportunities that are both significant and targeted. Guided by Coz, we improve the performance of Memcached by 9%, SQLite by 25%, and accelerate six PARSEC applications by as much as 68%; in most cases, these optimizations involve modifying under 10 lines of code.Comment: Published at SOSP 2015 (Best Paper Award

    A Fast Causal Profiler for Task Parallel Programs

    Full text link
    This paper proposes TASKPROF, a profiler that identifies parallelism bottlenecks in task parallel programs. It leverages the structure of a task parallel execution to perform fine-grained attribution of work to various parts of the program. TASKPROF's use of hardware performance counters to perform fine-grained measurements minimizes perturbation. TASKPROF's profile execution runs in parallel using multi-cores. TASKPROF's causal profile enables users to estimate improvements in parallelism when a region of code is optimized even when concrete optimizations are not yet known. We have used TASKPROF to isolate parallelism bottlenecks in twenty three applications that use the Intel Threading Building Blocks library. We have designed parallelization techniques in five applications to in- crease parallelism by an order of magnitude using TASKPROF. Our user study indicates that developers are able to isolate performance bottlenecks with ease using TASKPROF.Comment: 11 page

    Adaptive sampling-based profiling techniques for optimizing the distributed JVM runtime

    Get PDF
    Extending the standard Java virtual machine (JVM) for cluster-awareness is a transparent approach to scaling out multithreaded Java applications. While this clustering solution is gaining momentum in recent years, efficient runtime support for fine-grained object sharing over the distributed JVM remains a challenge. The system efficiency is strongly connected to the global object sharing profile that determines the overall communication cost. Once the sharing or correlation between threads is known, access locality can be optimized by collocating highly correlated threads via dynamic thread migrations. Although correlation tracking techniques have been studied in some page-based sof Tware DSM systems, they would entail prohibitively high overheads and low accuracy when ported to fine-grained object-based systems. In this paper, we propose a lightweight sampling-based profiling technique for tracking inter-thread sharing. To preserve locality across migrations, we also propose a stack sampling mechanism for profiling the set of objects which are tightly coupled with a migrant thread. Sampling rates in both techniques can vary adaptively to strike a balance between preciseness and overhead. Such adaptive techniques are particularly useful for applications whose sharing patterns could change dynamically. The profiling results can be exploited for effective thread-to-core placement and dynamic load balancing in a distributed object sharing environment. We present the design and preliminary performance result of our distributed JVM with the profiling implemented. Experimental results show that the profiling is able to obtain over 95% accurate global sharing profiles at a cost of only a few percents of execution time increase for fine- to medium- grained applications. © 2010 IEEE.published_or_final_versionThe 24th IEEE International Symposium on Parallel & Distributed Processing (IPDPS 2010), Atlanta, GA., 19-23 April 2010. In Proceedings of the 24th IPDPS, 2010, p. 1-1

    Cache-aware cross-profiling for java processors

    Full text link

    Observable dynamic compilation

    Get PDF
    Managed language platforms such as the Java Virtual Machine rely on a dynamic compiler to achieve high performance. Despite the benefits that dynamic compilation provides, it also introduces some challenges to program profiling. Firstly, profilers based on bytecode instrumentation may yield wrong results in the presence of an optimizing dynamic compiler, either due to not being aware of optimizations, or because the inserted instrumentation code disrupts such optimizations. To avoid such perturbations, we present a technique to make profilers based on bytecode instrumentation aware of the optimizations performed by the dynamic compiler, and make the dynamic compiler aware of the inserted code. We implement our technique for separating inserted instrumentation code from base-program code in Oracle's Graal compiler, integrating our extension into the OpenJDK Graal project. We demonstrate its significance with concrete profilers. On the one hand, we improve accuracy of existing profiling techniques, for example, to quantify the impact of escape analysis on bytecode-level allocation profiling, to analyze object life-times, and to evaluate the impact of method inlining when profiling method invocations. On the other hand, we also illustrate how our technique enables new kinds of profilers, such as a profiler for non-inlined callsites, and a testing framework for locating performance bugs in dynamic compiler implementations. Secondly, the lack of profiling support at the intermediate representation (IR) level complicates the understanding of program behavior in the compiled code. This issue cannot be addressed by bytecode instrumentation because it cannot precisely capture the occurrence of IR-level operations. Binary instrumentation is not suited either, as it lacks a mapping from the collected low-level metrics to higher-level operations of the observed program. To fill this gap, we present an easy-to-use event-based framework for profiling operations at the IR level. We integrate the IR profiling framework in the Graal compiler, together with our instrumentation-separation technique. We illustrate our approach with a profiler that tracks the execution of memory barriers within compiled code. In addition, using a deoptimization profiler based on our IR profiling framework, we conduct an empirical study on deoptimization in the Graal compiler. We focus on situations which cause program execution to switch from machine code to the interpreter, and compare application performance using three different deoptimization strategies which influence the amount of extra compilation work done by Graal. Using an adaptive deoptimization strategy, we manage to improve the average start-up performance of benchmarks from the DaCapo, ScalaBench, and Octane suites by avoiding wasted compilation work. We also find that different deoptimization strategies have little impact on steady- state performance

    Aspect weaving in standard Java class libraries

    Full text link

    Darwinian Data Structure Selection

    Get PDF
    Data structure selection and tuning is laborious but can vastly improve an application's performance and memory footprint. Some data structures share a common interface and enjoy multiple implementations. We call them Darwinian Data Structures (DDS), since we can subject their implementations to survival of the fittest. We introduce ARTEMIS a multi-objective, cloud-based search-based optimisation framework that automatically finds optimal, tuned DDS modulo a test suite, then changes an application to use that DDS. ARTEMIS achieves substantial performance improvements for \emph{every} project in 55 Java projects from DaCapo benchmark, 88 popular projects and 3030 uniformly sampled projects from GitHub. For execution time, CPU usage, and memory consumption, ARTEMIS finds at least one solution that improves \emph{all} measures for 86%86\% (37/4337/43) of the projects. The median improvement across the best solutions is 4.8%4.8\%, 10.1%10.1\%, 5.1%5.1\% for runtime, memory and CPU usage. These aggregate results understate ARTEMIS's potential impact. Some of the benchmarks it improves are libraries or utility functions. Two examples are gson, a ubiquitous Java serialization framework, and xalan, Apache's XML transformation tool. ARTEMIS improves gson by 16.516.5\%, 1%1\% and 2.2%2.2\% for memory, runtime, and CPU; ARTEMIS improves xalan's memory consumption by 23.523.5\%. \emph{Every} client of these projects will benefit from these performance improvements.Comment: 11 page
    • …
    corecore