2,699 research outputs found
Portable and Accurate Collection of Calling-Context-Sensitive Bytecode Metrics for the Java Virtual Machine
Calling-context profiles and dynamic metrics at the bytecode level are important for profiling, workload characterization, program comprehension, and reverse engineering. Prevailing tools for collecting calling-context profiles or dynamic bytecode metrics often provide only incomplete information or suffer from limited compatibility with standard JVMs. However, completeness and accuracy of the profiles is essential for tasks such as workload characterization, and compatibility with standard JVMs is important to ensure that complex workloads can be executed. In this paper, we present the design and implementation of JP2, a new tool that profiles both the inter- and intra-procedural control flow of workloads on standard JVMs. JP2 produces calling-context profiles preserving callsite information, as well as execution statistics at the level of individual basic blocks of code. JP2 is complemented with scripts that compute various dynamic bytecode metrics from the profiles. As a case-study and tutorial on the use of JP2, we use it for cross-profiling for an embedded Java processor
Profiling Power Consumption on Mobile Devices
The proliferation of mobile devices, and the migration of the information access paradigm to mobile platforms, motivate studies of power consumption behaviors with the purpose of increasing the device battery life. The aim of this work is to profile the power consumption of a Samsung Galaxy I7500 and a Samsung Nexus S, in order to understand how such feature has evolved over the years. We performed two experiments: the first one measures consumption for a set of usage scenarios, which represent common daily user activities, while the second one analyzes a context-aware application with a known source code. The first experiment shows that the most recent device in terms of OS and hardware components shows significantly lower consumption than the least recent one. The second experiment shows that the impact of different configurations of the same application causes a different power consumption behavior on both smartphones. Our results show that hardware improvements and energy-aware software applications greatly impact the energy efficiency of mobile device
Coz: Finding Code that Counts with Causal Profiling
Improving performance is a central concern for software developers. To locate
optimization opportunities, developers rely on software profilers. However,
these profilers only report where programs spent their time: optimizing that
code may have no impact on performance. Past profilers thus both waste
developer time and make it difficult for them to uncover significant
optimization opportunities.
This paper introduces causal profiling. Unlike past profiling approaches,
causal profiling indicates exactly where programmers should focus their
optimization efforts, and quantifies their potential impact. Causal profiling
works by running performance experiments during program execution. Each
experiment calculates the impact of any potential optimization by virtually
speeding up code: inserting pauses that slow down all other code running
concurrently. The key insight is that this slowdown has the same relative
effect as running that line faster, thus "virtually" speeding it up.
We present Coz, a causal profiler, which we evaluate on a range of
highly-tuned applications: Memcached, SQLite, and the PARSEC benchmark suite.
Coz identifies previously unknown optimization opportunities that are both
significant and targeted. Guided by Coz, we improve the performance of
Memcached by 9%, SQLite by 25%, and accelerate six PARSEC applications by as
much as 68%; in most cases, these optimizations involve modifying under 10
lines of code.Comment: Published at SOSP 2015 (Best Paper Award
A Fast Causal Profiler for Task Parallel Programs
This paper proposes TASKPROF, a profiler that identifies parallelism
bottlenecks in task parallel programs. It leverages the structure of a task
parallel execution to perform fine-grained attribution of work to various parts
of the program. TASKPROF's use of hardware performance counters to perform
fine-grained measurements minimizes perturbation. TASKPROF's profile execution
runs in parallel using multi-cores. TASKPROF's causal profile enables users to
estimate improvements in parallelism when a region of code is optimized even
when concrete optimizations are not yet known. We have used TASKPROF to isolate
parallelism bottlenecks in twenty three applications that use the Intel
Threading Building Blocks library. We have designed parallelization techniques
in five applications to in- crease parallelism by an order of magnitude using
TASKPROF. Our user study indicates that developers are able to isolate
performance bottlenecks with ease using TASKPROF.Comment: 11 page
Adaptive sampling-based profiling techniques for optimizing the distributed JVM runtime
Extending the standard Java virtual machine (JVM) for cluster-awareness is a transparent approach to scaling out multithreaded Java applications. While this clustering solution is gaining momentum in recent years, efficient runtime support for fine-grained object sharing over the distributed JVM remains a challenge. The system efficiency is strongly connected to the global object sharing profile that determines the overall communication cost. Once the sharing or correlation between threads is known, access locality can be optimized by collocating highly correlated threads via dynamic thread migrations. Although correlation tracking techniques have been studied in some page-based sof Tware DSM systems, they would entail prohibitively high overheads and low accuracy when ported to fine-grained object-based systems. In this paper, we propose a lightweight sampling-based profiling technique for tracking inter-thread sharing. To preserve locality across migrations, we also propose a stack sampling mechanism for profiling the set of objects which are tightly coupled with a migrant thread. Sampling rates in both techniques can vary adaptively to strike a balance between preciseness and overhead. Such adaptive techniques are particularly useful for applications whose sharing patterns could change dynamically. The profiling results can be exploited for effective thread-to-core placement and dynamic load balancing in a distributed object sharing environment. We present the design and preliminary performance result of our distributed JVM with the profiling implemented. Experimental results show that the profiling is able to obtain over 95% accurate global sharing profiles at a cost of only a few percents of execution time increase for fine- to medium- grained applications. © 2010 IEEE.published_or_final_versionThe 24th IEEE International Symposium on Parallel & Distributed Processing (IPDPS 2010), Atlanta, GA., 19-23 April 2010. In Proceedings of the 24th IPDPS, 2010, p. 1-1
Observable dynamic compilation
Managed language platforms such as the Java Virtual Machine rely on a dynamic compiler to achieve high performance. Despite the benefits that dynamic compilation provides, it also introduces some challenges to program profiling. Firstly, profilers based on bytecode instrumentation may yield wrong results in the presence of an optimizing dynamic compiler, either due to not being aware of optimizations, or because the inserted instrumentation code disrupts such optimizations. To avoid such perturbations, we present a technique to make profilers based on bytecode instrumentation aware of the optimizations performed by the dynamic compiler, and make the dynamic compiler aware of the inserted code. We implement our technique for separating inserted instrumentation code from base-program code in Oracle's Graal compiler, integrating our extension into the OpenJDK Graal project. We demonstrate its significance with concrete profilers. On the one hand, we improve accuracy of existing profiling techniques, for example, to quantify the impact of escape analysis on bytecode-level allocation profiling, to analyze object life-times, and to evaluate the impact of method inlining when profiling method invocations. On the other hand, we also illustrate how our technique enables new kinds of profilers, such as a profiler for non-inlined callsites, and a testing framework for locating performance bugs in dynamic compiler implementations. Secondly, the lack of profiling support at the intermediate representation (IR) level complicates the understanding of program behavior in the compiled code. This issue cannot be addressed by bytecode instrumentation because it cannot precisely capture the occurrence of IR-level operations. Binary instrumentation is not suited either, as it lacks a mapping from the collected low-level metrics to higher-level operations of the observed program. To fill this gap, we present an easy-to-use event-based framework for profiling operations at the IR level. We integrate the IR profiling framework in the Graal compiler, together with our instrumentation-separation technique. We illustrate our approach with a profiler that tracks the execution of memory barriers within compiled code. In addition, using a deoptimization profiler based on our IR profiling framework, we conduct an empirical study on deoptimization in the Graal compiler. We focus on situations which cause program execution to switch from machine code to the interpreter, and compare application performance using three different deoptimization strategies which influence the amount of extra compilation work done by Graal. Using an adaptive deoptimization strategy, we manage to improve the average start-up performance of benchmarks from the DaCapo, ScalaBench, and Octane suites by avoiding wasted compilation work. We also find that different deoptimization strategies have little impact on steady- state performance
Darwinian Data Structure Selection
Data structure selection and tuning is laborious but can vastly improve an
application's performance and memory footprint. Some data structures share a
common interface and enjoy multiple implementations. We call them Darwinian
Data Structures (DDS), since we can subject their implementations to survival
of the fittest. We introduce ARTEMIS a multi-objective, cloud-based
search-based optimisation framework that automatically finds optimal, tuned DDS
modulo a test suite, then changes an application to use that DDS. ARTEMIS
achieves substantial performance improvements for \emph{every} project in
Java projects from DaCapo benchmark, popular projects and uniformly
sampled projects from GitHub. For execution time, CPU usage, and memory
consumption, ARTEMIS finds at least one solution that improves \emph{all}
measures for () of the projects. The median improvement across
the best solutions is , , for runtime, memory and CPU
usage.
These aggregate results understate ARTEMIS's potential impact. Some of the
benchmarks it improves are libraries or utility functions. Two examples are
gson, a ubiquitous Java serialization framework, and xalan, Apache's XML
transformation tool. ARTEMIS improves gson by \%, and for
memory, runtime, and CPU; ARTEMIS improves xalan's memory consumption by
\%. \emph{Every} client of these projects will benefit from these
performance improvements.Comment: 11 page
- …