512 research outputs found
Coz: Finding Code that Counts with Causal Profiling
Improving performance is a central concern for software developers. To locate
optimization opportunities, developers rely on software profilers. However,
these profilers only report where programs spent their time: optimizing that
code may have no impact on performance. Past profilers thus both waste
developer time and make it difficult for them to uncover significant
optimization opportunities.
This paper introduces causal profiling. Unlike past profiling approaches,
causal profiling indicates exactly where programmers should focus their
optimization efforts, and quantifies their potential impact. Causal profiling
works by running performance experiments during program execution. Each
experiment calculates the impact of any potential optimization by virtually
speeding up code: inserting pauses that slow down all other code running
concurrently. The key insight is that this slowdown has the same relative
effect as running that line faster, thus "virtually" speeding it up.
We present Coz, a causal profiler, which we evaluate on a range of
highly-tuned applications: Memcached, SQLite, and the PARSEC benchmark suite.
Coz identifies previously unknown optimization opportunities that are both
significant and targeted. Guided by Coz, we improve the performance of
Memcached by 9%, SQLite by 25%, and accelerate six PARSEC applications by as
much as 68%; in most cases, these optimizations involve modifying under 10
lines of code.Comment: Published at SOSP 2015 (Best Paper Award
Performance Optimization Strategies for Transactional Memory Applications
This thesis presents tools for Transactional Memory (TM) applications that cover multiple TM systems (Software, Hardware, and hybrid TM) and use information of all different layers of the TM software stack. Therefore, this thesis addresses a number of challenges to extract static information, information about the run time behavior, and expert-level knowledge to develop these new methods and strategies for the optimization of TM applications
Profilage continu et efficient de verrous pour Java pour les architectures multicœurs
Today, the processing of large dataset is generally parallelised and performed on computers with many cores. However, locks can serialize the execution of these cores and hurt the latency and the processing throughput. Spotting theses lock contention issues in-vitro (i.e. during the development phase) is complex because it is difficult to reproduce a production environment, to create a realistic workload representative of the context of use of the software and to test every possible configuration of deployment where will be executed the software. This thesis introduces Free Lunch, a lock profiler that diagnoses phases of high lock contention due to locks in-vivo (i.e. during the operational phase). Free Lunch is designed around a new metric, the Critical Section Pressure (CSP), which aims to evaluate the impact of lock contention on overall thread progress. Free Lunch is integrated in Hotpost in order to minimize the overhead and regularly reports the CSP during the execution in order to detect temporary issues due to locks. Free Lunch is evaluated over 31 benchmarks from Dacapo 9.12, SpecJVM08 and SpecJBB2005, and over the Cassandra database. We were able to pinpoint the phases of lock contention in 6 applications for which some of these were not detected by existing profilers. With this information, we have improved the performance of Xalan by 15% just by rewriting one line of code and identified a phase of high lock contention in Cassandra during the replay of transactions after a crash of a node. Free Lunch has never degraded performance by more than 6%, which makes it suitable to be deployed continuously in an operational environment.Aujourd’hui, le traitement de grands jeux de données est généralement parallélisé et effectué sur des machines multi-cœurs. Cependant, les verrous peuvent sérialiser l'exécution de ces coeurs et dégrader la latence et le débit du traitement. Détecter ces problèmes de contention de verrous in-vitro (i.e. pendant le développement du logiciel) est complexe car il est difficile de reproduire un environnement de production, de créer une charge de travail réaliste représentative du contexte d’utilisation du logiciel et de tester toutes les configurations de déploiement possibles où s'exécutera le logiciel. Cette thèse présente Free Lunch, un profiler permettant d'identifier les phases de contention dues aux verrous in-vivo (i.e. en production). Free Lunch intègre une nouvelle métrique appelée Critical Section Pressure (CSP) évaluant avec précision l'impact de la synchronisation sur le progrès des threads. Free Lunch est directement intégré dans la JVM Hotspot pour minimiser le surcoût d'exécution et reporte régulièrement la CSP afin de pouvoir détecter les problèmes transitoires dus aux verrous. Free Lunch est évalué sur 31 benchmarks issus de Dacapo 9.12, SpecJVM08 et SpecJBB2005, ainsi que sur la base de données Cassandra. Nous avons identifié des phases de contention dans 6 applications dont certaines n'étaient pas détectées par les profilers actuels. Grâce à ces informations, nous avons amélioré la performance de Xalan de 15% en modifiant une seule ligne de code et identifié une phase de haute contention dans Cassandra. Free Lunch n’a jamais dégradé les performances de plus de 6% ce qui le rend approprié pour être déployé continuellement dans un environnement de production
Profilage continu et efficient de verrous pour Java pour les architectures multicœurs
Today, the processing of large dataset is generally parallelised and performed on computers with many cores. However, locks can serialize the execution of these cores and hurt the latency and the processing throughput. Spotting theses lock contention issues in-vitro (i.e. during the development phase) is complex because it is difficult to reproduce a production environment, to create a realistic workload representative of the context of use of the software and to test every possible configuration of deployment where will be executed the software. This thesis introduces Free Lunch, a lock profiler that diagnoses phases of high lock contention due to locks in-vivo (i.e. during the operational phase). Free Lunch is designed around a new metric, the Critical Section Pressure (CSP), which aims to evaluate the impact of lock contention on overall thread progress. Free Lunch is integrated in Hotpost in order to minimize the overhead and regularly reports the CSP during the execution in order to detect temporary issues due to locks. Free Lunch is evaluated over 31 benchmarks from Dacapo 9.12, SpecJVM08 and SpecJBB2005, and over the Cassandra database. We were able to pinpoint the phases of lock contention in 6 applications for which some of these were not detected by existing profilers. With this information, we have improved the performance of Xalan by 15% just by rewriting one line of code and identified a phase of high lock contention in Cassandra during the replay of transactions after a crash of a node. Free Lunch has never degraded performance by more than 6%, which makes it suitable to be deployed continuously in an operational environment.Aujourd’hui, le traitement de grands jeux de données est généralement parallélisé et effectué sur des machines multi-cœurs. Cependant, les verrous peuvent sérialiser l'exécution de ces coeurs et dégrader la latence et le débit du traitement. Détecter ces problèmes de contention de verrous in-vitro (i.e. pendant le développement du logiciel) est complexe car il est difficile de reproduire un environnement de production, de créer une charge de travail réaliste représentative du contexte d’utilisation du logiciel et de tester toutes les configurations de déploiement possibles où s'exécutera le logiciel. Cette thèse présente Free Lunch, un profiler permettant d'identifier les phases de contention dues aux verrous in-vivo (i.e. en production). Free Lunch intègre une nouvelle métrique appelée Critical Section Pressure (CSP) évaluant avec précision l'impact de la synchronisation sur le progrès des threads. Free Lunch est directement intégré dans la JVM Hotspot pour minimiser le surcoût d'exécution et reporte régulièrement la CSP afin de pouvoir détecter les problèmes transitoires dus aux verrous. Free Lunch est évalué sur 31 benchmarks issus de Dacapo 9.12, SpecJVM08 et SpecJBB2005, ainsi que sur la base de données Cassandra. Nous avons identifié des phases de contention dans 6 applications dont certaines n'étaient pas détectées par les profilers actuels. Grâce à ces informations, nous avons amélioré la performance de Xalan de 15% en modifiant une seule ligne de code et identifié une phase de haute contention dans Cassandra. Free Lunch n’a jamais dégradé les performances de plus de 6% ce qui le rend approprié pour être déployé continuellement dans un environnement de production
Performance analysis methods for understanding scaling bottlenecks in multi-threaded applications
In dit proefschrift stellen we drie nieuwe methodes voor om de prestatie van meerdradige programma's te analyseren. Onze eerste methode, criticality stacks, is bruikbaar voor het analyseren van onevenwicht tussen draden. Om deze stacks te construeren stellen we een nieuwe criticaliteitsmetriek voor, die de uitvoeringstijd van een applicatie opsplitst in een deel voor iedere draad. Hoe groter dit deel is voor een draad, hoe kritischer deze draad is voor de applicatie. De tweede methode, bottle graphs, stelt iedere draad van een meerdradig programma voor als een rechthoek in een grafiek. De hoogte van de rechthoek wordt berekend door middel van onze criticaliteitsmetriek, en de breedte stelt het parallellisme van een draad voor. Rechthoeken die bovenaan in de grafiek zitten, als het ware in de hals van de fles, hebben een beperkt parallellisme, waardoor we ze beschouwen als “bottlenecks” voor de applicatie. Onze derde methode, speedup stacks, toont de bereikte speedup van een applicatie en de verschillende componenten die speedup beperken in een gestapelde grafiek. De intuïtie achter dit concept is dat door het reduceren van de invloed van een bepaalde component, de speedup van een applicatie proportioneel toeneemt met de grootte van die component in de stapel
Software Performance Engineering using Virtual Time Program Execution
In this thesis we introduce a novel approach to software performance engineering that is based
on the execution of code in virtual time. Virtual time execution models the timing-behaviour
of unmodified applications by scaling observed method times or replacing them with results
acquired from performance model simulation. This facilitates the investigation of "what-if" performance predictions of applications comprising an arbitrary combination of real code and
performance models. The ability to analyse code and models in a single framework enables
performance testing throughout the software lifecycle, without the need to to extract performance
models from code. This is accomplished by forcing thread scheduling decisions to take
into account the hypothetical time-scaling or model-based performance specifications of each
method. The virtual time execution of I/O operations or multicore targets is also investigated.
We explore these ideas using a Virtual EXecution (VEX) framework, which provides performance
predictions for multi-threaded applications. The language-independent VEX core is
driven by an instrumentation layer that notifies it of thread state changes and method profiling events; it is then up to VEX to control the progress of application threads in virtual time on top of the operating system scheduler. We also describe a Java Instrumentation Environment
(JINE), demonstrating the challenges involved in virtual time execution at the JVM level.
We evaluate the VEX/JINE tools by executing client-side Java benchmarks in virtual time
and identifying the causes of deviations from observed real times. Our results show that VEX
and JINE transparently provide predictions for the response time of unmodified applications
with typically good accuracy (within 5-10%) and low simulation overheads (25-50% additional
time). We conclude this thesis with a case study that shows how models and code can be
integrated, thus illustrating our vision on how virtual time execution can support performance
testing throughout the software lifecycle
Recommended from our members
Effective Performance Analysis and Debugging
Performance is once again a first-class concern. Developers can no longer wait for the next generation of processors to automatically optimize their software. Unfortunately, existing techniques for performance analysis and debugging cannot cope with complex modern hardware, concurrent software, or latency-sensitive software services.
While processor speeds have remained constant, increasing transistor counts have allowed architects to increase processor complexity. This complexity often improves performance, but the benefits can be brittle; small changes to a program’s code, inputs, or execution environment can dramatically change performance, resulting in unpredictable performance in deployed software and complicating performance evaluation and debugging. Developers seeking to improve performance must resort to manual performance tuning for large performance gains. Software profilers are meant to guide developers to important code, but conventional profilers do not produce actionable information for concurrent applications. These profilers report where a program spends its time, not where optimizations will yield performance improvements. Furthermore, latency is a critical measure of performance for software services and interactive applications, but conventional profilers measure only throughput. Many performance issues appear only when a system is under high load, but generating this load in development is often impossible. Developers need to identify and mitigate scalability issues before deploying software, but existing tools offer developers little or no assistance.
In this dissertation, I introduce an empirically-driven approach to performance analysis and debugging. I present three systems for performance analysis and debugging. Stabilizer mitigates the performance variability that is inherent in modern processors, enabling both predictable performance in deployment and statistically sound performance evaluation. Coz conducts performance experiments using virtual speedups to create the effect of an optimization in a running application. This approach accurately predicts the effect of hypothetical optimizations, guiding developers to code where optimizations will have the largest effect. Amp allows developers to evaluate system scalability using load amplification to create the effect of high load in a testing environment. In combination, Amp and Coz allow developers to pinpoint code where manual optimizations will improve the scalability of their software
A Fast Causal Profiler for Task Parallel Programs
This paper proposes TASKPROF, a profiler that identifies parallelism
bottlenecks in task parallel programs. It leverages the structure of a task
parallel execution to perform fine-grained attribution of work to various parts
of the program. TASKPROF's use of hardware performance counters to perform
fine-grained measurements minimizes perturbation. TASKPROF's profile execution
runs in parallel using multi-cores. TASKPROF's causal profile enables users to
estimate improvements in parallelism when a region of code is optimized even
when concrete optimizations are not yet known. We have used TASKPROF to isolate
parallelism bottlenecks in twenty three applications that use the Intel
Threading Building Blocks library. We have designed parallelization techniques
in five applications to in- crease parallelism by an order of magnitude using
TASKPROF. Our user study indicates that developers are able to isolate
performance bottlenecks with ease using TASKPROF.Comment: 11 page
- …