12 research outputs found
Differential Performance Debugging with Discriminant Regression Trees
Differential performance debugging is a technique to find performance
problems. It applies in situations where the performance of a program is
(unexpectedly) different for different classes of inputs. The task is to
explain the differences in asymptotic performance among various input classes
in terms of program internals. We propose a data-driven technique based on
discriminant regression tree (DRT) learning problem where the goal is to
discriminate among different classes of inputs. We propose a new algorithm for
DRT learning that first clusters the data into functional clusters, capturing
different asymptotic performance classes, and then invokes off-the-shelf
decision tree learning algorithms to explain these clusters. We focus on linear
functional clusters and adapt classical clustering algorithms (K-means and
spectral) to produce them. For the K-means algorithm, we generalize the notion
of the cluster centroid from a point to a linear function. We adapt spectral
clustering by defining a novel kernel function to capture the notion of linear
similarity between two data points. We evaluate our approach on benchmarks
consisting of Java programs where we are interested in debugging performance.
We show that our algorithm significantly outperforms other well-known
regression tree learning algorithms in terms of running time and accuracy of
classification.Comment: To Appear in AAAI 201
Recommended from our members
Automatic static analysis of software performance
Performance is a critical component of software quality. Software performance can have drastic repercussions on an application, frustrating its users, breaking the functionality of its components, or even rendering it defenseless against hackers. Unfortunately, unlike in the program verification domain, robust analysis techniques for software performance are almost non-existent. In this thesis we formalize important classes of performance-related bugs and security vulnerabilities, and implement novel static analysis techniques for automatically detecting them in widely used open-source applications. Our tools are able to uncover 92 performance bugs and 47 security vulnerabilities, while analyzing hundreds of thousands of lines of code and reporting a modest amount of false positives. Our work opens a new avenue for research: the development of rigorous automatic analyses for effective software performance understanding, inspired by traditional research in functional verification.Computer Science
Three pitfalls in Java performance evaluation
The Java programming language has known a remarkable growth over the last decade. This is partially due to the infrastructure required to run Java ap- plications on general purpose microprocessors: a Java virtual machine (VM). The VM ensures that Java applications are portable across different hardware platforms, because it shelters the applications from the underlying system. Hence the motto write once, run (almost) anywhere. Java applications are compiled to an intermediate form, called bytecode, and consist of a number of so-called class files. The virtual machine takes care of class loading, interpreting or compiling the bytecode to the native code of the underlying hardware platform, thread scheduling, garbage collection, etc. As such, during the execution of a Java application, the VM regularly intervenes to take care of housekeeping tasks and to optimise the application as it is executing. Furthermore, the specific implementation details of most virtual machines insert non-deterministic behaviour, not into the semantic part of the execution, but rather into the lower level execution. For example, to bring a Java application up to competitive speed with classical compiled programs written in languages such as C, the virtual machine needs to optimise Java bytecode. To limit the execution overhead, most virtual machines use a time sampling mechanism to determine the hot methods in the application. This introduces non-determinism, as over several runs, the methods are not always optimised at the same moment, nor is the set of optimised methods always the same. Other factors that introduce non-determinism are the thread scheduling, garbage collection, etc. It is readily seen that performance analysis of Java applications is not as simple as it seems at first, and warrants closer inspection. In this dissertation we are mainly interested in the behaviour of Java applications and their performance. In the course of this work, we uncovered three major pitfalls that were not taken into account by researchers when analysing Java performance prior to this work. We will briefly summarise the main achievements presented in this dissertation. The first pitfall we present involves the interaction between the virtual machine, the application and the input to the application. The performance for short running applications is shown to be mainly determined by the virtual machine. For longer running applications, this influence decreases, but remains tangible. We use statistical analysis, such as principal components analysis and cluster analysis (K-means and hierarchical clustering) to demonstrate and clarify the pitfall. By means of a large number of performance char- acteristics measured using hardware performance counters, five virtual machines and fourteen benchmarks with both a small and a large input size, we demonstrate that short running workloads are primarily clustered by virtual machines. Even for long running applications from the SPECjvm98 benchmark suite, the virtual machine still exerts a large influence on the observed behaviour at the microarchitectural level. This work has shown the need for both larger and longer running benchmarks than were available prior to it – this was (partially) met by the introduction of the DaCapo benchmark suite – as well as a careful consideration when setting up an experiment to avoid measuring the virtual machine, rather than the benchmark. Prior to this work, people were quite often using simulation with short running applications (to save time) for exploring Java performance. The second pitfall we uncover involves the analysis of performance numbers. During a survey of 50 papers published at premier conferences, such as OOPSLA, PLDI, CGO, ISMM and VEE, over the past seven years, we found that a variety of approaches are used, both for experimental design – for example, the input size, virtual machines, heap sizes, etc. – and, even more importantly, for data analysis – for example, using a best out of 3 performance number. New techniques are pitted against existing work using these prevalent approaches, and conclusions regarding their successfulness in beating prior state-of-the-art are based upon them. Given the fact that the execution of Java applications usually involves non-determinism in the virtual machine – for example, when determining which methods to optimise – it should come as no surprise that the lack of statistical rigour in these prevalent approaches leads to misleading or even incorrect conclusions. By this we mean that the conclusions are either not representative of what actually happens, or even contradict reality, as modelled in a statistical manner. To circumvent this pitfall, we propose a rigorous statistical approach that uses confidence intervals to both report and compare performance numbers. We also claim that sufficient experiments should be conducted to get a reliable performance measure. The non-determinism caused by the timer-based optimisation component in a virtual machine can be eliminated using so-called replay compilation. This technique will record a compilation plan during a first execution or profiling run of the application. During a second execution, the application is iterated twice: once to compile and optimise all methods found in the compilation plan, and a second time to perform the actual measurement. It turns out however that current practice of using either a single plan – corresponding to the best performing profiling run – or a combined plan choosing the methods that were optimised in, say, more than half the profiling runs, is no match for using multiple plans. The variability observed in the plans themselves is too large to capture in one of the current practices. Consequently, using multiple plans is definitely the better option. Moreover, this allows using a matched-pair approach in the data analysis, which results in tighter confidence intervals for the mean performance number. The third pitfall we examine is the usage of global performance numbers when tuning either an application or a virtual machine. We show that Java applications exhibit phase behaviour at the method level. This means that instances of the same method show more similarity to each other, behaviourwise, than to instances of other methods. A phase can then be identified as a set of sub-trees of the dynamic call-tree, with each sub-tree headed by the same method. We present an two-step algorithm that allows correlating hardware performance counter data in step 2 with the phases determined in step 1. The information obtained can be applied to show the programmer which methods perform worse than average, for example with respect to the number of cache misses they incur. In the dissertation, we pay particular attention to statistical rigour. For each pitfall, we use statistics to demonstrate its presence. Hopefully this work will encourage other researchers to use more rigour in their work as well
Recommended from our members
A Topology-Based Approach for Nonlinear Time Series with Applications in Computer Performance Analysis
We present a topology-based methodology for the analysis of experimental data generated by a discrete-time, nonlinear dynamical system. This methodology has significant applications in the field of computer performance analysis. Our approach consists of two parts. In the first part, we propose a novel signal separation algorithm that exploits the continuity of the dynamical system being studied. We use established tools from computational topology to test the connectedness of various regions of state space. In particular, a connected region of space that has a disconnected image under the experimental dynamics suggests the presence of multiple signals in the data. Using this as a guideline, we are able to model experimental data as an Iterated Function System (IFS). We demonstrate the success of our algorithm on several synthetic examples--including a Henon-like IFS. Additionally, we successfully model experimental computer performance data as an IFS. In the second part of the analysis, we represent an experimental dynamical system with an algebraic structure that allows for the computation of algebraic topological invariants. Previous work has shown that a cubical grid and the associated cubical complex are effective tools that can be used to identify isolating neighborhoods and compute the corresponding Conley Index--thereby rigorously verifying the existence of periodic orbits and/or chaotic dynamics. Our contribution is to adapt this technique by altering the underlying data structure--improving flexibility and efficiency. We represent the state space of the dynamical system with a simplicial complex and its induced simplicial multivalued map. This contains information about both geometry and dynamics, whereas the cubical complex is restricted by the geometry of the experimental data. This representation has several advantages; most notably, the complexity of the algorithm that generates the associated simplicial multivalued map is linear in the number of data points--as opposed to exponential in dimension for the cubical multivalued map. The synthesis of the two parts of our methodology results in a nonlinear time-series analysis framework that is particularly well suited for computer performance analysis. Complex computer programs naturally switch between `regimes\u27 and are appropriately modeled as IFSs by part one of our program. Part two of our methodology provides the correct tools for analyzing each regime independently
Structural Performance Comparison of Parallel Software Applications
With rising complexity of high performance computing systems and their parallel software, performance analysis and optimization has become essential in the development of efficient applications. The comparison of performance data is a key operation required in performance analysis. An analyst may conduct different types of comparisons in order to understand the performance properties of an application. One use case is comparing performance data from multiple measurements. Typical examples for such comparisons are before/after comparisons when applying optimizations or changing code versions. Besides comparing performance between multiple runs, also comparing performance characteristics across the parallel execution streams of an application is essential to detect performance problems. This is typically useful to detect imbalances, outliers, or changing runtime behavior during the execution of an application. While such comparisons are straightforward for the aggregated data in performance profiles, only limited solutions exist for comparing event traces. Trace-based analysis, i.e., the collection of fine-grained information on individual application events with timestamps and application context, has proven to be a powerful technique. The detailed performance information included in event traces make them very suitable for performance analysis. However, this level of detail also presents a challenge because it implies a large and overwhelming amount of data. Currently, users need to perform manual comparison of event traces, which is extremely challenging and time consuming because of the large volume of detailed data and the need to correctly line up trace events.
To fill the gap of missing solutions for automatic comparison of event traces, this work proposes a set of techniques that automatically align traces. The alignment allows their structural comparison and the highlighting of differences between them. A set of novel metrics provide the user with an objective measure of the differences between traces, both in terms of differences in the event stream and timing differences across events.
An additional important aspect of trace-based analysis is the visualization of performance data in event timelines. This has proven to be a powerful approach for the detection of various types of performance problems. However, visualization of large numbers of event timelines quickly hits the limits of available display resolution. Likewise, identifying performance problems is challenging in the large amount of visualized performance data. To alleviate these problems this work proposes two new approaches for event timeline visualization. First, novel folding strategies for event timelines facilitate visual scalability and provide powerful overviews of performance data at the same time. Second, this work presents an effective approach that automatically identifies and highlights several types of performance critical sections in an application run. This approach identifies time dominant functions of an application and subsequently uses them to analyze runtime imbalances throughout the application run. Intuitive visualizations present the resulting runtime variations and guide the analyst to performance hot spots.
Evaluations with benchmarks and real-world applications assess all introduced techniques. The effectiveness of the comparison approaches is demonstrated by showing automatically detected performance issues and structural differences between different versions of applications and across parallel execution streams. Case studies showcase the capabilities of the event timeline visualization techniques by demonstrating scalable performance data visualizations and detecting performance problems and code inefficiencies in real-world applications