4 research outputs found

    Establishing a base of trust with performance counters for enterprise workloads

    Get PDF
    Understanding the performance of large, complex enterprise-class applications is an important, yet nontrivial task. Methods using hardware performance counters, such as profiling through event-based sampling, are often favored over instrumentation for analyzing such large codes, but rarely provide good accuracy at the instruction level. This work evaluates the accuracy ofmultiple eventbased sampling techniques and quantifies the impact of a range of improvements suggested in recent years. The evaluation is performed on instances of three modern CPU architectures, using designated kernels and full applications. We conclude that precisely distributed events considerably improve accuracy, with further improvements possible when using Last Branch Records. We also present practical recommendations for hardware architects, tool developers and performance engineers, aimed at improving the quality of results

    Low-Overhead Dynamic Instruction Mix Generation using Hybrid Basic Block Profiling

    Get PDF
    Dynamic instruction mixes form an important part of the toolkits of performance tuners, compiler writers, and CPU architects. Instruction mixes are traditionally generated using software instrumentation, an accurate yet slow method, that is normally limited to user-mode code. We present a new method for generating instruction mixes using the Performance Monitoring Unit (PMU) of the CPU. It has very low overhead, extends coverage to kernel-mode execution, and causes only a very modest decrease in accuracy, compared to software instrumentation. In order to achieve this level of accuracy, we develop a new PMU-based data collection method, Hybrid Basic Block Profiling (HBBP). HBBP uses simple machine learning techniques to choose, on a per basic block basis, between data from two conventional sampling methods, Event Based Sampling (EBS) and Last Branch Records (LBR). We implement a profiling tool based on HBBP, and we report on experiments with the industry standard SPEC CPU2006 suite, as well as with two large-scale scientific codes. We observe an improvement in runtime compared to software instrumentation of up to 76x on the tested benchmarks, reducing wait times from hours to minutes. Instruction attribution errors average 2.1%. The results indicate that HBBP provides a favorable tradeoff between accuracy and speed, making it a suitable candidate for use in production environments

    Hardware-assisted instruction profiling and latency detection

    Get PDF
    Debugging and profiling tools can alter the execution flow or timing, can induce heisenbugs and are thus marginally useful for debugging time critical systems. Software tracing, however advanced it may be, depends on consuming precious computing resources. In this study, the authors analyse state-of-the-art hardware-tracing support, as provided in modern Intel processors and propose a new technique which uses the processor hardware for tracing without any code instrumentation or tracepoints. They demonstrate the utility of their approach with contributions in three areas - syscall latency profiling, instruction profiling and software-tracer impact detection. They present improvements in performance and the granularity of data gathered with hardware-assisted approach, as compared with traditional software only tracing and profiling. The performance impact on the target system – measured as time overhead – is on average 2–3%, with the worst case being 22%. They also define a way to measure and quantify the time resolution provided by hardware tracers for trace events, and observe the effect of finetuning hardware tracing for optimum utilisation. As compared with other in-kernel tracers, they observed that hardware-based tracing has a much reduced overhead, while achieving greater precision. Moreover, the other tracing techniques are ineffective in certain tracing scenarios
    corecore