7 research outputs found
Mats: MultiCore Adaptive Trace Selection
Dynamically optimizing programs is worthwhile only if the overhead created by the dynamic optimizer is less than the benefit gained from the optimization. Program trace selection is one of the most important, yet time consuming, components of many dynamic optimizers. The dynamic application of monitoring and profiling can often result in an execution slowdown rather than speedup. Achieving significant performance gain from dynamic optimization has proven to be quite challenging. However, current technological advances, namely multicore architectures, enable us to design new approaches to meet this challenge. Selecting traces in current dynamic optimizers is typically achieved through the use of instrumentation to collect control flow information from a running application. Using instrumentation for runtime analysis requires the trace selection algorithms to be light weight, and this limits how sophisticated these algorithms can be. This is problematic because the quality of the traces can determine the potential benefits that can be gained from optimizing the traces. In many cases, even when using a lightweight approach, the overhead incurred is more than the benefit of the optimizations. In this paper we exploit the multicore architecture to design an aggressive trace selection approach that produces better traces and does not perturb the running application. 1
Low overhead hardware-assisted virtual machine analysis and profiling
Cloud infrastructure providers need reliable performance
analysis tools for their nodes. Moreover, the analysis of
Virtual Machines (VMs) is a major requirement in quantifying
cloud performance. However, root cause analysis, in case of
unexpected crashes or anomalous behavior in VMs, remains
a major challenge. Modern tracing tools such as LTTng allow
fine grained analysis - albeit at a minimal execution overhead,and being OS dependent. In this paper, we propose HAVAna,a hardware-assisted VM analysis algorithm that gathers and analyzes pure hardware trace data, without any dependence on the underlying OS or performance analysis infrastructure. Our approach is totally non-intrusive and does not require any performance statistics, trace or log gathering from the VM. We used the recently introduced Intel PT ISA extensions on modern Intel Skylake processors to demonstrate its efficiency and observed that, in our experimental scenarios, it leads to a tiny overhead of up to 1%, as compared to 3.6-28.7% for similar VM trace analysis done with software-only schemes such as LTTng. Our proposed VM trace analysis algorithm has also been opensourced
for further enhancements and to the benefit of other
developers. Furthermore, we developed interactive Resource and Process Control Flow visualization tools to analyze the hardware trace data and present a real-life usecase in the paper that allowed us to see unexpected resource consumption by VMs
Runtime Adaptation: A Case for Reactive Code Alignment
ABSTRACT Static alignment techniques are well studied and have been incorporated into compilers in order to optimize code locality for the instruction fetch unit in modern processors. However, current static alignment techniques have several limitations that cannot be overcome. In the exascale era, it becomes even more important to break from static techniques and develop adaptive algorithms in order to maximize the utilization of every processor cycle. In this paper, we explore those limitations and show that reactive realignment, a method where we dynamically monitor running applications, react to symptoms of poor alignment, and adapt alignment to the current execution environment and program input, is more scalable than static alignment. We present fetchesper-instruction as a runtime indicator of poor alignment. Additionally, we discuss three main opportunities that static alignment techniques cannot leverage, but which are increasingly important in large scale computing systems: microarchitectural differences of cores, dynamic program inputs that exercise different and sometimes alternating code paths, and dynamic branch behavior, including indirect branch behavior and phase changes. Finally, we will present several instances where our trigger for reactive realignment may be incorporated in practice, and discuss the limitations of dynamic alignment
Reusing cached schedules in an out-of-order processor with in-order issue logic
Modern processors use out-of-order processing logic to achieve high performance in Instructions Per Cycle (IPC) but this logic has a serious impact on
the achievable frequency. In order to get better performance out of smaller transistors there is a trend to increase the number of cores per die instead of
making the cores themselves bigger. Moreover, for throughput-oriented and server workloads, simpler in-order processors that allow more cores per die
and higher design frequencies are becoming the preferred choice. Unfortunately, for other workloads this type of cores result in a lower single thread
performance.
There are many workloads where it is still important to achieve good single thread performance. In this thesis we present the ReLaSch processor.
Its aim is to enable high IPC cores capable of running at high clock frequencies by processing the instructions using simple superscalar in-order issue
logic and caching instruction groups that are dynamically scheduled in hardware after commit, that is, out of the critical path and only when really
needed.
Objective
This thesis has several research goals:
• Show that the dynamic scheduler of a conventional out-of-order processor does a lot of redundant work because it ignores the
repetitiveness of code.
• Propose a complete superscalar out-of-order architecture that reduces the amount of redundant work done by creating the
schedules once in dedicated hardware, storing them in a cache of schedules and reusing the schedules as much as possible.
• Place the scheduler out of the critical path of execution, which should be enabled by the reduction of work that it must do. Thus,
the execution path of our proposed processor can be simpler than that of a conventional out-of-order processor.
Proposal and results
We present the \textbf{ReLaSch} processor, named after Reused Late Schedules, in which the creation of issue-groups is removed from the critical
path of execution and uses a simple and small in-order issue logic. It just wakes-up and selects the instructions of a single issue-group each cycle,
instead of processing the instructions of a whole issue queue.
A new logic at the end of the conventional pipeline schedules the committed instructions. The new scheduler can be complex since it is not in the critical
path of execution. The schedules are cached and whenever it is possible an rgroup is read and its instructions executed. The schedules are reused,
lowering the pressure on the scheduling logic.
In some cases, the ReLaSch processor is able to outperform a conventional out-of-order processor, because the post-commit scheduler has a broader
vision of the code. For instance, while ReLaSch can schedule together two independent instructions that are distant in the code, a conventional out-oforder
processor only issues them in the same cycle if both are in-flight.
The ReLaSch processor predicts the branch targets, memory aliases and latencies at scheduling time, out of the critical path. The prediction is based
on the most recent executions at scheduling time. Furthermore, most of the register renaming process is performed by the scheduler and is removed
from the execution pipeline.
Our experiments show that ReLaSch has the same average IPC as our reference out-of-order processor and is clearly better than the reference inorder
processor (1.55 speed-up). In all cases it outperforms the in-order processor and in 23 benchmarks out of 40 it has a higher IPC than the
reference out-of-order processor
A Hardware Mechanism for Dynamic Extraction and Relayout of Program Hot Spots
This paper presents a new mechanism for collecting and deploying runtime optimized code. The code-collecting component resides in the instruction retirement stage and lays out hot execution paths to improve instruction fetch rate as well as enable further code optimization. The code deployment component uses an extension to the Branch Target Bu#er to migrate execution into the new code without modifying the original code. No significant delay is added to the total execution of the program due to these components. The code collection scheme enables safe runtime optimization along paths that span function boundaries. This technique provides a better platform for runtime optimization than trace caches, because the traces are longer and persist in main memory across context switches. Additionally, these traces are not as susceptible to transient behavior because they are restricted to frequently executed code. Empirical results show that on average this mechanism can achieve better instruc..