17 research outputs found
Hill-Climbing SMT Processor Resource Distribution
The key to high performance in SMT processors lies in optimizing the shared
resources distribution among simultaneously executing threads. Existing
resource distribution techniques optimize performance only indirectly. They
infer potential performance bottlenecks by observing indicators, like
instruction occupancy or cache miss count, and take actions to try to alleviate
them. While the corrective actions are designed to improve performance, their
actual performance impact is not known since end performance is never
monitored. Consequently, opportunities for performance gains are lost whenever
the corrective actions do not effectively address the actual performance
bottlenecks occurring in the SMT processor pipeline.
In this dissertation, we propose a different approach to SMT processor resource
distribution that optimizes end performance directly. Our approach observes
the impact that resource distribution decisions have on performance at runtime,
and feeds this information back to the resource distribution mechanisms to
improve future decisions. By successively applying and evaluating different
resource distributions, our approach tries to learn the best distribution over
time. Because we perform learning on-line, learning time is crucial. We
develop a hill-climbing SMT processor resource distribution technique that
efficiently learns the best resource distribution by following the performance
gradient within the resource distribution space.
This dissertation makes three contributions within the context of
learning-based SMT processor resource distribution. First, we characterize and
quantify the time-varying performance behavior of SMT processors. This
analysis provides understanding of the behavior and guides the design of our
hill-climbing algorithm. Second, we present a hill-climbing SMT processor
resource distribution technique that performs learning on-line. The
performance evaluation of our approach shows a 11.4% gain over ICOUNT, 11.5%
gain over FLUSH, and 2.8% gain over DCRA across a large set of 63
multiprogrammed workloads. Third, we compare existing resource distribution
techniques to an ideal learning-based technique that performs learning off-line
to show the potential performance of the existing techniques. This limit study
identifies the performance bottleneck of the existing techniques, showing that
the performance of ICOUNT, FLUSH, and DCRA is 13.2%, 13.5%, and 6.6%,
respectively, lower than the ideal performance. Our hill-climbing based
resource distribution, however, handles most of the bottlenecks of the existing
techniques properly, achieving 4.1% lower performance than the ideal case
Hill-Climbing SMT Processor Resource Distribution
The key to high performance in Simultaneous MultiThreaded (SMT) processors lies in optimizing the distribution of shared resources to active threads. Existing resource distribution techniques optimize performance only indirectly. They infer potential performance bottlenecks by observing indicators, like instruction occupancy or cache miss counts, and take actions to try to alleviate them. While the corrective actions are designed to improve performance, their actual performance impact is not known since end performance is never monitored. Consequently, potential performance gains are lost whenever the corrective actions do not effectively address the actual bottlenecks occurring in the pipeline. We propose a different approach to SMT resource distribution that optimizes end performance directly. Our approach observes the impact that resource distribution decisions have on performance at runtime, and feeds this information back to the resource distribution mechanisms to improve future decisions. By evaluating many different resource distributions, our approach tries to learn the best distribution over time. Because we perform learning online, learning time is crucial. We develop a hill-climbing algorithm that quickly learns the best distribution of resources by following the performance gradient within the resource distribution space. We also develop severa
Learning-Based SMT Processor Resource Distribution via Hill-Climbing
The key to high performance in Simultaneous Multithreaded (SMT) processors lies in optimizing the distribution of shared resources to active threads. Existing resource distribution techniques optimize performance only indirectly. They infer potential performance bottlenecks by observing indicators, like instruction occupancy or cache miss counts, and take actions to try to alleviate them. While the corrective actions are designed to improve performance, their actual performance impact is not known since end performance is never monitored. Consequently, potential performance gains are lost whenever the corrective actions do not effectively address the actual bottlenecks occurring in the pipeline. We propose a different approach to SMT resource distribution that optimizes end performance directly. Our approach observes the impact that resource distribution decisions have on performance at runtime, and feeds this information back to the resource distribution mechanisms to improve future decisions. By evaluating many different resource distributions, our approach tries to learn the best distribution over time. Because we perform learning on-line, learning time is crucial. We develop a hill-climbing algorithm that efficiently learns the best distribution of resources by following the performance gradient within the resource distribution space. This paper conducts an in-depth investigation of learningbased SMT resource distribution. First, we compare existing resource distribution techniques to an ideal learning-based technique that performs learning off-line. This limit study shows learning-based techniques can provide up to 19.2 % gain over ICOUNT, 18.0 % gain over FLUSH, and 7.6 % gain over DCRA across 21 multithreaded workloads. Then, we present an on-line learning algorithm based on hill-climbing. Our evaluation shows hill-climbing provides a 12.4 % gain over ICOUNT, 11.3 % gain over FLUSH, and 2.4 % gain over DCRA across a larger set of 42 multiprogrammed workloads. 1
Hill-Climbing SMT Processor Resource Scheduler
Multiple threads in SMT processor share resources to increase resource utilization and im-prove overall performance. At the same time, they compete against each other for the shared resources, causing resource monopolization or underutilization. Therefore, resource scheduling mechanism is important because it determines the throughput as well as fairness of the simulta-neously running threads in SMT processor. To achieve optimal SMT performance, all the earlier mechanisms schedule resources based on a couple of indicators, such as cache miss count, pre-decoded instruction count, or resource demand/occupancy. Those indicators trigger scheduling actions, expecting improved performance. However, since combinations of indicators can not cover all possible cases of threads behavior, earlier mechanisms may face a situation where the expected positive correlation between the indicators and the performance becomes weak, losing the chance of performance improvement. In this paper, we developed a novel methodology using hill-climbing or gradient descent algorithm to find the optimal resource share of simultaneously running threads. We carefully vary the resource share of multiple threads toward the direction which improves the SMT per-formance. Instead of monitoring the behavior of indicators, we use the past resource share and SMT performance change history to determine the resource share shift direction for the future. Simulation result shows that our hill-climbing resource scheduler outperforms FLUSH by 5.0% and DCRA by 1.4%, on average, using the weighted IPC as a metric
Multi-Chain Prefetching: Exploiting Natural Memory Parallelism in PointerChasing Codes
This paper presents a novel pointer prefetching technique, called multi-chain prefetching. Multi-chain prefetching tolerates serialized memory latency commonly found in pointer chasing codes via aggressive prefetch scheduling. Unlike conventional prefetching techniques that hide memory latency underneath a single traversal loop or recursive function exclusively, multi-chain prefetching initiates prefetches for a chain of pointers prior to the traversal code, thus exploiting \pre-loop " work to help overlap serialized memory latency. As prefetch chains are scheduled increasingly early to accommodate long pointer chains, multi-chain prefetching overlaps prefetches across multiple independent linked structures, thus exploiting the natural memory parallelism that exists between separate pointer-chasing loops or recursive function calls. This paper makes three contributions in the context of multi-chain prefetching. First, we introduce a prefetch scheduling technique that exploits pre-loop work and inter-chain memory parallelism to tolerate serialized memory latency. To our knowledge, our scheduling algorithm is the rst of its kind to expose natural memory parallelism in pointer-chasing codes. Second, we present the design of a prefetch engine that generates a prefetch address stream at runtime
Optimizing SMT Processors for High Single-Thread Performance
Simultaneous Multithreading (SMT) processors achieve high processor throughput at the expense of single-thread performance. This paper investigates resource allocation policies for SMT processors that preserve, as much as possible, the single-thread performance of designated "foreground" threads, while still permitting other "background" threads to share resources. Since background threads onsuchanSMTmachinehaveanear-zeroperformance impact on foreground threads, we refer to the background threads as transparent threads. Transparent threads are ideal for performing low-priority or non-critical computations, with applications in process scheduling, subordinate multithreading, and on-line performance monitoring
Multi-chain prefetching: Effective exploitation of inter-chain memory parallelism for pointer-chasing codes
Pointer-chasing applications tend to traverse composed data structures consisting of multiple independent pointer chains. While the traversal of any single pointer chain leads to the serialization of memory operations, the traversal of independent pointer chains provides a source of memory parallelism. This paper presents multi-chain prefetching, a technique that utilizes offline analysis and a hardware prefetch engine to prefetch multiple independent pointer chains simultaneously, thus exploiting interchain memory parallelism for the purpose of memory latency tolerance. This paper makes three contributions. First, we introduce a scheduling algorithm that identifies independent pointer chains in pointer-chasing codes and computes a prefetch schedule that overlaps serialized cache misses across separate chains. Our analysis focuses on static traversals. We also propose using speculatio