    Hill-Climbing SMT Processor Resource Distribution

    The key to high performance in SMT processors lies in optimizing the shared resources distribution among simultaneously executing threads. Existing resource distribution techniques optimize performance only indirectly. They infer potential performance bottlenecks by observing indicators, like instruction occupancy or cache miss count, and take actions to try to alleviate them. While the corrective actions are designed to improve performance, their actual performance impact is not known since end performance is never monitored. Consequently, opportunities for performance gains are lost whenever the corrective actions do not effectively address the actual performance bottlenecks occurring in the SMT processor pipeline. In this dissertation, we propose a different approach to SMT processor resource distribution that optimizes end performance directly. Our approach observes the impact that resource distribution decisions have on performance at runtime, and feeds this information back to the resource distribution mechanisms to improve future decisions. By successively applying and evaluating different resource distributions, our approach tries to learn the best distribution over time. Because we perform learning on-line, learning time is crucial. We develop a hill-climbing SMT processor resource distribution technique that efficiently learns the best resource distribution by following the performance gradient within the resource distribution space. This dissertation makes three contributions within the context of learning-based SMT processor resource distribution. First, we characterize and quantify the time-varying performance behavior of SMT processors. This analysis provides understanding of the behavior and guides the design of our hill-climbing algorithm. Second, we present a hill-climbing SMT processor resource distribution technique that performs learning on-line. The performance evaluation of our approach shows a 11.4% gain over ICOUNT, 11.5% gain over FLUSH, and 2.8% gain over DCRA across a large set of 63 multiprogrammed workloads. Third, we compare existing resource distribution techniques to an ideal learning-based technique that performs learning off-line to show the potential performance of the existing techniques. This limit study identifies the performance bottleneck of the existing techniques, showing that the performance of ICOUNT, FLUSH, and DCRA is 13.2%, 13.5%, and 6.6%, respectively, lower than the ideal performance. Our hill-climbing based resource distribution, however, handles most of the bottlenecks of the existing techniques properly, achieving 4.1% lower performance than the ideal case

    Runahead threads to improve SMT performance

    In this paper, we propose Runahead Threads (RaT) as a valuable solution for both reducing resource contention and exploiting memory-level parallelism in Simultaneous Multithreaded (SMT) processors. Our technique converts a resource intensive memory-bound thread to a speculative light thread under long-latency blocking memory operations. These speculative threads prefetch data and instructions with minimal resources, reducing critical resource conflicts between threads. We compare an SMT architecture using RaT to both state-of-the-art static fetch policies and dynamic resource control policies. In terms of throughput and fairness, our results show that RaT performs better than any other policy. The proposed mechanism improves average throughput by 37% regarding previous static fetch policies and by 28% compared to previous dynamic resource scheduling mechanisms. RaT also improves fairness by 36% and 30% respectively. In addition, the proposed mechanism permits register file size reduction of up to 60% in a SMT processor without performance degradation.Peer ReviewedPostprint (published version

    Efficient resources assignment schemes for clustered multithreaded processors

    New feature sizes provide larger number of transistors per chip that architects could use in order to further exploit instruction level parallelism. However, these technologies bring also new challenges that complicate conventional monolithic processor designs. On the one hand, exploiting instruction level parallelism is leading us to diminishing returns and therefore exploiting other sources of parallelism like thread level parallelism is needed in order to keep raising performance with a reasonable hardware complexity. On the other hand, clustering architectures have been widely studied in order to reduce the inherent complexity of current monolithic processors. This paper studies the synergies and trade-offs between two concepts, clustering and simultaneous multithreading (SMT), in order to understand the reasons why conventional SMT resource assignment schemes are not so effective in clustered processors. These trade-offs are used to propose a novel resource assignment scheme that gets and average speed up of 17.6% versus Icount improving fairness in 24%.Peer ReviewedPostprint (published version

    Introducing runahead threads

    Simultaneous Multithreading processors share their resources among multiple threads in order to improve performance. However, a resource control policy is needed to avoid resource conflicts and prevent some threads from monopolizing them. On the contrary, resource conflicts would cause other threads to suffer from resource starvation degrading the overall performance. This situation is especially sensitive for memory bounded threads, because they hold an important amount of resources while long latency accesses are being served. Several fetch policies and resource control techniques have been proposed to overcome these problems by limiting the per-thread resource utilization. Nevertheless, this limitation is harmful for memory bounded threads because it restricts the memory level parallelism available that hides the long latency memory accesses. In this paper, we propose Runahead threads on SMT scenarios as a valuable solution for both exploiting the memory-level parallelism and reducing the resource contention. This approach switches a memory-bounded eager resource thread to a speculative light thread, avoiding critical resource blocking among multiple threads. Furthermore, it improves the thread-level parallelism by removing long-latency memory operations from the instruction window, releasing busy resources. We compare an SMT architecture using Runahead threads (SMTRA) to both state-of-the-art static fetch and dynamic resource control policies. Our results show that the SMTRA combination performs better, in terms of throughput and fairness, than any of the other policies.Postprint (published version

    Managing SMT Resource Usage through Speculative Instruction Window Weighting

    Simultaneous multithreading processors dynamically share processor resources between multiple threads. In general, shared SMT resources may be managed explicitly, e.g. by dynamically setting queue occupation bounds for each thread as in the DCRA and Hill-Climbing policies. Alternatively, resources may be managed implicitly, i.e. resource usage is controlled by placing the desired instruction mix in the resources. In this case, the main resource management tool is the instruction fetch policy which must predict the behavior of each thread (branch mispredictions, long-latency loads, etc.) as it fetches instructions. In this paper, we present the use of Speculative Instruction Window Weighting (SIWW) to bridge the gap between implicit and explicit SMT fetch policies. SIWW estimates for each thread the amount of outstanding work in the processor pipeline. Fetch proceeds for the thread with the least amount of work left. SIWW policies are implicit as fetch proceeds for the thread with the least amount of work left. They are also explicit as maximum resource allocation can also be set. SIWW can use and combine virtually any of the indicators that were previously proposed for guiding the instruction fetch policy (number of in-flight instructions, number of low confidence branches, number of predicted cache misses, etc.). Therefore, SIWW is an \emph{approach to designing SMT fetch policies}, rather than a particular fetch policy. Targeting fairness or throughput is often contradictory and a SMT scheduling policy often optimizes only one performance metric at the sacrifice of the other metric. Our simulations show that the SIWW fetch policy can achieve at the same time state-of-the-art throughput, state-of-the-art fairness and state-of-the-art harmonic performance mean

    L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors

    Improving the utilization of shared resources is a key issue to increase performance in SMT processors. Recent work has focused on resource sharing policies to enhance the processor performance, but their proposals mainly concentrate on novel hardware mechanisms that adapt to the dynamic resource requirements of the running threads. This work addresses the L1 cache bandwidth problem in SMT processors experimentally on real hardware. Unlike previous work, this paper concentrates on thread allocation, by selecting the proper pair of co-runners to be launched to the same core. The relation between L1 bandwidth requirements of each benchmark and its performance (IPC) is analyzed. We found that for individual benchmarks, performance is strongly connected to L1 bandwidth consumption, and this observation remains valid when several co-runners are launched to the same SMT core. Based on these findings we propose two L1 bandwidth aware thread to core (t2c) allocation policies, namely Static and Dynamic t2c allocation, respectively. The aim of these policies is to properly balance L1 bandwidth requirements of the running threads among the processor cores. Experiments on a Xeon E5645 processor show that the proposed policies significantly improve the performance of the Linux OS kernel regardless the number of cores considered.

    SEEC: A Framework for Self-aware Management of Multicore Resources

    This paper presents SEEC, a self-aware programming model, designed to reduce programming effort in modern multicore systems. In the SEEC model, application programmers specify application goals and progress, while systems programmers separately specify actions system software and hardware can take to affect an application (e.g. resource allocation). The SEEC runtime monitors applications and dynamically selects actions to meet application goals optimally (e.g. meeting performance while minimizing power consumption). The SEEC runtime optimizes system behavior for the application rather than requiring the application programmer to optimize for the system. This paper presents a detailed discussion of the SEEC model and runtime as well as several case studies demonstrating their benefits. SEEC is shown to optimize performance per Watt for a video encoder, find optimal resource allocation for an application with complex resource usage, and maintain the goals of multiple applications in the face of environmental fluctuations

    Stretch: Balancing QoS and Throughput for Colocated Server Workloads on SMT Cores

    In a drive to maximize resource utilization, today's datacenters are moving to colocation of latency-sensitive and batch workloads on the same server. State-of-the-art deployments, such as those at Google, colocate such diverse workloads even on a single SMT core. This form of aggressive colocation is afforded by virtue of the fact that a latency-sensitive service operating below its peak load has significant slack in its response latency with respect to the QoS target. The slack affords a degradation in single-thread performance, which is inevitable under SMT colocation, without compromising QoS targets. This work makes the observation that many batch applications can greatly benefit from a large instruction window to uncover ILP and MLP. Under SMT colocation, conventional wisdom holds that individual hardware threads should be limited in their ability to acquire and hold a disproportionately large share of microarchitectural resources so as not to compromise the performance of a co-running thread. We show that the performance slack inherent in latency-sensitive workloads operating at low to moderate load makes it safe to shift microarchitectural resources to a co-running batch thread without compromising QoS targets. Based on this insight, we introduce Stretch, a simple ROB partitioning scheme that is invoked by system software to provide one hardware thread with a much larger ROB partition at the expense of another thread. When Stretch is enabled for latency-sensitive workloads operating below their peak load on an SMT core, co-running batch applications gain 13% of performance on average (30% max) over a baseline SMT colocation and without compromising QoS constraints

    Exploring coordinated software and hardware support for hardware resource allocation

    Multithreaded processors are now common in the industry as they offer high performance at a low cost. Traditionally, in such processors, the assignation of hardware resources between the multiple threads is done implicitly, by the hardware policies. However, a new class of multithreaded hardware allows the explicit allocation of resources to be controlled or biased by the software. Currently, there is little or no coordination between the allocation of resources done by the hardware and the prioritization of tasks done by the software.This thesis targets to narrow the gap between the software and the hardware, with respect to the hardware resource allocation, by proposing a new explicit resource allocation hardware mechanism and novel schedulers that use the currently available hardware resource allocation mechanisms.It approaches the problem in two different types of computing systems: on the high performance computing domain, we characterize the first processor to present a mechanism that allows the software to bias the allocation hardware resources, the IBM POWER5. In addition, we propose the use of hardware resource allocation as a way to balance high performance computing applications. Finally, we propose two new scheduling mechanisms that are able to transparently and successfully balance applications in real systems using the hardware resource allocation. On the soft real-time domain, we propose a hardware extension to the existing explicit resource allocation hardware and, in addition, two software schedulers that use the explicit allocation hardware to improve the schedulability of tasks in a soft real-time system.In this thesis, we demonstrate that system performance improves by making the software aware of the mechanisms to control the amount of resources given to each running thread. In particular, for the high performance computing domain, we show that it is possible to decrease the execution time of MPI applications biasing the hardware resource assignation between threads. In addition, we show that it is possible to decrease the number of missed deadlines when scheduling tasks in a soft real-time SMT system.Postprint (published version