4 research outputs found

    Symbiotic Subordinate Threading (SST)

    Get PDF
    Integration of multiple processor cores on a single die, relatively constant die sizes, increasing memory latencies, and emerging new applications create new challenges and opportunities for processor architects. How to build a multi-core processor that provides high single-thread performance while enabling high throughput through multi-programming? Conventional approaches for high single-thread performance use a large instruction window for memory latency tolerance, which requires large and complex cores. However, to be able to integrate more cores on the same die for high throughput, cores must be simpler and smaller. We present an architecture that obtains high performance for single-threaded applications in a multi-core environment, while using simpler cores to meet the high throughput requirement. Our scheme, called Symbiotic Subordinate Threading (SST), achieves the benefits of a large instruction window by utilizing otherwise idle cores to run dynamically constructed subordinate threads (a.k.a. {\em helper threads}) for the individual threads running on the active cores. In our proposed execution paradigm, the subordinate thread fetches and pre-processes instruction streams and retires processed instructions into a buffer for the main thread to consume. The subordinate thread executes a smaller version of the program executed by the main thread. As a result, it runs far ahead to warm up the data caches and fix branch miss-predictions for the main thread. In-flight instructions are present in the subordinate thread, the buffer, and the main thread, forming a very large effective instruction window for single-thread out-of-order execution. Moreover, using a simple technique of identifying the subordinate thread non-speculative results, the main thread can integrate the subordinate thread's non-speculative results directly into its state without having to execute their corresponding instructions. In this way, the main thread is sped up because it also executes a smaller version of the program, and the total number of instructions executed is minimized, thereby achieving an efficient utilization of the hardware resources. The proposed SST architecture does not require large register files, issue queues, load/store queues, or reorder buffers. In addition, it incurs only minor hardware additions/changes. Experimental results show remarkable latency-hiding capabilities of the proposed SST architecture, outperforming existing architectures that share similar high-level microarchitecture

    Compiler-based Pre-execution

    Get PDF
    Pre-execution is a novel latency-tolerance technique where one or more helper threads run in front of the main computation and trigger long-latency delinquent events early so that the main thread makes forward progress without experiencing stalls. The most important issue in pre-execution is how to construct effective helper threads that quickly get ahead and compute the delinquent events accurately. Since the manual construction of helper threads is error-prone and cumbersome for a programmer, automation of such an onerous task is inevitable for pre-execution to be widely used for a variety of real-world workloads. In this thesis, we study compiler-based pre-execution to construct prefetching helper threads using a source-level compiler. We first introduce various compiler algorithms to optimize the helper threads; program slicing removes noncritical code unnecessary to compute the delinquent loads, prefetch conversion reduces blocking in the helper threads by converting delinquent loads into nonblocking prefetches, and loop parallelization speculatively parallelizes the targeted code region so that more memory accesses are overlapped simultaneously. In addition to these algorithms to expedite the helper threads, we also propose several important algorithms to select the righ

    Design and Evaluation of Compiler Algorithms for Pre-Execution

    No full text
    Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This paper investigates a source-to-source C compiler for extracting preexecution thread code automatically, thus relieving the programmer or hardware from this onerous task. At the heart of our compiler are three algorithms. First, program slicing removes non-critical code for computing cache-missing memory references, reducing pre-execution overhead. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, threading scheme selection chooses the best scheme for initiating pre-execution threads, speculatively parallelizing loops to generate threadlevel parallelism when necessary for latency tolerance. We prototyped our algorithms using the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [13], and we evaluated our compiler on a detailed architectural simulator of an SMT processor. Our results show compiler-based pre-execution improves the performance of 9 out of 13 applications, reducing execution time by 22.7%. Across all 13 applications, our technique delivers an average speedup of 17.0%. These performance gains are achieved fully automatically on conventional SMT hardware, with only minimal modifications to support pre-execution threads

    ABSTRACT Design and Evaluation of Compiler Algorithms

    No full text
    for Pre-Execution Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This paper investigates a source-to-source C compiler for extracting preexecution thread code automatically, thus relieving the programmer or hardware from this onerous task. At the heart of our compiler are three algorithms. First, program slicing removes non-critical code for computing cache-missing memory references, reducing pre-execution overhead. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, threading scheme selection chooses the best scheme for initiating pre-execution threads, speculatively parallelizing loops to generate threadlevel parallelism when necessary for latency tolerance. We prototyped our algorithms using the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [13], and we evaluated our compiler on a detailed architectural simulator of an SMT processor. Our results show compiler-based pre-execution improves the performance of 9 out of 13 applications, reducing execution time by 22.7%. Across all 13 applications, our technique delivers an average speedup of 17.0%. These performance gains are achieved fully automatically on conventional SMT hardware, with only minimal modifications to support pre-execution threads. 1