35 research outputs found

    Effective instruction prefetching via fetch prestaging

    Get PDF
    As technological process shrinks and clock rate increases, instruction caches can no longer be accessed in one cycle. Alternatives are implementing smaller caches (with higher miss rate) or large caches with a pipelined access (with higher branch misprediction penalty). In both cases, the performance obtained is far from the obtained by an ideal large cache with one-cycle access. In this paper we present cache line guided prestaging (CLGP), a novel mechanism that overcomes the limitations of current instruction cache implementations. CLGP employs prefetching to charge future cache lines into a set of fast prestage buffers. These buffers are managed efficiently by the CLGP algorithm, trying to fetch from them as much as possible. Therefore, the number of fetches served by the main instruction cache is highly reduced, and so the negative impact of its access latency on the overall performance. With the best CLGP configuration using a 4 KB I-cache, speedups of 3.5% (at 0.09 /spl mu/m) and 12.5% (at 0.045 /spl mu/m) are obtained over an equivalent fetch directed prefetching configuration, and 39% (at 0.09 /spl mu/m) and 48% (at 0.045 /spl mu/m) over using a pipelined instruction cache without prefetching. Moreover, our results show that CLGP with a 2.5 KB of total cache budget can obtain a similar performance than using a 64 KB pipelined I-cache without prefetching, that is equivalent performance at 6.4X our hardware budget.Peer ReviewedPostprint (published version

    DIA: A complexity-effective decoding architecture

    Get PDF
    Fast instruction decoding is a true challenge for the design of CISC microprocessors implementing variable-length instructions. A well-known solution to overcome this problem is caching decoded instructions in a hardware buffer. Fetching already decoded instructions avoids the need for decoding them again, improving processor performance. However, introducing such special--purpose storage in the processor design involves an important increase in the fetch architecture complexity. In this paper, we propose a novel decoding architecture that reduces the fetch engine implementation cost. Instead of using a special-purpose hardware buffer, our proposal stores frequently decoded instructions in the memory hierarchy. The address where the decoded instructions are stored is kept in the branch prediction mechanism, enabling it to guide our decoding architecture. This makes it possible for the processor front end to fetch already decoded instructions from the memory instead of the original nondecoded instructions. Our results show that using our decoding architecture, a state-of-the-art superscalar processor achieves competitive performance improvements, while requiring less chip area and energy consumption in the fetch architecture than a hardware code caching mechanism.Peer ReviewedPostprint (published version

    Better branch prediction through prophet/critic hybrids

    Get PDF
    The prophet/critic hybrid conditional branch predictor has two component predictors. The prophet uses a branch's history to predict its direction. We call this prediction and the ones for branches following it the branch future. The critic uses the branch's history and future to critique the prophet's prediction. The hybrid combines the prophet's prediction with the critique, either agrees or disagree, forming the branch's overall prediction. Results shows these hybrids can reduce mispredicts by 39 percent and improve processor performance by 7.8 percent.Peer ReviewedPostprint (published version

    A low-complexity, high-performance fetch unit for simultaneous multithreading processors

    Get PDF
    Simultaneous multithreading (SMT) is an architectural technique that allows for the parallel execution of several threads simultaneously. Fetch performance has been identified as the most important bottleneck for SMT processors. The commonly adopted solution has been fetching from more than one thread each cycle. Recent studies have proposed a plethora of fetch policies to deal with fetch priority among threads, trying to increase fetch performance. We demonstrate that the simultaneous sharing of the fetch unit, apart from increasing the complexity of the fetch unit, can be counterproductive in terms of performance. We evaluate the use of high-performance fetch units in the context of SMT. Our new fetch architecture proposal allows us to feed an 8-way processor fetching from a single thread each cycle, reducing complexity, and increasing the usefulness of proposed fetch policies. Our results show that using new high-performance fetch units, like the FTB or the stream fetch, provides higher performance than fetching from two threads using common SMT fetch architectures. Furthermore, our results show that our design obtains better average performance for any kind of workloads (both ILP and memory bounded benchmarks), in contrast to previously proposed solutions.Peer ReviewedPostprint (published version

    A complexity-effective simultaneous multithreading architecture

    Get PDF
    Different applications may exhibit radically different behaviors and thus have very different requirements in terms of hardware support. In simultaneous multithreading (SMT) architectures, the hardware is shared among multiple running applications in order to better profit from it. However, current architectures are designed for the common case, and try to satisfy a number of different application classes with a single design. That is, current designs are usually overdesigned for most cases, obtaining high performance, but wasting a lot of resources to do so. In this paper we present an alternative SMT architecture, the heterogeneously distributed SMT (hdSMT). Our architecture is based in a novel combination of SMT and clustering techniques in a heterogeneity-aware fashion. The hardware is designed to match the heterogeneous application behavior with the statically and heterogeneously partitioned resources. Such a design is aimed for minimizing the amount of resources wasted to achieve a given performance rate. On top of our statically partitioned architecture, we propose a heuristic policy to map threads to clusters so that each cluster matches the characteristics of the running threads and overall hardware usage is optimized. We compare our hdSMT architecture with a monolithic SMT processor, where all threads compete for the same resources, and with a homogeneous clustered SMT, where resources are statically and equally partitioned across clusters. Our results show that hdSMT architectures obtain an average improvement of 13% and 14% in optimizing performance per area over monolithic SMT and homogeneously clustered SMT respectively.Peer ReviewedPostprint (published version

    Maximizing multithreaded multicore architectures through thread migrations

    Get PDF
    Heterogeneity in general-purpose workloads often end up in non optimal per-thread hardware resource usage. The current trend towards multicore architectures, containing several multithreaded cores, increases the need of a complexity-effective way to expose the heterogeneity in general-purpose workloads to the underlying hardware, in order to obtain all the potential performance of these architectures. In this paper we present the Heterogeneity-Aware Dynamic Thread Migrator (hDTM), a novel complexity-effective hardware mechanism that exposes the heterogeneity in software to the hardware, also enabling the hardware to react to the dynamic behavior variations in the running applications. By means of core-to-core thread migrations, the hDTM mechanism strives to perform the desired behavior transparently to the Operating System. As an example of the general-purpose hDTM concept presented in this paper, we describe a naive hDTM implementation for a Power5-like processor and provide results on the benefits of the proposed mechanism. Our results indicate that even this simple hDTM implementation is able to get close to hDTM’s goal, not only avoiding losses due to bad thread-to-core assignments (up to a 25%) but also going beyond the best static thread-to-core assignment upper limit.Postprint (published version

    Effective Instruction Prefetching via Fetch Prestaging

    No full text
    As technological process shrinks and clock rate increases, instruction caches can no longer be accessed in one cycle. Alternatives are implementing smaller caches (with higher miss rate) or large caches with a pipelined access (with higher branch misprediction penalty). In both cases, the performance obtained is far from the obtained by an ideal large cache with one-cycle access. In this paper we present Cache Line Guided Prestaging (CLGP), a novel mechanism that overcomes the limitations of current instruction cache implementations. CLGP employs prefetching to charge future cache lines into a set of fast prestage buffers. These buffers are managed efficiently by the CLGP algorithm, trying to fetch from them as much as possible. Therefore, the number of fetches served by the main instruction cache is highly reduced, and so the negative impact of its access latency on the overall performance. With the best CLGP configuration using a 4 KB I-cache, speedups of 3.5 % (at 0.09µm) and 12.5 % (at 0.045µm) are obtained over an equivalent Fetch Directed Prefetching configuration, and 39 % (at 0.09µm) and 48 % (at 0.045µm) over using a pipelined instruction cache without prefetching. Moreover, our results show that CLGP with a 2.5 KB of total cache budget can obtain a similar performance than using a 64 KB pipelined I-cache without prefetching, that is equivalent performance at 6.4X our hardware budget. 1

    A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors

    No full text
    Simultaneous Multithreading (SMT) is an architectural technique that allows for the parallel execution of several threads simultaneously. Fetch performance has been identified as the most important bottleneck for SMT processors. The commonly adopted solution has been fetching from more than one thread each cycle. Recent studies have proposed a plethora of fetch policies to deal with fetch priority among threads, trying to increase fetch performance. In this paper we demonstrate that the simultaneous sharing of the fetch unit, apart from increasing the complexity of the fetch unit, can be counterproductive in terms of performance. We evaluate the use of high-performance fetch units in the context of SMT. Our new fetch architecture proposal allows us to feed an 8-way processor fetching from a single thread each cycle, reducing complexity, and increasing the usefulness of proposed fetch policies. Our results show that using new high-performance fetch units, like the FTB or the stream fetch, provides higher performance than fetching from two threads using common SMT fetch architectures. Furthermore, our results show that our design obtains better average performance for any kind of workloads (both ILP and memory bounded benchmarks), in contrast to previously proposed solutions. 1
    corecore