23 research outputs found

    Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors

    No full text
    We investigate instruction distribution methods for quad-clustec dynamically-scheduled superscalar processors. We study a variety of methods with different cost, performance and complexity characteristics. We investigate both non-adaptive and adaptive methods and their sensitivity both to inter-cluster communication latencies and pipeline depth. Furthermore, we develop a set of models that allow us to identify how well each method attacks issue-bandwidth and inter-cluster communication restrictions. We find that a relatively simple method that changes clusters every other three instructions offers only a 17 % performance slowdown compared to a non-clustered conjguration operating at the same frequency. Moreover; we show that by utilizing adaptive methods it is possible to further reduce this gap down to about 14%. Furthermore, performance appears to be more sensitive to inter-cluster communication latencies rather than to pipeline depth. The best performing method offers a slowdown of about 24 % when inter-cluster communication latency is two cycle. This gap is only 20 % when two additional stages are introduced in the front-end pipeline.

    Energy-aware dynamic resource allocation heuristics for clustered processors

    No full text

    General Terms Design

    No full text
    We introduce asymmetric frequency clustering (AFC), a micro-architectural technique that reduces the dynamic power dissipated by a processor's back-end while maintaining high performance. We present a dual-cluster, dual-frequency machine comprising a performance oriented cluster and a power-aware one. The power-aware cluster operates at half the frequency of the performance oriented cluster and uses a lower voltage supply. We show that this organization significantly reduces back-end power dissipation by executing non-performance-critical instructions in the power-aware cluster. AFC localizes the two frequency/voltage domains. Consequently, it mitigates many of the complexities associated with maintaining multiple supply voltage and frequency domains on the same chip. Key to the success of this technique are methods that assign as many instructions as possible to the slower/ lower power cluster without impacting overall performance. We evaluate our techniques using a subset of SPEC2000 and SPEC95. AFC provides a 16 % back-end power reduction with 1.5 % performance loss compared to a conventional, dual-clustered processor where each cluster has schedulers of the same width and length

    Instruction Flow-Based Front-end Throttling for Power-Aware High-Performance Processors

    No full text
    We present a number of power-aware instruction front-end (fetch/decode) throttling methods for high-performance dynamically-scheduled superscalar processors. Our methods reduce power dissipation by selectively turning on and off instruction fetch and decode. Moreover, they have a negligible impact on performance as they deliver instructions just in time for exploiting the available parallelism. Previously proposed front-end throttling methods rely on branch prediction confidence estimation. We introduce a new class of methods that exploit information about instruction flow (rate of instructions passing through stages). We show that our methods can boost power savings over previously proposed methods. In particular, for an 8-way processor a combined method reduces traffic by 14%, 20%, 6 % and 6 % for the fetch, decode, issue and complete stages respectively while performance remains mostly unaffected. The best previously proposed method reduces traffic by 10%, 15%, 4 % and 4 % respectively. 1

    Reducing Execution Unit Leakage Power in Embedded Processors

    No full text
    Abstract. We introduce low-overhead power optimization techniques to reduce leakage power in embedded processors. Our techniques improve previous work by a) taking into account idle time distribution for different execution units, and b) using instruction decode and control dependencies to wakeup the gated (but needed) units as soon as possible. We take into account idle time distribution per execution unit to detect an idle time period as soon as possible. This in turn results in increasing our leakage power savings. In addition, we use information already available in the processor to predict when a gated execution unit will be needed again. This results in early and less costly reactivation of gated execution units. We evaluate our techniques for a representative subset of MiBench benchmarks and for a processor using a configuration similar to Intels Xscale processor. We show that our techniques reduce leakage power considerably while maintaining performance.

    Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation

    No full text
    We study lazy instructions. We define lazy instructions as those spending long periods in the issue queue. Moreover, we investigate lazy instruction predictability and show how their behavior could be exploited to reduce activity and power dissipation in modern processors. We show that a simple and small 64-entry table can identify up to a maximum of 50% of lazy instructions by storing their past behavior. We exploit this to a) reduce wakeup activity and power dissipation in the issue queue and b) reduce the number of in-flight instructions and the average instruction issue delay in the processor. We also introduce two power optimization techniques that use lazy instruction behavior to improve energy efficiency in the processor. Our study shows that, by using these optimizations, it is possible to reduce wakeup activity and power dissipation by up to 34 % and 29 % respectively. This comes with a performance cost of 1.5%. In addition, we reduce average instruction issue delay and the number of inflight instructions by up to 8.5 % and 7 % respectively with no performance cost. 1
    corecore