16 research outputs found

    Dynamic cluster resizing

    Get PDF
    Processor resources required for an effective execution of an application vary across different sections. We propose to take advantage of clustering to turn-off resources that do not contribute to improve performance. First, we present a simple hardware scheme to dynamically compute the energy consumed by each processor block and the energy-delay2 product for a given interval of time. This scheme is used to compute the effectiveness of the current configuration in terms of energy-delay2 and evaluate the benefits of increasing/decreasing the number of active issue queues. Performance evaluation shows an average energy-delay2 product improvement of 18%, and up to 50% for some applications, in a quad-cluster architecture.Peer ReviewedPostprint (published version

    Exploiting compiler-generated schedules for energy savings in high-performance processors

    Get PDF
    This paper develops a technique that uniquely combines the advantages of static scheduling and dynamic scheduling to reduce the energy consumed in modern superscalar processors with out-of-order issue logic. In this Hybrid-Scheduling paradigm, regions of the application containing large amounts of parallelism visible at compile-time completely bypass the dynamic scheduling logic and execute in a low power static mode. Simulation studies using the Wattch framework on several media and scientific benchmarks demonstrate large improvements in overall energy consumption of 43 % in kernels and 25 % in full applications with only a 2.8 % performance degradation on average

    Chapter One – An Overview of Architecture-Level Power- and Energy-Efficient Design Techniques

    Get PDF
    Power dissipation and energy consumption became the primary design constraint for almost all computer systems in the last 15 years. Both computer architects and circuit designers intent to reduce power and energy (without a performance degradation) at all design levels, as it is currently the main obstacle to continue with further scaling according to Moore's law. The aim of this survey is to provide a comprehensive overview of power- and energy-efficient “state-of-the-art” techniques. We classify techniques by component where they apply to, which is the most natural way from a designer point of view. We further divide the techniques by the component of power/energy they optimize (static or dynamic), covering in that way complete low-power design flow at the architectural level. At the end, we conclude that only a holistic approach that assumes optimizations at all design levels can lead to significant savings.Peer ReviewedPostprint (published version

    Power efficient resource scaling in partitioned architectures through dynamic heterogeneity

    Get PDF
    Journal ArticleThe ever increasing demand for high clock speeds and the desire to exploit abundant transistor budgets have resulted in alarming increases in processor power dissipation. Partitioned (or clustered) architectures have been proposed in recent years to address scalability concerns in future billion-transistor microprocessors. Our analysis shows that increasing processor resources in a clustered architecture results in a linear increase in power consumption, while providing diminishing improvements in single-thread performance. To preserve high performance to power ratios, we claim that the power consumption of additional resources should be in proportion to the performance improvements they yield. Hence, in this paper, we propose the implementation of heterogeneous clusters that have varying delay and power characteristics. A cluster's performance and power characteristic is tuned by scaling its frequency and novel policies dynamically assign frequencies to clusters, while attempting to either meet a fixed power budget or minimize a metric such as Energy×Delay2 (ED2). By increasing resources in a power-efficient manner, we observe a 11% improvement in ED2 and a 22.4% average reduction in peak temperature, when compared to a processor with homogeneous units. Our proposed processor model also provides strategies to handle thermal emergencies that have a relatively low impact on performance

    プロセッサアーキテクチャ「STRAIGHT」のシミュレータ設計と評価

    Get PDF
    パーソナルコンピュータやスマートフォンを始めとする多くの情報機器の中核部品であるマイクロプロセッサは,現在の情報化社会を支えている.技術の発展により半導体の微細化が進み,パッケージ内で利用可能なトランジスタの数は増加しているが,チップ上の回路の消費電力は,プロセスの微細化ほどスケールダウンしていない.そのため,プロセスを微細化する度により多くのコアをパッケージ内に搭載することができるようになるが,チップ全体の消費電力もそれに従い増大する.それによって,パッケージあたりの電力や熱の制限から,搭載されたトランジスタを同時に駆動することができない,ダークシリコン問題が指摘されている.この問題を解消するためには,従来のアーキテクチャよりもシングルスレッド実行能力に優れ,なおかつ同じより少ない消費電力で同じ処理を実行できる新しいアーキテクチャが求められている.我々は,シングルスレッド能力の向上と消費電力の削減を同時に達成するアーキテクチャとして,広大なレジスタ空間を持つSTRAIGHTアーキテクチャを提案している.STRAIGHTアーキテクチャはライト・ワンス・コードの実行を想定した広大な論理レジスタ空間を持ち,従来のアウト・オブ・オーダ・プロセッサにおいて主要な電力オーバヘッドであったレジスタ・リネーミングやフリーレジスタの管理,レジスタの解放管理を必要としない.また,このライト・ワンス・コードによって,従来困難であった命令ウィンドウサイズやフロントエンド幅の軽量な拡張が可能になり,シングルスレッド性能を向上させる. 本論文では,シミュレータの設計,アセンブラの構築,専用コードの生成,生成したコードを用いた評価・議論を行った.生成したSTRAIGHT専用のLivermore LoopコードをSTRAIGHTアセンブラの入力とすることで,シミュレータの入力となるSTRAIGHTバイナリを生成することが可能になり,このSTRAIGHTバイナリを,設計したSTRAIGHTシミュレータの入力とすることでSTRAIGHTアーキテクチャを詳細に評価した.評価では,従来のAlphaアーキテクチャと比べてSTRAIGHTアーキテクチャはIPCを最大で88%,平均で29%向上させた.さらに,同じ処理を行うために必要な命令数について,従来のAlphaアーキテクチャと比べてSTRAIGHTアーキテクチャは最大で約90%,平均で約55%の命令を削減することができた.IPCと1000ループの必要命令数を相乗した性能については,Alphaに対してSTRAIGHTは最大で12.5倍,平均で約3倍の性能であることがわかった.電気通信大学201

    Improving Application Performance by Dynamically Balancing Speed and Complexity in a GALS Microprocessor

    Get PDF
    Microprocessors are traditionally designed to provide “best overall” performance across a wide range of applications and operating environments. Several groups have proposed hardware techniques that save energy by “downsizing” hardware resources that are underutilized by particular applications. We explore the converse: “upsizing” hardware resources in order to improve performance relative to an aggressively clocked baseline processor. Our proposal depends critically on the ability to change frequencies independently in separate domains of a globally asynchronous, locally synchronous (GALS) microprocessor. We use a variant of our multiple clock domain (MCD) processor, with four independently clocked domains. Each domain is streamlined with modest hardware structures for very high clock frequency. Key structures can then be upsized on demand to exploit more distant parallelism, improve branch prediction, or increase cache capacity. Although doing so requires decreasing the associated domain frequency, other domain frequencies are unaffected. Measuring across a broad suite of application benchmarks, we find that configuring just once per application increases performance by an average of 17.6% compared to the best fully synchronous design. When adapting to application phases, performance improves by over 20%

    Integrating adaptive on-chip storage structures for reduced dynamic power

    Get PDF
    Journal ArticleEnergy efficiency in microarchitectures has become a necessity. Significant dynamic energy savings can be realized for adaptive storage structures such as caches, issue queues, and register files by disabling unnecessary storage resources. Prior studies have analyzed individual structures and their control. A common theme to these studies is exploration of the configuration space and use of system IPC as feedback to guide reconfiguration. However, when multiple structures adapt in concert, the number of possible configurations increases dramatically, and assigning causal effects to IPC change becomes problematic. To overcome this issue, we introduce designs that are reconfigured solely on local behavior. We introduce a novel cache design that permits direct calculation of efficient configurations. For buffer and queue structures, limited histogramming permits precise resizing control. When applying these techniques we show energy savings of up to 70% on the individual structures, and savings averaging 30% overall for the portion of energy attributed to these structures with an average of 2.1% performance degradation

    Dynamically managing the communication-parallelism trade-off in future clustered processors

    Get PDF
    Journal ArticleClustered microarchitectures are an attractive alternative to large monolithic superscalar designs due to their potential for higher clock rates in the face of increasingly wire-delay-constrained process technologies. As increasing transistor counts allow an increase in the number of clusters, thereby allowing more aggressive use of instruction-level parallelism (ILP), the inter-cluster communication increases as data values get spread across a wider area. As a result of the emergence of this trade-off between communication and parallelism, a subset of the total on-chip clusters is optimal for performance. To match the hardware to the application's needs, we use a robust algorithm to dynamically tune the clustered architecture. The algorithm, which is based on program metrics gathered at periodic intervals, achieves an 11% performance improvement on average over the best statically defined architecture. We also show that the use of additional hardware and reconfiguration at basic block boundaries can achieve average improvements of 15%. Our results demonstrate that reconfiguration provides an effective solution to the communication and parallelism trade-off inherent in the communication-bound processors of the future

    Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power

    Get PDF
    Energy efficiency in microarchitectures has become a necessity. Significant dynamic energy savings can be realized for adaptive storage structures such as caches, issue queues, and register files by disabling unnecessary storage resources. Prior studies have analyzed individual structures and their control. A common theme to these studies is exploration of the configuration space and use of system IPC as feedback to guide reconfiguration. However, when multiple structures adapt in concert, the number of possible configurations increases dramatically, and assigning causal effects to IPC change becomes problematic. To overcome this issue, we introduce designs that are reconfigured solely on local behavior. We introduce a novel cache design that permits direct calculation of efficient configurations. For buffer and queue structures, limited histogramming permits precise resizing control. When applying these techniques we show energy savings of up to 70% on the individual structures, and savings averaging 30% overall for the portion of energy attributed to these structures with an average of 2.1% performance degradation
    corecore