Search CORE

8 research outputs found

Compiler-directed energy reduction using dynamic voltage scaling and voltage Islands for embedded systems

Author: Chen G.
Kandemir M.
Ozturk O.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

Cataloged from PDF version of article.Addressing power and energy consumption related issues early in the system design flow ensures good design and minimizes iterations for faster turnaround time. In particular, optimizations at software level, e.g., those supported by compilers, are very important for minimizing energy consumption of embedded applications. Recent research demonstrates that voltage islands provide the flexibility to reduce power by selectively shutting down the different regions of the chip and/or running the select parts of the chip at different voltage/frequency levels. As against most of the prior work on voltage islands that mainly focused on the architecture design and IP placement related issues, this paper studies the necessary software compiler support for voltage islands. Specifically, we focus on an embedded multiprocessor architecture that supports both voltage islands and control domains within these islands, and determine how an optimizing compiler can automatically map an embedded application onto this architecture. Such an automated support is critical since it is unrealistic to expect an application programmer to reach a good mapping correlating multiple factors such as performance and energy at the same time. Our experiments with the proposed compiler support show that our approach is very effective in reducing energy consumption. The experiments also show that the energy savings we achieve are consistent across a wide range of values of our major simulation parameters

Bilkent University Institutional Repository

PerfBound: Conserving Energy with Bounded Overheads in On/Off-Based HPC Interconnects

Author: Carpenter Paul M.
Saravanan Karthikeyan P.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Energy and power are key challenges in high-performance computing. System energy efficiency must be significantly improved, and this requires greater efficiency in all subcomponents. An important target of optimization is the interconnect, since network links are always on, consuming power even during idle periods. A large number of HPC machines have a primary interconnect based on Ethernet (about 40 percent of TOP500 machines), which, since 2010, has included support for saving power via Energy Efficient Ethernet (EEE). Nevertheless, it is unlikely that HPC interconnects would use these energy saving modes unless the performance overhead is known and small. This paper presents PerfBound, a self-contained technique to manage on/off-based networks such as EEE, minimizing interconnect link energy consumption subject to a bound on the performance degradation. PerfBound does not require changes to the applications and it uses only local information already available at switches and NICs without introducing additional communication messages, and is also compatible with multi-hop networks. PerfBound is evaluated using traces from a production supercomputer. For twelve out of fourteen applications, PerfBound has high energy savings, up to 70 percent for only 1 percent performance degradation. This paper also presents DynamicFastwake, which extends PerfBound to exploit multiple low-power states. DynamicFastwake achieves an energy-delay product 10 percent lower than the original PerfBound techniqueThis research was supported by European Union’s 7th Framework Programme [FP7/2007-2013] under the Mont-Blanc-3 (FP7-ICT-671697) and EUROSERVER (FP7-ICT-610456) projects, the Ministry of Economy and Competitiveness of Spain (TIN2012-34557 and TIN2015-65316), Generalitat de Catalunya (FI-AGAUR 2012 FI B 00644, 2014-SGR-1051 and 2014-SGR-1272), the European Union’s Horizon2020 research and innovation programme under the HiPEAC-3 Network of Excellence (ICT-287759), and the Severo Ochoa Program (SEV-2011-00067) of the Spanish Government.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Variable-width datapath for on-chip network static power reduction

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

POWAR: Power-Aware Routing in HPC Networks with On/Off Links

Author: Alonso Díaz Marina
Andújar-Muñoz Francisco José
Coll Salvador
López Rodríguez Pedro Juan
Martínez-Rubio Juan-Miguel
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

[EN] In order to save energy in HPC interconnection networks, one usual proposal is to switch idle links into a low-power mode after a certain time without any transmission, as IEEE Energy Efficient Ethernet standard proposes. Extending the low-power mode mechanism, we propose POWer-Aware Routing (POWAR), a simple power-aware routing and selection function for fat-tree and torus networks. POWAR adapts the amount of network links that can be used, taking into account the network load, and obtaining great energy savings in the network (55%-65%) and the entire system (9%-10%) with negligible performance overhead.This work has been supported by the Spanish MINECO and European Commission (FEDER funds) under project TIN2015-66972-C5-1-R. Francisco J. Andujar has been partially funded by the Spanish MICINN and by the ERDF program of the European Union: PCAS Project (TIN2017-88614-R), CAPAP-H6 (TIN2016-81840-REDT), and Junta de Castilla y Leon FEDER Grant VA082P17 (PROPHET Project).Andújar-Muñoz, FJ.; Coll, S.; Alonso Díaz, M.; López Rodríguez, PJ.; Martínez-Rubio, J. (2019). POWAR: Power-Aware Routing in HPC Networks with On/Off Links. ACM Transactions on Architecture and Code Optimization. 15(4):1-22. https://doi.org/10.1145/3293445S122154Abts, D., Marty, M. R., Wells, P. M., Klausler, P., & Liu, H. (2010). Energy proportional datacenter networks. Proceedings of the 37th annual international symposium on Computer architecture - ISCA ’10. doi:10.1145/1815961.1816004Adiga, N. R., Blumrich, M. A., Chen, D., Coteus, P., Gara, A., Giampapa, M. E., … Vranas, P. (2005). Blue Gene/L torus interconnection network. IBM Journal of Research and Development, 49(2.3), 265-276. doi:10.1147/rd.492.0265M. Alonso S. Coll J. M. Martínez V. Santonja and P. López. 2015. Power consumption management in fat-tree interconnection networks. Parallel Comput. 48 C (Oct. 2015) 59--80. 10.1016/j.parco.2015.03.007 M. Alonso S. Coll J. M. Martínez V. Santonja and P. López. 2015. Power consumption management in fat-tree interconnection networks. Parallel Comput. 48 C (Oct. 2015) 59--80. 10.1016/j.parco.2015.03.007Marina Alonso, Coll, S., Martínez, J.-M., Santonja, V., López, P., & Duato, J. (2010). Power saving in regular interconnection networks. Parallel Computing, 36(12), 696-712. doi:10.1016/j.parco.2010.08.003Bob Alverson Edwin Froese Larry Kaplan and Duncan Roweth. 2012. Cray XC series network. Cray Inc. White Paper WP-Aries01-1112 (2012). Bob Alverson Edwin Froese Larry Kaplan and Duncan Roweth. 2012. Cray XC series network. Cray Inc. White Paper WP-Aries01-1112 (2012).Anderson, T. E., Owicki, S. S., Saxe, J. B., & Thacker, C. P. (1993). High-speed switch scheduling for local-area networks. ACM Transactions on Computer Systems, 11(4), 319-352. doi:10.1145/161541.161736Andujar, F. J., Villar, J. A., Sanchez, J. L., Alfaro, F. J., & Escudero-Sahuquillo, J. (2015). VEF Traces: A Framework for Modelling MPI Traffic in Interconnection Network Simulators. 2015 IEEE International Conference on Cluster Computing. doi:10.1109/cluster.2015.141Barroso, L. A., & Hölzle, U. (2007). The Case for Energy-Proportional Computing. Computer, 40(12), 33-37. doi:10.1109/mc.2007.443Camacho, J., & Flich, J. (2011). HPC-Mesh: A Homogeneous Parallel Concentrated Mesh for Fault-Tolerance and Energy Savings. 2011 ACM/IEEE Seventh Symposium on Architectures for Networking and Communications Systems. doi:10.1109/ancs.2011.17Chen, D., Parker, J. J., Eisley, N. A., Heidelberger, P., Senger, R. M., Sugawara, Y., … Steinmacher-Burow, B. (2011). The IBM Blue Gene/Q interconnection network and message unit. Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC ’11. doi:10.1145/2063384.2063419Chen, L., & Pinkston, T. M. (2012). NoRD: Node-Router Decoupling for Effective Power-gating of On-Chip Routers. 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. doi:10.1109/micro.2012.33Christensen, K., Reviriego, P., Nordman, B., Bennett, M., Mostowfi, M., & Maestro, J. (2010). IEEE 802.3az: the road to energy efficient ethernet. IEEE Communications Magazine, 48(11), 50-56. doi:10.1109/mcom.2010.5621967Dally, & Seitz. (1987). Deadlock-Free Message Routing in Multiprocessor Interconnection Networks. IEEE Transactions on Computers, C-36(5), 547-553. doi:10.1109/tc.1987.1676939Das, R., Narayanasamy, S., Satpathy, S. K., & Dreslinski, R. G. (2013). Catnap. Proceedings of the 40th Annual International Symposium on Computer Architecture - ISCA ’13. doi:10.1145/2485922.2485950Derradji, S., Palfer-Sollier, T., Panziera, J.-P., Poudes, A., & Atos, F. W. (2015). The BXI Interconnect Architecture. 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. doi:10.1109/hoti.2015.15Jack Dongarra Hans W. Meuer and Erich Strohmaier. 2018. TOP500 Supercomputer Sites. Retrieved from https://www.top500.org. Jack Dongarra Hans W. Meuer and Erich Strohmaier. 2018. TOP500 Supercomputer Sites. Retrieved from https://www.top500.org.Duato, J. (1993). A new theory of deadlock-free adaptive routing in wormhole networks. IEEE Transactions on Parallel and Distributed Systems, 4(12), 1320-1331. doi:10.1109/71.250114José Duato Sudhakar Yalamanchili and Lionel Ni. 2003. Interconnection Networks. An Engineering Approach. Morgan Kaufmann Publishers Inc. San Francisco CA. José Duato Sudhakar Yalamanchili and Lionel Ni. 2003. Interconnection Networks. An Engineering Approach. Morgan Kaufmann Publishers Inc. San Francisco CA.GALGO 2017. GALGO—Albacete Research Institute of Informatics Supercomputer Center homepage. Retrieved from http://www.i3a.uclm.es/galgo. GALGO 2017. GALGO—Albacete Research Institute of Informatics Supercomputer Center homepage. Retrieved from http://www.i3a.uclm.es/galgo.Greenberg, A., Hamilton, J., Maltz, D. A., & Patel, P. (2008). The cost of a cloud. ACM SIGCOMM Computer Communication Review, 39(1), 68-73. doi:10.1145/1496091.1496103HPCC {n.d.}. HPC Challenge Benchmark. Retrieved from http://icl.cs.utk.edu/hpcc/index.html. HPCC {n.d.}. HPC Challenge Benchmark. Retrieved from http://icl.cs.utk.edu/hpcc/index.html.Hluchyj, M. G., & Karol, M. J. (1988). Queueing in high-performance packet switching. IEEE Journal on Selected Areas in Communications, 6(9), 1587-1597. doi:10.1109/49.12886Koibuchi, M., Otsuka, T., Hiroki Matsutani, & Amano, H. (2009). An on/off link activation method for low-power ethernet in PC clusters. 2009 IEEE International Symposium on Parallel & Distributed Processing. doi:10.1109/ipdps.2009.5161069Phillips, J. C., Braun, R., Wang, W., Gumbart, J., Tajkhorshid, E., Villa, E., … Schulten, K. (2005). Scalable molecular dynamics with NAMD. Journal of Computational Chemistry, 26(16), 1781-1802. doi:10.1002/jcc.20289Pronk, S., Páll, S., Schulz, R., Larsson, P., Bjelkmar, P., Apostolov, R., … Lindahl, E. (2013). GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit. Bioinformatics, 29(7), 845-854. doi:10.1093/bioinformatics/btt055Reviriego, P., Hernandez, J., Larrabeiti, D., & Maestro, J. (2009). Performance evaluation of energy efficient ethernet. IEEE Communications Letters, 13(9), 697-699. doi:10.1109/lcomm.2009.090880K. P. Saravanan and P. Carpenter. 2018. PerfBound: Conserving energy with bounded overheads in on/off-based HPC interconnects. IEEE Trans. Comput. (2018) 1--1. 10.1109/TC.2018.2790394 K. P. Saravanan and P. Carpenter. 2018. PerfBound: Conserving energy with bounded overheads in on/off-based HPC interconnects. IEEE Trans. Comput. (2018) 1--1. 10.1109/TC.2018.2790394Saravanan, K. P., Carpenter, P. M., & Ramirez, A. (2013). Power/performance evaluation of energy efficient Ethernet (EEE) for High Performance Computing. 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). doi:10.1109/ispass.2013.6557171Soteriou, V., & Li-Shiuan Peh. (s. f.). Dynamic power management for power optimization of interconnection networks using on/off links. 11th Symposium on High Performance Interconnects, 2003. Proceedings. doi:10.1109/conect.2003.1231472Totoni, E., Jain, N., & Kale, L. V. (2013). Toward Runtime Power Management of Exascale Networks by on/off Control of Links. 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum. doi:10.1109/ipdpsw.2013.191VEF 2017. VEF traces homepage. Retrieved from http://www.i3a.info/VEFtraces. VEF 2017. VEF traces homepage. Retrieved from http://www.i3a.info/VEFtraces

RiuNet

Dynamic Power Management for Power Optimization of Interconnection Networks Using On/Off Links

Author: Li-Shiuan Peh
Vassos Soteriou
Publication venue
Publication date
Field of study

Power consumption in interconnection networks has become an increasingly important architectural issue. The links which interconnect network node routers are a major consumer of power and will devour an ever-increasing portion of total available power as network bandwidth and operating frequencies upscale. In this paper we propose a dynamic power management policy where network links are turned off and switched back on depending on network utilization in a distributed fashion. We have devised a systematic approach based on the derivation of a connectivity graph that balances power and performance for a 2D mesh topology. This coupled with a deadlock-free, fullyadaptive routing algorithm guarantees packet delivery. Our approach realizes up to ##### reduction in overall network link power for an 8-ary 2-mesh topology with a moderate network latency increase

CiteSeerX

Adaptive Routing Approaches for Networked Many-Core Systems

Author: Ebrahimi Masoumeh
Publication venue: Annales Universitatis Turkuensis A I 465
Publication date: 28/06/2013
Field of study

Through advances in technology, System-on-Chip design is moving towards integrating tens to hundreds of intellectual property blocks into a single chip. In such a many-core system, on-chip communication becomes a performance bottleneck for high performance designs. Network-on-Chip (NoC) has emerged as a viable solution for the communication challenges in highly complex chips. The NoC architecture paradigm, based on a modular packet-switched mechanism, can address many of the on-chip communication challenges such as wiring complexity, communication latency, and bandwidth. Furthermore, the combined benefits of 3D IC and NoC schemes provide the possibility of designing a high performance system in a limited chip area. The major advantages of 3D NoCs are the considerable reductions in average latency and power consumption. There are several factors degrading the performance of NoCs. In this thesis, we investigate three main performance-limiting factors: network congestion, faults, and the lack of efficient multicast support. We address these issues by the means of routing algorithms. Congestion of data packets may lead to increased network latency and power consumption. Thus, we propose three different approaches for alleviating such congestion in the network. The first approach is based on measuring the congestion information in different regions of the network, distributing the information over the network, and utilizing this information when making a routing decision. The second approach employs a learning method to dynamically find the less congested routes according to the underlying traffic. The third approach is based on a fuzzy-logic technique to perform better routing decisions when traffic information of different routes is available. Faults affect performance significantly, as then packets should take longer paths in order to be routed around the faults, which in turn increases congestion around the faulty regions. We propose four methods to tolerate faults at the link and switch level by using only the shortest paths as long as such path exists. The unique characteristic among these methods is the toleration of faults while also maintaining the performance of NoCs. To the best of our knowledge, these algorithms are the first approaches to bypassing faults prior to reaching them while avoiding unnecessary misrouting of packets. Current implementations of multicast communication result in a significant performance loss for unicast traffic. This is due to the fact that the routing rules of multicast packets limit the adaptivity of unicast packets. We present an approach in which both unicast and multicast packets can be efficiently routed within the network. While suggesting a more efficient multicast support, the proposed approach does not affect the performance of unicast routing at all. In addition, in order to reduce the overall path length of multicast packets, we present several partitioning methods along with their analytical models for latency measurement. This approach is discussed in the context of 3D mesh networks.Siirretty Doriast

UTUPub

Performance-aware energy optimizations in networks for HPC

Author: Saravanan Karthikeyan P.
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2016
Field of study

Energy efficiency is an important challenge in the field of High Performance Computing (HPC). High energy requirements not only limit the potential to realize next-generation machines but are also an increasing part of the total cost of ownership of an HPC system. While at large HPC systems are becoming increasingly energy proportional in an effort to reduce energy costs, interconnect links stand out for their inefficiency. Commodity interconnect links remain ¿always-on¿, consuming full power even when no data is being transmitted. Although various techniques have been proposed towards energy- proportional interconnects, they are often too conservative or are not focused toward HPC. Aggressive techniques for interconnect energy savings are often not applied to HPC, in particular, because they may incur excessive performance overheads. Any energy-saving technique will only be adopted in HPC if there is no significant impact on performance, which is still the primary design objective. This thesis explores interconnect energy proportionality from a performance perspective. In this thesis, first a characterization of HPC applications is presented, making a case for the enormous potential for interconnect energy proportionality with HPC applications. Next, an HPC interconnect with on/off based links, modeled after the IEEE Energy Efficient Ethernet protocol, is evaluated. This evaluation while presenting a relationship between performance impact and energy over HPC applications also emphasizes the need for performance focused designs in energy efficient interconnects. Next, an adaptive mechanism, PerfBound, is presented that saves link energy subject to a bound on application performance overheads. Finally this evaluation structure is applied into an intermediate link power state, in addition to the traditional on and off states. Results of this study, over 15 production HPC applications show that, compared to current day always-on HPC interconnects, link energy can be reduced by unto 70%, while application performance overhead is bounded to only 1%.La eficiencia energética es un gran reto en el área de la Supercomputación (HPC), las grandes necesidades de energía no solo limitan el potencial de las computadoras de nueva generación, sino que también aumentan el coste de funcionamiento de estos sistemas. Mientras que los sistemas HPC tienden a ser cada vez más energéticamente proporcionales en un empeño por reducir costes, los enlaces de interconexión siguen siendo muy ineficientes. Los enlaces de interconexión comunes funcionan en modo "always-on", es decir, consumiendo energía incluso cuando no transmiten. Aunque se han propuesto algunas técnicas que ayuden a la proporcionalidad energética de los enlaces de interconexión, éstas han sido muy agresivas o poco enfocadas hacia su uso con sistemas HPC. Las técnicas de ahorro energético para los enlaces más agresivas no suelen ser utilizadas en HPC, particularmente porque degradan excesivamente el rendimiento. Cualquier técnica de ahorro energético solo será adoptada en sistemas HPC si no hay un impacto excesivo en el rendimiento, el cual es el principal objetivo de estos sistemas. En esta tesis, primeramente se presenta una nueva caracterización de aplicaciones HPC, remarcando el enorme potencial de la proporcionalidad en los enlaces de interconexión proporcionales para aplicaciones HPC. Seguidamente, se evaluará siguiendo el protocolo "IEEE Energy Efficient Ethernet" un link de interconexión on/off. Esta evaluación presentará una relación de impacto energético y rendimiento en aplicaciones HPC, enfatizando en la necesidad de usar un enlace de interconexión enfocados a la eficiencia. Se continuará con la presentación de un mecanismo adaptivo, PerfBound, que ahorra energía respetando unos límites máximos de impacto en el rendimiento. Finalmente, esta estructura es aplicada a un nuevo estado intermedio de funcionamiento adicional a los estados tradicionales on/off. Los resultados de este estudio, muestran que en más de 15 aplicaciones HPC la energía en los enlaces puede ser reducida en un 70% en comparación con enlaces "always-on", mientras que el impacto en el rendimiento es de tan solo un 1%.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

새로운 메모리 기술을 기반으로 한 메모리 시스템 설계 기술

Author: 안준환
Publication venue: 서울대학교 대학원
Publication date: 01/02/2017
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 최기영.Performance and energy efficiency of modern computer systems are largely dominated by the memory system. This memory bottleneck has been exacerbated in the past few years with (1) architectural innovations for improving the efficiency of computation units (e.g., chip multiprocessors), which shift the major cause of inefficiency from processors to memory, and (2) the emergence of data-intensive applications, which demands a large capacity of main memory and an excessive amount of memory bandwidth to efficiently handle such workloads. In order to address this memory wall challenge, this dissertation aims at exploring the potential of emerging memory technologies and designing a high-performance, energy-efficient memory hierarchy that is aware of and leverages the characteristics of such new memory technologies. The first part of this dissertation focuses on energy-efficient on-chip cache design based on a new non-volatile memory technology called Spin-Transfer Torque RAM (STT-RAM). When STT-RAM is used to build on-chip caches, it provides several advantages over conventional charge-based memory (e.g., SRAM or eDRAM), such as non-volatility, lower static power, and higher density. However, simply replacing SRAM caches with STT-RAM rather increases the energy consumption because write operations of STT-RAM are slower and more energy-consuming than those of SRAM. To address this challenge, we propose four novel architectural techniques that can alleviate the impact of inefficient STT-RAM write operations on system performance and energy consumption. First, we apply STT-RAM to instruction caches (where write operations are relatively infrequent) and devise a power-gating mechanism called LASIC, which leverages the non-volatility of STT-RAM to turn off STT-RAM instruction caches inside small loops. Second, we propose lower-bits cache, which exploits the narrow bit-width characteristics of application data by caching frequent bit-flips at lower bits in a small SRAM cache. Third, we present prediction hybrid cache, an SRAM/STT-RAM hybrid cache whose block placement between SRAM and STT-RAM is determined by predicting the write intensity of each cache block with a new hardware structure called write intensity predictor. Fourth, we propose DASCA, which predicts write operations that can bypass the cache without incurring extra cache misses (called dead writes) and lets the last-level cache bypass such dead writes to reduce write energy consumption. The second part of this dissertation architects intelligent main memory and its host architecture support based on logic-enabled DRAM. Traditionally, main memory has served the sole purpose of storing data because the extra manufacturing cost of implementing rich functionality (e.g., computation) on a DRAM die was unacceptably high. However, the advent of 3D die stacking now provides a practical, cost-effective way to integrate complex logic circuits into main memory, thereby opening up the possibilities for intelligent main memory. For example, it can be utilized to implement advanced memory management features (e.g., scheduling, power management, etc.) inside memoryit can be also used to offload computation to main memory, which allows us to overcome the memory bandwidth bottleneck caused by narrow off-chip channels (commonly known as processing-in-memory or PIM). The remaining questions are what to implement inside main memory and how to integrate and expose such new features to existing systems. In order to answer these questions, we propose four system designs that utilize logic-enabled DRAM to improve system performance and energy efficiency. First, we utilize the existing logic layer of a Hybrid Memory Cube (a commercial logic-enabled DRAM product) to (1) dynamically turn off some of its off-chip links by monitoring the actual bandwidth demand and (2) integrate prefetch buffer into main memory to perform aggressive prefetching without consuming off-chip link bandwidth. Second, we propose a scalable accelerator for large-scale graph processing called Tesseract, in which graph processing computation is offloaded to specialized processors inside main memory in order to achieve memory-capacity-proportional performance. Third, we design a low-overhead PIM architecture for near-term adoption called PIM-enabled instructions, where PIM operations are interfaced as cache-coherent, virtually-addressed host processor instructions that can be executed either by the host processor or in main memory depending on the data locality. Fourth, we propose an energy-efficient PIM system called aggregation-in-memory, which can adaptively execute PIM operations at any level of the memory hierarchy and provides a fully automated compiler toolchain that transforms existing applications to use PIM operations without programmer intervention.Chapter 1 Introduction 1 1.1 Inefficiencies in the Current Memory Systems 2 1.1.1 On-Chip Caches 2 1.1.2 Main Memory 2 1.2 New Memory Technologies: Opportunities and Challenges 3 1.2.1 Energy-Efficient On-Chip Caches based on STT-RAM 3 1.2.2 Intelligent Main Memory based on Logic-Enabled DRAM 6 1.3 Dissertation Overview 9 Chapter 2 Previous Work 11 2.1 Energy-Efficient On-Chip Caches based on STT-RAM 11 2.1.1 Hybrid Caches 11 2.1.2 Volatile STT-RAM 13 2.1.3 Redundant Write Elimination 14 2.2 Intelligent Main Memory based on Logic-Enabled DRAM 15 2.2.1 PIM Architectures in the 1990s 15 2.2.2 Modern PIM Architectures based on 3D Stacking 15 2.2.3 Modern PIM Architectures on Memory Dies 17 Chapter 3 Loop-Aware Sleepy Instruction Cache 19 3.1 Architecture 20 3.1.1 Loop Cache 21 3.1.2 Loop-Aware Sleep Controller 22 3.2 Evaluation and Discussion 24 3.2.1 Simulation Environment 24 3.2.2 Energy 25 3.2.3 Performance 27 3.2.4 Sensitivity Analysis 27 3.3 Summary 28 Chapter 4 Lower-Bits Cache 29 4.1 Architecture 29 4.2 Experiments 32 4.2.1 Simulator and Cache Model 32 4.2.2 Results 33 4.3 Summary 34 Chapter 5 Prediction Hybrid Cache 35 5.1 Problem and Motivation 37 5.1.1 Problem Definition 37 5.1.2 Motivation 37 5.2 Write Intensity Predictor 38 5.2.1 Keeping Track of Trigger Instructions 39 5.2.2 Identifying Hot Trigger Instructions 40 5.2.3 Dynamic Set Sampling 41 5.2.4 Summary 42 5.3 Prediction Hybrid Cache 43 5.3.1 Need for Write Intensity Prediction 43 5.3.2 Organization 43 5.3.3 Operations 44 5.3.4 Dynamic Threshold Adjustment 45 5.4 Evaluation Methodology 48 5.4.1 Simulator Configuration 48 5.4.2 Workloads 50 5.5 Single-Core Evaluations 51 5.5.1 Energy Consumption and Speedup 51 5.5.2 Energy Breakdown 53 5.5.3 Coverage and Accuracy 54 5.5.4 Sensitivity to Write Intensity Threshold 55 5.5.5 Impact of Dynamic Set Sampling 55 5.5.6 Results for Non-Write-Intensive Workloads 56 5.6 Multicore Evaluations 57 5.7 Summary 59 Chapter 6 Dead Write Prediction Assisted STT-RAM Cache 61 6.1 Motivation 62 6.1.1 Energy Impact of Inefficient Write Operations 62 6.1.2 Limitations of Existing Approaches 63 6.1.3 Potential of Dead Writes 64 6.2 Dead Write Classification 65 6.2.1 Dead-on-Arrival Fills 65 6.2.2 Dead-Value Fills 66 6.2.3 Closing Writes 66 6.2.4 Decomposition 67 6.3 Dead Write Prediction Assisted STT-RAM Cache Architecture 68 6.3.1 Dead Write Prediction 68 6.3.2 Bidirectional Bypass 71 6.4 Evaluation Methodology 72 6.4.1 Simulation Configuration 72 6.4.2 Workloads 74 6.5 Evaluation for Single-Core Systems 75 6.5.1 Energy Consumption and Speedup 75 6.5.2 Coverage and Accuracy 78 6.5.3 Sensitivity to Signature 78 6.5.4 Sensitivity to Update Policy 80 6.5.5 Implications of Device-/Circuit-Level Techniques for Write Energy Reduction 80 6.5.6 Impact of Prefetching 80 6.6 Evaluation for Multi-Core Systems 81 6.6.1 Energy Consumption and Speedup 81 6.6.2 Application to Inclusive Caches 83 6.6.3 Application to Three-Level Cache Hierarchy 84 6.7 Summary 85 Chapter 7 Link Power Management for Hybrid Memory Cubes 87 7.1 Background and Motivation 88 7.1.1 Hybrid Memory Cube 88 7.1.2 Motivation 89 7.2 HMC Link Power Management 91 7.2.1 Link Delay Monitor 91 7.2.2 Power State Transition 94 7.2.3 Overhead 95 7.3 Two-Level Prefetching 95 7.4 Application to Multi-HMC Systems 97 7.5 Experiments 98 7.5.1 Methodology 98 7.5.2 Link Energy Consumption and Speedup 100 7.5.3 HMC Energy Consumption 102 7.5.4 Runtime Behavior of LPM 102 7.5.5 Sensitivity to Slowdown Threshold 104 7.5.6 LPM without Prefetching 104 7.5.7 Impact of Prefetching on Link Traffic 105 7.5.8 On-Chip Prefetcher Aggressiveness in 2LP 107 7.5.9 Tighter Off-Chip Bandwidth Margin 107 7.5.10 Multithreaded Workloads 108 7.5.11 Multi-HMC Systems 109 7.6 Summary 111 Chapter 8 Tesseract PIM System for Parallel Graph Processing 113 8.1 Background and Motivation 115 8.1.1 Large-Scale Graph Processing 115 8.1.2 Graph Processing on Conventional Systems 117 8.1.3 Processing-in-Memory 118 8.2 Tesseract Architecture 119 8.2.1 Overview 119 8.2.2 Remote Function Call via Message Passing 122 8.2.3 Prefetching 124 8.2.4 Programming Interface 126 8.2.5 Application Mapping 127 8.3 Evaluation Methodology 128 8.3.1 Simulation Configuration 128 8.3.2 Workloads 129 8.4 Evaluation Results 130 8.4.1 Performance 130 8.4.2 Iso-Bandwidth Comparison 133 8.4.3 Execution Time Breakdown 134 8.4.4 Prefetch Efficiency 134 8.4.5 Scalability 135 8.4.6 Effect of Higher Off-Chip Network Bandwidth 136 8.4.7 Effect of Better Graph Distribution 137 8.4.8 Energy/Power Consumption and Thermal Analysis 138 8.5 Summary 139 Chapter 9 PIM-Enabled Instructions 141 9.1 Potential of ISA Extensions as the PIM Interface 143 9.2 PIM Abstraction 145 9.2.1 Operations 145 9.2.2 Memory Model 147 9.2.3 Software Modification 148 9.3 Architecture 148 9.3.1 Overview 148 9.3.2 PEI Computation Unit (PCU) 149 9.3.3 PEI Management Unit (PMU) 150 9.3.4 Virtual Memory Support 153 9.3.5 PEI Execution 153 9.3.6 Comparison with Active Memory Operations 154 9.4 Target Applications for Case Study 155 9.4.1 Large-Scale Graph Processing 155 9.4.2 In-Memory Data Analytics 156 9.4.3 Machine Learning and Data Mining 157 9.4.4 Operation Summary 157 9.5 Evaluation Methodology 158 9.5.1 Simulation Configuration 158 9.5.2 Workloads 159 9.6 Evaluation Results 159 9.6.1 Performance 160 9.6.2 Sensitivity to Input Size 163 9.6.3 Multiprogrammed Workloads 164 9.6.4 Balanced Dispatch: Idea and Evaluation 165 9.6.5 Design Space Exploration for PCUs 165 9.6.6 Performance Overhead of the PMU 167 9.6.7 Energy, Area, and Thermal Issues 167 9.7 Summary 168 Chapter 10 Aggregation-in-Memory 171 10.1 Motivation 173 10.1.1 Rethinking PIM for Energy Efficiency 173 10.1.2 Aggregation as PIM Operations 174 10.2 Architecture 176 10.2.1 Overview 176 10.2.2 Programming Model 177 10.2.3 On-Chip Caches 177 10.2.4 Coherence and Consistency 181 10.2.5 Main Memory 181 10.2.6 Potential Generalization Opportunities 183 10.3 Compiler Support 184 10.4 Contributions over Prior Art 185 10.4.1 PIM-Enabled Instructions 185 10.4.2 Parallel Reduction in Caches 187 10.4.3 Row Buffer Locality of DRAM Writes 188 10.5 Target Applications 188 10.6 Evaluation Methodology 190 10.6.1 Simulation Configuration 190 10.6.2 Hardware Overhead 191 10.6.3 Workloads 192 10.7 Evaluation Results 192 10.7.1 Energy Consumption and Performance 192 10.7.2 Dynamic Energy Breakdown 196 10.7.3 Comparison with Aggressive Writeback 197 10.7.4 Multiprogrammed Workloads 198 10.7.5 Comparison with Intrinsic-based Code 198 10.8 Summary 199 Chapter 11 Conclusion 201 11.1 Energy-Efficient On-Chip Caches based on STT-RAM 202 11.2 Intelligent Main Memory based on Logic-Enabled DRAM 203 Bibliography 205 요약 227Docto

SNU Open Repository and Archive