7 research outputs found

    A Syntax Directed Imperative Language Microprocessor for Reduced Power Consumption and Improved Performance

    Get PDF
    This thesis investigates high-level Instruction Set Architectures (ISAs) and supporting processor architectures. A Syntax Directed Imperative Language Processor (SDLP) and associated ISA have been defined with the ultimate aim of reducing power consumption and improving performance. The findings of this thesis suggest that there may be a number of benefits of the SDLP over traditional ISAs and architectures. Initial results suggest that the SDLP ISA places less burden on the memory system by reducing the number of instructions executed for a given program. It also appears that the SDLP could reduce the number of interactions with the memory system for data. These results are significant since a large portion of the total power for a system is consumed by the memory system. It is illustrated how the SDLP requires fewer cycle counts for the equivalent throughput of traditional microprocessor architectures. The implication is that further perfor- mance improvements could be obtained with uniprocessors, before considering multiprocessors. The main contributions of this thesis include: • The design of a hybrid control flow and data flow architecture with a supporting Instruction Set Architecture; • Implementation of an assembler and software-based cycle accurate simulator for the SDLP processor; • Comparisons of the SDLP architecture with traditional CISC and RISC processors; • It has been shown that high-level ISAs and supporting processor architectures can reduce the burden on the memory system for both instructions and data; and can reduce the cycle count of programs

    새로운 메모리 기술을 기반으로 한 메모리 시스템 설계 기술

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2017. 2. 최기영.Performance and energy efficiency of modern computer systems are largely dominated by the memory system. This memory bottleneck has been exacerbated in the past few years with (1) architectural innovations for improving the efficiency of computation units (e.g., chip multiprocessors), which shift the major cause of inefficiency from processors to memory, and (2) the emergence of data-intensive applications, which demands a large capacity of main memory and an excessive amount of memory bandwidth to efficiently handle such workloads. In order to address this memory wall challenge, this dissertation aims at exploring the potential of emerging memory technologies and designing a high-performance, energy-efficient memory hierarchy that is aware of and leverages the characteristics of such new memory technologies. The first part of this dissertation focuses on energy-efficient on-chip cache design based on a new non-volatile memory technology called Spin-Transfer Torque RAM (STT-RAM). When STT-RAM is used to build on-chip caches, it provides several advantages over conventional charge-based memory (e.g., SRAM or eDRAM), such as non-volatility, lower static power, and higher density. However, simply replacing SRAM caches with STT-RAM rather increases the energy consumption because write operations of STT-RAM are slower and more energy-consuming than those of SRAM. To address this challenge, we propose four novel architectural techniques that can alleviate the impact of inefficient STT-RAM write operations on system performance and energy consumption. First, we apply STT-RAM to instruction caches (where write operations are relatively infrequent) and devise a power-gating mechanism called LASIC, which leverages the non-volatility of STT-RAM to turn off STT-RAM instruction caches inside small loops. Second, we propose lower-bits cache, which exploits the narrow bit-width characteristics of application data by caching frequent bit-flips at lower bits in a small SRAM cache. Third, we present prediction hybrid cache, an SRAM/STT-RAM hybrid cache whose block placement between SRAM and STT-RAM is determined by predicting the write intensity of each cache block with a new hardware structure called write intensity predictor. Fourth, we propose DASCA, which predicts write operations that can bypass the cache without incurring extra cache misses (called dead writes) and lets the last-level cache bypass such dead writes to reduce write energy consumption. The second part of this dissertation architects intelligent main memory and its host architecture support based on logic-enabled DRAM. Traditionally, main memory has served the sole purpose of storing data because the extra manufacturing cost of implementing rich functionality (e.g., computation) on a DRAM die was unacceptably high. However, the advent of 3D die stacking now provides a practical, cost-effective way to integrate complex logic circuits into main memory, thereby opening up the possibilities for intelligent main memory. For example, it can be utilized to implement advanced memory management features (e.g., scheduling, power management, etc.) inside memoryit can be also used to offload computation to main memory, which allows us to overcome the memory bandwidth bottleneck caused by narrow off-chip channels (commonly known as processing-in-memory or PIM). The remaining questions are what to implement inside main memory and how to integrate and expose such new features to existing systems. In order to answer these questions, we propose four system designs that utilize logic-enabled DRAM to improve system performance and energy efficiency. First, we utilize the existing logic layer of a Hybrid Memory Cube (a commercial logic-enabled DRAM product) to (1) dynamically turn off some of its off-chip links by monitoring the actual bandwidth demand and (2) integrate prefetch buffer into main memory to perform aggressive prefetching without consuming off-chip link bandwidth. Second, we propose a scalable accelerator for large-scale graph processing called Tesseract, in which graph processing computation is offloaded to specialized processors inside main memory in order to achieve memory-capacity-proportional performance. Third, we design a low-overhead PIM architecture for near-term adoption called PIM-enabled instructions, where PIM operations are interfaced as cache-coherent, virtually-addressed host processor instructions that can be executed either by the host processor or in main memory depending on the data locality. Fourth, we propose an energy-efficient PIM system called aggregation-in-memory, which can adaptively execute PIM operations at any level of the memory hierarchy and provides a fully automated compiler toolchain that transforms existing applications to use PIM operations without programmer intervention.Chapter 1 Introduction 1 1.1 Inefficiencies in the Current Memory Systems 2 1.1.1 On-Chip Caches 2 1.1.2 Main Memory 2 1.2 New Memory Technologies: Opportunities and Challenges 3 1.2.1 Energy-Efficient On-Chip Caches based on STT-RAM 3 1.2.2 Intelligent Main Memory based on Logic-Enabled DRAM 6 1.3 Dissertation Overview 9 Chapter 2 Previous Work 11 2.1 Energy-Efficient On-Chip Caches based on STT-RAM 11 2.1.1 Hybrid Caches 11 2.1.2 Volatile STT-RAM 13 2.1.3 Redundant Write Elimination 14 2.2 Intelligent Main Memory based on Logic-Enabled DRAM 15 2.2.1 PIM Architectures in the 1990s 15 2.2.2 Modern PIM Architectures based on 3D Stacking 15 2.2.3 Modern PIM Architectures on Memory Dies 17 Chapter 3 Loop-Aware Sleepy Instruction Cache 19 3.1 Architecture 20 3.1.1 Loop Cache 21 3.1.2 Loop-Aware Sleep Controller 22 3.2 Evaluation and Discussion 24 3.2.1 Simulation Environment 24 3.2.2 Energy 25 3.2.3 Performance 27 3.2.4 Sensitivity Analysis 27 3.3 Summary 28 Chapter 4 Lower-Bits Cache 29 4.1 Architecture 29 4.2 Experiments 32 4.2.1 Simulator and Cache Model 32 4.2.2 Results 33 4.3 Summary 34 Chapter 5 Prediction Hybrid Cache 35 5.1 Problem and Motivation 37 5.1.1 Problem Definition 37 5.1.2 Motivation 37 5.2 Write Intensity Predictor 38 5.2.1 Keeping Track of Trigger Instructions 39 5.2.2 Identifying Hot Trigger Instructions 40 5.2.3 Dynamic Set Sampling 41 5.2.4 Summary 42 5.3 Prediction Hybrid Cache 43 5.3.1 Need for Write Intensity Prediction 43 5.3.2 Organization 43 5.3.3 Operations 44 5.3.4 Dynamic Threshold Adjustment 45 5.4 Evaluation Methodology 48 5.4.1 Simulator Configuration 48 5.4.2 Workloads 50 5.5 Single-Core Evaluations 51 5.5.1 Energy Consumption and Speedup 51 5.5.2 Energy Breakdown 53 5.5.3 Coverage and Accuracy 54 5.5.4 Sensitivity to Write Intensity Threshold 55 5.5.5 Impact of Dynamic Set Sampling 55 5.5.6 Results for Non-Write-Intensive Workloads 56 5.6 Multicore Evaluations 57 5.7 Summary 59 Chapter 6 Dead Write Prediction Assisted STT-RAM Cache 61 6.1 Motivation 62 6.1.1 Energy Impact of Inefficient Write Operations 62 6.1.2 Limitations of Existing Approaches 63 6.1.3 Potential of Dead Writes 64 6.2 Dead Write Classification 65 6.2.1 Dead-on-Arrival Fills 65 6.2.2 Dead-Value Fills 66 6.2.3 Closing Writes 66 6.2.4 Decomposition 67 6.3 Dead Write Prediction Assisted STT-RAM Cache Architecture 68 6.3.1 Dead Write Prediction 68 6.3.2 Bidirectional Bypass 71 6.4 Evaluation Methodology 72 6.4.1 Simulation Configuration 72 6.4.2 Workloads 74 6.5 Evaluation for Single-Core Systems 75 6.5.1 Energy Consumption and Speedup 75 6.5.2 Coverage and Accuracy 78 6.5.3 Sensitivity to Signature 78 6.5.4 Sensitivity to Update Policy 80 6.5.5 Implications of Device-/Circuit-Level Techniques for Write Energy Reduction 80 6.5.6 Impact of Prefetching 80 6.6 Evaluation for Multi-Core Systems 81 6.6.1 Energy Consumption and Speedup 81 6.6.2 Application to Inclusive Caches 83 6.6.3 Application to Three-Level Cache Hierarchy 84 6.7 Summary 85 Chapter 7 Link Power Management for Hybrid Memory Cubes 87 7.1 Background and Motivation 88 7.1.1 Hybrid Memory Cube 88 7.1.2 Motivation 89 7.2 HMC Link Power Management 91 7.2.1 Link Delay Monitor 91 7.2.2 Power State Transition 94 7.2.3 Overhead 95 7.3 Two-Level Prefetching 95 7.4 Application to Multi-HMC Systems 97 7.5 Experiments 98 7.5.1 Methodology 98 7.5.2 Link Energy Consumption and Speedup 100 7.5.3 HMC Energy Consumption 102 7.5.4 Runtime Behavior of LPM 102 7.5.5 Sensitivity to Slowdown Threshold 104 7.5.6 LPM without Prefetching 104 7.5.7 Impact of Prefetching on Link Traffic 105 7.5.8 On-Chip Prefetcher Aggressiveness in 2LP 107 7.5.9 Tighter Off-Chip Bandwidth Margin 107 7.5.10 Multithreaded Workloads 108 7.5.11 Multi-HMC Systems 109 7.6 Summary 111 Chapter 8 Tesseract PIM System for Parallel Graph Processing 113 8.1 Background and Motivation 115 8.1.1 Large-Scale Graph Processing 115 8.1.2 Graph Processing on Conventional Systems 117 8.1.3 Processing-in-Memory 118 8.2 Tesseract Architecture 119 8.2.1 Overview 119 8.2.2 Remote Function Call via Message Passing 122 8.2.3 Prefetching 124 8.2.4 Programming Interface 126 8.2.5 Application Mapping 127 8.3 Evaluation Methodology 128 8.3.1 Simulation Configuration 128 8.3.2 Workloads 129 8.4 Evaluation Results 130 8.4.1 Performance 130 8.4.2 Iso-Bandwidth Comparison 133 8.4.3 Execution Time Breakdown 134 8.4.4 Prefetch Efficiency 134 8.4.5 Scalability 135 8.4.6 Effect of Higher Off-Chip Network Bandwidth 136 8.4.7 Effect of Better Graph Distribution 137 8.4.8 Energy/Power Consumption and Thermal Analysis 138 8.5 Summary 139 Chapter 9 PIM-Enabled Instructions 141 9.1 Potential of ISA Extensions as the PIM Interface 143 9.2 PIM Abstraction 145 9.2.1 Operations 145 9.2.2 Memory Model 147 9.2.3 Software Modification 148 9.3 Architecture 148 9.3.1 Overview 148 9.3.2 PEI Computation Unit (PCU) 149 9.3.3 PEI Management Unit (PMU) 150 9.3.4 Virtual Memory Support 153 9.3.5 PEI Execution 153 9.3.6 Comparison with Active Memory Operations 154 9.4 Target Applications for Case Study 155 9.4.1 Large-Scale Graph Processing 155 9.4.2 In-Memory Data Analytics 156 9.4.3 Machine Learning and Data Mining 157 9.4.4 Operation Summary 157 9.5 Evaluation Methodology 158 9.5.1 Simulation Configuration 158 9.5.2 Workloads 159 9.6 Evaluation Results 159 9.6.1 Performance 160 9.6.2 Sensitivity to Input Size 163 9.6.3 Multiprogrammed Workloads 164 9.6.4 Balanced Dispatch: Idea and Evaluation 165 9.6.5 Design Space Exploration for PCUs 165 9.6.6 Performance Overhead of the PMU 167 9.6.7 Energy, Area, and Thermal Issues 167 9.7 Summary 168 Chapter 10 Aggregation-in-Memory 171 10.1 Motivation 173 10.1.1 Rethinking PIM for Energy Efficiency 173 10.1.2 Aggregation as PIM Operations 174 10.2 Architecture 176 10.2.1 Overview 176 10.2.2 Programming Model 177 10.2.3 On-Chip Caches 177 10.2.4 Coherence and Consistency 181 10.2.5 Main Memory 181 10.2.6 Potential Generalization Opportunities 183 10.3 Compiler Support 184 10.4 Contributions over Prior Art 185 10.4.1 PIM-Enabled Instructions 185 10.4.2 Parallel Reduction in Caches 187 10.4.3 Row Buffer Locality of DRAM Writes 188 10.5 Target Applications 188 10.6 Evaluation Methodology 190 10.6.1 Simulation Configuration 190 10.6.2 Hardware Overhead 191 10.6.3 Workloads 192 10.7 Evaluation Results 192 10.7.1 Energy Consumption and Performance 192 10.7.2 Dynamic Energy Breakdown 196 10.7.3 Comparison with Aggressive Writeback 197 10.7.4 Multiprogrammed Workloads 198 10.7.5 Comparison with Intrinsic-based Code 198 10.8 Summary 199 Chapter 11 Conclusion 201 11.1 Energy-Efficient On-Chip Caches based on STT-RAM 202 11.2 Intelligent Main Memory based on Logic-Enabled DRAM 203 Bibliography 205 요약 227Docto

    The use of a reconfigurable functional cache in a digital signal processor: power and performance

    Get PDF
    Due to the computationally intensive nature of the tasks that digital signal processors (DSP) are required to perform it is desirable to decrease the time required to execute these tasks. Minimizing the execution time required for the various algorithms that are commonly and frequently executed (ex: FIR filters) will improve the overall performance. It is known that hardware is able to execute algorithms faster than software, however, due to the size limitations of embedded DSP, not all of the necessary algorithms can be implemented in hardware. A reconfigurable cache architecture in combination with a DSP is proposed as an alternative to increase algorithm performance by using reconfigurable hardware rather than dedicated hardware. Another important issue to consider for embedded processors is the power consumption of the DSP. Due to the fact that most embedded processors operate by battery power, energy efficiency is a necessity. This study looks at the power requirements of a DSP with reconfigurable cache to determine the viability of such an architecture in an embedded system. Others have shown that reconfigurable cache in conjunction with a general purpose processor improves performance for some DSP benchmarks. This study shows that a DSP/reconfigurable cache combination can achieve kernel performance gains ranging from 10-350 times that of a DSP architecture operating alone and can achieve overall benchmark speedups ranging from 1.02 to 1.91 times that of the existing DSP architecture. Further, relative power consumption results show that the power consumption of the reconfigurable architecture is approximately 85 to 95% of the current architecture (5-15% power savings) and attains energy savings ranging from approximately 14 to 50%

    The effect of an optical network on-chip on the performance of chip multiprocessors

    Get PDF
    Optical networks on-chip (ONoC) have been proposed to reduce power consumption and increase bandwidth density in high performance chip multiprocessors (CMP), compared to electrical NoCs. However, as buffering in an ONoC is not viable, the end-to-end message path needs to be acquired in advance during which the message is buffered at the network ingress. This waiting latency is therefore a combination of path setup latency and contention and forms a significant part of the total message latency. Many proposed ONoCs, such as Single Writer, Multiple Reader (SWMR), avoid path setup latency at the expense of increased optical components. In contrast, this thesis investigates a simple circuit-switched ONoC with lower component count where nodes need to request a channel before transmission. To hide the path setup latency, a coherence-based message predictor is proposed, to setup circuits before message arrival. Firstly, the effect of latency and bandwidth on application performance is thoroughly investigated using full-system simulations of shared memory CMPs. It is shown that the latency of an ideal NoC affects the CMP performance more than the NoC bandwidth. Increasing the number of wavelengths per channel decreases the serialisation latency and improves the performance of both ONoC types. With 2 or more wavelengths modulating at 25 Gbit=s , the ONoCs will outperform a conventional electrical mesh (maximal speedup of 20%). The SWMR ONoC outperforms the circuit-switched ONoC. Next coherence-based prediction techniques are proposed to reduce the waiting latency. The ideal coherence-based predictor reduces the waiting latency by 42%. A more streamlined predictor (smaller than a L1 cache) reduces the waiting latency by 31%. Without prediction, the message latency in the circuit-switched ONoC is 11% larger than in the SWMR ONoC. Applying the realistic predictor reverses this: the message latency in the SWMR ONoC is now 18% larger than the predictive circuitswitched ONoC

    Cache Resource Allocation in Large-Scale Chip Multiprocessors.

    Full text link
    Chip multiprocessors (CMPs) have become virtually ubiquitous due to the increasing impact of power and thermal constraints being placed on processor design, as well as the diminishing returns of building ever more complex uniprocessors. While the number of cores on a chip has increased rapidly, changes to other aspects of system design have been slower in coming. Namely, the on-chip memory hierarchy has largely been unchanged despite the shift to multicore designs. The last level of cache on chip is generally shared by all hardware threads on the chip, creating a ripe environment for resource allocation problems. This dissertation examines cache resource allocation in large-scale chip multiprocessors. It begins by performing extensive supporting research in the area of shared cache metric analysis, concluding that there is no single optimal shared cache design metric. This result supports the idea that shared caches ought not explicitly attempt to achieve optimal partitions; rather they should only react when unfavorable performance is detected. This study is followed by some studies using machine learning analysis to extract salient characteristics in predicting poor shared cache performance. The culmination of this dissertation is a shared cache management framework called SLAM (Scalable, Lightweight, Adaptive Management). SLAM is a scalable and feasible mechanism which detects inefficiency of cache usage by hardware threads sharing the cache. An inefficient thread can be easily punished by effectively reducing its cache occupancy via a modified cache insertion policy. The crux of SLAM is the detection of inefficient threads, which relies on two novel performance monitors in the cache which stem from the results of the machine learning studies: the Misses Per Access Counter (MPAC), and the Relative Insertion Tracker (RIT), which each requires only tens of bits in storage per thread. SLAM not only provides a means for extracting significant performance gains over current cache designs (up to 13.1% improvement), but SLAM also provides a means for granting differentiated quality of service to various cache sharers. Particularly as commercialized virtual servers become increasingly common, being able to provide differentiated quality of service at low cost potentially has significant value.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/64727/1/hsul_1.pd

    Analyse de la sécurité de systèmes critiques embarqués à forte composante logicielle par interprétation abstraite

    Get PDF
    This thesis is dedicated to the analysis of low-level software, like operating systems, by abstract interpretation. Analyzing OSes is a crucial issue to guarantee the safety of software systems since they are the layer immediately above the hardware and that all applicative tasks rely on them. For critical applications, we want to prove that the OS does not crash, and that it ensures the isolation of programs, so that an untrusted program cannot disrupt a trusted one. The analysis of this kind of programs raises specific issues. This is because OSes must control hardware using instructions that are meaningless in ordinary programs. In addition, because hardware features are outside the scope of C, source code includes assembly blocks mixed with C code. These are the two main axes in this thesis: handling mixed C and assembly, and precise abstraction of instructions that are specific to low-level software. This work is motivated by the analysis of a case study emanating from an industrial partner, which required the implementation of proposed methods in the static analyzer Astrée. The first part is about the formalization of a language mixing simplified models of C and assembly, from syntax to semantics. This specification is crucial to define what is legal and what is a bug, while taking into account the intricacy of interactions of C and assembly, in terms of data flow and control flow. The second part is a short introduction to abstract interpretation focusing on what is useful thereafter. The third part proposes an abstraction of the semantics of mixed C and assembly. This is actually a series of parametric abstractions handling each aspect of the semantics. The fourth part is interested in the question of the abstraction of instructions specific to low-level software. Interest properties can easily be proven using ghost variables, but because of technical reasons, it is difficult to design a reduced product of abstract domains that allows a satisfactory handling of ghost variables. This part builds such a general framework with domains that allow us to solve our problem and many others. The final part details properties to prove in order to guarantee isolation of programs that have not been treated since they raise many complicated questions. We also give some suggestions to improve the product of domains with ghost variables introduced in the previous part, in terms of features and performances.Cette thèse est consacrée à l'analyse de logiciels de bas niveau, tels que les systèmes d'exploitation, par interprétation abstraite. L'analyse des OS est une question importante pour garantir la sûreté des systèmes logiciels puisqu'ils forment le niveau immédiatement au-dessus du matériel et que toutes les tâches applicatives dépendent d'eux. Pour des applications critiques, on veut s'assurer que l'OS ne plante pas, mais aussi qu'il assure l'isolation des programmes, de sorte qu'un programme dont la fiabilité n'a pas été établie ne puisse perturber un programme de confiance. L'analyse de ce genre de programmes soulève des problèmes spécifiques. Cela provient du fait que les OS doivent contrôler le matériel avec des opérations qui n'ont pas de sens dans un programme ordinaire. De plus, comme les fonctionnalités matérielles sont en dehors du for du C, le code source contient des blocs de code assembleur mêlés au C. Ce sont les deux axes de cette thèse : gérer les mélanges de C et d'assembleur, et abstraire finement les opérations spécifiques aux logiciels de bas niveau. Ce travail est guidé par l'analyse d'un cas d'étude d'un partenaire industriel, ce qui a nécessité l'implémentation des méthodes proposées dans l'analyseur statique Astrée. La première partie s'intéresse à la formalisation d'un langage mélangeant des modèles simplifiés du C et de l'assembleur, depuis la syntaxe jusqu'à la sémantique. Cette spécification est importante pour définir ce qui est légal et ce qui constitue une erreur, tout en tenant compte de la complexité des interactions du C et de l'assembleur, tant en termes de données que de flot de contrôle. La seconde partie est une introduction sommaire à l'interprétation abstraite qui se limite à ce qui est utile par la suite. La troisième partie propose une abstraction de la sémantique des mélanges de C et d'assembleur. Il s'agit en fait d'une collection d'abstractions paramétriques qui gèrent chacun des aspects de cette sémantique. La quatrième partie s'intéresse à l'abstraction des opérations spécifiques aux logiciels de bas niveau. Les propriétés d'intérêt peuvent être facilement prouvées à l'aide de variables fantômes, mais pour des raisons techniques, il est difficile de concevoir un produit réduit de domaines abstraits qui supporte une gestion satisfaisante des variables fantômes. Cette partie construit un tel cadre très général ainsi que des domaines qui permettent de résoudre beaucoup de problèmes dont le nôtre. L'ultime partie présente quelques propriétés à prouver pour garantir l'isolation des programmes, qui n'ont pas été traitées, car elles posent de nouvelles et complexes questions. On donne aussi quelques propositions d'amélioration du produit de domaines avec variables fantômes introduit dans la partie précédente, tant en termes de fonctionnalités que de performances
    corecore