1,972 research outputs found

    A Study on Performance and Power Efficiency of Dense Non-Volatile Caches in Multi-Core Systems

    Full text link
    In this paper, we present a novel cache design based on Multi-Level Cell Spin-Transfer Torque RAM (MLC STTRAM) that can dynamically adapt the set capacity and associativity to use efficiently the full potential of MLC STTRAM. We exploit the asymmetric nature of the MLC storage scheme to build cache lines featuring heterogeneous performances, that is, half of the cache lines are read-friendly, while the other is write-friendly. Furthermore, we propose to opportunistically deactivate ways in underutilized sets to convert MLC to Single-Level Cell (SLC) mode, which features overall better performance and lifetime. Our ultimate goal is to build a cache architecture that combines the capacity advantages of MLC and performance/energy advantages of SLC. Our experiments show an improvement of 43% in total numbers of conflict misses, 27% in memory access latency, 12% in system performance, and 26% in LLC access energy, with a slight degradation in cache lifetime (about 7%) compared to an SLC cache

    STT-RAM์„ ์ด์šฉํ•œ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ์บ์‹œ ์„ค๊ณ„ ๊ธฐ์ˆ 

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2019. 2. ์ตœ๊ธฐ์˜.์ง€๋‚œ ์ˆ˜์‹ญ ๋…„๊ฐ„ '๋ฉ”๋ชจ๋ฆฌ ๋ฒฝ' ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์˜จ ์นฉ ์บ์‹œ์˜ ํฌ๊ธฐ๋Š” ๊พธ์ค€ํžˆ ์ฆ๊ฐ€ํ•ด์™”๋‹ค. ํ•˜์ง€๋งŒ ์ง€๊ธˆ๊นŒ์ง€ ์บ์‹œ์— ์ฃผ๋กœ ์‚ฌ์šฉ๋˜์–ด ์˜จ ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ˆ ์ธ SRAM์€ ๋‚ฎ์€ ์ง‘์ ๋„์™€ ๋†’์€ ๋Œ€๊ธฐ ์ „๋ ฅ ์†Œ๋ชจ๋กœ ์ธํ•ด ํฐ ์บ์‹œ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๋ฐ์—๋Š” ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค. ์ด๋Ÿฌํ•œ SRAM์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ๋” ๋†’์€ ์ง‘์ ๋„์™€ ๋‚ฎ์€ ๋Œ€๊ธฐ ์ „๋ ฅ์„ ์†Œ๋ชจํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ˆ ์ธ STT-RAM์œผ๋กœ SRAM์„ ๋Œ€์ฒดํ•˜๋Š” ๊ฒƒ์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ํ•˜์ง€๋งŒ STT-RAM์€ ๋ฐ์ดํ„ฐ๋ฅผ ์“ธ ๋•Œ ๋งŽ์€ ์—๋„ˆ์ง€์™€ ์‹œ๊ฐ„์„ ์†Œ๋น„ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์ˆœํžˆ SRAM์„ STT-RAM์œผ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๊ฒƒ์€ ์˜คํžˆ๋ ค ์บ์‹œ ์—๋„ˆ์ง€ ์†Œ๋น„๋ฅผ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” STT-RAM์„ ์ด์šฉํ•œ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ์บ์‹œ ์„ค๊ณ„ ๊ธฐ์ˆ ๋“ค์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ, ๋ฐฐํƒ€์  ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ์—์„œ STT-RAM์„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋ฐฐํƒ€์  ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ๋Š” ๊ณ„์ธต ๊ฐ„์— ์ค‘๋ณต๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ํฌํ•จ์  ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ์™€ ๋น„๊ตํ•˜์—ฌ ๋” ํฐ ์œ ํšจ ์šฉ๋Ÿ‰์„ ๊ฐ–์ง€๋งŒ, ๋ฐฐํƒ€์  ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ์—์„œ๋Š” ์ƒ์œ„ ๋ ˆ๋ฒจ ์บ์‹œ์—์„œ ๋‚ด๋ณด๋‚ด์ง„ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜์œ„ ๋ ˆ๋ฒจ ์บ์‹œ์— ์จ์•ผ ํ•˜๋ฏ€๋กœ ๋” ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์“ฐ๊ฒŒ ๋œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฐํƒ€์  ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ์˜ ํŠน์„ฑ์€ ์“ฐ๊ธฐ ํŠน์„ฑ์ด ๋‹จ์ ์ธ STT-RAM์„ ํ•จ๊ป˜ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์„ ์–ด๋ ต๊ฒŒ ํ•œ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์žฌ์‚ฌ์šฉ ๊ฑฐ๋ฆฌ ์˜ˆ์ธก์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” SRAM/STT-RAM ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ•˜์˜€๋‹ค. ๋‘ ๋ฒˆ์งธ, ๋น„ํœ˜๋ฐœ์„ฑ STT-RAM์„ ์ด์šฉํ•ด ์บ์‹œ๋ฅผ ์„ค๊ณ„ํ•  ๋•Œ ๊ณ ๋ คํ•ด์•ผ ํ•  ์ ๋“ค์— ๋Œ€ํ•ด ๋ถ„์„ํ•˜์˜€๋‹ค. STT-RAM์˜ ๋น„ํšจ์œจ์ ์ธ ์“ฐ๊ธฐ ๋™์ž‘์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ํ•ด๊ฒฐ๋ฒ•๋“ค์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ๊ทธ์ค‘ ํ•œ ๊ฐ€์ง€๋Š” STT-RAM ์†Œ์ž๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ์œ ์ง€ํ•˜๋Š” ์‹œ๊ฐ„์„ ์ค„์—ฌ (ํœ˜๋ฐœ์„ฑ STT-RAM) ์“ฐ๊ธฐ ํŠน์„ฑ์„ ํ–ฅ์ƒํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. STT-RAM์— ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์žƒ๋Š” ๊ฒƒ์€ ํ™•๋ฅ ์ ์œผ๋กœ ๋ฐœ์ƒํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์•ˆ์ •์ ์œผ๋กœ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์˜ค๋ฅ˜ ์ •์ • ๋ถ€ํ˜ธ(ECC)๋ฅผ ์ด์šฉํ•ด ์ฃผ๊ธฐ์ ์œผ๋กœ ์˜ค๋ฅ˜๋ฅผ ์ •์ •ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” STT-RAM ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ํœ˜๋ฐœ์„ฑ STT-RAM ์„ค๊ณ„ ์š”์†Œ๋“ค์— ๋Œ€ํ•ด ๋ถ„์„ํ•˜์˜€๊ณ  ์‹คํ—˜์„ ํ†ตํ•ด ํ•ด๋‹น ์„ค๊ณ„ ์š”์†Œ๋“ค์ด ์บ์‹œ ์—๋„ˆ์ง€์™€ ์„ฑ๋Šฅ์— ์ฃผ๋Š” ์˜ํ–ฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋งค๋‹ˆ์ฝ”์–ด ์‹œ์Šคํ…œ์—์„œ์˜ ๋ถ„์‚ฐ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ•˜์˜€๋‹ค. ๋‹จ์ˆœํžˆ ๊ธฐ์กด์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ์™€ ๋ถ„์‚ฐ์บ์‹œ๋ฅผ ๊ฒฐํ•ฉํ•˜๋ฉด ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ์˜ ํšจ์œจ์„ฑ์— ํฐ ์˜ํ–ฅ์„ ์ฃผ๋Š” SRAM ํ™œ์šฉ๋„๊ฐ€ ๋‚ฎ์•„์ง„๋‹ค. ๋”ฐ๋ผ์„œ ๊ธฐ์กด์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ ๊ตฌ์กฐ์—์„œ์˜ ์—๋„ˆ์ง€ ๊ฐ์†Œ๋ฅผ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์—†๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ถ„์‚ฐ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์บ์‹œ ๊ตฌ์กฐ์—์„œ SRAM ํ™œ์šฉ๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ๋Š” ๋‘ ๊ฐ€์ง€ ์ตœ์ ํ™” ๊ธฐ์ˆ ์ธ ๋ฑ…ํฌ-๋‚ด๋ถ€ ์ตœ์ ํ™”์™€ ๋ฑ…ํฌ๊ฐ„ ์ตœ์ ํ™” ๊ธฐ์ˆ ์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋ฑ…ํฌ-๋‚ด๋ถ€ ์ตœ์ ํ™”๋Š” highly-associative ์บ์‹œ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฑ…ํฌ ๋‚ด๋ถ€์—์„œ ์“ฐ๊ธฐ ๋™์ž‘์ด ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์‚ฐ์‹œํ‚ค๋Š” ๊ฒƒ์ด๊ณ  ๋ฑ…ํฌ๊ฐ„ ์ตœ์ ํ™”๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์บ์‹œ ๋ฑ…ํฌ์— ์“ฐ๊ธฐ ๋™์ž‘์ด ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ ๋ฅด๊ฒŒ ๋ถ„์‚ฐ์‹œํ‚ค๋Š” ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์ด๋‹ค.Over the last decade, the capacity of on-chip cache is continuously increased to mitigate the memory wall problem. However, SRAM, which is a dominant memory technology for caches, is not suitable for such a large cache because of its low density and large static power. One way to mitigate these downsides of the SRAM cache is replacing SRAM with a more efficient memory technology. Spin-Transfer Torque RAM (STT-RAM), one of the emerging memory technology, is a promising candidate for the alternative of SRAM. As a substitute of SRAM, STT-RAM can compensate drawbacks of SRAM with its non-volatility and small cell size. However, STT-RAM has poor write characteristics such as high write energy and long write latency and thus simply replacing SRAM to STT-RAM increases cache energy. To overcome those poor write characteristics of STT-RAM, this dissertation explores three different design techniques for energy-efficient cache using STT-RAM. The first part of the dissertation focuses on combining STT-RAM with exclusive cache hierarchy. Exclusive caches are known to provide higher effective cache capacity than inclusive caches by removing duplicated copies of cache blocks across hierarchies. However, in exclusive cache hierarchies, every block evicted from the upper-level cache is written back to the last-level cache regardless of its dirtiness thereby incurring extra write overhead. This makes it challenging to use STT-RAM for exclusive last-level caches due to its high write energy and long write latency. To mitigate this problem, we design an SRAM/STT-RAM hybrid cache architecture based on reuse distance prediction. The second part of the dissertation explores trade-offs in the design of volatile STT-RAM cache. Due to the inefficient write operation of STT-RAM, various solutions have been proposed to tackle this inefficiency. One of the proposed solutions is redesigning STT-RAM cell for better write characteristics at the cost of shortened retention time (i.e., volatile STT-RAM). Since the retention failure of STT-RAM has a stochastic property, an extra overhead of periodic scrubbing with error correcting code (ECC) is required to tolerate the failure. With an analysis based on analytic STT-RAM model, we have conducted extensive experiments on various volatile STT-RAM cache design parameters including scrubbing period, ECC strength, and target failure rate. The experimental results show the impact of the parameter variations on last-level cache energy and performance and provide a guideline for designing a volatile STT-RAM with ECC and scrubbing. The last part of the dissertation proposes Benzene, an energy-efficient distributed SRAM/STT-RAM hybrid cache architecture for manycore systems running multiple applications. It is based on the observation that a naive application of hybrid cache techniques to distributed caches in a manycore architecture suffers from limited energy reduction due to uneven utilization of scarce SRAM. We propose two-level optimization techniques: intra-bank and inter-bank. Intra-bank optimization leverages highly-associative cache design, achieving more uniform distribution of writes within a bank. Inter-bank optimization evenly balances the amount of write-intensive data across the banks.Abstract i Contents iii List of Figures vii List of Tables xi Chapter 1 Introduction 1 1.1 Exclusive Last-Level Hybrid Cache 2 1.2 Designing Volatile STT-RAM Cache 4 1.3 Distributed Hybrid Cache 5 Chapter 2 Background 9 2.1 STT-RAM 9 2.1.1 Thermal Stability 10 2.1.2 Read and Write Operation of STT-RAM 11 2.1.3 Failures of STT-RAM 11 2.1.4 Volatile STT-RAM 13 2.1.5 Related Work 14 2.2 Exclusive Last-Level Hybrid Cache 18 2.2.1 Cache Hierarchies 18 2.2.2 Related Work 19 2.3 Distributed Hybrid Cache 21 2.3.1 Prediction Hybrid Cache 21 2.3.2 Distributed Cache Partitioning 22 2.3.3 Related Work 23 Chapter 3 Exclusive Last-Level Hybrid Cache 27 3.1 Motivation 27 3.1.1 Exclusive Cache Hierarchy 27 3.1.2 Reuse Distance 29 3.2 Architecture 30 3.2.1 Reuse Distance Predictor 30 3.2.2 Hybrid Cache Architecture 32 3.3 Evaluation 34 3.3.1 Methodology 34 3.3.2 LLC Energy Consumption 35 3.3.3 Main Memory Energy Consumption 38 3.3.4 Performance 39 3.3.5 Area Overhead 39 3.4 Summary 39 Chapter 4 Designing Volatile STT-RAM Cache 41 4.1 Analysis 41 4.1.1 Retention Failure of a Volatile STT-RAM Cell 41 4.1.2 Memory Array Design 43 4.2 Evaluation 45 4.2.1 Methodology 45 4.2.2 Last-Level Cache Energy 46 4.2.3 Performance 51 4.3 Summary 52 Chapter 5 Distributed Hybrid Cache 55 5.1 Motivation 55 5.2 Architecture 58 5.2.1 Intra-Bank Optimization 59 5.2.2 Inter-Bank Optimization 63 5.2.3 Other Optimizations 67 5.3 Evaluation Methodology 69 5.4 Evaluation Results 73 5.4.1 Energy Consumption and Performance 73 5.4.2 Analysis of Intra-bank Optimization 76 5.4.3 Analysis of Inter-bank Optimization 78 5.4.4 Impact of Inter-Bank Optimization on Network Energy 79 5.4.5 Sensitivity Analysis 80 5.4.6 Implementation Overhead 81 5.5 Summary 82 Chapter 6 Conculsion 85 Bibliography 88 ์ดˆ๋ก 101Docto

    WCET Optimizations and Architectural Support for Hard Real-Time Systems

    Get PDF
    As time predictability is critical to hard real-time systems, it is not only necessary to accurately estimate the worst-case execution time (WCET) of the real-time tasks but also desirable to improve either the WCET of the tasks or time predictability of the system, because the real-time tasks with lower WCETs are easy to schedule and more likely to meat their deadlines. As a real-time system is an integration of software and hardware, the optimization can be achieved through two ways: software optimization and time-predictable architectural support. In terms of software optimization, we fi rst propose a loop-based instruction prefetching approach to further improve the WCET comparing with simple prefetching techniques such as Next-N-Line prefetching which can enhance both the average-case performance and the worst-case performance. Our prefetching approach can exploit the program controlow information to intelligently prefetch instructions that are most likely needed. Second, as inter-thread interferences in shared caches can signi cantly a ect the WCET of real-time tasks running on multicore processors, we study three multicore-aware code positioning methods to reduce the inter-core L2 cache interferences between co-running real-time threads. One strategy focuses on decreasing the longest WCET among the co-running threads, and two other methods aim at achieving fairness in terms of the amount or percentage of WCET reduction among co-running threads. In the aspect of time-predictable architectural support, we introduce the concept of architectural time predictability (ATP) to separate timing uncertainty concerns caused by hardware from software, which greatly facilitates the advancement of time-predictable processor design. We also propose a metric called Architectural Time-predictability Factor (ATF) to measure architectural time predictability quantitatively. Furthermore, while cache memories can generally improve average-case performance, they are harmful to time predictability and thus are not desirable for hard real-time and safety-critical systems. In contrast, Scratch-Pad Memories (SPMs) are time predictable, but they may lead to inferior performance. Guided by ATF, we propose and evaluate a variety of hybrid on-chip memory architectures to combine both caches and SPMs intelligently to achieve good time predictability and high performance. Detailed implementation and experimental results discussion are presented in this dissertation

    A low-power, high-performance speech recognition accelerator

    Get PDF
    ยฉ 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at high energy cost, not being affordable for the tiny power-budgeted mobile devices. Hardware acceleration reduces energy-consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for largevocabulary, speaker-independent, continuous speech-recognition. It focuses on the Viterbi search algorithm representing the main bottleneck in an ASR system. The proposed design consists of innovative techniques to improve the memory subsystem, since memory is the main bottleneck for performance and power in these accelerators' design. It includes a prefetching scheme tailored to the needs of ASR systems that hides main memory latency for a large fraction of the memory accesses, negligibly impacting area. Additionally, we introduce a novel bandwidth-saving technique that removes off-chip memory accesses by 20 percent. Finally, we present a power saving technique that significantly reduces the leakage power of the accelerators scratchpad memories, providing between 8.5 and 29.2 percent reduction in entire power dissipation. Overall, the proposed design outperforms implementations running on the CPU by orders of magnitude, and achieves speedups between 1.7x and 5.9x for different speech decoders over a highly optimized CUDA implementation running on Geforce-GTX-980 GPU, while reducing the energy by 123-454x.Peer ReviewedPostprint (author's final draft

    Low Power Processor Architectures and Contemporary Techniques for Power Optimization โ€“ A Review

    Get PDF
    The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. ยฉ 2009 ACADEMY PUBLISHER

    Near Data Processing for Efficient and Trusted Systems

    Full text link
    We live in a world which constantly produces data at a rate which only increases with time. Conventional processor architectures fail to process this abundant data in an efficient manner as they expend significant energy in instruction processing and moving data over deep memory hierarchies. Furthermore, to process large amounts of data in a cost effective manner, there is increased demand for remote computation. While cloud service providers have come up with innovative solutions to cater to this increased demand, the security concerns users feel for their data remains a strong impediment to their wide scale adoption. An exciting technique in our repertoire to deal with these challenges is near-data processing. Near-data processing (NDP) is a data-centric paradigm which moves computation to where data resides. This dissertation exploits NDP to both process the data deluge we face efficiently and design low-overhead secure hardware designs. To this end, we first propose Compute Caches, a novel NDP technique. Simple augmentations to underlying SRAM design enable caches to perform commonly used operations. In-place computation in caches not only avoids excessive data movement over memory hierarchy, but also significantly reduces instruction processing energy as independent sub-units inside caches perform computation in parallel. Compute Caches significantly improve the performance and reduce energy expended for a suite of data intensive applications. Second, this dissertation identifies security advantages of NDP. While memory bus side channel has received much attention, a low-overhead hardware design which defends against it remains elusive. We observe that smart memory, memory with compute capability, can dramatically simplify this problem. To exploit this observation, we propose InvisiMem which uses the logic layer in the smart memory to implement cryptographic primitives, which aid in addressing memory bus side channel efficiently. Our solutions obviate the need for expensive constructs like Oblivious RAM (ORAM) and Merkle trees, and have one to two orders of magnitude lower overheads for performance, space, energy, and memory bandwidth, compared to prior solutions. This dissertation also addresses a related vulnerability of page fault side channel in which the Operating System (OS) induces page faults to learn application's address trace and deduces application secrets from it. To tackle it, we propose Sanctuary which obfuscates page fault channel while allowing the OS to manage memory as a resource. To do so, we design a novel construct, Oblivious Page Management (OPAM) which is derived from ORAM but is customized for page management context. We employ near-memory page moves to reduce OPAM overhead and also propose a novel memory partition to reduce OPAM transactions required. For a suite of cloud applications which process sensitive data we show that page fault channel can be tackled at reasonable overheads.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144139/1/shaizeen_1.pd
    • โ€ฆ
    corecore