34 research outputs found

    Enabling Hyperscale Web Services

    Full text link
    Modern web services such as social media, online messaging, web search, video streaming, and online banking often support billions of users, requiring data centers that scale to hundreds of thousands of servers, i.e., hyperscale. In fact, the world continues to expect hyperscale computing to drive more futuristic applications such as virtual reality, self-driving cars, conversational AI, and the Internet of Things. This dissertation presents technologies that will enable tomorrow’s web services to meet the world’s expectations. The key challenge in enabling hyperscale web services arises from two important trends. First, over the past few years, there has been a radical shift in hyperscale computing due to an unprecedented growth in data, users, and web service software functionality. Second, modern hardware can no longer support this growth in hyperscale trends due to a decline in hardware performance scaling. To enable this new hyperscale era, hardware architects must become more aware of hyperscale software needs and software researchers can no longer expect unlimited hardware performance scaling. In short, systems researchers can no longer follow the traditional approach of building each layer of the systems stack separately. Instead, they must rethink the synergy between the software and hardware worlds from the ground up. This dissertation establishes such a synergy to enable futuristic hyperscale web services. This dissertation bridges the software and hardware worlds, demonstrating the importance of that bridge in realizing efficient hyperscale web services via solutions that span the systems stack. The specific goal is to design software that is aware of new hardware constraints and architect hardware that efficiently supports new hyperscale software requirements. This dissertation spans two broad thrusts: (1) a software and (2) a hardware thrust to analyze the complex hyperscale design space and use insights from these analyses to design efficient cross-stack solutions for hyperscale computation. In the software thrust, this dissertation contributes uSuite, the first open-source benchmark suite of web services built with a new hyperscale software paradigm, that is used in academia and industry to study hyperscale behaviors. Next, this dissertation uses uSuite to study software threading implications in light of today’s hardware reality, identifying new insights in the age-old research area of software threading. Driven by these insights, this dissertation demonstrates how threading models must be redesigned at hyperscale by presenting an automated approach and tool, uTune, that makes intelligent run-time threading decisions. In the hardware thrust, this dissertation architects both commodity and custom hardware to efficiently support hyperscale software requirements. First, this dissertation characterizes commodity hardware’s shortcomings, revealing insights that influenced commercial CPU designs. Based on these insights, this dissertation presents an approach and tool, SoftSKU, that enables cheap commodity hardware to efficiently support new hyperscale software paradigms, improving the efficiency of real-world web services that serve billions of users, saving millions of dollars, and meaningfully reducing the global carbon footprint. This dissertation also presents a hardware-software co-design, uNotify, that redesigns commodity hardware with minimal modifications by using existing hardware mechanisms more intelligently to overcome new hyperscale overheads. Next, this dissertation characterizes how custom hardware must be designed at hyperscale, resulting in industry-academia benchmarking efforts, commercial hardware changes, and improved software development. Based on this characterization’s insights, this dissertation presents Accelerometer, an analytical model that estimates gains from hardware customization. Multiple hyperscale enterprises and hardware vendors use Accelerometer to make well-informed hardware decisions.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169802/1/akshitha_1.pd

    Micro-architectural analysis of in-memory OLTP: Revisited

    Get PDF
    Micro-architectural behavior of traditional disk-based online transaction processing (OLTP) systems has been investigated extensively over the past couple of decades. Results show that traditional OLTP systems mostly under-utilize the available micro-architectural resources. In-memory OLTP systems, on the other hand, process all the data in main-memory and, therefore, can omit the buffer pool. Furthermore, they usually adopt more lightweight concurrency control mechanisms, cache-conscious data structures, and cleaner codebases since they are usually designed from scratch. Hence, we expect significant differences in micro-architectural behavior when running OLTP on platforms optimized for in-memory processing as opposed to disk-based database systems. In particular, we expect that in-memory systems exploit micro-architectural features such as instruction and data caches significantly better than disk-based systems. This paper sheds light on the micro-architectural behavior of in-memory database systems by analyzing and contrasting it to the behavior of disk-based systems when running OLTP workloads. The results show that, despite all the design changes, in-memory OLTP exhibits very similar micro-architectural behavior to disk-based OLTP: more than half of the execution time goes to memory stalls where instruction cache misses or the long-latency data misses from the last-level cache (LLC) are the dominant factors in the overall execution time. Even though ground-up designed in-memory systems can eliminate the instruction cache misses, the reduction in instruction stalls amplifies the impact of LLC data misses. As a result, only 30% of the CPU cycles are used to retire instructions, and 70% of the CPU cycles are wasted to stalls for both traditional disk-based and new generation in-memory OLTP

    Energy Measurements of High Performance Computing Systems: From Instrumentation to Analysis

    Get PDF
    Energy efficiency is a major criterion for computing in general and High Performance Computing in particular. When optimizing for energy efficiency, it is essential to measure the underlying metric: energy consumption. To fully leverage energy measurements, their quality needs to be well-understood. To that end, this thesis provides a rigorous evaluation of various energy measurement techniques. I demonstrate how the deliberate selection of instrumentation points, sensors, and analog processing schemes can enhance the temporal and spatial resolution while preserving a well-known accuracy. Further, I evaluate a scalable energy measurement solution for production HPC systems and address its shortcomings. Such high-resolution and large-scale measurements present challenges regarding the management of large volumes of generated metric data. I address these challenges with a scalable infrastructure for collecting, storing, and analyzing metric data. With this infrastructure, I also introduce a novel persistent storage scheme for metric time series data, which allows efficient queries for aggregate timelines. To ensure that it satisfies the demanding requirements for scalable power measurements, I conduct an extensive performance evaluation and describe a productive deployment of the infrastructure. Finally, I describe different approaches and practical examples of analyses based on energy measurement data. In particular, I focus on the combination of energy measurements and application performance traces. However, interweaving fine-grained power recordings and application events requires accurately synchronized timestamps on both sides. To overcome this obstacle, I develop a resilient and automated technique for time synchronization, which utilizes crosscorrelation of a specifically influenced power measurement signal. Ultimately, this careful combination of sophisticated energy measurements and application performance traces yields a detailed insight into application and system energy efficiency at full-scale HPC systems and down to millisecond-range regions.:1 Introduction 2 Background and Related Work 2.1 Basic Concepts of Energy Measurements 2.1.1 Basics of Metrology 2.1.2 Measuring Voltage, Current, and Power 2.1.3 Measurement Signal Conditioning and Analog-to-Digital Conversion 2.2 Power Measurements for Computing Systems 2.2.1 Measuring Compute Nodes using External Power Meters 2.2.2 Custom Solutions for Measuring Compute Node Power 2.2.3 Measurement Solutions of System Integrators 2.2.4 CPU Energy Counters 2.2.5 Using Models to Determine Energy Consumption 2.3 Processing of Power Measurement Data 2.3.1 Time Series Databases 2.3.2 Data Center Monitoring Systems 2.4 Influences on the Energy Consumption of Computing Systems 2.4.1 Processor Power Consumption Breakdown 2.4.2 Energy-Efficient Hardware Configuration 2.5 HPC Performance and Energy Analysis 2.5.1 Performance Analysis Techniques 2.5.2 HPC Performance Analysis Tools 2.5.3 Combining Application and Power Measurements 2.6 Conclusion 3 Evaluating and Improving Energy Measurements 3.1 Description of the Systems Under Test 3.2 Instrumentation Points and Measurement Sensors 3.2.1 Analog Measurement at Voltage Regulators 3.2.2 Instrumentation with Hall Effect Transducers 3.2.3 Modular Instrumentation of DC Consumers 3.2.4 Optimal Wiring for Shunt-Based Measurements 3.2.5 Node-Level Instrumentation for HPC Systems 3.3 Analog Signal Conditioning and Analog-to-Digital Conversion 3.3.1 Signal Amplification 3.3.2 Analog Filtering and Analog-To-Digital Conversion 3.3.3 Integrated Solutions for High-Resolution Measurement 3.4 Accuracy Evaluation and Calibration 3.4.1 Synthetic Workloads for Evaluating Power Measurements 3.4.2 Improving and Evaluating the Accuracy of a Single-Node Measuring System 3.4.3 Absolute Accuracy Evaluation of a Many-Node Measuring System 3.5 Evaluating Temporal Granularity and Energy Correctness 3.5.1 Measurement Signal Bandwidth at Different Instrumentation Points 3.5.2 Retaining Energy Correctness During Digital Processing 3.6 Evaluating CPU Energy Counters 3.6.1 Energy Readouts with RAPL 3.6.2 Methodology 3.6.3 RAPL on Intel Sandy Bridge-EP 3.6.4 RAPL on Intel Haswell-EP and Skylake-SP 3.7 Conclusion 4 A Scalable Infrastructure for Processing Power Measurement Data 4.1 Requirements for Power Measurement Data Processing 4.2 Concepts and Implementation of Measurement Data Management 4.2.1 Message-Based Communication between Agents 4.2.2 Protocols 4.2.3 Application Programming Interfaces 4.2.4 Efficient Metric Time Series Storage and Retrieval 4.2.5 Hierarchical Timeline Aggregation 4.3 Performance Evaluation 4.3.1 Benchmark Hardware Specifications 4.3.2 Throughput in Symmetric Configuration with Replication 4.3.3 Throughput with Many Data Sources and Single Consumers 4.3.4 Temporary Storage in Message Queues 4.3.5 Persistent Metric Time Series Request Performance 4.3.6 Performance Comparison with Contemporary Time Series Storage Solutions 4.3.7 Practical Usage of MetricQ 4.4 Conclusion 5 Energy Efficiency Analysis 5.1 General Energy Efficiency Analysis Scenarios 5.1.1 Live Visualization of Power Measurements 5.1.2 Visualization of Long-Term Measurements 5.1.3 Integration in Application Performance Traces 5.1.4 Graphical Analysis of Application Power Traces 5.2 Correlating Power Measurements with Application Events 5.2.1 Challenges for Time Synchronization of Power Measurements 5.2.2 Reliable Automatic Time Synchronization with Correlation Sequences 5.2.3 Creating a Correlation Signal on a Power Measurement Channel 5.2.4 Processing the Correlation Signal and Measured Power Values 5.2.5 Common Oversampling of the Correlation Signals at Different Rates 5.2.6 Evaluation of Correlation and Time Synchronization 5.3 Use Cases for Application Power Traces 5.3.1 Analyzing Complex Power Anomalies 5.3.2 Quantifying C-State Transitions 5.3.3 Measuring the Dynamic Power Consumption of HPC Applications 5.4 Conclusion 6 Summary and Outloo

    STT-RAM을 이용한 에너지 효율적인 캐시 설계 기술

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2019. 2. 최기영.지난 수십 년간 '메모리 벽' 문제를 해결하기 위해 온 칩 캐시의 크기는 꾸준히 증가해왔다. 하지만 지금까지 캐시에 주로 사용되어 온 메모리 기술인 SRAM은 낮은 집적도와 높은 대기 전력 소모로 인해 큰 캐시를 구성하는 데에는 적합하지 않다. 이러한 SRAM의 단점을 보완하기 위해 더 높은 집적도와 낮은 대기 전력을 소모하는 새로운 메모리 기술인 STT-RAM으로 SRAM을 대체하는 것이 제안되었다. 하지만 STT-RAM은 데이터를 쓸 때 많은 에너지와 시간을 소비하기 때문에 단순히 SRAM을 STT-RAM으로 대체하는 것은 오히려 캐시 에너지 소비를 증가시킨다. 이러한 문제를 해결하기 위해 본 논문에서는 STT-RAM을 이용한 에너지 효율적인 캐시 설계 기술들을 제안한다. 첫 번째, 배타적 캐시 계층 구조에서 STT-RAM을 활용하는 방법을 제안하였다. 배타적 캐시 계층 구조는 계층 간에 중복된 데이터가 없기 때문에 포함적 캐시 계층 구조와 비교하여 더 큰 유효 용량을 갖지만, 배타적 캐시 계층 구조에서는 상위 레벨 캐시에서 내보내진 모든 데이터를 하위 레벨 캐시에 써야 하므로 더 많은 양의 데이터를 쓰게 된다. 이러한 배타적 캐시 계층 구조의 특성은 쓰기 특성이 단점인 STT-RAM을 함께 활용하는 것을 어렵게 한다. 이를 해결하기 위해 본 논문에서는 재사용 거리 예측을 기반으로 하는 SRAM/STT-RAM 하이브리드 캐시 구조를 설계하였다. 두 번째, 비휘발성 STT-RAM을 이용해 캐시를 설계할 때 고려해야 할 점들에 대해 분석하였다. STT-RAM의 비효율적인 쓰기 동작을 줄이기 위해 다양한 해결법들이 제안되었다. 그중 한 가지는 STT-RAM 소자가 데이터를 유지하는 시간을 줄여 (휘발성 STT-RAM) 쓰기 특성을 향상하는 방법이다. STT-RAM에 저장된 데이터를 잃는 것은 확률적으로 발생하기 때문에 저장된 데이터를 안정적으로 유지하기 위해서는 오류 정정 부호(ECC)를 이용해 주기적으로 오류를 정정해주어야 한다. 본 논문에서는 STT-RAM 모델을 이용하여 휘발성 STT-RAM 설계 요소들에 대해 분석하였고 실험을 통해 해당 설계 요소들이 캐시 에너지와 성능에 주는 영향을 보여주었다. 마지막으로, 매니코어 시스템에서의 분산 하이브리드 캐시 구조를 설계하였다. 단순히 기존의 하이브리드 캐시와 분산캐시를 결합하면 하이브리드 캐시의 효율성에 큰 영향을 주는 SRAM 활용도가 낮아진다. 따라서 기존의 하이브리드 캐시 구조에서의 에너지 감소를 기대할 수 없다. 본 논문에서는 분산 하이브리드 캐시 구조에서 SRAM 활용도를 높일 수 있는 두 가지 최적화 기술인 뱅크-내부 최적화와 뱅크간 최적화 기술을 제안하였다. 뱅크-내부 최적화는 highly-associative 캐시를 활용하여 뱅크 내부에서 쓰기 동작이 많은 데이터를 분산시키는 것이고 뱅크간 최적화는 서로 다른 캐시 뱅크에 쓰기 동작이 많은 데이터를 고르게 분산시키는 최적화 방법이다.Over the last decade, the capacity of on-chip cache is continuously increased to mitigate the memory wall problem. However, SRAM, which is a dominant memory technology for caches, is not suitable for such a large cache because of its low density and large static power. One way to mitigate these downsides of the SRAM cache is replacing SRAM with a more efficient memory technology. Spin-Transfer Torque RAM (STT-RAM), one of the emerging memory technology, is a promising candidate for the alternative of SRAM. As a substitute of SRAM, STT-RAM can compensate drawbacks of SRAM with its non-volatility and small cell size. However, STT-RAM has poor write characteristics such as high write energy and long write latency and thus simply replacing SRAM to STT-RAM increases cache energy. To overcome those poor write characteristics of STT-RAM, this dissertation explores three different design techniques for energy-efficient cache using STT-RAM. The first part of the dissertation focuses on combining STT-RAM with exclusive cache hierarchy. Exclusive caches are known to provide higher effective cache capacity than inclusive caches by removing duplicated copies of cache blocks across hierarchies. However, in exclusive cache hierarchies, every block evicted from the upper-level cache is written back to the last-level cache regardless of its dirtiness thereby incurring extra write overhead. This makes it challenging to use STT-RAM for exclusive last-level caches due to its high write energy and long write latency. To mitigate this problem, we design an SRAM/STT-RAM hybrid cache architecture based on reuse distance prediction. The second part of the dissertation explores trade-offs in the design of volatile STT-RAM cache. Due to the inefficient write operation of STT-RAM, various solutions have been proposed to tackle this inefficiency. One of the proposed solutions is redesigning STT-RAM cell for better write characteristics at the cost of shortened retention time (i.e., volatile STT-RAM). Since the retention failure of STT-RAM has a stochastic property, an extra overhead of periodic scrubbing with error correcting code (ECC) is required to tolerate the failure. With an analysis based on analytic STT-RAM model, we have conducted extensive experiments on various volatile STT-RAM cache design parameters including scrubbing period, ECC strength, and target failure rate. The experimental results show the impact of the parameter variations on last-level cache energy and performance and provide a guideline for designing a volatile STT-RAM with ECC and scrubbing. The last part of the dissertation proposes Benzene, an energy-efficient distributed SRAM/STT-RAM hybrid cache architecture for manycore systems running multiple applications. It is based on the observation that a naive application of hybrid cache techniques to distributed caches in a manycore architecture suffers from limited energy reduction due to uneven utilization of scarce SRAM. We propose two-level optimization techniques: intra-bank and inter-bank. Intra-bank optimization leverages highly-associative cache design, achieving more uniform distribution of writes within a bank. Inter-bank optimization evenly balances the amount of write-intensive data across the banks.Abstract i Contents iii List of Figures vii List of Tables xi Chapter 1 Introduction 1 1.1 Exclusive Last-Level Hybrid Cache 2 1.2 Designing Volatile STT-RAM Cache 4 1.3 Distributed Hybrid Cache 5 Chapter 2 Background 9 2.1 STT-RAM 9 2.1.1 Thermal Stability 10 2.1.2 Read and Write Operation of STT-RAM 11 2.1.3 Failures of STT-RAM 11 2.1.4 Volatile STT-RAM 13 2.1.5 Related Work 14 2.2 Exclusive Last-Level Hybrid Cache 18 2.2.1 Cache Hierarchies 18 2.2.2 Related Work 19 2.3 Distributed Hybrid Cache 21 2.3.1 Prediction Hybrid Cache 21 2.3.2 Distributed Cache Partitioning 22 2.3.3 Related Work 23 Chapter 3 Exclusive Last-Level Hybrid Cache 27 3.1 Motivation 27 3.1.1 Exclusive Cache Hierarchy 27 3.1.2 Reuse Distance 29 3.2 Architecture 30 3.2.1 Reuse Distance Predictor 30 3.2.2 Hybrid Cache Architecture 32 3.3 Evaluation 34 3.3.1 Methodology 34 3.3.2 LLC Energy Consumption 35 3.3.3 Main Memory Energy Consumption 38 3.3.4 Performance 39 3.3.5 Area Overhead 39 3.4 Summary 39 Chapter 4 Designing Volatile STT-RAM Cache 41 4.1 Analysis 41 4.1.1 Retention Failure of a Volatile STT-RAM Cell 41 4.1.2 Memory Array Design 43 4.2 Evaluation 45 4.2.1 Methodology 45 4.2.2 Last-Level Cache Energy 46 4.2.3 Performance 51 4.3 Summary 52 Chapter 5 Distributed Hybrid Cache 55 5.1 Motivation 55 5.2 Architecture 58 5.2.1 Intra-Bank Optimization 59 5.2.2 Inter-Bank Optimization 63 5.2.3 Other Optimizations 67 5.3 Evaluation Methodology 69 5.4 Evaluation Results 73 5.4.1 Energy Consumption and Performance 73 5.4.2 Analysis of Intra-bank Optimization 76 5.4.3 Analysis of Inter-bank Optimization 78 5.4.4 Impact of Inter-Bank Optimization on Network Energy 79 5.4.5 Sensitivity Analysis 80 5.4.6 Implementation Overhead 81 5.5 Summary 82 Chapter 6 Conculsion 85 Bibliography 88 초록 101Docto

    Architecting Secure Processor Caches

    Get PDF
    Caches in modern processors enable fast access to data and help alleviate the performance overheads from slow access to DRAM main-memory. While sharing of cache resources between multiple cores, especially the last-level cache, boosts cache utilization and improves system performance, it has been shown to cause serious security vulnerabilities in the form cache side-channel attacks. Different cores of a system can simultaneously run sensitive and malicious applications which can contend for the shared cache space. As a result, accesses of a sensitive application can influence the cache utilization and the execution time of a malicious application, introducing a side-channel of information leakage. Such cache interactions between a sensitive victim and a malicious spy have been shown to allow leakage of encryption keys, user-sensitive data such as files or browsing histories, confidential intellectual property such as machine-learning models, etc. Similarly, such cache interactions can also be used as a channel for covert communication be- tween two colluding malicious applications, when direct communication via network ports is disabled. The focus of this thesis is to develop principled and practical mitigation for such cache side channel and covert channel attacks. To develop principled defenses, it is necessary to develop a deep understanding of attacks. So, first, this thesis investigates the capabilities of attackers and in the process develops a new cache covert channel attack called Streamline, which is considerably faster than current state-of-the-art attacks, with fewer requirements. With an asynchronous and flushless information transmission protocol, Streamline reaches bit-rates of more than 1 MB/s while being applicable to all ISAs and micro-architectures. This demonstrates the need for effective defenses against cache attacks across all platforms. Second, this thesis develops new principled and practical defenses utilizing cache lo- cation randomization. Randomized caches obfuscate the mappings of addresses to cache locations to prevent malicious programs from inferring contention patterns on shared last- level caches with victim programs. However, successive defenses relying on randomization have been broken by recent attacks. To end the arms race in randomized caches, this thesis proposes a principled defense, MIRAGE, which provides the security of a fully-associative design in a practical manner for randomized caches. This eliminates set-conflicts and set- conflict based cache attacks in a future-proof manner. Third, this thesis explores cache-partitioning based defenses to eliminate all potential cache side channels through shared last-level caches. Such defenses map mistrusting applications to isolated cache partitions, thus preventing any information leakage across applications through cache state changes. However, existing solutions are not scalable or do not allow flexible usage of DRAM and cache resources. To address these problems, this thesis provides a scalable and flexible cache-isolation framework, Bespoke Cache Enclaves, supporting hundreds of partitions independent of memory utilization. This work enables practical adoption of cache-isolation defenses against cache side-channel attacks. Lastly, this thesis develops techniques to secure caches against exploitation in transient execution attacks. Attacks like Spectre and Meltdown exploit processor speculation to illegally access secrets and leak these out through cache covert channels, i.e., making transient changes to processor caches. This thesis enables CleanupSpec, one of the first defenses against such attacks, which reverses speculative modifications to caches on mis- speculations, to limit such transient information leakage via caches. This solution prevents caches from being exploited by attacks like Spectre with minimal overheads. Overall, this thesis enables several techniques that provide principled yet practical security for processor caches against side channels and covert channels. These techniques can potentially enable the wide adoption of secure cache designs in future processors and support efforts to enable confidential computing in systems.Ph.D
    corecore