46 research outputs found

    Architecting Energy Efficient Servers.

    Full text link
    This dissertation investigates how energy efficient servers can be architected using current and future technology. We leverage recent trends in packaging and device technology to deliver low power and high throughput. Specifically at the package level, this dissertation looks at 3D stacking technology that has emerged as a promising solution in achieving energy efficiency by delivering high throughput at a low cost. It shows how one would leverage this new technology into a datacenter. 3D stacking technology can be used to implement a simple, low-power, high-performance chip multiprocessor suitable for throughput processing. Our proposed architecture leveraging this technology, PicoServer, employs 3D technology to bond one die containing several simple slow processing cores to multiple memory dies sufficient for a primary memory. The multiple memory dies are composed of DRAM. 3D stacking technology also enables wide low-latency buses between processors and memory. These remove the need for an L2 cache allowing its area to be re-allocated to additional simple cores. The additional cores allow the clock frequency to be lowered without impairing throughput. Lower clock frequency along with the integration of non-volatile memory in turn reduces power and means that thermal constraints, a concern with 3D stacking, are easily satisfied. The PicoServer architecture targets server applications,which exhibit a high degree of thread level parallelism. An architecture targeted to efficient throughput is ideal for this application domain. At the memory device level, this dissertation investigates how the system memory could be re-architected to reduce the rising power consumption of system memory and disk drives. Flash memory has emerged as a strong candidate to reduce system memory power while remaining cost effective than conventional system memory. This dissertation discusses how Flash could be integrated at the system level and provides insights on the architectural support for Flash in servers. Our architecture uses a two level disk cache composed of a relatively small DRAM, which includes a primary disk cache, and a Flash based secondary disk cache. Further, based on our observations, we found that the Flash based disk caches should be split into a read optimized disk cache and write optimized disk cache.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57602/2/tkgil_1.pd

    Centaur: Host-Side SSD Caching for Storage Performance Control

    Full text link

    Runtime Systems for Persistent Memories

    Full text link
    Emerging persistent memory (PM) technologies promise the performance of DRAM with the durability of disk. However, several challenges remain in existing hardware, programming, and software systems that inhibit wide-scale PM adoption. This thesis focuses on building efficient mechanisms that span hardware and operating systems, and programming languages for integrating PMs in future systems. First, this thesis proposes a mechanism to solve low-endurance problem in PMs. PMs suffer from limited write endurance---PM cells can be written only 10^7-10^9 times before they wear out. Without any wear management, PM lifetime might be as low as 1.1 months. This thesis presents Kevlar, an OS-based wear-management technique for PM, that requires no new hardware. Kevlar uses existing virtual memory mechanisms to remap pages, enabling it to perform both wear leveling---shuffling pages in PM to even wear; and wear reduction---transparently migrating heavily written pages to DRAM. Crucially, Kevlar avoids the need for hardware support to track wear at fine grain. It relies on a novel wear-estimation technique that builds upon Intel's Precise Event Based Sampling to approximately track processor cache contents via a software-maintained Bloom filter and estimate write-back rates at fine grain. Second, this thesis proposes a persistency model for high-level languages to enable integration of PMs in to future programming systems. Prior works extend language memory models with a persistency model prescribing semantics for updates to PM. These approaches require high-overhead mechanisms, are restricted to certain synchronization constructs, provide incomplete semantics, and/or may recover to state that cannot arise in fault-free program execution. This thesis argues for persistency semantics that guarantee failure atomicity of synchronization-free regions (SFRs) --- program regions delimited by synchronization operations. The proposed approach provides clear semantics for the PM state that recovery code may observe and extends C++11's "sequential consistency for data-race-free" guarantee to post-failure recovery code. To this end, this thesis investigates two designs for failure-atomic SFRs that vary in performance and the degree to which commit of persistent state may lag execution. Finally, this thesis proposes StrandWeaver, a hardware persistency model that minimally constrains ordering on PM operations. Several language-level persistency models have emerged recently to aid programming recoverable data structures in PM. The language-level persistency models are built upon hardware primitives that impose stricter ordering constraints on PM operations than the persistency models require. StrandWeaver manages PM order within a strand, a logically independent sequence of PM operations within a thread. PM operations that lie on separate strands are unordered and may drain concurrently to PM. StrandWeaver implements primitives under strand persistency to allow programmers to improve concurrency and relax ordering constraints on updates as they drain to PM. Furthermore, StrandWeaver proposes mechanisms that map persistency semantics in high-level language persistency models to the primitives implemented by StrandWeaver.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/155100/1/vgogte_1.pd

    ADDING PERSISTENCE TO MAIN MEMORY PROGRAMMING

    Get PDF
    Unlocking the true potential of the new persistent memories (PMEMs) requires eliminating traditional persistent I/O abstractions altogether, by introducing persistent semantics directly into main memory programming. Such a programming model elevates failure atomicity to a first-class application property in addition to in-memory data layout, concurrency-control, and fault tolerance, and therefore requires redesign of programming abstractions for both program correctness and maximum performance gains. To address these challenges, this thesis proposes a set of system software designs that integrate persistence with main memory programming, and makes the following contributions. First, this thesis proposes a PMEM-aware I/O runtime, NVStream, that supports fast durable streaming I/O. NVStream uses a memory-based I/O interface that integrates with existing I/O data movement operations of an application to accelerate persistent data writes. NVStream carefully designs its persistent data storage layout and crash-consistent semantics to match both application and PMEM characteristics. Specifically, we leverage the streaming nature of I/O in HPC workflows, to benefit from using a log-structured PMEM storage engine design, that uses relaxed write orderings and append-only failure-atomic semantics to form strongly consistent application checkpoints. Furthermore, we identify that optimizing the I/O software stack exposes the PMEM bandwidth limitations as a bottleneck during parallel HPC I/O writes, and propose a novel data movement design – PHX. PHX uses alternative network data movement paths available in datacenters to ease up the bandwidth pressure on the PMEM memory interconnects, all while maintaining the correctness of the persistent data. Next, the thesis explores the challenges and opportunities of using PMEM for true main memory persistent programming – a single data domain for both runtime and persistent applicationstate. Such a programming model includes maintaining ACID properties during each and every update to applications persistent structures. ACID-qualified persistent programming for multi-threaded applications is hard, as the programmer has to reason about both crash-consistency and synchronization – crash-sync – semantics for programming correctness. The thesis contributes new understanding of the correctness requirements for mixing different crash-consistent and synchronization protocols, characterizes the performance of different crash-sync realizations for different applications and hardware architectures, and draws actionable insights for future designs of PMEM systems. Finally, the application state stored on node-local persistent memory is still vulnerable to catastrophic node failures. The thesis proposes a replicated persistent memory runtime, Blizzard, that supports truly fault tolerant, concurrent and persistent data-structure programming. Blizzard carefully integrates userspace networking with byte addressable PMEM for a fast, persistent memory replication runtime. The design also incorporates a replication-aware crash-sync protocol that supports consistent and concurrent updates on persistent data-structures. Blizzard offers applications the flexibility to use the data structures that best match their functional requirements, while offering better performance, and providing crucial reliability guarantees lacking from existing persistent memory runtimes.Ph.D

    Triple-L: Improving CPS Disk I/O Performance in a Virtualized NAS Environment

    Get PDF
    Network-attached storage (NAS) provides cyberphysical systems (CPS) with the scalable, efficient, and reliable backing storage, such as the mobile virtual desktop based on cloud infrastructure. Within this storage architecture, virtual machine (VM) instances running in the NAS client usually receive data from the complex physical world and then persist them in the neat cyberspace in the NAS server. In this paper, we propose Triple-L to improve VM disk I/O performance in the NAS architecture. According to the specific storage semantic, Triple-L decouples the VM image file into several subfiles at the host layer and then selectively moves them into the NAS clients. In such a way, a VM disk I/O request may be proceeded locally in the NAS client, instead of walking the external networking path repetitively between NAS server and client. We have implemented Triple-L in a Xen-based NAS system. An accessory solution for dealing with storage failure and VM live migration on Triple-L is also discussed and evaluated. The experimental result shows that our work can effectively improve the disk I/O performance of VMs. Meanwhile, it brings moderate overhead for VM live migration

    TACKLING PERFORMANCE AND SECURITY ISSUES FOR CLOUD STORAGE SYSTEMS

    Get PDF
    Building data-intensive applications and emerging computing paradigm (e.g., Machine Learning (ML), Artificial Intelligence (AI), Internet of Things (IoT) in cloud computing environments is becoming a norm, given the many advantages in scalability, reliability, security and performance. However, under rapid changes in applications, system middleware and underlying storage device, service providers are facing new challenges to deliver performance and security isolation in the context of shared resources among multiple tenants. The gap between the decades-old storage abstraction and modern storage device keeps widening, calling for software/hardware co-designs to approach more effective performance and security protocols. This dissertation rethinks the storage subsystem from device-level to system-level and proposes new designs at different levels to tackle performance and security issues for cloud storage systems. In the first part, we present an event-based SSD (Solid State Drive) simulator that models modern protocols, firmware and storage backend in detail. The proposed simulator can capture the nuances of SSD internal states under various I/O workloads, which help researchers understand the impact of various SSD designs and workload characteristics on end-to-end performance. In the second part, we study the security challenges of shared in-storage computing infrastructures. Many cloud providers offer isolation at multiple levels to secure data and instance, however, security measures in emerging in-storage computing infrastructures are not studied. We first investigate the attacks that could be conducted by offloaded in-storage programs in a multi-tenancy cloud environment. To defend against these attacks, we build a lightweight Trusted Execution Environment, IceClave to enable security isolation between in-storage programs and internal flash management functions. We show that while enforcing security isolation in the SSD controller with minimal hardware cost, IceClave still keeps the performance benefit of in-storage computing by delivering up to 2.4x better performance than the conventional host-based trusted computing approach. In the third part, we investigate the performance interference problem caused by other tenants' I/O flows. We demonstrate that I/O resource sharing can often lead to performance degradation and instability. The block device abstraction fails to expose SSD parallelism and pass application requirements. To this end, we propose a software/hardware co-design to enforce performance isolation by bridging the semantic gap. Our design can significantly improve QoS (Quality of Service) by reducing throughput penalties and tail latency spikes. Lastly, we explore more effective I/O control to address contention in the storage software stack. We illustrate that the state-of-the-art resource control mechanism, Linux cgroups is insufficient for controlling I/O resources. Inappropriate cgroup configurations may even hurt the performance of co-located workloads under memory intensive scenarios. We add kernel support for limiting page cache usage per cgroup and achieving I/O proportionality

    Reliability of SSD Storage Systems

    Get PDF
    Solid-state drives (SSDs) are attractive storage components due to their many attractive properties, however, concerns about their reliability still remain and this delays the wider deployment of the SSDs. Many protection schemes have been proposed to improve the reliability of SSDs. For example, some techniques like error correction codes (ECC), log-like writing of ash translation layer (FTL), garbage collection and wear leveling improve the reliability of SSD at the device level. Composing an array of SSDs and employing system level parity protection is one of the popular protection schemes at the system level. Enterprise class (high-end) SSDs are faster and more resilient than client class (low-end) SSDs but they are expensive to be deployed in large scale storage systems. It is an attractive and practical alternative to exploit the high-end SSDs as a cache and low-end SSDs as main storage. The high-end SSD cache equipped on a low-end SSD array enhances both latency and reduces write count of the SSD storage system at the same time. This work analyzes the effectiveness of protection schemes originally designed for HDDs but applied to SSD storage systems. We find that different characteristics of HDDs and SSDs make integration of those solutions in SSD storage systems not so straight-forward. This work, at first, analyzes the effectiveness of the device level protection schemes such as ECC and scrubbing. A Markov model based analysis of the protection schemes is presented. Our model considers time varying nature of the reliability of ash memory as well as write amplification of various device level protection schemes. Our study shows that write amplification from these various sources can significantly affect the benefits of protection schemes in improving the lifetime. Based on the results from our analysis, we propose that bit errors within an SSD page be left uncorrected until a threshold of errors are accumulated. We show that such an approach can significantly improve lifetimes by up to 40%. This work also analyzes the effectiveness of parity protection over SSD arrays, a widely used protection scheme for SSD arrays at system level. The parity protection is typically employed to compose reliable storage systems. However, careful consideration is required when SSD based systems employ parity protection. Additional writes are required for parity updates. Also, parity consumes space on the device, which results in write amplification from less efficient garbage collection at higher space utilization. We present a Markov model to estimate the lifetime of SSD based RAID systems in different environments. In a small array, our results show that parity protection provides benefit only with considerably low space utilizations and low data access rates. However, in a large system, RAID improves data lifetime even when we take write amplification into account. This work explores how to optimize a mixed SSD array in terms of performance and lifetime. We show that simple integration of different classes of SSDs in traditional caching policies results in poor reliability. We also reveal that caching policies with static workload classifiers are not always efficient. We propose a sampling based adaptive approach that achieves fair workload distribution across the cache and the storage. The proposed algorithm enables fine-grained control of the workload distribution which minimizes latency over lifetime of mixed SSD arrays. We show that our adaptive algorithm is very effective in improving the latency over lifetime metric, on an average, by up to 2.36 times over LRU, across a number of workloads

    Architecting heterogeneous memory systems with 3D die-stacked memory

    Get PDF
    The main objective of this research is to efficiently enable 3D die-stacked memory and heterogeneous memory systems. 3D die-stacking is an emerging technology that allows for large amounts of in-package high-bandwidth memory storage. Die-stacked memory has the potential to provide extraordinary performance and energy benefits for computing environments, from data-intensive to mobile computing. However, incorporating die-stacked memory into computing environments requires innovations across the system stack from hardware and software. This dissertation presents several architectural innovations to practically deploy die-stacked memory into a variety of computing systems. First, this dissertation proposes using die-stacked DRAM as a hardware-managed cache in a practical and efficient way. The proposed DRAM cache architecture employs two novel techniques: hit-miss speculation and self-balancing dispatch. The proposed techniques virtually eliminate the hardware overhead of maintaining a multi-megabytes SRAM structure, when scaling to gigabytes of stacked DRAM caches, and improve overall memory bandwidth utilization. Second, this dissertation proposes a DRAM cache organization that provides a high level of reliability for die-stacked DRAM caches in a cost-effective manner. The proposed DRAM cache uses error-correcting code (ECCs), strong checksums (CRCs), and dirty data duplication to detect and correct a wide range of stacked DRAM failures—from traditional bit errors to large-scale row, column, bank, and channel failures—within the constraints of commodity, non-ECC DRAM stacks. With only a modest performance degradation compared to a DRAM cache with no ECC support, the proposed organization can correct all single-bit failures, and 99.9993% of all row, column, and bank failures. Third, this dissertation proposes architectural mechanisms to use large, fast, on-chip memory structures as part of memory (PoM) seamlessly through the hardware. The proposed design achieves the performance benefit of on-chip memory caches without sacrificing a large fraction of total memory capacity to serve as a cache. To achieve this, PoM implements the ability to dynamically remap regions of memory based on their access patterns and expected performance benefits. Lastly, this dissertation explores a new usage model for die-stacked DRAM involving a hybrid of caching and virtual memory support. In the common case where system’s physical memory is not over-committed, die-stacked DRAM operates as a cache to provide performance and energy benefits to the system. However, when the workload’s active memory demands exceed the capacity of the physical memory, the proposed scheme dynamically converts the stacked DRAM cache into a fast swap device to avoid the otherwise grievous performance penalty of swapping to disk.Ph.D
    corecore