24 research outputs found

    Towards Design and Analysis For High-Performance and Reliable SSDs

    Get PDF
    NAND Flash-based Solid State Disks have many attractive technical merits, such as low power consumption, light weight, shock resistance, sustainability of hotter operation regimes, and extraordinarily high performance for random read access, which makes SSDs immensely popular and be widely employed in different types of environments including portable devices, personal computers, large data centers, and distributed data systems. However, current SSDs still suffer from several critical inherent limitations, such as the inability of in-place-update, asymmetric read and write performance, slow garbage collection processes, limited endurance, and degraded write performance with the adoption of MLC and TLC techniques. To alleviate these limitations, we propose optimizations from both specific outside applications layer and SSDs\u27 internal layer. Since SSDs are good compromise between the performance and price, so SSDs are widely deployed as second layer caches sitting between DRAMs and hard disks to boost the system performance. Due to the special properties of SSDs such as the internal garbage collection processes and limited lifetime, traditional cache devices like DRAM and SRAM based optimizations might not work consistently for SSD-based cache. Therefore, for the outside applications layer, our work focus on integrating the special properties of SSDs into the optimizations of SSD caches. Moreover, our work also involves the alleviation of the increased Flash write latency and ECC complexity due to the adoption of MLC and TLC technologies by analyzing the real work workloads

    Drowsy cache partitioning for reduced static and dynamic energy in the cache hierarchy

    Get PDF
    Power consumption in computing today has lead the industry towards energy efficient computing. As transistor technology shrinks, new techniques have to be developed to keep leakage current, the dominant portion of overall power consumption, to a minimum. Due to the large amount of transistors devoted to the cache hierarchy, the cache provides an excellent avenue to dramatically reduce power usage. The inherent danger with techniques that save power can negatively effect the primary reason for the inclusion of the cache, performance. This thesis work proposes a modification to the cache hierarchy that dramatically saves power with only a slight reduction in performance. By taking advantage of the overwhelming preference of memory accesses to the most recently used blocks, these blocks are placed into a small, fast access A partition. The rest of the cache is put into a drowsy mode, a state preserving technique that reduces leakage power within the remaining portion of the cache. This design was implemented within a private, second level cache that achieved an average of almost 20% dynamic energy savings and an average of nearly 45% leakage energy savings. These savings were attained while incurring an average performance penalty of only 2%

    Pervasive Data Access in Wireless and Mobile Computing Environments

    Get PDF
    The rapid advance of wireless and portable computing technology has brought a lot of research interests and momentum to the area of mobile computing. One of the research focus is on pervasive data access. with wireless connections, users can access information at any place at any time. However, various constraints such as limited client capability, limited bandwidth, weak connectivity, and client mobility impose many challenging technical issues. In the past years, tremendous research efforts have been put forth to address the issues related to pervasive data access. A number of interesting research results were reported in the literature. This survey paper reviews important works in two important dimensions of pervasive data access: data broadcast and client caching. In addition, data access techniques aiming at various application requirements (such as time, location, semantics and reliability) are covered

    Multi-Gigabyte On-Chip DRAM Caches for Servers

    Get PDF
    While DRAM latency has long been recognized as a major bottleneck in servers, DRAM bandwidth is emerging as an important bottleneck as server processors shift to many-core architectures to allow for sustainable throughput improvements. The rapid expansion of the digital universe, increasingly stored in memory, rapidly pushes the need for higher DRAM density as well. Emerging die-stacked DRAM technology dramatically improves the three major DRAM properties: latency, bandwidth and density. Recent advancements in die-stacking technology made it possible to integrate a sizeable amount of DRAM directly on top of the processor. While the feasible on-chip DRAM capacities are insufficient to satisfy the memory needs of modern servers, architecting on-chip DRAM as a high-capacity low-latency high-bandwidth cache has the potential to provide significant reduction both in off-chip memory traffic and in average memory access latency. We make the observation that high-capacity on-chip DRAM caches expose abundant spatial locality present in server applications and a modest amount of temporal data reuse. As a consequence, DRAM caches that manage and fetch data at a coarser granularity, e.g., in 2KB pages, exhibit overall superior properties compared to caches that do fine-grain management using 64B blocks. These properties include substantially higher hit rates, smaller tag storage, higher energy efficiency and set-associativity. Unfortunately, naive employment of page-based caches results in excessive data overfetch and capacity waste, as some of the fetched and allocated blocks are never accessed prior to their eviction. We demonstrate that if the cache is organized in pages, then page footprints -- i.e., the set of blocks that are touched while the page is in the cache -- are highly predictable using well-established code-correlation techniques. Accurately predicting access patterns within a page can eliminate most of the bandwidth overhead and capacity waste that page-based caches suffer from

    Next-Generation Smart Cars: Towards a More Intelligent Interactive Infotainment System

    Get PDF
    abstract: Today, in a world of automation, the impact of Artificial Intelligence can be seen in every aspect of our lives. Starting from smart homes to self-driving cars everything is run using intelligent, adaptive technologies. In this thesis, an attempt is made to analyze the correlation between driving quality and its impact on the use of car infotainment system and vice versa and hence the driver distraction. Various internal and external driving factors have been identified to understand the dependency and seriousness of driver distraction caused due to the car infotainment system. We have seen a number UI/UX changes, speech recognition advancements in cars to reduce distraction. But reducing the number of casualties on road is still a persisting problem in hand as the cognitive load on the driver is considered to be one of the primary reasons for distractions leading to casualties. In this research, a pathway has been provided to move towards building an artificially intelligent, adaptive and interactive infotainment which is trained to behave differently by analyzing the driving quality without the intervention of the driver. The aim is to not only shift focus of the driver from screen to street view, but to also change the inherent behavior of the infotainment system based on the driving statistics at that point in time without the need for driver intervention.Dissertation/ThesisMasters Thesis Software Engineering 201

    Fault- and Yield-Aware On-Chip Memory Design and Management

    Get PDF
    Ever decreasing device size causes more frequent hard faults, which becomes a serious burden to processor design and yield management. This problem is particularly pronounced in the on-chip memory which consumes up to 70% of a processor' s total chip area. Traditional circuit-level techniques, such as redundancy and error correction code, become less effective in error-prevalent environments because of their large area overhead. In this work, we suggest an architectural solution to building reliable on-chip memory in the future processor environment. Our approaches have two parts, a design framework and architectural techniques for on-chip memory structures. Our design framework provides important architectural evaluation metrics such as yield, area, and performance based on low level defects and process variations parameters. Processor architects can quickly evaluate their designs' characteristics in terms of yield, area, and performance. With the framework, we develop architectural yield enhancement solutions for on-chip memory structures including L1 cache, L2 cache and directory memory. Our proposed solutions greatly improve yield with negligible area and performance overhead. Furthermore, we develop a decoupled yield model of compute cores and L2 caches in CMPs, which show that there will be many more L2 caches than compute cores in a chip. We propose efficient utilization techniques for excess caches. Evaluation results show that excess caches significantly improve overall performance of CMPs

    Declarative Querying For Biological Sequences.

    Full text link
    Life science research labs today manage increasing volumes of sequence data. Much of the data management and querying today is accomplished procedurally using Perl, Python, or Java programs that integrate data from different sources and query tools. The dangers of this procedural approach are well known to the database community-- a) severe limitations on the ability to rapidly express queries and b) inefficient query plans due to the lack of sophisticated optimization tools. This situation is likely to get worse with advances in high-throughput technologies that make it easier to quickly produce vast amounts of sequence data. The need for a declarative and efficient system to manage and query biological sequence data is urgent. To address this need, we designed the Periscope/SQ system. Periscope/SQ extends current relational systems to enable sophisticated queries on sequence data and can optimize and execute these queries efficiently. This thesis describes the problems that need to be solved to make it possible to build the Periscope/SQ system. First, we describe the algebraic framework which forms the backbone of Periscope/SQ. Second, we describe algorithms to construct large scale suffix tree indexes for efficiently answering sequence queries. Third, we describe techniques for selectivity estimation and optimization in the context of queries over biological sequences. Next, we demonstrate how some of the techniques developed for Periscope/SQ can be applied to produce a powerful mining algorithm that we call FLAME. Finally, we describe GeneFinder, a biological application built on top of Periscope/SQ. GeneFinder is currently being used to predict the targets of transcription factors. Today, genomic and proteomic sequences are the most abundantly available source of high-quality biological data. By making it possible to declaratively and efficiently query vast amount of sequence data, Periscope/SQ opens the door to vast improvements in the pace of bioinformatics research.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/55670/2/tatas_1.pd

    FPGA-Augmented Secure Crash-Consistent Non-Volatile Memory

    Get PDF
    Emerging byte-addressable Non-Volatile Memory (NVM) technology, although promising superior memory density and ultra-low energy consumption, poses unique challenges to achieving persistent data privacy and computing security, both of which are critically important to the embedded and IoT applications. Specifically, to successfully restore NVMs to their working states after unexpected system crashes or power failure, maintaining and recovering all the necessary security-related metadata can severely increase memory traffic, degrade runtime performance, exacerbate write endurance problem, and demand costly hardware changes to off-the-shelf processors. In this thesis, we summarize and expand upon two of our innovative works, ARES and HERMES, to design a new FPGA-assisted processor-transparent security mechanism aiming at efficiently and effectively achieving all three aspects of a security triad—confidentiality, integrity, and recoverability—in modern embedded computing. Given the growing prominence of CPU-FPGA heterogeneous computing architectures, ARES leverages FPGA\u27s hardware reconfigurability to offload performance-critical and security-related functions to the programmable hardware without microprocessors\u27 involvement. In particular, recognizing that the traditional Merkle tree caching scheme cannot fully exploit FPGA\u27s parallelism due to its sequential and recursive function calls, ARES proposed a new Merkle tree cache architecture and a novel Merkle tree scheme which flattened and reorganized the computation in the traditional Merkle tree verification and update processes to fully exploit the parallel cache ports and to fully pipeline time-consuming hashing operations. To further optimize the throughput of BMT operations, HERMES proposed an optimally efficient dataflow architecture by processing multiple outstanding counter requests simultaneously. Specifically, HERMES explored and addressed three technical challenges when exploiting task-level parallelism of BMT and proposed a speculative execution approach with both low latency and high throughput

    Improving processor efficiency by exploiting common-case behaviors of memory instructions

    Get PDF
    Processor efficiency can be described with the help of a number of  desirable effects or metrics, for example, performance, power, area, design complexity and access latency. These metrics serve as valuable tools used in designing new processors and they also act as  effective standards for comparing current processors. Various factors impact the efficiency of modern out-of-order processors and one important factor is the manner in which instructions are processed through the processor pipeline. In this dissertation research, we study the impact of load and store instructions (collectively known as memory instructions) on processor efficiency,  and show how to improve efficiency by exploiting common-case or  predictable patterns in the behavior of memory instructions. The memory behavior patterns that we focus on in our research are the predictability of memory dependences, the predictability in data forwarding patterns,   predictability in instruction criticality and conservativeness in resource allocation and deallocation policies. We first design a scalable  and high-performance memory dependence predictor and then apply accurate memory dependence prediction to improve the efficiency of the fetch engine of a simultaneous multi-threaded processor. We then use predictable data forwarding patterns to eliminate power-hungry  hardware in the processor with no loss in performance.  We then move to  studying instruction criticality to improve  processor efficiency. We study the behavior of critical load instructions  and propose applications that can be optimized using  predictable, load-criticality  information. Finally, we explore conventional techniques for allocation and deallocation  of critical structures that process memory instructions and propose new techniques to optimize the same.  Our new designs have the potential to reduce  the power and the area required by processors significantly without losing  performance, which lead to efficient designs of processors.Ph.D.Committee Chair: Loh, Gabriel H.; Committee Member: Clark, Nathan; Committee Member: Jaleel, Aamer; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin S.; Committee Member: Prvulovic, Milo

    Reducing Cache Contention On GPUs

    Get PDF
    The usage of Graphics Processing Units (GPUs) as an application accelerator has become increasingly popular because, compared to traditional CPUs, they are more cost-effective, their highly parallel nature complements a CPU, and they are more energy efficient. With the popularity of GPUs, many GPU-based compute-intensive applications (a.k.a., GPGPUs) present significant performance improvement over traditional CPU-based implementations. Caches, which significantly improve CPU performance, are introduced to GPUs to further enhance application performance. However, the effect of caches is not significant for many cases in GPUs and even detrimental for some cases. The massive parallelism of the GPU execution model and the resulting memory accesses cause the GPU memory hierarchy to suffer from significant memory resource contention among threads. One cause of cache contention arises from column-strided memory access patterns that GPU applications commonly generate in many data-intensive applications. When such access patterns are mapped to hardware thread groups, they become memory-divergent instructions whose memory requests are not GPU hardware friendly, resulting in serialized access and performance degradation. Cache contention also arises from cache pollution caused by lines with low reuse. For the cache to be effective, a cached line must be reused before its eviction. Unfortunately, the streaming characteristic of GPGPU workloads and the massively parallel GPU execution model increase the reuse distance, or equivalently reduce reuse frequency of data. In a GPU, the pollution caused by a large reuse distance data is significant. Memory request stall is another contention factor. A stalled Load/Store (LDST) unit does not execute memory requests from any ready warps in the issue stage. This stall prevents the potential hit chances for the ready warps. This dissertation proposes three novel architectural modifications to reduce the contention: 1) contention-aware selective caching detects the memory-divergent instructions caused by the column-strided access patterns, calculates the contending cache sets and locality information and then selectively caches; 2) locality-aware selective caching dynamically calculates the reuse frequency with efficient hardware and caches based on the reuse frequency; and 3) memory request scheduling queues the memory requests from a warp issuing stage, frees the LDST unit stall and schedules items from the queue to the LDST unit by multiple probing of the cache. Through systematic experiments and comprehensive comparisons with existing state-of-the-art techniques, this dissertation demonstrates the effectiveness of our aforementioned techniques and the viability of reducing cache contention through architectural support. Finally, this dissertation suggests other promising opportunities for future research on GPU architecture
    corecore