14 research outputs found

    Improving Processor Design by Exploiting Performance Variance

    Get PDF
    Programs exhibit significant performance variance in their access to microarchitectural structures. There are three types of performance variance. First, semantically equivalent programs running on the same system can yield different performance due to characteristics of microarchitectural structures. Second, program phase behavior varies significantly. Third, different types of operations on microarchitectural structure can lead to different performance. In this dissertation, we explore the performance variance and propose techniques to improve the processor design. We explore performance variance caused by microarchitectural structures and propose program interferometry, a technique that perturbs benchmark executables to yield a wide variety of performance points without changing program semantics or other important execution characteristics such as the number of retired instructions. By observing the behavior of the benchmarks over a range of branch prediction accuracies, we can estimate the impact of a microarchitectural optimization optimization and not the rest of the microarchitecture. We explore performance variance caused by phase changes and develop prediction-driven last-level cache (LLC) writeback techniques. We propose a rank idle time prediction driven LLC writeback technique and a last-write prediction driven LLC writeback technique. These techniques improve performance by reducing the write-induced interference. We explore performance variance caused by different types of operations to Non-Volatile Memory (NVM) and propose LLC management policies to reduce write overhead of NVM.We propose an adaptive placement and migration policy for an STT-RAM-based hybrid cache and writeback aware dynamic cache management for NVM-based main memory system. These techniques reduce write latency and write energy, thus leading to performance improvement and energy reduction

    A Novel Function Complexity-Based Code Migration Policy for Reducing Power Consumption

    Get PDF
    Embedded system designs have changed greatly owing to rapid developments in both hardware and software technology. Typical design should consider hardware limitations, such as size, weight, or battery capacity. In other words, the designs are heavily dependent on the hardware component. Since hardware can deteriorate and degenerate, hardware-aware software design is needed to achieve power-efficient embedded systems. Studies usually focus on the microprocessor in terms of optimizing power consumption. Besides computation, however, the system also consumes power when executing programs. A lot of memory accesses result in the entire execution, it should be considered to minimize for more efficient designs. Modern embedded systems often use heterogeneous memory to benefit from different characteristics of memory devices. This study aims to optimize the power efficiency of heterogeneous memory in embedded systems. We have proposed a detailed function complexity concept to identify specific function units in a program that consume less power in migrated memory. Using the function complexity, function selection algorithm is proposed to select a unique function which improves most after the migration. Experiments and quantitative analyses with various benchmarks have been performed to prove the validity of the proposed algorithm. Power consumption is successfully minimized by migrating certain function of a program in low-power memory

    HALLS: An Energy-Efficient Highly Adaptable Last Level STT-RAM Cache for Multicore Systems

    Get PDF
    Spin-Transfer Torque RAM (STT-RAM) is widely considered a promising alternative to SRAM in the memory hierarchy due to STT-RAM's non-volatility, low leakage power, high density, and fast read speed. The STT-RAM's small feature size is particularly desirable for the last-level cache (LLC), which typically consumes a large area of silicon die. However, long write latency and high write energy still remain challenges of implementing STT-RAMs in the CPU cache. An increasingly popular method for addressing this challenge involves trading off the non-volatility for reduced write speed and write energy by relaxing the STT-RAM's data retention time. However, in order to maximize energy saving potential, the cache configurations, including STT-RAM's retention time, must be dynamically adapted to executing applications' variable memory needs. In this paper, we propose a highly adaptable last level STT-RAM cache (HALLS) that allows the LLC configurations and retention time to be adapted to applications' runtime execution requirements. We also propose low-overhead runtime tuning algorithms to dynamically determine the best (lowest energy) cache configurations and retention times for executing applications. Compared to prior work, HALLS reduced the average energy consumption by 60.57% in a quad-core system, while introducing marginal latency overhead.Comment: To Appear on IEEE Transactions on Computers (TC

    A heterogeneous memory organization with minimum energy consumption in 3D chip-multiprocessors

    Get PDF
    Main memories play an important role in overall energy consumption of embedded systems. Using conventional memory technologies in future designs in nanoscale era cause a drastic increase in leakage power consumption and temperature-related problems. Emerging non-volatile memory (NVM) technologies offer many desirable characteristics such as near-zero leakage power, high density and non-volatility. They can significantly mitigate the issue of memory leakage power in future embedded chip-multiprocessor (eCMP) systems. However, they suffer from challenges such as limited write endurance and high write energy consumption which restrict them for adoption in modern memory systems. In this article, we propose a stacked hybrid memory system for 3D chip-multiprocessors to take advantages of both traditional and non-volatile memory technologies. For reaching this target, we present a convex optimization-based model that minimizes the system energy consumption while satisfy endurance constraint in order to design a reliable memory system. Experimental results show that the proposed method improves energy-delay product (EDP) and performance by about 44.8% and 13.8% on average respectively compared with the traditional memory design where single technology is used. © 2016 IEEE

    Hybrid stacked memory architecture for energy efficient embedded chip-multiprocessors based on compiler directed approach

    Get PDF
    Energy consumption becomes the most critical limitation on the performance of nowadays embedded system designs. On-chip memories due to major contribution in overall system energy consumption are always significant issue for embedded systems. Using conventional memory technologies in future designs in nano-scale era causes a drastic increase in leakage power consumption and temperature-related problems. Emerging non-volatile memory (NVM) technologies are promising replacement for conventional memory structure in embedded systems due to its attractive characteristics such as near-zero leakage power, high density and non-volatility. Recent advantages of NVM technologies can significantly mitigate the issue of memory leakage power. However, they introduce new challenges such as limited write endurance and high write energy consumption which restrict them for adoption in modern memory systems. In this article, we propose a stacked hybrid memory system to minimize energy consumption for 3D embedded chip-multiprocessors (eCMP). For reaching this target, we present a convex optimization-based model to distribute data blocks between SRAM and NVM banks based on data access pattern derived by compiler. Our compiler-assisted hybrid memory architecture can achieve up to 51.28 times improvement in lifetime. In addition, experimental results show that our proposed method reduce energy consumption by 56% on average compared to the traditional memory design where single technology is used. © 2015 IEEE

    A survey of emerging architectural techniques for improving cache energy consumption

    Get PDF
    The search goes on for another ground breaking phenomenon to reduce the ever-increasing disparity between the CPU performance and storage. There are encouraging breakthroughs in enhancing CPU performance through fabrication technologies and changes in chip designs but not as much luck has been struck with regards to the computer storage resulting in material negative system performance. A lot of research effort has been put on finding techniques that can improve the energy efficiency of cache architectures. This work is a survey of energy saving techniques which are grouped on whether they save the dynamic energy, leakage energy or both. Needless to mention, the aim of this work is to compile a quick reference guide of energy saving techniques from 2013 to 2016 for engineers, researchers and students

    Shiftsreduce: Minimizing shifts in racetrack memory 4.0

    Get PDF
    Racetrack memories (RMs) have significantly evolved since their conception in 2008, making them a serious contender in the field of emerging memory technologies. Despite key technological advancements, the access latency and energy consumption of an RM-based system are still highly influenced by the number of shift operations. These operations are required to move bits to the right positions in the racetracks. This article presents data-placement techniques for RMs that maximize the likelihood that consecutive references access nearby memory locations at runtime, thereby minimizing the number of shifts. We present an integer linear programming (ILP) formulation for optimal data placement in RMs, and we revisit existing offset assignment heuristics, originally proposed for random-access memories. We introduce a novel heuristic tailored to a realistic RM and combine it with a genetic search to further improve the solution. We show a reduction in the number of shifts of up to 52.5%, outperforming the state of the art by up to 16.1%

    L2C2: Last-level compressed-contents non-volatile cache and a procedure to forecast performance and lifetime

    Get PDF
    Several emerging non-volatile (NV) memory technologies are rising as interesting alternatives to build the Last-Level Cache (LLC). Their advantages, compared to SRAM memory, are higher density and lower static power, but write operations wear out the bitcells to the point of eventually losing their storage capacity. In this context, this paper presents a novel LLC organization designed to extend the lifetime of the NV data array and a procedure to forecast in detail the capacity and performance of such an NV-LLC over its lifetime. From a methodological point of view, although different approaches are used in the literature to analyze the degradation of an NV-LLC, none of them allows to study in detail its temporal evolution. In this sense, this work proposes a forecasting procedure that combines detailed simulation and prediction, allowing an accurate analysis of the impact of different cache control policies and mechanisms (replacement, wear-leveling, compression, etc.) on the temporal evolution of the indices of interest, such as the effective capacity of the NV-LLC or the system IPC. We also introduce L2C2, a LLC design intended for implementation in NV memory technology that combines fault tolerance, compression, and internal write wear leveling for the first time. Compression is not used to store more blocks and increase the hit rate, but to reduce the write rate and increase the lifetime during which the cache supports near-peak performance. In addition, to support byte loss without performance drop, L2C2 inherently allows N redundant bytes to be added to each cache entry. Thus, L2C2+N, the endurance-scaled version of L2C2, allows balancing the cost of redundant capacity with the benefit of longer lifetime. For instance, as a use case, we have implemented the L2C2 cache with STT-RAM technology. It has affordable hardware overheads compared to that of a baseline NV-LLC without compression in terms of area, latency and energy consumption, and increases up to 6-37 times the time in which 50% of the effective capacity is degraded, depending on the variability in the manufacturing process. Compared to L2C2, L2C2+6 which adds 6 bytes of redundant capacity per entry, that means 9.1% of storage overhead, can increase up to 1.4-4.3 times the time in which the system gets its initial peak performance degraded
    corecore