641 research outputs found

    Author retrospective for the dual data cache

    Get PDF
    In this paper we present a retrospective on our paper published in ICS 1995, which to best of our knowledge was the first paper that introduced the concept of a cache memory with multiple subcaches, each tuned for a different type of locality. In this retrospective, we summarize the main ideas of the original paper and outline some of the later work that exploited similar ideas and could have been influenced by our original paper, including two actual industrial microprocessors.Peer ReviewedPostprint (author’s final draft

    Buffer Controlled Cache for Low Power Multicore Processors

    Get PDF
    This thesis proposes a buffered dual access mode cache to reduce power consumption in multicore caches for embedded systems. This cache is called Buffer Controlled Cache (BCC cache). The proposed scheme introduces a pre-cache buffer to determine how to access the cache. The proposed cache shows better prediction rates and lower power consumption than conventional caches, such as Phased cache and Way-prediction cache. For single core implementation, Simplescalar and Cacti simulators have been used for these simulations using SPEC2000 benchmark programs. The experimental results show that the proposed cache improves the power consumption by 37%-42% over the conventional caches. Multi2Sim and McPAT simulators have been used for the multicore simulations using the Parsec benchmark programs. The experimental results show that the proposed cache improves the power consumption by as much as 54% over conventional caches

    Cache Equalizer: A Cache Pressure Aware Block Placement Scheme for Large-Scale Chip Multiprocessors

    Get PDF
    This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. CE provides Quality of Service (QoS) by robustly offering better performance than the baseline shared NUCA cache. Simulation results using a full-system simulator demonstrate that CE outperforms shared NUCA caches by an average of 15.5% and by as much as 28.5% for the benchmark programs we examined. Furthermore, evaluations manifested the outperformance of CE versus related CMP cache designs

    Improving the Performance and Energy Efficiency of GPGPU Computing through Adaptive Cache and Memory Management Techniques

    Get PDF
    Department of Computer Science and EngineeringAs the performance and energy efficiency requirement of GPGPUs have risen, memory management techniques of GPGPUs have improved to meet the requirements by employing hardware caches and utilizing heterogeneous memory. These techniques can improve GPGPUs by providing lower latency and higher bandwidth of the memory. However, these methods do not always guarantee improved performance and energy efficiency due to the small cache size and heterogeneity of the memory nodes. While prior works have proposed various techniques to address this issue, relatively little work has been done to investigate holistic support for memory management techniques. In this dissertation, we analyze performance pathologies and propose various techniques to improve memory management techniques. First, we investigate the effectiveness of advanced cache indexing (ACI) for high-performance and energy-efficient GPGPU computing. Specifically, we discuss the designs of various static and adaptive cache indexing schemes and present implementation for GPGPUs. We then quantify and analyze the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation shows that ACI schemes achieve significant performance and energy-efficiency gains over baseline conventional indexing scheme. We also analyze the performance sensitivity of ACI to key architectural parameters (i.e., capacity, associativity, and ICN bandwidth) and the cache indexing latency. We also demonstrate that ACI continues to achieve high performance in various settings. Second, we propose IACM, integrated adaptive cache management for high-performance and energy-efficient GPGPU computing. Based on the performance pathology analysis of GPGPUs, we integrate state-of-the-art adaptive cache management techniques (i.e., cache indexing, bypassing, and warp limiting) in a unified architectural framework to eliminate performance pathologies. Our quantitative evaluation demonstrates that IACM significantly improves the performance and energy efficiency of various GPGPU workloads over the baseline architecture (i.e., 98.1% and 61.9% on average, respectively) and achieves considerably higher performance than the state-of-the-art technique (i.e., 361.4% at maximum and 7.7% on average). Furthermore, IACM delivers significant performance and energy efficiency gains over the baseline GPGPU architecture even when enhanced with advanced architectural technologies (e.g., higher capacity, associativity). Third, we propose bandwidth- and latency-aware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP analyzes the characteristics of a application and determines the optimal page allocation ratio between the GPU and CPU memory. Based on the optimal page allocation ratio, BLPP dynamically allocate pages across the heterogeneous memory nodes. Our experimental results show that BLPP considerably outperforms the baseline and state-of-the-art technique (i.e., 13.4% and 16.7%) and performs similar to the static-best version (i.e., 1.2% difference), which requires extensive offline profiling.clos

    Energy-efficient and cost-effective reliability design in memory systems

    Get PDF
    Reliability of memory systems is increasingly a concern as memory density increases, the cell dimension shrinks and new memory technologies move close to commercial use. Meanwhile, memory power efficiency has become another first-order consideration in memory system design. Conventional reliability scheme uses ECC (Error Correcting Code) and EDC (Error Detecting Code) to support error correction and detection in memory systems, putting a rigid constraint on memory organizations and incurring a significant overhead regarding the power efficiency and area cost. This dissertation studies energy-efficient and cost-effective reliability design on both cache and main memory systems. It first explores the generic approach called embedded ECC in main memory systems to provide a low-cost and efficient reliability design. A scheme called E3CC (Enhanced Embedded ECC) is proposed for sub-ranked low-power memories to alleviate the concern on reliability. In the design, it proposes a novel BCRM (Biased Chinese Remainder Mapping) to resolve the address mapping issue in page-interleaving scheme. The proposed BCRM scheme provides an opportunity for building flexible reliability system, which favors the consumer-level computers to save power consumption. Within the proposed E3CC scheme, we further explore address mapping schemes at DRAM device level to provide SEP (Selective Error Protection). We explore a group of address mapping schemes at DRAM device level to map memory requests to their designated regions. All the proposed address mapping schemes are based on modulo operation. They will be proven, in this thesis, to be efficient, flexible and promising to various scenarios to favor system requirements. Additionally, we propose Free ECC reliability design for compressed cache schemes. It utilizes the unused fragments in compressed cache to store ECC. Such a design not only reduces the chip overhead but also improves cache utilization and power efficiency. In the design, we propose an efficient convergent cache allocation scheme to organize the compressed data blocks more effectively than existing schemes. This new design makes compressed cache an increasingly viable choice for processors with requirements of high reliability. Furthermore, we propose a novel, system-level scheme of memory error detection based on memory integrity check, called MemGuard, to detect memory errors. It uses memory log hashes to ensure, by strong probability, that memory read log and write log match with each other. It is much stronger than conventional protection in error detection and incurs little hardware cost, no storage overhead and little power overhead. It puts no constraints on memory organization and no major complication to processor design and operating system design. In the thesis, we prove that the MemGuard reliability design is simple, robust and efficient

    Jigsaw: Scalable software-defined caches

    Get PDF
    Shared last-level caches, widely used in chip-multi-processors (CMPs), face two fundamental limitations. First, the latency and energy of shared caches degrade as the system scales up. Second, when multiple workloads share the CMP, they suffer from interference in shared cache accesses. Unfortunately, prior research addressing one issue either ignores or worsens the other: NUCA techniques reduce access latency but are prone to hotspots and interference, and cache partitioning techniques only provide isolation but do not reduce access latency.United States. Defense Advanced Research Projects Agency (DARPA PERFECT contract HR0011-13-2-0005)Quanta Computer (Firm

    Hardware-Oriented Cache Management for Large-Scale Chip Multiprocessors

    Get PDF
    One of the key requirements to obtaining high performance from chip multiprocessors (CMPs) is to effectively manage the limited on-chip cache resources shared among co-scheduled threads/processes. This thesis proposes new hardware-oriented solutions for distributed CMP caches. Computer architects are faced with growing challenges when designing cache systems for CMPs. These challenges result from non-uniform access latencies, interference misses, the bandwidth wall problem, and diverse workload characteristics. Our exploration of the CMP cache management problem suggests a CMP caching framework (CC-FR) that defines three main approaches to solve the problem: (1) data placement, (2) data retention, and (3) data relocation. We effectively implement CC-FR's components by proposing and evaluating multiple cache management mechanisms.Pressure and Distance Aware Placement (PDA) decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Flexible Set Balancing (FSB), on the other hand, reduces interference misses via extending the life time of cache lines through retaining some fraction of the working set at underutilized local sets to satisfy far-flung reuses. PDA implements CC-FR's data placement and relocation components and FSB applies CC-FR's retention approach.To alleviate non-uniform access latencies and adapt to phase changes in programs, Adaptive Controlled Migration (ACM) dynamically and periodically promotes cache blocks towards L2 banks close to requesting cores. ACM lies under CC-FR's data relocation category. Dynamic Cache Clustering (DCC), on the other hand, addresses diverse workload characteristics and growing non-uniform access latencies challenges via constructing a cache cluster for each core and expands/contracts all clusters synergistically to match each core's cache demand. DCC implements CC-FR's data placement and relocation approaches. Lastly, Dynamic Pressure and Distance Aware Placement (DPDA) combines PDA and ACM to cooperatively mitigate interference misses and non-uniform access latencies. Dynamic Cache Clustering and Balancing (DCCB), on the other hand, combines DCC and FSB to employ all CC-FR's categories and achieve higher system performance. Simulation results demonstrate the effectiveness of the proposed mechanisms and show that they compare favorably with related cache designs

    Control Caching : a fault-tolerant architecture for SEU mitigation in microprocessor control logic

    Get PDF
    The importance of fault tolerance at the processor architecture level has been made increasingly important due to rapid advancements in the design and usage of high performance devices and embedded processors. System level solutions to the challenge of fault tolerance flag errors and utilize penalty cycles to recover through the re-execution of instructions. This motivates the need for a hybrid technique providing fault detection as well as fault masking, with minimal penalty cycles for recovery from detected errors. In this research, we propose Control Caching, an architectural technique comprising of three schemes to protect the control logic of microprocessors against Single Event Upsets (SEUs). High fault coverage with relatively low hardware overhead is obtained by using both fault detection with recovery and fault masking. Control signals are classified as either static or dynamic, and static signals are further classified as opcode dependent and instruction dependent. The strategy for protecting static instruction dependent control signals utilizes a distributed cache of the history of the control bits along with the Triple Modular Redundancy (TMR) concept, while the opcode dependent control signals are protected by a distributed cache which can be used to flag errors. Dynamic signals are protected by selective duplication of datapath components. The techniques are implemented on the OpenRISC 1200 processor. Our simulation results show that fault detection with single cycle recovery is provided for 92% of all instruction executions. FPGA synthesis is performed to analyze the associated cycle time and area overheads

    Extensões para a compressão Base-Delta-Imediato

    Get PDF
    Orientador: Rodolfo Jardim de AzevedoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Memórias cache há muito têm sido utilizadas para reduzir os problemas decorrentes da discrepância de desempenho entre a memória e o processador: muitos níveis de caches on-chip reduzem a latência média de memória ao custo de área e energia extra no die. Para diminuir o dispêndio desses componentes extras, técnicas de compressão de cache são usadas para armazenar dados comprimidos e permitir um aumento de capacidade de cache. Este projeto apresenta extensões para a Compressão Base-Delta-Imediato, várias modificações da técnica original que minimizam a quantidade de bits de preenchimento numa compressão através da flexibilização dos tamanhos de delta permitidos para cada base e do aumento do número de bases. As extensões foram testadas utilizando ZSim, avaliadas contra métodos estado da arte, e os resultados de desempenho foram comparados e avaliados para determinar a validade de utilização das técnicas propostas. Foi constatado um aumento do fator de compressão médio de 1.37x para 1.58x com um aumento de energia tão baixo quanto 27%Abstract: Cache memories have long been used to reduce problems deriving from the memory-processor performance discrepancy: many levels of on-chip cache reduce the average memory latency at the cost of extra die area and power. To decrease the outlay of these extra components, cache compression techniques are used to store compressed data and allow a cache capacity boost. This project introduces extensions to the Base-Delta-Immediate Compression, many modifications of the original technique that minimize the quantity of padding bits by relaxing the allowed delta sizes for each base and increasing number of bases. The extensions were tested using ZSim, evaluated against state-of-the-art methods, and the performance results were compared and evaluated to determine the validity of the proposed techniques. We verified an improvement of the original BDI compression factor from 1.37x to 1.58x at a energy increase as low as 27%MestradoCiência da ComputaçãoMestre em Ciência da Computação1564395CAPE
    corecore