20,695 research outputs found
Impact of intrinsic parameter fluctuation on the fault tolerance of L1 data cache
As the semiconductor process technology continues to scale deeper into the nanometer region, the intrinsic parameter fluctuations will aggressively affect the performance and reliability of future microprocessors and System-on-Chip (SoC) applications. These system requires large SRAM arrays that occupy an increasing fraction of the chip real estate. To investigate the impact various source of intrinsic parameter fluctuation (IPF) from systems point of view, a framework to bridge architecture-level and device-level simulation will be utilized for data cache built from transistors with 25 nm, 18 nm and 13 nm technology node. This study found that the IPF will not have any significant impacts on data cache memory systems build with 25 nm while increasing the memory cell ratio, (ß) to two will overcome the IPF impacts for the 18 nm. However, the 13 nm technology data cache could not operate even with higher cell ratio. Common, cache memory fault detection and correction such as ECC and redundancy can only partially remove the transaction error caused by these fluctuation sources
L2C2: Last-level compressed-contents non-volatile cache and a procedure to forecast performance and lifetime
Several emerging non-volatile (NV) memory technologies are rising as interesting alternatives to build the Last-Level Cache (LLC). Their advantages, compared to SRAM memory, are higher density and lower static power, but write operations wear out the bitcells to the point of eventually losing their storage capacity. In this context, this paper presents a novel LLC organization designed to extend the lifetime of the NV data array and a procedure to forecast in detail the capacity and performance of such an NV-LLC over its lifetime. From a methodological point of view, although different approaches are used in the literature to analyze the degradation of an NV-LLC, none of them allows to study in detail its temporal evolution. In this sense, this work proposes a forecasting procedure that combines detailed simulation and prediction, allowing an accurate analysis of the impact of different cache control policies and mechanisms (replacement, wear-leveling, compression, etc.) on the temporal evolution of the indices of interest, such as the effective capacity of the NV-LLC or the system IPC. We also introduce L2C2, a LLC design intended for implementation in NV memory technology that combines fault tolerance, compression, and internal write wear leveling for the first time. Compression is not used to store more blocks and increase the hit rate, but to reduce the write rate and increase the lifetime during which the cache supports near-peak performance. In addition, to support byte loss without performance drop, L2C2 inherently allows N redundant bytes to be added to each cache entry. Thus, L2C2+N, the endurance-scaled version of L2C2, allows balancing the cost of redundant capacity with the benefit of longer lifetime. For instance, as a use case, we have implemented the L2C2 cache with STT-RAM technology. It has affordable hardware overheads compared to that of a baseline NV-LLC without compression in terms of area, latency and energy consumption, and increases up to 6-37 times the time in which 50% of the effective capacity is degraded, depending on the variability in the manufacturing process. Compared to L2C2, L2C2+6 which adds 6 bytes of redundant capacity per entry, that means 9.1% of storage overhead, can increase up to 1.4-4.3 times the time in which the system gets its initial peak performance degraded
Stochastic Modeling of Hybrid Cache Systems
In recent years, there is an increasing demand of big memory systems so to
perform large scale data analytics. Since DRAM memories are expensive, some
researchers are suggesting to use other memory systems such as non-volatile
memory (NVM) technology to build large-memory computing systems. However,
whether the NVM technology can be a viable alternative (either economically and
technically) to DRAM remains an open question. To answer this question, it is
important to consider how to design a memory system from a "system
perspective", that is, incorporating different performance characteristics and
price ratios from hybrid memory devices.
This paper presents an analytical model of a "hybrid page cache system" so to
understand the diverse design space and performance impact of a hybrid cache
system. We consider (1) various architectural choices, (2) design strategies,
and (3) configuration of different memory devices. Using this model, we provide
guidelines on how to design hybrid page cache to reach a good trade-off between
high system throughput (in I/O per sec or IOPS) and fast cache reactivity which
is defined by the time to fill the cache. We also show how one can configure
the DRAM capacity and NVM capacity under a fixed budget. We pick PCM as an
example for NVM and conduct numerical analysis. Our analysis indicates that
incorporating PCM in a page cache system significantly improves the system
performance, and it also shows larger benefit to allocate more PCM in page
cache in some cases. Besides, for the common setting of performance-price ratio
of PCM, "flat architecture" offers as a better choice, but "layered
architecture" outperforms if PCM write performance can be significantly
improved in the future.Comment: 14 pages; mascots 201
A Cache Management Strategy to Replace Wear Leveling Techniques for Embedded Flash Memory
Prices of NAND flash memories are falling drastically due to market growth
and fabrication process mastering while research efforts from a technological
point of view in terms of endurance and density are very active. NAND flash
memories are becoming the most important storage media in mobile computing and
tend to be less confined to this area. The major constraint of such a
technology is the limited number of possible erase operations per block which
tend to quickly provoke memory wear out. To cope with this issue,
state-of-the-art solutions implement wear leveling policies to level the wear
out of the memory and so increase its lifetime. These policies are integrated
into the Flash Translation Layer (FTL) and greatly contribute in decreasing the
write performance. In this paper, we propose to reduce the flash memory wear
out problem and improve its performance by absorbing the erase operations
throughout a dual cache system replacing FTL wear leveling and garbage
collection services. We justify this idea by proposing a first performance
evaluation of an exclusively cache based system for embedded flash memories.
Unlike wear leveling schemes, the proposed cache solution reduces the total
number of erase operations reported on the media by absorbing them in the cache
for workloads expressing a minimal global sequential rate.Comment: Ce papier a obtenu le "Best Paper Award" dans le "Computer System
track" nombre de page: 8; International Symposium on Performance Evaluation
of Computer & Telecommunication Systems, La Haye : Netherlands (2011
Interval simulation: raising the level of abstraction in architectural simulation
Detailed architectural simulators suffer from a long development cycle and extremely long evaluation times. This longstanding problem is further exacerbated in the multi-core processor era. Existing solutions address the simulation problem by either sampling the simulated instruction stream or by mapping the simulation models on FPGAs; these approaches achieve substantial simulation speedups while simulating performance in a cycle-accurate manner This paper proposes interval simulation which rakes a completely different approach: interval simulation raises the level of abstraction and replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model estimates core-level performance by analyzing intervals, or the timing between two miss events (branch mispredictions and TLB/cache misses); the miss events are determined through simulation of the memory hierarchy, cache coherence protocol, interconnection network and branch predictor By raising the level of abstraction, interval simulation reduces both development time and evaluation time. Our experimental results using the SPEC CPU2000 and PARSEC benchmark suites and the MS multi-core simulator show good accuracy up to eight cores (average error of 4.6% and max error of 11% for the multi-threaded full-system workloads), while achieving a one order of magnitude simulation speedup compared to cycle-accurate simulation. Moreover interval simulation is easy to implement: our implementation of the mechanistic analytical model incurs only one thousand lines of code. Its high accuracy, fast simulation speed and ease-of-use make interval simulation a useful complement to the architect's toolbox for exploring system-level and high-level micro-architecture trade-offs
Accelerator Memory Reuse in the Dark Silicon Era
Accelerators integrated on-die with General-Purpose CPUs (GP-CPUs) can yield significant performance and power improvements. Their extensive use, however, is ultimately limited by their area overhead; due to their high degree of specialization, the opportunity cost of investing die real estate on accelerators can become prohibitive, especially for general-purpose architectures. In this paper we present a novel technique aimed at mitigating this opportunity cost by allowing GP-CPU cores to reuse accelerator memory as a non-uniform cache architecture (NUCA) substrate. On a system with a last level-2 cache of 128kB, our technique achieves on average a 25% performance improvement when reusing four 512 kB accelerator memory blocks to form a level-3 cache. Making these blocks reusable as NUCA slices incurs on average in a 1.89% area overhead with respect to equally-sized ad hoc cache slice
Performance Evaluation and Modeling of HPC I/O on Non-Volatile Memory
HPC applications pose high demands on I/O performance and storage capability.
The emerging non-volatile memory (NVM) techniques offer low-latency, high
bandwidth, and persistence for HPC applications. However, the existing I/O
stack are designed and optimized based on an assumption of disk-based storage.
To effectively use NVM, we must re-examine the existing high performance
computing (HPC) I/O sub-system to properly integrate NVM into it. Using NVM as
a fast storage, the previous assumption on the inferior performance of storage
(e.g., hard drive) is not valid any more. The performance problem caused by
slow storage may be mitigated; the existing mechanisms to narrow the
performance gap between storage and CPU may be unnecessary and result in large
overhead. Thus fully understanding the impact of introducing NVM into the HPC
software stack demands a thorough performance study.
In this paper, we analyze and model the performance of I/O intensive HPC
applications with NVM as a block device. We study the performance from three
perspectives: (1) the impact of NVM on the performance of traditional page
cache; (2) a performance comparison between MPI individual I/O and POSIX I/O;
and (3) the impact of NVM on the performance of collective I/O. We reveal the
diminishing effects of page cache, minor performance difference between MPI
individual I/O and POSIX I/O, and performance disadvantage of collective I/O on
NVM due to unnecessary data shuffling. We also model the performance of MPI
collective I/O and study the complex interaction between data shuffling,
storage performance, and I/O access patterns.Comment: 10 page
- …