Search CORE

3,206 research outputs found

Design Guidelines for High-Performance SCM Hierarchies

Author: Bugnion Edouard
Daglis Alexandros
Falsafi Babak
Picorel Javier
Pnevmatikatos Dionisios
Sutherland Mark
Ustiugov Dmitrii
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 07/03/2019
Field of study

With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it will deliver the much-anticipated high density and access latencies within only a few factors of DRAM. Nevertheless, the latency-sensitive nature of memory-resident services makes seamless integration of SCM in servers questionable. In this paper, we ask the question of how best to introduce SCM for such servers to improve overall performance/cost over existing DRAM-only architectures. We first show that even with the most optimistic latency projections for SCM, the higher memory access latency results in prohibitive performance degradation. However, we find that deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the performance of an SCM-mostly memory system competitive. The high degree of spatial locality that memory-resident services exhibit not only simplifies the DRAM cache's design as page-based, but also enables the amortization of increased SCM access latencies and the mitigation of SCM's read/write latency disparity. We identify the set of memory hierarchy design parameters that plays a key role in the performance and cost of a memory system combining an SCM technology and a 3D stacked DRAM cache. We then introduce a methodology to drive provisioning for each of these design parameters under a target performance/cost goal. Finally, we use our methodology to derive concrete results for specific SCM technologies. With PCM as a case study, we show that a two bits/cell technology hits the performance/cost sweet spot, reducing the memory subsystem cost by 40% while keeping performance within 3% of the best performing DRAM-only system, whereas single-level and triple-level cell organizations are impractical for use as memory replacements.Comment: Published at MEMSYS'1

arXiv.org e-Print Archive

Crossref

Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0

Author: Begovic Ermina
Garme Karl
Razola Mikael
Rosén Anders
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2007
Field of study

Journal ArticleA significant part of future microprocessor real estate will be dedicated to L2 or L3 caches. These on-chip caches will heavily impact processor performance, power dissipation, and thermal management strategies. There are a number of interconnect design considerations that influence power/performance/area characteristics of large caches, such as wire models (width/spacing/repeaters), signaling strategy (RC/differential/transmission), router design, etc. Yet, to date, there exists no analytical tool that takes all of these parameters into account to carry out a design space exploration for large caches and estimate an optimal organization. In this work, we implement two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache. First, we add the ability to model different types of wires, such as RC-based wires with different power/delay characteristics and differential low-swing buses. Second, we add the ability to model Non-uniform Cache Access (NUCA). We not only adopt state-of-the-art design space exploration strategies for NUCA, we also enhance this exploration by considering on-chip network contention and a wider spectrum of wiring and routing choices. We present a validation analysis of the new tool (to be released as CACTI 6.0) and present a case study to showcase how the tool can improve architecture research methodologies

Publikationer från KTH

The University of Utah: J. Willard Marriott Digital Library

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0

Author: Balasubramonian Rajeev
Muralimanohar Naveen
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2007
Field of study

The University of Utah: J. Willard Marriott Digital Library

Efficient instruction and data caching for high-performance low-power embedded systems

Author: Ferrerón Labari Alexandra
Suárez Gracia Darío
Publication venue: 'Universidad de Zaragoza'
Publication date: 01/01/2012
Field of study

Although multi-threading processors can increase the performance of embedded systems with a minimum overhead, fetching instructions from multiple threads each cycle also increases the pressure on the instruction cache, potentially harming the performance/consumption ratio. Instruction caches are responsible of a high percentage of the total energy consumption of the chip, which for battery-powered embedded devices becomes a critical issue. A direct way to reduce the energy consumption of the first level instruction cache is to decrease its size and associativity. However, demanding applications, and specially applications with several threads running together, might suffer a dramatic performance slow down, or even increase the total energy consumption of the cache hierarchy, due to the extra misses incurred. In this work we introduce iLP-NUCA (Instruction Light Power NUCA), a new instruction cache that replaces the conventional second level cache (L2) and improves the Energy–Delay of the system. We provided iLP-NUCA with a new tree-based transport network-in-cache that reduces both the cache line service latency and the energy consumption, regarding the former LP-NUCA implementation. We modeled in our cycle-accurate simulation environment both conventional instruction hierarchies and iLP-NUCAs. Our experiments show that, running SPEC CPU2006, iLP-NUCA, in comparison with a state–of–the–art high performance conventional cache hierarchy (three cache levels, dedicated L1 and L2, shared L3), performs better and consumes less energy. Furthermore, iLP-NUCA reaches the performance, on average, of a conventional instruction cache hierarchy implementing a double sized L1, independently of the number of threads. This translates into a reduction of the Energy–Delay product of 21%, 18%, and 11%, reaching 90%, 95%, and 99% of the ideal performance for 1, 2, and 4 threads, respectively. These results are consistent for the considered applications distribution, and bigger gains are in the most demanding applications (applications with high instruction cache requirements). Besides, we increase the performance of applications with several threads without being detrimental for any of them. The new transport topology reduces the average service latency of cache lines by 8%, and the energy consumption of its components by 20%

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Universidad de Zaragoza

A fault-tolerant last level cache for CMPs operating at ultra-low voltage

Author: Alastruey Jesús
Ferrerón Alexandra
Ibáñez Marín Pablo Enrique
Monreal Arnal Teresa
Suárez Gracía Dario
Viñals Yúfera Víctor
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

Voltage scaling to values near the threshold voltage is a promising technique to hold off the many-core power wall. However, as voltage decreases, some SRAM cells are unable to operate reliably and show a behavior consistent with a hard fault. Block disabling is a micro-architectural technique that allows low-voltage operation by deactivating faulty cache entries, at the expense of reducing the effective cache capacity. In the case of the last-level cache, this capacity reduction leads to an increase in off-chip memory accesses, diminishing the overall energy benefit of reducing the voltage supply. In this work, we exploit the reuse locality and the intrinsic redundancy of multi-level inclusive hierarchies to enhance the performance of block disabling with negligible cost. The proposed fault-aware last-level cache management policy maps critical blocks, those not present in private caches and with a higher probability of being reused, to active cache entries. Our evaluation shows that this fault-aware management results in up to 37.3% and 54.2% fewer misses per kilo instruction (MPKI) than block disabling for multiprogrammed and parallel workloads, respectively. This translates to performance enhancements of up to 13% and 34.6% for multiprogrammed and parallel workloads, respectively.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Repositorio Universidad de Zaragoza