745 research outputs found

    Resource-aware scheduling for 2D/3D multi-/many-core processor-memory systems

    Get PDF
    This dissertation addresses the complexities of 2D/3D multi-/many-core processor-memory systems, focusing on two key areas: enhancing timing predictability in real-time multi-core processors and optimizing performance within thermal constraints. The integration of an increasing number of transistors into compact chip designs, while boosting computational capacity, presents challenges in resource contention and thermal management. The first part of the thesis improves timing predictability. We enhance shared cache interference analysis for set-associative caches, advancing the calculation of Worst-Case Execution Time (WCET). This development enables accurate assessment of cache interference and the effectiveness of partitioned schedulers in real-world scenarios. We introduce TCPS, a novel task and cache-aware partitioned scheduler that optimizes cache partitioning based on task-specific WCET sensitivity, leading to improved schedulability and predictability. Our research explores various cache and scheduling configurations, providing insights into their performance trade-offs. The second part focuses on thermal management in 2D/3D many-core systems. Recognizing the limitations of Dynamic Voltage and Frequency Scaling (DVFS) in S-NUCA many-core processors, we propose synchronous thread migrations as a thermal management strategy. This approach culminates in the HotPotato scheduler, which balances performance and thermal safety. We also introduce 3D-TTP, a transient temperature-aware power budgeting strategy for 3D-stacked systems, reducing the need for Dynamic Thermal Management (DTM) activation. Finally, we present 3QUTM, a novel method for 3D-stacked systems that combines core DVFS and memory bank Low Power Modes with a learning algorithm, optimizing response times within thermal limits. This research contributes significantly to enhancing performance and thermal management in advanced processor-memory systems

    Scalable Hierarchical Instruction Cache for Ultra-Low-Power Processors Clusters

    Full text link
    High Performance and Energy Efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly-coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultra-low-power tightly-coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures' performance and energy efficiency for parallel ultra-low-power (ULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20\% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17\% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications.Comment: 14 page

    Compression-aware and performance-efficient insertion policies for long-lasting hybrid LLCs

    Get PDF
    Emerging non-volatile memory (NVM) technologies can potentially replace large SRAM memories such as the last-level cache (LLC). However, despite recent advances, NVMs suffer from higher write latency and limited write endurance. Recently, NVM-SRAM hybrid LLCs are proposed to combine the best of both worlds. Several policies have been proposed to improve the performance and lifetime of hybrid LLCs by intelligently steering the incoming LLC blocks into either the SRAM or NVM part, regarding the cache behavior of the LLC blocks and the SRAM/NVM device properties. However, these policies neither consider compressing the contents of the cache block nor using partially worn-out NVM cache blocks.This paper proposes new insertion policies for byte-level fault-tolerant hybrid LLCs that collaboratively optimize for lifetime and performance. Specifically, we leverage data compression to utilize partially defective NVM cache entries, thereby improving the LLC hit rate. The key to our approach is to guide the insertion policy by both the reuse properties of the block and the size resulting from its compression. A block is inserted in NVM only if it is a read-reuse block or its compressed size is lower than a threshold. It will be inserted in SRAM if the block is a write-reuse or its compressed size is greater than the threshold. We use set-dueling to tune the compression threshold at runtime. This compression threshold provides a knob to control the NVM write rate and, together with a rule-based mechanism, allows balancing performance and lifetime.Overall, our evaluation shows that, with affordable hardware overheads, the proposed schemes can nearly reach the performance of an SRAM cache with the same associativity while improving lifetime by 17× compared to a hybrid NVM-unaware LLC. Our proposed scheme outperforms the state-of-the-art insertion policies by 9% while achieving a comparative lifetime. The rule-based mechanism shows that by compromising, for instance, 1.1% and 1.9% performance, the NVM lifetime can be further increased by 28% and 44%, respectively.This work was partially funded by the HiPEAC collaboration grant 2020, the Center for Advancing Electronics Dresden (cfaed), the German Research Council (DFG) through the HetCIM project (502388442) under the Priority Program on ‘Disruptive Memory Technologies’ (SPP 2377), and from grants (1) PID2019-105660RB-C21 and PID2019-107255GB- C22/AEI/10.13039/501100011033 from Agencia Estatal de Investigación (AEI), and (2) gaZ: T5820R research group from Dept. of Science, University and Knowledge Society, Government of Aragon.Peer ReviewedPostprint (author's final draft

    Decryption Failure Attacks on Post-Quantum Cryptography

    Get PDF
    This dissertation discusses mainly new cryptanalytical results related to issues of securely implementing the next generation of asymmetric cryptography, or Public-Key Cryptography (PKC).PKC, as it has been deployed until today, depends heavily on the integer factorization and the discrete logarithm problems.Unfortunately, it has been well-known since the mid-90s, that these mathematical problems can be solved due to Peter Shor's algorithm for quantum computers, which achieves the answers in polynomial time.The recently accelerated pace of R&D towards quantum computers, eventually of sufficient size and power to threaten cryptography, has led the crypto research community towards a major shift of focus.A project towards standardization of Post-quantum Cryptography (PQC) was launched by the US-based standardization organization, NIST. PQC is the name given to algorithms designed for running on classical hardware/software whilst being resistant to attacks from quantum computers.PQC is well suited for replacing the current asymmetric schemes.A primary motivation for the project is to guide publicly available research toward the singular goal of finding weaknesses in the proposed next generation of PKC.For public key encryption (PKE) or digital signature (DS) schemes to be considered secure they must be shown to rely heavily on well-known mathematical problems with theoretical proofs of security under established models, such as indistinguishability under chosen ciphertext attack (IND-CCA).Also, they must withstand serious attack attempts by well-renowned cryptographers both concerning theoretical security and the actual software/hardware instantiations.It is well-known that security models, such as IND-CCA, are not designed to capture the intricacies of inner-state leakages.Such leakages are named side-channels, which is currently a major topic of interest in the NIST PQC project.This dissertation focuses on two things, in general:1) how does the low but non-zero probability of decryption failures affect the cryptanalysis of these new PQC candidates?And 2) how might side-channel vulnerabilities inadvertently be introduced when going from theory to the practice of software/hardware implementations?Of main concern are PQC algorithms based on lattice theory and coding theory.The primary contributions are the discovery of novel decryption failure side-channel attacks, improvements on existing attacks, an alternative implementation to a part of a PQC scheme, and some more theoretical cryptanalytical results

    Database System Acceleration on FPGAs

    Get PDF
    Relational database systems provide various services and applications with an efficient means for storing, processing, and retrieving their data. The performance of these systems has a direct impact on the quality of service of the applications that rely on them. Therefore, it is crucial that database systems are able to adapt and grow in tandem with the demands of these applications, ensuring that their performance scales accordingly. In the past, Moore's law and algorithmic advancements have been sufficient to meet these demands. However, with the slowdown of Moore's law, researchers have begun exploring alternative methods, such as application-specific technologies, to satisfy the more challenging performance requirements. One such technology is field-programmable gate arrays (FPGAs), which provide ideal platforms for developing and running custom architectures for accelerating database systems. The goal of this thesis is to develop a domain-specific architecture that can enhance the performance of in-memory database systems when executing analytical queries. Our research is guided by a combination of academic and industrial requirements that seek to strike a balance between generality and performance. The former ensures that our platform can be used to process a diverse range of workloads, while the latter makes it an attractive solution for high-performance use cases. Throughout this thesis, we present the development of a system-on-chip for database system acceleration that meets our requirements. The resulting architecture, called CbMSMK, is capable of processing the projection, sort, aggregation, and equi-join database operators and can also run some complex TPC-H queries. CbMSMK employs a shared sort-merge pipeline for executing all these operators, which results in an efficient use of FPGA resources. This approach enables the instantiation of multiple acceleration cores on the FPGA, allowing it to serve multiple clients simultaneously. CbMSMK can process both arbitrarily deep and wide tables efficiently. The former is achieved through the use of the sort-merge algorithm which utilizes the FPGA RAM for buffering intermediate sort results. The latter is achieved through the use of KeRRaS, a novel variant of the forward radix sort algorithm introduced in this thesis. KeRRaS allows CbMSMK to process a table a few columns at a time, incrementally generating the final result through multiple iterations. Given that acceleration is a key objective of our work, CbMSMK benefits from many performance optimizations. For instance, multi-way merging is employed to reduce the number of merge passes required for the execution of the sort-merge algorithm, thus improving the performance of all our pipeline-breaking operators. Another example is our in-depth analysis of early aggregation, which led to the development of a novel cache-based algorithm that significantly enhances aggregation performance. Our experiments demonstrate that CbMSMK performs on average 5 times faster than the state-of-the-art CPU-based database management system MonetDB.:I Database Systems & FPGAs 1 INTRODUCTION 1.1 Databases & the Importance of Performance 1.2 Accelerators & FPGAs 1.3 Requirements 1.4 Outline & Summary of Contributions 2 BACKGROUND ON DATABASE SYSTEMS 2.1 Databases 2.1.1 Storage Model 2.1.2 Storage Medium 2.2 Database Operators 2.2.1 Projection 2.2.2 Filter 2.2.3 Sort 2.2.4 Aggregation 2.2.5 Join 2.2.6 Operator Classification 2.3 Database Queries 2.4 Impact of Acceleration 3 BACKGROUND ON FPGAS 3.1 FPGA 3.1.1 Logic Element 3.1.2 Block RAM (BRAM) 3.1.3 Digital Signal Processor (DSP) 3.1.4 IO Element 3.1.5 Programmable Interconnect 3.2 FPGADesignFlow 3.2.1 Specifications 3.2.2 RTL Description 3.2.3 Verification 3.2.4 Synthesis, Mapping, Placement, and Routing 3.2.5 TimingAnalysis 3.2.6 Bitstream Generation and FPGA Programming 3.3 Implementation Quality Metrics 3.4 FPGA Cards 3.5 Benefits of Using FPGAs 3.6 Challenges of Using FPGAs 4 RELATED WORK 4.1 Summary of Related Work 4.2 Platform Type 4.2.1 Accelerator Card 4.2.2 Coprocessor 4.2.3 Smart Storage 4.2.4 Network Processor 4.3 Implementation 4.3.1 Loop-based implementation 4.3.2 Sort-based Implementation 4.3.3 Hash-based Implementation 4.3.4 Mixed Implementation 4.4 A Note on Quantitative Performance Comparisons II Cache-Based Morphing Sort-Merge with KeRRaS (CbMSMK) 5 OBJECTIVES AND ARCHITECTURE OVERVIEW 5.1 From Requirements to Objectives 5.2 Architecture Overview 5.3 Outlineof Part II 6 COMPARATIVE ANALYSIS OF OPENCL AND RTL FOR SORT-MERGE PRIMITIVES ON FPGAS 6.1 Programming FPGAs 6.2 RelatedWork 6.3 Architecture 6.3.1 Global Architecture 6.3.2 Sorter Architecture 6.3.3 Merger Architecture 6.3.4 Scalability and Resource Adaptability 6.4 Experiments 6.4.1 OpenCL Sort-Merge Implementation 6.4.2 RTLSorters 6.4.3 RTLMergers 6.4.4 Hybrid OpenCL-RTL Sort-Merge Implementation 6.5 Summary & Discussion 7 RESOURCE-EFFICIENT ACCELERATION OF PIPELINE-BREAKING DATABASE OPERATORS ON FPGAS 7.1 The Case for Resource Efficiency 7.2 Related Work 7.3 Architecture 7.3.1 Sorters 7.3.2 Sort-Network 7.3.3 X:Y Mergers 7.3.4 Merge-Network 7.3.5 Join Materialiser (JoinMat) 7.4 Experiments 7.4.1 Experimental Setup 7.4.2 Implementation Description & Tuning 7.4.3 Sort Benchmarks 7.4.4 Aggregation Benchmarks 7.4.5 Join Benchmarks 7. Summary 8 KERRAS: COLUMN-ORIENTED WIDE TABLE PROCESSING ON FPGAS 8.1 The Scope of Database System Accelerators 8.2 Related Work 8.3 Key-Reduce Radix Sort(KeRRaS) 8.3.1 Time Complexity 8.3.2 Space Complexity (Memory Utilization) 8.3.3 Discussion and Optimizations 8.4 Architecture 8.4.1 MSM 8.4.2 MSMK: Extending MSM with KeRRaS 8.4.3 Payload, Aggregation and Join Processing 8.4.4 Limitations 8.5 Experiments 8.5.1 Experimental Setup 8.5.2 Datasets 8.5.3 MSMK vs. MSM 8.5.4 Payload-Less Benchmarks 8.5.5 Payload-Based Benchmarks 8.5.6 Flexibility 8.6 Summary 9 A STUDY OF EARLY AGGREGATION IN DATABASE QUERY PROCESSING ON FPGAS 9.1 Early Aggregation 9.2 Background & Related Work 9.2.1 Sort-Based Early Aggregation 9.2.2 Cache-Based Early Aggregation 9.3 Simulations 9.3.1 Datasets 9.3.2 Metrics 9.3.3 Sort-Based Versus Cache-Based Early Aggregation 9.3.4 Comparison of Set-Associative Caches 9.3.5 Comparison of Cache Structures 9.3.6 Comparison of Replacement Policies 9.3.7 Cache Selection Methodology 9.4 Cache System Architecture 9.4.1 Window Aggregator 9.4.2 Compressor & Hasher 9.4.3 Collision Detector 9.4.4 Collision Resolver 9.4.5 Cache 9.5 Experiments 9.5.1 Experimental Setup 9.5.2 Resource Utilization and Parameter Tuning 9.5.3 Datasets 9.5.4 Benchmarks on Synthetic Data 9.5.5 Benchmarks on Real Data 9.6 Summary 10 THE FULL PICTURE 10.1 System Architecture 10.2 Benchmarks 10.3 Meeting the Objectives III Conclusion 11 SUMMARY AND OUTLOOK ON FUTURE RESEARCH 11.1 Summary 11.2 Future Work BIBLIOGRAPHY LIST OF FIGURES LIST OF TABLE

    Transactional memory for high-performance embedded systems

    Get PDF
    The increasing demand for computational power in embedded systems, which is required for various tasks, such as autonomous driving, can only be achieved by exploiting the resources offered by modern hardware. Due to physical limitations, hardware manufacturers have moved to increase the number of cores per processor instead of further increasing clock rates. Therefore, in our view, the additionally required computing power can only be achieved by exploiting parallelism. Unfortunately writing parallel code is considered a difficult and complex task. Hardware Transactional Memories (HTMs) are a suitable tool to write sophisticated parallel software. However, HTMs were not specifically developed for embedded systems and therefore cannot be used without consideration. The use of conventional HTMs increases complexity and makes it more difficult to foresee implications with other important properties of embedded systems. This thesis therefore describes how an HTM for embedded systems could be implemented. The HTM was designed to allow the parallel execution of software and to offer functionality which is useful for embedded systems. Hereby the focus lay on: elimination of the typical limitations of conventional HTMs, several conflict resolution mechanisms, investigation of real time behavior, and a feature to conserve energy. To enable the desired functionalities, the structure of the HTM described in this work strongly differs from a conventional HTM. In comparison to the baseline HTM, which was also designed and implemented in this thesis, the biggest adaptation concerns the conflict detection. It was modified so that conflicts can be detected and resolved centrally. For this, the cache hierarchy as well as the cache coherence had to be adapted and partially extended. The system was implemented in the cycle-accurate gem5 simulator. The eight benchmarks of the STAMP benchmark suite were used for evaluation. The evaluation of the various functionalities shows that the mechanisms work and add value for the operation in embedded systems.Der immer größer werdende Bedarf an Rechenleistung in eingebetteten Systemen, der für verschiedene Aufgaben wie z. B. dem autonomen Fahren benötigt wird, kann nur durch die effiziente Nutzung der zur Verfügung stehenden Ressourcen erreicht werden. Durch physikalische Grenzen sind Prozessorhersteller dazu übergegangen, Prozessoren mit mehreren Prozessorkernen auszustatten, statt die Taktraten weiter anzuheben. Daher kann die zusätzlich benötigte Rechenleistung aus unserer Sicht nur durch eine Steigerung der Parallelität gelingen. Hardwaretransaktionsspeicher (HTS) erlauben es ihren Nutzern schnell und einfach parallele Programme zu schreiben. Allerdings wurden HTS nicht speziell für eingebettete Systeme entwickelt und sind daher nur eingeschränkt für diese nutzbar. Durch den Einsatz herkömmlicher HTS steigt die Komplexität und es wird somit schwieriger abzusehen, ob andere wichtige Eigenschaften erreicht werden können. Um den Einsatz von HTS in eingebettete Systeme besser zu ermöglichen, beschreibt diese Arbeit einen konkreten Ansatz. Der HTS wurde hierzu so entwickelt, dass er eine parallele Ausführung von Programmen ermöglicht und Eigenschaften besitzt, welche für eingebettete Systeme nützlich sind. Dazu gehören unter anderem: Wegfall der typischen Limitierungen herkömmlicher HTS, Einflussnahme auf den Konfliktauflösungsmechanismus, Unterstützung einer abschätzbaren Ausführung und eine Funktion, um Energie einzusparen. Um die gewünschten Funktionalitäten zu ermöglichen, unterscheidet sich der Aufbau des in dieser Arbeit beschriebenen HTS stark von einem klassischen HTS. Im Vergleich zu dem Referenz HTS, der ebenfalls im Rahmen dieser Arbeit entworfen und implementiert wurde, betrifft die größte Anpassung die Konflikterkennung. Sie wurde derart verändert, dass die Konflikte zentral erkannt und aufgelöst werden können. Hierfür mussten die Cache-Hierarchie und Cache-Kohärenz stark angepasst und teilweise erweitert werden. Das System wurde in einem taktgenauen Simulator, dem gem5-Simulator, umgesetzt. Zur Evaluation wurden die acht Benchmarks der STAMP-Benchmark-Suite eingesetzt. Die Evaluation der verschiedenen Funktionen zeigt, dass die Mechanismen funktionieren und somit einen Mehrwert für eingebettete Systeme bieten

    A multi-level functional IR with rewrites for higher-level synthesis of accelerators

    Get PDF
    Specialised accelerators deliver orders of magnitude higher energy-efficiency than general-purpose processors. Field Programmable Gate Arrays (FPGAs) have become the substrate of choice, because the ever-changing nature of modern workloads, such as machine learning, demands reconfigurability. However, they are notoriously hard to program directly using Hardware Description Languages (HDLs). Traditional High-Level Synthesis (HLS) tools improve productivity, but come with their own problems. They often produce sub-optimal designs and programmers are still required to write hardware-specific code, thus development cycles remain long. This thesis proposes Shir, a higher-level synthesis approach for high-performance accelerator design with a hardware-agnostic programming entry point, a multi-level Intermediate Representation (IR), a compiler and rewrite rules for optimisation. First, a novel, multi-level functional IR structure for accelerator design is described. The IRs operate on different levels of abstraction, cleanly separating different hardware concerns. They enable the expression of different forms of parallelism and standard memory features, such as asynchronous off-chip memories or synchronous on-chip buffers, as well as arbitration of such shared resources. Exposing these features at the IR level is essential for achieving high performance. Next, mechanical lowering procedures are introduced to automatically compile a program specification through Shir’s functional IRs until low-level HDL code for FPGA synthesis is emitted. Each lowering step gradually adds implementation details. Finally, this thesis presents rewrite rules for automatic optimisations around parallelisation, buffering and data reshaping. Reshaping operations pose a challenge to functional approaches in particular. They introduce overheads that compromise performance or even prevent the generation of synthesisable hardware designs altogether. This fundamental issue is solved by the application of rewrite rules. The viability of this approach is demonstrated by running matrix multiplication and 2D convolution on an Intel Arria 10 FPGA. A limited design space exploration is conducted, confirming the ability of the IR to exploit various hardware features. Using rewrite rules for optimisation, it is possible to generate high-performance designs that are competitive with highly tuned OpenCL implementations and that outperform hardware-agnostic OpenCL code. The performance impact of the optimisations is further evaluated showing that they are essential to achieving high performance, and in many cases also necessary to produce hardware that fits the resource constraints

    Flexible Hardware-based Security-aware Mechanisms and Architectures

    Get PDF
    For decades, software security has been the primary focus in securing our computing platforms. Hardware was always assumed trusted, and inherently served as the foundation, and thus the root of trust, of our systems. This has been further leveraged in developing hardware-based dedicated security extensions and architectures to protect software from attacks exploiting software vulnerabilities such as memory corruption. However, the recent outbreak of microarchitectural attacks has shaken these long-established trust assumptions in hardware entirely, thereby threatening the security of all of our computing platforms and bringing hardware and microarchitectural security under scrutiny. These attacks have undeniably revealed the grave consequences of hardware/microarchitecture security flaws to the entire platform security, and how they can even subvert the security guarantees promised by dedicated security architectures. Furthermore, they shed light on the sophisticated challenges particular to hardware/microarchitectural security; it is more critical (and more challenging) to extensively analyze the hardware for security flaws prior to production, since hardware, unlike software, cannot be patched/updated once fabricated. Hardware cannot reliably serve as the root of trust anymore, unless we develop and adopt new design paradigms where security is proactively addressed and scrutinized across the full stack of our computing platforms, at all hardware design and implementation layers. Furthermore, novel flexible security-aware design mechanisms are required to be incorporated in processor microarchitecture and hardware-assisted security architectures, that can practically address the inherent conflict between performance and security by allowing that the trade-off is configured to adapt to the desired requirements. In this thesis, we investigate the prospects and implications at the intersection of hardware and security that emerge across the full stack of our computing platforms and System-on-Chips (SoCs). On one front, we investigate how we can leverage hardware and its advantages, in contrast to software, to build more efficient and effective security extensions that serve security architectures, e.g., by providing execution attestation and enforcement, to protect the software from attacks exploiting software vulnerabilities. We further propose that they are microarchitecturally configured at runtime to provide different types of security services, thus adapting flexibly to different deployment requirements. On another front, we investigate how we can protect these hardware-assisted security architectures and extensions themselves from microarchitectural and software attacks that exploit design flaws that originate in the hardware, e.g., insecure resource sharing in SoCs. More particularly, we focus in this thesis on cache-based side-channel attacks, where we propose sophisticated cache designs, that fundamentally mitigate these attacks, while still preserving performance by enabling that the performance security trade-off is configured by design. We also investigate how these can be incorporated into flexible and customizable security architectures, thus complementing them to further support a wide spectrum of emerging applications with different performance/security requirements. Lastly, we inspect our computing platforms further beneath the design layer, by scrutinizing how the actual implementation of these mechanisms is yet another potential attack surface. We explore how the security of hardware designs and implementations is currently analyzed prior to fabrication, while shedding light on how state-of-the-art hardware security analysis techniques are fundamentally limited, and the potential for improved and scalable approaches

    Evolutionary Design of the Memory Subsystem

    Full text link
    The memory hierarchy has a high impact on the performance and power consumption in the system. Moreover, current embedded systems, included in mobile devices, are specifically designed to run multimedia applications, which are memory intensive. This increases the pressure on the memory subsystem and affects the performance and energy consumption. In this regard, the thermal problems, performance degradation and high energy consumption, can cause irreversible damage to the devices. We address the optimization of the whole memory subsystem with three approaches integrated as a single methodology. Firstly, the thermal impact of register file is analyzed and optimized. Secondly, the cache memory is addressed by optimizing cache configuration according to running applications and improving both performance and power consumption. Finally, we simplify the design and evaluation process of general-purpose and customized dynamic memory manager, in the main memory. To this aim, we apply different evolutionary algorithms in combination with memory simulators and profiling tools. This way, we are able to evaluate the quality of each candidate solution and take advantage of the exploration of solutions given by the optimization algorithm.We also provide an experimental experience where our proposal is assessed using well-known benchmark applications

    L2C2: Last-level compressed-contents non-volatile cache and a procedure to forecast performance and lifetime

    Get PDF
    Several emerging non-volatile (NV) memory technologies are rising as interesting alternatives to build the Last-Level Cache (LLC). Their advantages, compared to SRAM memory, are higher density and lower static power, but write operations wear out the bitcells to the point of eventually losing their storage capacity. In this context, this paper presents a novel LLC organization designed to extend the lifetime of the NV data array and a procedure to forecast in detail the capacity and performance of such an NV-LLC over its lifetime. From a methodological point of view, although different approaches are used in the literature to analyze the degradation of an NV-LLC, none of them allows to study in detail its temporal evolution. In this sense, this work proposes a forecasting procedure that combines detailed simulation and prediction, allowing an accurate analysis of the impact of different cache control policies and mechanisms (replacement, wear-leveling, compression, etc.) on the temporal evolution of the indices of interest, such as the effective capacity of the NV-LLC or the system IPC. We also introduce L2C2, a LLC design intended for implementation in NV memory technology that combines fault tolerance, compression, and internal write wear leveling for the first time. Compression is not used to store more blocks and increase the hit rate, but to reduce the write rate and increase the lifetime during which the cache supports near-peak performance. In addition, to support byte loss without performance drop, L2C2 inherently allows N redundant bytes to be added to each cache entry. Thus, L2C2+N, the endurance-scaled version of L2C2, allows balancing the cost of redundant capacity with the benefit of longer lifetime. For instance, as a use case, we have implemented the L2C2 cache with STT-RAM technology. It has affordable hardware overheads compared to that of a baseline NV-LLC without compression in terms of area, latency and energy consumption, and increases up to 6-37 times the time in which 50% of the effective capacity is degraded, depending on the variability in the manufacturing process. Compared to L2C2, L2C2+6 which adds 6 bytes of redundant capacity per entry, that means 9.1% of storage overhead, can increase up to 1.4-4.3 times the time in which the system gets its initial peak performance degraded
    corecore