697 research outputs found
Decreasing the Miss Rate and Eliminating the Performance Penalty of a Data Filter Cache
While data filter caches (DFCs) have been shown to be effective at reducing data access energy, they have not been adopted in processors due to the associated performance penalty caused by high DFC miss rates. In this article, we present a design that both decreases the DFC miss rate and completely eliminates the DFC performance penalty even for a level-one data cache (L1 DC) with a single cycle access time. First, we show that a DFC that lazily fills each word in a DFC line from an L1 DC only when the word is referenced is more energy-efficient than eagerly filling the entire DFC line. For a 512B DFC, we are able to eliminate loads of words into the DFC that are never referenced before being evicted, which occurred for about 75% of the words in 32B lines. Second, we demonstrate that a lazily word filled DFC line can effectively share and pack data words from multiple L1 DC lines to lower the DFC miss rate. For a 512B DFC, we completely avoid accessing the L1 DC for loads about 23% of the time and avoid a fully associative L1 DC access for loads 50% of the time, where the DFC only requires about 2.5% of the size of the L1 DC. Finally, we present a method that completely eliminates the DFC performance penalty by speculatively performing DFC tag checks early and only accessing DFC data when a hit is guaranteed. For a 512B DFC, we improve data access energy usage for the DTLB and L1 DC by 33% with no performance degradation
On the tailoring of CAST-32A certification guidance to real COTS multicore architectures
The use of Commercial Off-The-Shelf (COTS) multicores in real-time industry is on the rise due to multicores' potential performance increase and energy reduction. Yet, the unpredictable impact on timing of contention in shared hardware resources challenges certification. Furthermore, most safety certification standards target single-core architectures and do not provide explicit guidance for multicore processors. Recently, however, CAST-32A has been presented providing guidance for software planning, development and verification in multicores. In this paper, from a theoretical level, we provide a detailed review of CAST-32A objectives and the difficulty of reaching them under current COTS multicore design trends; at experimental level, we assess the difficulties of the application of CAST-32A to a real multicore processor, the NXP P4080.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant
TIN2015-65316-P and the HiPEAC Network of Excellence.
Jaume Abella has been partially supported by the MINECO under Ramon y Cajal grant RYC-2013-14717.Peer ReviewedPostprint (author's final draft
Software-Based Self-Test of Set-Associative Cache Memories
Embedded microprocessor cache memories suffer from limited observability and controllability creating problems during in-system tests. This paper presents a procedure to transform traditional march tests into software-based self-test programs for set-associative cache memories with LRU replacement. Among all the different cache blocks in a microprocessor, testing instruction caches represents a major challenge due to limitations in two areas: 1) test patterns which must be composed of valid instruction opcodes and 2) test result observability: the results can only be observed through the results of executed instructions. For these reasons, the proposed methodology will concentrate on the implementation of test programs for instruction caches. The main contribution of this work lies in the possibility of applying state-of-the-art memory test algorithms to embedded cache memories without introducing any hardware or performance overheads and guaranteeing the detection of typical faults arising in nanometer CMOS technologie
Scalable Hierarchical Instruction Cache for Ultra-Low-Power Processors Clusters
High Performance and Energy Efficiency are critical requirements for Internet
of Things (IoT) end-nodes. Exploiting tightly-coupled clusters of programmable
processors (CMPs) has recently emerged as a suitable solution to address this
challenge. One of the main bottlenecks limiting the performance and energy
efficiency of these systems is the instruction cache architecture due to its
criticality in terms of timing (i.e., maximum operating frequency), bandwidth,
and power. We propose a hierarchical instruction cache tailored to
ultra-low-power tightly-coupled processor clusters where a relatively large
cache (L1.5) is shared by L1 private caches through a two-cycle latency
interconnect. To address the performance loss caused by the L1 capacity misses,
we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to
L1.5. We optimize the core instruction fetch (IF) stage by removing the
critical core-to-L1 combinational path. We present a detailed comparison of
instruction cache architectures' performance and energy efficiency for parallel
ultra-low-power (ULP) clusters. Focusing on the implementation, our two-level
instruction cache provides better scalability than existing shared caches,
delivering up to 20\% higher operating frequency. On average, the proposed
two-level cache improves maximum performance by up to 17\% compared to the
state-of-the-art while delivering similar energy efficiency for most relevant
applications.Comment: 14 page
Scalable Hierarchical Instruction Cache for Ultralow-Power Processors Clusters
High performance and energy efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultralow-power (ULP) tightly coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private (PR) caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures' performance and energy efficiency for parallel ULP (PULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications
Smart hardware designs for probabilistically-analyzable processor architectures
Future Critical Real-Time Embedded Systems (CRTES), like those is planes, cars or trains, require more and more guaranteed performance in order to satisfy the increasing performance demands of advanced complex software features. While increased performance can be achieved by deploying processor techniques currently used in High-Performance Computing (HPC) and mainstream domains, their use challenges the software timing analysis, a necessary step in CRTES' verification and validation. Cache memories are known to have high impact in performance, and in fact, current CRTES include multicores usually with several levels of cache. In this line, this Thesis aims at increasing the guaranteed performance of CRTES by using techniques for caches building upon time randomization and providing probabilistic guarantees of tasks' execution time.
In this Thesis, we first focus on on improving cache placement and replacement to improve guaranteed performance. For placement, different existing policies are explored in a multi-level cache setup, and a solution is reached in which different of those policies are combined. For cache replacement, we analyze a pathological scenario that no cache policy so far accounts and propose several policies that fix this pathological scenario.
For shared caches in multicore we observe that contention is mainly caused by private writes that go through to the shared cache, yet using a pure write-back policy also has its drawbacks. We propose a hybrid approach to mitigate this contention. Building on this solution, the next contribution tackles a problem caused by the need of some reliability mechanisms in CRTES. Implementing reliability close to the processor's core has a significant impact in performance. A look-ahead error detection solution is proposed to greatly mitigate the performance impact.
The next contribution proposes the first hardware prefetcher for CRTES with arbitrary cache hierarchies. Given its speculative nature, prefetchers that have a guaranteed positive impact on performance are difficult to design. We present a framework that provides execution time guarantees and obtains a performance benefit.
Finally, we focus on the impact of timing anomalies in CRTES with caches. For the first time, a definition and taxonomy of timing anomalies is given for Measurement-Based Timing Analysis. Then, we focus on a specific timing anomaly that can happen with caches and provide a solution to account for it in the execution time estimates.Los Sistemas Empotrados de Tiempo-Real Crítico (SETRC), como los de los aviones, coches o trenes, requieren más y más rendimiento garantizado para satisfacer la demanda al alza de rendimiento para funciones complejas y avanzadas de software. Aunque el incremento en rendimiento puede ser adquirido utilizando técnicas de arquitectura de procesadores actualmente utilizadas en la Computación de Altas Prestaciones (CAP) i en los dominios convencionales, este uso presenta retos para el análisis del tiempo de software, un paso necesario en la verificación y validación de SETRC. Las memorias caches son conocidas por su gran impacto en rendimiento y, de hecho, los actuales SETRC incluyen multicores normalmente con diversos niveles de cache. En esta línea, esta Tesis tiene como objetivo mejorar el rendimiento garantizado de los SETRC utilizando técnicas para caches y utilizando métodos como la randomización del tiempo y proveyendo garantías probabilísticas de tiempo de ejecución de las tareas. En esta Tesis, primero nos centramos en mejorar la colocación y el reemplazo de caches para mejorar el rendimiento garantizado. Para la colocación, diferentes políticas son exploradas en un sistema cache multi-nivel, y se llega a una solución donde diversas de estas políticas son combinadas. Para el reemplazo, analizamos un escenario patológico que ninguna política actual tiene en cuenta, y proponemos varias políticas que solucionan este escenario patológico. Para caches compartidas en multicores, observamos que la contención es causada principalmente por escrituras privadas que van a través de la cache compartida, pero usar una política de escritura retardada pura también tiene sus consecuencias. Proponemos un enfoque híbrido para mitigar la contención. Sobre esta solución, la siguiente contribución ataca un problema causado por la necesidad de mecanismos de fiabilidad en SETRC. Implementar fiabilidad cerca del núcleo del procesador tiene un impacto significativo en rendimiento. Una solución basada en anticipación se propone para mitigar el impacto en rendimiento. La siguiente contribución propone el primer prefetcher hardware para SETRC con una jerarquía de caches arbitraria. Por primera vez, se da una definición y taxonomía de anomalías temporales para Análisis Temporal Basado en Medidas. Después, nos centramos en una anomalía temporal concreta que puede pasar con caches y ofrecemos una solución que la tiene en cuenta en las estimaciones del tiempo de ejecución.Postprint (published version
Recommended from our members
Efficient fine-grained virtual memory
Virtual memory in modern computer systems provides a single abstraction of the memory hierarchy.
By hiding fragmentation and overlays of physical memory, virtual memory frees applications from managing physical memory and improves programmability.
However, virtual memory often introduces noticeable overhead.
State-of-the-art systems use a paged virtual memory that maps virtual addresses to physical addresses
in page granularity (typically 4 KiB ).This mapping is stored as a page table. Before accessing physically addressed memory, the page table is accessed
to translate virtual addresses to physical addresses. Research shows that the overhead of accessing the page table can even exceed the execution time for some important applications.
In addition, this fine-grained mapping changes the access patterns between virtual and physical address spaces, introducing difficulties to many architecture techniques, such as caches and prefecthers.
In this dissertation, I propose architecture mechanisms to reduce the overhead of accessing and managing fine-grained virtual memory without compromising existing benefits.
There are three main contributions in this dissertation.
First, I investigate the impact of address translation on cache. I examine the restriction of virtually indexed, physically tagged (VIPT) caches with fine-grained paging and conclude that this restriction may lead to sub-optimal cache designs.
I introduce a novel cache strategy, speculatively indexed, physically tagged (SIPT) to enable flexible cache indexing under fine-grained page mapping.
SIPT speculates on the value of a few more index bits (1 - 3 in our experiments) to access the cache speculatively before translation, and then verify that the physical tag matches after translation.
Utilizing the fact that a simple relation generally exists between virtual and physical addresses, because memory allocators often exhibit contiguity, I also propose low-cost mechanisms to predict and correct potential mis-speculations.
Next, I focus on reducing the overhead of address translation for fine-grained virtual memory. I propose a novel architecture mechanism, Embedded Page Translation Information (EMPTI),
to provide general fine-grained page translation information on top of coarse-grained virtual memory.
EMPTI does so by speculating that a virtual address is mapped to a pre-determined physical location and then verifying the translation with a very-low-cost access to metadata embedded with data.
Coarse-grained virtual memory mechanisms (e.g., segmentation) are used to suggest the pre-determined physical location for each virtual page.
Overall, EMPTI achieves the benefits of low overhead translation while keeping the flexibility and programmability of fine-grained paging.
Finally, I improve the efficiency of metadata caching based on the fact that memory mapping contiguity generally exists beyond a page boundary.
In state-of-the-art architectures, caches treat PTEs (page table entries) as regular data. Although this is simple and straightforward,
it fails to maximize the storage efficiency of metadata.
Each page in the contiguously mapped region costs a full 8-byte PTE. However, the delta between virtual addresses and physical addresses remain the same and most metadata are identical.
I propose a novel microarchitectural mechanism that expands the effective PTE storage in the last-level-cache (LLC) and reduces the number of page-walk accesses that miss the LLC.Electrical and Computer Engineerin
- …