498 research outputs found

    A memory-centric approach to enable timing-predictability within embedded many-core accelerators

    Get PDF
    There is an increasing interest among real-time systems architects for multi- and many-core accelerated platforms. The main obstacle towards the adoption of such devices within industrial settings is related to the difficulties in tightly estimating the multiple interferences that may arise among the parallel components of the system. This in particular concerns concurrent accesses to shared memory and communication resources. Existing worst-case execution time analyses are extremely pessimistic, especially when adopted for systems composed of hundreds-tothousands of cores. This significantly limits the potential for the adoption of these platforms in real-time systems. In this paper, we study how the predictable execution model (PREM), a memory-aware approach to enable timing-predictability in realtime systems, can be successfully adopted on multi- and manycore heterogeneous platforms. Using a state-of-the-art multi-core platform as a testbed, we validate that it is possible to obtain an order-of-magnitude improvement in the WCET bounds of parallel applications, if data movements are adequately orchestrated in accordance with PREM. We identify which system parameters mostly affect the tremendous performance opportunities offered by this approach, both on average and in the worst case, moving the first step towards predictable many-core systems

    Eager Stack Cache Memory Transfers

    Get PDF
    The growing complexity of modern computer architectures increasingly complicates the prediction of the run-time behavior of software. For real-time systems, where a safe estimation of the program\u27s worst-case execution time is needed, time-predictable computer architectures promise to resolve this problem. The stack cache, for instance, allows the compiler to efficiently cache a program\u27s stack, while static analysis of its behavior remains easy. This work introduces an optimization of the stack cache that allows to anticipate memory transfers that might be initiated by future stack cache control instructions. These eager memory transfers thus allow to reduce the average-case latency of those control instructions, very similar to "prefetching" techniques known from conventional caches. However, the mechanism proposed here is guaranteed to have no impact on the worst-case execution time estimates computed by static analysis. Measurements on a dual-core platform using the Patmos processor and imedivision-multiplexing-based memory arbitration, show that our technique can eliminate up to 62% (7%) of the memory transfers from (respectively to) the stack cache on average over all programs of the MiBench benchmark suite

    Real-Time Prefetching on Shared-Memory Multi-Core Systems

    Get PDF
    In recent years, there has been a growing trend towards using multi-core processors in real-time systems to cope with the rising computation requirements of real-time tasks. Coupled with this, the rising memory requirements of these tasks pushes demand beyond what can be provided by small, private on-chip caches, requiring the use of larger, slower off-chip memories such as DRAM. Due to the cost, power requirements and complexity of these memories, they are typically shared between all of the tasks within the system. In order for the execution time of these tasks to be bounded, the response time of the memory and the interference from other tasks also needs to be bounded. While there is a great amount of current research on bounding this interference, one popular method is to effectively partition the available memory bandwidth between the processors in the system. Of course, as the number of processors increases, so does the worst-case blocking, and worst-case blocking times quickly increase with the number of processors. It is difficult to further optimise the arbitration scheme; instead, this scaling problem needs to be approached from another angle. Prefetching has previously been shown to improve the execution time of tasks by speculatively issuing memory accesses ahead of time for items which may be useful in the near future, although these prefetchers are typically not used in real-time systems due to their unpredictable nature. Instead, this work presents a framework by which a prefetcher can be safely used alongside a composable memory arbiter, a predictable prefetching scheme, and finally a method by which this predictable prefetcher can be used to improve the worst-case execution time of a running task

    Instruction cache for the C-processor

    Get PDF

    Software-Controlled Instruction Prefetch Buffering for Low-End Processors

    Get PDF
    This paper proposes a method of buffering instructions by software-based prefetching. The method allows low-end processors to improve their instruction throughput with a minimum of additional logic and power consumption. Low-end embedded processors do not employ caches for mainly two reasons. The first reason is that the overhead of cache implementation in terms of energy and area is considerable. The second reason is that, because a cache's performance primarily depends on the number of hits, an increasing number of misses could cause a processor to remain in stall mode for a longer duration. As a result, a cache may become more of a liability than an advantage. In contrast, the benchmarked results for the proposed software-based prefetch buffering without a cache show a 5-10% improvement in execution time. They also show a 4% or more reduction in the energy-delay-square-product (ED2P) with a maximum reduction of 40%. The results additionally demonstrate that the performance and efficiency of the proposed architecture scales with the number of multicycle instructions. The benchmarked routines tested to arrive at these results are widely deployed components of embedded applications

    Smart hardware designs for probabilistically-analyzable processor architectures

    Get PDF
    Future Critical Real-Time Embedded Systems (CRTES), like those is planes, cars or trains, require more and more guaranteed performance in order to satisfy the increasing performance demands of advanced complex software features. While increased performance can be achieved by deploying processor techniques currently used in High-Performance Computing (HPC) and mainstream domains, their use challenges the software timing analysis, a necessary step in CRTES' verification and validation. Cache memories are known to have high impact in performance, and in fact, current CRTES include multicores usually with several levels of cache. In this line, this Thesis aims at increasing the guaranteed performance of CRTES by using techniques for caches building upon time randomization and providing probabilistic guarantees of tasks' execution time. In this Thesis, we first focus on on improving cache placement and replacement to improve guaranteed performance. For placement, different existing policies are explored in a multi-level cache setup, and a solution is reached in which different of those policies are combined. For cache replacement, we analyze a pathological scenario that no cache policy so far accounts and propose several policies that fix this pathological scenario. For shared caches in multicore we observe that contention is mainly caused by private writes that go through to the shared cache, yet using a pure write-back policy also has its drawbacks. We propose a hybrid approach to mitigate this contention. Building on this solution, the next contribution tackles a problem caused by the need of some reliability mechanisms in CRTES. Implementing reliability close to the processor's core has a significant impact in performance. A look-ahead error detection solution is proposed to greatly mitigate the performance impact. The next contribution proposes the first hardware prefetcher for CRTES with arbitrary cache hierarchies. Given its speculative nature, prefetchers that have a guaranteed positive impact on performance are difficult to design. We present a framework that provides execution time guarantees and obtains a performance benefit. Finally, we focus on the impact of timing anomalies in CRTES with caches. For the first time, a definition and taxonomy of timing anomalies is given for Measurement-Based Timing Analysis. Then, we focus on a specific timing anomaly that can happen with caches and provide a solution to account for it in the execution time estimates.Los Sistemas Empotrados de Tiempo-Real Crítico (SETRC), como los de los aviones, coches o trenes, requieren más y más rendimiento garantizado para satisfacer la demanda al alza de rendimiento para funciones complejas y avanzadas de software. Aunque el incremento en rendimiento puede ser adquirido utilizando técnicas de arquitectura de procesadores actualmente utilizadas en la Computación de Altas Prestaciones (CAP) i en los dominios convencionales, este uso presenta retos para el análisis del tiempo de software, un paso necesario en la verificación y validación de SETRC. Las memorias caches son conocidas por su gran impacto en rendimiento y, de hecho, los actuales SETRC incluyen multicores normalmente con diversos niveles de cache. En esta línea, esta Tesis tiene como objetivo mejorar el rendimiento garantizado de los SETRC utilizando técnicas para caches y utilizando métodos como la randomización del tiempo y proveyendo garantías probabilísticas de tiempo de ejecución de las tareas. En esta Tesis, primero nos centramos en mejorar la colocación y el reemplazo de caches para mejorar el rendimiento garantizado. Para la colocación, diferentes políticas son exploradas en un sistema cache multi-nivel, y se llega a una solución donde diversas de estas políticas son combinadas. Para el reemplazo, analizamos un escenario patológico que ninguna política actual tiene en cuenta, y proponemos varias políticas que solucionan este escenario patológico. Para caches compartidas en multicores, observamos que la contención es causada principalmente por escrituras privadas que van a través de la cache compartida, pero usar una política de escritura retardada pura también tiene sus consecuencias. Proponemos un enfoque híbrido para mitigar la contención. Sobre esta solución, la siguiente contribución ataca un problema causado por la necesidad de mecanismos de fiabilidad en SETRC. Implementar fiabilidad cerca del núcleo del procesador tiene un impacto significativo en rendimiento. Una solución basada en anticipación se propone para mitigar el impacto en rendimiento. La siguiente contribución propone el primer prefetcher hardware para SETRC con una jerarquía de caches arbitraria. Por primera vez, se da una definición y taxonomía de anomalías temporales para Análisis Temporal Basado en Medidas. Después, nos centramos en una anomalía temporal concreta que puede pasar con caches y ofrecemos una solución que la tiene en cuenta en las estimaciones del tiempo de ejecución.Postprint (published version

    Data cache organization for accurate timing analysis

    Get PDF

    Algorithmic ramifications of prefetching in memory hierarchy

    Get PDF
    External Memory models, most notable being the I-O Model [3], capture the effects of memory hierarchy and aid in algorithm design. More than a decade of architectural advancements have led to new features not captured in the I-O model - most notably the prefetching capability. We propose a relatively simple Prefetch model that incorporates data prefetching in the traditional I-O models and show how to design algorithms that can attain close to peak memory bandwidth. Unlike (the inverse of) memory latency, the memory bandwidth is much closer to the processing speed, thereby, intelligent use of prefetching can considerably mitigate the I-O bottleneck. For some fundamental problems, our algorithms attain running times approaching that of the idealized Random Access Machines under reasonable assumptions. Our work also explains the significantly superior performance of the I-O efficient algorithms in systems that support prefetching compared to ones that do not
    corecore