140 research outputs found

    Improving time predictability of shared hardware resources in real-time multicore systems : emphasis on the space domain

    Get PDF
    Critical Real-Time Embedded Systems (CRTES) follow a verification and validation process on the timing and functional correctness. This process includes the timing analysis that provides Worst-Case Execution Time (WCET) estimates to provide evidence that the execution time of the system, or parts of it, remain within the deadlines. A key design principle for CRTES is the incremental qualification, whereby each software component can be subject to verification and validation independently of any other component, with obvious benefits for cost. At timing level, this requires time composability, such that the timing behavior of a function is not affected by other functions. CRTES are experiencing an unprecedented growth with rising performance demands that have motivated the use of multicore architectures. Multicores can provide the performance required and bring the potential of integrating several software functions onto the same hardware. However, multicore contention in the access to shared hardware resources creates a dependence of the execution time of a task with the rest of the tasks running simultaneously. This dependence threatens time predictability and jeopardizes time composability. In this thesis we analyze and propose hardware solutions to be applied on current multicore designs for CRTES to improve time predictability and time composability, focusing on the on-chip bus and the memory controller. At hardware level, we propose new bus and memory controller designs that control and mitigate contention between different cores and allow to have time composability by design, also in the context of mixed-criticality systems. At analysis level, we propose contention prediction models that factor the impact of contenders and don¿t need modifications to the hardware. We also propose a set of Performance Monitoring Counters (PMC) that provide evidence about the contention. We give an special emphasis on the Space domain focusing on the Cobham Gaisler NGMP multicore processor, which is currently assessed by the European Space Agency for its future missions.Los Sistemas Críticos Empotrados de Tiempo Real (CRTES) siguen un proceso de verificación y validación para su correctitud funcional y temporal. Este proceso incluye el análisis temporal que proporciona estimaciones de el peor caso del tiempo de ejecución (WCET) para dar evidencia de que el tiempo de ejecución del sistema, o partes de él, permanecen dentro de los límites temporales. Un principio de diseño clave para los CRTES es la cualificación incremental, por la que cada componente de software puede ser verificado y validado independientemente del resto de componentes, con beneficios obvios para el coste. A nivel temporal, esto requiere composabilidad temporal, por la que el comportamiento temporal de una función no se ve afectado por otras funciones. CRTES están experimentando un crecimiento sin precedentes con crecientes demandas de rendimiento que han motivado el uso the arquitecturas multi-núcleo (multicore). Los procesadores multi-núcleo pueden proporcionar el rendimiento requerido y tienen el potencial de integrar varias funcionalidades software en el mismo hardware. A pesar de ello, la interferencia entre los diferentes núcleos que aparece en los recursos compartidos de os procesadores multi núcleo crea una dependencia del tiempo de ejecución de una tarea con el resto de tareas ejecutándose simultáneamente en el procesador. Esta dependencia amenaza la predictabilidad temporal y compromete la composabilidad temporal. En esta tésis analizamos y proponemos soluciones hardware para ser aplicadas en los diseños multi núcleo actuales para CRTES que mejoran la predictabilidad y composabilidad temporal, centrándose en el bus y el controlador de memoria internos al chip. A nivel de hardware, proponemos nuevos diseños de buses y controladores de memoria que controlan y mitigan la interferencia entre los diferentes núcleos y permiten tener composabilidad temporal por diseño, también en el contexto de sistemas de criticalidad mixta. A nivel de análisis, proponemos modelos de predicción de la interferencia que factorizan el impacto de los núcleos y no necesitan modificaciones hardware. También proponemos un conjunto de Contadores de Control del Rendimiento (PMC) que proporcionoan evidencia de la interferencia. En esta tésis, damós especial importancia al dominio espacial, centrándonos en el procesador mutli núcleo Cobham Gaisler NGMP, que está siendo actualmente evaluado por la Agencia Espacial Europea para sus futuras misiones

    A survey of techniques for reducing interference in real-time applications on multicore platforms

    Get PDF
    This survey reviews the scientific literature on techniques for reducing interference in real-time multicore systems, focusing on the approaches proposed between 2015 and 2020. It also presents proposals that use interference reduction techniques without considering the predictability issue. The survey highlights interference sources and categorizes proposals from the perspective of the shared resource. It covers techniques for reducing contentions in main memory, cache memory, a memory bus, and the integration of interference effects into schedulability analysis. Every section contains an overview of each proposal and an assessment of its advantages and disadvantages.This work was supported in part by the Comunidad de Madrid Government "Nuevas Técnicas de Desarrollo de Software de Tiempo Real Embarcado Para Plataformas. MPSoC de Próxima Generación" under Grant IND2019/TIC-17261

    A Multi-core processor for hard real-time systems

    Get PDF
    The increasing demand for new functionalities in current and future hard real-time embedded systems, like the ones deployed in automotive and avionics industries, is driving an increment in the performance required in current embedded processors. Multi-core processors represent a good design solution to cope with such higher performance requirements due to their better performance-per-watt ratio while maintaining the core design simple. Moreover, multi-cores also allow executing mixed-criticality level workloads composed of tasks with and without hard real-time requirements, maximizing the utilization of the hardware resources while guaranteeing low cost and low power consumption. Despite those benefits, current multi-core processors are less analyzable than single-core ones due to the interferences between different tasks when accessing hardware shared resources. As a result, estimating a meaningful Worst-Case Execution Time (WCET) estimation - i.e. to compute an upper bound of the application's execution time - becomes extremely difficult, if not even impossible, because the execution time of a task may change depending on the other threads running at the same time. This makes the WCET of a task dependent on the set of inter-task interferences introduced by the co-running tasks. Providing a WCET estimation independent from the other tasks (time composability property) is a key requirement in hard real-time systems. This thesis proposes a new multi-core processor design in which time composability is achieved, hence enabling the use of multi-cores in hard real-time systems. With our proposals the WCET estimation of a HRT is independent from the other co-running tasks. To that end, we design a multi-core processor in which the maximum delay a request from a Hard Real-time Task (HRT), accessing a hardware shared resource can suffer due to other tasks is bounded: our processor guarantees that a request to a shared resource cannot be delayed longer than a given Upper Bound Delay (UBD). In addition, the UBD allows identifying the impact that different processor configurations may have on the WCET by determining the sensitivity of a HRT to different resource allocations. This thesis proposes an off-line task allocation algorithm (called IA3: Interference-Aware Allocation Algorithm), that allocates tasks in a task set based on the HRT's sensitivity to different resource allocations. As a result the hardware shared resources used by HRTs are minimized, by allowing Non Hard Real-time Tasks (NHRTs) to use the rest of resources. Overall, our proposals provide analyzability for the HRTs allowing NHRTs to be executed into the same chip without any effect on the HRTs. The previous first two proposals of this thesis focused on supporting the execution of multi-programmed workloads with mixed-criticality levels (composed of HRTs and NHRTs). Higher performance could be achieved by implementing multi-threaded applications. As a first step towards supporting hard real-time parallel applications, this thesis proposes a new hardware/software approach to guarantee a predictable execution of software pipelined parallel programs. This thesis also investigates a solution to verify the timing correctness of HRTs without requiring any modification in the core design: we design a hardware unit which is interfaced with the processor and integrated into a functional-safety aware methodology. This unit monitors the execution time of a block of instructions and it detects if it exceeds the WCET. Concretely, we show how to handle timing faults on a real industrial automotive platform.La creciente demanda de nuevas funcionalidades en los sistemas empotrados de tiempo real actuales y futuros en industrias como la automovilística y la de aviación, está impulsando un incremento en el rendimiento necesario en los actuales procesadores empotrados. Los procesadores multi-núcleo son una solución eficiente para obtener un mayor rendimiento ya que aumentan el rendimiento por vatio, manteniendo el diseño del núcleo simple. Por otra parte, los procesadores multi-núcleo también permiten ejecutar cargas de trabajo con niveles de tiempo real mixtas (formadas por tareas de tiempo real duro y laxo así como tareas sin requerimientos de tiempo real), maximizando así la utilización de los recursos de procesador y garantizando el bajo consumo de energía. Sin embargo, a pesar los beneficios mencionados anteriormente, los actuales procesadores multi-núcleo son menos analizables que los de un solo núcleo debido a las interferencias surgidas cuando múltiples tareas acceden simultáneamente a los recursos compartidos del procesador. Como resultado, la estimación del peor tiempo de ejecución (conocido como WCET) - es decir, una cota superior del tiempo de ejecución de la aplicación - se convierte en extremadamente difícil, si no imposible, porque el tiempo de ejecución de una tarea puede cambiar dependiendo de las otras tareas que se estén ejecutando concurrentemente. Determinar una estimación del WCET independiente de las otras tareas es un requisito clave en los sistemas empotrados de tiempo real duro. Esta tesis propone un nuevo diseño de procesador multi-núcleo en el que el tiempo de ejecución de las tareas se puede componer, lo que permitirá el uso de procesadores multi-núcleo en los sistemas de tiempo real duro. Para ello, diseñamos un procesador multi-núcleo en el que la máxima demora que puede sufrir una petición de una tarea de tiempo real duro (HRT) para acceder a un recurso hardware compartido debido a otras tareas está acotado, tiene un límite superior (UBD). Además, UBD permite identificar el impacto que las diferentes posibles configuraciones del procesador pueden tener en el WCET, mediante la determinación de la sensibilidad en la variación del tiempo de ejecución de diferentes reservas de recursos del procesador. Esta tesis propone un algoritmo estático de reserva de recursos (llamado IA3), que asigna tareas a núcleos en función de dicha sensibilidad. Como resultado los recursos compartidos del procesador usados por tareas HRT se reducen al mínimo, permitiendo que las tareas sin requerimiento de tiempo real (NHRTs) puedas beneficiarse del resto de recursos. Por lo tanto, las propuestas presentadas en esta tesis permiten el análisis del WCET para tareas HRT, permitiendo así mismo la ejecución de tareas NHRTs en el mismo procesador multi-núcleo, sin que estas tengan ningún efecto sobre las tareas HRT. Las propuestas presentadas anteriormente se centran en el soporte a la ejecución de múltiples cargas de trabajo con diferentes niveles de tiempo real (HRT y NHRTs). Sin embargo, un mayor rendimiento puede lograrse mediante la transformación una tarea en múltiples sub-tareas paralelas. Esta tesis propone una nueva técnica, con soporte del procesador y del sistema operativo, que garantiza una ejecución analizable del modelo de ejecución paralela software pipelining. Esta tesis también investiga una solución para verificar la corrección del WCET de HRT sin necesidad de ninguna modificación en el diseño de la base: un nuevo componente externo al procesador se conecta a este sin necesidad de modificarlo. Esta nueva unidad monitorea el tiempo de ejecución de un bloque de instrucciones y detecta si se excede el WCET. Esta unidad permite detectar fallos de sincronización en sistemas de computación utilizados en automóviles

    Adapting TDMA arbitration for measurement-based probabilistic timing analysis

    Get PDF
    Critical Real-Time Embedded Systems require functional and timing validation to prove that they will perform their functionalities correctly and in time. For timing validation, a bound to the Worst-Case Execution Time (WCET) for each task is derived and passed as an input to the scheduling algorithm to ensure that tasks execute timely. Bounds to WCET can be derived with deterministic timing analysis (DTA) and probabilistic timing analysis (PTA), each of which relies upon certain predictability properties coming from the hardware/software platform beneath. In particular, specific hardware designs are needed for both DTA and PTA, which challenges their adoption by hardware vendors. This paper makes a step towards reconciling the hardware needs of DTA and PTA timing analyses to increase the likelihood of those hardware designs to be adopted by hardware vendors. In particular, we show how Time Division Multiple Access (TDMA), which has been regarded as one of the main DTA-compliant arbitration policies, can be used in the context of PTA and, in particular, of the industrially-friendly Measurement-Based PTA (MBPTA). We show how the execution time measurements taken as input for MBPTA need to be padded to obtain reliable and tight WCET estimates on top of TDMA-arbitrated hardware resources with no further hardware support. Our results show that TDMA delivers tighter WCET estimates than MBPTA-friendly arbitration policies, whereas MBPTA-friendly policies provide higher average performance. Thus, the best policy to choose depends on the particular needs of the end user.The research leading to these results has been funded by the EU FP7 under grant agreement no. 611085 (PROXIMA) and 287519 (parMERASA). This work has also been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant TIN2015-65316-P and the HiPEAC Network of Excellence. Miloˇs Pani´c is funded by the Spanish Ministry of Education under the FPU grant FPU12/05966. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717.Peer ReviewedPostprint (author's final draft

    Fixed-Priority Memory-Centric Scheduler for COTS-Based Multiprocessors

    Get PDF
    Memory-centric scheduling attempts to guarantee temporal predictability on commercial-off-the-shelf (COTS) multiprocessor systems to exploit their high performance for real-time applications. Several solutions proposed in the real-time literature have hardware requirements that are not easily satisfied by modern COTS platforms, like hardware support for strict memory partitioning or the presence of scratchpads. However, even without said hardware support, it is possible to design an efficient memory-centric scheduler. In this article, we design, implement, and analyze a memory-centric scheduler for deterministic memory management on COTS multiprocessor platforms without any hardware support. Our approach uses fixed-priority scheduling and proposes a global "memory preemption" scheme to boost real-time schedulability. The proposed scheduling protocol is implemented in the Jailhouse hypervisor and Erika real-time kernel. Measurements of the scheduler overhead demonstrate the applicability of the proposed approach, and schedulability experiments show a 20% gain in terms of schedulability when compared to contention-based and static fair-share approaches

    Measuring and Controlling Multicore Contention in a RISC-V System-on-Chip

    Get PDF
    [ES] Los procesadores multinúcleo empezaron una revolución en el cómputo moderno cuando fueron introducidos en el espacio de cómputo comercial y de consumidor. Estos procesadores multinúcleo presentaban un aumento significativo en consumo, eficiencia y rendimiento en un periodo de tiempo en el aumento de la frecuencia y el IPC del procesador parecía estar tocando techo. Sin embargo, en sistemas críticos, la introducción de los procesadores multinúcleo ha traído a la luz diferentes dificultades en el proceso de certificación. La principal área que dificulta la caracterización de los sistemas multicore en tiempo real es el uso de recursos compartidos, en específico, los buses compartidos. En este trabajo proveeremos las herramientas necesarias para facilitar la caracterización de sistemas que hacen uso de buses compartidos en sistemas de criticidad mixta. En específico, combinamos las políticas desarrolladas para sistemas con buses con políticas de limitación de ancho de banda basadas en interferencia causada al núcleo principal. Con esta combinación de políticas podemos limitar el WCET de la tarea crítica en el sistema multinúcleo mientras que proveemos un "best effort" para permitir el progreso en los núcleos secundarios.[CAT] Els processadors multinucli van començar una revolució en el còmput modern quan van ser introduïts en l’espai de còmput comercial i de consumidor. Aquests processadors multinucli presentaven un augment significatiu en consum, eficiència i rendiment en un període de temps en l’augment de la freqüència i l’IPC de l’processador semblava estar tocant sostre. No obstant això, en sistemes crítics, la introducció dels processadors multi- nucli ha portat a la llum diferents dificultats en el procés de certificació. La principal àrea que dificulta la caracterització dels sistemes multinucli en temps real és l’ús de recursos compartits, en específic, els busos compartits. En aquest treball proveirem les eines necessàries per facilitar la caracterització de sis- temes que fan ús de busos compartits en sistemes de criticitat mixta. En específic, combi- nem les polítiques desenvolupades per a sistemes amb busos amb polítiques de limitació d’ample de banda basades en interferència causada a el nucli principal. Amb aquesta combinació de polítiques podem limitar l’WCET de la tasca crítica en el sistema multinu- cli mentre que proveïm un "best effort"per permetre el progrés en els nuclis secundaris.[EN] Multicore processors were a revolution when introduced into the commercial computing space, they presented great power efficiency and performance in a time where clock speeds and instruction level parallelism were plateauing. But, on safety critical systems, the introduction of multi-core processors has brought serious difficulties to the certification process. The main trouble spot for multicore characterization is the usage of shared resources, in specific, shared buses. In this work, we provide tools to ease the characterization of shared bus mechanisms timing interference on critical and mixed criticality systems. In particular, we combine shared bus arbitration policies with rate limiting policies based on critical workload interference to bound the WCET of a critical workload on a multi-core system while doing a best effort to let secondary cores progress as much as possible.Andreu Cerezo, P. (2021). Measuring and Controlling Multicore Contention in a RISC-V System-on-Chip. Universitat Politècnica de València. http://hdl.handle.net/10251/173563TFG

    Sharing the instruction cache among lean cores on an asymmetric CMP for HPC applications

    Get PDF
    High performance computing (HPC) applications have parallel code sections that must scale to large numbers of cores, which makes them sensitive to serial regions. Current supercomputing systems with heterogeneous or asymmetric CMPs (ACMP) combine few high-performance big cores for serial regions, together with many low-power lean cores for throughput computing. The low requirements of HPC applications in the core front-end lead some designs, such as SMT and GPU cores, to share front-end structures including the instruction cache (I-cache). However, little work exists to analyze the benefit of sharing the I-cache among full cores, which seems compelling as a solution to reduce silicon area and power. This paper analyzes the performance, power and area impact of such a design on an ACMP with one high-performance core and multiple low-power cores. Having identified that multiple cores run the same code during parallel regions, the lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections with McPAT and a rich set of HPC benchmarks show 11% area savings with a 5% energy reduction at no performance cost.The research was supported by European Unions 7th Framework Programme [FP7/2007-2013] under project Mont-Blanc (288777), the Ministry of Economy and Competitiveness of Spain (TIN2012-34557, TIN2015-65316-P, and BES-2013-063925), Generalitat de Catalunya (2014-SGR-1051 and 2014-SGR-1272), HiPEAC-3 Network of Excellence (ICT-287759), and finally the Severo Ochoa Program (SEV-2011-00067) of the Spanish Government.Peer ReviewedPostprint (author's final draft

    A High-performance, Energy-efficient Modular DMA Engine Architecture

    Full text link
    Data transfers are essential in today's computing systems as latency and complex memory access patterns are increasingly challenging to manage. Direct memory access engines (DMAEs) are critically needed to transfer data independently of the processing elements, hiding latency and achieving high throughput even for complex access patterns to high-latency memory. With the prevalence of heterogeneous systems, DMAEs must operate efficiently in increasingly diverse environments. This work proposes a modular and highly configurable open-source DMAE architecture called intelligent DMA (iDMA), split into three parts that can be composed and customized independently. The front-end implements the control plane binding to the surrounding system. The mid-end accelerates complex data transfer patterns such as multi-dimensional transfers, scattering, or gathering. The back-end interfaces with the on-chip communication fabric (data plane). We assess the efficiency of iDMA in various instantiations: In high-performance systems, we achieve speedups of up to 15.8x with only 1 % additional area compared to a base system without a DMAE. We achieve an area reduction of 10 % while improving ML inference performance by 23 % in ultra-low-energy edge AI systems over an existing DMAE solution. We provide area, timing, latency, and performance characterization to guide its instantiation in various systems.Comment: 14 pages, 14 figures, accepted by an IEEE journal for publicatio

    Timing Predictable and High-Performance Hardware Cache Coherence Mechanisms for Real-Time Multi-Core Platforms

    Get PDF
    Multi-core platforms are becoming primary compute platforms for real-time systems such as avionics and autonomous vehicles. This adoption is primarily driven by the increasing application demands deployed in real-time systems, and the cost and performance benefits of multi-core platforms. For real-time applications, satisfying safety properties in the form of timing predictability, is the paramount consideration. Providing such guarantees on safety properties requires applying some timing analysis on the application executing on the compute platform. The timing analysis computes an upper bound on the application’s execution time on the compute platform, which is referred to as the worst-case execution time (WCET). However, multi-core platforms pose challenges that complicate the timing analysis. Among these challenges are timing challenges caused due to simultaneous accesses from multiple cores to shared hardware resources such as shared caches, interconnects, and off-chip memories. Supporting timing predictable shared data communication between real-time applications further compounds this challenge as a core’s access to shared data is dependent on the simultaneous memory activity from other cores on the shared data. Although hardware cache coherence mechanisms are the primary high-performance data communication mechanisms in current multi-core platforms, there has been very little use of these mechanisms to support timing predictable shared data communication in real-time multi-core platforms. Rather, current state-of-the-art approaches to timing predictable shared data communication sidestep hardware cache coherence. These approaches enforce memory and execution constraints on the shared data to simplify the timing analysis at the expense of application performance. This thesis makes the case for timing predictable hardware cache coherence mechanisms as viable shared data communication mechanisms for real-time multi-core platforms. A key takeaway from the contributions in this thesis is that timing predictable hardware cache coherence mechanisms offer significant application performance over prior state-of-the-art data communication approaches while guaranteeing timing predictability. This thesis has three main contributions. First, this thesis shows how a hardware cache coherence mechanism can be designed to be timing predictable by defining design invariants that guarantee timing predictability. We apply these design invariants and design timing predictable variants of existing conventional cache coherence mechanisms. Evaluation of these timing predictable cache coherence mechanisms show that they provide significant application performance over state-of-the-art approaches while delivering timing predictability. Second, we observe that the large worst-case memory access latency under timing predictable hardware cache coherence mechanisms questions their applicability as a data communication mechanism in real-time multi-core platforms. To this end, we present a systematic framework to design better timing predictable cache coherence mechanisms that balance high application performance and low worst-case memory access latency. Our systematic framework concisely captures the design features of timing predictable cache coherence mechanisms that impacts their WCET, and identifies a spectrum of approaches to reduce the worst-case memory access latency. We describe one approach and show that this approach reduces the worst-case memory access latency of timing predictable cache coherence mechanisms to be the same as alternative approaches while trading away minimal performance in the original cache coherence mechanisms. Third, we design a timing predictable hardware cache coherence mechanism for multi-core platforms used in mixed-critical real-time systems (MCS). Applications in MCS have varying performance and timing predictability requirements. We design a timing predictable cache coherence mechanism that considers these differing requirements and ensures that applications with no timing predictability requirements do not impact applications with strict predictability requirements

    PASoC: A Predictable Accelerator Rich SoC for Safety-Critical Systems

    Get PDF
    This thesis presents a model of a Predictable Accelerator-rich System-on-Chip (PASoC) for safety-critical systems, which guarantees timing predictability of a memory access in the system. Earlier adoption of accelerator-rich SoCs was for general-purpose comput ing and thus timing predictability of such systems was not well explored, despite being used in safety-critical systems. This thesis takes initial steps in exploring the predictabil ity of ASoCs by combining CPU clusters with one or more hardware accelerators. The PASoC allows the integration of multiple coherent agents to interact with each other over a shared memory bus and a shared LLC. These agents can be a cluster of cache-coherent homogeneous cores, and fully or one-way coherent hardware accelerators. PASoC ensures the predictability of a memory request through some modifications in hardware architecture and cache coherence protocols. PASoC supports predictable cache coherence within the cluster of cores and across agents. The former uses linear cache coherence, and the latter uses a modified version of predictable Modified Shared Invalid (MSI) cache coherence pro tocol. PASoC analyzes the per-request worst-case latency of a memory request from any of the agents and evaluates the design on the gem5 simulator. Finally, this work presents some observations based on the analysis that can help in future designs of PASoCs
    • …
    corecore