140 research outputs found
Improving time predictability of shared hardware resources in real-time multicore systems : emphasis on the space domain
Critical Real-Time Embedded Systems (CRTES) follow a verification and validation process on the timing and functional correctness. This process includes the timing analysis that provides Worst-Case Execution Time (WCET) estimates to provide evidence that the execution time of the system, or parts of it, remain within the deadlines. A key design principle for CRTES is the incremental qualification, whereby each software component can be subject to verification and validation independently of any other component, with obvious benefits for cost. At timing level, this requires time composability, such that the timing behavior of a function is not affected by other functions. CRTES are experiencing an unprecedented growth with rising performance demands that have motivated the use of multicore architectures. Multicores can provide the performance required and bring the potential of integrating several software functions onto the same hardware. However, multicore contention in the access to shared hardware resources creates a dependence of the execution time of a task with the rest of the tasks running simultaneously. This dependence threatens time predictability and jeopardizes time composability. In this thesis we analyze and propose hardware solutions to be applied on current multicore designs for CRTES to improve time predictability and time composability, focusing on the on-chip bus and the memory controller. At hardware level, we propose new bus and memory controller designs that control and mitigate contention between different cores and allow to have time composability by design, also in the context of mixed-criticality systems. At analysis level, we propose contention prediction models that factor the impact of contenders and don¿t need modifications to the hardware. We also propose a set of Performance Monitoring Counters (PMC) that provide evidence about the contention. We give an special emphasis on the Space domain focusing on the Cobham Gaisler NGMP multicore processor, which is currently assessed by the European Space Agency for its future missions.Los Sistemas CrÃticos Empotrados de Tiempo Real (CRTES) siguen un proceso de verificación y validación para su correctitud funcional y temporal. Este proceso incluye el análisis temporal que proporciona estimaciones de el peor caso del tiempo de ejecución (WCET) para dar evidencia de que el tiempo de ejecución del sistema, o partes de él, permanecen dentro de los lÃmites temporales. Un principio de diseño clave para los CRTES es la cualificación incremental, por la que cada componente de software puede ser verificado y validado independientemente del resto de componentes, con beneficios obvios para el coste. A nivel temporal, esto requiere composabilidad temporal, por la que el comportamiento temporal de una función no se ve afectado por otras funciones. CRTES están experimentando un crecimiento sin precedentes con crecientes demandas de rendimiento que han motivado el uso the arquitecturas multi-núcleo (multicore). Los procesadores multi-núcleo pueden proporcionar el rendimiento requerido y tienen el potencial de integrar varias funcionalidades software en el mismo hardware. A pesar de ello, la interferencia entre los diferentes núcleos que aparece en los recursos compartidos de os procesadores multi núcleo crea una dependencia del tiempo de ejecución de una tarea con el resto de tareas ejecutándose simultáneamente en el procesador. Esta dependencia amenaza la predictabilidad temporal y compromete la composabilidad temporal. En esta tésis analizamos y proponemos soluciones hardware para ser aplicadas en los diseños multi núcleo actuales para CRTES que mejoran la predictabilidad y composabilidad temporal, centrándose en el bus y el controlador de memoria internos al chip. A nivel de hardware, proponemos nuevos diseños de buses y controladores de memoria que controlan y mitigan la interferencia entre los diferentes núcleos y permiten tener composabilidad temporal por diseño, también en el contexto de sistemas de criticalidad mixta. A nivel de análisis, proponemos modelos de predicción de la interferencia que factorizan el impacto de los núcleos y no necesitan modificaciones hardware. También proponemos un conjunto de Contadores de Control del Rendimiento (PMC) que proporcionoan evidencia de la interferencia. En esta tésis, damós especial importancia al dominio espacial, centrándonos en el procesador mutli núcleo Cobham Gaisler NGMP, que está siendo actualmente evaluado por la Agencia Espacial Europea para sus futuras misiones
A survey of techniques for reducing interference in real-time applications on multicore platforms
This survey reviews the scientific literature on techniques for reducing interference in real-time multicore systems, focusing on the approaches proposed between 2015 and 2020. It also presents proposals that use interference reduction techniques without considering the predictability issue. The survey highlights interference sources and categorizes proposals from the perspective of the shared resource. It covers techniques for reducing contentions in main memory, cache memory, a memory bus, and the integration of interference effects into schedulability analysis. Every section contains an overview of each proposal and an assessment of its advantages and disadvantages.This work was supported in part by the Comunidad de Madrid Government "Nuevas Técnicas de Desarrollo de Software de Tiempo Real Embarcado Para Plataformas. MPSoC de Próxima Generación" under Grant IND2019/TIC-17261
A Multi-core processor for hard real-time systems
The increasing demand for new functionalities in current and future hard real-time embedded systems, like the ones deployed in automotive and avionics industries, is driving an increment in the performance required in current embedded processors. Multi-core processors represent a good design solution to cope with such higher performance requirements due to their better performance-per-watt ratio while maintaining the core design simple. Moreover, multi-cores also allow executing mixed-criticality level workloads composed of tasks with and without hard real-time requirements, maximizing the utilization of the hardware resources while guaranteeing low cost and low power consumption.
Despite those benefits, current multi-core processors are less analyzable than single-core ones due to the interferences between different tasks when accessing hardware shared resources. As a result, estimating a meaningful Worst-Case Execution Time (WCET) estimation - i.e. to compute an upper bound of the application's execution time - becomes extremely difficult, if not even impossible, because the execution time of a task may change depending on the other threads running at the same time. This makes the WCET of a task dependent on the set of inter-task interferences introduced by the co-running tasks.
Providing a WCET estimation independent from the other tasks (time composability property) is a key requirement in hard real-time systems.
This thesis proposes a new multi-core processor design in which time composability is achieved, hence enabling the use of multi-cores in hard real-time systems. With our proposals the WCET estimation of a HRT is independent from the other co-running tasks. To that end, we design a multi-core processor in which the maximum delay a request from a Hard Real-time Task (HRT), accessing a hardware shared resource can suffer due to other tasks is bounded: our processor guarantees that a request to a shared resource cannot be delayed longer than a given Upper Bound Delay (UBD).
In addition, the UBD allows identifying the impact that different processor configurations may have on the WCET by determining the sensitivity of a HRT to different resource allocations. This thesis proposes an off-line task allocation algorithm (called IA3: Interference-Aware Allocation Algorithm), that allocates tasks in a task set based on the HRT's sensitivity to different resource allocations. As a result the hardware shared resources used by HRTs are minimized, by allowing Non Hard Real-time Tasks (NHRTs) to use the rest of resources. Overall, our proposals provide analyzability for the HRTs allowing NHRTs to be executed into the same chip without any effect on the HRTs.
The previous first two proposals of this thesis focused on supporting the execution of multi-programmed workloads with mixed-criticality levels (composed of HRTs and NHRTs).
Higher performance could be achieved by implementing multi-threaded applications. As a first step towards supporting hard real-time parallel applications, this thesis proposes a new hardware/software approach to guarantee a predictable execution of software pipelined parallel programs.
This thesis also investigates a solution to verify the timing correctness of HRTs without requiring any modification in the core design: we design a hardware unit which is interfaced with the processor and integrated into a functional-safety aware methodology. This unit monitors the execution time of a block of instructions and it detects if it exceeds the WCET. Concretely, we show how to handle timing faults on a real industrial automotive platform.La creciente demanda de nuevas funcionalidades en los sistemas empotrados de tiempo real actuales y futuros en
industrias como la automovilÃstica y la de aviación, está impulsando un incremento en el rendimiento necesario en los
actuales procesadores empotrados. Los procesadores multi-núcleo son una solución eficiente para obtener un mayor
rendimiento ya que aumentan el rendimiento por vatio, manteniendo el diseño del núcleo simple.
Por otra parte, los procesadores multi-núcleo también permiten ejecutar cargas de trabajo con niveles de tiempo real mixtas
(formadas por tareas de tiempo real duro y laxo asà como tareas sin requerimientos de tiempo real), maximizando asà la
utilización de los recursos de procesador y garantizando el bajo consumo de energÃa.
Sin embargo, a pesar los beneficios mencionados anteriormente, los actuales procesadores multi-núcleo son menos
analizables que los de un solo núcleo debido a las interferencias surgidas cuando múltiples tareas acceden
simultáneamente a los recursos compartidos del procesador.
Como resultado, la estimación del peor tiempo de ejecución (conocido como WCET) - es decir, una cota superior del tiempo
de ejecución de la aplicación - se convierte en extremadamente difÃcil, si no imposible, porque el tiempo de ejecución de
una tarea puede cambiar dependiendo de las otras tareas que se estén ejecutando concurrentemente. Determinar una
estimación del WCET independiente de las otras tareas es un requisito clave en los sistemas empotrados de tiempo real
duro. Esta tesis propone un nuevo diseño de procesador multi-núcleo en el que el tiempo de ejecución de las tareas se
puede componer, lo que permitirá el uso de procesadores multi-núcleo en los sistemas de tiempo real duro. Para ello,
diseñamos un procesador multi-núcleo en el que la máxima demora que puede sufrir una petición de una tarea de tiempo
real duro (HRT) para acceder a un recurso hardware compartido debido a otras tareas está acotado, tiene un lÃmite superior
(UBD).
Además, UBD permite identificar el impacto que las diferentes posibles configuraciones del procesador pueden tener en el
WCET, mediante la determinación de la sensibilidad en la variación del tiempo de ejecución de diferentes reservas de
recursos del procesador. Esta tesis propone un algoritmo estático de reserva de recursos (llamado IA3), que asigna tareas
a núcleos en función de dicha sensibilidad. Como resultado los recursos compartidos del procesador usados por tareas
HRT se reducen al mÃnimo, permitiendo que las tareas sin requerimiento de tiempo real (NHRTs) puedas beneficiarse del
resto de recursos.
Por lo tanto, las propuestas presentadas en esta tesis permiten el análisis del WCET para tareas HRT, permitiendo asÃ
mismo la ejecución de tareas NHRTs en el mismo procesador multi-núcleo, sin que estas tengan ningún efecto sobre las
tareas HRT.
Las propuestas presentadas anteriormente se centran en el soporte a la ejecución de múltiples cargas de trabajo con
diferentes niveles de tiempo real (HRT y NHRTs).
Sin embargo, un mayor rendimiento puede lograrse mediante la transformación una tarea en múltiples sub-tareas
paralelas. Esta tesis propone una nueva técnica, con soporte del procesador y del sistema operativo, que garantiza una
ejecución analizable del modelo de ejecución paralela software pipelining.
Esta tesis también investiga una solución para verificar la corrección del WCET de HRT sin necesidad de ninguna
modificación en el diseño de la base: un nuevo componente externo al procesador se conecta a este sin necesidad de
modificarlo. Esta nueva unidad monitorea el tiempo de ejecución de un bloque de instrucciones y detecta si se excede el
WCET. Esta unidad permite detectar fallos de sincronización en sistemas de computación utilizados en automóviles
Adapting TDMA arbitration for measurement-based probabilistic timing analysis
Critical Real-Time Embedded Systems require functional and timing validation to prove that they will perform their functionalities correctly and in time. For timing validation, a bound to the Worst-Case Execution Time (WCET) for each task is derived and passed as an input to the scheduling algorithm to ensure that tasks execute timely. Bounds to WCET can be derived with deterministic timing analysis (DTA) and probabilistic timing analysis (PTA), each of which relies upon certain predictability properties coming from the hardware/software platform beneath. In particular, specific hardware designs are needed for both DTA and PTA, which challenges their adoption by hardware vendors.
This paper makes a step towards reconciling the hardware needs of DTA and PTA timing analyses to increase the likelihood of those hardware designs to be adopted by hardware vendors. In particular, we show how Time Division Multiple Access (TDMA), which has been regarded as one of the main DTA-compliant arbitration policies, can be used in the context of PTA and, in particular, of the industrially-friendly Measurement-Based PTA (MBPTA). We show how the execution time measurements taken as input for MBPTA need to be padded to obtain reliable and tight WCET estimates on top of TDMA-arbitrated hardware resources with no further hardware support. Our results show that TDMA delivers tighter WCET estimates than MBPTA-friendly arbitration policies, whereas MBPTA-friendly policies provide higher average performance. Thus, the best policy to choose depends on the particular needs of the end user.The research leading to these results has been funded by the EU FP7 under grant agreement no. 611085 (PROXIMA)
and 287519 (parMERASA). This work has also been partially supported by the Spanish Ministry of Economy and Competitiveness
(MINECO) under grant TIN2015-65316-P and
the HiPEAC Network of Excellence. Miloˇs Pani´c is funded by the Spanish Ministry of Education under the FPU grant FPU12/05966. Jaume Abella has been partially supported by
the MINECO under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717.Peer ReviewedPostprint (author's final draft
Fixed-Priority Memory-Centric Scheduler for COTS-Based Multiprocessors
Memory-centric scheduling attempts to guarantee temporal predictability on commercial-off-the-shelf (COTS) multiprocessor systems to exploit their high performance for real-time applications. Several solutions proposed in the real-time literature have hardware requirements that are not easily satisfied by modern COTS platforms, like hardware support for strict memory partitioning or the presence of scratchpads. However, even without said hardware support, it is possible to design an efficient memory-centric scheduler.
In this article, we design, implement, and analyze a memory-centric scheduler for deterministic memory management on COTS multiprocessor platforms without any hardware support. Our approach uses fixed-priority scheduling and proposes a global "memory preemption" scheme to boost real-time schedulability. The proposed scheduling protocol is implemented in the Jailhouse hypervisor and Erika real-time kernel. Measurements of the scheduler overhead demonstrate the applicability of the proposed approach, and schedulability experiments show a 20% gain in terms of schedulability when compared to contention-based and static fair-share approaches
Measuring and Controlling Multicore Contention in a RISC-V System-on-Chip
[ES] Los procesadores multinúcleo empezaron una revolución en el cómputo moderno cuando fueron introducidos en el espacio de cómputo comercial y de consumidor. Estos procesadores multinúcleo presentaban un aumento significativo en consumo, eficiencia y rendimiento en un periodo de tiempo en el aumento de la frecuencia y el IPC del procesador parecÃa estar tocando techo. Sin embargo, en sistemas crÃticos, la introducción de los procesadores multinúcleo ha traÃdo a la luz diferentes dificultades en el proceso de certificación. La principal área que dificulta la caracterización de los sistemas multicore en tiempo real es el uso de recursos compartidos, en especÃfico, los buses compartidos.
En este trabajo proveeremos las herramientas necesarias para facilitar la caracterización de sistemas que hacen uso de buses compartidos en sistemas de criticidad mixta. En especÃfico, combinamos las polÃticas desarrolladas para sistemas con buses con polÃticas de limitación de ancho de banda basadas en interferencia causada al núcleo principal. Con esta combinación de polÃticas podemos limitar el WCET de la tarea crÃtica en el sistema multinúcleo mientras que proveemos un "best effort" para permitir el progreso en los núcleos secundarios.[CAT] Els processadors multinucli van començar una revolució en el còmput modern quan
van ser introduïts en l’espai de còmput comercial i de consumidor. Aquests processadors
multinucli presentaven un augment significatiu en consum, eficiència i rendiment en un
perÃode de temps en l’augment de la freqüència i l’IPC de l’processador semblava estar
tocant sostre. No obstant això, en sistemes crÃtics, la introducció dels processadors multi-
nucli ha portat a la llum diferents dificultats en el procés de certificació. La principal à rea
que dificulta la caracterització dels sistemes multinucli en temps real és l’ús de recursos
compartits, en especÃfic, els busos compartits.
En aquest treball proveirem les eines necessà ries per facilitar la caracterització de sis-
temes que fan ús de busos compartits en sistemes de criticitat mixta. En especÃfic, combi-
nem les polÃtiques desenvolupades per a sistemes amb busos amb polÃtiques de limitació
d’ample de banda basades en interferència causada a el nucli principal. Amb aquesta
combinació de polÃtiques podem limitar l’WCET de la tasca crÃtica en el sistema multinu-
cli mentre que proveïm un "best effort"per permetre el progrés en els nuclis secundaris.[EN] Multicore processors were a revolution when introduced into the commercial computing space, they presented great power efficiency and performance in a time where clock speeds and instruction level parallelism were plateauing. But, on safety critical systems, the introduction of multi-core processors has brought serious difficulties to the certification process. The main trouble spot for multicore characterization is the usage of shared resources, in specific, shared buses.
In this work, we provide tools to ease the characterization of shared bus mechanisms timing interference on critical and mixed criticality systems. In particular, we combine shared bus arbitration policies with rate limiting policies based on critical workload interference to bound the WCET of a critical workload on a multi-core system while doing a best effort to let secondary cores progress as much as possible.Andreu Cerezo, P. (2021). Measuring and Controlling Multicore Contention in a RISC-V System-on-Chip. Universitat Politècnica de València. http://hdl.handle.net/10251/173563TFG
Sharing the instruction cache among lean cores on an asymmetric CMP for HPC applications
High performance computing (HPC) applications have parallel code sections that must scale to large numbers of cores, which makes them sensitive to serial regions. Current supercomputing systems with heterogeneous or asymmetric CMPs (ACMP) combine few high-performance big cores for serial regions, together with many low-power lean cores for throughput computing. The low requirements of HPC applications in the core front-end lead some designs, such as SMT and GPU cores, to share front-end structures including the instruction cache (I-cache). However, little work exists to analyze the benefit of sharing the I-cache among full cores, which seems compelling as a solution to reduce silicon area and power. This paper analyzes the performance, power and area impact of such a design on an ACMP with one high-performance core and multiple low-power cores. Having identified that multiple cores run the same code during parallel regions, the lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections with McPAT and a rich set of HPC benchmarks show 11% area savings with a 5% energy reduction at no performance cost.The research was supported by European Unions 7th Framework Programme [FP7/2007-2013] under project Mont-Blanc
(288777), the Ministry of Economy and Competitiveness of Spain (TIN2012-34557, TIN2015-65316-P, and BES-2013-063925), Generalitat de Catalunya (2014-SGR-1051 and 2014-SGR-1272), HiPEAC-3 Network of Excellence (ICT-287759), and finally the Severo Ochoa Program (SEV-2011-00067) of
the Spanish Government.Peer ReviewedPostprint (author's final draft
A High-performance, Energy-efficient Modular DMA Engine Architecture
Data transfers are essential in today's computing systems as latency and
complex memory access patterns are increasingly challenging to manage. Direct
memory access engines (DMAEs) are critically needed to transfer data
independently of the processing elements, hiding latency and achieving high
throughput even for complex access patterns to high-latency memory. With the
prevalence of heterogeneous systems, DMAEs must operate efficiently in
increasingly diverse environments. This work proposes a modular and highly
configurable open-source DMAE architecture called intelligent DMA (iDMA), split
into three parts that can be composed and customized independently. The
front-end implements the control plane binding to the surrounding system. The
mid-end accelerates complex data transfer patterns such as multi-dimensional
transfers, scattering, or gathering. The back-end interfaces with the on-chip
communication fabric (data plane). We assess the efficiency of iDMA in various
instantiations: In high-performance systems, we achieve speedups of up to 15.8x
with only 1 % additional area compared to a base system without a DMAE. We
achieve an area reduction of 10 % while improving ML inference performance by
23 % in ultra-low-energy edge AI systems over an existing DMAE solution. We
provide area, timing, latency, and performance characterization to guide its
instantiation in various systems.Comment: 14 pages, 14 figures, accepted by an IEEE journal for publicatio
Timing Predictable and High-Performance Hardware Cache Coherence Mechanisms for Real-Time Multi-Core Platforms
Multi-core platforms are becoming primary compute platforms for real-time systems such as avionics and autonomous vehicles. This adoption is primarily driven by the increasing application demands deployed in real-time systems, and the cost and performance benefits of multi-core platforms. For real-time applications, satisfying safety properties in the form of timing predictability, is the paramount consideration. Providing such guarantees on safety properties requires applying some timing analysis on the application executing on the compute platform. The timing analysis computes an upper bound on the application’s execution time on the compute platform, which is referred to as the worst-case execution time (WCET).
However, multi-core platforms pose challenges that complicate the timing analysis. Among these challenges are timing challenges caused due to simultaneous accesses from multiple cores to shared hardware resources such as shared caches, interconnects, and off-chip memories. Supporting timing predictable shared data communication between real-time applications further compounds this challenge as a core’s access to shared data is dependent on the simultaneous memory activity from other cores on the shared data. Although hardware cache coherence mechanisms are the primary high-performance data communication mechanisms in current multi-core platforms, there has been very little use of these mechanisms to support timing predictable shared data communication in real-time multi-core platforms. Rather, current state-of-the-art approaches to timing predictable shared data communication sidestep hardware cache coherence. These approaches enforce memory and execution constraints on the shared data to simplify the timing analysis at the expense of application performance.
This thesis makes the case for timing predictable hardware cache coherence mechanisms as viable shared data communication mechanisms for real-time multi-core platforms. A key takeaway from the contributions in this thesis is that timing predictable hardware cache coherence mechanisms offer significant application performance over prior state-of-the-art data communication approaches while guaranteeing timing predictability.
This thesis has three main contributions.
First, this thesis shows how a hardware cache coherence mechanism can be designed to be timing predictable by defining design invariants that guarantee timing predictability. We apply these design invariants and design timing predictable variants of existing conventional cache coherence mechanisms. Evaluation of these timing predictable cache coherence mechanisms show that they provide significant application performance over state-of-the-art approaches while delivering timing predictability.
Second, we observe that the large worst-case memory access latency under timing predictable hardware cache coherence mechanisms questions their applicability as a data communication mechanism in real-time multi-core platforms. To this end, we present a systematic framework to design better timing predictable cache coherence mechanisms that balance high application performance and low worst-case memory access latency. Our systematic framework concisely captures the design features of timing predictable cache coherence mechanisms that impacts their WCET, and identifies a spectrum of approaches to reduce the worst-case memory access latency. We describe one approach and show that this approach reduces the worst-case memory access latency of timing predictable cache coherence mechanisms to be the same as alternative approaches while trading away minimal performance in the original cache coherence mechanisms.
Third, we design a timing predictable hardware cache coherence mechanism for multi-core platforms used in mixed-critical real-time systems (MCS). Applications in MCS have varying performance and timing predictability requirements. We design a timing predictable cache coherence mechanism that considers these differing requirements and ensures that applications with no timing predictability requirements do not impact applications with strict predictability requirements
PASoC: A Predictable Accelerator Rich SoC for Safety-Critical Systems
This thesis presents a model of a Predictable Accelerator-rich System-on-Chip (PASoC)
for safety-critical systems, which guarantees timing predictability of a memory access in
the system. Earlier adoption of accelerator-rich SoCs was for general-purpose comput ing and thus timing predictability of such systems was not well explored, despite being
used in safety-critical systems. This thesis takes initial steps in exploring the predictabil ity of ASoCs by combining CPU clusters with one or more hardware accelerators. The
PASoC allows the integration of multiple coherent agents to interact with each other over
a shared memory bus and a shared LLC. These agents can be a cluster of cache-coherent
homogeneous cores, and fully or one-way coherent hardware accelerators. PASoC ensures
the predictability of a memory request through some modifications in hardware architecture
and cache coherence protocols. PASoC supports predictable cache coherence within the
cluster of cores and across agents. The former uses linear cache coherence, and the latter
uses a modified version of predictable Modified Shared Invalid (MSI) cache coherence pro tocol. PASoC analyzes the per-request worst-case latency of a memory request from any
of the agents and evaluates the design on the gem5 simulator. Finally, this work presents
some observations based on the analysis that can help in future designs of PASoCs
- …