92 research outputs found

    WCET-aware prefetching of unlocked instruction caches: a technique for reconciling real-time guarantees and energy efficiency

    Get PDF
    Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Engenharia de Automação e Sistemas, Florianópolis, 2015.A computação embarcada requer crescente vazão sob baixa potência. Ela requer um aumento de eficiência energética quando se executam programas de crescente complexidade. Muitos sistemas embarcados são também sistemas de tempo real, cuja correção temporal precisa ser garantida através de análise de escalonabilidade, a qual costuma assumir que o WCET de uma tarefa é conhecido em tempo de projeto. Como resultado da crescente complexidade do software, uma quantidade significativa de energia é gasta ao se prover instruções através da hierarquia de memória. Como a cache de instruções consome cerca de 40% da energia gasta em um processador embarcado e afeta a energia consumida em memória principal, ela se torna um relevante alvo para otimização. Entretanto, como ela afeta substancialmente o WCET, o comportamento da cache precisa ser restrito via  cache locking ou previsto via análise de WCET. Para obter eficiência energética sob restrições de tempo real, é preciso estender a consciência que o compilador tem da plataforma de hardware. Entretanto, compiladores para tempo real ignoram a energia, embora determinem rapidamente limites superiores para o WCET, enquanto compiladores para sistemas embarcados estimem com precisão a energia, mas gastem muito tempo em  profiling . Por isso, esta tese propõe um método unificado para estimar a energia gasta em memória, o qual é baseado em Interpretação Abstrata, exatamente o mesmo substrato matemático usado para a análise de WCET em caches. As estimativas mostram derivadas que são tão precisas quanto as obtidas via  profiling , mas são computadas 1000 vezes mais rápido, sendo apropriadas para induzir otimização de código através de melhoria iterativa. Como  cache locking troca eficiência energética por previsibilidade, esta tese propõe uma nova otimização de código, baseada em pré-carga por software, a qual reduz a taxa de faltas de caches de instruções e, provadamente, não aumenta o WCET. A otimização proposta é comparada com o estado-da-arte em  cache locking parcial para 37 programas do  Malardalen WCET benchmark para 36 configurações de cache e duas tecnologias distintas (2664 casos de uso). Em média, para obter uma melhoria de 68% no WCET,  cache locking parcial requer 8% mais energia. Por outro lado, a pré-carga por software diminui o consumo de energia em 11% enquanto melhora em 15% o WCET, reconciliando assim eficiência energética e garantias de tempo real.Abstract : Embedded computing requires increasing throughput at low power budgets. It asks for growing energy efficiency when executing programs of rising complexity. Many embedded systems are also real-time systems, whose temporal correctness is asserted through schedulability analysis, which often assumes that the WCET of each task is known at design-time. As a result of the growing software complexity, a significant amount of energy is spent in supplying instructions through the memory hierarchy. Since an instruction cache consumes around 40% of an embedded processor s energy and affects the energy spent in main memory, it becomes a relevant optimization target. However, since it largely impacts the WCET, cache behavior must be either constrained via cache locking or predicted by WCET analysis. To achieve energy efficiency under real-time constraints, a compiler must have extended awareness of the hardware platform. However, real-time compilers ignore energy, although they quickly determine bounds for WCET, whereas embedded compilers accurately estimate energy but require time-consuming profiling. That is why this thesis proposes a unifying method to estimate memory energy consumption that is based on Abstract Interpretation, the very same mathematical framework employed for the WCET analysis of caches. The estimates exhibit derivatives that are as accurate as those obtained by profiling, but are computed 1000 times faster, being suitable for driving code optimization through iterative improvement. Since cache locking gives up energy efficiency for predictability, this thesis proposes a novel code optimization, based on software prefetching, which reduces miss rate of unlocked instruction caches and, provenly, does not increase the WCET. The proposed optimization is compared with a state-of-the-art partial cache locking technique for the 37 programs of the Malardalen WCET benchmarks under 36 cache configurations and two distinct target technologies (2664 use cases). On average, to achieve an improvement of 68% in the WCET, partial cache locking required 8% more energy. On the other hand, software prefetching decreased the energy consumption by 11% while leading to an improvement of 15% in the WCET, thereby reconciling energy efficiency and real-time guarantees

    Reducing the WCET and analysis time of systems with simple lockable instruction caches

    Get PDF
    One of the key challenges in real-time systems is the analysis of the memory hierarchy. Many Worst-Case Execution Time (WCET) analysis methods supporting an instruction cache are based on iterative or convergence algorithms, which are rather slow. Our goal in this paper is to reduce the WCET analysis time on systems with a simple lockable instruction cache, focusing on the Lock-MS method. First, we propose an algorithm to obtain a structure-based representation of the Control Flow Graph (CFG). It organizes the whole WCET problem as nested subproblems, which takes advantage of common branch-and-bound algorithms of Integer Linear Programming (ILP) solvers. Second, we add support for multiple locking points per task, each one with specific cache contents, instead of a given locked content for the whole task execution. Locking points are set heuristically before outer loops. Such simple heuristics adds no complexity, and reduces the WCET by taking profit of the temporal reuse found in loops. Since loops can be processed as isolated regions, the optimal contents to lock into cache for each region can be obtained, and the WCET analysis time is further reduced. With these two improvements, our WCET analysis is around 10 times faster than other approaches. Also, our results show that the WCET is reduced, and the hit ratio achieved for the lockable instruction cache is similar to that of a real execution with an LRU instruction cache. Finally, we analyze the WCET sensitivity to compiler optimization, showing for each benchmark the right choices and pointing out that O0 is always the worst option

    The potential of programmable logic in the middle: cache bleaching

    Full text link
    Consolidating hard real-time systems onto modern multi-core Systems-on-Chip (SoC) is an open challenge. The extensive sharing of hardware resources at the memory hierarchy raises important unpredictability concerns. The problem is exacerbated as more computationally demanding workload is expected to be handled with real-time guarantees in next-generation Cyber-Physical Systems (CPS). A large body of works has approached the problem by proposing novel hardware re-designs, and by proposing software-only solutions to mitigate performance interference. Strong from the observation that unpredictability arises from a lack of fine-grained control over the behavior of shared hardware components, we outline a promising new resource management approach. We demonstrate that it is possible to introduce Programmable Logic In-the-Middle (PLIM) between a traditional multi-core processor and main memory. This provides the unique capability of manipulating individual memory transactions. We propose a proof-of-concept system implementation of PLIM modules on a commercial multi-core SoC. The PLIM approach is then leveraged to solve long-standing issues with cache coloring. Thanks to PLIM, colored sparse addresses can be re-compacted in main memory. This is the base principle behind the technique we call Cache Bleaching. We evaluate our design on real applications and propose hypervisor-level adaptations to showcase the potential of the PLIM approach.Accepted manuscrip

    Combining watchdog processor with instruction cache locking for a fault-tolerant, predictable architecture applied to fixed-priority, preemptive, multitasking real-time systems

    Full text link
    [EN] Control flow monitoring using a watchdog processor is a well-known technique to increase the dependability of a microprocessor system. Most approaches embed reference signatures for the watchdog processor into the processor instruction stream. These signatures contain the information required to detect control flow errors during program execution by the main processor. This paper proposes an architecture that offers both fault-tolerance and dynamic cache locking combined. This combination is achieved taking advantage of the fact that watchdog processor signatures are inserted along the program code. Then cache locking information is incorporated into these signatures. And also the required circuitry to inform the cache controller whether to lock or not the instructions fetched by the main processor is added into the watchdog processor. With this approach both fault-tolerant and real-time features are supported by the same hardware, therefore saving room on the silicon die or FPGA size. Results from experiments show that in most cases this approach reaches the same performance than previous, hardware-costly proposals.This work was partially funded by the Plan Nacional de I+D, Comision Interministerial de Ciencia y Tecnologia (FEDER-CICYT) under the project HAR2017-85557-P and Agencia Estatal de Investigacion under the project DPI2016-80303-C2-1-P.MartĂ­-Campoy, A.; RodrĂ­guez-Ballester, F. (2019). Combining watchdog processor with instruction cache locking for a fault-tolerant, predictable architecture applied to fixed-priority, preemptive, multitasking real-time systems. IEEE. 259-265. https://doi.org/10.1109/ETFA.2019.8869168S25926

    A Survey on Cache Management Mechanisms for Real-Time Embedded Systems

    Get PDF
    © ACM, 2015. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Computing Surveys, {48, 2, (November 2015)} http://doi.acm.org/10.1145/2830555Multicore processors are being extensively used by real-time systems, mainly because of their demand for increased computing power. However, multicore processors have shared resources that affect the predictability of real-time systems, which is the key to correctly estimate the worst-case execution time of tasks. One of the main factors for unpredictability in a multicore processor is the cache memory hierarchy. Recently, many research works have proposed different techniques to deal with caches in multicore processors in the context of real-time systems. Nevertheless, a review and categorization of these techniques is still an open topic and would be very useful for the real-time community. In this article, we present a survey of cache management techniques for real-time embedded systems, from the first studies of the field in 1990 up to the latest research published in 2014. We categorize the main research works and provide a detailed comparison in terms of similarities and differences. We also identify key challenges and discuss future research directions.King Saud University NSER

    An Overview of Approaches Towards the Timing Analysability of Parallel Architecture

    Get PDF
    In order to meet performance/low energy/integration requirements, parallel architectures (multithreaded cores and multi-cores) are more and more considered in the design of embedded systems running critical software. The objective is to run several applications concurrently. When applications have strict real-time constraints, two questions arise: a) how can the worst-case execution time (WCET) of each application be computed while concurrent applications might interfere? b)~how can the tasks be scheduled so that they are guarantee to meet their deadlines? The second question has received much attention for several years~cite{CFHS04,DaBu11}. Proposed schemes generally assume that the first question has been solved, and in addition that they do not impact the WCETs. In effect, the first question is far from been answered even if several approaches have been proposed in the literature. In this paper, we present an overview of these approaches from the point of view of static WCET analysis techniques

    Real-time operating system support for multicore applications

    Get PDF
    Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Engenharia de Automação e Sistemas, Florianópolis, 2014Plataformas multiprocessadas atuais possuem diversos níveis da memória cache entre o processador e a memória principal para esconder a latência da hierarquia de memória. O principal objetivo da hierarquia de memória é melhorar o tempo médio de execução, ao custo da previsibilidade. O uso não controlado da hierarquia da cache pelas tarefas de tempo real impacta a estimativa dos seus piores tempos de execução, especialmente quando as tarefas de tempo real acessam os níveis da cache compartilhados. Tal acesso causa uma disputa pelas linhas da cache compartilhadas e aumenta o tempo de execução das aplicações. Além disso, essa disputa na cache compartilhada pode causar a perda de prazos, o que é intolerável em sistemas de tempo real críticos. O particionamento da memória cache compartilhada é uma técnica bastante utilizada em sistemas de tempo real multiprocessados para isolar as tarefas e melhorar a previsibilidade do sistema. Atualmente, os estudos que avaliam o particionamento da memória cache em multiprocessadores carecem de dois pontos fundamentais. Primeiro, o mecanismo de particionamento da cache é tipicamente implementado em um ambiente simulado ou em um sistema operacional de propósito geral. Consequentemente, o impacto das atividades realizados pelo núcleo do sistema operacional, tais como o tratamento de interrupções e troca de contexto, no particionamento das tarefas tende a ser negligenciado. Segundo, a avaliação é restrita a um escalonador global ou particionado, e assim não comparando o desempenho do particionamento da cache em diferentes estratégias de escalonamento. Ademais, trabalhos recentes confirmaram que aspectos da implementação do SO, tal como a estrutura de dados usada no escalonamento e os mecanismos de tratamento de interrupções, impactam a escalonabilidade das tarefas de tempo real tanto quanto os aspectos teóricos. Entretanto, tais estudos também usaram sistemas operacionais de propósito geral com extensões de tempo real, que afetamos sobre custos de tempo de execução observados e a escalonabilidade das tarefas de tempo real. Adicionalmente, os algoritmos de escalonamento tempo real para multiprocessadores atuais não consideram cenários onde tarefas de tempo real acessam as mesmas linhas da cache, o que dificulta a estimativa do pior tempo de execução. Esta pesquisa aborda os problemas supracitados com as estratégias de particionamento da cache e com os algoritmos de escalonamento tempo real multiprocessados da seguinte forma. Primeiro, uma infraestrutura de tempo real para multiprocessadores é projetada e implementada em um sistema operacional embarcado. A infraestrutura consiste em diversos algoritmos de escalonamento tempo real, tais como o EDF global e particionado, e um mecanismo de particionamento da cache usando a técnica de coloração de páginas. Segundo, é apresentada uma comparação em termos da taxa de escalonabilidade considerando o sobre custo de tempo de execução da infraestrutura criada e de um sistema operacional de propósito geral com extensões de tempo real. Em alguns casos, o EDF global considerando o sobre custo do sistema operacional embarcado possui uma melhor taxa de escalonabilidade do que o EDF particionado com o sobre custo do sistema operacional de propósito geral, mostrando claramente como diferentes sistemas operacionais influenciam os escalonadores de tempo real críticos em multiprocessadores. Terceiro, é realizada uma avaliação do impacto do particionamento da memória cache em diversos escalonadores de tempo real multiprocessados. Os resultados desta avaliação indicam que um sistema operacional "leve" não compromete as garantias de tempo real e que o particionamento da cache tem diferentes comportamentos dependendo do escalonador e do tamanho do conjunto de trabalho das tarefas. Quarto, é proposto um algoritmo de particionamento de tarefas que atribui as tarefas que compartilham partições ao mesmo processador. Os resultados mostram que essa técnica de particionamento de tarefas reduz a disputa pelas linhas da cache compartilhadas e provê garantias de tempo real para sistemas críticos. Finalmente, é proposto um escalonador de tempo real de duas fases para multiprocessadores. O escalonador usa informações coletadas durante o tempo de execução das tarefas através dos contadores de desempenho em hardware. Com base nos valores dos contadores, o escalonador detecta quando tarefas de melhor esforço o interferem com tarefas de tempo real na cache. Assim é possível impedir que tarefas de melhor esforço acessem as mesmas linhas da cache que tarefas de tempo real. O resultado desta estratégia de escalonamento é o atendimento dos prazos críticos e não críticos das tarefas de tempo real.Abstracts: Modern multicore platforms feature multiple levels of cache memory placed between the processor and main memory to hide the latency of ordinary memory systems. The primary goal of this cache hierarchy is to improve average execution time (at the cost of predictability). The uncontrolled use of the cache hierarchy by realtime tasks may impact the estimation of their worst-case execution times (WCET), specially when real-time tasks access a shared cache level, causing a contention for shared cache lines and increasing the application execution time. This contention in the shared cache may leadto deadline losses, which is intolerable particularly for hard real-time (HRT) systems. Shared cache partitioning is a well-known technique used in multicore real-time systems to isolate task workloads and to improve system predictability. Presently, the state-of-the-art studies that evaluate shared cache partitioning on multicore processors lack two key issues. First, the cache partitioning mechanism is typically implemented either in a simulated environment or in a general-purpose OS (GPOS), and so the impact of kernel activities, such as interrupt handlers and context switching, on the task partitions tend to be overlooked. Second, the evaluation is typically restricted to either a global or partitioned scheduler, thereby by falling to compare the performance of cache partitioning when tasks are scheduled by different schedulers. Furthermore, recent works have confirmed that OS implementation aspects, such as the choice of scheduling data structures and interrupt handling mechanisms, impact real-time schedulability as much as scheduling theoretic aspects. However, these studies also used real-time patches applied into GPOSes, which affects the run-time overhead observed in these works and consequently the schedulability of real-time tasks. Additionally, current multicore scheduling algorithms do not consider scenarios where real-time tasks access the same cache lines due to true or false sharing, which also impacts the WCET. This thesis addresses these aforementioned problems with cache partitioning techniques and multicore real-time scheduling algorithms as following. First, a real-time multicore support is designed and implemented on top of an embedded operating system designed from scratch. This support consists of several multicore real-time scheduling algorithms, such as global and partitioned EDF, and a cache partitioning mechanism based on page coloring. Second, it is presented a comparison in terms of schedulability ratio considering the run-time overhead of the implemented RTOS and a GPOS patched with real-time extensions. In some cases, Global-EDF considering the overhead of the RTOS is superior to Partitioned-EDF considering the overhead of the patched GPOS, which clearly shows how different OSs impact hard realtime schedulers. Third, an evaluation of the cache partitioning impacton partitioned, clustered, and global real-time schedulers is performed.The results indicate that a lightweight RTOS does not impact real-time tasks, and shared cache partitioning has different behavior depending on the scheduler and the task's working set size. Fourth, a task partitioning algorithm that assigns tasks to cores respecting their usage of cache partitions is proposed. The results show that by simply assigning tasks that shared cache partitions to the same processor, it is possible to reduce the contention for shared cache lines and to provideHRT guarantees. Finally, a two-phase multicore scheduler that provides HRT and soft real-time (SRT) guarantees is proposed. It is shown that by using information from hardware performance counters at run-time, the RTOS can detect when best-effort tasks interfere with real-time tasks in the shared cache. Then, the RTOS can prevent best effort tasks from interfering with real-time tasks. The results also show that the assignment of exclusive partitions to HRT tasks together with the two-phase multicore scheduler provides HRT and SRT guarantees, even when best-effort tasks share partitions with real-time tasks

    A survey of techniques for reducing interference in real-time applications on multicore platforms

    Get PDF
    This survey reviews the scientific literature on techniques for reducing interference in real-time multicore systems, focusing on the approaches proposed between 2015 and 2020. It also presents proposals that use interference reduction techniques without considering the predictability issue. The survey highlights interference sources and categorizes proposals from the perspective of the shared resource. It covers techniques for reducing contentions in main memory, cache memory, a memory bus, and the integration of interference effects into schedulability analysis. Every section contains an overview of each proposal and an assessment of its advantages and disadvantages.This work was supported in part by the Comunidad de Madrid Government "Nuevas TĂ©cnicas de Desarrollo de Software de Tiempo Real Embarcado Para Plataformas. MPSoC de PrĂłxima GeneraciĂłn" under Grant IND2019/TIC-17261

    WCET-Aware Scratchpad Memory Management for Hard Real-Time Systems

    Get PDF
    abstract: Cyber-physical systems and hard real-time systems have strict timing constraints that specify deadlines until which tasks must finish their execution. Missing a deadline can cause unexpected outcome or endanger human lives in safety-critical applications, such as automotive or aeronautical systems. It is, therefore, of utmost importance to obtain and optimize a safe upper bound of each task’s execution time or the worst-case execution time (WCET), to guarantee the absence of any missed deadline. Unfortunately, conventional microarchitectural components, such as caches and branch predictors, are only optimized for average-case performance and often make WCET analysis complicated and pessimistic. Caches especially have a large impact on the worst-case performance due to expensive off- chip memory accesses involved in cache miss handling. In this regard, software-controlled scratchpad memories (SPMs) have become a promising alternative to caches. An SPM is a raw SRAM, controlled only by executing data movement instructions explicitly at runtime, and such explicit control facilitates static analyses to obtain safe and tight upper bounds of WCETs. SPM management techniques, used in compilers targeting an SPM-based processor, determine how to use a given SPM space by deciding where to insert data movement instructions and what operations to perform at those program locations. This dissertation presents several management techniques for program code and stack data, which aim to optimize the WCETs of a given program. The proposed code management techniques include optimal allocation algorithms and a polynomial-time heuristic for allocating functions to the SPM space, with or without the use of abstraction of SPM regions, and a heuristic for splitting functions into smaller partitions. The proposed stack data management technique, on the other hand, finds an optimal set of program locations to evict and restore stack frames to avoid stack overflows, when the call stack resides in a size-limited SPM. In the evaluation, the WCETs of various benchmarks including real-world automotive applications are statically calculated for SPMs and caches in several different memory configurations.Dissertation/ThesisDoctoral Dissertation Computer Science 201
    • …
    corecore