78 research outputs found

    An analyzable memory controller for hard real-time CMPs

    Get PDF
    Multicore processors (CMPs) represent a good solution to provide the performance required by current and future hard real-time systems. However, it is difficult to compute a tight WCET estimation for CMPs due to interferences that tasks suffer when accessing shared hardware resources.We propose an analyzable JEDEC-compliant DDRx SDRAM memory controller (AMC) for hard real-time CMPs, that reduces the impact of memory interferences caused by other tasks on WCET estimation, providing a predictable memory access time and allowing the computation of tight WCET estimations.Peer ReviewedPostprint (published version

    Ubik: efficient cache sharing with strict qos for latency-critical workloads

    Get PDF
    Chip-multiprocessors (CMPs) must often execute workload mixes with different performance requirements. On one hand, user-facing, latency-critical applications (e.g., web search) need low tail (i.e., worst-case) latencies, often in the millisecond range, and have inherently low utilization. On the other hand, compute-intensive batch applications (e.g., MapReduce) only need high long-term average performance. In current CMPs, latency-critical and batch applications cannot run concurrently due to interference on shared resources. Unfortunately, prior work on quality of service (QoS) in CMPs has focused on guaranteeing average performance, not tail latency. In this work, we analyze several latency-critical workloads, and show that guaranteeing average performance is insufficient to maintain low tail latency, because microarchitectural resources with state, such as caches or cores, exert inertia on instantaneous workload performance. Last-level caches impart the highest inertia, as workloads take tens of milliseconds to warm them up. When left unmanaged, or when managed with conventional QoS frameworks, shared last-level caches degrade tail latency significantly. Instead, we propose Ubik, a dynamic partitioning technique that predicts and exploits the transient behavior of latency-critical workloads to maintain their tail latency while maximizing the cache space available to batch applications. Using extensive simulations, we show that, while conventional QoS frameworks degrade tail latency by up to 2.3x, Ubik simultaneously maintains the tail latency of latency-critical workloads and significantly improves the performance of batch applications.United States. Defense Advanced Research Projects Agency (Power Efficiency Revolution For Embedded Computing Technologies Contract HR0011-13-2-0005)National Science Foundation (U.S.) (Grant CCF-1318384

    Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems

    Full text link
    In modern Commercial Off-The-Shelf (COTS) multicore systems, each core can generate many parallel memory requests at a time. The processing of these parallel requests in the DRAM controller greatly affects the memory interference delay experienced by running tasks on the platform. In this paper, we model a modern COTS multicore system which has a nonblocking last-level cache (LLC) and a DRAM controller that prioritizes reads over writes. To minimize interference, we focus on LLC and DRAM bank partitioned systems. Based on the model, we propose an analysis that computes a safe upper bound for the worst-case memory interference delay. We validated our analysis on a real COTS multicore platform with a set of carefully designed synthetic benchmarks as well as SPEC2006 benchmarks. Evaluation results show that our analysis is more accurately capture the worst-case memory interference delay and provides safer upper bounds compared to a recently proposed analysis which significantly under-estimate the delay.Comment: Technical Repor

    Adapting TDMA arbitration for measurement-based probabilistic timing analysis

    Get PDF
    Critical Real-Time Embedded Systems require functional and timing validation to prove that they will perform their functionalities correctly and in time. For timing validation, a bound to the Worst-Case Execution Time (WCET) for each task is derived and passed as an input to the scheduling algorithm to ensure that tasks execute timely. Bounds to WCET can be derived with deterministic timing analysis (DTA) and probabilistic timing analysis (PTA), each of which relies upon certain predictability properties coming from the hardware/software platform beneath. In particular, specific hardware designs are needed for both DTA and PTA, which challenges their adoption by hardware vendors. This paper makes a step towards reconciling the hardware needs of DTA and PTA timing analyses to increase the likelihood of those hardware designs to be adopted by hardware vendors. In particular, we show how Time Division Multiple Access (TDMA), which has been regarded as one of the main DTA-compliant arbitration policies, can be used in the context of PTA and, in particular, of the industrially-friendly Measurement-Based PTA (MBPTA). We show how the execution time measurements taken as input for MBPTA need to be padded to obtain reliable and tight WCET estimates on top of TDMA-arbitrated hardware resources with no further hardware support. Our results show that TDMA delivers tighter WCET estimates than MBPTA-friendly arbitration policies, whereas MBPTA-friendly policies provide higher average performance. Thus, the best policy to choose depends on the particular needs of the end user.The research leading to these results has been funded by the EU FP7 under grant agreement no. 611085 (PROXIMA) and 287519 (parMERASA). This work has also been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant TIN2015-65316-P and the HiPEAC Network of Excellence. Miloˇs Pani´c is funded by the Spanish Ministry of Education under the FPU grant FPU12/05966. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717.Peer ReviewedPostprint (author's final draft

    WCET Analysis of a Parallel 3D Multigrid Solver Executed on the MERASA Multi-Core

    Get PDF
    To meet performance requirements as well as constraints on cost and power consumption, future embedded systems will be designed with multi-core processors. However, the question of timing analysability is raised with these architectures. In the MERASA project, a WCET-aware multi-core processor has been designed with the appropriate system software. They both guarantee that the WCET of tasks running on different cores can be safely analyzed since their possible interactions can be bounded. Nevertheless, computing the WCET of a parallel application is still not straightforward and a high-level preliminary analysis of the communication and synchronization patterns must be performed. In this paper, we report on our experience in evaluating the WCET of a parallel 3D multigrid solver code and we propose lines for further research on this topic

    Design Space Exploration of Object Caches with Cross-Profiling

    Get PDF
    To avoid data cache trashing between heapallocated data and other data areas, a distinct object cache has been proposed for embedded real-time Java processors. This object cache uses high associativity in order to statically track different object pointers for worst-case execution-time analysis. However, before implementing such an object cache, an empirical analysis of different organization forms is needed. We use a cross-profiling technique based on aspect-oriented programming in order to evaluate different object cache organizations with standard Java benchmarks. From the evaluation we conclude that field access exhibits some temporal locality, but almost no spatial locality. Therefore, filling long cache lines on a miss just introduces a high miss penalty without increasing the hit rate enough to make up for the increased miss penalty. For an object cache, it is more efficient to fill individual words within the cache line on a miss

    Deterministic Memory Abstraction and Supporting Multicore System Architecture

    Get PDF
    Poor time predictability of multicore processors has been a long-standing challenge in the real-time systems community. In this paper, we make a case that a fundamental problem that prevents efficient and predictable real-time computing on multicore is the lack of a proper memory abstraction to express memory criticality, which cuts across various layers of the system: the application, OS, and hardware. We, therefore, propose a new holistic resource management approach driven by a new memory abstraction, which we call Deterministic Memory. The key characteristic of deterministic memory is that the platform-the OS and hardware-guarantees small and tightly bounded worst-case memory access timing. In contrast, we call the conventional memory abstraction as best-effort memory in which only highly pessimistic worst-case bounds can be achieved. We propose to utilize both abstractions to achieve high time predictability but without significantly sacrificing performance. We present deterministic memory-aware OS and architecture designs, including OS-level page allocator, hardware-level cache, and DRAM controller designs. We implement the proposed OS and architecture extensions on Linux and gem5 simulator. Our evaluation results, using a set of synthetic and real-world benchmarks, demonstrate the feasibility and effectiveness of our approach

    Understanding Shared Memory Bank Access Interference in Multi-Core Avionics

    Get PDF
    Deployment of multi-core platforms in safety-critical applications requires reliable estimation of worst-case response time (WCRT) for critical processes. Determination of WCRT needs to accurately estimate and measure the interferences arising from multiple processes and multiple cores. Earlier works have proposed frameworks in which CPU, shared cache, and shared memory (DRAM) interferences can be estimated using some application and platform-dependent parameters. In this work we examine a recent work in which single core equivalent (SCE) worst case execution time is used as a basis for deriving WCRT. We describe the specific requirements in an avionics context including the sharing of memory banks by multiple processes on multiple cores, and adapt the SCE framework to account for them. We present the needed adaptations to a real-time operating system to enforce the requirements, and present a methodology for validating the theoretical WCRT through measurements on the resulting platform. The work reveals that the framework indeed creates a (pessimistic) bound on the WCRT. It also discloses that the maximum interference for memory accesses does not arise when all cores share the same memory bank
    • …
    corecore