12 research outputs found

    The potential of programmable logic in the middle: cache bleaching

    Full text link
    Consolidating hard real-time systems onto modern multi-core Systems-on-Chip (SoC) is an open challenge. The extensive sharing of hardware resources at the memory hierarchy raises important unpredictability concerns. The problem is exacerbated as more computationally demanding workload is expected to be handled with real-time guarantees in next-generation Cyber-Physical Systems (CPS). A large body of works has approached the problem by proposing novel hardware re-designs, and by proposing software-only solutions to mitigate performance interference. Strong from the observation that unpredictability arises from a lack of fine-grained control over the behavior of shared hardware components, we outline a promising new resource management approach. We demonstrate that it is possible to introduce Programmable Logic In-the-Middle (PLIM) between a traditional multi-core processor and main memory. This provides the unique capability of manipulating individual memory transactions. We propose a proof-of-concept system implementation of PLIM modules on a commercial multi-core SoC. The PLIM approach is then leveraged to solve long-standing issues with cache coloring. Thanks to PLIM, colored sparse addresses can be re-compacted in main memory. This is the base principle behind the technique we call Cache Bleaching. We evaluate our design on real applications and propose hypervisor-level adaptations to showcase the potential of the PLIM approach.Accepted manuscrip

    A Memory Scheduling Infrastructure for Multi-Core Systems with Re-Programmable Logic

    Get PDF
    The sharp increase in demand for performance has prompted an explosion in the complexity of modern multi-core embedded systems. This has lead to unprecedented temporal unpredictability concerns in Cyber-Physical Systems (CPS). On-chip integration of programmable logic (PL) alongside a conventional Processing System (PS) in modern Systems-on-Chip (SoC) establishes a genuine compromise between specialization, performance, and reconfigurability. In addition to typical use-cases, it has been shown that the PL can be used to observe, manipulate, and ultimately manage memory traffic generated by a traditional multi-core processor. This paper explores the possibility of PL-aided memory scheduling by proposing a Scheduler In-the-Middle (SchIM). We demonstrate that the SchIM enables transaction-level control over the main memory traffic generated by a set of embedded cores. Focusing on extensibility and reconfigurability, we put forward a SchIM design covering two main objectives. First, to provide a safe playground to test innovative memory scheduling mechanisms; and second, to establish a transition path from software-based memory regulation to provably correct hardware-enforced memory scheduling. We evaluate our design through a full-system implementation on a commercial PS-PL platform using synthetic and real-world benchmarks

    RT-Bench: an Extensible Benchmark Framework for the Analysis and Management of Real-Time Applications

    Full text link
    Benchmarking is crucial for testing and validating any system, even more so in real-time systems. Typical real-time applications adhere to well-understood abstractions: they exhibit a periodic behavior, operate on a well-defined working set, and strive for stable response time avoiding non-predicable factors such as page faults. Unfortunately, available benchmark suites fail to reflect key characteristics of real-time applications. Practitioners and researchers must resort to either benchmark heavily approximated real-time environments, or to re-engineer available benchmarks to add -- if possible -- the sought-after features. Additionally, the measuring and logging capabilities provided by most benchmark suites are not tailored "out-of-the-box" to real-time environments, and changing basic parameters such as the scheduling policy often becomes a tiring and error-prone exercise. In this paper, we present RT-bench, an open-source framework adding standard real-time features to virtually any existing benchmark. Furthermore, RT-bench provides an easy-to-use, unified command line interface to customize key aspects of the real-time execution of a set of benchmarks. Our framework is guided by four main criteria: 1) cohesive interface, 2) support for periodic application behavior and deadline semantics, 3) controllable memory footprint, and 4) extensibility and portability. We have integrated within the framework applications from the widely used SD-VBS and IsolBench suites. We showcase a set of use-cases that are representative of typical real-time system evaluation scenarios and that can be easily conducted via RT-Bench.Comment: 11 pages, 12 figures; code available at https://gitlab.com/rt-bench/rt-bench, documentation available at https://rt-bench.gitlab.io/rt-bench

    CAESAR: coherence-aided elective and seamless alternative routing via on-chip FPGA

    Full text link
    Prompted by the ever-growing demand for high-performance System-on-Chip (SoC) and the plateauing of CPU frequencies, the SoC design landscape is shifting. In a quest to offer programmable specialization, the adoption of tightly-coupled FPGAs co-located with traditional compute clusters has been embraced by major vendors. This CPU+FPGA architectural paradigm opens the door to novel hardware/software co-design opportunities. The key principle is that CPU-originated memory traffic can be re-routed through the FPGA for analysis and management purposes. Albeit promising, the side-effect of this approach is that time-critical operations—such as cache-line refills—are fulfilled by moving data over slower interconnects meant for I/O traffic. In this article, we introduce a novel principle named Cache Coherence Backstabbing to precisely tackle these shortcomings. The technique leverages the ability to include the FGPA in the same coherence domain as the core processing elements. Importantly, this enables Coherence-Aided Elective and Seamless Alternative Routing (CAESAR), i.e., seamless inspection and routing of memory transactions, especially cache-line refills, through the FPGA. CAESAR allows the definition of new memory programming paradigms. We discuss the intrinsic potentials of the approach and evaluate it with a full-stack prototype implementation on a commercial platform. Our experiments show an improvement of up to 29% in read bandwidth, 23% in latency, and 13% in pragmatic workloads over the state of the art. Furthermore, we showcase the first in-coherence-domain run-time profiler design as a use-case of the CAESAR approach.National Science FoundationAccepted manuscrip

    Hardware data re-organization engine for real-time systems

    Full text link
    Access patterns and cache utilization play a key role in the analyzability of data-intensive applications. In this demo, we re-examine our previous research on software-hardware codesign to push data transformation closer to memory from a real-time perspective. Deployed in modern CPU+FPGA systems, our design enables efficient and cache-friendly access to large data by only moving relevant bytes from the target memory. This (1) compresses the cache footprint and (2) reorganizes complex memory access patterns into sequential and predictable patterns.National Science FoundationAccepted manuscrip

    Low-Overhead Online Assessment of Timely Progress as a System Commodity

    Get PDF

    Low-overhead online assessment of timely progress as a system commodity

    Get PDF
    The correctness of safety-critical systems depends on both their logical and temporal behavior. Control-flow integrity (CFI) is a well-established and understood technique to safeguard the logical flow of safety-critical applications. But unfortunately, no established methodologies exist for the complementary problem of detecting violations of control flow timeliness. Worse yet, the latter dimension, which we term Timely Progress Integrity (TPI), is increasingly more jeopardized as the complexity of our embedded systems continues to soar. As key resources of the memory hierarchy become shared by several CPUs and accelerators, they become hard-to-analyze performance bottlenecks. And the precise interplay between software and hardware components becomes hard to predict and reason about. How to restore control over timely progress integrity? We postulate that the first stepping stone toward TPI is to develop methodologies for Timely Progress Assessment (TPA). TPA refers to the ability of a system to live-monitor the positive/negative slack - with respect to a known reference - at key milestones throughout an application’s lifespan. In this paper, we propose one such methodology that goes under the name of Milestone-Based Timely Progress Assessment or MB-TPA, for short. Among the key design principles of MB-TPA is the ability to operate on black-box binary executables with near-zero time overhead and implementable on commercial platforms. To prove its feasibility and effectiveness, we propose and evaluate a full-stack implementation called Timely Progress Assessment with 0 Overhead (TPAw0v). We demonstrate its capability in providing live TPA for complex vision applications while introducing less than 0.6% time overhead for applications under test. Finally, we demonstrate one use case where TPA information is used to restore TPI in the presence of temporal interference over shared memory resources.National Science Foundation; CCF-2008799 - National Science Foundation; National Science Foundation; CNS-2238476 - National Science FoundationPublished versio

    Know your enemy: benchmarking and experimenting with insight as a goal

    Full text link
    Available benchmark suites are used to provide realistic workloads and to understand their run-time characteristics. However, they do not necessarily target the same platforms and often offer a diverse set of metrics, leading to the lack of a knowledge base that could be used for both systems and theoretical research. RT-Bench, a new benchmark framework environment, tries to address these issues by providing a uniform interface and metrics while maintaining portability. This demo illustrates how to leverage this framework and its recently added features to improve the understanding of the benchmarks’ interaction with its system.National Science FoundationAccepted manuscrip

    Observing the invisible: live cache inspection for high-performance embedded systems

    Full text link
    The vast majority of high-performance embedded systems implement multi-level CPU cache hierarchies. But the exact behavior of these CPU caches has historically been opaque to system designers. Absent expensive hardware debuggers, an understanding of cache makeup remains tenuous at best. This enduring opacity further obscures the complex interplay among applications and OS-level components, particularly as they compete for the allocation of cache resources. Notwithstanding the relegation of cache comprehension to proxies such as static cache analysis, performance counter-based profiling, and cache hierarchy simulations, the underpinnings of cache structure and evolution continue to elude software-centric solutions. In this article, we explore a novel method of studying cache contents and their evolution via snapshotting. Our method complements extant approaches for cache profiling to better formulate, validate, and refine hypotheses on the behavior of modern caches. We leverage cache introspection interfaces provided by vendors to perform live cache inspections without the need for external hardware. We present CacheFlow, a proof-of-concept Linux kernel module which snapshots cache contents on an NVIDIA Tegra TX1 system on chip and a Hardkernel Odroid XU4.Accepted manuscrip
    corecore