This paper presents Boomerang, a system that integrates a legacy non-real-time OS with one that is customized for timing-sensitive tasks. A relatively small RTOS benefits from the pre-existing libraries, drivers and services of the legacy system. Additionally, timing-critical tasks are isolated from less critical tasks by securely partitioning machine resources among the separate OSes. Boomerang guarantees end-to-end processing delays on input data that requires outputs to be generated within specific time bounds.
Introduction
Mixed-criticality systems require the spatial and temporal isolation of tasks to meet timing, safety and security constraints [9] . Additionally, these systems involve real-time task pipelines to implement sensing, processing and actuation. For example, an automotive system supports low-criticality infotainment services, which must be isolated from highly critical driving assistance tasks that process sensor data to avoid vehicle collisions.
Spatial isolation ensures that one software component cannot alter another component's private code or data, or interfere with the control of its devices. Temporal isolation ensures that a software component cannot affect when another component accesses a resource (e.g., a CPU). Lack of temporal and spatial isolation leads to potential timing or functional failures. Failure of a highly critical task has potentially catastrophic consequences, while failure of a lowcriticality task has less significant consequences.
One way to support mixed-criticality systems is to partition tasks onto separate hardware. This ensures less critical tasks are unable to directly affect those of greater importance. Automotive systems have traditionally taken this approach, by assigning a different functional component to a separate electronic control unit (ECU) [31] . However, as the complexity of these systems increases, hardware costs, wiring and packaging become prohibitive. For this reason, new hardware platforms that integrate the functionality of multiple hardware components, including multicore processors, accelerators, GPUs, and various input/output (I/O) interfaces are now emerging. Tesla's AutoPilot 2.x, for example, already uses platforms such as the Nvidia Drive PX2 in its cars, to assist with vehicle control.
An integrated solution, combining tasks of different criticality levels on the same hardware, requires an operating system to correctly enforce temporal and spatial isolation. Partitioning operating systems such as Tresos [9] and LynxOS [25] have been developed for automotive and avionics systems, respectively, in accordance with standards such as AU-TOSAR [5] and ARINC653 [43] , to isolate tasks of different criticality levels. However, these types of systems are not able to take advantage of legacy software, including libraries and device drivers written for the newest hardware. In contrast, systems such as Linux, Windows and OS X are regularly updated with features that would take an operating system developer years to reproduce in a clean-slate design. Unfortunately, general purpose systems lack the necessary temporal and spatial requirements, including the ability to perform real-time sensing, processing and actuation required by emerging mixed-criticality systems.
In this paper, we present a system called Boomerang. Boomerang uses a partitioning hypervisor [45] , which separates the hardware of a physical machine into different guest domains that directly manage their assigned resources. This contrasts with a conventional multiplexing (or consolidating) hypervisor, which intervenes in the sharing of physical machine resources among multiple guests. Boomerang's approach removes the hypervisor from resource management, once CPU cores, physical memory and I/O devices are assigned to separate guests.
Using separate partitions, Boomerang supports the co-existence of a real-time operating system (RTOS) and a legacy system such as Linux. Linux provides a domain for less timing and safety-critical tasks, while providing a rich set of pre-existing libraries and device drivers. For example, OpenCL, CUDA, drivers for hardware accelerators, cameras, and machine learning algorithms are all available in Linux, and would be cumbersome to write for a new RTOS. At the same time, the RTOS partition in Boomerang provides the timing guarantees for real-time tasks to perform sensor data processing and actuation.
Key to this paper's contributions is the construction of a composable tuned pipe abstraction. This abstraction implements real-time task pipelines that ensure end-to-end service guarantees on sensing, processing and actuation. As stated above, many emerging mixed-criticality systems require tasks to process sensory inputs before subsequently generating outputs that affect the actuation of a device. For example, a cruise control system in an electric car may collect data from cameras and speed sensors before determining that the motors need to change speed to keep a safe distance to the vehicle ahead.
Novel to Boomerang's composable tuned pipes is the ability for an integrated RTOS to manage I/O that requires services in a legacy system such as Linux. We show how to construct composable task pipelines in Boomerang that combine tasks spanning a custom RTOS and a legacy Linux system. By assigning time-critical I/O to the RTOS, we ensure that complementary services provided by Linux are sufficiently predictable to meet end-to-end service guarantees. We compare our approach to one based solely on Linux, using specific cores to handle timing-sensitive I/O. Boomerang not only benefits from spatial isolation, it also outperforms a standalone Linux system using deadline-based CPU reservations for pipeline tasks.
The following section provides background to the problem addressed by Boomerang. Section 3 describes the Boomerang partitioning hypervisor and composable tuned pipes. An evaluation of Boomerang is described in Section 4. Related work is discussed in Section 5. Finally, conclusions and future work are described in Section 6.
Background
Boomerang supports composable task pipelines that form a round-trip path, originating from a device input and ultimately finishing with a device output. It is designed specifically for applications that require sensing, processing and actuation. The system name is based on the idea that a boomerang follows a path that returns to its starting point, which in our case is a device, although not necessarily the same device that produced input data.
Figure 1(a) shows the round-trip path in a typical OS. A device acknowledges the completion of an I/O request by generating an interrupt. Most systems handle interrupts at priorities above those of software tasks. They also incorrectly charge interrupt handling to the task that was preempted by the arrival of the interrupt. Worse still, a burst of interrupts within a short time may delay a time-critical task enough to miss its deadline [13, 56] . Figure 1 (a) uses a dedicated core for I/O processing of device interrupts to avoid interference with task execution, as described above. However, the single OS approach does not ensure sufficient spatial isolation for domains that require the heightened safety and security of a partitioned system. In contrast, Figure 1(b) shows how Boomerang supports three different classes of I/O using a partitioning hypervisor [35, 48] to separate highly critical timing sensitive operations from less critical system components.
In the first case, all I/O is contained within an RTOS. Realtime tasks and interrupt handlers for device I/O share the same processor cores, as the RTOS ensures predictable timing guarantees on task and I/O processing.
In the second case, the I/O path traverses a task pipeline that enters into a legacy OS via secure shared memory. Here, the legacy OS provides services that would require significant effort to port to the RTOS. The round-trip I/O path in case 2 is still able to meet end-to-end timing guarantees because the tasks in the legacy OS are isolated from timing unpredictability caused by interrupts. This is possible by demoting interrupts to priorities that are distinctly lower than those of tasks. Additionally, legacy OSes such as Linux support SCHED_DEADLINE execution for tasks, thereby ensuring some degree of timing guarantees, as long as there is no interference from interrupts [17] .
In the third case, it may be necessary for some I/O to be handled by a legacy system, which has drivers and libraries that are unavailable in the RTOS. For example, a series of cameras used in a driverless car need suitable device drivers and machine learning algorithms to perform object classification. The outcomes of object classification dictate whether information needs to be communicated to the RTOS to issue real-time outputs that adjust vehicle motion. As with the single OS approach, I/O originating in case 3 may handle interrupts on a dedicated core, to avoid interference with tasks that serve RTOS requests in case 2. Alternatively, I/O processing in the legacy OS is given lower priority than task execution, leaving critical I/O to the real-time OS.
VCPU Scheduling
Boomerang's partitioning hypervisor provides support for an RTOS to co-exist on the same platform as a legacy system such as Linux. Our in-house RTOS assigns threads to virtual CPUs (VCPUs), which are then scheduled on physical CPUs (PCPUs) 1 . Each VCPU is specified a processor capacity reserve [28] consisting of a budget capacity, C, and period, T . A VCPU is required to receive at least C units of execution time every T time units when it is runnable, as long as a schedulability test [21] is passed when creating new VCPUs. This way, the scheduler guarantees temporal isolation between threads associated with different VCPUs.
Boomerang's RTOS assigns tasks to Main VCPUs, which are implemented as Sporadic Servers [40] . Each Sporadic Server keeps track of its VCPU's budget usage, and constructs a list of timestamped future replenishments, to ensure timing guarantees. By default each Sporadic Server VCPU is scheduled using Rate-Monotonic Scheduling (RMS) [24] . The VCPU with the smallest period, T , has the highest priority. It is possible to increase the utilization bound of a set of Sporadic Servers by using earliest deadline first scheduling [24] , where the deadline is set to the end of the VCPU's current period. However, this is a form of dynamic priority scheduling, which increases scheduling overheads.
To ensure that tasks are isolated from interrupts, the RTOS promotes interrupt handling to a schedulable thread context. In Boomerang, all real-time interrupts are associated with a source entity. The source entity is a task awaiting I/O completion (e.g., via a blocking read() system call), or a system thread awaiting a kernel event. All such entities have a corresponding Main VCPU.
A top half handler executes when an interrupt occurs, and determines the corresponding source entity. The top half then inserts into a system ready queue an IO VCPU with a dynamically calculated budget and period. These values are derived from the source entity associated with the interrupt. Finally, the interrupt is acknowledged, and all subsequent handling occurs in a bottom half thread context, when the corresponding IO VCPU is scheduled.
Each IO VCPU in Boomerang's RTOS is given a utilization bound, U I O . There is one IO VCPU for each device class, with classes existing for USB, networking, ATA, and GPIO devices, among others. When an IO VCPU is added to the scheduler ready queue, its budget is set to U I O ×T Main and its period is set to T Main , where T Main is the period of the Main VCPU of the source entity associated with the interrupt. The RTOS is then able to correctly schedule bottom half interrupt handlers at the priority of the source task running on a Main VCPU. This contrasts with systems such as Linux, which schedule bottom halves (a.k.a., tasklets or softirqs) at priorities that are not tied to the source of corresponding interrupts.
The mapping of tasks and interrupt handlers to Main and IO VCPUs, respectively, allows the RTOS to use different policies for VCPU time management. While Main VCPUs default to Sporadic Servers, IO VCPUs implement a simpler scheme. They dynamically derive a budget from the Main VCPUs they serve, as this avoids the overhead of maintaining replenishment lists for short-lived interrupt service routines (ISRs). This budget is eligible for use as long as the sustained IO VCPU's utilization does not exceed U I O . This policy is shown to be effective for short-lived interrupt service routines (ISRs), which would fragment a Sporadic Server budget.
Boomerang's RTOS requires reprogramming of hardware timers in one-shot mode, to determine the next system event. This is similar to Linux's tickless operation. As IO VCPUs only have one budget replenishment to consider, rather than a list, this leads to reduced timer reprogramming overhead.
Note that a traditional multiplexing hypervisor manages the execution of VCPUs on behalf of all guests. In Boomerang, VCPUs exist only in the RTOS. A non-real-time (legacy) OS running in a separate guest domain is able to schedule its own tasks directly on its assigned PCPUs.
Communication Model
Control flow in Boomerang is influenced by the path of data, which originates and ends with a device. Data flow involves a pipeline of communicating tasks. Each task processes its input data to produce output, either for devices or subsequent tasks in the pipeline. This leads to a communication model characterized by: (1) the interarrival times of tasks in the pipeline, (2) inter-task buffering, and (3) each tasks' access pattern to communication buffers.
Periodic versus Aperiodic Tasks. Aperiodic tasks have irregular interarrival times, influenced by the arrival of data. Similarly, interrupts occur when devices complete I/O.
Boomerang assumes that devices generate interrupts with a minimum inter-arrival time necessary for pipeline processing to complete within specific throughput, delay and loss bounds. For this to occur, tasks are either configured to run periodically, using whatever data is available at their inputs, or they must run aperiodically when new data is available. The semantics of task execution depends on whether the pipeline works only with the freshest data (e.g., from a sensor), or whether it must use all input data. Either way, a task pipeline's timing requirements assume that data will propagate with a minimum inter-arrival time between tasks.
Register-based versus FIFO-based Communication. A FIFO-based shared buffer is used in scenarios where data history is an important factor. However, in sensor-data processing the most recent data is often more important. For example, a driving assistance system should always compute outputs that affect vehicle dynamics from the latest sensor data. FIFO-based communication results in loosely synchronous communication: the producer is suspended when the FIFO buffer is full and the consumer is suspended when the buffer is empty. Register-based communication achieves fully asynchronous communication between two communicating parties using Simpson's four-slot algorithm [39] .
Implicit versus Explicit Communication. Explicit communication allows access to shared data at any time point during a task's execution. This might lead to data inconsistency in the presence of task preemption. A task that reads the same shared data at the beginning and the end of its execution might see two different values, if it is preempted between the two reads by another task that changes the value of the shared data. Conversely, the implicit communication model [18] essentially follows a read-before-execute paradigm to avoid data inconsistency. It mandates a task to make a local copy of the shared data at the beginning of its execution and to work on that copy throughout its execution. Simpson's four-slot algorithm [39] ensures the read and write stages avoid data corruption without blocking.
Boomerang supports both periodic and aperiodic tasks. It also supports both register-and FIFO-based communication. Implicit communication is enforced for data consistency.
Boomerang
The Boomerang partitioning hypervisor divides processor cores, physical memory and I/O devices among guest domains. Each guest then manages its physical resources without involvement of the hypervisor. This has two important properties: (1) the hypervisor is only used to bootstrap the system and to establish secure communication channels between guests using hardware extended page tables (EPTs) 2 , and (2) the hypervisor is removed from runtime resource management of the physical machine, making its trusted code base extremely small.
Boomerang's partitioning hypervisor has a text segment of less than 4KB, although more space is needed for EPTs (e.g., 24KB for a 4GB guest). Given the hypervisor is not accessed under normal guest operation, the system's most privileged ring of protection is less susceptible to security attacks than a conventional OS image running directly on hardware. In the latter case, system calls must pass control to the OS kernel, whereas in Boomerang these are restricted to the local guest.
Unlike traditional hypervisors that multiplex guests onto the same shared physical machine, partitioning hypervisors offer opportunities for applications that require security and timing predictability. Hardware virtualization features isolate guests, using an additional ring of protection reserved for the hypervisor. At the same time, time-critical guests are able to run real-time resource management policies without being compromised by additional resource management policies in the hypervisor.
We see partitioning hypervisors as being suitable for mixedcriticality systems, requiring spatial and temporal isolation of application tasks and software components according to different system criticality levels. For example, automotive systems adhering to standards such as ISO 26262 [19] are required to meet specific functional safety requirements, according to several classes of automotive safety integrity levels known as ASIL A-D. Software certified to ASIL D standard operates at the most stringent safety level, where the risk of failure is potentially life threatening. In contrast, ASIL A applies to software that has a very low probability of significant human injury even during failures. Other standards such as ARINC 653 and DO-178B have similar requirements for avionics systems. For these types of systems, it is possible to assign software to machine partitions according to their safety integrity levels. Figure 2 . A tuned pipe. Figure 2 shows a logical representation of a single tuned pipe (a.k.a., tpipe). A pipe has one pipe processor and two endpoints, with one endpoint for input and the other for output. A pipe processor is represented by a VCPU, guaranteeing at least C units of execution time every T time units when runnable. Pipe processors are associated with tasks bound to Main VCPUs, or threaded interrupt handlers bound to I/O VCPUs.
Composable Tuned Pipes
A tuned pipe guarantees data flowing from an input to an output endpoint is processed according to specific service requirements. These requirements apply end-to-end, through a pipeline of one or more tuned pipes. If the pipeline is lossless, it ensures specific throughput and delay guarantees, whereas if it is lossy, it guarantees a maximum fraction of lost data while meeting delay bounds.
Boomerang maintains a local repository for each guest OS (a.k.a., sandbox or machine partition), which stores information about available endpoints. The repository records a globally unique identifier for each endpoint, in the form: hostID:sandboxID:asID:epID. This distinguishes endpoints in different host machines (by hostID) 3 , sandboxes (by sandboxID), and address spaces (by asID). Access capabilities restrict which tuned pipes are able to connect to endpoints.
The rules controlling connectivity to endpoints are a topic of ongoing research. They have implications for secure information flow analysis [8, 16, 55] , which is outside the scope of this paper. Notwithstanding, pipelines may be constructed within a single address space, between address spaces in the same machine partition, between different partitions on the same host, and across different hosts.
When creating a tuned pipe, Boomerang automatically calculates (i.e., tunes) the budget and period of the pipe VCPU to ensure end-to-end guarantees are met. Tuned pipes are created with a call to tpipe(), as follows:
tpipe_id_t tpipe(ep_t *inp[], int n_inp, ep_t *outp, qos_t spec, tpipe_task_t func, void* arg); 3 In this paper, we restrict communication within the same host machine.
The input endpoint of the new tuned pipe specifies an array of pointers, inp, to endpoint types. This array identifies the endpoint addresses of n_inp inputs to the tuned pipe, along with the buffering semantics of each input, which will be discussed in Section 3.1.1.
Data flowing into the tuned pipe is processed by a specific callback function (func), which sends its output to specific destinations connected to the output endpoint, identified by outp. The callback function takes an optional argument (arg), and runs in its own thread context. The thread context defines a task, which is bound to a VCPU having an automatically-generated budget, C i , and period, T i , for the tuned pipe, tpipe i . The budget and period are derived from the quality-of-service (QoS) requirement (spec) for end-toend throughput and delay on data processing. This requirement must also satisfy the schedulability of all VCPUs on a given physical CPU (PCPU), otherwise the tuned pipe is not created. If a tuned pipe is successfully created, it is given a unique ID within its guest OS.
tpipe i requires its callback function to process data from one or more input endpoints and produce output in one quantum of size C i , every period, T i . Functions are selected from a predefined repository of callbacks. Each callback has a known worst-case execution time (WCET) based on preprofiled timing information to handle a maximum I i inputs and produce up to O i outputs in one quantum. The actual amount of processing in a quantum depends on the availability of data in input buffers, and how many outputs need to be written.
Each function in the repository declares the allowable buffering capabilities for its inputs and outputs. Any tuned pipe connecting to another with a function that does not match the allowed buffering capabilities is rejected.
POSIX Pipes versus Tuned Pipes
Similarities exist between a pair of tuned pipes and a single POSIX pipe. The latter provides a shared memory buffer that is accessible to a group of communicating threads via file descriptors. The file descriptors describe the endpoint capabilities, including whether the pipe is readable or writable.
A tuned pipe pair in Boomerang differ from a POSIX pipe by capturing the timing requirements for data processing and communication. They also define the buffering semantics for I/O endpoints. Figure 3 (a) shows a two-stage pipeline with fully asynchronous (RT_ASYNC) communication between tpipe 1 and tpipe 2 . In this case, Simpson's Four Slot buffering scheme [38, 39] is used to allow the two pipe threads to execute independently of each another. Four Slot communication guarantees freshness and integrity of data objects exchanged between a producer and consumer, without the sender or receiver ever having to block. Freshness guarantees the most recent value of a data object is made available. This is important in sensor data processing, where the latest sensor readings are more important than older, potentially stale values. Integrity ensures a data object is not partially updated before the previous object has been read in entirety.
In contrast, Figure 3 (b) shows a two-stage pipeline with semi-asynchronous (RT_FIFO) communication. This scheme uses a classic ring buffer to pass data without loss between the sender and receiver. However, the sender must block when the buffer is full, and the receiver must block when the buffer is empty. This places a timing dependency on producers and consumers, which potentially violates end-to-end timing guarantees unless data flow rates are managed correctly.
Device versus Task Pipes
Boomerang's RTOS provides a pre-defined set of tuned pipes for all devices involved in real-time I/O. A device pipe features an IO VCPU for interrupt handling, and an optional Main VCPU for endpoint buffer management of shared devices. Sharing requires scatter-gather functions to move data between the device endpoint buffer and pipe-specific buffers of task pipes. If a device is not shared, its handler directly accesses the buffer of a specific task pipe.
The tpipe() call, described earlier, creates a task pipe. Unlike a device pipe, there is no IO VCPU for interrupt handling. Task pipes form pipelines between device pipes that act as the sources and sinks of input and output data, respectively. 
Pipeline Construction
Pipelines of tuned pipes are constructed in the order in which data flows, from input to output. A tuned pipe is responsible for the creation of all buffers that connect to its input endpoint. It also declares its output endpoint, which includes a count of the number of outputs it handles. A pipeline is incomplete until all I i inputs and O i outputs of each tpipe i are connected.
The output endpoint of each task pipe has a connection to a default device pipe, which could be a null device. A system call interface allows this output endpoint to be redirected to one or more different device pipes.
Once fully connected, the system activates the pipeline by allowing each tpipe task to be scheduled for execution. Those tasks that execute in the RTOS are runnable when they have available budgets on their corresponding VCPUs. Tuned pipe tasks that execute in Linux are runnable when they have available budgets in their SCHED_DEADLINE scheduling class. Linux's SCHED_DEADLINE scheduling class uses a Constant Bandwidth Server [2] to limit the maximum CPU bandwidth consumed by a task within a specific period. The end of the period is used to define the task deadline, and all tasks are scheduled earliest deadline first. However, interrupt handlers are not managed in this scheme.
Boomerang runs our in-house RTOS in one sandbox, and Linux in another sandbox on the same physical machine. A Linux kernel module maps a secure shared memory region by calling into the hypervisor. The hypervisor uses EPTs to map machine physical memory into each sandbox so they are able to communicate.
Each sandbox is equipped with kernel services that manage a local repository of endpoints and tuned pipes. Communication services allow queries to a remote sandbox, to discover endpoints and to connect or disconnect from tuned pipes. Mailbox channels are established by Boomerang to enable OSes in different sandboxes to send remote OS requests. Access policies determine whether address spaces in the local or remote sandbox are able to connect to endpoints of existing tuned pipes.
Boomerang's RTOS provides a remote shell to Linux through inter-sandbox shared memory. Linux uses a kernel module to allow user-space application programs with root privilege to execute shell commands on the RTOS. A shell interface allows pipelines of tuned pipes to be constructed. The RTOS is able to query endpoints and tuned pipes that exist in Linux, and issue requests to connect to them via tpipe() calls.
After the construction of the pipeline, the RTOS runs an end-to-end throughput and delay analysis. If the end-to-end requirements are met for the pipeline, the transmission of data is allowed to begin from the RTOS. Tuned pipe functions synchronize their start and end of operation life-cycle using Start-Of-Task and End-Of-Task packets on their input endpoints.
The following example illustrates the specification of a pipeline:
The resultant pipeline is shown in Figure 5 . Boomerang's repository of tuned pipe functions requires that A and C connect to a device output endpoint for reading, while E and F connect to a device input endpoint for writing. Boomerang defaults to non-blocking tuned pipe semantics, where data freshness is more important than lossless communication. Figure 5 shows four-slot buffering of all pipeline stages. If lossless communication is required, the entire pipeline specification is preceded by an asterisk. This pipeline would then use FIFO buffers between each pair of tuned pipes.
With four-slot buffering, the entire pipeline has an optional end-to-end service specification in terms of tolerable loss_rate and e2e_dela . With FIFO buffering, the pipeline is specified with an optional end-to-end throughput, e2e_tput, and delay. The throughput is measured as the minimum number of data objects per unit time exiting a final tuned pipe, while the delay is measured in microseconds. Each data object represents a message, which is the size of one slot of either a four-slot or FIFO buffer.
If the QoS specification is omitted, then the pipeline defaults to best effort. In such case, the VCPUs of each tuned pipe revert to their default values. If the pipeline overloads the PCPUs to which it is assigned, leading to an infeasible schedule, its VCPU periods are repeatedly extended until the pipeline is schedulable.
The shell interpreter allows parallel sections of a pipeline to be defined by comma-separated lists of tuned pipes. Here, the pipeline section A | B runs parallel with C. This could be representative of two separate input sensor streams coming from two different devices. Parentheses ensure correct grouping of pipeline sequences, while two tuned pipes are connected using the shell vertical bar symbol (|).
In the example, the outputs of B and C feed into the single tuned pipe, D. Similarly, the output of D is split across E and F . D might represent a sensor fusion and control task, while E and F might be specific actuator tasks that output their data to different devices. In an automotive system, for example, E and F might send their outputs to two different CAN buses, managed by device pipes.
The e2e_dela constraint applies to the longest path through the pipeline, while the e2e_tput applies to whichever final task pipe has the lowest rate of output. In the figure, whichever of E and F has the lowest output rate would dictate the endto-end throughput.
As a four-slot buffered pipeline allows each tuned pipe to read and process whatever data sits in its input buffers, it is possible that new data has overwritten old data before the consumer runs. This happens if the producer has an arrival rate, λ = 1/T p , greater than the consumer's service rate, µ = 1/T c . Here, it is assumed that T p and T c are set to ensure one message transfer every corresponding period, regardless of whether it is a new message or not.
End-to-end QoS Guarantees
Given a pipeline of tuned pipes and buffers, Boomerang runs a constraint solver to determine C i and T i for each tpipe i . The function executed by tpipe i is assumed to process at least one of its I i inputs and generate one of its O i outputs every period, T i . Essentially, one or more processed data messages propagate through a tuned pipe within C i execution time. Boomerang assumes that C i is derived by preprofiling the WCET of the corresponding task function. This WCET is then stored in the local repository, along with the set of inputs and outputs used by the function.
For a pipeline to successfully meet its end-to-end timing requirements, Boomerang must still determine each period, T i | T i >C i , and possibly scale each service time C i to forward more than one message at a time. It follows that a FIFO buffered pipeline successfully meets its end-to-end timing requirements if:
, where m i ≥1 messages are transferred by tpipe i every C i , 3. all FIFO buffers are sized to ensure no additional blocking delays of tasks, and 4. all task scheduling constraints are satisfied on their respective PCPUs.
Similarly, a pipeline with four-slot buffering meets its endto-end requirements if:
}≤loss_rate, for all T p ≤T c , and 3. all task scheduling constraints are satisfied on their respective PCPUs.
The end-to-end delay represents the time for a message to traverse the longest path through a pipeline. The final message output from the pipeline is a transformation of data propagated through each tuned pipe. The worst-case end-to-end delay is the sum of all the periods of the tuned pipes in the longest path, plus any blocking delays. The blocking delays are zero with asynchronous communication as each tuned pipe processes its most recent data, regardless of it being updated. Similarly, blocking delays are avoided with FIFO-based communication if each buffer is never empty or totally full.
It follows that each tuned pipe propagates a message after C i worst-case execution time. However, if data arrives at the inputs to a tuned pipe when it has just depleted its budget, it must wait T i −C i before the budget is replenished. If the next tuned pipe in a pipeline is not synchronized to start exactly when the previous tuned pipe forwards its data there could be an additional delay of T i − C i on top of C i to process the data in tpipe i .
To see this more clearly, consider a system ofT tasks each with a service time of 1 time unit every T . Suppose two of these tasks are associated with tpipe 1 and tpipe 2 . Input data D in to tpipe 1 is processed and forwarded to tpipe 2 , which produces D out . These two tuned pipes form a pipeline, while all other tasks compete for execution on the same PCPU. Using either rate monotonic or earliest deadline first scheduling [24] yields the same schedule in this case: neglecting scheduling overheads, each task has the same priority. A possible schedule is shown in Figure 6 . The worst-case end-to-end delay is when each of the T −2 tasks other than those for tpipe 1 and tpipe 2 run immediately after the data, D in , has arrived. Then, tpipe 2 executes and processes old input data before tpipe 1 is able to read D in . Consequently, tpipe 1 does not process D in and forward the output to tpipe 2 until T time after the data first arrived. Similarly, tpipe 2 is not able to run again until 2T − 2, when it finally reads D in . This is because the scheduler will not provide it with a budget replenishment until one period after it last executed. The total end-to-end delay between D out and D in is therefore 2T − 1. For large T this approaches a worst-case delay of 2T . Extending this to more than two tasks in a pipeline leads to the worst-case end-to-end delay being the sum of the corresponding tuned pipe periods.
The end-to-end throughput of a path through a FIFO buffered pipeline is limited by the minimum output rate of any one tuned pipe in that path. A tuned pipe's output rate is how many messages it is able to forward in its period. As FIFO buffering allows tpipe i to forward m i ≥1 messages per period, the minimum value of
≥e2e_tput for all i is a lower-bound on overall throughput.
For any pair of tuned pipes connected via FIFO buffers, it is essential that blocking delays are factored into the end-toend service guarantees. Boomerang tries to avoid blocking on message exchanges by matching the arrival and departure rates of messages passed through shared FIFO buffers.
Suppose a producing and consuming pair of tuned pipes have budgets C p and C c , respectively. Given C p = C in is sufficient to produce one message in T p , and C c = C out is sufficient to consume one message in T c , Boomerang starts by setting T p = T c = ∆, where ∆·n = e2e_dela , and n is the number of tuned pipes in the longest path. This ensures the producer and consumer are rate-matched, to prevent the FIFO buffer between them either filling to capacity, or completely emptying.
Rate-matching is applied to all tuned pipes in the pipeline. If the pipeline cannot feasibly be scheduled on its PCPUs, each tuned pipe period is scaled by a factor α, where α > 1. This is repeated until all tuned pipes are schedulable, but leads to a violation of the end-to-end latency requirement.
To reduce end-to-end latency, Boomerang adjusts tuned pipe periods, starting with the inputs to the pipeline. For each tuned pipe pair, T p is repeatedly halved and C c is similarly doubled, ensuring that T p >C p , T c >C c and all VCPUs are schedulable when possible. The doubling of C c enables it to process multiple messages, m c , in one budget cycle. T p is reduced until either the entire pipeline meets its end-to-end delay requirement or it is set as low as feasibly possible. If i ∈l T i ≤e2e_dela for longest path l, the algorithms stops, or else it moves onto the next stage in the path, and repeats the above procedure.
If all stages of the pipeline have been processed from input to output, the algorithm revisits each consumer whose budget is set to process multiple messages in one period. For each consumer, both C c and T c are halved, as long as C c is no smaller than the time to process one message. If the path's e2e_dela is satisfied, or tuned pipe periods and budgets cannot be reduced further, the algorithm stops. At this point each C p = m p ·C in and each C c = m c ·C out , for m p , m c ≥1.
If a feasible schedule for the pipeline is found, each FIFO buffer is set to have enough space for 2×⌈ m c ·T c m p ·T p ⌉ messages output from the producer. The factor of two accounts for the potential phase-shifted processing of messages by the producer and consumer, ensuring that the buffer is never empty or full, but instead operates with approximately half its maximum occupancy.
For four-slot communication, if the consumer has a smaller period than a producer at any stage in the pipeline, then the consumer will always see the most recent data. Given that four-slot communication restricts each tuned pipe to read, process and write one message every period, it is impossible for a pipeline to lose any data if all consumer periods are smaller than their corresponding producer periods. However, if a consumer has a larger period than its producer, such that T c > T p , then the producer may overwrite data before the consumer sees the previous message. It follows that the loss-rate through a four-slot pipeline is limited to the maximum value of 1 − T p T c of any stage in the pipeline. This is an important metric for sensor data processing, where the fraction of lost data must be constrained.
Irrespective of four-slot or FIFO-based communication, all VCPUs serving all tuned pipes in a pipeline must satisfy the system scheduling requirements. For n tuned pipes scheduled using rate-monotonic scheduling, the scheduling constraint is satisfied if
≤n·(2 1/n −1). If earliest-deadline first scheduling is used, the scheduling constraint is satisfied if
≤1 on a single PCPU. Boomerang applies these constraints, including utilization bounds on IO VCPUs used by device pipes, to ensure pipeline schedulability. This holds for pipelines encompassing our RTOS and Linux SCHED_DEADLINE tasks.
Evaluation
We evaluated Boomerang on an Up Squared Single-board Computer (SBC), featuring an Intel Celeron N3350 processor with a speed up to 2.4 GHz. We connected a five-channel Kvaser USBCan Pro 5xHS CAN bus interface via USB 3.0. This setup emulated an automotive system that reads sensor inputs and writes actuator outputs on a CAN bus.
Traffic on channels 1-3 (CAN1-3) was produced by Woodward MotoHawk ECM5634-70 ECUs, as used in chassis and powertrain applications in a real vehicle. Each of these channels produced data at 20%, 30% and 40% of their 500kbps bandwidths, respectively. Channels 4 and 5 (CAN4-5) were replaced with Arduino UNOs [4] equipped with CAN shields, to collect performance data.
Two separate pipelines were constructed for CAN4 and CAN5, with thread budgets and periods shown in Table 1 . These pipelines shared three device I/O threads: mhydra_rx and mhydra_tx for Kvaser USBCan scatter-gather functionality, and a USB xHCI bottom half handler (USB_BH). Pipeline 1 consists of three task pipes: CanRead, MLProcess & CanWrite. These read CAN data, perform machine learning, and write CAN data, respectively. Pipeline 2 consists of two task pipes: RTFusion and RTControl, for sensor data fusion and control, respectively. Table 1 . Pipeline details.
Given the above setup, we compared Boomerang to a tuned pipe implementation on a PREEMPT_RT-patched Yocto 4.9.99 SMP Linux system. The Boomerang partitioning hypervisor ran our RTOS in one sandbox, and Yocto Linux in another. In all cases xHCI device interrupts were mapped to Core 0, while all other device interrupts were redirected to Core 1. Table 2 shows the assignment of threads to cores. Table 2 . Thread assignments to cores.
Background tasks running on Core 1 generated disk and network I/O activity. These included five wget tasks that each retrieved a copy of a 1.9GB binary image over the Internet. Five additional tasks performed file copies of a local version of the binary image to different directories. A periodic task additionally consumed 20% of the CPU time to bring the total average utilization across all tasks on Core 1 to 67%.
Linux SMP: All threads were assigned budgets and periods within the SCHED_DEADLINE scheduling class except the USB_BH bottom half handler. Without modifications to Linux, bottom half handlers are not guaranteed CPU reservations.
Boomerang: The Boomerang partitioning hypervisor ran our RTOS on Core 0 and Yocto Linux on Core 1. The RTOS has its own Real-time USB stack and mhydra driver for the Kvaser USBCan interface. The MLProcess ran on Linux and represented a machine learning task using capabilities unavailable in the RTOS. Pipeline 1 extended from the RTOS into Linux via a secure shared memory channel using extended page table mappings.
All experiments were run for 30s over 10 runs each, although end-to-end delay results are displayed for the first 200 packets.
Asynchronous Communication
Asynchronous communication using four-slot buffering has the potential to suffer information loss. We constructed two experiments with expected pipeline losses of 0% loss and 20%. In both cases, packets for Pipelines 1 and 2 arrived and departed on CAN4 and CAN5 channels, respectively. We measured the round-trip time taken by each packet to be read from and written to each of these channels. From Table 2 , the expected end-to-end delay for Pipeline 1 was 10ms, and for Pipeline 2 was 8ms. Figures 7 and 8 show the end-to-end delay for Pipelines 1 and 2, when there is no expected loss. The horizontal lines represent the expected latency as calculated above. The endto-end latency for Boomerang is always less than the theoretically calculated bound. However, Linux SMP frequently fails to meet the end-to-end delay requirements. The main reason is the priority mismatch between bottom-half handlers and the task awaiting I/O operations. Our RTOS ensures that bottom-half handlers run at the correct priority with a specific CPU reservation. Therefore, Boomerang achieves temporal isolation between tasks and interrupts. As Linux is unable to achieve the same level of timing guarantees, even when tasks are guaranteed CPU reservations, there are some lost packets as observed by the missing data points in Figures 7 and 8 . Table 5 summarizes the end-to-end latency results. It also shows that Linux suffers packet losses of 28% and 56% for Pipelines 1 and 2, respectively. Figure 9 shows the cumulative interrupts received by each core with Boomerang and Linux SMP over the duration of 200 packet transfers. The cumulative numbers of interrupts are 20623 and 16693, respectively, in Boomerang and Linux SMP. Linux has fewer overall interrupts but more on Core 0. We conjecture this is caused by local APIC timer interrupts, which are influenced by the budget management of SCHED_DEADLINE tasks. However, this requires further investigation. Notwithstanding, Linux SMP fails to meet endto-end delay guarantees because of its unpredictability in interrupt handling.
End-to-end Delay

Loss
Sensor data processing is often able to tolerate a specific loss rate. We increased the periods of certain pipeline tasks, as shown in Table 3 , to incur a tolerable loss rate up to 20%. The expected latency for Pipeline 1 is now changed from 10ms to 11ms due to increased periods of MLProcess and CanWrite. Similarly, the expected latency of Pipeline 2 is changed from 8ms to 8.5ms due to the increased periodicity of RTControl. Table 3 . Difference in periods for different loss.
Figures 10 and Figure 11 show the end-to-end delays for each pipeline. Boomerang keeps the loss-rate within 20%, as observed in Table 6 . However, Linux SMP misses 55% and 50% of the 200 packets, respectively, for Pipelines 1 and 2.
Synchronous Communication
We repeated experiments with Pipelines 1 and 2 using FIFObuffering. The constraint solver described in Section 3.1.4 is used to establish correct budgets, periods and buffer sizes. The finally calculated budget and periods are presented in Table 4 . Buffer sizes are 4, 2 and 4 messages, respectively between CanRead and MLProcess, MLProcess and CanWrite, and RTFusion and RTControl. Table 4 . Synchronous pipeline (Common threads not shown).
Throughput and Delay
The expected end-to-end delay of Pipeline 1 is increased to 14ms because of the increased periods of the tpipe threads. Figures 13 and 14 show the revised end-to-end delays. Measurements are summarized in Table 7 . FIFO buffering does not improve the latency for Linux SMP because of previously mentioned issues with interrupts. However, it improves the packet loss rate for Linux SMP, as a buffer holds messages even if a tpipe thread is interrupted. Table 8 shows the throughput with Boomerang and Linux SMP are similar, although the standard deviation is smaller with Boomerang. Arrival rates (λ) from CAN4 and CAN5 are shown for each pipeline.
MIMO Pipelines
Boomerang supports the construction of pipelines with multiple inputs and outputs (MIMO). We constructed a pipeline based on Figure 5, Table 10 . MIMO Pipeline throughput.
the labeling in that figure, tuned pipes A−F have the following (budget, period) pairs in milliseconds:
, E (0.1, 1) and F (0.1, 1). A reads input from CAN4 while C reads input from CAN5. Similarly, E writes back to CAN4, and F writes to CAN5. Tuned pipe D operates in Linux while all others operate in the RTOS. Tables 9 and 10 summarize the throughput and latencies, while Figure 15 shows the end-to-end delay. Even with multiple device inputs and outputs, both paths through CAN4 and CAN5 transfer data within their expected bounds.
Discussion
Boomerang supports the construction of QoS-constrained task pipelines that span an RTOS and a legacy system. Each tuned pipe executes on a bandwidth-preserving VCPU, to ensure end-to-end timing guarantees. Interrupt handlers are executed with time-budgeted IO VCPUs, whereas this is not the case in Linux. The SCHED_DEADLINE scheduling class ensures timely execution of tasks in Linux but requires interrupt handlers to be demoted to a lower priority.
The version of Linux used in Boomerang is the same as that used in all standalone experiments labelled Linux SMP. In all comparisons, Linux includes the PREEMPT_RT patch for improved real-time responsiveness. Additionally, SCHED-_DEADLINE provides CPU reservations, which are not found in many simpler RTOSes.
Only if tuned pipes are implemented in an OS that ensures temporal isolation and scheduling of interrupts is it possible to ensure end-to-end timing guarantees on critical I/O control paths. Device I/O still needs to be conformant with a task pipeline's end-to-end requirements. That is, data must arrive at an input device and be able to depart according to pipeline throughput requirements. In the experiments, we ensure all CAN interfaces provide appropriate arrival and departure rates through device pipes to meet end-toend requirements.
This work has thus far not considered shared caches, DRAM and memory buses as potential causes of interference and timing unpredictability. Page coloring [22] , and memory bandwidth budgets [52, 54] have been studied in past work. We consider these techniques complementary to the VCPU scheduling used with tuned pipes.
It may seem counter-intuitive to build task pipelines that communicate with a lower criticality, and potentially less secure system in a separate guest domain. However, we envision this being the case for systems that wish to leverage pre-existing functionality of a legacy OS. Automotive systems are considering the integration of mixed-criticality tasks and services [31] . Future work will investigate secure information flow rules to prevent access to data by unauthorized software components.
Related Work
Operating Systems
Several systems have attempted to provide temporal isolation between tasks. Mercer et al implemented processor capacity reserves in the Mach micro-kernel [28] . As with our RTOS VCPUs, a processor capacity reserve provides a budget and a period for each task. Once a task depletes its budget it is unable to continue execution until its budget is replenished.
Steere et al used a reservation-based scheme along with a feedback-based controller to adjust CPU allocations among tasks [41] . The idea is somewhat similar to Boomerang's tuned pipes. However, it does not integrate the scheduling of interrupt handlers with task execution.
Linux supports reservation-based scheduling using the PREEMPT_RT patch [33] and SCHED_DEADLINE [23] task execution. This allows tasks to specify CPU reservations that are managed by a Constant Bandwidth Server [2] . LITMUS RT [11] is a Linux-based system that supports configurable realtime schedulers, including those with reservations. Multiple RTOSes attempt to provide temporal isolation to tasks [7, 20, 47] . However, these systems do not properly handle events such as interrupts, which may interfere with the timing guarantees to the real-time tasks.
RT-Linux virtualizes interrupts for non-time-critical parts of the system, thereby ensuring real-time service to timecritical tasks [53] . Similar approaches have been adopted by Wind River Linux [49] , the Real-time Application Interface (RTAI) for Linux [14] , Xenomai [50] , and NASA's CFS Linux [30] . Zhang et al integrated interrupt handling with task scheduling in Linux. A bottom half handler for a device interrupt inherited the highest priority of a blocked process waiting on the device [56] . However, interrupt handling was not limited to a CPU reservation, meaning a burst of interrupts could still interfere with tasks.
Many real-time OSes such as eCOS [15] , RTEMS [12] , and FreeRTOS [36] provide a single address space multi-threaded solution for multicore machines. However, this is insufficient for many safety-critical domains, requiring both temporal and spatial isolation between components of different criticality levels. The Quest RTOS [13] not only supports multiple address spaces, but also provides a PriorityInherited Bandwidth-preserving Server approach to serve the interrupts in a timely manner along with CPU-bound tasks. While Quest provides timing isolation for both I/Oand CPU-bound tasks, it does not support the richness of services found in a legacy system such as Linux.
Hypervisors
Several hypervisors attempt to support both temporal and spatial isolation of real-time and non-real-time guests [26, 27, 44, 46] . RT-Xen [51] adds real-time scheduling support to the Xen [6] hypervisor. However, all these hypervisors multiplex their guests on a shared physical machine. They virtualize interrupts, and perform additional resource management operations that conflict with the policies within each guest.
Partitioning hypervisors allow guests to directly manage subsets of machine resources. The hypervisor is removed from I/O and resource management once each guest is granted access to its machine resources. The Quest-V [48] separation kernel [37] uses a partitioning hypervisor to support the co-existence of the Quest RTOS and one or more general purpose OSes. Each guest OS runs simultaneously on separate cores in a multicore machine, with device interrupts delivered directly to the guest that owns the device.
PikeOS [42] and Muen [10] are separation kernels that support multiple guest OSes. However, unlike Quest-V, interrupts are trapped into the hypervisor, and subsequently delivered to the guest OSes. Jailhouse [35] and ACRN [3] have similarities to Quest-V. Jailhouse uses Linux to bootstrap a system that provides cells for system inmates. These are essentially restricted hardware subsets assigned to guests. However, Jailhouse does not currently provide an integrated approach for guests to communicate in real-time via shared memory channels. ACRN's philosophy is to allow a service OS to manage machine resources on behalf of other safetycritical OSes. However, as with Jailhouse, there is currently no way to communicate between guests with end-to-end timing guarantees. Boomerang's partitioning hypervisor is modeled on the approach taken by Quest-V, but provides support for composable tuned pipes spanning multiple guests.
Predictable Communication
Boomerang's support for composable tuned pipes is greatly inspired by the Scout operating system [29] . In Scout, paths through a sequence of services are treated as first-class schedulable entities. The entire processing along a path is run in the context of a single thread that is scheduled according to the bottleneck queue. Boomerang, in contrast, schedules each component of a pipeline with a separate time-budgeted thread. This allows paths to be interleaved and executed on different PCPUs, spanning different sandboxes.
RAD-FLOWS [32] provided a design framework for predictable data communication. The work includes several designs and proofs for predictable inter-task communication. However, RAD-FLOWS does not seem to have been applied to a working system.
Golchin et al developed a system abstraction for predictable data delivery between USB devices and a real-time process [17] . Boomerang provides support for real-time I/O to span multiple tasks in a partitioning hypervisor. We envision Boomerang to be useful in automotive systems, data streaming applications [1] and systems based on publishersubscriber models, such as ROS [34] .
Conclusions and Future Work
This paper presents Boomerang, which combines time-critical tasks with legacy services. Boomerang's partitioning hypervisor connects a built-in guest RTOS with a legacy system such as Linux, via secure and predictable shared memory communication channels. The legacy OS benefits from timing predictable services that are isolated from less critical code. At the same time, the RTOS benefits from the preexisting services, including libraries and lower criticality device drivers of a legacy non-real-time system.
Boomerang supports composable tuned pipes, for realtime task pipelines that require guaranteed end-to-end service on data transfers. The system provides real-time task pipelines with complementary legacy services that are timing predictable using CPU reservations.
Experiments show that real-time task pipelines guarantee end-to-end throughput, delay and loss requirements in Boomerang. This is the case for all pipelines contained within the RTOS and which span both the RTOS and Linux. In contrast, task pipelines in Linux are not able to ensure end-toend service constraints, even when using CPU reservations. This is because of task interference by interrupts from I/O devices. The interrupt handlers need to be assigned suitable CPU reservations at appropriate priorities that match the pipelined tasks they serve. Alternatively, if I/O processing is assigned to a dedicated core, it reduces system utilization.
Future work will extend Boomerang's composable tuned pipes to span different physical machines. We see a programming model for real-time pipes useful in data flow machines and stream processing applications, such as those in neuromorphic computing.
NB: The source code for Boomerang is available upon request.
