Qduino: a cyber-physical programming platform for multicore Systems-on-Chip by Cheng, Zhuoqun
Boston University
OpenBU http://open.bu.edu
Theses & Dissertations Boston University Theses & Dissertations
2018
Qduino: a cyber-physical
programming platform for
multicore Systems-on-Chip
https://hdl.handle.net/2144/33253
Boston University
BOSTON UNIVERSITY
GRADUATE SCHOOL OF ARTS AND SCIENCES
Dissertation
QDUINO: A CYBER-PHYSICAL PROGRAMMING PLATFORM
FORMULTICORE SYSTEMS-ON-CHIP
by
ZHUOQUN CHENG
Zhejiang University, 2012
Submitted in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
2018
c© Copyright by
ZHUOQUN CHENG
2018
Approved by
First Reader
Richard West, PhD
Professor of Computer Science
Second Reader
Renato Mancuso, PhD
Professor of Computer Science
Third Reader
Hongwei Xi, PhD
Professor of Computer Science
Acknowledgments
I would like to thank my advisor Professor Richard West for his support of my PhD re-
search, his guidance and for imparting his wisdom and expertise. The insights I gained
about operating and real-time systems will be forever useful in my future endeavors. I en-
joyed exchanging research ideas, hacking on Quest and Linux, on 3D printers and drones,
and of course chatting about all kinds of things in real life, with him. To me, he serves as
a mentor for not only research but also life.
I also want to thank my previous colleagues: Ye Li and Ying Ye. I owe them a lot for
their kind help to my research projects at different stages. They were also like mentors
to me when I just started my PhD program. I benefited a lot from their experience and
insights. I would also like to thank my colleagues and friends, Eric Missimer, Katherine
Missimer, Soham Sinha, Anam Farrukh, Craig Einstein and Ahmad Golchin. They all
helped me at various times. Along with all the other members of the BOSS group, they
made my 6 years at BU an enjoyable experience.
To my parents, thank you for all the support and encouragement. You were always
there for me and I will be forever grateful for all of your support.
This work is supported in part by the National Science Foundation (NSF) under Grant
# 1527050. Any opinions, findings, and conclusions or recommendations expressed in this
material are those of the author(s) and do not necessarily reflect the views of the NSF.
iv
QDUINO: A CYBER-PHYSICAL PROGRAMMING PLATFORM
FORMULTICORE SYSTEMS-ON-CHIP
ZHUOQUN CHENG
Boston University, Graduate School of Arts and Sciences, 2018
Major Professor: Richard West, Professor of Computer Science
ABSTRACT
Emerging multicore Systems-on-Chip are enabling new cyber-physical applications such
as autonomous drones, driverless cars and smart manufacturing using web-connected 3D
printers. Common to those applications is a communicating task pipeline, to acquire and
process sensor data and produce outputs that control actuators. As a result, these applica-
tions usually have timing requirements for both individual tasks and task pipelines formed
for sensor data processing and actuation. Current cyber-physical programming platforms,
such as Arduino and embedded Linux with the POSIX interface do not allow application
developers to specify those timing requirements. Moreover, none of them provide the pro-
gramming interface to schedule tasks and map them to processor cores, while managing
I/O in a predictable manner, on multicore hardware platforms. Hence, this thesis presents
the Qduino programming platform. Qduino adopts the simplicity of the Arduino API, with
additional support for real-time multithreaded sketches on multicore architectures. Qduino
allows application developers to specify timing properties of individual tasks as well as
task pipelines at the design stage. To this end, we propose a mathematical framework to
derive each task’s budget and period from the specified end-to-end timing requirements.
The second part of the thesis is motivated by the observation that at the center of these
pipelines are tasks that typically require complex software support, such as sensor data fu-
v
sion or image processing algorithms. These features are usually developed by many man-
year engineering efforts and thus commonly seen on General-Purpose Operating Systems
(GPOS). Therefore, in order to support modern, intelligent cyber-physical applications,
we enhance the Qduino platform’s extensibility by taking advantage of the Quest-V vir-
tualized partitioning kernel. The platform’s usability is demonstrated by building a novel
web-connected 3D printer and a prototypical autonomous drone framework in Qduino.
vi
Contents
1 Introduction 1
1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Related Work 6
2.1 Guarantee of Task Timing Requirements . . . . . . . . . . . . . . . . . . 6
2.2 End-to-end Timing Analysis and Generation of Task Timing Requirements 7
2.3 Cyber-physical Programming Platforms . . . . . . . . . . . . . . . . . . 9
2.4 Platform Extensibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 3D Printers and Drones . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Guarantee and Generation of Task Timing Properties 12
3.1 Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Guarantee of Timing Properties: VCPU Scheduling . . . . . . . . . . . . 14
3.3 Generation of Timing Properties: End-to-end Timing Analysis and Design 15
3.3.1 Communication Model . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 End-to-end Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.1 Semantics of End-to-end Time . . . . . . . . . . . . . . . . . . . 18
3.4.2 Latency Contributors . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.3 The Composable Pipe Model . . . . . . . . . . . . . . . . . . . . 20
vii
3.4.4 Reachability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4.5 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.6 Composability . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 End-to-end Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5.2 Solving the Constraints . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 The Qduino Programming Platform 40
4.1 The Standard Arduino API . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 The Qduino Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 Support for User-specified Timing Requirements . . . . . . . . . 44
4.2.2 Support for Multithreaded Sketch . . . . . . . . . . . . . . . . . 46
4.2.3 Support for Multicore Architectures . . . . . . . . . . . . . . . . 49
4.3 Micro-benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Multithreaded Sketch . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.2 Timing Guarantee . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 A Real-world Application . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 A Web-connected 3D Printer Case Study 56
5.1 Web-connected 3D Printers . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2 Timing Requirements of a 3D Printer . . . . . . . . . . . . . . . . . . . . 57
5.3 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.2 Implementation on Linux . . . . . . . . . . . . . . . . . . . . . . 61
5.3.3 Implementation on Qduino . . . . . . . . . . . . . . . . . . . . . 69
viii
5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6 Platform Extensibility 76
6.1 Design Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.2 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2.1 Partitioning Hypervisor . . . . . . . . . . . . . . . . . . . . . . . 79
6.2.2 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.3 Inter-sandbox Communication . . . . . . . . . . . . . . . . . . . . . . . 81
6.3.1 Asynchronous Message Passing . . . . . . . . . . . . . . . . . . 81
6.3.2 Remote Procedure Call . . . . . . . . . . . . . . . . . . . . . . . 82
6.3.3 User APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.4 An Autonomous Drone Framework Prototype . . . . . . . . . . . . . . . 85
6.4.1 Hardware Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4.2 Cleanflight in Qduino . . . . . . . . . . . . . . . . . . . . . . . . 86
6.4.3 The Autonomous Control Platform . . . . . . . . . . . . . . . . 89
7 Conclusion and Future Work 91
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Bibliography 95
Curriculum Vitae 102
ix
List of Tables
3.1 Example Periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Application Timing Characteristics . . . . . . . . . . . . . . . . . . . . . 33
3.3 Simulation Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 Arduino Standard API . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 New APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Case Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 VCPU Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.1 Case Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Kernel Overheads (in nanoseconds) . . . . . . . . . . . . . . . . . . . . 67
6.1 Arduino Standard API . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 Task Execution Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 Task Periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
x
List of Figures
3.1 VCPU Scheduling Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Task Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Illustration of a Pipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 End-to-end Reaction Time in Case 1 . . . . . . . . . . . . . . . . . . . . 25
3.5 End-to-end Reaction Time in Case 2 . . . . . . . . . . . . . . . . . . . . 26
3.6 End-to-end Freshness Time in Case 1 . . . . . . . . . . . . . . . . . . . 30
3.7 Application Task Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.8 Simulation Pipe Topology . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9 Observed vs. Predicted Freshness & Reaction Times . . . . . . . . . . . 38
3.10 Our System vs. Linux End-to-end Times . . . . . . . . . . . . . . . . . . 39
4.1 Qduino Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Multithreaded Sketch Benchmarks . . . . . . . . . . . . . . . . . . . . . 52
4.3 Temporal Isolation between Loops . . . . . . . . . . . . . . . . . . . . . 53
4.4 Temporal Isolation between Loops and Interrupts . . . . . . . . . . . . . . . 55
5.1 A Conceptual View of a 3D Printer . . . . . . . . . . . . . . . . . . . . . 58
5.2 An Example Speed Ramp for a Stepper Motor . . . . . . . . . . . . . . . 59
5.3 Structural Deficiency due to Unstable Pulse Train . . . . . . . . . . . . . 60
5.4 The Structure of Marlin . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5 Temperature Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
xi
5.6 Target 10kHz Pulse on Yocto Linux (Actually 7.591kHz) . . . . . . . . . 66
5.7 Call Graph for Listing 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.8 Marlin on Qduino . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.9 Temperature Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.10 Target 10kHz Pulse on Qduino (Actually 9.569kHz) . . . . . . . . . . . . 72
5.11 10kHz Pulse on the Printrboard (without a webserver) . . . . . . . . . . . 73
5.12 3D Image STL File for Test Object . . . . . . . . . . . . . . . . . . . . . 74
5.13 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.14 Qduino . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.1 Qduino Extension Architecture . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Cleanflight Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3 Cleanflight Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
xii
List of Abbreviations
CPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cyber-Physical Systems
EPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Extended Page Table
FDM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fused Deposition Modeling
GPOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . General-Purpose Operating System
IMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inertial Measurement Unit
IoT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Internet of Things
I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Input/Output
KVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kernel-based Virtual Machine
PIBS . . . . . . . . . . . . . . . .Priority Inheritance Bandwidth Preserving Server
PID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proportional-Integral-Derivative
RC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Radio Control
SBC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Single Board Computers
SoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Systems-on-Chips
SLAM . . . . . . . . . . . . . . . . . . . . . . . . Simultaneous Localization and Mapping
TCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Trusted Computing Base
VMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Virtual Machine Monitor
WCET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Worst-Case Execution Time
WCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Worst-Case Delay
xiii
Chapter 1
Introduction
Emerging multicore systems-on-chip are enabling new cyber-physical applications such
as autonomous drones, driverless cars and smart manufacturing using web-connected 3D
printers. These applications consist of tasks with individual timing requirements. Being
able to strictly meet task timing requirements is fundamental for correctness. To this end,
real-time scheduling techniques are often employed to provide temporal isolation between
the tasks. On the other hand, new timing requirements are introduced by those appli-
cations, which commonly feature communicating task pipelines, to acquire and process
sensor data and produce outputs that control actuators. End-to-end timing requirements
are usually imposed on the task pipelines to ensure that application-wide mission objec-
tives are achieved within certain time bounds. For example, in a flight controller for a
multirotor drone, the data from a gyro or inertial sensor must be gathered and processed
to determine the attitude of the aircraft. Sensor data fusion is followed by control outputs
that alter rotor speeds to adjust the drones flight. If the processing pipeline between sensor
input and actuation is not bounded, the drone will lose control and possibly fail to maintain
flight. Consequently, it is crucial to meet timing requirements for both individual tasks and
task pipelines formed for sensor data processing and actuation, in order to guarantee the
correct functioning of these new applications.
Current cyber-physical programming platforms, such as Arduino have gained popu-
2larity as a result of their relatively simple programming interface. Similarly, embedded
Linux with the POSIX interface has become a widely used standard. However, none of
the popular cyber-physical programming environments today provide adequate support
for application developers to specify task timing requirements. Apart from the lack of
timing specification, the Arduino platform is also limited by its support for a single thread
of control. The Linux/POSIX platform is also limited by its cumbersome programming
interface, such as the file-based sysfs subsystem for I/O operations.
The increasing computational capabilities of systems-on-chip, which more commonly
feature multiple rather than single cores, require application tasks to be carefully sched-
uled to ensure latency-constrained I/O and CPU-bound processing. Neither Arduino nor
the Linux/POSIX programming environment allow applications sufficient capabilities to
schedule tasks or map them to processor cores to meet latency constraints, while managing
I/O in a predictable and efficient manner.
Given the above observations, this thesis presents the Qduino programming platform.
Qduino adopts the simplicity of the Arduino API with additional support for real-time
guarantees to multithreaded sketches1. At the design stage, it allows application devel-
opers to specify timing properties of individual tasks as well as task pipelines. On multi-
core hardware platforms, Qduino hides the hardware and system complexity from applica-
tion developers, and exposes a set of high-level programming interfaces for explicit CPU
and I/O resource management. In order to guarantee end-to-end timing requirements in
Qduino, we propose a mathematical framework, proving that a scheduling technique that
guarantees individual tasks’ timing requirements is also able to provide end-to-end timing
guarantees for pipelines composed of those tasks. We proceed to show that user-specified
end-to-end timing requirements can be utilized to derive each task’s budget and period.
1An Arduino program is called sketch.
3This provides a powerful design tool to application developers to determine tasks’ timing
properties according to application-wide timing specifications.
In order to verify the usability of the Qduino platform, we use it to build a novel
web-connected 3D printer. Our 3D printer in Qduino demonstrates better printing quality
and faster printing speed, compared to a reference implementation in Linux. A reference
Qduino-based 3D printer is able to print at speeds comparable to a traditional 3D printer
without web connectivity. Moreover, the submission of web requests are guaranteed not
to interfere with the timely operation of the active print job using Qduino.
While a 3D printer accomplishes its objectives by following pre-determined instruc-
tions (i.e., G-code), it differs from emerging cyber-physical applications that must reason
about, and adapt to, changes in their environment. Autonomous vehicles, be it drones
or cars, are example applications that must sense information in their environments (e.g.,
obstacles or objects of interest) and adapt their behavior as necessary to meet mission ob-
jectives. The autonomous nature of these applications is often made possible by complex
software stacks, such as OpenCV for image processing, or GPU-based machine learning
algorithms, commonly deployed on General-Purpose Operating Systems (GPOS). There-
fore, we believe it is important for a programming platform to be extensible in the sense
that third party software components such as libraries and frameworks can be easily inte-
grated into application development. In the last part of this thesis, we propose a system
design that enables Qduino to integrate with Linux applications. This allows Qduino to
exploit the abundance of pre-existing software currently available.
1.1 Thesis Statement
Thesis: This thesis presents a programming platform for modern cyber-physical applica-
tions on emerging multicore systems-on-chip, with support for three essential features: 1)
4design-time specification and run-time guarantee of timing properties of individual tasks
as well as application-wide task pipelines; 2) a simple and clean interface for CPU and
I/O resource management on multicore architectures; 3) platform extensibility that enables
easy integration with legacy software components.
1.2 Research Contributions
The contributions of this thesis include the following:
Timing analysis. We propose two timing constraints to quantify end-to-end timing re-
quirements of cyber-physical applications that consist of pipelined communicating tasks.
Different from previous research work that assumes task dependencies, our timing anal-
ysis is based on the model where tasks execute independently and communicate asyn-
chronously. This task and communication model is able to more precisely address the
cyber-physical applications we focus on. Based upon the model, a mathematical frame-
work is presented to analyze the two end-to-end times of a given system. The framework
is further utilized to derive timing properties of individual tasks from given end-to-end
timing constraints.
Programmability. We propose the Qduino programming platform to ease the develop-
ment of modern cyber-physical applications. It mitigates the shortcomings of current so-
lutions such as Arduino and Linux, by providing support for: 1) design-time specification
and run-time guarantee of timing properties of individual tasks as well as application-wide
task pipelines; 2) CPU and I/O resource management on multicore architectures; 3) plat-
form extensibility that enables integration with legacy software components.
Application building. As a case study, we investigate and analyze the timing properties
of 3D printers. 3D printers traditionally have stringent timing requirements. Adding web-
connectivity poses further challenges to balance system resources between different types
5of tasks. We are able to build a web-connected 3D printer in Qduino that outperforms a
Linux-based implementation in terms of printing accuracy and speed.
We also show the need to support platform extensibility by designing a flight controller
that uses Linux to implement autonomous functionality alongside traditional sensor data
fusion and control of existing flight control firmwares. In this regard, we port the popular
Cleanflight firmware found on modern drones to run as a Qduino application alongside
autonomous flight management services in Linux that replace human flight commands.
1.3 Thesis Organization
The rest of this thesis is organized as follows. Chapter 2 provides the necessary back-
ground and related work. Chapter 3 investigates the relationship between individual task’s
timing requirements and application-wide end-to-end times. A formula for the worst case
end-to-end time is derived under certain assumptions. The mathematical framework for
generation of timing properties from end-to-end times is introduced. Chapter 3 covers the
proposed Qduino programming platform, followed by the web-connected 3D printer case
study described in Chapter 4. Chapter 5 describes techniques used to enhance Qduino’s
extensibility in order to build an ideal programming platform for cyber-physical applica-
tions. Finally, the thesis conclusion followed by future work is covered in Chapter 6.
Chapter 2
Related Work
2.1 Guarantee of Task Timing Requirements
There is a significant body of research concerning task timing guarantee by providing
temporal isolation between tasks. In the context of CPU scheduling for real-time systems,
the concept of CPU capacity reserves was introduced in [MST93] for real-time processes.
Server mechanisms [AB98] are another form of resource reservations. A large variety of
server mechanisms have been introduced, e.g., [LSS87, SSL89, SLS95, AB04, SB94a]. A
Constant Bandwidth Server (CBS) has a current budget, cs and a bandwidth limited by the
ratio Qs/Ts, where Qs is the maximum server budget available in the period Ts. When
a server depletes all its budget it is recharged to its maximum value. A corresponding
server deadline is updated by a function of Ts, depending on the state of the server when
a new job arrives for service, or when the current budget expires. CBS guarantees a total
utilization factor no greater thanQs/Ts, even in overloads, by specifying a maximum bud-
get in a designated window of time. This contrasts with work on the Constant Utilization
Server (CUS) [DLS97] and Total Bandwidth Server (TBS) [SB94b], which ensure band-
width limits only when actual job execution times are no more than specified worst-case
values. CBS has bandwidth preservation properties similar to that of the Dynamic Spo-
radic Server (DSS) [GB95] but with better responsiveness. CBS, CUS, TBS and DSS all
assume the existence of server deadlines. In contrast, we chose not to assume the existence
7of deadlines for VCPUs, instead restricting VCPUs to fixed priorities. This avoided the
added complexity of managing dynamic priorities of VCPUs as their deadlines change.
Additionally, for cases when there are multiple tasks sharing a fixed-priority VCPU, the
execution of one task will not change the importance of the VCPU for the other tasks.
In the context of real-time operating systems, Linux/RK [OR98] is an early RTOS that
is built around the notion of resource reserves. The concept of a reserve in Linux/RK was
derived and generalized from processor capacity reserves [MST93] in RT-Mach. In more
recent times there has been similar work on resource containers to account for resource
usage [BDM99]. Each time-multiplexed reserve in the system has a budget C, interval
T , and deadline D, assigned to it so that utilization can be specified. Linux/RK requires
apriori knowledge of application resource demands and relies on an admission control
policy to guarantee a reasonable global reserve allocation. In contrast, our VCPU schedul-
ing framework focuses on the temporal isolation between tasks and system events using a
hierarchy [SL03, Reg01] of virtual servers, acting as either Main or I/O VCPUs.
2.2 End-to-end Timing Analysis and Generation of Task Timing Re-
quirements
In the seminal work of Feiertag et al. [FRNJ08], they distinguish four semantics of end-
to-end time and provide a generic framework to determine all the valid data paths for each
semantic. Yet the authors do not perform timing analysis as no scheduling model is as-
sumed. Hamann et al. [HDK+17] also discuss end-to-end reaction and age time. Their
work focuses on integrating three different communication models, including the implicit
communication model, into existing timing analysis tools such as SymTA/S [HHJ+05].
While our composable pipe model is also based on implicit communication, we perform
timing analysis for a sequence of RMS-scheduled tasks running on bandwidth-preserving
8servers. Others have proposed end-to-end timing analysis at the model level in the automo-
tive domain [BDM+16, MSN+15]. Our approach is applicable to any applications with
data pipeline processing, freshness and reaction constraints. A large portion of end-to-
end reaction time analysis is based on the synchronous data-flow graph (SDFG) [LM87],
where inter-task communication is driven by the arrival of input data. Real-time schedul-
ing techniques have been used to analyze end-to-end latencies in systems modeled by
SDFGs [SEB17, KMKKTC16].
A task’s budget can be derived by using either static analysis or an experimental ap-
proach. Static analysis tools such as aiT [Abs14] are able to accurately estimate the worst-
case execution time by analyzing a task’s intrinsic cache and pipeline behavior, if the
micro-architectural model such as formal cache and pipeline models is available. For
complex architectures, it is often common industrial practice to experimentally derive the
value of WCET. However, testing by repeatedly measuring the execution time of a task
is a tedious process and typically error-prone. It is often impossible to prove that the
conditions determining maximum execution time have been taken into account.
There has been limited research on determining task period. Gerber et al. [GHS95]
propose a synthesis approach that determines tasks’ periods, offsets and deadlines from
end-to-end timing constraints. Their work relies on task precedence constraints as there is
no scheduling model used for the analysis. Seto et al. [SLS98] present algorithms based
on static priority scheduling methods to determine optimal periods for each task in a task
set. Their solution to the period selection problem optimizes a system-wide performance
measure subject to meeting the maximal acceptable latency requirements of each task. In
contrast, our work assumes the VCPU scheduling model to derive time period from two
application-wide end-to-end timing constraints: freshness and reaction times. There are
also programming languages that allow the specification and verification of end-to-end
9timing properties. For example, Prelude [PFB+11] and Giotto [HHK01] are languages
designed to derive tasks’ periods based on user-specified timing constraints. Lauer et
al. [LBPE14, BLPE13] use formal methods to verify end-to-end timing properties for
avionic systems. Forget et al. [FBP17] define a language to specify formally end-to-end
constraints and propose a technique to verify those constraints.
2.3 Cyber-physical Programming Platforms
There is a wide range of programming platforms for different aspects of cyber-physical
applications. Contiki [DGV04] and RIOT [BHG+13] aim to develop applications with
wireless connectivity. Contiki is a small footprint operating system for use with Internet
of Things (IoT) devices. It supports per-process preemptive multi-threading by linking
applications with a protothread library. RIOT OS is another multi-threaded operating sys-
tem designed for IoT devices. Both of them provide native support for Internet Protocol
Suite. The AXIOM project (Agile, eXtensible, fast I/O Module) [MAB+16] provides a
general platform focusing on providing a scalable and easy-to-program platform. Yet AX-
IOM targets FPGA-based boards. Similar to Qduino, RT-Arduino [PB14] is a software-
based Arduino extension that provides real-time multitasking support. It is built upon the
OSEK/VDX certified ERIKA Enterprise RTOS [eri], which supports time-sensitive ap-
plications on a wide range of microcontrollers and multicore platforms. Arduino loops
are mapped to OSEK-tasks that are statically configured at compile-time. By compari-
son, Qduino extends the Arduino API by allowing users to specify timing requirements.
While ARTE is designed for single-core hardware devices (e.g., Arduino Uno and Arduino
Due), Qduino also exploits multicore architectures to improve performance and temporal
isolation amongst tasks with hard real-time requirements.
Linux is also widely used in cyber-physical systems, often with PREEMPT RT patch
10
applied. This patch aims to improve scheduling latency by reducing the number and the
length of critical sections in the kernel that mask interrupts or disable preemptions [CB13].
However, its POSIX programming interface is cumbersome for performing I/O opera-
tions and does not allow users to specify timing requirements. On the multicore archi-
tecture, Linux also provides features to shield CPUs from individual interrupts, including
Adaptive-ticks [adp] to disable timer interrupts on certain cores. These features require ex-
tensive configurations and expertise in real-time application programming. Qduino adopts
the Arduino API to provide a simple, straightforward programming model to ease the
development of applications with timing requirements.
2.4 Platform Extensibility
The extensibility enhancement of Qduino largely coincides with research work in the oper-
ating system community. Extensible operating systems research [SS96] aims at providing
applications with greater control over the management of their resources. For example, the
Exokernel [EKO95a] tries to efficiently multiplex hardware resources among applications
that utilize library operating systems. Resource management is thus delegated to library
operating systems, which can be readily modified to suit the needs of individual applica-
tions. SPIN [BSP+95] is an extensible operating system that supports extensions written
in the Modula-3 programming language. Interaction between the core kernel and SPIN
extensions is mediated by an event system, which dispatches events to handler functions
in the kernel. By providing handlers for events, extensions can implement application-
specific resource management policies with low overhead.
Some work have attempted to virtualize existing OSes, so that different OSes share
services. User-Mode Linux [Dik06] and L4Linux [TD16], for example, implement a mod-
ified Linux as a user-level address space on top of a host OS. Virtualization technologies
11
like KVM [KKL+07] or Xen [BDF+03] create machine abstractions for hosting guest ser-
vices with unmodified kernels, or kernels with minimal change. Embedded systems like
PikeOS [AG17] and QNX [Sys17] rely on their own in-kernel virtualization technologies
to support legacy software
2.5 3D Printers and Drones
To the best of the authors’ knowledge, there is no published work on building intelli-
gent 3D printers. However, Replicape is an open source add-on board for BeagleBone
and BeagleBone Black, to enable 3D-printing. It hosts a standard Linux distribution
(A˚ngstro¨m/Debian) for running the G-code daemon, while real-time stepper timings are
handled by the two on-chip Programmable Real-time Units (PRU) present on BeagleBone.
It also supports a web interface for reading and writing configuration files. However, PRUs
have to be programmed in a special assembly language. Our 3D printer does not require
any special hardware to guarantee the real-time stepper timings, and the firmware is written
using simple Qduino APIs. In addition, our web interface is not only used for configura-
tion files but also for G-code transmission so that users can request 3D printing services
remotely.
Similar to 3D printers, there is limited published research on autonomous drones from
a system point of view. For research on general drone development, Bregu et al. [BCC+16]
conceived a notion of reactive drone control that supersedes the time-triggered approach.
Using reactive control, control decisions are taken only upon recognizing the need to,
based on observed changes in the navigation sensors. As a result, the rate of execution
dynamically adapts to the circumstances.
Chapter 3
Guarantee and Generation of Task Timing
Properties
Cyber-physical applications traditionally consist of tasks with individual timing require-
ments. A large variety of scheduling techniques have been proposed to meet those timing
requirements, in order to ensure the correct execution of the application. On the hand other,
it is notoriously hard to correctly and efficiently determine the task timing requirements. It
is common practice to derive a task’s budget by repeatedly measuring the execution time,
which is tedious and error-prone. There is even less knowledge on how to derive task pe-
riod. Tasks that interact directly with the hardware devices typically run at the frequency
of sensor inputs and actuator outputs specified by the hardware or application demands.
However, in a task pipeline, only the head and tail interact with I/O devices and their tim-
ing properties can be determined. The timing properties of intermediate tasks are then
left to application designers to decide. Task pipelines are common to those applications
that utilize communicating task pipelines, to acquire and process sensor data and produce
outputs that control actuators. In such applications, we observe that the choice of task pe-
riod is usually constrained by end-to-end timing requirements of all the involved pipelines
within the application. This section seeks the mathematical representation to quantify this
relationship, in order to find an approach to derive individual task’s timing properties from
the timing specifications of the pipelines in which the task is involved, and the traditional
13
schedulability requirement.
3.1 Task Model
We first describe the task model we use to model cyber-physical applications, upon which
our following analysis and design is based.
We model cyber-physical applications as a set of tasks with mixed criticalities. Non-
critical tasks are considered as black boxes, and we do not enforce any constraint on them.
Instead we focus on the critical ones and model them as a set of real-time periodic tasks
{τ 1, τ 2, · · · , τn}. Each task τ j is characterized by its worst-case execution time (WCET)
Cj , period T j , and a deadline that is equal to period. A task is composed of a setup
phase executed once for initialization, and a series of periodically released jobs. The setup
phase is used to perform I/O device and task state initialization, and resource allocation.
Its execution time is not characterized and thus does not play a role in timing analysis.
Conversely, a job of task τ j , released at time t, is required to finish by time t + T j . A job
can perform I/O operations and communicate with active jobs belonging to other tasks,
besides conducting computations. The value of Cj and T j are determined during the
design stage and are fixed at runtime. Cj is usually profiled off-line under the worst-
case execution condition. T j can be determined by hardware specifications, application
logic, or even trial and error. In this work, T j will be derived from the end-to-end latency
constraints and the schedulability test, using the timing analysis and design framework
presented in Section 3.3.
All the tasks are assumed to be implemented using separate, preemptive schedulable
entities, scheduled by an underlying OS or bare-metal scheduler. On a multicore hardware
platform, we assume a partitioned scheduler is used and no load balancing is in place.
14
3.2 Guarantee of Timing Properties: VCPU Scheduling
The Quest kernel and its VCPU scheduling framework serve as the foundations for the
end-to-end timing analysis and design described later in this chapter, by providing both
the theoretical and systematic basis for timing guarantee and temporal isolation of tasks.
Quest is an x86-based real-time operating system and features a novel hierarchical
VCPU scheduling framework. Periodic tasks are implemented as user-level threads in
Quest, while I/O interrupts are handled by kernel-level threaded interrupt handlers. Threads
in Quest are scheduled by a two-level scheduling hierarchy, with threads mapped to vir-
tual CPUs (VCPUs) that are mapped to physical CPUs, as illustrated in Figure 3.1. Each
VCPU is specified a processor capacity reserve [MST93] consisting of a budget capacity,
C, and period, T . The value of C and T are determined by the WCET and period of the
task mapped to the VCPU. A VCPU is required to receive at least C units of execution
time every T time units when it is runnable, as long as a schedulability test [LSD89] is
passed when creating new VCPUs. This way, Quest’s scheduling subsystem guarantees
temporal isolation between threads in the runtime environment.
Figure 3.1: VCPU Scheduling Hierarchy
Conventional periodic tasks are assigned to Main VCPUs, which are implemented as
15
Sporadic Servers [Spr89] and scheduled using Rate-Monotonic Scheduling (RMS) [LL73].
The VCPU with the smallest period has the highest priority. Instead of using the Sporadic
Server model for both main tasks and bottom half threads, special I/O VCPUs are cre-
ated for threaded interrupt handlers. Each I/O VCPU operates as a Priority Inheritance
Bandwidth preserving Server (PIBS) [DLW11]. A PIBS uses a single replenishment to
avoid fragmentation of replenishment list budgets caused by short-lived interrupt service
routines (ISRs). By using PIBS for interrupt threads, the scheduling overheads from con-
text switching and timer reprogramming are reduced [MMW16]. A PIBS is specified by
a utilization, U , instead of a capacity and period. A PIBS always runs on behalf of a Spo-
radic Server and inherits both the priority and period of the Sporadic Server. For example,
the PIBS running in response to a device interrupt would run on behalf of the Sporadic
Server that requested the I/O action to be performed. The capacity of a PIBS is calculated
as C=U×T , where T is the period of the Sporadic Server. As shown in [DLW11], for n
Main VCPUs and m I/O VCPUs running on a single physical CPU, temporal isolation is
guaranteed amongst all VCPUs if:
n−1∑
i=0
Ci
Ti
+
m−1∑
j=0
(2− Uj)·Uj ≤ n
(
n
√
2− 1
)
(3.1)
3.3 Generation of Timing Properties: End-to-end Timing Analysis
and Design
The WCET of a task is traditionally profiled experimentally, but there is no common prac-
tice to determine the period. For tasks that interact with I/O devices directly, periods can
be determined by the I/O frequency. However, there is an increasing number of tasks that
16
are away from I/O devices, in modern cyber-physical applications. Instead, these tasks
serve as intermediate stations in task pipelines connecting sensors and actuators.
Task pipelines are commonly seen in autonomous cyber-physical applications such as
drones, robotics, driverless cars and intelligent home automation systems. An autonomous
CPS is usually equipped with multiple sensors and actuators. For these applications, it
is essential to guarantee two types of end-to-end timing requirements: 1) the maximum
time it takes for a sample of input sensor reading to flow through the whole system to
eventually affect an actuator output, and 2) the maximum time within which a sampled
input sensor reading remains influential on output actuator commands. For example, in
a flight controller, if a gyroscope reading fails to correctly influence a change in motor
speed within a specific time bound, the drone might not be able to stabilize. For these
applications, it is important to guarantee not only the timing requirements of individual
tasks, but also the application-wide end-to-end timing specifications.
In the rest of this chapter, we propose a pipe model to analyze two end-to-end time
semantics: freshness and reaction time. Based on our timing analysis, we provide a math-
ematical framework to derive feasible task periods and budgets that satisfy both schedula-
bility and end-to-end timing requirements.
3.3.1 Communication Model
Data flow involves a pipeline of communicating tasks, leading to a communication model
characterized by: (1) the interarrival times of tasks in the pipeline, (2) inter-task buffering,
and (3) the tasks’ access pattern to communication buffers.
Periodic versus Aperiodic Tasks. Aperiodic tasks have irregular interarrival times,
influenced by the arrival of data. Periodic tasks have fixed interarrival times and operate
on whatever data is available at the time of their execution. A periodic task implements
17
asynchronous communication by not blocking to await the arrival of new data.
Register-based versus FIFO-based Communication. A FIFO-based shared buffer
is used in scenarios where data history is an important factor. However, autonomous
cyber-physical applications are typically sensor-based control programs and data freshness
outweighs the preservation of the full history of all sampled data. Moreover, FIFO-based
communication results in loosely synchronous communication: the producer is suspended
when the FIFO buffer is full and the consumer is suspended when the buffer is empty.
Register-based communication achieves fully asynchronous communication between two
communicating parties using Simpson’s four-slot algorithm [Sim90].
Implicit versus Explicit Communication. Explicit communication allows access to
shared data at any time point during a task’s execution. This might lead to data incon-
sistency in the presence of task preemption. A task that reads the same shared data at the
beginning and the end of its execution might see two different values, if it is preempted be-
tween the two reads by another task that changes the value of the shared data. Conversely,
the implicit communication model [HDK+17] essentially follows a read-before-execute
paradigm to avoid data inconsistency. It mandates a task to make a local copy of the shared
data at the beginning of its execution and to work on that copy throughout its execution.
We adopt a periodic task model, as well as a register-based, implicit communication
mechanism in favor of data freshness.
3.4 End-to-end Timing Analysis
In this section, we first distinguish two different timing semantics for end-to-end com-
munication, which will be used as the basis for separate timing analyses. Secondly, we
develop a composable pipe model for communication, which is derived from separate
latencies that influence end-to-end delay. Lastly, we use the pipe model to derive the
18
worst-case end-to-end communication time under various situations.
3.4.1 Semantics of End-to-end Time
To understand the meaning of end-to-end time, consider the following two constraints for
a flight controller:
• Constraint 1: a change in motor speed must be within 2 ms of the gyro sensor reading
that caused the change.
• Constraint 2: an update to a gyro sensor value must be within 2 ms of the corresponding
update in motor speed.
The values before and after a change differ, whereas they may stay the same before and
after an update. These semantics lead to two different constraints. To appreciate the
difference, imagine the two cases in Table 3.1. In Case 1, the task that reads the gyro runs
every 10 ms and the one that controls the motors runs every 1 ms. Case 1 is guaranteed
to meet Constraint 1 because the motor task runs more than once within 2 ms, no matter
whether the gyro reading changes. However, it fails Constraint 2 as the gyro task is not
likely to run even once in an interval of 2 ms. Conversely, Case 2 is guaranteed to meet
Constraint 2 but fails Constraint 1 frequently.
Gyro Period Motor Period
Case 1 10 ms 1 ms
Case 2 1 ms 10 ms
Table 3.1: Example Periods
This example demonstrates the difference between the two semantics of end-to-end
time, defined as follows:
• Reaction time is the time it takes for a sample of input data to flow through the
system, and is affected by the period of each consumer in a pipeline. A reaction timing
19
constraint bounds the time interval between a sensor input and the first corresponding
actuator output.
• Freshness time is the interval within which a sampled data input has influence on
the system, and is affected by the period of each producer in a pipeline. A freshness timing
constraint bounds the interval between a sensor input and the last corresponding actuator
output.
Constraint 1, above, is a constraint on reaction time, while Constraint 2 is on freshness
time. We perform analysis of the two semantics of time in Section 3.4.5.2 and 3.4.5.3,
respectively.
3.4.2 Latency Contributors
The end-to-end communication delay is influenced by several factors, which we will iden-
tify as part of our analysis. To begin, we first consider the end-to-end communication
pipeline illustrated as a task chain in Figure 3.2. Task τ1 reads input data Din over chan-
nel Chin, processes it and produces data D1. Task τ2 reads D1 and produces D2, and τ3
eventually writes output Dout to channel Chout after reading and processing D2.
Figure 3.2: Task Chain
Each task handles data in three stages, i.e., read, process and write. The end-to-end
time should sum the latency of each stage in the task chain. Due to the asynchrony of
communication, however, we also need to consider one less obvious latency, which is the
waiting time it takes for an intermediate output to be read in as input, by the succeeding
task in the chain. In summary, the latency contributors are classified as follows:
• Processing latency represents the time it takes for a task to translate a raw input
20
to a processed output. The actual processing latency depends not only on the absolute pro-
cessing time of a task without interruption, but also on the service constraints (i.e., CPU
budget and period of the VCPU) associated with the task.
• Communication latency represents the time to transfer data over a channel. The
transfer data size, channel bandwidth, propagation delay, and communication protocol
overheads all contribute to the overall latency. Since our communication model is asyn-
chronous and register-based as described in Section 3.3.1, queuing latency is not a concern
of this work.
• Scheduling latency represents the time between the arrival of data on a channel
from a sending task and when the receiving task begins reading that data. The schedul-
ing latency depends on the order of execution of tasks in the system, and therefore has
significant influence on the end-to-end communication delay.
3.4.3 The Composable Pipe Model
In Section 3.4.2, we identified the factors that influence end-to-end communication delay.
Among them, the absolute processing time and the transfer data size are determined by the
nature of the task in question. To capture the rest of the timing characteristics, we develop
a composable pipe model, leveraging the scheduling approach described in Section 3.2. A
task and pipe have a one-to-one relationship, as illustrated in Figure 3.3.
A pipe has one pipe terminal and two pipe ends, with one end for input and the other
for output. A pipe terminal is represented by a VCPU, guaranteeing at least C units of
execution time every T time units when runnable. Pipe terminals are associated with
conventional tasks bound to Main VCPUs and kernel control paths (including interrupt
handlers and device drivers) bound to I/O VCPUs, as described in Section 3.2.
A pipe end is an interface to a communication channel, which is either an I/O bus or
21
Figure 3.3: Illustration of a Pipe
shared memory. The pipe end’s propagation delay is assumed neglible, while the trans-
mission delay is modeled by the bandwidth parameter,W , of the communication channel.
δ is used to denote the software overheads of a communication protocol. Though we are
aware that δ depends on the data transfer size, the time difference is negligible, compared
to the time of actual data transfer and processing. Therefore, for the sake of simplicity, δ
is a constant in our model.
3.4.3.1 Notation
The timing characteristics of a pipe are denoted by the 3-tuple, π = ((Wi, δi), (C, T ), (Wo,
δo)), where:
• (Wi, δi) and (Wo, δo) denote the bandwidth and software overheads of the input and
output ends, respectively.
• (C, T ) denotes the budget and period of the pipe terminal.
A task τ is also denoted as a 3-tuple, τ = (di, p, do), where:
• di denotes the size of the raw data that is read in by τ in order to perform its job, and
do denotes the size of the processed data that is produced by τ .
22
• p denotes the uninterrupted processing time it takes for τ to turn the raw data into
the processed data.
In addition, τ 7→ π denotes the mapping between task τ and pipe π. A task τ = (di, p, do)
is said to be mapped to a pipe π = ((Wi, δi), (C, T ), (Wo, δo)) when
• data of size di is read from the input end with parameters (Wi, δi), and data of size
do is written to the output end with parameters (Wo, δo);
• the pipe terminal with parameters (C, T ) is used for scheduling and accounting of
the read and write operations, as well as the processing that takes time p.
For the composition of a chain of pipes, the operator | connects a pipe’s output end to
its succeeding pipe’s input end. The scheduling latency between two pipes is denoted
by Sτ 7→pi|τ ′ 7→pi′ . Lastly, given a task set T = {τ1, τ2, · · · , τn} identity mapped to a pipe
set Π = {π1, π2, · · · , πn}, where pipes are connected to each other in ascending order
of subscript, Eτ1 7→pi1|τ2 7→pi2|···|τn 7→pin denotes the end-to-end reaction time of the pipe chain,
and Fτ1 7→pi1|τ2 7→pi2|···|τn 7→pin denotes the end-to-end freshness time.
3.4.4 Reachability
Before mathematically analyzing end-to-end time, we introduce the concept of reachabil-
ity, inspired by the data-path reachability conditions proposed by Feiertag et al [FRNJ08].
The need to consider reachability is due to a subtle difference between our register-based
asynchronous communication model and the traditional FIFO-based synchronous commu-
nication. In the latter, data is guaranteed to be transferred without loss or repetition. This
way, end-to-end time is derived from the time interval between the arrival of a data input
and the departure of its corresponding data output. Unfortunately, this might result in an
23
infinitely large end-to-end time in the case of register-based asynchronous communica-
tion where not every input leads to an output. Instead, unprocessed input data might be
discarded (overwritten) when newer input data is available, as explained in Section 3.3.1.
An infinitely large end-to-end time, while mathematically correct, lacks practical use.
Therefore, the following timing analysis ignores all input data that fails to “reach” the exit
of the pipe chain it enters. Instead, only those data inputs that result in data outputs from
the pipe chain are considered. We define this latter class of inputs as being reachable.
3.4.5 Timing Analysis
As alluded to above, the execution of a task is divided into three stages, involving (1)
reading, (2) processing, and (3) writing data. To simplify the timing analysis, we assume
that tasks are able to finish the read and write stages within one period of the pipe termi-
nal, to which the task is mapped. This is a realistic assumption because: 1) data to be
transferred is usually small, and 2) all three stages are typically able to finish within one
period. However, to maintain generality, we do not impose any restriction on the length of
the processing stage.
3.4.5.1 Worst-case End-to-end Time of a Single Pipe
First, we consider the case where there is a single pipe. Two key observations for this case
are: 1) the absence of scheduling latency due to the lack of a succeeding pipe, and 2) the
equivalence of the two end-to-end time semantics (reaction and freshness time) due to the
lack of a preceding pipe. We therefore use Lτ 7→pi to unify the notation of Eτ 7→pi and Fτ 7→pi.
Given task τ = (di, p, do) mapped to pipe π = ((Wi, δi), (C, T ), (Wo, δo)), the worst-
case end-to-end time is essentially the execution time of the three stages of τ on π. Due
to the timing property of π’s pipe terminal, τ is guaranteed C units of execution time
24
within any window of T time units. Hence, the worst-case latency Lwcτ 7→pi is bounded by
the following:
Lwcτ 7→pi =
⌊
∆in + p+∆out
C
⌋
· T + (∆in + p+∆out) mod C (3.2)
where ∆in =
di
Wi
+ δi and ∆out =
do
Wo
+ δo.
3.4.5.2 Worst-case End-to-end Reaction Time of a Pipe Chain
In this section, we extend the timing analysis of a single pipe to a pipe chain. For the
sake of simplicity, we start with a chain of length two. We show in Section 3.4.6 that the
mathematical framework is applicable to arbitrarily long pipe chains. To distinguish the
tasks mapped to the two pipes, we name the preceding task producer and the succeed-
ing consumer. The producer is denoted by τp = (d
p
i , p
p, d) and its pipe is denoted by
πp = ((W
p
i , δ
p
i ), (C
p, T p), (W po , δ
p
o)). Similarly, the consumer task and pipe are denoted
by τc = (d, p
c, dco) and πc = ((W
c
i , δ
c
i ), (C
c, T c), (W co , δ
c
o)). Following the definition of the
end-to-end reaction time, Eτp 7→pip|τc 7→pic , in Section 3.4.1, we investigate the time interval
between a specific instance of input data, denoted by Di, being read by τp, and its first
corresponding output, denoted by Do, being written by τc.
It is of vital importance to recognize that end-to-end time of a pipe chain is not simply
the sum of the end-to-end time of each single pipe in the chain. We also need to account
for the scheduling latency resulting from each appended pipe. As described in
Section 3.4.2, the scheduling latency depends on the order of execution of tasks. We,
therefore, perform the timing analysis under two complementary cases: Case 1 - τc has
shorter period and thus higher priority than τp; Case 2 - τp has shorter period and thus
higher priority than τc, according to rate-monotonic ordering.
25
Figure 3.4: End-to-end Reaction Time in Case 1
Calculating the End-to-end Reaction Time Case 1. The key to making use of Lwcτp 7→pip
and Lwcτc 7→pic in the timing analysis of E
wc
τp 7→pip|τc 7→pic
, is to find the worst-case scheduling la-
tency, Swcτp 7→pip|τc 7→pic . As illustrated in Figure 3.4, the worst-case scheduling latency occurs
when τc preempts τp (Step 1 ) immediately before τp produces the intermediate output
Dint corresponding to Di. After preemption, τc uses up πc’s budget and gives the CPU
back to τp. Upon being resumed, τp immediately produces Dint (Step 2 ). For τc to be-
come runnable again to read Dint in Step 3 , it has to wait for its budget replenishment.
The waiting time is exactly the worst-case scheduling latency:
Swcτp 7→pip|τc 7→pic = T
c − Cc − ( d
W po
+ δpo) (3.3)
After replenishment, τc reads in Dint, processes it and eventually writes out Do. As
Eτp 7→pip|τc 7→pic is defined to be the time interval between the arrival of Di and the departure
26
of Do, the worst case of Eτp 7→pip|τc 7→pic is as follows:
Ewcτp 7→pip|τc 7→pic = L
wc
τp 7→pip + S
wc
τp 7→pip|τc 7→pic + L
wc
τc 7→pic
= Lwcτp 7→pip + L
wc
τc 7→pic + T
c − Cc − ( d
W po
+ δpo)
(3.4)
Note that if τc runs out of budget before writing Do, τp may overwrite Dint in the pipe
with new data (Step 4 ). However, the implicit communication property guarantees that
τc only works on its local copy of the shared data, which is Dint until τc initiates another
read.
Case 2. The situation is more complicated when τp has higher priority than τc. The
worst-case scenario in Case 1 does not hold in Case 2 primarily because τp might overwrite
Dint before τc has a budget replenishment. This is impossible in Case 1 because τp has a
larger period than τc, which is guaranteed to have its budget replenished before τp is able
to initiate another write. In other words, in Figure 3.4, Step 3 is guaranteed to happen
before Step 4 .
Figure 3.5: End-to-end Reaction Time in Case 2
The data-overwrite problem in Case 2 is the reason for introducing reachability in
27
Section 3.4.4. To find the worst-case end-to-end reaction time in this case, we have to find
the scenario that not only leads to the worst-case scheduling latency, but also originates
from a reachable input. Figure 3.5 illustrates a scenario that meets these requirements.
In the figure, τp preempts τc immediately after τc finishes reading τp’s intermediate output
(Step 3 ), Dint, corresponding to Di. It follows that the longest possible waiting time,
between Dint becoming available (Step 2 ) and τc reading the data (Step 3 ), is the
period of τp minus both its budget and the execution time of the read stage of τc. This
waiting time is exactly the worst-case scheduling latency:
Swcτp 7→pip|τc 7→pic = T
p − Cp − ( d
W ci
+ δci ) (3.5)
Between reading Dint and writing Do, τc might experience more than one preemption
from τp, which repeatedly overwrites the shared data. This will not, however, affect τc’s
processing on Dint either spatially or temporally, thanks to the VCPU model and the im-
plicit communication semantic. Similar to Case 1, the worst-case end-to-end reaction time
is again the sum of Equation 3.2 of each pipe and Equation 3.5:
Ewcτp 7→pip|τc 7→pic = L
wc
τp 7→pip + S
wc
τp 7→pip|τc 7→pic + L
wc
τc 7→pic
= Lwcτp 7→pip + L
wc
τc 7→pic + T
p − Cp − ( d
W ci
+ δci )
(3.6)
Since the output end of τp and the input end of τc share the same communication
channel, it is reasonable to assume thatW po = W
c
i and δ
p
o = δ
c
i . With that, we proceed to
28
unify the worst-case end-to-end reaction time as follows:
Ewcτp 7→pip|τc 7→pic =


T c − Cc − ( d
W
+ δ)
+Lwcτp 7→pip + L
wc
τc 7→pic , if T
c < T p
T p − Cp − ( d
W
+ δ)
+Lwcτp 7→pip + L
wc
τc 7→pic , otherwise
(3.7)
whereW = W po = W
c
i and δ = δ
p
o = δ
c
i
Special Cases Real-time systems are often profiled offline to obtain worst-case execu-
tion times of their tasks. In our case, this would enable CPU resources for pipe terminals to
be provisioned so that each task completes one iteration of all three stages (read, process,
write) in one budget allocation and, hence, period. This implies that∆in+p+∆out+ǫ = C
in Equation 3.2, where ǫ is an arbitrarily small positive number to account for surplus bud-
get after completing all task stages. With that, it is possible to simplify the worst-case
end-to-end reaction time derived in Section 3.4.5.2. First, Equation 3.2 is simplified as
follows:
Lwcτ 7→pi = ⌊
C − ǫ
C
⌋ · T + [(C − ǫ) mod C]
= 0 · T + (C − ǫ) ≈ C
(3.8)
Using Equation 3.8, Equation 3.6 reduces to:
Ewcτp 7→pip|τc 7→pic = T
p − Cp −∆cin + Lwcτp 7→pip + Lwcτc 7→pic
= T p − Cp −∆cin + Cp + Cc
= T p + Cc −∆cin
(3.9)
29
The same simplification applied to Equation 3.4 of Case 1 reduces Equation 3.7 to:
Ewcτp 7→pip|τc 7→pic =


T c + Cp −∆, if T c < T p
T p + Cc −∆, otherwise
(3.10)
where ∆ = d
W
+ δ.
If we further assume that πp and πc communicate data of small size over shared mem-
ory, it is possible to discard communication overheads, such that ∆ = 0. With that,
Equation 3.10 simplifies to:
Ewcτp 7→pip|τc 7→pic =


T c + Cp, if T c < T p
T p + Cc, otherwise
(3.11)
Finally, notice that by appending τc 7→ πc to τp 7→ πp, the worst-case end-to-end reaction
time is increased by the following:
↑ Ewc = Ewcτp 7→pip|τc 7→pic − Ewcτp 7→pip
= Ewcτp 7→pip|τc 7→pic − Cp
=


T c, if T c < T p
T p − Cp + Cc, otherwise
(3.12)
3.4.5.3 Worst-case End-to-end Freshness Time of a Pipe Chain
Techniques similar to those in Section 3.4.5.2 will be used to analyze end-to-end freshness
time. To avoid repetition, we abbreviate the end-to-end freshness timing analysis by only
focusing on the special cases described in Section 3.4.5.2.
Recall that freshness time is defined to be the interval between the arrival of an input
30
and the departure of its last corresponding output. Therefore, we investigate the interval
between a specific instance of input data, Di, being read by τp and its last corresponding
output, Do, being written by τc.
Figure 3.6: End-to-end Freshness Time in Case 1
Case 1. As illustrated in Figure 3.6, Di is read by the first instance of τp at time 0
and the intermediate output, Dint, is written to the shared data (Step 1 ). After that, τc
produces three outputs corresponding to Dint (Steps 2 , 3 and 4 ), or to Di indirectly.
The last output, Do, is the one preceding τp’s write of new data, Dnew (Step 5 ). Thus,
the worst-case end-to-end freshness time, Fwcτp 7→pip|τc 7→pic , occurs when: 1) the two consec-
utive writes (Steps 1 and 5 ) from τp have the longest possible time interval between
them, and 2) the write of Do happens as late as possible. The latest time to write Do is
immediately before the second write of τp, which is preempted by higher priority τc.
31
From Figure 3.6 that the worst-case end-to-end freshness time is:
Fwcτp 7→pip|τc 7→pic = 2 · T p −∆pout (3.13)
Again, when communicating over shared memory, Equation 3.13 can be further simplified
to:
Fwcτp 7→pip|τc 7→pic = 2 · T p (3.14)
Case 2. When τp has a smaller period than τc, it is impossible for τc to read the same
intermediate output of τp twice. In Figure 3.6, Step 5 is guaranteed to happen before
3 . Thus, the worst-case freshness time is essentially the worst-case reaction time, shown
in Equation 3.6.
In summary, the worst-case end-to-end freshness latency of two communicating pipes
is represented in the following conditional equation:
Fτp 7→pip|τc 7→pic =


2 · T p, if T c < T p
T p + Cc, otherwise
(3.15)
3.4.6 Composability
The timing analysis for two pipes in Section 3.4.5.2 extends to pipe chains of arbitrary
length. Every time an extra task τnew (mapped to πnew) is appended to the tail end of a
chain (τtail 7→ πtail), the worst-case end-to-end reaction time increases by the worst-case
end-to-end time of the newly appended pipe, plus the scheduling latency between the new
pipe and the tail pipe. The actual value of the increase, depending on the relative priority of
the new pipe and the tail pipe, is shown in Equation 3.12. Similarly, the added end-to-end
32
freshness time can be derived from Equation 3.15.
Composability is a crucial property of our pipe model, since it significantly eases the
end-to-end time calculation for any given pipeline. This provides the basis for a design
framework that derives task periods from given end-to-end timing constraints. This is
detailed in the following section.
3.5 End-to-end Design
Given the end-to-end timing analysis, we now describe a mathematical framework that
generates task timing properties, especially task periods, from end-to-end timing spec-
ifications and schedulability constraints. This framework is intended to solve one of the
major issues of constructing cyber-physical applications. That is, how to determine the pe-
riod of each task. A naive approach would be to start by choosing a tentative set of periods
and use the timing analysis method in Section 3.4 to validate the timing correctness. Upon
failure, the periods are heuristically adjusted and the validation step is repeated until end-
to-end timing guarantees are met. This approach, however, is potentially time-consuming
and labor-intensive when the number of tasks or constraints increase.
Inspired by Gerber et al [GHS95], we derive task periods from end-to-end timing
constraints, by combining the timing analysis of the pipe model with linear optimization
techniques. In this section, we generalize our method for use with a broader spectrum of
cyber-physical control applications.
3.5.1 Problem Definition
Consider a set of tasks Γ = {τ1, τ2, · · · , τn} and a set of pipes Π = {π1, π2, · · · , πn},
where τj = (d
j
i , p
j , djo) and πj = ((W
j
i , δ
j
i ), (C
j, T j),
(W jo , δ
j
o)). We additionally require the following information:
33
Reaction
Eτ1 7→pi1|τ4 7→pi4 ≤ 10,
Eτ2 7→pi2|τ4 7→pi4 ≤ 15,
Eτ2 7→pi2|τ5 7→pi5|τ6 7→pi6 ≤ 25,
Eτ3 7→pi3|τ6 7→pi6 ≤ 15;
Freshness
Fτ1 7→pi1|τ4 7→pi4 ≤ 20,
Fτ2 7→pi2|τ4 7→pi4 ≤ 30,
Fτ2 7→pi2|τ5 7→pi5|τ6 7→pi6 ≤ 50,
Fτ3 7→pi3|τ6 7→pi6 ≤ 20;
Schedulability
∑6
j=1
Cj
T j
≤ 6( 6√2− 1)
Execution Time
∀j ∈ {1, 2, · · · , 6},
dji = d
j
o = 3,
W ji = W
j
o = 20,
δji = δ
j
o = 0.1,
pj = 0.5;
Table 3.2: Application Timing Characteristics
• the mapping between Γ and Π. For ease of notation, we assume tasks map to the
pipe with the same subscript, hence ∀j ∈ {1, 2, · · · , n}, τj 7→ πj;
• the topology of Π (an example is shown in Figure 3.7);
• ∀j ∈ {1, 2, · · · , n}, the value of dji , pj and djo;
• ∀j ∈ {1, 2, · · · , n}, the value ofW ji , δji ,W jo and δjo;
• the end-to-end timing constraints, namely the value of
Eτi 7→pii|τj 7→pij |···|τk 7→pik and/or Fτp 7→pip|τq 7→piq |···|τr 7→pir
where i, j, k, p, q, r ∈ {1, 2, · · · , n}.
The aim is to find a feasible set of {(Cj, T j)} pairs for j ∈ {1, 2, · · · , n} that: (1) meets
all the specified end-to-end timing constraints, (2) passes the task schedulability test, and
(3) ideally but not necessarily minimizes CPU utilization. A task should not be run faster
than necessary, to free resources for additional system objectives.
34
Figure 3.7: Application Task Graph
3.5.2 Solving the Constraints
Our solution is carried out in a three-step process. To make it easier to understand, we
use a concrete example with actual numbers to elaborate the process. Consider the pipe
topology graph shown in Figure 3.7, in which there are six tasks mapped to six pipes. Tasks
1, 2 and 3 read inputs from sensors, Tasks 4 and 6 write their outputs to actuators, and Task
5 is an intermediary responsible for complicated processing such as PID control or sensor
data fusion. The timing characteristics of the tasks and pipes are shown in Table 3.2. Note
that the execution times are assumed to be identical for all tasks. In practice this would
not necessarily be the case but it does not affect the generality of the approach.
In Step 1, we use the given di, do, p, Wi, δi, Wo and δo to compute the budget of each
pipe terminal. The budget is set to a value that ensures the three stages (i.e., read, process
and write) finish in one period. To compute C1, for example, we aggregate the times for
35
τ1 to read, process and write data. Thus C
1 =
d1i
W 1i
+ δ1i + p
1 + d
1
o
W 1o
+ δ1o . All budgets are
computed in a similar way.
When the input to a pipe terminal comes from multiple sources the value di is aggre-
gated from all input channels. For example, τ4 receives a maximum of d
4
i = d1 + d2
amount of data every transfer from both τ1 and τ2. Data from a pipe terminal is not nec-
essarily duplicated for all pipe terminals that are consumers. For example, τ2 generates
a maximum of d2o = d2 data every transfer, by placing a single copy of the output in a
shared memory region accessible to both τ4 and τ5. If the communication channels did not
involve shared memory, then data would be duplicated, so that d2o = 2d2.
In Step 2, we derive a list of inequations involving period variables from the given
end-to-end timing and scheduling constraints in Table 3.2. For simplicity, the scheduling
constraint is shown as a rate-monotonic utilization bound on the six pipe tasks. However,
for sensor inputs and some actuator outputs, our system would map those tasks to I/O
VCPUs that have a different utilization bound, as described in our earlier work [DLW11].
The derivation is based on Equations 3.10 and 3.15, and the composability property
of the pipe model. According to the conditional equations, however, every two pipes with
undetermined priority can lead to two possible inequations. This exponentially increases
the search space for feasible periods. In order to prune the search space, our strategy is
to always start with the case where T p > T c. This is based on the observation that tasks
tend to over-sample inputs for the sake of better overall responsiveness. Thus, the reaction
constraint Eτ2 7→pi2|τ5 7→pi5|τ6 7→pi6 ≤ 25, for example, is translated to inequation T 5 + C2 −
∆+T 6 ≤ 25. This is derived by combining Equations 3.10 and 3.12. It is then possible to
translate all timing constraints to inequations with only periods as variables. In addition,
periods are implicitly constrained by T j > Cj, ∀j ∈ {1, 2, · · · , n}.
Given all the inequations, Step 3 attempts to find the maximum value for each period
36
so that the total CPU utilization is minimized. We are then left with a linear program-
ming problem. Unfortunately, there is no polynomial time solution to the integer linear
programming problem, as it is known to be NP-hard. Though solutions are possible un-
der certain mathematical conditions [BHM77], this is beyond the scope of this thesis.
Instead, in practice, the problem can be simplified because 1) there are usually a small
number of fan-in and fan-out pipe ends for each task, meaning that a period variable is
usually involved in a small number of inequations, and 2) a sensor task period is usu-
ally pre-determined by a hardware sampling rate limit. For example, if we assume T 3
is known to be 5, a feasible set of periods for the example in Table 3.2 is easily found:
{T 1 = 10, T 2 = 15, T 3 = 10, T 4 = 10, T 5 = 15, T 6 = 5}. If we ignore the integer
requirement, it is possible to find a feasible solution in polynomial time using rational
numbers rounded to integers. Though rounding may lead to constraint violations, it is pos-
sible to increase the time resolution to ensure system overheads exceed those of rounding
errors. If all else fails, one can resort to an exhaustive search of all possible constraint
solutions.
3.6 Evaluation
This section describes a simulation experiment, conducted on the Intel Aero board with an
Atom x7-Z8750 1.6 GHz 4-core processor and 4GB RAM.
We developed simulations for both Linux and our RTOS, to predict the worst-case
end-to-end time using the equations in Section 3.4. The simulations consist of seven tasks,
all of which search for prime numbers within a certain range and then communicate with
one another to exchange their results. Each of the tasks is mapped to a separate pipe.
The topology of the pipes is shown in Figure 3.8. The communication channel is shared
memory with caches disabled and the data size is set to 6.7 KB to achieve a non-negligible
37
Figure 3.8: Simulation Pipe Topology
1 millisecond communication overhead. Each task is assigned a different search range and
the profiled execution time is shown in Table 3.3 in milliseconds. The budget of each pipe
is set to be slightly larger than the execution time of its corresponding task, to compensate
for system overheads. The settings of each pipe terminal (PT) are also shown in Table 3.3,
again in milliseconds. Apart from the seven main tasks, the system is loaded with low
priority background tasks that consume all the remaining CPU resources.
τ1 τ2 τ3 τ4 τ5 τ6 τ7
11.5 5.5 3.5 5.5 11.5 11.5 3.5
PT 1 PT 2 PT 3 PT 4 PT 5 PT 6 PT 7
(12,100) (6,50) (4,150) (6,100) (12,150) (12,100) (4,50)
Table 3.3: Simulation Settings
We measure the end-to-end reaction time and freshness time separately, for pipeline
p1 → p2 → p3 → p4 → p5 (denoted by P1), and p6 → p3 → p4 → p7 (denoted by P2),
and compare them to corresponding theoretical bounds. Figure 3.9 shows the results after
100,000 outputs are produced by τ5 and τ7, respectively on our RTOS. As can be seen, the
observed values are always within the prediction bounds.
The difference between observed and predicted freshness times is greater than between
observed and predicted reaction times. This is because our strategy for deriving feasible
task periods starts with producers having greater periods than consumers. As stated in Sec-
38
 0
 50
 100
 150
 200
 250
 300
 350
 400
 450
 500
 550
 600
 650
 700
P1
Reaction
P2
Reaction
P1
Freshness
P2
Freshness
Ti
m
e 
(m
s)
Observed
Predicted
Figure 3.9: Observed vs. Predicted Freshness & Reaction Times
tion 3.4.1, the freshness time is affected by the period of each producer, so the prediction
may not be as tight as for reaction time.
We also perform the same experiment on Yocto Linux shipped with the Aero board.
The kernel is version 4.4.76 and patched with the PREEMPT RT patch. While running
the simulation, the system also uncompresses Linux source code in the background. This
places the same load on the system as the background tasks in our RTOS. Figure 3.10
summarizes the average reaction time (AVGR), worst-case reaction time (WCR), max-
imum variance of reaction time (MaxRV), average freshness time (AVGF), worst-case
freshness time (WCF) and maximum variance of freshness time (MaxFV) of pipeline P1
for our RTOS and Linux. Compared to Linux, there is less variance shown by the end-to-
end times using our system. Additionally, the freshness and reaction times are lower than
with Linux.
39
 0
 50
 100
 150
 200
 250
 300
 350
 400
 450
 500
AVGR WCR MaxRV AVGF WCF MaxFV
Ti
m
e 
(m
s)
Our RTOS
Linux
Figure 3.10: Our System vs. Linux End-to-end Times
Chapter 4
The Qduino Programming Platform
In the previous chapter, we have described the VCPU scheduling algorithm, which pro-
vides runtime guarantee for individual task’s timing specification. The end-to-end design
framework has also been presented, which aims at generating task timing properties from
application-wide, end-to-end timing specifications. However, it remains a problem that by
what means these timing requirements can be specified by end users. Due partly to this
problem, in this chapter, we present a cyber-physical application programming platform,
called Qduino. Qduino provides an simple interface for users to specify individual task’s
as well as application-wide timing requirements.
Moreover, Qduino also offers a straightforward way to access I/O devices and spawn
concurrent tasks. Cyber-physical applications interact frequently with the physical envi-
ronment through I/O devices involving sensors and actuators. We believe that an easy-
to-use programming interface to performing I/O operations is the key to easing CPS soft-
ware development. For example, Arduino, one of the most popular physical computing
programming platform, is well-known for its clear and simple way to interact with I/O
devices.
However, Arduino is designed to target microcontrollers tasked primarily with indus-
trial control and thus only allows a single thread of execution. On the other hand, CPS,
nowadays, are evolving to more complex systems, such as autonomous cars and drones.
41
Those applications typically require more computing resources for tasks such as sen-
sor data fusion, decision making and networked communication. To meet this demand,
multicore single-board computers (SBC) such as the Intel Edison board, the Minnow-
board MAX and the Aero board, are increasingly popular as hardware platforms. At
the same time, in order to cope with the complexity of the hardware as well as to ac-
complish advanced application-specific missions, CPS software is also becoming more
complicated. Consequently, a single thread of execution fails to take advantage of the un-
derlying hardware parallelism and needlessly groups functionally independent tasks into a
single schedulable entity, disabling task interleaving.
To accommodate these demands, Qduino is designed to be backward compatible with
the Arduino API for its simple I/O interface. Yet Qduino extends the Arduino API by
support for user-specified timing requirements and multithreading with multicore support
in mind. We evaluate its applicability and effectiveness by using Qduino to build a web-
connected 3D printer.
4.1 The Standard Arduino API
Arduino is an open source hardware and software platform that offers a clear and simple
environment for physical computing. It is now widely used in cyber-physical applications
and Internet of Things, due in part to its low-cost, ease of programming, and rapid pro-
totyping capabilities. Sensors and actuators can easily be connected to the analog and
digital I/O pins of an Arduino device, which features an on-board microcontroller pro-
grammed using the Arduino API. The Arduino language reference [ard18] specifies 40
functions available for all Arduino-compatible platforms. Table 4.1 lists all the functions
in different categories.
For structuring sketches, it offers two functions: setup and loop. The setup function is
42
Function Name Category
loop, setup Structure
pinMode, digitalWrite,
Digital I/O
digitalRead
analogWrite, analogRead,
Analog I/O
analogReference
tone, noTone, shiftOut,
Advanced I/O
shiftIn, pulseIn
millis, micros, delay,
Time
delayMicroseconds
min, max, abs, constrain,
Math
map, pow, sqrt
sin, cos, tan Trigonometry
randomSeed, random Random Numbers
lowByte, highByte,
Bits and BytesbitRead, bitWrite,
bitSet, bitClear, bit
attachInterrupt, External Interrupts
detachInterrupt
interrupts, noInterrupt Interrupts
Table 4.1: Arduino Standard API
called when a sketch starts and usually contains code for initialization. After calling the
setup function, the loop function repeatedly performs a series of jobs. While the structure
of an Arduino sketch perfectly matches with our task model described in Section 3.1 and
provides an appropriate interface to developing cyber-physical applications, the standard
Arduino API only allows one loop function.
Within the loop function, a sketch typically interacts with I/O devices using functions
such as digitalWrite and digitalRead, or computes application logic written in C-like syn-
tax. Listing 4.1 shows a simple Arduino sketch that toggles an LED connected to pin 13,
every 1 seconds.
43
Listing 4.1: An Arduino Sketch Example
/* The LED is connected to 13 */
#define LED_PIN 13
void setup()
{
/* set up pin 13 as output */
pinMode (LED_PIN, OUTPUT);
}
void loop()
{
/* toggle pin 13 */
digitalWrite (LED_PIN, (val = !val));
/* sleep for 1 second */
delay(1000);
}
4.2 The Qduino Extension
Qduino adopts the simplicity of the Arduino APIs by maintaining backward compatibility
with the original API. An Arduino sketch can be built and run on a Qduino platform
without any modification. Specifically, Qduino uses the same functions to perform I/O
operations. The extensions focus on support for timing specification, multithreading and
multicore architectures.
Qduino is implemented as a library on top of the Quest real-time kernel, which runs on
multicore x86 platforms such as the Intel Galileo and MinnowMAX SBCs. A Qduino loop
is implemented as a Quest thread and scheduled by the VCPU scheduler. An overview of
44
the Qduino system architecture is shown in Figure 4.1.
Figure 4.1: Qduino Architecture Overview
Qduino introduces a set of new functions as listed in Table 4.2, which are elaborated
in following sections.
4.2.1 Support for User-specified Timing Requirements
Apart from the loop ID, the new loop function takes two extra parameters, i.e., the budget
and period of the task. By default, they are specified in milliseconds, although Qduino can
be configured to accept their specification in different time units. As mentioned earlier,
each Qduino loop is implemented as a Quest thread associated with a separate VCPU. The
VCPU scheduling framework, as explained in the Section 3.2, partitions CPU resources
precisely between tasks and thereby ensures temporal isolation between them. By making
use of this property provided by Quest, Qduino guarantees each loop’s user-specified tim-
45
Function Signatures Category
loop(loopID, budget, period [,coreID]) Structure
interruptsVcpu(device, budget, period [,coreID])
Interrupt
attachInterruptVcpu(pin, mode, C, T)
noInterrupts(device, [,coreID]),
noTimer(coreID)
delayBusyMicroseconds(ms),
Time
delayBusyNanoseconds(ns)
spinlockInit(lock),
SpinlockspinlockLock(lock),
spinlockUnlock(lock)
channelWrite(channel, item),
Four-slot
item channelRead(channel)
ringbufInit(buffer, size),
Ring bufferringbufWrite(buffer, item),
ringbufRead(buffer, item)
Table 4.2: New APIs
ing requirements by preventing the execution of one loop from interfering with the timely
execution of others.
As also mentioned earlier, Quest is capable of scheduling interrupt handlers as time-
budgeted threads, to avoid interference with other tasks. We exploit this feature by creating
I/O VCPUs to handle interrupt bottom halves associated with devices. Qduino currently
supports GPIO, I2C and SPI. The I/O VCPU budget prevents a high volume of interrupts
being handled indefinitely, at the cost of other loops. The function interruptsVcpu is used
to control the timing parameters of the I/O VCPU of the specified I/O device. By careful
tuning of I/O VCPU budgets, it is possible for a system designer to balance CPU time
between CPU- and I/O-intensive tasks. Effectively, when setting the capacity of the I/O
VCPU associated with a certain I/O deivce to 0, interrupt handling for that device is dis-
abled. Though not listed in Table 4.2, noInterrupts and interrupts disable and re-enable in-
terrupts, respectively. These two functions are currently implemented as wrappers around
46
the interruptsVcpu function. Finally, attachInterruptVcpu extends the standard attach-
Interrupt function by requiring the specification of Main VCPU timing constraints for a
user-level ISR.
Through a separate configuration file fed into the building process, Qduino also allows
users to specify loop pipelines with end-to-end timing requirements, as those listed in
Table 3.2. Then the method described in Section 3.5 will be invoked to derive periods for
those tasks whose timing properties are not specified.
Thanks to the programming interfaces introduced in this section, Qduino exposes the
features of the VCPU scheduling algorithm and the end-to-end design framework, de-
scribed in Chapter 3, to end users in a simple and clean way.
4.2.2 Support for Multithreaded Sketch
As mentioned in Section 4.1, The standard Arduino API offers two structure functions:
setup and loop. While only one loop function is allowed in the standard API, Qduino
allows up to 32 loop functions, which takes a loop ID as its first parameter.
Multiple loop support in Qduino makes it easier to write sketches with parallel tasks.
A simple example might be to process sensor input data from one I/O pin while another
I/O pin is used for output, perhaps to control an actuator. If the input and output processing
require separate rates for reading and writing data, a single timed loop might be inadequate.
The loop will have a certain period, which might satisfy one, but not necessarily both, of
the input and output rates. The code snippet shown in Listing 4.2 exhibits an incorrect
way to perform two I/O operations of separate rates in one single loop. A similar example
is shown for blinking LEDs in the standard Arduino API [bli18], suggesting users to do
time accounting on their own. This places the burden of scheduling on users, making code
overly complex and vulnerable to mistakes when the number of tasks increases. With
47
the multi-loop feature, separate tasks with different delays between I/O operations can be
assigned to different loops, with the assurance that their delay settings will not affect other
tasks. A code snippet in Qduino in shown in Listing 4.3.
Listing 4.2: Incorrect Arduino Sketch Performing Parallel Tasks
int val;
int pin = 0;
void setup()
{
/* set up pin 2 and 3 */
pinMode (2, INPUT);
pinMode (3, OUTPUT);
}
void loop()
{
/* try to toggle pin 3 every 2 milliseconds */
digitalWrite (3, (pin = !pin));
delay(2);
/* try to read pin 2 every 3 milliseconds */
val = digitalRead(2);
delay(3);
/*
* consequently, both I/O operations will be performed
* every 5 milliseconds
*/
}
48
Listing 4.3: Qduino Sketch Performing Parallel Tasks
int val;
int pin = 0;
void setup()
{
/* set up pin 2 and 3 */
pinMode (2, INPUT);
pinMode (3, OUTPUT);
}
void loop(1, 1, 2)
{
/* loop 1 toggles pin 3 every 2 milliseconds */
digitalWrite (3, (pin = !pin));
delay(2);
}
void loop(2, 1, 3)
{
/* loop 2 reads pin 2 every 3 milliseconds */
val = digitalRead(2);
delay(3);
}
From a performance point of view, binding loops to threads also provides the pos-
sibility to interleave the execution of independent tasks. This is particularly beneficial
when blocking I/O operations frequently occur. In a single-threaded sketch, a blocking
I/O operation in one task would prevent other tasks in the same sketch from executing.
In Section 4.3, we experiment with a multithreaded sketch which has both CPU- and I/O-
intensive tasks. Significant performance improvement is observed over a single-threaded
49
sketch containing the same tasks.
Communication and Synchronization. In a multi-loop Qduino sketch, communica-
tion between loops can be done via global variables, as loops are internally Quest threads.
However, unrestricted use of shared variables is unreliable and unsafe due to multiple
update problems. Therefore, spinlocks are made available for use in Qduino. To hide the
complexity of explicit synchronization and to maintain the simplicity of Arduino program-
ming, we further provide two asynchronous communication facilities: a four-slot [Sim90]
channel and a ring buffer. Simpson’s four-slot fully asynchronous communication mech-
anism allows a single reader and writer to access a shared memory region in such a way
that the reader always accesses the most recent data stored by the writer, and neither en-
tity need wait for the other [Rus02]. Thus, data is always fresh, even though some may
be over-written and, hence, lost. Four-slot asynchronous communication is widely used
in real-time systems to guarantee that actuators always read the latest data from sensors.
We also provide a single-reader, single-writer ring buffer FIFO for applications that want
historical data values to be preserved.
4.2.3 Support for Multicore Architectures
The uptrend of building cyber-physical applications on multicore architectures brings new
challenges. In Qduino, if loop’s CPU affinity and interrupt routing is managed by the
underlying kernel, it is possible that two or more time-critical loops are assigned to the
same core that handles frequent interrupts, while a non-real-time task occupies another
core with only occasional hardware interrupts. This motivated us to provide a set of APIs
that allow developers to better specify application-level timing requirements on multicore
architectures.
The loop function takes an optional fourth parameter, which restricts the loop to a
50
specific core. This way, loops with high-precision timing requirements can be assigned
to dedicated cores, to avoid scheduling and context-switching overheads between separate
threads. SBCs such as the Intel Aero board have four cores, with the likelihood that this
will increase further on future SBCs. It is therefore not impractical to dedicate a core to
high-precision, time-critical tasks such as those operating at the nano- and micro-second
resolution.
Qduino also allows I/O device interrupts to be routed to a specified core. The inter-
ruptsVcpu function accepts an optional fourth parameter to specify which core is asso-
ciated with the threaded interrupt handler. To further isolate a core, hardware interrupts
of a specified I/O device can be disabled on a specified core, by calling noInterrupts. It
internally modifies the I/O APIC registers to mask a core from the delivery destination
of the interrupt line of the specified device. noTimer is used to disable the local APIC
timer interrupts on a specified core. This is useful when a real-time task does not want to
be interrupted, to allow another task on the same core to execute. As a consequence, all
scheduling and context- switching overheads are eliminated.
Lastly, Qduino add two new time function, the delayBusyNanoseconds() and delay-
BusyMicroseconds(). These read the time stamp counter and wait until the specified time
instance is passed. They are intended to be used for time-driven event loops.
4.3 Micro-benchmarks
We conducted a series of experiments to investigate the performance of the Qduino ex-
tensions in the Qduino environment. All experiments used a first generation Intel Galileo
board. As reference, we compared Qduino to Clanton Linux 3.8.7, which is shipped with
the Intel Galileo board. The Linux sketches are created and uploaded with the Intel Ar-
duino IDE v1.0.0. Sketches running on Qduino are built using Quest’s toolchain and
51
loaded through the Qduino shell. Quest’s toolchain is based on GCC 4.7.2, with the same
optimization flag (-Os) as the Intel Arduino IDE. All clock cycle timing measurements
used the Quark processor’s TimeStamp Counter.
4.3.1 Multithreaded Sketch
We constructed a sketch with a mixture of CPU and I/O operations. For the CPU work-
load, we constructed a findPrime benchmark that calculates all prime numbers smaller
than 80000. For I/O, we issued 2000 digitalWrite() requests. In the single-loop
version, both CPU and I/O operations are combined in one loop() function. In the mul-
tithreaded version, we have two loops: one runs findPrime and the other issues the
digitalWrite() requests. Table 4.3 lists all four experimental cases. We conducted
Case # Description
Case 1 single-loop digitalWrite()
Case 2 single-loop findPrime
Case 3 single-loop digitalWrite()+findPrime
Case 4 multi-loop digitalWrite()+findPrime
Table 4.3: Case Descriptions
this group of experiments on both Qduino and Clanton Linux. For Qduino, the single-loop
case uses a Main VCPU with C = 498mS and T = 500mS 1. In the multi-loop version,
we established a Main VCPU with C = 495mS and T = 500mS to run findPrime.
For the I/O operations, we assigned an I/O VCPU with 3/500 fraction of CPU time. In
both cases, the leftover CPU time is reserved for the shell so that the sketch can be loaded.
When running on Clanton, the multithreaded sketch uses the Pthread library.
Results in Figure 4.2 show that the multithreaded sketch achieves approximately 28%
performance increase over the single-loop version on Clanton Linux, and 31% increase
1Unless stated otherwise, all VCPU parameters are in milliseconds.
52
over a single-loop version on Qduino. The multithreaded sketches are both only slightly
slower than running findPrime alone. This is because digitalWrite() spends most
of its time blocking on I/O commands from the I2C bus.
 0
 2
 4
 6
 8
 10
 12
Case 1 Case 2 Case 3 Case 4
CP
U 
Cy
cle
s 
(x1
0^
9)
Clanton
Qduino
Figure 4.2: Multithreaded Sketch Benchmarks
4.3.2 Timing Guarantee
User-specified timing requirements of Qduino loops are guaranteed by temporally isolat-
ing a loop from others and asynchronous system events, e.g. interrupts. We conducted
another set of experiments to verify temporal isolation.
We first wrote a sketch with 3 loops, each running findPrime with different VCPU
parameters, as shown in Table 4.4. We then split the 3-loop sketch into three single-loop
Loop # Loop 1 Loop 2 Loop 3
VCPU parameters 40/100 20/100 10/100
Table 4.4: VCPU Parameters
sketches. Each contains one of the three loops respectively. We ran each single-loop
53
sketch with the same VCPU parameters it used in the 3-loop version. We compared each
loop’s execution time in the 3-loop sketch to that in the corresponding single-loop sketch
(averaged over 5 runs). The results in Figure 4.3 show that Qduino maintains temporal
isolation between loops.
 0
 20
 40
 60
 80
 100
Loop 1 Loop 2 Loop 3
CP
U 
Cy
cle
s 
(x1
0^
9)
single-loop
3-loop
Figure 4.3: Temporal Isolation between Loops
In a further experiment, we investigated the use of I/O VCPUs in Qduino. We used
a similar setup to the predictable event delivery experiment, except that Board A toggles
pin 13 repeatedly, without any delay, and interrupts generated on B’s pin 2 thus have a
frequency of 220Hz. On Board B, we ran findPrime using a Main VCPU with pa-
rameters 70mS/100mS. The sketch also attaches an interrupt handler to pin 2, which
counts the number of interrupts received during the execution of findPrime. Five cases
were studied, all using Qduino, as shown in Table 4.5. I/O VCPU parameters are adjusted
via the interruptsVcpu() function before the interrupt handler is attached. Case 1
serves as the base case to show the execution time when external interrupts are disabled2.
Case 5 is a special case, using a kernel-level interrupt handler, which is not associated
2We disabled interrupts by removing the wire that connects the two boards.
54
with any I/O VCPU. This case is intended to show the interrupt processing interference
when an I/O VCPU is not used. Results are presented in Figure 4.4. For each cluster of
Case # I/O VCPU External Interrupts
Case 1 10/100 OFF
Case 2 0/100 ON
Case 3 5/100 ON
Case 4 10/100 ON
Case 5 Disabled ON
Table 4.5: Experimental Setup
bars, the bar on the left shows the execution time of the loop in CPU cycles. As can be
seen, when the I/O VCPU is enabled, the loop has approximately the same execution time
with the base case where external interrupts are turned off, thereby demonstrating the ex-
pected temporal isolation between loops and interrupts. The bar on the right represents the
number of interrupts received. Though the loop has guaranteed execution time whatever
I/O VCPU parameters are used, the number of interrupts received varies accordingly. The
larger the I/O VCPU budget is, the more interrupts the sketch receives. It demonstrates
that interrupts can be flexibly controlled by the budget of the I/O VCPU. Users can effec-
tively disable external interrupt delivery by setting the I/O VCPU budget to 0, as shown in
Case 2.
We performed the same experiment with Clanton Linux. In this case, findPrime’s
performance degrades to 30.4% of its peak value while 2402 interrupts are received. For
comparison, we divided the CPU cycles between the Main VCPU and the I/O VCPU in
Qduino to achieve the similar performance drop for findPrime. We found that when
using a 40mS/100mS I/O VCPU and a 48mS/100mS Main VCPU, 5369 interrupts are
received when the performance drop is about 34%.
55
 0
 5
 10
 15
 20
Case1
Case2
Case3
Case4
Case5
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
 10
CP
U 
Cy
cle
s 
(x1
0^
9)
Co
un
ts
 (x
10
00
)
CPU Cycles
Interrupts Handled
Figure 4.4: Temporal Isolation between Loops and Interrupts
4.4 A Real-world Application
We further evaluate Qduino’s applicability and effectiveness by using it to build a real-
world application, a web-connected 3D printer, on an Intel MinnowMax board. The design
and implementation of the 3D printer is elaborated in the next chapter.
Chapter 5
AWeb-connected 3D Printer Case Study
Continued from last chapter, this chapter further evaluates the utility of Qduino by apply-
ing it to a specific application: a web-connected 3D printer. During the course, we also
investigate and analyze the timing requirements of 3D printers in general. We believe a
3D printer serves as an ideal candidate for demonstration based upon the following facts:
• Qduino is a platform that allows application developers to explicitly control CPU
and I/O resources. Building a web-connected 3D printer provides the opportunity
to exhibit these features. 3D printers have stringent timing requirements, as de-
scribed in Section 5.2. Adding web-connectivity to a 3D printer poses challenges
to application designers to balance system resources among several types of tasks:
those performing latency-sensitive (signaling I/O operations to drive stepper mo-
tors) versus high-bandwidth (network I/O operations to receive printing requests)
I/O operations, and those involved in compute-intensive G-code processing.
• A 3D printer performs intensive I/O operations and thus serves as a stress test on the
efficiency of Qduino I/O APIs.
• Many consumer-grade 3D printers are built upon Arduino-based hardware and soft-
ware. It is natural for Qduino, being compatible with the Arduino API, to refactor
an application that is originally implemented in Arduino.
57
5.1 Web-connected 3D Printers
Traditional 3D printers are tethered to a PC via a USB link, to receive manual configura-
tions or low-level commands (G-codes 1) for one printing job. In contrast, a web-connected
3D printer is intended to provide an interface to interact with either end users or other print-
ers on the web. It is also capable of spooling multiple job requests, which is a common
feature of 2D printers. Web-connected 3D printers form the basis for more sophisticated
farms of manufacturing devices that coordinate their operation.
Many current consumer-grade 3D printers are based on relatively simple microcon-
trollers such as the Arduino Mega. These controllers incorporate several stepper motor
drivers, general-purpose I/O pins (GPIOs) and analog-to-digital converters (ADCs), for
motion, extruder, heat-bed and fan control. Examples include the RepRap printing plat-
forms that run Arduino-based firmware such as Marlin. While these platforms provide
low-latency I/O with sensors and actuators, they are unable to run a real-time printing con-
trol program along with web services, to interoperate with other devices or remote users.
Our web-connected 3D printer is based on a Printrbot Simple Metal, with the tradi-
tional Arduino Mega controller being replaced by an Intel MinnowMax SBC running the
Marlin firmware reimplemented in Qduino.
5.2 Timing Requirements of a 3D Printer
3D printers are classified by the filament materials they use and the way layers of filament
are deposited to create parts. Most of the 3D printers in the consumer market utilize fused
deposition modeling (FDM), which produces objects by extruding small beads of material
that harden immediately to form layers on a movable bed. These printers have 3 axes of
1G-code is a numerical control programming language for computer-aided manufacturing.
58
motion, with stepper motors to control the bed’s movement along the X and Y axes, and
the extruder’s movement along the Z axis (Figure 5.1). An additional stepper motor feeds
the filament through the extruder to maintain the correct bead deposition rate.
Figure 5.1: A Conceptual View of a 3D Printer
Motor Control in 3D Printers. In a 3D printer, stepper motors combine with linear
motion systems, such as pulleys and belts, to control the position and speed of the bed
and extruder. Stepper motors are able to make precise angular movements. A stepper
motor divides its full rotation into a number of equal steps. A microcontroller and driver
circuit commands the motor to rotate and hold at one of these steps without a feedback
sensor. Motor steps are triggered by a series of digital pulses sent to a GPIO pin, which
feeds a driver circuit (e.g., Pololu 4988) that produces directional current within the motor
windings. Electromagnets are switched on and off to attract teeth on the motor’s shaft,
causing it to rotate. This makes the shaft rotate one step for each pulse. Higher rotational
accuracy is achieved by using driver circuits that generate microsteps. Similarly, rotational
speed is affected by the frequency at which digital pulses are generated.
Control of the stepper motor’s acceleration and deceleration is needed to ensure smooth
59
linear movement. The time delay between the stepper motor pulses must be calculated,
so that the motor’s operation follows the trapezoidal speed ramp (example shown in Fig-
ure 5.2) as closely as possible.
Figure 5.2: An Example Speed Ramp for a Stepper Motor
Timing Requirements. For a better understanding of the timing issues, we use a
Printrbot Simple Metal 3D printer as an example. The Printrbot Simple Metal uses one of
its four NEMA (National Electrical Manufacturers Association) 17 stepper motors with a
belt driven system to move the bed along the X axis. The NEMA 17 has 200 full steps
per revolution, denoted as s. Each step is divided by α = 16 microsteps, to achieve more
precise motor rotation with less vibration. Thus, s · α = 200 × 16 = 3200 digital pulses
are needed per revolution. The Printrbot uses a GT2 belt driven system, with a belt tooth
pitch, b = 2 mm, and pulley tooth count, p = 20. This means one rotation of the stepper
motor will be translated to b · p = 2 × 20 = 40 mm linear movement of the bed. For Θ
radians of rotation, the relationship between linear motion speed, v, and pulse frequency,
f , is as follows:
v =
∆S
∆t
=
∆Θ
2pi
· b · p
∆t
=
∆t·f · 2pi
s·α
2pi
· b · p
∆t
=
bp
sα
f (5.1)
After the extruder is heated, we assume melted filament continuously flows out of the
extruder tip hole at speed γ when material needs to be deposited. Within a time period
60
of ∆t, using an extruder tip with diameter d, filament of volume ∆V = γ∆t · (d
2
)2π will
be extruded. We assume that during this process, the pulse train frequency is steady at f .
Using Equation 5.1, the relative displacement of the bed to the extruder is ∆S = v∆t =
bp
sα
f∆t. So filament of volume ∆V is distributed on a rectangle of area ∆A = ∆Sd. This
reveals the relationship between layer thickness, H and pulse frequency, f :
H =
∆V
∆A
=
γ∆t · (d
2
)2π
bp
sα
f∆t · d = K ·
1
f
(5.2)
where K = pi
4
· sαd
bp
· γ.
Equation 5.2 indicates that, for a given printer configuration and filament material, an
unstable pulse train will cause uneven thickness of a printed layer. If the task that generates
the pulse experiences a delay δt, due to temporal interference from other tasks or delays
within the control system, the pulse frequency will drop to f ′ = 11
f
+δt
. According to
Equation 5.2, this will result in a vertical gap of height ∆H ,
∆H = K(
1
f ′ −
1
f
) = Kδt (5.3)
between adjacent layers and therefore undermine the structural soundness of the printed
object. This example is illustrated in Figure 5.3 where the delay δt happens at the (n+1)th
step.
Figure 5.3: Structural Deficiency due to Unstable Pulse Train
61
5.3 Implementations
5.3.1 Challenges
A 3D printer is required to translate G-codes into peripheral I/O control sequences. More
advanced features might include running a local compute-intensive slicer engine to trans-
late 3D modeling files into 2D sequences of G-codes, for each layer of a print job. At
the same time, time-critical I/O signaling operations are necessary to drive stepper mo-
tors, fans and heaters. For a high volume of remote requests, interrupts from the disk
and the network interface controller start to consume a large portion of CPU cycles. This
poses challenges to balance system resources among several types of tasks: those perform-
ing latency-sensitive versus high-bandwidth I/O operations, and those involved in G-code
processing.
5.3.2 Implementation on Linux
This section describes the web-connected 3D printer we built on Linux, including both
hardware and software setup. It then identifies the problems we observed during building
and testing this prototype.
Hardware. While we reused the mechanical parts of the Printrbot Simple Metal 3D
printer, we replaced the original Printrboard controller with our custom controller. Most
traditional 3D printers, including the Printrbot, are equipped with the AVR ATmega mi-
crocontroller, operating at speeds up to 20 MHz, and run firmware on the bare metal. In
such an environment, however, it is difficult if not impossible to run a webserver in parallel
with the printing control software. Therefore, our custom 3D printer controller is based on
the much more powerful Intel MinnowMax board. The MinnowMax is equipped with a
64-bit dual-core Atom 1.33 GHz processor, 2 GB memory and 86 GPIOs, of which some
62
are configurable as I2C, SPI, UART and PWM pins.
We interfaced the MinnowMax to a RepRap Arduino Mega Polulo Shield (RAMPS)
and various analog circuits to level-shift the 3.3V GPIO pins to appropriate values, to
control the Printrbot’s motors, fan and extruder heater. Our custom controller uses four
Pololu 4988 stepper motor drivers and an ADS7828 8-channel 12-bit I2C analog-to-digital
converter (ADC), to monitor extruder temperature readings.
Software. We ran Yocto Linux 4.4.13 with the PREEMPT RT patch on the Minnow-
Max. A lighttpd daemon runs in the background to receive remotely submitted printing
jobs, and a custom spooler queues and feeds jobs to the printing control program, which
is a customized version of Marlin.
Marlin is a firmware for RepRap single-processor electronics, supporting RAMPS,
RAMBo, Ultimaker, BQ, and several other Arduino-based 3D printers. It has one Arduino
loop function and two interrupt handlers for a pair of hardware timers, illustrated in Fig-
ure 5.4. In each iteration of the loop, a G-code is read from the serial bus and is processed.
Though there are hundreds of different G-codes, the most common are G0 and G1, which
specify the speed and position for the extruder to move. Positional coordinates are then
translated into each stepper motor’s direction, number of steps and angular speed (accord-
ing to the speed ramp). This is the core algorithm of Marlin and consumes the most CPU
cycles. As the last stage, the loop packages the stepper motor motion parameters into a
data block and adds it to a finite capacity queue, after which the main loop repeats.
The block queue is shared between the loop and an interrupt handler, which is bound to
a hardware timer in one-shot mode. When the timer fires, the interrupt handler is invoked
to check the remaining steps to execute on each axis, and their respective rate. It then,
accordingly, sends one pulse to each stepper motor that should be driven and programs the
timer for when the next pulse should occur. When the handler completes all the steps in a
63
block, it consumes a new one from the queue if it is not empty.
Marlin has a second interrupt handler bound to a hardware timer in periodic mode.
Every 8 milliseconds it samples the extruder temperature, and adjusts the fan speed and
heater settings using PID feedback control. Typical PLA plastic filament, for example,
requires a temperature of about 208-210 degrees Celcius, and the PID controller is used to
keep the target and actual temperature within a small error margin.
We ported Marlin to run as a multithreaded Linux application. The loop function
and the two interrupt handlers are each converted to run as an individual thread. Proper
locking is applied to data shared among threads. GPIO and I2C operations are performed
via the mraa library from Intel’s IoT Dev Kit, to replace AVR I/O instructions. Instead of
programming hardware timers directly, we rely on nanosleep to perform timed operations.
We also refactored the program structure by moving the PID controller to the thread that
runs the original Timer2 interrupt handler, from the thread that runs the original loop
function. In the original Marlin firmware, the PID control code is in the loop function,
because its execution takes a relatively long time and renders the system unresponsive if
invoked in an interrupt handler. The new Marlin application-level code is optimized for
the real-time preemption patch by carefully setting its scheduling priority, and locking all
pages into memory [McK08].
Observations. We carefully tuned the 3D printer and it successfully printed out our
test objects. This was done under the circumstances that the webserver was disabled dur-
ing printing. We then proceeded to enable the webserver and start submitting G-code files
while printing was active. We used a script to automatically generate submission requests.
The submission intervals were random numbers uniformly selected from 100 to 10000
milliseconds. The file sizes were randomly selected from a log-uniform distribution, rang-
64
Figure 5.4: The Structure of Marlin
ing from 50 KB to 150 MB 2. Each file was then submitted to the printer via an HTTP
POST request. We were able to observe evident jitter of the extruder relative to the bed.
Backlash noise from the stepper motor gearing and belt drive was evident for an active job
when a file larger than 15 MB was being transferred.
Case # Description
Case 1 Webserver/Spooler Disabled. No Job Submitted.
Case 2 Webserver/Spooler Enabled. Jobs Submitted from the Script.
Table 5.1: Case Descriptions
In order to quantify our observations, we created two micro-benchmarks to examine
the interference by the webserver and spooler on: (1) the timely execution of the tem-
perature control thread, and (2) the stepper motor control thread. In the first experiment,
we logged the temperature readings from the thermistor under the two cases in Table 5.1.
2It is realistic to have a G-code file exceeding 100 MB, if the object to print is relatively large, requires
thin layers and dense infill.
65
 190
 195
 200
 205
 210
 215
 0  50  100  150  200  250
Te
m
pe
ra
tu
re
 (C
els
ius
)
Sample #
Setpoint
Case 1
Case 2
Figure 5.5: Temperature Readings
The sample period was set to 1 second. The extruder was heated to 209 degrees Celcius,
and kept at that value during printing. Results are plotted in Figure 5.5. It can be seen
that, despite the continuous reception of submitted files, the PID controller maintained the
temperature close to the setpoint in both cases. The 8 ms period of the temperature control
thread is large enough to hide delay variations caused by Linux.
In the second experiment, we measured the frequency of the pulse train that drives
the stepper motors. However, when a 3D printer is printing, the pulse train changes its
frequency in different G-code commands and phases in the speed ramp. It is hard to tell
if an observed frequency change is actually caused by interference. Therefore, we wrote a
micro-benchmark program to drive a stepper motor at a constant speed. Listing 5.1 shows
our benchmark code. Variable period was set to 100 microseconds, to generate a 10kHz
pulse. Using Equation 5.1, this yields an expected linear speed of 125 mm/s. When the
webserver is disabled, an oscilloscope showed a pulse wave of variable frequency, ranging
from 7.75 to 8.02 kHz, shown in Figure 5.6. After we started the script to submit jobs, the
frequency showed more fluctuation, even dropping to as low as 6.42 kHz.
66
Figure 5.6: Target 10kHz Pulse on Yocto Linux (Actually 7.591kHz)
Listing 5.1: Loop to Measure Frequency
struct timespec period =
{.tv_sec = 0, .tv_nsec = 100000};
for (;;) {
nanosleep(&period, NULL);
/* write 1 to gpio6 */
mraa_gpio_write(GPIO6, HIGH);
/* write 0 to gpio6 */
mraa_gpio_write(GPIO6, LOW);
}
Operating System Control The second experiment reveals two problems we faced
when generating a precise pulse train on the Linux platform. First, the frequency range
is too low to drive the motors at their expected speed. Second, the stepper motor control
thread is subject to interference from other tasks, causing jerks in the stepper motor’s
rotation, due to backlash. In this section, we take a look into the causes of these two
problems.
We reran the benchmark shown in Listing 5.1 with the timestamp counter recorded
at various stages within the program and the kernel, to determine the root cause of these
67
costs. Results are shown in Table 5.2. All times are averaged over 1,000 samples, and
shown in nanoseconds (ns). The average time of one iteration is 126,422 ns, resulting in
a 7.91 kHz pulse train. Except for the times spent sleeping, and setting the GPIO value
register, the others are operating system overheads, which fall into four categories:
• nanosleep Kernel Crossing – from user space to kernel and back, which accounts for
∼0.5% of the execution time.
• The hrtimer Subsystem – for high-resolution kernel timers. The nanosleep call uses
the hrtimer subsystem to access the highest precision hardware timing mechanism, either
local APIC timers or the high precision event timer (HPET).
• Context Switch Overhead – to put a task to sleep, run the scheduler, and wake up the next
task.
• The GPIO Framework – which interfaces GPIO pins with user-space programs using
sysfs. This includes the general GPIO framework (gpiolib) and the actual GPIO con-
troller driver for the Minnow MAX (the BayTrail GPIO driver). This accounts for
roughly 12.5% of the execution time.
Costs Percentage
nanosleep Kernel Crossing 587 0.5%
Timer Framework 2420 1.9%
In sleep 100000 79.1%
Context Switch 7744 6.1%
sysfs Framework 10592 8.4%
gpiolib Framework 1038 0.8%
GPIO Controller Driver 4146 3.3%
Total 126422 100%
Table 5.2: Kernel Overheads (in nanoseconds)
GPIO Framework Overhead. As shown in Table 5.2, the biggest overhead stems
from GPIO operations. Intel development boards, such as Galileo, Edison, MinnowMax
and Up, all have PCI-based GPIO controllers with memory-mapped device registers. On
top of the GPIO controller drivers, Yocto Linux uses a uniform GPIO sysfs interface for
68
Figure 5.7: Call Graph for Listing 5.1
userspace. The sysfs interface provides a simple way to manually check the status of an
input pin or write values to output pins, from the command line or a shell script. However,
an autonomous control program suffers from this convenience, by paying the cost of going
through convoluted kernel control paths before reaching the actual GPIO driver to set the
value register, as shown in Figure 5.7. Instead, a simple userspace GPIO driver is all that
the control system really needs.
Context Switch Overhead. The nanosleep system call is invoked whenever a Linux
task wishes to relinquish the processor and sleep for a precise period of time. This call sets
a high-resolution timer (hrtimer) to fire at some future point in time, when the task will be
woken up and allowed to run again on the processor. As shown in Figure 5.7, nanosleep
programs the Local APIC timer and calls into the kernel scheduler. In a tickless kernel,
the next LAPIC timer interrupt will check the timer and call the scheduler to wake the
task up. The cost of going through the hrtimer subsystem, (re-)programming the LAPIC
timer, handling the timer interrupt and running the scheduler adds up to more than 10
microseconds.
Timing Unpredictability. Unlike disk and network interface controllers that use
direct-memory access (DMA) to transfer high-bandwidth (bulk) data, GPIOs are used
for latency- and jitter-sensitive control of sensors and actuators. As explained earlier, a
fluctuating pulse train results in jerky stepper motor movements in 3D printer control.
This is exacerbated by the lack of temporal isolation between Linux processes, and the
69
interference from interrupts on process execution.
Further Optimizations. We recognize that Linux can be further optimized for latency-
sensitive real-time applications, such as by using the cset utility to isolate a CPU, making
use of the SCHED DEADLINE scheduling class, or enabling CONFIG NO HZ FULL
to disable timer interrupts on certain cores. However, all these techniques involve system
utilities or kernel settings without a uniform programming interface. Significant kernel and
real-time expertise is required to take advantage of these features, and thus raises barriers
to embedded application development.
5.3.3 Implementation on Qduino
We reimplemented the Marlin to take advantage of features in Qduino. The new Marlin
code maintained three loops, shown in Figure 5.8. The first loop performs the calculations
to translate G-code coordinates to the number of motor steps in each axis, as well as the
trapezoidal acceleration/deceleration profiles. The second one generates precisely timed
pulses on GPIOs to drive the stepper motors, and the third loop reads the extruder temper-
ature and adjusts the operation of the fan and heater. Apart from Marlin, a webserver and
a spooler ran natively on the Quest OS, sharing the default Main VCPU. Therefore, four
Main VCPUs existed in total in the system.
The second loop for the stepper motors had the most critical timing requirement. The
other tasks associated with different VCPUs had weaker real-time requirements. Thus, we
assigned Loop 2 to a dedicated core on which all unnecessary interrupts were disabled.
By calling noTimer(1) in the setup phase, the timer interrupt was disabled on Core 1 so
that Loop 2 was free from the interruption of the scheduler. Interrupts were generated
by the I2C controller as part of communication with the ADC, which recorded readings
from a thermistor to monitor extruder temperature. A call to noInterrupts(ALL, 1) was
70
Figure 5.8: Marlin on Qduino
used to suppress all interrupts, including I2C and NIC interrupts, on Core 1. The threaded
bottom half of the I2C and NIC interrupt handler, running on respective I/O VCPUs, were
automatically pinned to Core 0 by calling noInterrupts(ALL, 1). Outside of the Marlin
sketch, the webserver and the spooler ran on the default Main VCPU, which was pinned
to Core 0. After these setup stages, Core 1 was able to run Loop 2 in isolation of all other
loops and interrupt events.
In addition, the GPIO controller on theMinnowMax is a PCI-based device with memory-
mapped registers. We wrote a user-mode GPIO driver by mapping the registers to a
memory region with user-level access permission. Qduino’s GPIO functions are wrap-
pers around the user-mode driver functions. This way, the kernel crossing overheads are
significantly reduced for loops that perform intensive GPIO operations.
71
5.4 Evaluation
This section evaluates the implementation of the web-connected 3D printer on the Qduino
platform, in terms of its overheads, predictability and usability. Comparison tests involve
an unmodified off-the-shelf Printrbot 3D printer, with a Printrboard controller based on an
Atmel AT90USB1286, operating at 16 MHz. The Printrboard runs Marlin as firmware,
lacking the capabilities of an operating system with web connectivity. Thus, experiments
on the Printrbot are conducted without remote job submissions. Additional comparisons
are shown for Linux running on our custom MinnowMax controller.
Temperature Control. We measured the temperature during printing under the two
cases shown in Table 5.1. The results are plotted against those of the Printrboard. As
shown in Figure 5.9, the execution of the PID controller in Qduino Marlin does not suffer
interference by the webserver or the spooler. The PID controller maintains the temperature
close to the setpoint, as effectively as the Printrboard.
 190
 195
 200
 205
 210
 215
 0  50  100  150  200  250
Te
m
pe
ra
tu
re
 (C
els
ius
)
Sample #
Setpoint
Case 1
Case 2
Printrboard
Figure 5.9: Temperature Control
The 10 kHz Pulse Train. We also performed the 10 kHz pulse train experiment on
Qduino. The benchmark sketch is shown in Listing 5.2. Due to the hard real-time nature
72
Figure 5.10: Target 10kHz Pulse on Qduino (Actually 9.569kHz)
of this single loop sketch, we pinned the loop to Core 1 and disabled all interrupts on that
core in the setup phase. The C and T parameters were both set to 100, to indicate the
task’s dedication to the whole core. Oscilloscope readings yielded a pulse train with a
stable frequency of 9.569 kHz (Figure 5.10). We then enabled the webserver and spooler
in the default Main VCPU, that is pinned to Core 0 by default. After we started the script
described in Section 5.3.2 to submit jobs, the pulse frequency stayed at 9.569 kHz.
Listing 5.2: The 10 kHz Pulse Train Sketch
int GPIO6 = 6;
void setup() {
pinMode(GPIO6, OUTPUT);
noInterrupts(ALL, 1);
}
void loop(1, 100, 100, 1) {
delayBusyNanoseconds(100000);
digitalWrite(GPIO6, 1);
digitalWrite(GPIO6, 0);
}
The Printrboard has four hardware timers, which are programmable with a maximum
73
Figure 5.11: 10kHz Pulse on the Printrboard (without a webserver)
frequency as fast as the system clock. We modified the Marlin firmware to set the Timer1
prescaler to 8, resulting in a 2 MHz timer. We then put Timer1 in the Clear Timer on
Compare (CTC) mode and programmed the Output Compare Register to 200. Under this
setting, the timer incremented the counter by 1 every 500 nanoseconds. When the counter
reached 200, it fired an interrupt. The interrupt handler did nothing but generate a pulse on
the GPIO6 pin, and then reset Timer1. An oscilloscope showed a stable pulse frequency
of 9.96 kHz (Figure 5.11) on GPIO6, which was very close to the theoretical 10 kHz3.
However, this was without any background tasks running, and served as a reference for
the performance of all other scenarios.
Experiments show that while the pulse frequency on Qduino was slightly lower than
the one on the Printboard, it was more than 21% higher than the one on Yocto Linux.
More importantly, the pulse train on Qduino was more stable, especially when the system
load was high. While Qduino was able to maintain a pulse frequency close to that of the
Printrboard, it did so while handling background web requests.
Example Test Object. We developed a 3D test object to work as a wheel encoder for
3The Printrboard uses 5V logic levels.
74
Figure 5.12: 3D Image STL File for Test Object
a mobile robot, as shown in Figure 5.12. We compared the performance of Qduino against
Linux to print the object, while our job submission script was actively communicating with
the webserver. Given the inability of Linux to maintain a high stepper pulse rate, the whole
print process ran at a slower speed to produce a successfully printed object, as shown in
Figure 5.13. The Linux object was still not printed to the same quality as with Qduino,
and took more than one hour to complete. With Linux, jitter and delays in stepper motor
speeds for both printer movement and extrusion caused jerky operation and fine strands of
unwanted filament being deposited. In contrast, Qduino printed the object (Figure 5.14)
in about thirty minutes, which was similar to the Printrboard without the ability to support
web requests. Qduino is able to empower a 3D printer with additional services without
impacting manufacturing time, which is critical in a production system.
Platform Usability. One of the goals of Qduino is to provide easy-to-use APIs for
application development. The pulse train generation sketch for Qduino in Listing 5.2
required only 10 lines, while the Yocto Linux code in Listing 5.1 needed 35 lines 4. We
also observed a 10% source code reduction in Marlin, with Linux Marlin having 3,674
lines of code, and Qduino Marlin having 3,342. The minimal system overheads incurred
4Scheduling priority settings, memory locking and error checking are not shown in Listing 5.1.
75
Figure 5.13: Linux Figure 5.14: Qduino
by Qduino compared to those of the Printrboard firmware are compensated by its ability to
work in conjunction with a richer set of embedded application services (e.g., web services).
Chapter 6
Platform Extensibility
In Chapter 3 and 4, we have described the Qduino programming platform that provides a
simple and clear interface to access I/O devices, spawn multiple tasks as well as specify
timing requirements of those tasks. We have demonstrated the usability of the platform by
using it to build a web-connect 3D printer.
However, a 3D printer, even with the web-connectivity, is still a relatively dumb cyber-
physical device. It accomplishes its objectives by following pre-determined instructions
(i.e., G-code) and seldom responds to changes of its surroundings. Conversely, an in-
telligent application, such as an autonomous drone, is able to reason about and adapt to
changes in its surroundings, while accomplishing mission objectives without remote as-
sistance from a human being. This is often made possible by complex software stacks,
developed with many man-year engineering efforts. Examples are OpenCV for image
processing, or GPU-based machine learning algorithms.
It is infeasible for Qduino users to develop these features from scratch. Lack of sup-
port for legacy libraries, frameworks and device drivers, is becoming the biggest stum-
bling block in the adoption of Qduino. On the other hand, these features are common
on General-Purpose OSes (GPOS) such as Linux, benefiting from both open-source and
industrial support. However, a GPOS is known to be unsuitable for critical control pro-
grams due to its fairness-oriented resource management, multiple layers of indirection and
77
lengthy critical sections. In addition, a GPOS tend to be less reliable due to its larger
Trusted Computing Base (TCB) compared to specialized custom OSes [KEH+09]. On the
other hand, while Qduino is specifically designed for cyber-physical applications with its
For these reasons, we argue that an ideal programming platform for cyber-physical
applications should be the combination of the two worlds. On one side, a platform with
predictable timing behavior, simple and efficient programming interface, and small TCB,
hosts the critical control program. On the other side, a GPOS with rich support for libraries
and frameworks, hosts auxiliary applications that augment the application’s autonomy and
connectivity. To this end, this chapter describes the enhancements to Qduino’s extensibil-
ity so that legacy software components can be easily integrated.
6.1 Design Concerns
It is common in the embedded world to integrate a custom platform and a GPOS, by
accompanying the microcontroller running the custom OS with a more powerful com-
puting board or processor running the GPOS. The microcontrollers run either bare-metal,
firmware-like control programs, or custom OSes with control programs running as appli-
cations. And the computing boards or processors usually run Linux. For example, a recent,
WiFi-enabled 3D printer accompanies the traditional ATmega-based printer control board
with an ESP8266 module, which is a self-contained SoC with integrated TCP/IP protocol
stack. The Intel Aero Ready to Fly Drone [Int18] utilizes the STM32-based PX4 autopi-
lot for flight control, accompanied with an Intel Aero board for wireless communication,
image processing and etc.
While it is easy and straightforward to achieve extensibility using this strategy, an
extra board inevitably increases the weight of the whole system. For applications such
as drones, extra weight increases battery consumption and reduces flight time. On the
78
other hand, this weight problem can be alleviated by using an extra on-board processor
instead of a separate board. For instance, the Arduino Yu´n and Tian integrate an ATmega
processor and a more powerful ARM or MIPS processor on a single board. The added
powerful core runs Linux, which controls network communication. However, to the best
knowledge of the author, all current solutions connect the two computing entities with
serial link, due to the fact that the relatively weak processor or board is not able to use
high-speed buses such as USB or Ethernet. There are frequent complaints among users
about this serial bottleneck. Imagine the scenario where data is collected by a sensor on
the Linux side and fed to a control program on the microcontroller, via the serial link. The
latency and throughput of the data path is constraint by the serial link, regardless of the
potential high-speed bus that the sensor is sitting on. In other words, the communication
overhead between the computing entities is prohibitively high due to the communication
link.
Multicore SoCs with shared main memory provide an opportunity to integrate legacy
systems and custom platforms on a single piece of hardware, with low-latency, high-
throughput communication. The shared main memory makes it possible for the two com-
puting entities to exchange data potentially close to the memory bus speed. However,
it also posts the challenge to protect the memory region owned by one entity from the
other’s access. Specifically, misbehaviors of a low-criticality service should not compro-
mise the behavior of the one with higher criticality. Fortunately, latest multicore SoCs
provide hardware-assisted virtualization that offers an extra layer of memory partitioning
and protection. By taking advantage of our in-house virtualized partitioning kernel called
Quest-V, we enable Qduino to be safely and securely extensible.
79
6.2 Architecture Overview
The system architecture is shown in Figure 6.1. Qduino takes over one or potentially
more cores, and Linux takes over the rest. The partitioning of the shared main memory
is enforced using page-table virtualization techniques such as Intel’s Extended Page Ta-
ble (EPT). Each OS has its dedicated memory partition, plus one memory region that is
mapped to both, serving as the pool for inter-OS communication channels.
Figure 6.1: Qduino Extension Architecture
6.2.1 Partitioning Hypervisor
The hypervisor is responsible for partitioning resources (CPU, memory and device) into
sandboxes and booting OS in each sandbox. It also provides interface to create communi-
cation channels at runtime, between Qduino loops and extension services. The hypervisor
we used is our in-house partitioning hypervisor called Quest-V. It utilizes hardware virtu-
alization features to run multiple separate instances of operating systems. These instances
80
are isolated from each other using hardware virtualization extensions available on modern
processors. As opposed to a typical monitor, which multiplexes and emulates hardware
resources, the Quest-V monitor partitions and isolates resources. A subset of CPUs, mem-
ory and devices are statically assigned to each Quest-V sandbox. Therefore, there is no
virtual machine scheduling or device multiplexing that can introduce overheads or unpre-
dictability. During normal operation, a Quest-V sandbox does not perform VM-exits1 and
therefore minimizes the cost of using the hardware virtualization extensions. Currently,
excluding forced exits, such as the CPUID instruction, the monitor is only entered to dy-
namically setup shared communication channels. A more detailed description of Quest-V
can be found in West et al. [WLMD16].
6.2.2 Linux
An extension service OS can be any pre-existing GPOS, such as a UNIX-based system
with process address spaces, or a library OS having a single address space [EKO95b]. Our
system currently supports Linux for its rich libraries, frameworks and device drivers. The
Linux is only able to see and access the physical memory regions and devices that are
assigned to it by the hypervisor. This way, failures in Linux is not able to compromise
the behavior of Qduino. The hardware Performance Monitoring Unit (PMU) is also vir-
tualized to Linux. Core-local performance counters will not be exposed if being used by
Quest. Also, global performance events are made inaccessible. This hardens security iso-
lation between Qduino and Linux. For example, information leakage from side channels
based on PMU data is avoided [KJJ99].
1A VM-exit is the transition from a guest virtual machine to the virtual machine monitor
81
6.3 Inter-sandbox Communication
Services from extensions are provided via inter-sandbox communication channels built on
shared memory, as shown in Figure 6.1. The Quest-V hypervisor updates EPT mappings
as necessary to establish message passing channels between the Qduino and Linux sand-
boxes. And those channels are further selectively mapped into applications (i.e, Qduino
loops or Linux extensions) by each guest OS.
6.3.1 Asynchronous Message Passing
Data flow from Linux extensions to Qduino loops is necessary. Using an autonomous
drone as an example, camera and WiFi, given their low criticality levels and complex
software stacks, are assigned to the Linux sandbox. The camera captures pictures of the
surroundings and image processing algorithms subsequently translate pictures into control
commands for object tracking or obstacle avoidance purposes. The control commands are
then fed into the flight controller in Qduino. Similarly, remote control or configuration
commands received via WiFi need to be relayed to the Qduino sandbox. Due to the fact
that Linux extensions typically function as the replacement of human operators, we believe
that control and configuration commands are the primary type of information that forms
the data flow from the Linux to Qduino.
In some cases, commands should be received and executed sequentially, without skip-
ping any of them. For example, to lift up, control commands should gradually increase
the throttle until the drone leaves the ground. In other cases, only the latest command
should be executed and all the stale ones should be discarded (overwritten). This is true
for applications such as moving object tracking. The drone should always use the last
received position of the object as the flight destination. The desired communication pat-
tern differs between the two cases. In the first case, the Linux application should write to
82
the communication channel only when it is not full, in order to avoid overwriting pending
commands. Yet in the second case, it should immediately write into the channel whenever
it has new data. Nonetheless, in either case, Qduino loops on the receiving side should
never be required to synchronize with the Linux sender in order to communicate. This is
because the Linux CFS scheduler does not offer the level of predictability provided by the
VCPU scheduling framework in the Quest kernel. A Qduino loop might have to wait for an
indefinite period of time before data arrives and this leads to violation of the loop’s timing
requirement. Consequently, it is also difficult to extend communication with end-to-end
timing requirements (Section 3.4) to the Linux sandbox.
For these reasons, we adopt asynchronous message passing as the communication
mechanism for data flow from Linux extensions to Qduino loops. An asynchronous com-
munication channel is created at setup stage and exists until the owner application explic-
itly destorys it before the application exits. The interface for data writing, only available
for Linux extensions, is designed to be configurable. Users can choose to overwrite pend-
ing data or wait until the channel is clear.
6.3.2 Remote Procedure Call
The data flow from Qduino loops to Linux extensions is also necessary. Such data flow
is primary driven by inquiries, issued by Linux extensions, for software or device sta-
tus in the Qduino sandbox. The returned status is most likely to be displayed through
graphic user interface, or stored in remote or local storage for data analytics. These op-
erations usually do not have critical timing requirements and are triggered by sporadic
user interactions. For this reason, it wastes resources by pre-allocating communication
channels for each status that can be potentially queried. It is also not efficient to allocate
communication channels on demand at runtime, especially that channel mapping involves
83
costly hypervisor invocation. Consequently, the asynchronous communication mechanism
in Section 6.3.1 is not ideal in this case.
Instead, we argue that remote procedure call (RPC) is the most suitable mechanism
for data flow from Qduino loops to Linux extensions. Only a fixed number of channels
is required to be created at initialization stage, and no VM-exit is required at runtime to
create new channels. Apart from the memory and time efficiency, RPC also benefits from
being able to extend with authentication [Sri95], if necessary.
At setup stage, one channel is created to serve as the conduit for conveying RPC re-
quests. This channel is referred to as the request channel and accessible to all applications.
On the Qduino side, a low-priority user-level daemon thread is spawned, to periodically
poll the request channel for incoming requests. Upon receipt, it dispatches the RPC re-
quest and later collects the result, which will be written into to a response channel. There
are three possible ways to create a response channel: 1) similar to the request channel, a
pre-allocated response channel shared by all the RPC entities; 2) a channel created by the
RPC sender, which also includes the channel’s identity in the RPC request; 3) one of the
channels in a pool that is pre-allocated by the Quest daemon thread. The first method re-
quires either a dedicated Linux kernel thread to periodically poll the response channel, or
an event handler to respond to the arrival of an RPC response. In either case, Linux needs
to be significantly paravirtualized. The second method requires the Quest daemon thread
to invoke the hypervisor to map the channel, which is included in the RPC request, into
its address space. This involves expensive VM-exits and thus should be avoided. The last
method is similar to the second one except that it transfers the overheads of VM-exiting
and channel mapping to the Linux sandbox. This is acceptable in our setup because Linux
extensions are less time-sensitive than Qduino loops.
84
6.3.3 User APIs
We provide two sets of APIs for Qduino and Linux, respectively, to communicate via
half-duplex channels. The APIs are listed in Table 6.1.
Function Name Availability
create channel(chan t *chan, size t sz, Both
int key, bool overwrite)
write channel(chan t *chan, byte* dat, Linux Only
bool overwrite)
read channel(chan t *chan, byte* dat, Qduino Only
bool overwrite)
rpc alloc(chan t *chan) Linux Only
rpc free(chan t *chan) Linux Only
rpc call(chan t *chan, req t *request, Linux Only
rsp t *response)
rpc poll() Qduino Only
Table 6.1: Arduino Standard API
Asynchronous Communication. create channel invokes the hypervisor to obtain a
channel of the specified size, from the physical memory region shared by the Qduino and
Linux sandboxes. If the specified key matches with the one of the existing channels, the
matched channel is returned. This is used by callers to connect to an existing channel.
Otherwise, a new channel, tagged by the specified key, is created and returned through
the first parameter. If the overwrite flag is not set, data in this channel is not allowed
to be overwritten until the reader (i.e., a Qduino loop) marks it as consumed. For this
case, a single-slot buffer suffices. Current implementation polls a status bit associated
with the channel buffer until the channel becomes ready to write. Asynchronous event
notification mechanisms using Inter-processor Interrupts (IPIs) can be added to notify the
consumption of a data message. However, Linux needs to be further paravirtualized in
order to handle IPIs. On the other hand, if the overwrite flag is set, the writer and reader
85
are allowed to access the channel buffer without any synchronization. In order to avoid
data corruption, the channel buffer is implemented using the four-slot mechanism [Sim90]
when the overwrite flag is set to one. read channel and write channel reads from and
writes to the specified channel, respectively. The overwrite flag indicates whether the
four-slot mechanism is used internally.
RPC. rpc alloc invokes the hypervisor to find an available channel, from a pre-allocated
channel pool, to be used as the response channel. Each response channel is essentially
an asynchronous communication channel that prohibits overwriting, as described in Sec-
tion 6.3.1. The assigned response channel can then be used as a parameter of rpc call,
which obtains exclusive access to the shared request channel and writes the specified RPC
request along with the specified response channel. The shared channel is designed to be
a circular buffer that supports single reader and multiple writers. On the Qduino side, an
RPC daemon thread periodically polls the request channel for incoming requests, by call-
ing rpc poll. Upon receipt, the daemon dispatches requested operations, gathers execution
results, and writes the results into the given response channel by calling write channel. Si-
multaneously, the RPC caller on the Linux side, while still in rpc call, calls read channel
on the response channel for the RPC response. The rpc call eventually returns with re-
ceived results. The rpc free function is used to invoke the hypervisor again to free the
response channel. Currently, applications that utilize RPC is responsible for defining the
Interface Definition Language (IDL) in order to marshal and unmarshal RPC parameters.
6.4 An Autonomous Drone Framework Prototype
We verify the usability of the enhanced Qduino platform by using it to build an au-
tonomous drone framework on the Intel Aero board. The aim is to replace radio-control
with on-board tasks that perform configurable flight missions, such as aerial photogra-
86
phy, package delivery, and search and rescue. Apart from the system components shown
in Figure 6.1, the current framework consists of two user-level applications: 1) a re-
implemented flight controller, called Cleanflight in Qduino; and 2) a Linux application
called Autonomous Control Platform (ACP) that is intended to replace human drone oper-
ators.
6.4.1 Hardware Setup
The Intel Aero board has an Atom x7-Z8750 1.6 GHz 4-core processor and 4GB RAM.
Apart from the CPU, the Aero board also has an FPGA-based I/O coprocessor. It provides
FPGA-emulated I/O interfaces including analog-to-digital conversion (ADC), UART se-
rial, and pulse-width modulation (PWM). Our system currently uses the I/O hub to send
PWM signals to electronic speed controllers (ESCs) that alter motor speeds. There is
also an on-board Bosch BMI160 Inertial Measurement Unit (IMU). Both the I/O hub and
IMU connect to the main processor via an SPI bus. The SPI bus controller is assigned by
Quest-V to the Qduino sandbox so that only the Cleanflight is able to access the IMU and
PWM. Non-critical devices such as camera, WiFi and permanent storage are assigned to
the Linux sandbox. The Qduino sandbox currently uses Core 0 to run Qduino/Cleanflight.
The remaining three cores are reserved for extension applications involved in autonomous
flight management.
6.4.2 Cleanflight in Qduino
Cleanflight [Cle] is a popular flight control firmware for racing drones that are operated
by humans via radio-control. The core software components of Cleanflight consist of
sensor and actuator drivers, a PID controller, the Mahony Attitude and Heading Reference
(AHRS) algorithm, various communication stacks, and a logging system. Runtime entities
87
of those components are called tasks. There are 31 tasks in total, of which more than half
are optional.
To minimize engineering efforts, the reimplemented Cleanflight only has the essential
tasks shown as circular tasks in Figure 6.2. The AHRS sensor fusion task takes the in-
put readings of the accelerometer and gyroscope (in the BMI160 IMU) and calculates the
current attitude of the drone. Then, the PID task compares the calculated and target atti-
tudes, and feeds the difference to the PID control logic. In the original Cleanflight code,
the target attitude is determined by radio-control signals from a human flying the drone.
In an autonomous setting, the target attitude would be calculated according to on-board
computations based on mission objectives and flight conditions. The output is nonetheless
mixed with the desired throttle and read in by the PWM task, which translates it to motor
commands. Motor commands are sent over the SPI bus and delivered as PWM signals to
each ESC associated with a separate motor-rotor pair on the drone.
The tasks are implemented as separate Qduino loops. Their budgets are determined
by profiling and shown in Table 6.2. A flight controller is a typical cyber-physical ap-
plication where sensor data must be sampled and processed at a minimum rate to ensure
the correctness and timeliness of actuation. Specifically, a drone must operate according
to the most recent estimates of its attitude, direction, altitude, and speed. Similarly, the
rotor speeds must be updated within certain time bounds to be relevant to the current sen-
sor readings. These end-to-end timing requirements make it possible to apply Qduino’s
end-to-end design approach (Section 3.5) to determining the periods of each task in the
re-implementation of Cleanflight.
Gyro AHRS PID PWM Accl Radio
Exec Times (µs) 174 10 2 970 167 12
Table 6.2: Task Execution Times
88
Figure 6.2: Cleanflight Data Flow
As can be seen in Figure 6.2, there are three data paths, originating from the gyro,
accelerometer, and radio receiver, respectively. Unfortunately, there is little information
available on what end-to-end timing constraints should be imposed on each path to guar-
antee a working drone. Most timing parameters in the original Cleanflight are determined
by trial and error. Therefore, we first port Cleanflight to Yocto Linux on the Aero board,
as a reference implementation. The Linux version remains single-threaded and is used to
estimate the desired end-to-end time. For example, for the gyro path, the worst-case reac-
tion and freshness times are measured to be 9769 and 22972 µs, respectively. We round
them to 10 and 23 ms, and use them as end-to-end timing constraints for our flight con-
troller implementation. Using the same approach, we determine end-to-end reaction and
freshness times for the accelerometer path, which are set to 10 and 23 ms, respectively.
89
Finally, for the radio path, we set the end-to-end reaction and freshness times to be 20 and
44 ms, respectively.
Using the execution times in Table 6.2 and timing constraints above, we apply the
end-to-end design approach to derive the periods. The results for each task are shown in
Table 6.3.
Task Gyro AHRS PID
Budget/Period (µs) 200/1000 100/5000 100/2000
Task PWM Accl Radio
Budget/Period (µs) 1000/5000 200/1000 100/10000
Table 6.3: Task Periods
Evaluation. To measure the actual end-to-end time, we focus on the longest pipe
chain highlighted in Figure 6.2. We instrument the Cleanflight code to append every gyro
reading with an incrementing ID, and also record a timestamp before the gyro input is
read. The timestamp is then stored in an array indexed by the ID. Every task is further
instrumented to maintain the ID when translating input data to output. This way, the ID is
preserved along the pipe chain, from the input gyro reading to the output motor command.
After the PWM task sends out motor commands, it looks up the timestamp using its ID
and compares it to the current time. By doing this, we are able to log both the reaction and
freshness end-to-end time for every input gyro reading. We then compare the observed
end-to-end time with the given timing constraints, as well as the predicted worst-case
value. Results are shown in Figure 6.3. As can be seen, the observed values are always
within the predicted bounds, and always meet the timing constraints.
6.4.3 The Autonomous Control Platform
The ACP is a broker program between the flight controller in Qduino and configurable
flight missions in Linux. It receives control commands from the currently active flight
90
 3000
 6000
 9000
 12000
 15000
 18000
 21000
 24000
Reaction Freshness
Ti
m
e 
(us
)
Observed Best Case
Observed Worst Case
Predicted
Constraint
Figure 6.3: Cleanflight Times
mission task and forwards them to the Cleanflight. Currently, one asynchronous commu-
nication channel (Section 6.3.1) is utilized to send control commands. The ACP is also
responsbile for logging and displaying flight information such as attitude, motor speeds,
or battery information. It does this by using the RPC mechanism (Section 6.3.2) to send
requests to the Cleanflight. Multiwii Serial Protocol (MSP) is currently used as the IDL.
Chapter 7
Conclusion and Future Work
This thesis focuses on designing a programming platform for modern cyber-physical appli-
cations. We identified three crucial properties such a platform should support: 1) design-
time specification and run-time guarantee of timing properties of individual tasks as well
as application-wide task pipelines; 2) a simple and clean interface for CPU and I/O re-
source management on multicore architectures; 3) platform extensibility that enables easy
integration with legacy software components. Using the three properties as guidelines, we
designed and implemented the Qduino programming platform.
For real-time guarantees, we described the VCPU scheduling algorithm that provides
temporal isolation between system events and conventional threads in the runtime environ-
ment. Based on this scheduling policy and the communication model in the Quest RTOS,
a composable pipe model was described to model the timing properties of task pipelines.
We observed that common to many modern cyber-physical applications is a communi-
cating task pipeline, to acquire and process sensor data and produce outputs that control
actuators. Therefore two end-to-end timing constraints were defined, in terms of sensor
data freshness and actuator reaction time that constrain the problem of composing a feasi-
ble task pipeline. The pipe model is then used to analyze the two end-to-end semantics of
time and we were able to formulate the worst case. Finally, we provided a mathematical
framework to derive feasible task periods and budgets that satisfy both schedulability and
92
end-to-end timing requirements.
The proposed mathematical framework is intended for easing the design stage of cyber-
physical applications, especially those rely on sensor fusion algorithms to perform au-
tonomous actuation. The scheduling algorithm, on the other hand, is intended for runtime
guarantee of specified timing requirements. In order to expose both features to end users,
a programming platform, called Qduino was presented. Qduino was designed to adopt
the simplicity of the popular Arduino API, but was also extended with support for: 1)
multithreaded sketches; 2) timing specification of individual loops, interrupt handlers and
loop pipelines; 3) CPU affinity and interrupt routing management on multicore architec-
tures. Qduino’s loop structure functions allow users to specify multiple real-time tasks,
whose timing requirements are guaranteed at runtime by the underlying VCPU scheduler.
Through a separate configuration file fed into the building process, users can also option-
ally specify loop pipelines with end-to-end timing requirements, which are used to derive
loop periods. Qduino’s API also allows user to shield a certain loop from the scheduling
and interrupt overheads on a multicore architecture.
We also described a case study based on the development of a web-connected 3D
printer, which has the ability to spool print requests via a webserver without being tethered
to a remote computer. We analyzed the real-time issues in a 3D printer, and implemented
our in-house web-connected 3D printer. With Qduino, we were able to pin the extremely
time-sensitive stepper control logic on a separate core without interference from interrupts
or scheduling overheads. We compared the performance of a prototype controller running
Linux and Qduino. Experiments show that stepper motor control requires precise timing
at the microsecond resolution and the underlying scheduling and context switching over-
heads on Linux are prohibitive to accurate task timing at this granularity. The printer built
on Qduino displayed lower overheads and better predictability.
93
Not satisfied with the fact that a 3D printer is a dumb cyber-physical device, we sought
for a programming platform that is able to be used to build an autonomous drone. An au-
tonomous drone requires complex software stacks, such as OpenCV for image processing
or GPU-based machine learning algorithms, that typically exist on GPOS. We therefore
integrated Qduino with a Linux, by taking advantage of the Quest-V virtualized partition-
ing kernel. Two types of communication mechanisms, i.e., asynchronous communication
and RPC, were designed so that Qduino and Linux are able to exchange data. With this
new system, we reimplemented the Cleanflight flight controller on the Intel Aero board
and also implemented a Linux application that serves as a broker program for future flight
mission applications.
7.1 Future Work
Co-scheduling of Quest and Linux Tasks. As mentioned earlier, the Linux CFS sched-
uler does not offer the level of predictability provided by the VCPU scheduling framework
in the Quest kernel. This hinders us from extending communication with end-to-end tim-
ing requirements (Section 3.4) to the Linux sandbox. In order to provide end-to-end timing
guarantees on inter-sandbox communication, the scheduling of Linux tasks should be (in-
directly) controlled by associated Qduino tasks or the VCPU scheduler. We refer to this
technique as the co-scheduling of Quest and Linux tasks.
Autonomous Drone. The current implementation of the autonomous drone frame-
work is a preliminary prototype. Though it requires great engineering efforts, a fully
fledged autonomous drone on a single piece of hardware has significant research value.
It serves as the ultimate application that demonstrates the advantages of a virtualized par-
titioning kernel and also opens up the opportunity to potentially apply this technique to
other embedded and cyber-physical applications.
94
In order to achieve this goal, there are many objectives need to be met, to name a few:
• A complete reimplementation of Cleanflight. To take the best advantage of the
Qduino/Linux system, those low-criticality Cleanflight tasks should be transferred
to the Linux sandbox. This way, the failure of those tasks are not able to compromise
the correct execution of the core flight control logic. Besides, transferred tasks can
utilize existing Linux device drivers. Examples are the blackbox task that requires
access to permanent storage and the original serial task that will be replaced with a
wireless communication task.
• An implementation of flight mission tasks. Possible candidates include the one that
utilizes the on-board Intel RealSense camera to perform SLAM-based navigation
task.
• Division of power management permissions. Current experiments reveals the prob-
lem that Linux controls the power management on the Aero board. Changes of
the power state of those essential sensors or actuators might cause failure to main-
tain flight. Therefore, the high-criticality Quest sandbox should be in control of
the power management. To implement the power management logic from scratch
can be timing-consuming, it is thus worthwhile to investigate the possibility of call-
ing Linux kernel features, e.g., the power management subsystem, from the Quest
sandbox.
Bibliography
[AB98] Luca Abeni and Giorgio Buttazzo. Integrating Multimedia Applications
in Hard Real-Time Systems. In Proceedings of the 19th IEEE Real-time
Systems Symposium, pages 4–13, 1998.
[AB04] Luca Abeni and Giorgio Buttazzo. Resource reservation in dynamic real-
time systems. Real-Time Systems, 27(2):123–167, Jul 2004.
[Abs14] AbsInt. aiT worst-case execution time analyzers:
https://www.absint.com/ait/, 2014.
[adp] NO HZ: Reducing Scheduling-Clock Ticks:
https://www.kernel.org/doc/Documentation/timers/NO HZ.txt.
[AG17] SYSGO AG. The PikeOS Hypervisor. https://www.sysgo.com/,
2017.
[ard18] Arduino Language Reference, 2018. Arduino Language Reference:
http://arduino.cc/en/Reference/HomePage.
[BCC+16] Endri Bregu, Nicola Casamassima, Daniel Cantoni, Luca Mottola, and
Kamin Whitehouse. Reactive Control of Autonomous Drones. In Pro-
ceedings of the 14th Annual International Conference on Mobile Systems,
Applications, and Services, MobiSys ’16, pages 207–219, New York, NY,
USA, 2016. ACM.
[BDF+03] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex
Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the Art
of Virtualization. In Proceedings of the Nineteenth ACM Symposium on
Operating Systems Principles, SOSP ’03, pages 164–177, New York, NY,
USA, 2003. ACM.
[BDM99] Gaurav Banga, Peter Druschel, and Jeffrey C. Mogul. Resource contain-
ers: a new facility for resource management in server systems. In Pro-
ceedings of the 3rd USENIX Symposium on Operating Systems Design
and Implementation, 1999.
[BDM+16] M. Becker, D. Dasari, S. Mubeen, M. Behnam, and T. Nolte. Synthesiz-
ing Job-Level Dependencies for Automotive Multi-rate Effect Chains. In
96
the 22nd IEEE Intl. Conference on Embedded and Real-Time Computing
Systems and Applications, Aug 2016.
[BHG+13] Emmanuel Baccelli, Oliver Hahm, M Gunes, M Wahlisch, and Thomas C
Schmidt. RIOT OS: Towards an OS for the Internet of Things. In Com-
puter Communications Workshops (INFOCOM WKSHPS), 2013 IEEE
Conference on, pages 79–80. IEEE, 2013.
[BHM77] S.P. Bradley, A.C. Hax, and T.L. Magnanti. Applied Mathematical Pro-
gramming. Addison-Wesley Publishing Company, 1977.
[bli18] Blink Without Delay Example, 2018. Blink Without Delay:
http://arduino.cc/en/Tutorial/BlinkWithoutDelay.
[BLPE13] Fre´de´ric Boniol, Michae¨l Lauer, Claire Pagetti, and Je´roˆme Ermont.
Freshness and Reactivity Analysis in Globally Asynchronous Locally
Time-Triggered Systems. In NASA Formal Methods. Springer Berlin Hei-
delberg, 2013.
[BSP+95] Brian N. Bershad, Stefan Savage, Prsemyslaw Pardyak, Emin Gum Sirer,
Marc Fiuczynski, and Becker Eggers Chambers. Extensibility, safety, and
performance in the SPIN operating system. In Proceedings of the 15th
ACM Symposium on Operating Systems Principles, pages 267–284, Cop-
per Mountain, Colorado, December 1995.
[CB13] Felipe Cerqueira and Bjrn B. Brandenburg. A comparison of scheduling
latency in linux, preempt rt, and litmus rt. In OSPERT 2013 9th Annual
Workshop on Operating Systems Platforms for Embedded Real-Time Ap-
plications, 2013.
[Cle] Cleanflight: http://cleanflight.com/.
[DGV04] Adam Dunkels, Bjorn Gronvall, and Thiemo Voigt. Contiki - A
Lightweight and Flexible Operating System for Tiny Networked Sensors.
In Local Computer Networks, 2004. 29th Annual IEEE International Con-
ference on, pages 455–462. IEEE, 2004.
[Dik06] Jeff Dike. The User Mode Linux Kernel. http://
user-mode-linux.sourceforge.net/, 2006.
[DLS97] Z. Deng, J. W. S. Liu, and J. Sun. A scheme for scheduling hard real-time
applications in open system environment, 1997.
[DLW11] Matthew Danish, Ye Li, and Richard West. Virtual-CPU Scheduling in
the Quest Operating System. In Proceedings of the 17th Real-Time and
Embedded Technology and Applications Symposium, 2011.
97
[EKO95a] D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr. Exokernel: An Oper-
ating System Architecture for Application-level Resource Management.
In Proceedings of the Fifteenth ACM Symposium on Operating Systems
Principles, SOSP ’95, pages 251–266, New York, NY, USA, 1995. ACM.
[EKO95b] D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr. Exokernel: An oper-
ating system architecture for application-level resource management. In
Proceedings of the Fifteenth ACM Symposium on Operating Systems Prin-
ciples, SOSP ’95, pages 251–266, New York, NY, USA, 1995. ACM.
[eri] ERIKA Enterprise: http://erika.tuxfamily.org/drupal/.
[FBP17] J. Forget, F. Boniol, and C. Pagetti. Verifying End-to-end Real-time Con-
straints on Multi-periodic Models. In the 22nd IEEE International Con-
ference on Emerging Technologies and Factory Automation (ETFA), pages
1–8, Sept 2017.
[FRNJ08] Nico Feiertag, Kai Richter, Johan Nordlander, and Jan Jonsson. A Com-
positional Framework for End-to-End Path Delay Calculation of Automo-
tive Systems under Different Path Semantics. In Proceedings of the IEEE
Real-Time System Symposium - Workshop on Compositional Theory and
Technology for Real-Time Embedded Systems, Barcelona, Spain, Novem-
ber 30, 2008.
[GB95] T.M. Ghazalie and T. Baker. Aperiodic servers in a deadline scheduling
environment. Real-Time Systems, (9), 1995.
[GHS95] Richard Gerber, Seongsoo Hong, andManas Saksena. Guaranteeing Real-
Time Requirements With Resource-Based Calibration of Periodic Pro-
cesses. IEEE Trans. Softw. Eng., July 1995.
[HDK+17] Arne Hamann, Dakshina Dasari, Simon Kramer, Michael Pressler, and
Falk Wurst. Communication Centric Design in Complex Automotive Em-
bedded Systems. In 29th Euromicro Conference on Real-Time Systems
(ECRTS 2017), Leibniz International Proceedings in Informatics (LIPIcs),
Dagstuhl, Germany, 2017.
[HHJ+05] Rafik Henia, Arne Hamann, Marek Jersak, Razvan Racu, Kai Richter, and
Rolf Ernst. System Level Performance Analysis - the SymTA/S Approach.
In IEEE Proceedings Computers and Digital Techniques, 2005.
[HHK01] Thomas A. Henzinger, Benjamin Horowitz, and Christoph Meyer Kirsch.
Giotto: A Time-Triggered Language for Embedded Programming, pages
166–184. Springer Berlin Heidelberg, Berlin, Heidelberg, 2001.
98
[Int18] Intel. Intel Aero Ready to Fly Drone. https://www.
intel.com/content/www/us/en/products/drones/
aero-ready-to-fly.html, July 2018.
[KEH+09] Gerwin Klein, Kevin Elphinstone, Gernot Heiser, June Andronick, David
Cock, Philip Derrin, Dhammika Elkaduwe, Kai Engelhardt, Rafal Kolan-
ski, Michael Norrish, Thomas Sewell, Harvey Tuch, and SimonWinwood.
sel4: Formal verification of an os kernel. In Proceedings of the ACM
SIGOPS 22Nd Symposium on Operating Systems Principles, SOSP ’09,
pages 207–220, New York, NY, USA, 2009. ACM.
[KJJ99] Paul C. Kocher, Joshua Jaffe, and Benjamin Jun. Differential Power Anal-
ysis. In Proceedings of the 19th Annual International Cryptology Confer-
ence on Advances in Cryptology, CRYPTO ’99, pages 388–397, London,
UK, UK, 1999. Springer-Verlag.
[KKL+07] Avi Kivity, Yaniv Kamay, Dor Laor, Uri Lublin, and Anthony Liguori.
KVM: the Linux Virtual Machine Monitor. In In Proceedings of the 2007
Ottawa Linux Symposium, OLS ’07, 2007.
[KMKKTC16] Jad Khatib, Alix Munier-Kordon, Enagnon Cedric Klikpo, and Kods
Trabelsi-Colibet. Computing Latency of a Real-time System Modeled
by Synchronous Dataflow Graph. In the 24th International Conference
on Real-Time Networks and Systems, RTNS ’16, pages 87–96, New York,
NY, USA, 2016. ACM.
[LBPE14] Michae¨l Lauer, Fre´de´ric Boniol, Claire Pagetti, and Je´roˆme Ermont. End-
to-end Latency and Temporal Consistency Analysis in Networked Real-
time Systems. International Journal on Critical Computing-Based Sys-
tems, 5(3/4), 2014.
[LL73] C. L. Liu and James W. Layland. Scheduling Algorithms for Multi-
programming in a Hard Real-Time Environment. Journal of the ACM,
20(1):46–61, 1973.
[LM87] E. A. Lee and D. G. Messerschmitt. Synchronous Data Flow. Proceedings
of the IEEE, 75(9):1235–1245, Sept 1987.
[LSD89] John Lehoczky, Lui Sha, and Ye Ding. The Rate Monotonic Scheduling
Algorithm: Exact Characterization and Average Case Behavior. In Pro-
ceedings of the IEEE Real-Time Systems Symposium (RTSS), 1989.
[LSS87] John P. Lehoczky, Lui Sha, and Jay K. Strosnider. Enhanced aperiodic
responsiveness in hard real-time environments. In RTSS, 1987.
99
[MAB+16] S. Mazumdar, E. Ayguade, N. Bettin, J. Bueno, S. Ermini, A. Filgueras,
D. Jimenez-Gonzalez, C. A. Martinez, X. Martorell, F. Montefoschi,
D. Oro, D. Pnevmatikatos, A. Rizzo, D. Theodoropoulos, and R. Giorgi.
Axiom: A hardware-software platform for cyber physical systems. In
2016 Euromicro Conference on Digital System Design (DSD), pages 539–
546, Aug 2016.
[McK08] Paul E McKenney. ’Real Time’ vs. ’Real Fast’: How to Choose? In
Ottawa Linux Symposium (July 2008), pp. v2, pages 57–65, 2008.
[MMW16] E. Missimer, K. Missimer, and R. West. Mixed-Criticality Scheduling
with I/O. In 28th Euromicro Conference on Real-Time Systems (ECRTS),
pages 120–130, July 2016.
[MSN+15] S. Mubeen, M. Sjo¨din, T. Nolte, J. Lundba¨ck, M. Ga˚lnander, and K. L.
Lundba¨ck. End-to-End Timing Analysis of Black-Box Models in Legacy
Vehicular Distributed Embedded Systems. In the 21st IEEE International
Conference on Embedded and Real-Time Computing Systems and Appli-
cations, pages 149–158, Aug 2015.
[MST93] Clifford W. Mercer, Stefan Savage, and Hideyuki Tokuda. Processor Ca-
pacity Reserves for Multimedia Operating Systems. Technical report,
Pittsburgh, PA, USA, 1993.
[OR98] Shuichi Oikawa and Ragunathan Rajkumar. Linux/RK: A portable re-
source kernel in Linux. In Proc. 19th IEEE Real-Time Systems Sympo-
sium, 1998.
[PB14] Pietro Lorefice Pasquale Buonocunto, Alessandro Biondi. Real-Time
Multitasking in Arduino. In WiP session of the 9th IEEE International
Symposium on Industrial Embedded Systems, 2014.
[PFB+11] Claire Pagetti, Julien Forget, Fre´de´ric Boniol, Mikel Cordovilla, and
David Lesens. Multi-task Implementation of Multi-periodic Synchronous
Programs. Discrete Event Dynamic Systems, Sep 2011.
[Reg01] John Regehr. HLS: A framework for composing soft real-time schedulers.
In Proceedings of the 22nd IEEE Real-Time Systems Symposium, pages
3–14, 2001.
[Rus02] John Rushby. Model Checking Simpsons Four-slot Fully Asynchronous
Communication Mechanism. Computer Science Laboratory–SRI Interna-
tional, Tech. Rep. Issued, 2002.
100
[SB94a] M. Spuri and G. Buttazzo. Efficient Aperiodic Service under Earliest
Deadline Scheduling. In IEEE Real-Time Systems Symposium, Decem-
ber 1994.
[SB94b] M. Spuri and G. Buttazzo. Efficient aperiodic service under earliest dead-
line scheduling. In IEEE Real-Time Systems Symposium, December 1994.
[SEB17] Abhishek Singh, Pontus Ekberg, and Sanjoy Baruah. Applying Real-Time
Scheduling Theory to the Synchronous Data FlowModel of Computation.
In Proceedings of the 29th Euromicro Conference on Real-Time Systems,
Dagstuhl, Germany, 2017.
[Sim90] H.R. Simpson. Four-slot Fully Asynchronous Communication Mecha-
nism. IEEE Computers and Digital Techniques, 137:17–30, January 1990.
[SL03] Insik Shin and Insup Lee. Periodic resource model for compositional real-
time guarantees. In RTSS, pages 2–13, 2003.
[SLS95] Jay K. Strosnider, John P. Lehoczky, and Lui Sha. The Deferrable Server
Algorithm for Enhanced Aperiodic Responsiveness in Hard Real-Time
Environment. IEEE Transactions on Computers, 44(1):73–91, January
1995.
[SLS98] D. Seto, J. P. Lehoczky, and Liu Sha. Task period selection and schedula-
bility in real-time systems. In Proceedings 19th IEEE Real-Time Systems
Symposium (Cat. No.98CB36279), pages 188–198, Dec 1998.
[Spr89] B. Sprunt. Scheduling Sporadic and Aperiodic Events in a Hard Real-
Time System. Technical Report CMU/SEI-89-TR-011, Software Engi-
neering Institute, Carnegie Mellon, 1989.
[Sri95] Raj Srinivasan. Rpc: Remote procedure call protocol specification version
2. Technical report, 1995.
[SS96] Christopher Small and Margo I. Seltzer. A Comparison of OS Extension
Technologies. In USENIX Annual Technical Conference, pages 41–54,
1996.
[SSL89] Brinkley Sprunt, Lui Sha, and John Lehoczky. Aperiodic task scheduling
for hard-real-time systems. Real-Time Systems, 1(1):27–60, Jun 1989.
[Sys17] QNX Software Systems. QNX Hypervisor. https://www.qnx.
com/, 2017.
[TD16] TU-Dresden. L4linux. https://l4linux.org/, 2016.
101
[WLMD16] Richard West, Ye Li, Eric Missimer, and Matthew Danish. A Virtualized
Separation Kernel for Mixed-Criticality Systems. ACM Trans. Comput.
Syst., 34(3):8:1–8:41, June 2016.
Curriculum Vitae
103
