Washington University in St. Louis

Washington University Open Scholarship
McKelvey School of Engineering Theses &
Dissertations

McKelvey School of Engineering

Summer 8-15-2018

Concurrency Platforms for Real-Time and Cyber-Physical Systems
David Ferry
Washington University in St. Louis

Follow this and additional works at: https://openscholarship.wustl.edu/eng_etds
Part of the Civil Engineering Commons, and the Computer Sciences Commons

Recommended Citation
Ferry, David, "Concurrency Platforms for Real-Time and Cyber-Physical Systems" (2018). McKelvey School
of Engineering Theses & Dissertations. 360.
https://openscholarship.wustl.edu/eng_etds/360

This Dissertation is brought to you for free and open access by the McKelvey School of Engineering at Washington
University Open Scholarship. It has been accepted for inclusion in McKelvey School of Engineering Theses &
Dissertations by an authorized administrator of Washington University Open Scholarship. For more information,
please contact digital@wumail.wustl.edu.

Washington University in St. Louis
School of Engineering & Applied Science
Department of Computer Science & Engineering

Dissertation Examination Committee:
Christopher D. Gill, Chair
Kunal Agrawal, Co-Chair
James H. Anderson
Roger Chamberlain
I-Ting Angelina Lee
Arun Prakash

Concurrency Platforms for Real-Time and Cyber-Physical Systems
by
David Ferry

A dissertation presented to
The Graduate School
of Washington University in
partial fulllment of the
requirements for the degree
of Doctor of Philosophy

August 2018
St. Louis, Missouri

©

2018, David Ferry

Table of Contents

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

List of Tables

vi

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Acknowledgements

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

1 Parallelism and Concurrency Platforms for Real-Time Systems

1

2 RT-OpenMP

8

2.1

2.0.1

Overview of OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.0.2

Parallel Synchronous Task Model

. . . . . . . . . . . . . . . . . . . .

11

2.0.3

RT-OpenMP Scheduling Service Design . . . . . . . . . . . . . . . . .

13

RT-OpenMP Evaluation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2

3.3

24

2.1.1

RT-OpenMP Design Space Choices

. . . . . . . . . . . . . . . . . . .

27

2.1.2

System Overhead Measurements . . . . . . . . . . . . . . . . . . . . .

30

2.1.3

Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3 Mixed-Criticality Federated Scheduling Service
3.1

9

37

Implementing Mixed-Criticality Federated Scheduling (MCFS) . . . . . . . .

39

3.1.1

Overrun Detection

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

3.1.2

Core Reallocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

3.1.3

State-Aware Barrier Implementation

44

3.1.4

Recovering from critical-state to typical-state

. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .

46

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

3.2.1

MCFS Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

3.2.2

Impact of high-criticality tasks on low-criticality tasks . . . . . . . . .

48

3.2.3

MCFS Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

3.2.4

Mode Switch Stress Testing

. . . . . . . . . . . . . . . . . . . . . . .

52

3.2.5

Graceful Degradation . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

Evaluation of MCFS

Discussion of RT-OpenMP vs Federated Scheduling Implementations

ii

. . . .

54

4 CyberMech, A Concurrency Platform for Real-Time Hybrid Simulation 64
4.1

Background on RTHS

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

4.1.1

Structural Simulation Methodology . . . . . . . . . . . . . . . . . . .

70

4.1.2

Shake Table Hardware

72

. . . . . . . . . . . . . . . . . . . . . . . . . .

4.2

RTHS Challenges for CyberMech

4.3

Computational Architecture for RTHS

4.4

. . . . . . . . . . . . . . . . . . . . . . . .

75

. . . . . . . . . . . . . . . . . . . . .

77

4.3.1

Specifying RTHS Computations . . . . . . . . . . . . . . . . . . . . .

78

4.3.2

Thread Safe Hardware I/O . . . . . . . . . . . . . . . . . . . . . . . .

81

4.3.3

Interaction Between Tasks . . . . . . . . . . . . . . . . . . . . . . . .

83

4.3.4

RTHS Repeatability on CyberMech . . . . . . . . . . . . . . . . . . .

85

Further Challenges

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

4.4.1

Application to General Cyber-Physical Systems

. . . . . . . . . . . .

91

4.4.2

Challenges in RTHS for CyberMech . . . . . . . . . . . . . . . . . . .

93

5 Parallel Computing Tradeos In Statically Determined Cyber-Physical
Systems
96
5.1

Linearity of RTHS determines proportion of parallel/serial computation . . .

98

5.2

Parallel Real-time Computation of Static RTHS . . . . . . . . . . . . . . . .

103

5.3

Further Challenges and Future Work

110

. . . . . . . . . . . . . . . . . . . . . .

6 Related Work and Other Soft Real-Time Platforms on Linux

115

6.1

Concurrency Platforms and Parallel Programming . . . . . . . . . . . . . . .

115

6.2

Multi-processing vs. Parallel Processing . . . . . . . . . . . . . . . . . . . . .

121

6.3

Soft Real-Time vs. Hard Real-Time . . . . . . . . . . . . . . . . . . . . . . .

122

6.4

Parallel Real-Time

125

6.5

Real-Time Hybrid Simulation (RTHS)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .

7 Conclusion

127

129

7.1

Future Parallel Real-Time Platforms

. . . . . . . . . . . . . . . . . . . . . .

131

7.2

Future of RTHS Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . .

133

7.3

Future of Cyber-Physical Parallelism

135

8 Bibliography

. . . . . . . . . . . . . . . . . . . . . .

137

iii

List of Figures

2.1

An example parallel-synchronous task with four segments.

. . . . . . . . . .

11

2.2

An example decomposition and scheduling of two tasks under RT-OpenMP. .

14

2.3

Division of responsibility between the scheduler and run-time dispatcher in
the RT-OpenMP system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

2.4

RT-OpenMP task set utilization vs. failure rate for the 2ms timescale. . . . .

34

2.5

RT-OpenMP task set utilization vs. failure rate for the 4ms timescale. . . . .

34

2.6

RT-OpenMP task set utilization vs. failure rate for the 8ms timescale. . . . .

35

2.7

RT-OpenMP task set utilization vs. failure rate for the 16ms timescale. . . .

35

2.8

RT-OpenMP task set utilization vs. failure rate for the 32ms timescale. . . .

36

2.9

RT-OpenMP task set utilization vs. failure rate for the 2048ms timescale. . .

36

3.1

MCFS Periodic Task Invocation Psuedocode . . . . . . . . . . . . . . . . . .

44

3.2

MCFS Mode Aware Barrier Psuedocode

45

3.3

High-criticality mode transition latency in MCFS

3.4

Graceful degradation of low-criticality tasks in the presence of high-criticality

. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .

49

task overruns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

4.1

The fundamental RTHS control loop. . . . . . . . . . . . . . . . . . . . . . .

68

4.2

A generic Real-Time Hybrid Simultion (RTHS) decomposition of a two-story
structure.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

4.3

The electronic shake table used for experimental evaluations in this chapter.

74

4.4

And overview of the CyberMech system architecture as applied to RTHS. . .

79

4.5

Two-story frame validation RTHS for CyberMech and xPC.

88

4.6

Comparison of transfer system performance between CyberMech and xPC.

4.7

Normalized error in displacement of 1st oor resulting from modeling idealization and epistemic experimental sources of error.

4.8

. . . . . . . . .
.

. . . . . . . . . . . . . .

89

89

Standard deviation in displacement response of the 1st oor for both sets of
runs as a function of time. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

90

4.9

Normalized dierence of the average displacement response of the 1st oor
between CyberMech and xPC. . . . . . . . . . . . . . . . . . . . . . . . . . .

4.10 Generalized decision making-loop for cyber-physical applications.

90

. . . . . .

92

5.1

Ten degree of freedom numerical substructure.

. . . . . . . . . . . . . . . .

104

5.2

Static RTHS per-period time by simulation size and number of cores . . . . .

106

5.3

Static RTHS periodic rate by simulation size and number of cores

107

5.4

Static RTHS numerical simulation computation timings by model size and
number of processor cores

5.5

. . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

111

Static RTHS Hardware communication time by model size and number of
processor cores.

Model sizes under 506 DOF are not shown as they were

extremely similar to the 506 DOF data . . . . . . . . . . . . . . . . . . . . .

v

112

List of Tables

2.1

Timescales used to validate the RT-OpenMP system.

. . . . . . . . . . . . .

27

2.2

RT-OpenMP scheduling latency and barrier latency micro-benchmarks. . . .

31

4.1

Observed analog read/write and digital read communication overheads for the
electric shake table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2

83

CyberMech achievable state sizes as inuenced by communication sizes and
synchronization type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

5.1

Categories of RTHS Explored in This Work

98

5.2

Serial vs. Parallelizable work in an exemplar RTHS

vi

. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .

101

Acknowledgements

I am grateful to both Chris Gill and Kunal Agrawal, both of whom have provided incredible
guidance through my graduate career. I could not have asked for a better pair of co-advisers
given nature of this project at the intersection of parallel computing and real-time computing,
and they are certainly a case where the whole is greater than the sum of the parts. They
met my needs as a student and have shepherded me into an academic life that is all my own,
and for that I am deeply thankful.

This work could not exist without both Jing Li and Abusayeed Saifullah, both of whom
were instrumental in developing the related theory of parallel real-time execution.

Jing's

theoretical work on federated scheduling in particular should be considered an essential
companion to this dissertation as it forms the underpinning for much of the systems work
that is within.

Chenyang Lu has been a constant collaborator over my graduate experience and has closely
directed the parallel real-time computing eort at Washington University alongside Chris
and Kunal. He has contributed to this work in innumerable ways both large and small.

The civil engineers at Purdue University have been fantastic collaborators, without which
the large body of work on real-time hybrid simulation (RTHS) would not have been possible.
In particular: Gregory Bunting, Amin Maghareh, and Johnny Condori Uribe, who have been
advised by Shirley Dyke and Arun Prakash. They have been an exceptional group of people
to work with, and their willingness to engage in topics outside of their domain has been
tremendous.

James Orr has helped to esh out ideas on the application of parallelism in cyber-physical
systems; in particular the frontier between statically allocated systems and dynamically
allocated systems.

vii

There are many undergraduate and master's students who have worked alongside me in
various capacities, who have contributed in ways seen and unseen. Two students with particular contributions are Kevin Kieselbach, who wrote the rst prototype for the federated
scheduling runtime, and Tommy Powers, who did much of the initial work in setting up our
shake table. Thank you all.

Thanks also to the National Science Foundation, which has funded me as a graduate student
under awards CNS-1136075 and CCF-1439062.

Lastly, my deepest gratitude is to my wife Laura, who has been a constant support and
without whom none of this would have happened.

My son Gideon and daughter Tabitha

certainly haven't sped me along to completion, but I love them both as well.

David Ferry

Washington University in St. Louis
August 2018

viii

Dedicated to my inspiration, Laura
and to those I measure myself against: Al, George, Michael, and Joe.

ix

Abstract of the Dissertation
Concurrency Platforms for Real-Time and Cyber-Physical Systems
by
David Ferry
Doctor of Philosophy in Computer Science
Washington University in St. Louis, 2018
Professor Christopher D. Gill, Chair
Associate Professor Kunal Agrawal, Co-Chair

Parallel processing is an important way to satisfy the increasingly demanding computational
needs of modern real-time and cyber-physical systems, but existing parallel computing technologies primarily emphasize high-throughput and average-case performance metrics, which
are largely unsuitable for direct application to real-time, safety-critical contexts. This work
contrasts two concurrency platforms designed to achieve predictable worst case parallel performance for soft real-time workloads with millisecond periods and higher. One of these is
then the basis for the CyberMech platform, which enables parallel real-time computing for
a novel yet representative application called Real-Time Hybrid Simulation (RTHS). RTHS
combines demanding parallel real-time computation with real-time simulation and control in
an earthquake engineering laboratory environment, and results concerning RTHS characterize a reasonably comprehensive survey of parallel real-time computing in the static context,
where the size, shape, timing constraints, and computational requirements of workloads are
xed prior to system runtime. Collectively, these contributions constitute the rst published
implementations and evaluations of general-purpose concurrency platforms for real-time and
cyber-physical systems, explore two fundamentally dierent design spaces for such systems,
and successfully demonstrate the utility and tradeos of parallel computing for statically
determined real-time and cyber-physical systems.
x

Preface

My graduate work has focused on two eminently practical branches of computer science:
parallel and real-time computer systems. This let me build the rst (as far as I am aware)
parallel real-time execution platform, and then contribute heavily to the second.

Parallelism can fundamentally change the game in real-time systems. If a computation must
execute 1024 times per second then success is binary: either that is achievable or it isn't.
In other elds parallelism might allow a program to run twice or four times as fast, which
is great, but it's an improvement on the margins. With real-time systems, parallelism can
make the dierence between achieving an inviolable timing constraint or not: the dierence
between success and failure. In this way it enables new applications that were simply not
possible before.

The combination of parallel computing and real-time computing was certainly foreseen before
me. The critical theoretical work that underpins the systems described herein is not my own.
I would liken this experience to a mountain that nobody has climbed before. It's sitting there
in the distance, and everyone knows that someone will get there eventually. There is no innate
power that is needed just the willingness and eort to get to the top. But, getting there
rst makes it your mountain.

I didn't do it alone. I climbed with Jing, Abu, James, Greg, Amin, Johnny, Chris, Kunal,
Chenyang, Arun, Shirley, and a host of other characters. We got there together, and it's an
experience I treasure.

David Ferry

Washington University in St. Louis
August 2018

xi

Chapter 1: Parallelism and Concurrency
Platforms for Real-Time Systems

This work explores parallel computing on general purpose symmetric multiprocessor platforms for real-time and cyber-physical systems. The need to manage interactions between
parallel real-time tasks drives the design of system mechanisms, so an understanding of these
interactions is essential to the engineering of parallel real-time platforms. Parallel tasks have
multiple threads executing on multiple processors, and this poses new challenges for realtime systems designers both in that these tasks must use intra-task synchronization to ensure
correct execution of threads within a task, and they also need to manage the potential for
greatly increased inter-task interference.

The newness and challenge posed by the former

is illustrated by observing that intra-task synchronization simply is not relevant for singlethreaded computations, and for the latter we observe that parallel tasks that are co-located
on a set of cores can interfere with each other many times per period between cores, so events
in one part of a system can have nearly arbitrarily far reaching eects. Sequential tasks are
usually expected to only interfere with other tasks sharing that single core.

Later in this

work we will see that systems with strong task interactions are signicantly more complex
when compared to those without, as evidenced by two systems we will examine later called
Federated Scheduling and Mixed-Criticality Federated Scheduling.

1

Parallel computing

coordinates multiple processing cores to collectively perform computa-

tions either faster or in greater depth than they could be done with individual processor
cores. This represents a denite paradigm shift from traditional real-time computing, which
often assumes sequential processing, either with one core or more than one core (multiple
sequential computations, or

multi-processing ).

This conservatism is a reasonable response:

multi-core processors and parallel computing add complexity, and traditional real-time computing strives to be as predictable and reliable as possible. The elds of real-time computing
and cyber-physical computing encompass truly safety-critical computer systems, where people's lives are at stake, so it is only prudent to exercise extreme caution with new and
untested technologies. This can be seen in action at the FAA and related aviation regulatory
agencies, where the question of when, where, and how to incorporate multi-core processors
(much less full-blown parallel computing) is still not a decided issue, even more than 20 years
after the mainstream availability of such hardware.

Indeed, even early papers on parallel

real-time processing [1, 2, 3] did not appear until 2008-2010, while mature non-real-time
parallel processing systems such as Cilk [4] (1995), OpenMP [5] (1998), and MPI [6] (1994)
had been developed much earlier.

While moving away from established single-core and sequential processing approaches introduces many open research challenges, it is also clear that parallel computing is now an
inevitability for real-time and cyber-physical systems. The majority of gains in processing
potential in recent years are from adding more processing cores to individual chips. Multicore chips with four, eight, twelve or even more processing cores are now commonplace,
and host machines with 16, 32, or more cores are aordable and represent a huge untapped
potential. Moreover, sequential processing gains have not kept pace with increases in computational demand, especially for data-heavy and sensor-heavy cyber-physical systems that
increasingly seek to understand the physical world through on-board processing and simulation. These technologies are probably here to stay even if sequential speeds rebounded

2

dramatically, it seems unlikely at this point that hardware designers would consider giving
up on multi-core processors.

Regardless of what the future may hold, parallel computing

today is a source of computational potential for cutting-edge, compute-heavy applications.

If the technical and regulatory concerns surrounding these disruptive technologies can be
addressed, the unique benets that parallel processing aords can be leveraged. There are
many conceivable benets, and in this work parallel real-time computing:



Is used to enable earthquake engineers to perform laboratory evaluations at a delity
that would be infeasible with single-core processing.



Allows a system to rapidly reallocate additional computational resources to prevent
imminent system failure.



Improves the physical delity of an exemplar cyber-physical system.

Beyond those demonstrated benets there are other potential uses for parallelism as well.
Parallel processing could be a way to provide scalable and energy-ecient on-demand processing power to embedded applications with bursty computational loads. Multi-core processors could be used to reduce latencies due to contention on shared resources, simply by
virtue of having multiple processing units capable of responding to events. Hard real-time
systems may nd themselves with an extra margin of safety through replication of resources
and computations.

These are all speculative directions for potential future work beyond

this dissertation, but they are plausible and demonstrate the potential for parallel real-time
computing to fundamentally change the real-time system designer's relationship with computational supply and demand.

This work lies at the intersection of three elds of research: symmetric multiprocessor parallel
computing, real-time systems, and cyber-physical systems. The ultimate goal is the development of an engineering methodology for incorporating parallel computing into soft-real-time

3

cyber-physical systems, and reconciling these elds together is a nontrivial task. The classic
design criteria for each subeld are disjoint at best, and even sometimes antagonistic. Reconciling these to each other is not easy. This work explores two concurrency platforms for
parallel real-time computing, RT-OpenMP and the Federated Scheduling Service, which take
dramatically dierent approaches to the problem or reconciliation. The former makes heavy
assumptions about computational workloads and implements an entirely novel scheduling
and runtime approach, while the latter makes very few assumptions and relies heavily on
existing parallel systems. Surprisingly and counter-intuitively, it is found that the second
approach is much more eective for the soft real-time applications explored in this work,
but only when such existing parallel systems are carefully congured to provide reasonable
behavior.

Traditionally, real-time systems prize predictability and reliability above all else. They are
an outgrowth of early avionics and spaceight, where lives did (and still do) depend on
the correct and timely operation of such systems.

The design process for these systems

is, roughly, to quantify the runtime behavior of individual computational workloads and
then to assemble them all into a single validated, analyzed, simulated, and exhaustively
tested task set on an approved set of hardware.

Each computational task is classied by

its worst-case execution time, which is taken to be its largest execution timing out of many
observed tests. A pessimistic scheduling analysis is performed to provide

a priori

assurance

of computational success under all operating scenarios. Formal validation may be performed
in order to demonstrate that the system always responds correctly to physical stimulus and
with the correct timing response.

Parallel computing systems without real-time constraints are radically dierent.

High-

performance systems are designed to execute large, bulk-parallel scientic or engineering
computations. The goal of improving parallel systems is to execute computations as fast as

4

possible, often measured in total computational throughput. Specic responses to specic
events is typically unimportant. Successes in such environments may be to reduce a computational time from hours to minutes or from minutes to seconds. The benet of increased
speed is either largely qualitative, or is tied to an

external

(e.g. economic) objective. For

example, training a machine learning data set may require many computations that analyze
the computational data over and over again. The training may require a hour or more on a
sequential processor, but parallelism may reduce this time interval to a bearable number of
minutes.

In contrast to both, cyber-physical systems seek to quantify and manage the interactions of
computational algorithms and physical components. These systems have existed for decades,
but generally at an ad-hoc level where systems are designed and built individually rather than
via an established methodology, and the thinking on how to design these systems continues
to evolve. As late as 2017 an NSF solicitation oers funding to conduct basic research in such
systems and claims, "we do not yet have a mature science to support systems engineering of
high-condence CPS" [7]. Parallel processing opens a new dimension to the design of such
systems, where interdependence within the system means that allocation of computational
resources directly impacts physical control performance and behavior uncertainty.

In leveraging parallel real-time computing for cyber-physical systems the goal of this work
is to maximize computational performance, subject to meeting soft real-time computational
constraints. The target timing performance in this work for parallel real-time execution is
roughly 1KHz (1 millisecond periods) in order to provide high delity for the physical com-

1

ponents of Real-Time Hybrid Simulation . However, unlike in traditional parallel platforms,

1 In

the application domain of structural engineering, for example, this rate is justied by researchers who
want to quantify the oscillatory/vibrational modes of a test specimen. A system that senses at a rate of
1000Hz can accurately determine structural response between 0Hz and 500Hz, per the Nyquist frequency.
Full-scale structures often have dominant vibrational modes between 0-10Hz, but scale models and individual
structural elements may have dominant modes of several hundred hertz. A rate of 1000Hz allows the capture

5

timing uncertainty must be managed, or at a minimum quantied and accounted for. System
overheads are relevant as they not only detract from the overall computational ability of the
platforms under consideration, but they also threaten to derail the accuracy of parallel realtime scheduling analyses. As can be seen, the three individual topics of parallel computing,
real-time computing, and cyber-physical computing have disjoint primary and secondary objectives, and the eld of parallel real-time computing has a multitude of primary objectives
which all must be achieved simultaneously to provide correctness and good overall system
performance.

In addition to providing new techniques for parallel real-time systems more generally, this
work seeks specically to produce soft-real-time systems that are suitable for use in a structural engineering laboratory environment to conduct real-time processing within experiments
that are a minute to two in duration. The consequences for software failure in this particular
work are meaningful, but not safety-critical: failure in this case means wasted time, possibly wasted materials, and potential damage to equipment. The physical apparatuses under
control are in a laboratory environment with minimal danger to operators. It is possible for
violation of timing constraints to damage physical experimental specimens or apparatus, but
the equipment at risk in this work was (relatively) inexpensive and this risk was managed
through extensive testing of software in simulation prior to hooking it up to real hardware.
Because the software could be carefully tested prior to runtime, in practice the goal of building soft-real-time software in this work has meant that a particular experimental code can be
congured into a state where there are no timing constraints violated (no deadline misses)
over a trial execution period signicantly longer than the expected experimental runtime.
For example, behavior in many cases was declared satisfactory after one hundred trial executions with no deadline misses. However, the specic measure of robustness varies for the

of higher modes for specimens, and provides plenty of excess sampling to ensure all relevant data is captured
when testing a smaller specimen of unknown response.

6

dierent software systems presented, and this is discussed in more detail in their respective
chapters.

More theoretically, there are many possible denitions of what it means for a system to have
soft-real-time behavior, in addition to the above denition that is used in this dissertation.
Such systems have been dened to have bounded tardiness, to provide a low probability of
per-period failure, or to minimize utility loss subject to an overall system utility function.
In fact, the parallel real-time scheduling theories that are used in this work provide strong
sucient conditions for schedulability- if the scheduling theory assumptions are met then
the theory makes a strong guarantee of system performance suitable for implementation in a
hard-real-time system. However, both the operating system (Linux with the RT_PREEMPT
patchset applied) and the parallel platform (OpenMP) that were used are not real-time
software, so no claim to building a hard-real-time system is ever made in this work.

This dissertation continues in Chapters 2 and 3 with an examination of two parallel real-time
concurrency platforms. The rst, RT-OpenMP, adopts a highly regimented design to provide
ne-grained control over executable pieces of parallel real-time workloads. The latter, the
Mixed-Criticality Federated Scheduling Service, instead makes only a few light assumptions
about how parallel real-time computations will execute, and then hands o the task of
execution to existing parallel concurrency platforms. Chapter 4 introduces CyberMech, a full
concurrency platform for Real-Time Hybrid Simulation based on the Federated Scheduling
Service. Chapter 5 draws broader conclusions about the process of engineering cyber-physical
systems in the context of parallel real-time execution.

Finally, Chapter 6 provides some

background and related work in the general eld of parallel computing, real-time computing,
and cyber-physical systems, and the dissertation concludes in Chapter 7.

7

Chapter 2: RT-OpenMP

This chapter describes the rst platform ever implemented for the execution of parallel realtime tasks that provides scheduling with respect to a theoretical schedulability bound. This
platform is a scheduling service called

RT-OpenMP

[8], a parallel real-time concurrency

platform that supports real-time semantics, performs scheduling of parallel tasks [9], and is
based on theoretical results [10] in parallel real-time scheduling. This system was designed
to provide a true parallel programming interface via modication of the OpenMP [5] implementation in the GNU Compiler Collection (GCC), but for soft real-time workloads was
subsequently superseded by the Federated Scheduling Service (FSS) described in Chapter 3.

The RT-OpenMP implementation explores a radically dierent design space for parallel
real-time systems when compared to the Federated Scheduling Service. To contrast these
approaches at a high level, RT-OpenMP tightly controls how and when processes execute
by controlling execution of code scopes explicitly, while the Federated Scheduling Service
ooads the responsibility of thread creation and management to existing parallel concurrency
platforms such as OpenMP or Cilk Plus. As such, RT-OpenMP oers a much more tightly
orchestrated architecture and implementation, and for this reason it can be seen as a potential
model for future work in hard real-time parallel systems. As such an implementation would
be signicantly more complex and the types of parallel programs it could execute may be

8

limited compared to the unrestricted execution (of programs described by arbitrary directed
acyclic graphs) supported by the Federated Scheduling Service, such further investigation is
left for future work.

This chapter presents the following contributions:



The design and implementation of a scheduler and runtime dispatcher capable of
scheduling and executing a collection of parallel real-time tasks that conform to the
parallel synchronous task model introduced in [10].



System evaluation with a set of synthetic workloads to measure the performance of
the platform under various partitioned scheduling strategies and utilizations.

This

shows that the platform provides good performance for a signicant class of potential
workloads.

The author is responsible for the design and implementation of the online job dispatcher and
system evaluation. The theoretical analysis of this system was performed in [10], and the
oine scheduler itself was implemented by Jing Li (second author on [8], where this system
was originally presented).

Section 2.0.1 provides background information about OpenMP. Section 2.0.2 describes the
parallel synchronous task model.
dispatcher.

Section 2.0.3 describes the design of the scheduler and

In Section 2.1 an empirical evaluation of the scheduling service through full

system tests with synthetic parallel tasks is presented, as well micro-benchmarks.

2.0.1

Overview of OpenMP

OpenMP is an Application Programming Interface (API) specication that denes a standardized model for parallel programming on shared-memory multiprocessors. The specication is governed by the OpenMP Architecture Review Board, which is primarily composed of

9

representatives from companies engaged in the production of hardware and software used in
parallel and high-performance computing. The OpenMP API [5] is dened for the languages
C, C++, and FORTRAN, and has been implemented on many dierent architectures and
major compilers.

Importantly, for the purposes of this work, there exists an open source

version within the GNU Compiler Collection (GCC).

OpenMP provides programming support through library routines and directives.

Library

routines include auxiliary functions that allow a program to query and modify OpenMP
runtime parameters (such as the number of threads or thread scheduling policy), as well
as locking and timing routines.

OpenMP directives are compiler pragma statements that

indicate where and how parallelization can occur within a program. For example, one such
directive converts a regular

for

loop to a

parallel-for

loop, by prefacing the loop with

#pragma omp parallel for.
However, the unmodied OpenMP implementation does not support real-time execution.
First, the specication lacks any notion of real-time deadline and period semantics. More
fundamentally, current OpenMP platforms, and particularly their schedulers, are ill-suited
for real-time performance. When invoking a parallel directive in OpenMP there is no expectation of how, where, and when parallel execution will take place. These directives merely
point out the available program parallelism, and the compiler and runtime system make
very few guarantees about how the program actually executes. For general parallel execution this may be desirable, as it allows the system to load-balance exibly and eectively,
allowing OpenMP to run correctly on a sequential machine or on dierent parallel machines
with varying numbers of processors. Unfortunately, such exibility is not good for real-time
computing, where correctness is also a function of execution time. Thus, it is necessary for a
real-time concurrency platform to provide the stronger assurance that the feasible deadlines
of a given parallel workload are met on a specic execution target. It is also necessary that

10

A Single Synchronous Task
Strand 1
Strand 2
Strand 3
Strand 4
Strand 5
Segment 1

Segment 2

Segment 3

Segment 4

Figure 2.1: An example parallel-synchronous task with four segments.

the runtime system supports such assurances through robust deadline-aware scheduling and
dispatching.

2.0.2

Parallel Synchronous Task Model

The RT-OpenMP platform focuses on

synchronous tasks

 tasks described by a sequence

of segments where each segment consists of one or more parallel

strands 1

of execution, as

shown in Figure 2.1. The end of each segment serves as a synchronization point: no strand
from the next segment may begin executing before all strands from the current segment have
completed. The deadline of a synchronous task is the time by which all strands of the last
segment must nish executing.

This is not the most general model for describing parallel programs, but we use this model
for two reasons. First, the work in [10] allows RT-OpenMP to provide schedulability assurance. Second, the high-level

parallel for

programming construct naturally maps to this

1 The

nomenclature used here is purposely dierent from the preferred term in [10], threads. In the context
of the RT-OpenMP scheduling service it reduces confusion to use the term strands to refer to fundamental
units of executable code and reserve the term threads for the operating system's persistent threads that are
responsible for executing those units.

11

particular task model. This construct is of primary importance for many parallel programs
as it nicely captures the SIMD (single instruction multiple data) paradigm that is widely
used.

To describe the model more formally, a task set

{τ1 , τ2 , · · · , τn }.

Each task

τi

τ
ki

is a sequence of

consists of

is denoted

in segment

j

hei,j , mi,j i,
and

mi,j

where

ei,j

parallel synchronous tasks

segments, and a segment may not ex-

ecute until the previous segment is entirely nished.

τi

n

The

j th

segment in the

point at the end of each segment a task

(hei,1 , mi,1 i, hei,2 , mi,2 i, · · · , hei,ki , mi,ki i),

τi

Since there is a synchronization

can be alternately described as the sequence

where

ki

is the number of segments of task

assume periodic (or sporadic) implicit deadline tasks with the deadline

Ti

task

is the worst-case execution requirement of all strands

is the total number of strands.

to its arrival plus its period

ith

Di

i.

We

of each task equal

(minimum inter-arrival interval). Later, for the purpose of

scheduling, we will refer to the release time

ri,j

and deadline

di,j

of a strand, which respec-

tively are the times by which a strand may start and must nish execution in order to assure
that the overall task deadline

Other Denitions:

Di

is met.

Intrinsic to each task are several quantities of practical importance.

The worst case execution time of a task
by

Ci .

τi

on a single processor, called its

work,

The task's execution time on an innite number of cores, called its

length, is denoted by Pi .

By denition, the worst case execution time is

and the critical path length is

Pi =

Ci =

is denoted

critical-path

Pki

j=1

mi,j · ei,j

Pki

j=1 ei,j . Intuitively, the work is the total amount of

computation in a task (all strands from all segments), while the critical-path is the longest
chain of sequential computation (the longest strand from each segment).

Ui = Ci /Ti

of a task

τi

The

utilization

is the ratio of total work to the task period, while the utilization

of a task set is simply the sum of the utilization of each task in the set. Note that, unlike
sequential tasks, it is possible for a parallel task to have utilization greater than

12

1.

The

augmentation bound

is a property of a scheduling algorithm which provides a schedula-

bility test. In our case where all processors execute at the same speed (as in many common
multiprocessor systems), an augmentation bound of
algorithm a system with
to or less than

p/b.

p

b implies that under the given scheduling

processors can execute any task set with total utilization equal

This is a sucient but not necessary test, meaning that task sets with

greater utilization may still be schedulable.

2.0.3

RT-OpenMP Scheduling Service Design

The role of our scheduling service for RT-OpenMP is to schedule parallel-synchronous applications while providing real-time assurances to application developers. There are two objectives. First, RT-OpenMP must ensure that dependences between segments are respected.
Second, it must execute tasks so that they meet their deadlines.

RT-OpenMP uses two sub-systems to enforce this behavior, a

scheduler

and a

dispatcher.

The scheduler decomposes and annotates tasks prior to execution time, and the dispatcher
uses that information to dispatch strands of execution at runtime. In the current system,
the scheduling phase occurs before execution begins and the dispatching phase occurs at
runtime.

Scheduler:

The scheduler consists of two components: a

decomposition algorithm (from

[10]) that decomposes a parallel task into a set of sequential strands, each with its own release
time and deadline; and a

priority assignment and partitioning algorithm

(from [9]) that sets

priorities for each sequential strand and assigns each of them to a particular core given a

p-core

processor. The theoretical result from [10] provides an augmentation bound of

5

for

this method. This yields the following schedulability test: if the total utilization of a task
set is less than

p/5 (20%

of the maximum utilization allowed) and the critical path length

13

(a) Task set consists of two tasks.

strand
s11
s21
s31
s41
s51
s61
s12

deadline
10/3
40/9
40/9
40/9
40/9
20/9
8

priority
2
3
3
3
3
1
4

(b) The decomposed tasks on 3 processors. Each
strand has its own release time and deadline.

core
1
1
2
3
1
1
2

(c) Priority Assignment and Partitioning

(d) Execution trace of the strands execution

Figure 2.2: An example decomposition and scheduling of two tasks under RT-OpenMP.

Pi

of each task is at most

1/5th

of its deadline

Ti ,

then the theory guarantees this task set

as schedulable.

Decomposition Algorithm:

The decomposition algorithm from [10] performs the fol-

lowing operations. First, each task in the task set is decomposed into a set of independent
strands, where each strand has its own individual release time and deadline. These strands
are analyzed collectively, and the total computational time is divided in a way that provides
enough capacity for each. This assignment is reected in an intermediate release time and
deadline for each individual strand.

Release times are also chosen to satisfy dependences

between segments (the actual mechanism used to enforce this in the system is barrier syn-

14

chronization, but the dependency timing constraints are used to ensure the correctness of
the decomposition).

For this to work, we must perform the following adaptation: the above decomposition provides an augmentation bound of 5: this means that if an ideal scheduler can schedule a task
set on

m cores of speed 1, then a decomposed task set can be scheduled on m cores of speed 5.

Due to the derivation in [10], it is important that the parameters

Ti , Ci

and

Pi , are measured

on an ideal unit-speed machine and the decomposition is done at speed 2. In this formulation
both the speed-1 and speed-2 machines are hypothetical, while the task set actually runs
on what are considered speed-5 processors. Therefore, we must compute the decomposition
for machines that are

2.5

times

slower

than our machines, since decomposition occurs at

speed-2.

Hence, this gives rise to a constant value of
measure

2.5

in the following equations. In practice we

Ci , Ti , and Pi on real machines and then multiply those quantities by 2.5 to simulate

a 2-speed machine. The following process of decomposition is otherwise exactly the same as
in [10], but is modied to reect the inversion we have described.

The decomposition works as follows. The total slack of a task is the extra time it has to
nish computation, if it was given an innite number of processors as soon as it was released.
Then the slack on the (hypothetical) speed-2 processors is denoted as

Li = Ti − 2.5Pi

The idea behind decomposition is to divide this total slack among all the segments equitably.

For this purpose, we classify segments into

heavy

and

light

segments.

Intuitively,

heavy segments are those with many strands, and therefore, a larger computational requirement.

The classication is based on a threshold:

15

a segment is classied as heavy if the

number of strands in the segment is more than the total computational requirement divided
by the slack (on the 2-speed processor), that is,

mi,j >

2.5Ci
Ti − 2.5Pi

and otherwise it is a light segment. The total slack is distributed among heavy segments,
giving them more time to nish and therefore reducing the maximum workload density of
the task.

If there are any heavy segments in a task, we compute their segment slack as

h
li,j
=

where

Pi`

mi,j (Ti − 2.5Pi` )
−1
2.5(Ci − Ci` )

is the portion of the critical path contributed by light segments and

Ci`

is the

portion of the worst case execution time contributed by light segments. In the case where

`
heavy segments exist, no slack will be given to light segments: that is, li,j

= 0.

The relative

deadline for all strands in a segment is

di,j = 2.5ei,j (1 + li,j )

Note that even when light segments are not given any slack on the (hypothetical) speed-2
processors, when we run them on the (real) speed-5 processors, they do have slack.

If all segments are light, each segment will receive an equal portion of the slack and the
relative deadline of all strands in a segment is

di,j =

ei,j Ti
Pi

16

In either case, the release time of each strand is the deadline of the preceding segment.

Now an example to show the action of the scheduler in Figure 2.2. The example task set has
two synchronous tasks, whose parameters are shown in Figure 2.2a, which are executed on a
machine with three cores. In task 1, all segments are calculated as heavy segments, since

m1,j

should be larger than 0.643. Thus, segments 1, 2 and 3 get extra slack of 1.22, 7.88 and 1.22
respectively and hence have relative deadlines of 10/3, 40/9 and 20/9 respectively. Similarly,
the only segment of task 2 is a heavy segment and gets all the slack (3.2 time units). So the
deadline for segment

1

is

3.2 ∗ 2.5 ∗ 1 = 8,

the same as task 2's deadline. Figure 2.2b shows

the decomposed sequential strands with individual release times and deadlines.

Partitioning and Priority Assignment:

As indicated by [10], we use FBB-FFD [9]

(Fischer Baruah Baker First-Fit Decreasing bin packing) to assign strands to cores.

First, strands are sorted according to their relative deadlines. Since we are using a segmentlevel xed-priority scheduler, the segments with the smallest relative deadlines have the
highest priority.

Note that all strands in a segment have the same priority and the same

relative release time, though strands from the same segment may be placed on dierent
cores. As shown in Figure 2.2c, the priorities are assigned according to each strand's relative
deadline, where priority 1 is the highest priority.

Starting with the highest priority strands, the scheduler then tries to place each strand on a
core. To do so, the FBB-FFD [9] algorithm denes a request-bound function (RBF) as:

RBF (τi , t) = ei + ui t

The RBF is the maximum amount of computation required by task

i

over time of length

t

on the system. The above original RBF is tight for sequential tasks and represents the upper
bound of the computational requirement. However, for decomposed parallel tasks, strands

17

from dierent segments of the same task will never be released and executed simultaneously.
Hence when calculating the total RBF of a task, directly summing the RBF of every strand
would be pessimistic.

The oset-aware FBB-FFD algorithm replaces the original

RBF

with

RBF ∗ ,

which takes

release osets into account. It calculates all possible interference from other strands of other
tasks, as well as strands from the same segment on a given core.

When segment

τj,l

τj

is the rst interfering segment, the interference of task

with relative deadline

di,k

on segment

τi,k

is dened as

RBFq∗ (τj,l , di,k ) =
X

ej,p .mj,p,q +

X

uj,p .mj,p,q .di,k .

j,p

(rj,p +Tj −rj,l ) mod Tj ≤di,k

This oset-aware RBF is dierent in the rst term by only summing the interference from
strands that can be released within the deadline

di,k ,

considering the start segment

τj,l

and

all the osets of subsequent segments.

Then the maximum interference of task

τj

is

n

RBFq∗ (τjdecom , di,k ) = max RBFq∗ (τj,l , di,k )|1 ≤ l ≤ ki

o

A more detailed explanation can be found in [11].

If the

ei,k

RBF ∗ of a strand on a given core q satises the condition that di,k −

∗ decom
, di,k )
j RBF q (τj

P

(load condition), then the strand can be assigned to this core. In this case, the strand is

guaranteed not to miss its deadline.

For each strand, there may be more than one core that satises the load condition. Therefore,
there is a choice of assignment algorithms that can be used to place strands on cores. This

18

≥

choice results in dierent scheduling strategies and potentially dierent execution results.
One contribution of this work is to evaluate the following two heuristics.

A

rst-t heuristic

will scan cores in some canonical order (from core 1 to core

p)

and place

the strand on the rst core that satises the load condition. This is the standard method
for FBB-FFD bin-packing algorithms. However, it does not provide any load-balancing, and
the rst few cores may become heavily loaded, while the last cores may be entirely unused.

A

worst-t heuristic on the other hand, will scan all the cores and nd the core with the least

RBF ∗

value (the least loaded core) and will assign the strand to that core.

In principle,

the worst-t heuristic should exhibit better load balancing than the rst-t heuristic by
spreading computational work across as many cores as possible. However, it takes longer to
run the scheduler since each assignment step must scan all cores. An example assignment
using worst-t assignment is shown in Figure 2.2c. Note that if using rst-t, all strands in
this example would have been assigned to core 1, simply because the sum of the worst case
execution times of both tasks is much smaller than either task's period.

Dispatcher:

The dispatcher is responsible for enforcing previously generated schedules

and providing synchronization at the end of each segment during runtime.

This requires

the dispatcher to support scheduling priorities, runtime preemption, and synchronization,
which we accomplish through facilities provided by Linux.
enforce the schedule and enable task-level preemption.
synchronization through

futex

We use real-time priorities to

We accomplish segment (barrier)

(fast user-space mutex) system calls.

Recall from the previous section: prior to runtime, the scheduler decomposes a task set into
individual strands and encodes this in a static schedule. Thus, when the dispatcher begins,
it has the program structure of each task, and each strand is annotated with a processor
assignment, priority, and relative release time. An example of such an assignment table is
shown in Figure 2.2c.

In order to enforce that schedule, the dispatcher must be able to

19

Task 1
Program

Task 2
Schedule
Priority

int main( int argc, char*
char* conﬁg_ﬁle = arve;
unsigned ﬁrst_= 1
unsigned last;
//This is a comment rigt
//am a good programme
For( int i = 0; i > -10; ){
i -= i + 1;
printf("What\?\n");
}

1
2
3
4
5
6
7
8
9
10
11

20
10
32
44
55
AB
08
13
37
22
90

Program

Schedule

Processor

Priority
int main( int argc, char*
char* conﬁg_ﬁle = arve;
unsigned ﬁrst_= 1
unsigned last;

1, 2, 3, 4, 5
12, 11, 10, 9
8
9, 3,, 1
9, 0, 2, 1, 0
20, 20 20
9, 4,, 5
12 12 12
32, 32, 32 ,32
98, 76, 54, 32
1, 8, 0

//This is a comment rigt
//am a good programme
For( int i = 0; i > -10; ){
i -= i + 1;
printf("What\?\n");
}

1
2
3
4
5
6
7
8
9
10
11

20
10
32
44
55
AB
08
13
37
22
90

Processor
1, 2, 3, 4, 5
12, 11, 10, 9
8
9, 3,, 1
9, 0, 2, 1, 0
20, 20 20
9, 4,, 5
12 12 12
32, 32, 32 ,32
98, 76, 54, 32
1, 8, 0

Scheduler
Dispatcher
Task 1
Thread 1

Task 2
Thread 1

CPU 1

Task1
Thread 2

Task 1
Thread 3

Task 2
Thread 2

CPU 2

Task 2
Thread 3

CPU 3

Figure 2.3: An RT-OpenMP conguration consisting of two tasks and three processors. Each
task has a team of threads that are created at runtime, and consists of exactly one thread per
processor. The black dashed line represents the division between compile time and runtime:
the schedule is generated prior to runtime, but the dispatcher must refer to the generated
schedule frequently during execution.

run strands on cores to which they are assigned at the proper release time, and if a high
priority strand is released while a low-priority strand is running on its assigned core, then
the high-priority strand must preempt the low priority strand.

We describe the operation of the dispatcher in two phases:

initialization

and

runtime oper-

ation.

Initialization:
creating a

team

The ability to assign specic strands to processors is accomplished by

of threads for each task in the system.

threads as available cores.
total of

n∗p

Therefore, if there are

n

Each team has exactly as many

tasks and

p

processors, there are a

threads in the system: one thread from each team pinned to each core. The

numbers of tasks and processors are known a priori, so all teams are created and pinned
during initialization.

This simplies the dynamic operation of the system, as once these

threads are pinned during initialization they never again migrate. In the system shown in

20

Figure 2.3, there are two tasks and three processors.

Each task has a dedicated team of

threads, and the team is distributed so that every task has one thread per processor.

Runtime Operation:

A thread from task

i's

team, pinned on processor

following function: it executes all the strands from task
(and only those strands).

i

j

has the

that are assigned to processor

j

Therefore, given a strand-to-core assignment computed by the

scheduler, we have an automatic strand to thread assignment; once a strand is assigned to
a processor, there is one corresponding thread that is responsible for executing it. This can
be seen in the example in Figure 2.2d.

The execution of the dispatching system occurs in a completely distributed manner. Each
thread is individually responsible for nding the right work to do, and doing it at the right
time. This distributed approach has the advantage that all cores do useful processing, rather
than having a core for dedicated dispatching, and this avoids overhead due to centralized
coordination. The pseudo-code for how dispatching occurs is given in Algorithm 1.

Algorithm 1 realizes a team of threads synchronously stepping through a task segment by
segment, and is executed by all threads at runtime. At the start of each segment each thread
will look at the schedule to determine whether some strand is assigned to it from the current
segment.

Note that more than one strand from a segment may be assigned to the same

thread. The thread looks to see if any strands of the segment are assigned to it. If there
are, it performs the work of those strands. Once it nishes this work (or if it has no strands
assigned to it) it skips to the barrier (line 11) and waits for the rest of its team to nish the
segment. Threads wait at the barrier until all threads in the team have reached it.

There is one additional issue:

each thread is responsible for dispatching itself, but each

thread is running concurrently with many other threads, some of which are executing realtime workloads. Two dangers thus arise: dispatching actions done at a high priority may
interfere with currently executing jobs, while dispatching actions done at a low priority could

21

Algorithm 1 Distributed Dispatching
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:

Raise priority to system maximum

while new_task_iteration do
while more_segments_remain do
Wait until segment relative release
Check whether any strands from this segment are assigned

while has_assigned_strands do

Lower priority to segment priority
Perform the work of the strand
Raise priority to system maximum

end while

Team barrier synchronization

end while

Sleep until next task iteration

end while

lead to a situation in which a moderately high priority job blocks the dispatch of a higher
priority job (which results in priority inversion).

We address this by reserving the maximum real-time priority for dispatching. The choice
to have threads spend time frequently at the highest system priority might seem counterintuitive, as this means that all dispatching actions will always block the real-time execution
of any currently working thread (even if those dispatching actions are for a lower priority
thread). However, each thread is only ever performing one of three actions while dispatching:
checking for work, modifying its own priority, or barrier waiting. These three actions can be
made brief enough that they do not signicantly disrupt the operation of other threads in the
system. In essence, we have traded long and unpredictable priority inversion (a long-running
low priority thread blocking the dispatch of a high priority thread) for very brief and very
predictable priority inversion (the brief but frequent dispatching actions of every thread).

Preemption:

Note that preemption occurs correctly and automatically in this design.

Each thread executes a strand assigned to it at the priority of the strand itself. Therefore,
when a high-priority strand is released on processor

22

p,

the thread that is responsible for

that strand inherits this high priority. If another thread is executing a lower-priority task
on processor

p,

that thread has the lower priority. As a result, the high-priority thread will

preempt the lower priority thread through the normal Linux thread scheduling mechanism.
As can be seen in Figure 2.2d strand

s31

s31

has higher priority than strand

s12 .

Therefore, when

is released, the thread responsible for it immediately preempts the thread executing

When

s31

completes,

s12

s12 .

resumes its execution.

Synchronization Mechanism:

Finally, the dispatcher must ensure that no thread

executing a parallel-synchronous task can race ahead and begin executing a future segment
before it's predecessors have nished. We ensure this in our system through barrier synchronization, and we now describe how this barrier is implemented eciently.

Recall that barriers have two operations (

wait

and

wake )

that allow a team of threads to

synchronize with one another. When a thread reaches a barrier it waits there until all other
threads arrive. Once this happens all threads are awoken and allowed to proceed. This is
the precise behavior desired for segment synchronization, and prevents any one thread from
racing ahead and starting on a new segment while other threads are still working on the
previous one.

In this system the wait and wake operations are performed at the system's maximum realtime priority (to address the same priority inversion problem as with dispatching).

This

necessitates a barrier implementation that is as non-interfering as possible. To achieve this
we use the

futex

(fast userspace mutex) system call within Linux.

Futexes are single atomic counters in shared memory, used to support ecient mutual exclusion. There are two system calls that allow the kernel to arbitrate between processes that
are contending on the same futex:

futex_wait

and

futex_wake.

When a thread waits on

a futex it yields the processor and is put to sleep by the kernel. Later, some other thread
wakes the futex, which revives some or all of the threads that were previously waiting.

23

futex_wait is used to implement the barrier wait, and futex_wake to implement the barrier
wake. This is especially advantageous for our system: in this design the threads spend almost
all their time either working or barrier waiting at the maximum system priority.

Many

waiting threads at such high priority might contribute signicant overhead. However, with
the futex implementation the kernel is invoked to put the threads to sleep, and they consume
no resources in this wait state. This allows the system to have many threads idling at the
highest real-time priority without incurring substantial overhead.

2.1

RT-OpenMP Evaluation

This section presents an experimental evaluation of RT-OpenMP with two types of experiments: (A) full-system evaluation using synthetic parallel tasks to see if the scheduling service
meets task deadlines, and (B) micro-benchmarks in order to understand the overheads of
the mechanisms used by the system.

System Evaluation: Synthetic Parallel Tasks:
One of the goals of this evaluation is to determine whether system behavior agrees with the
theoretical augmentation bound of 5. Even though that bound holds in theory, the overheads
of a practical implementation might invalidate it on a real system. Theoretically schedulable
task sets were generated to test this, meaning that (1) utilization by each task set is at most

m/5 (20%
most

1/5

of maximum allowed utilization), and (2) each task's critical path length is at

its deadline (again,

20%

of maximum). These will be called

20% utilization

tests

for brevity.

To evaluate the practical applicability of RT-OpenMP more broadly, several parameters were
considered, as follows.

24

Utilization Level:

The theoretical results hold only for

20% utilization task sets.

Since theo-

retical results are often pessimistic, RT-OpenMP was also evaluated with higher utilization
task sets.

Task Frequency:
head.

Each periodic iteration of a task costs a (relatively) xed amount of over-

Hence, a task that executes at 1000Hz (1000 times per second) is likely to incur

approximately ten times more overhead than a task that executes at 100Hz, and this additional overhead may impact performance. We explore several timescales to quantify this
eect.

Bin-Packing Heuristic:

As we described in Section 2.0.3, the heuristic changes how work

is assigned to processors: the rst-t heuristic heavily loads as few processors as necessary
and leaves the rest underutilized, while the worst-t heuristic attempts to minimize the
maximum load on any particular processor. Theoretically both heuristics should guarantee
schedulability for

20%

utilization task sets, but we expect their performance may dier for

higher utilization task sets.

Number of Processors:

Also of interest is how well this approach scales as the number

of processors involved increases.

The size of each thread team increases by one for every

processor used in the system, which may increase the synchronization overhead at the end of
every segment. In addition, each processor chip in our machine has
more than

12

12

cores. When we use

cores, the teams have to synchronize across multiple chips, potentially leading

to additional communication overheads.

Test Platform:

We tested our runtime system with a 48-core symmetric multiprocessor,

a 1U AMD Quad with four Opteron 6168 processors. We used standard Linux kernel version
3.4.4 with RT_PREEMPT patch version r14 as our underlying RTOS. We left processors
0-11 in their default conguration (to handle normal Linux activities and interrupts), and
processors 12-47 were optimized for real-time performance.

25

This was done by isolating

isolcpus

them from the Linux scheduler and load balancer with the boot parameter

and

preventing them from servicing any maskable interrupts. This gave 36 processors on which
to run real-time task sets.

Task Set Generation:

Task set generation is straightforward; tasks were randomly

generated and included in the task set until the total utilization of the whole set was as
desired. Given a certain number of processors, the goal is to generate a parallel synchronous
task set that achieves within 2% of the desired utilization (e.g. if the desired utilization level
is 50% then the actual utilization will be between 48% and 50%). Task periods and strand
lengths are unitless and can be scaled at runtime to achieve a desired task frequency.

First, the period of the task is chosen to be

2i ,

where

i ∈ {11, 12 . . . 16}.

To conform to

the applicable scheduling theory, the critical path length of each task was chosen to be 8%,
10%, 14% or 20% of the period, with probability of 0.4, 0.3, 0.2 and 0.1 respectively, to yield
tasks that have varying levels of slack. As indicated above, the maximum allowable critical
path length is

100%

of the period of speed-1/5 ideal processors, or

processors. This methodology gives us critical path lengths of

20%

of the period on our

40%, 50% 70%

and

100%

of

the maximum allowable critical path length.

Given these parameters, the task is generated segment by segment to get a series of segments
such that the critical path length of the task is equal to the chosen critical path length. To
do so, we generate each segment in turn and randomly choose its execution time from a log
normal distribution. This allows us to control the distribution mean while still allowing for
occasional large and small values. The average segment length was
segment length was

100.

and the minimum

The number of strands in each segment also was chosen from a

log-normal distribution with mean

Methodology:

400,

4

and minimum value

We ran experiments with

of cores. For both values of

m,

m = 12

1.

and

m = 36,

where

m

is the number

we generated task sets with utilizations between

26

20%

and

Table 2.1:

Minimum

Maximum

Minimum

Average

Task Period

Task Period

Segment Length

Segment Length

2048ms

216 ms

100ms

400ms

32ms

1024ms

1563µs

6250µs

16ms

512ms

781µs

3125µs

8ms

256ms

391µs

1563µs

4ms

128ms

195µs

781µs

2ms

64ms

98µs

391µs

Several timescales allow us to validate the system for a variety of potential

application domains. The top three timescales demonstrate the limits of the design under 36core operation, and the bottom three timescales demonstrate the limits for 12-core operation.

80%.

For

m = 12

we generated

100

task sets, and for

m = 36

we generated

20

task sets.

Each task set was then scheduled with both the rst-t and worst-t heuristics. Each task
set was run for ve minutes of wall-clock time under various timescalings.

The absolute

values derived from each timescaling can be seen in Table 2.1.

For each experiment, we calculate the

failure rate.

A task set is said to have failed if any

task misses a deadline. The failure rate is the ratio of the number of task sets that failed
to the total number of task sets. Before presenting these results we rst describe a series of
relevant design choices and system overhead measurements.

2.1.1

RT-OpenMP Design Space Choices

In the design of RT-OpenMP, many choices were made among many alternatives.

This

subsection describes some of those alternatives, and discusses the pros and cons of each.
In particular, three major decisions were made:

the scheduling strategy, the preemption

mechanism, and the synchronization mechanism.

Scheduling:

RT-OpenMP uses partitioned DM (Deadline Monotonic) scheduling. The

alternative would have been to use global EDF (Earliest Deadline First), which, in fact,

27

provides a better augmentation bound of

4

(instead of

5

provided by partitioned DM).

We chose to implement partitioned DM for our rst prototype for multiple reasons. First,
partitioned DM is easier to implement on a multi-core system by leveraging thread priorities.
Dynamic priority schedulers are more dicult to implement using OS priorities (and a userspace scheduler that implements preemption also would have been dicult, as is discussed
subsequently). Second, partitioned scheduling has lower overheads for several reasons: (1)
scheduling occurs statically, so there are no overheads of computing the schedule at run time;
(2) strands do not migrate from one core to another during execution; and (3) preemption
occurs more rarely and predictably since strands can only preempt other strands that are
assigned to the same core. For these reasons, RT-OpenMP was prototyped with this strategy.
Future systems may explore global dynamic priority scheduling strategies in this or other
concurrency platforms.

Preemption:

There are two ways to implement preemption between threads.

can either rely on the

operating system mechanisms

does), or implement user-space preemption by

One

to provide preemption (as RT-OpenMP

voluntary yielding.

For user-space preemption,

each thread must periodically check whether it has been preempted. If it has, it should save
its current state and yield the core to the preempting thread. This has the advantage that
it doesn't involve expensive system calls. In addition, this is often safer for programs that
use mechanisms such as locks, since threads can make sure that they are at a safe point
before they yield. On the other hand, this method has a few disadvantages as well. The user
(or compiler) must provide mechanisms for periodic polling and checkpointing. Moreover, a
high-priority thread may have to wait for a long time before a low-priority thread decides
to yield its processor.

Due to this priority inversion, it is dicult to provide real-time

performance unless there are bounds on how long such a priority inversion can last and
how often it can happen.

RT-OpenMP is designed to enforce the real-time performance

provided by the theory presented in [10], which assumes immediate preemption. In addition,

28

the parallel synchronous task model assumes that tasks do not acquire and hold locks, so
the runtime system does not need to consider whether or not to preempt a thread holding
an exclusive lock on a shared resource.
managing thread priorities.

Therefore, RT-OpenMP uses OS preemption by

In the future, other systems may explore user-space yielding

mechanisms for real-time parallelism, which may have lower latency or be more suitable for
a user-space scheduler such as those found in traditional parallel concurrency platforms such
as OpenMP or Cilk Plus.

Synchronization:

In RT-OpenMP, each thread must wait on a barrier for other threads

of its team to nish executing the current segment. There are two ways to implement such
waiting.
to use

One method is to use

polling.

sleeping

(which is the option we used), and the other is

Both methods involve waiting for a specic condition to change, but their

implementation diers. Sleeping generally means that a thread is removed from consideration
of the scheduler (through removal from the runqueue) and hence does not consume further
processor time, even indirectly, until it is woken. The downside of this is that the operating
system must become involved both to suspend and resume the thread. The other approach,
polling, involves spinning until the condition becomes true.
polling thread continuously occupies the processor.

Unlike a sleeping thread, a

The benet is that polling generally

oers better latency; a thread can get past the barrier faster once the condition becomes
true. Hence, polling is the preferred strategy if it is known that wait times will be very short.
In RT-OpenMP, the number of threads may be much larger than the number of cores, and
many threads may spend a large amount of time waiting on a barrier. Therefore, the sleeping
mechanism is a clear choice for barrier synchronization in this platform, and is implemented
using the

futex_wait

system call.

29

2.1.2

System Overhead Measurements

As described earlier, one goal of our system design is to minimize the overhead due to
contention. There are two primary sources of contention within the system, preemption and
segment (barrier) synchronization. We evaluate these mechanisms with micro-benchmarks:
short programs designed to expose a specic facet of system performance.

Preemption Overhead:

Our rst micro-benchmark is designed to measure the eect

of preemption on scheduling latency, which we dene to be the dierence in time from when
a job may start executing (its release time when it has higher priority than any other job
that is currently executing) to when it actually starts. We use two jobs to accomplish this:
the rst is a low-priority job that executes for a long time on twelve cores simultaneously.
The second is a 12-core high-priority job whose release time is a xed interval after the
start of the low-priority job. There are two salient features: the second job should always
preempt the rst job immediately upon its release, and we always know the precise time of
that release. Hence, we can measure the dierence between the second task's release time
and the time it actually starts executing code. This always involves preempting task one,
and thus we consider this to be a practical measure of overhead due to preemption.

Barrier Latency Delay:

The second micro-benchmark addresses the segment syn-

chronization delay. Whenever a team of threads moves through a barrier there will be many
threads waiting on the barrier and only one thread to wake them. This could introduce a
signicant delay for whichever thread happens to be woken last. This micro-benchmark is
straightforward: a team of threads goes through a barrier and they timestamp immediately
before and after. The last pre-barrier timestamp is the time that the last thread entered the
barrier, so it is the time that all threads become eligible to proceed. The last post-barrier
timestamp is the time that the last thread left the barrier, which is the thread that suered

30

25th
50th
75th
95th

Scheduling

12-Core Barrier

36-Core Barrier

Latency (µs)

Latency (µs)

Latency (µs)

Pct.

9.2

11.7

3277.5

Pct.

9.9

12.2

3284.9

Pct.

10.8

15.6

3290.1

Pct.

12.5

18.1

3296.6

27.5

76.8

6503.5

Max
Table 2.2: The

25th , 50th , 75th ,

and

95th

percentiles for the scheduling and barrier latency

micro-benchmarks, as well as the worst-case observed latency.

the most delay.

Thus, we can compare these two timestamps to determine the true total

delay of the barrier operation the total amount of time it took for a thread to actually
leave the barrier once it was semantically allowed to do so.

We perform the barrier latency micro-benchmark twice: once with a team of 12 threads on
12 cores, and once with a team of 36 threads on 36 cores. As previously discussed, we expect
the 36-core version to incur greater overhead both because of the greater number of threads
in the team as well as the cost of communicating across multiple processor chips.

All micro-benchmark results are presented numerically in Table 2.2. We see that the overhead
is generally small for teams of 12-cores: less than
than

80 µs

delay for barriers.

30 µs

overhead for preemption and less

However, the overhead of barrier synchronization for large

teams is several orders of magnitude greater, requiring more than 3ms in all cases.

2.1.3

Empirical Results

Figures 2.4 through 2.9 present the results of our experiments.

We now evaluate those

results.

Our experiments show that all 12-core task sets are schedulable at
the minimum task period is 4ms or greater.

31

20%

utilization once

This validates the theoretical augmentation

bound results for those timescales, and demonstrates the suitability of the system to handle
applications that require task frequencies of 250Hz. It is dicult to determine whether the
2ms boundary is due entirely to preemption and barrier overhead or an additional factor,
because the total incurred overhead is dependent on the exact task set conguration (how
many preemptions and barrier synchronizations occur).

As we run slower timescale tasks we're able to execute higher utilization task sets. Between
gures 2.5 and 2.6 the achieved utilization grows from 20% to 30% at the expense of doubling
the shortest task period. For 12-core operation we are able to achieve full schedulability at
reasonably high utilization (greater than 50%) when the shortest task periods are greater
than or equal to 16ms.

The large 36-core task sets are much more dicult to schedule, which is to be expected given
the much higher barrier synchronization overhead.

We only successfully schedule all 20%

utilization tasks once the minimum task period is 16ms. We can achieve high (70%-80%)
utilization if we double the minimum task period to 32ms. This is not fast enough to run
real-time applications that require extremely short timescales, but does demonstrate the
suitability of the system for applications that have slightly longer periods but require much
more processing power, for example processing video in real-time at 25 frames per second.

The barrier overhead appears to be the primary limiting factor in how fast we can run large
teams of threads.

Minimizing or avoiding multi-chip communication would appear to be

necessary for any real-time systems that require sub-millisecond operation.

Packing Heuristic:

One very clear result concerns the performance of the rst-t and worst-

t bin packing heuristics. The worst-t heuristic dominates the rst-t heuristic, meaning
that there was not a single task set where the rst-t heuristic was successful but the worstt heuristic was not. The worst-t heuristic also scales better at longer timescales, which

32

actually is not due to the timescale: at high utilizations the rst-t heuristic tends to overutilize individual processors, resulting in a task set that is unschedulable under any scaling.

This provides strong evidence for the source of task set failures. If task failures were primarily
due to contention and synchronization overhead, then worst-t would be worse, since it
potentially spreads a task across many cores, while rst-t clusters the tasks on a small
number of cores. This result seems to suggest that the primary danger in our system is overutilization of individual cores, rather than the contention overhead due to the cooperation
of many cores.

This seems to be conrmed by the 2048ms timescale experiment, a timescale so large that it
is extremely unlikely that any deadline misses arise due to overheads. At the 70% utilization
level the rst-t heuristic begins to perform extremely poorly, while worst-t has only a
small increase in the number of unschedulable task sets. From a system design perspective
the worst-t heuristic appears to be a better choice, as it seems to oer a much larger margin
of safety.

33

Utilization and Taskset Failure Rate, 2ms-64ms Tasks
1

Failure Rate

0.8

0.6

0.4
First-Fit, 12 cores
0.2

Worst-Fit, 12 cores

0
20

30

40

50

60

70

80

Utilization

Figure 2.4: Task set utilization vs. failure rate (both in percentages) for the 2ms timescale.
Both the worst-t and rst-t task sets failed at 20% utilization. This shows that at the 2ms
timescale the scheduler's theoretical assurance fails due to system overheads.

All 36-core

task sets failed, and their results are not shown.

Utilization and Taskset Failure Rate, 4ms-128ms Tasks
1

Failure Rate

0.8

0.6

0.4
First-Fit, 12 cores
0.2

Worst-Fit, 12 cores

0
20

30

40

50

60

70

80

Utilization

Figure 2.5: Task set utilization vs. failure rate (both in percentages) for the 4ms timescale.
The worst-t heuristic succeeded at 20% utilization, but failed otherwise. Most 36-core task
sets failed, and their results are not shown.

34

Utilization and Taskset Failure Rate, 8ms-256ms Tasks
1

Failure Rate

0.8

0.6

0.4
F-F, 36
W-F, 36

0.2

F-F, 12
W-F, 12

0
20

30

40

50

60

70

80

Utilization

Figure 2.6: Task set utilization vs. failure rate (both in percentages) for the 8ms timescale.
Failure means at least one periodic deadline miss. The rst-t task sets are never completely
schedulable, while the 12-core worst-t task set is schedulable up to 30%. Worst-t (W-F)
and rst-t (F-F) are abbreviated.
Utilization and Taskset Failure Rate, 16ms-512ms Tasks
1
F-F, 36

Failure Rate

0.8

W-F, 36
F-F, 12
W-F, 12

0.6

0.4

0.2

0
20

30

40

50

60

70

80

Utilization

Figure 2.7: Task set utilization vs. failure rate (both in percentages) for the 16ms timescale.
Worst-t (W-F) and rst-t (F-F) are abbreviated.

35

Utilization and Taskset Failure Rate, 32ms-1024ms Tasks
1
First-Fit, 36 cores
0.8

First-Fit, 12 cores

Failure Rate

Worst-Fit, 36 cores
Worst-Fit, 12 cores

0.6

0.4

0.2

0
20

30

40

50

60

70

80

Utilization

Figure 2.8: Task set utilization vs. failure rate (both in percentages) for the 32ms timescale.
This demonstrates how the worst-t heuristic scales better than the rst-t heuristic, as
the 36-core worst-t task sets are approximately just as dicult to schedule as the 12-core
rst-t task sets.

216 ms

Utilization and Taskset Failure Rate, 2048ms-

Tasks

1
First-Fit, 36 cores

Failure Rate

0.8

First-Fit, 12 cores
Worst-Fit, 36 cores
Worst-Fit, 12 cores

0.6

0.4

0.2

0
20

30

40

50

60

70

80

Utilization

Figure 2.9:

Task set utilization vs.

failure rate (both in percentages) for the 2048ms

timescale. Failure means at least one periodic deadline miss.

36

Chapter 3: Mixed-Criticality Federated
Scheduling Service

When it was introduced, the Federated Scheduling Service (FSS) constituted a major revision
in thought for parallel real-time scheduling theory and practice. It solves a major limitation
of all prior research avenues in the eld, which is the need to dene a specic task model
(such as the parallel-synchronous task model from section 2.0.2), and to explicitly schedule
individual elements of a parallel computation (e.g. the strands from the parallel-synchronous
task model). It addresses these concerns by providing for the strict separation of parallel
workloads onto individual processor partitions, invoking a greedy scheduling strategy that
is indierent to the structure of the parallel programs it executes, and then justifying this
approach by providing theoretical performance that surpasses the 20% utilization bound for
RT-OpenMP as well as the other scheduling approaches known at the time.

The need for a specic task model in earlier work was driven by the desire to analyze exactly
how parallel tasks would be executed.

For both RT-OpenMP [8] and earlier work [1] the

scheduling analysis relied on sequencing exactly how a parallel task runs, and then allocating
enough processor cores to ensure there is enough slack in a generated schedule for each parallel
task to meet its real-time constraints. Not only does this articially restrict the freedom of

37

real-time application developers, but it invariably demands a system where parallel tasks
must be separated into their most basic runnable components and each atom of work must
be executed individually. While this approach can be successful, it is responsible for the high
degree of system overhead evidenced in RT-OpenMP, which was described in Chapter 2.

In contrast, the FSS analysis abstracts each parallel task into two quantities. The rst is
the total

work,

second is the

which is the sum of all computational eort in the parallel program. The

critical-path length, which is the longest sequential chain of work in the parallel

program. These quantities reect the structure of a parallel program (but do not dene it)
and are suitable for stating a bound using the greedy scheduler theorem. If we allow
be the work of a parallel program, and

T∞

T1

to

to be the critical-path length, then the greedy

scheduler theorem states [12] that the actual runtime

T

of a parallel program on

P

processors

is bounded by:

max(T1 /P, T∞ ) < T < T1 /P + T∞

(3.1)

Importantly, the only assumption made about a greedy scheduler is that processor cores are
never left idle when there is work available to do. This assumption can be violated in practice by scheduling overheads in realistic systems, as well as poor system conguration (e.g.,
OpenMP's chunk size or Cilk Plus' grainsize), but most concurrency platforms can approximate this criterion with appropriate conguration. Thus, any existing parallel scheduler that
is "suciently greedy" may be used to schedule and execute parallel real-time programs in
the FSS while achieving a strong theoretical bound. A detailed analysis can be found in [13].

In practice, the FSS implementation is simple, powerful, and exible. The job of the FSS
is to determine suitably sized partitions for each parallel task and then enforce isolation at

38

runtime. The actual running of the parallel real-time tasks is handled by existing parallel
concurrency platforms such as Cilk Plus and OpenMP (but not RT-OpenMP, which as we
have noted assumes a more restrictive task model). These are not inherently real-time environments but are nonetheless highly ecient and provide good soft real-time performance.

This chapter examines the implementation of a (dual-criticality) mixed-criticality federated
scheduling (MCFS) system. As in traditional federated scheduling, parallel tasks are partitioned onto sets of processor cores according to their processor demand. Each task may
execute in two modes (high-criticality and low-criticality), each of which has its own preassigned partition of potentially dierent sizes and even of disjoint partitions.

Moreover,

this state transition may be triggered at any point during execution, and an eective system
must allow for the allocation of eort from low-criticality work to high-criticality work with
a minimum of delay.

3.1

Implementing Mixed-Criticality Federated Scheduling (MCFS)

A

mixed-criticality

workload is one where certain computational tasks are considered more

important than others, and the more-critical tasks must be guaranteed under all operating
conditions, potentially at the expense of less-critical tasks. For example, a set of four processors may be shared among two normally-disjoint tasks. A structural engineering experiment
might identify two regions of structure to simulate- a highly critical region that is tightly
connected to the experimental purpose of the endeavor, and a less critical portion that is
farther away from the region of interest. In the event of unexpectedly high computational
demand, the MCFS system is designed to reallocate some of the computational resources
provided to the low-criticality task, either one or both processors, to the highly-critical task.

39

While this example captures the original intent of mixed-criticality real-time systems, which
is prioritization, in the most general sense the MCFS system may be viewed as implementing
two operating modes dened by pre-computed schedules.

This implementation is a particularly good example of the complexities of real-time systems
design for parallel computation, as it converts an existing parallel concurrency platform into
a real-time parallel concurrency platform with mixed criticality semantics.

The original

designers of OpenMP had no reason to consider the prioritization of some threads over other
threads, much less the challenge of dynamically adjusting those priorities during runtime in
response to system events. Creating an eective mixed-criticality mode transition required
modifying the basic mechanisms of OpenMP, essentially leaving only the thread creation and
parallel work management code intact. In particular, three mechanisms were required.

The three key requirements for the MCFS runtime are: (1) the system must detect when a
mode transition must occur (any high-criticality task has overrun its virtual deadline); (2)
it must modify the core allocation to give more cores to high-criticality tasks in the event
of a mode transition (virtual deadline miss); and (3) since the number of active threads in
the system uctuates with its criticality state, it must provide a state-aware concurrency
mechanism to facilitate parallel programming  i.e., a state-aware barrier.

This reference implementation supports parallel programs written in OpenMP [5]. It uses
Linux with the RT_PREEMPT patch as the underlying RTOS and the OpenMP parallel
concurrency platform to manage threads and assign work at runtime.

Background
As this is a dual-mode mixed-criticality implementation, there are only two states:

typical-state

and the

critical-state.

the

The system transitions from the typical state into the

critical state when the system is in danger of overrunning any job's deadline. This moment

40

of danger is the

virtual deadline, which is calculated to be a point in time suciently early to

detect undesirable system behavior while also leaving enough slack so that a high criticality
job may have enough time to nish successfully after the mode transition.

For a more complete background, including full details of the MCFS scheduling (core partitioning) algorithm, calculation of virtual deadlines, etc. see [14]. Those details are not this
author's work and are beyond the scope of a discussion of system implementation.

3.1.1

Overrun Detection

The MCFS runtime system detects that a high-criticality task overruns its virtual deadline
via Linux's

timer_create

and

timer_settime

API. These timers are set and disarmed at

the start and end of each job period by each high-criticality task while in the typical-state,
so expiration only occurs in the event of an overrun.

Timer expirations are delivered via

signals and signal handlers. To make sure that timer expiration is noticed promptly, kernel

ksoftirq
These

threads are given higher real-time priority than all other threads.

ksoftirq

threads are part of the interrupt handling system in Linux that consists

of a fast uninterruptable component and a slower, deferrable component (the deferrable
component being run in the

ksoftirq

thread). Allowing them to run at a priority above

high criticality tasks constitutes criticality inversion whenever they are invoked to handle
an event that does not belong to one of the system's high criticality task, meaning that
services for low-criticality tasks may be performed in preference to handling high-criticality
execution. However, these threads are necessary to the handling of timer events, and are thus
vital to the process of high-criticality escalation should it need to occur. If the

ksoftirq

threads executed at a priority below the low-criticality tasks, then a low-criticality task could
block the execution of the

ksoftirq thread, and thereby block the delivery of a timer signal

41

destined for a high-criticality task.

In practice, interrupt handling is fast and constitutes

minimal overhead, even for the slow portion of the interrupt hander.

3.1.2

Core Reallocation

A key requirement of MCFS is to increase the allocation of cores to a high-criticality task
when it exceeds its virtual deadline, by taking cores away from low-criticality tasks. This is
accomplished in four parts. (1) during initialization each high-criticality task

τi

creates the

O
maximum number of threads it would need in the critical-state (ni ). Each low criticality
task creates

nN
i

threads.

(2) When the runtime system starts (in typical-state), only

nN
i

1

threads are awake for each task and they are pinned to distinct cores . (3) The remaining

N
nO
i − ni

threads of high-criticality tasks are put to sleep with the

FUTEX_WAIT

2

system call ,

while also pinned to their cores (which may be shared with a low-criticality task). These
threads sleep at a priority higher than any low-criticality thread on the same core. (4) When
a job of high-criticality task
with

FUTEX_WAKE

τi

overruns its virtual deadline, its sleeping threads are awoken

and they preempt the low-criticality thread on the same core and begin

executing.

Note that the set of cores assigned by the
cores assigned by the

critical-state

typical-state

mapping to

τi

is a subset of the

mapping; therefore, the system needs no migration for

the high-utilization tasks.

In this design, the threads of each task must be activated and deactivated each period via
the OpenMP directive

#pragma omp parallel.

Thus, this approach of maintaining a pool

of unused, high-criticality threads does impose an additional overhead on the system, even if

1 In

order to pin threads to cores, before task execution we use an initial #pragma omp parallel directive
where individual threads make a call to Linux's sched_setaffinity and pin themselves to the assigned cores.
2 Currently only accessible through the generic syscall function with the FUTEX_WAIT and FUTEX_WAKE
denes.

42

it never transitions into critical-state, due to these activations and deactivations. However,
these overheads are only imposed on low-criticality tasks by high-criticality tasks, so there
is no criticality inversion.

When a job of high-criticality task

τi

overruns its virtual deadline and preempts the low-

criticality tasks on the shared cores, the current jobs of these low-criticality tasks may continue to execute when the higher-priority threads from high-criticality tasks are idling. If,
however, the start times of these low-criticality jobs are already later than their absolute
deadlines, such jobs are dropped voluntarily by low-criticality tasks. Therefore, when the
system is able to recover from critical-state to typical-state, there is little backlog of lowcriticality jobs and the future arriving jobs of the same task are able to resume normal
execution. Note that for systems that can tolerate tardiness for low-criticality jobs, an alternative could be not to drop these backlogged jobs, and instead to design policies to bound
such tardiness.

The primary reason for allowing current low-criticality jobs to run at a lower priority instead
of directly killing the threads of these job is to avoid the cost of creating new threads
during system operation, but it also allows the low-criticality threads to make progress on
a best-eort basis.

Note that since we allow low-criticality threads to continue executing

after a mode transition has occurred, they will continue to interfere with high-criticality
threads through cache pollution, resource contention, and other eects. Even so, allowing
low-criticality threads to continue progressing seems appropriate for a soft real-time system.
The other option besides killing these processes would be to suspend them, but we do not
investigate either of these options here.

Since high-criticality tasks do not share cores in MCFS, if a high-criticality task receives
a timer signal indicating that it has overrun its virtual deadline, it does not initiate a
system-wide mode switch.

Instead, it simply wakes up its sleeping

43

N
nO
i − ni

threads and

in doing so acquires the necessary additional cores from a subset of low-criticality tasks. If
a low-criticality task overruns its deadline, it need not do anything.

This natural default

implementation leads to graceful degradation since low-criticality tasks are not discarded on
entering critical-state.
perioidic_iteration (){
#pragma omp

parallel

{

if (

t y p i c a l _ s t a t e && h i g h _ c r i t _ t a s k

)

sleep_extra_threads ()

//Do p a r a l l e l program
#pragma omp

for

{

( j = 0;

for

s c h e d u l e ( dynamic )

nowait

j < num_strands ; ++j )

// Perform work
busy_work ( ) ;

}
mc_barrier_wait ( )
wake_extra_threads ( )
}
}

Figure 3.1: MCFS Periodic Task Invocation Psuedocode

3.1.3

State-Aware Barrier Implementation

One side-eect of this mixed-criticality model for parallel tasks is that counting-based thread
synchronization methods such as traditional barriers will not work properly as the number
of active threads uctuates. In

barrier synchronization

there is (usually) a xed number of

threads that must periodically rendezvous. The obvious implementation is to have a counter
that increments each time a thread reaches the barrier and have each thread wait at that
barrier. Once all threads have arrived the counter is reset and all threads are released.

44

// C a l l e d a s y n c h r o n o u s l y by s i g n a l h a n d l e r
barrier_state_switch ()
needs_switch = t r u e
check_needs_updating ( )

if (

needs_switch

)

atomically_claim_switcher ()

if (

switcher

)

verify_barrier_inactive ()
update_barrier_count ( )
needs_switch =

else

false

release_spinwaiters ()
spinwait ()

mc_barrier_wait ( )
check_needs_updating ( )
do_barrier_wait ( )

Figure 3.2: MCFS Mode Aware Barrier Psuedocode

This works in some use cases, but assumes that the number of threads is constant throughout
the lifetime of the barrier, which is not the case in the MCFS system. In particular, some
of the threads in a high-criticality task may be sleeping, so the implicit barrier at the end
of each

#pragma omp for

loop may deadlock if the sleeping threads never arrive. Equally

troubling is that if a race condition occurs on the thread counter, newly awoken threads
may race ahead and cause deadlock or could release threads from the barrier early (thus
having some threads working before the barrier and some threads working after the barrier),
violating the synchronization ordering of the program.

We address this by removing the implicit barrier with the OpenMP clause

nowait, as shown

in Figure 3.1 and implementing a state-aware barrier shown in 3.2, which operates as follows. When a task begins a transition, its signal handler sets a variable indicating that the
barrier needs updating before waking the extra high-criticality threads.

The next thread

to encounter the barrier checks this variable and claims responsibility for updating with an

45

atomic compare-and-swap on a boolean ag. Other threads arriving after that will spin-wait.
The update thread will then verify that the barrier is not currently being modied by any
thread that arrived before the transition, spin-waiting otherwise, and nally will increment
the barrier count when it is safe to do so. It then releases any threads that are spin-waiting
so that they may proceed through the barrier.

This imposes a small, constant overhead every time a thread accesses the barrier, since
threads must check to see if the barrier needs updating. However, it allows us to use the same
barrier in both states, and the barrier can be updated even if some threads are currently
waiting on the barrier.

Without such an arrangement, the transition overhead could be

unbounded, since the additional

N
nO
i − ni

high-criticality threads could not be released while

any barrier was in an indeterminate state.

3.1.4

Recovering from critical-state to typical-state

The MCFS scheduling theory naturally supports tasks that may transition between the
typical-state and critical-state many times over the life of the system. This is desirable as it
allows low-criticality tasks to continue executing on a best-eort basis. Otherwise, a highcriticality task transitioning into critical-state would permanently impair any low-criticality
task it happened to share a processor core with, even if the conditions that lead to the state
transition were transient.

Reverting to typical-state is straightforward compared with transitioning into the criticalstate, because the MCFS theory allows this to happen at a time of our choosing and not
in response to any external event. Thus, a particularly convenient time for this to occur is
outside the execution of any job of the task, because the task's team of parallel threads is not
active during those times. Modifying the system while a parallel computation is underway

46

is the major source of complexity for the critical-state transition and is what requires the
complex core reallocation and state-aware barrier mechanisms that are discussed above.

Eecting the transition to typical-state requires resetting the state-aware barrier and reducing the number of threads that will participate in future job invocations. Since this process
occurs outside the execution of any job, it is guaranteed that the barrier is not in use and
that no parallel threads are active.

Thus there are no concurrency issues to resolve, and

reversion is accomplished without synchronization.

In particular, the state-aware barrier

is recongured to expect the number of threads that should be active in the typical-state
(i.e. a modied version of

update_barrier_count() from Figure 3.2 may be called without

protection). Second, a global ag is set that indicates to the critical-state threads that they
should sleep with

FUTEX_WAIT

upon activation rather than immediately participating.

Under the MCFS theory this reversion may be performed as often as the completion of each
individual job that has entered the critical-state. In eect, the critical-state transition occurs
on a per-job basis rather than a per-task basis, and all new jobs start in the typical-state
but may transition to the critical-state as needed, allowing for very ne grained control
over the system criticality and providing the minimum interruption to low-criticality tasks.
Such low-criticality tasks operate on a best-eort basis but are not guaranteed in the face of
interference from a task in the critical-state. In Section 3.2, we construct a benchmark task
set to test and evaluate the recovery to typical-state feature of our MCFS runtime system.

3.2

Evaluation of MCFS

The MCFS system described previously was successfully implemented and tested.

First,

two overhead benchmarks are described which were then incorporated into the scheduling

47

theory. After incorporation, tests of taskset behavior behaved as expected for both high and
low criticality tasksets, in both the normal and critical modes.

3.2.1

MCFS Benchmarks

Latency due to mode transition:

The most important factor to optimize for ensuring

the safe operation of high-criticality tasks is the

high-criticality activation latency  the delay

between when a mode transition is detected and when the additional

N
nO
i −ni

high-criticality

threads that were sleeping in the typical mode wake up and are ready to perform work. We
measure this by inducing a mode transition at a xed time, and the extra threads perform
a time-stamp as soon as they wake up. The dierence between the mode switch time and
the latest time-stamp gives the latency. This latency was very low in general but increases
with the increasing number of threads, as can be seen in Figure 3.3. The number of awoken
threads varies from one to fourteen, measuring the latency 400 times for each setting, and
the maximum observed latency was 84 microseconds.

Note that this mode transition latency may occur only once for each high-criticality job
in the critical-state. To incorporate it into schedulability analysis, we subtract it from the
deadline of each high-criticality task.

3.2.2

Impact of high-criticality tasks on low-criticality tasks

In the MCFS system, low-criticality tasks may share cores with a high-criticality task. As
described above, this is managed by creating two threads on these cores one for the lowcriticality task and one for the high-criticality task. The low-criticality task is subject to interruption by high-criticality threads that must sleep and awake at the start and end of every
period, which involves two context switches, the start and end of a

48

#pragma omp parallel

Figure 3.3: High-criticality mode transition latency in MCFS

49

directive, and interactions with a Linux futex. One can compare the wall-clock execution
time of the low-criticality task with the Linux clock source

CLOCK_THREAD_CPUTIME_ID to in-

fer the total amount of time the low-criticality task was preempted. The maximum observed
overhead was relatively high at 1555 microseconds per preemption.

This is high enough

that it was important to incorporate this overhead into the schedulability analysis presented
in [14] to ensure that low-criticality tasks meet their deadlines. However, note that this overhead is only incurred when a high-criticality task's sleeping thread is sharing a core with a
low-criticality task in the typical-state, so high criticality tasks are not aected. In addition,
the preemption only occurs once per period of the high-criticality task. Therefore, we can
calculate the maximum number of preemptions and subtract the appropriate time from the
low-criticality task's deadline.

These experiments on a simple prototype platform demonstrated a signicant overhead, but
one that is low enough that the eect can be mitigated by incorporating the overhead into
the scheduling theory. However, using the worst-case 1.5ms, this limits low-criticality tasks
to periods of 1.5ms or longer.
the

#pragma omp parallel

It is mostly attributed to the cost of entering and exiting

each period as shown in Figure 3.1. For a reference system like

we have described here, the choice of including the

parallel

directive within the periodic

invocation greatly simplies programming and reasoning about the system, and also allows
the user to use existing parallel programs with little modication, but the overhead may
be unsuitably high for practical systems. In a traditional OpenMP program, the

parallel

directive would be used once or just a few times calling it once every period exposes an
important limitation of this standard parallel concurrency platform when used in real-time
system, and represents a serious mismatch between the expectations of the OpenMP system
designers and the current use case.

50

3.2.3

MCFS Validation

We evaluate the implementation of the MCFS runtime system using synthetic workloads
written in OpenMP. Experiments were conducted on a 16-core machine composed of two
Intel Xeon E5-2687W processors (each with 8 cores). When running the experiments, we
reserved two cores for operating system services, leaving 14 experimental cores. Linux with
RT_PREEMPT patch version 4.1.7-rt8 was the underlying RTOS. For each setting, we
randomly generated 100 task sets, each of which runs for 5 minutes 

300×

the maximum

period.

Now we explain how we generate task sets for the empirical evaluation. In these experiments,
the number of cores

m

is

14.

We construct a task set by repeatedly adding randomly

generated tasks until the MCFS schedulability test cannot admit any more tasks. Tasks are
either high- or low-criticality with equal probability.

Note that the synthetic tasks in these experiments are written in OpenMP. Each task has a
sequence of parallel for loops, or segments. Each iteration of a segment is called a strand.
We generate a task by rst randomly choosing a desired overload critical-path length
then adding randomly generated segments until

L0

L0 , and

is reached.

The task parameter generation process is similar to the one in [15]. To generate tasks with
large parallelism, we x the maximum ratio
period:

pmax =

pmax

of the overload critical-path length over

1√
. The other parameters are as follows:
2(2+ 2)

1. Criticality

zi :

50% high-criticality and 50% low-criticality.

2. Nominal and overload utilization ratio

[0.025, 0.25];

ratio

3. Implicit deadline

ri

Di :

ri

for high-criticality task:

for low-criticality tasks is always

1.

uniformly from 100ms to 1000ms.

51

uniformly from

4. Max overload critical-path length

L0 :

40%, 50%, 70% or 100% of

Di pmax ,

with proba-

bility of 0.4, 0.3, 0.2 and 0.1.

5. Number of strands of a segment
with mean of

1+

si,j :

randomly chosen from a log normal distribution

√
m/3.

6. Overload length of strands of a segment

tO
i,j :

randomly chosen from a log normal

distribution with mean of 5ms.

7. Nominal length of strands of a segment

O
tN
i,j = ri ti,j .

With these parameters, we can calculate the nominal and overload work and critical-path
length, which are used in the MCFS schedulability test.

3.2.4

Mode Switch Stress Testing

To validate the entire system we conducted experiments to stress test the performance of
the MCFS runtime system in both typical- and critical-states. In typical-state stress testing, both high- and low-criticality tasks are generated to execute their

nominal

work and

critical-path length, such that no mode transition is expected. The experimental results were
consistent with the correctness condition: no mode transition occurred and all high- and lowcriticality tasks met all their deadlines.
exactly its worst-case

overload

In critical-state stress testing, each task executes

work and critical-path length. Again, in this worst case be-

havior, the result is also consistent with the correctness condition: every high-criticality task
successfully transitions to critical-state and has no deadline misses.

Some low-criticality

tasks are preempted by high-criticality tasks, suspend some of their jobs and hence have
deadline misses, which is allowed in a critical-state transition.

52

Figure 3.4: Fraction of tasks with no deadline miss, for the sets of tasks with high- and lowcriticality, respectively, when increasing the number of high-criticality tasks that overrun
their nominal parameters.

3.2.5

Graceful Degradation

The mixed-criticality correctness condition allows us to discard all low-criticality tasks as
soon as any task misses its virtual deadline and the system transitions to critical-state.
However, our MCFS platform need not do so as is discussed above. Figure 3.4 demonstrates
that the MCFS runtime system can continue to run many low-criticality tasks even after
some high-criticality jobs transition to critical-state.

Here, we generate task sets with at

least 4 high-criticality tasks. For each set, we run 5 experiments: either

0, 1, 2, 3

or

4

high-

criticality tasks execute with their overload parameters and the remaining tasks execute with
their nominal parameters. We plot the fraction of tasks with no deadline miss in Figure 3.4.
We can see that all high-criticality tasks always meet their deadlines. In contrast, the lowcriticality task performance does not drop abruptly to zero as soon as the transition occurs,
but rather degrades gracefully as more and more high-criticality tasks exceed their nominal
settings.

For instance, when only 1 high-criticality task overruns, only about

low-criticality tasks miss their deadlines.

53

33%

of the

3.3

Discussion of RT-OpenMP vs Federated Scheduling
Implementations

The parallel real-time implementations discussed in this chapter and the previous chapter
are useful case studies for current and future parallel real-time systems designers, in part
because their approaches dier so signicantly. RT-OpenMP is a restrictive and regimented
system that places heavy assumptions on the kinds of tasks that can execute, while Federated
Scheduling is much less so. It would be intuitive to suspect that a system targeting a specic
subset of programs (RT-OpenMP) would perform better than one that is more general, but
experience shows otherwise. Federated scheduling can execute any program expressible as a
directed acyclic graph, and it has a utilization bound of 50% (which surpasses RT-OpenMP's
bound of 20%). In practice it also executes with much lower overhead, and can achieve much
higher periodic rates.

The question is why, from a whole-system point of view, RT-OpenMP performs poorly in
relation to federated scheduling. RT-OpenMP was constructed in an eort to build a good
parallel real-time system according to the best practices of real-time computing available at
the time. When this approach was insucient it was found that approaching the problem
starting with the principles of parallel computing was far more successful. First we should
rule out some potential dierences.

First, greedy scheduling used in federated scheduling is guaranteed to be relatively ecient
in its utilization of available processor resources, while decomposition scheduling used in
RT-OpenMP is not. Is it possible that RT-OpenMP's scheduling method generates inferior
schedules? Probably not it can be argued RT-OpenMP generates a greedy-like schedule.
The worst-t bin packing method used has the eect of heuristically minimizing processor

54

demand across all processors. This should also yield a schedule that is greedy-like in that
it should be unlikely that some processors will be heavily loaded while other processors are
left idle, unless that condition is an inevitable element of a particular task set.

Second, can the observed performance dierence be due to the use of only parallel-synchronous
tasks within RT-OpenMP? Again, probably not. The directed acyclic graph tasks that federated scheduling permits include all parallel-synchronous tasks as a subset, and intuition
tells us that the dependencies present in parallel-synchronous tasks are at worst no harder
than those found in general directed acyclic graph tasks (and are probably easier).

Thus, the performance dierence between RT-OpenMP and federated scheduling is likely
primarily explained by the systems implementation, rather than fundamental dierences in
the eciency of the scheduling policy.

Testing showed that the overheads in the initial RT-OpenMP implementation, with its regimentation, were extremely high, and unacceptably so for tasks running higher that 500Hz.
In contrast, tests of the federated scheduling system, running OpenMP or Cilk Plus code,
such as those that will be discussed in Section 5.3 could run meaningful (but small) tasks
as high as 7000Hz, with substantially computationally expensive tasks running successfully
at 1000Hz. The overheads in RT-OpenMP are due to explicit synchronization, priority setting, and thread management. All of these require the cooperation of the operating system
to achieve, as the choice to use thread-priorities as the preemption mechanism drives the
requirement to have many threads, which then requires the system to create many more
threads than processors, which in turn drives the requirement to use futex sleep-waiting to
avoid priority inversion between waiting threads and running threads. In eect, RT-OpenMP
uses the operating system extensively to manage what work is being done, as well as when
and where it is being done. Federated scheduling relies on the OS only implicitly (to start
threads, etc.).

55

This leads us to a major contrast between both systems.

Current parallel concurrency

platforms (OpenMP and Cilk Plus) operate almost entirely in userspace, and hence federated
scheduling operates primarily in userspace. RT-OpenMP executes user code in userspace, but
its mechanisms operate mostly in kernelspace. However, the choice between userspace and
kernelspace was not the design decision, but the consequence of a much more fundamental
mechanism.

Critically important is that this dichotomy is not an artifact of chance. It is the inevitable
consequence of two early decisions made in the development of both systems, and reects
how both systems deal with contention between tasks at a fundamental level.

Federated

scheduling takes a hands-o approach: it classies parallel tasks into high-utilization and
low-utilization and isolates high-utilization parallel tasks from one another on dedicated sets
of cores, while low-utilization tasks disable parallelism. Contention between parallel tasks
is eliminated because parallel tasks are segregated. RT-OpenMP embraces contention and
co-schedules strands of dierent tasks together on the same processing cores.

This then

introduces the need for threads of dierent process spaces (tasks) to be able to preempt each
other at arbitrary points in time.

On reection, it becomes clear that this need for arbitrary preemption of threads is in fact the
key dierentiating characteristic between RT-OpenMP and federated scheduling, and may
very well be a key dening characteristic of any possible parallel real-time system.

Truly

arbitrary preemption cannot be accomplished entirely in userspace with the mechanisms
currently available to systems developers.

Arbitrary preemption can only happen when a

processor core is interrupted via an external source either a software or hardware interrupt
delivered via the OS. Otherwise, the behavior of the processor is to continue executing the
fetch-decode-execute cycle until the currently executing program voluntarily yields control
of the processor.

56

As a purely parallel system, neither Cilk Plus nor OpenMP has a need for preemption of
threads. It is assumed that all units of work in these systems are equal and the objective is to
maximize throughput of units of work. Scheduling decisions happen to maximize throughput,
not to enforce timing requirements. As a result, these systems are implemented entirely in
userspace. They create only as many OS threads as is necessary to manage all processors.
They use internal work queues to manage a larger number of apparent user level threads.
Switching between units of work only happens when currently executing work voluntarily
yields the processor back to the concurrency platform. All of these mechanisms are readily
achievable without heavy reliance on the OS.

Suppose, as a thought experiment, that one of these systems did want to implement prioritybased scheduling between contending tasks. The only place where preemption may occur is
during voluntary yielding. Hence, such systems have no mechanism to guarantee scheduling
behavior (a misbehaving program may never yield the processor).

Even if correctness is

assumed, there is no mechanism to enforce latency during preemption, and preemption can
be delayed arbitrarily.

Individual units of work in parallel concurrency platforms tend to

be quite small (on the order of single loop iterations) but this is not a requirement, with
deviations from this possibly leading to unbounded priority inversion.

In contrast, RT-OpenMP achieves true arbitrary preemption via thread priorities and sleeping via the futex mechanism. Thread sleeping is ultimately rooted in hardware timers, which
permits the CPU to be interrupted by a hardware clock at predictable times. Since future
interrupt times are always known (due to the static schedule used in this system) simple
timers are sucient to implement the preemption needed for this system.

This approach

has denite advantages for real-time computing. First, preemption of a low priority task for
the sake of a high priority task cannot be delayed by the low-priority task if it is misbehaving. Second, the latency is determined by the implementation of the OS mechanisms rather

57

than the behavior of the low-priority task. This minimizes and potentially bounds priority
inversion if the preemption mechanism is itself bounded.

This link between systems is so fundamental that it was only apparent in hindsight (to this
author)

that the mixed-criticality implementation of federated scheduling is actually a hybrid

between RT-OpenMP and traditional federated scheduling.

What is the dening characteristic

between regular federated scheduling and mixed-criticality federated scheduling? It is precisely the need to preempt a low-criticality task at arbitrary times! In traditional federated
scheduling the strict partition between parallel tasks eliminates this need. In MCFS there
are low-criticality and high-criticality tasks that share cores, and when a virtual deadline
is overrun the low-criticality task must be preempted immediately. The mechanism here is
identical to RT-OpenMP: each task has its own set of threads on all relevant processors, with
thread priorities congured appropriately. If a preemption happens it will happen at a known
time which is the current virtual deadline for some high-criticality task, and the preemption is induced by a hardware timer waking a set of waiting high-priority (high-criticality)
threads.

If one accepts this premise that the presence (or lack) of arbitrary preemption is the fundamental distinguishing feature for parallel real-time systems then there are ultimately three
kinds of parallel real-time systems:

1. Systems without arbitrary preemption are more ecient and computationally powerful
due to keeping more code-paths within userspace, and can use any existing parallel concurrency platform, but are unsuitable for hard-real-time processing due to potentially
unbounded priority inversion.

2. Systems with arbitrary preemption are less ecient due to heavy reliance on OS mechanisms and the need for more OS threads, and must implement their own work scheduling (cannot use existing concurrency platforms), but may be more suitable for hard-

58

real-time processing as they provide more control over the system which helps to bound
priority inversion.

3. There are middle-ground systems that will be variously suitable for real-time processing
depending on the application. For example, user tasks could perform periodic checks
to see if a preemption is needed, and voluntarily yield if asked to do so. This would
bound preemption latency and thus priority inversion to whatever the longest gap is
between checks. This does not protect the system from misbehaving tasks however,
and thus would not be a strong hard-real-time system.

One question is why this dichotomy has not been explored more fully in the sequential
processing case.

Real-time systems have been around for a long time and have needed

to deal with preemption before, so what is dierent in the parallel context?

The biggest

dierence is the frequency with which preemption and synchronization is needed in a parallel
context. In a parallel-synchronous task running on RT-OpenMP, for example, each segment
of a task (with many segments per task) requires an explicit synchronization leading to
many synchronizations per period. Additionally, depending on how strands are packed onto
processors, a single period of task execution may see many preemptions by other, higherpriority tasks.

With parallelism, a single long-running task may be preempted multiple

times on each processor by multiple other higher priority tasks over its lifetime. Eectively,
preemption and synchronization have gone from being a once-per-period event to a manyper-period event. The quantity and interleavings of events that occur in the system become
signicantly more complex when parallelism is introduced.

Similarly, the overhead of thread management in a sequential context simply may not be signicant enough to be particularly noteworthy. A sequential system (whether multi-processing
or not) would have approximately one thread per task. A parallel system must support running multiple threads per task, so the worst-case situation would be one thread per task on

59

each core (which could happen regularly in practice, such as in RT-OpenMP). This presents
a serious scaling concern, since the number of threads in the system would increase semiquadratically with the numbers of cores and tasks. Userspace approaches to threading can
be wary of this concern and ensure that all overheads are distributed evenly, but OS-based
approaches to threading would need to be exceedingly careful that there are no hidden mechanisms or overheads that impose a quadratic overhead burden on any one task or processor.
Such hidden overheads appear to be documented in both [8] and [16].

This tension between the need for arbitrary preemption and more capable parallel real-time
systems has just begun to be explored, and it is not at all clear whether the this tension will
be made better or worse by more tightly or loosely coupling the real-time platform with the
OS itself. For example, locks and mutual exclusion are basic primitives in both concurrent
real-time systems and non-real-time parallel systems. Parallel systems have not needed to
contend with work priority in the past, where real-time systems have developed concepts such
as priority-ordered locks. Now, in the parallel real-time context, it becomes apparent that
there is a need for preemption mechanisms that take into account userspace lock status. True
preemption of a task due to an external interrupt is necessary for bounding priority inversion
due to scheduling. However, this could have disastrous eects in the real-time context if the
currently executing parallel real-time task is holding a lock that would prevent the incoming
high-priority task from making progress. What is the appropriate resolution here? Should the
parallel real-time systems designer simply eschew OS mechanisms, implementing everything
in userspace where they have total control? Or, does the OS need to become aware of the
locking status of the processes it seeks to preempt at a hardware level?

Similarly, how should future userspace systems handle preemption? What needs to be done
to integrate a traditional concurrency platform like OpenMP or Cilk Plus into a framework
where work prioritization and preemption are expected? Both of these platforms currently

60

assume that they have a set of OS threads, one per core, that has exclusive control over that
hardware resource. This model clearly does not adapt well if some of those OS threads may
be preempted unexpectedly. Managing work priorities is more achievable, but will present its
own challenges as well. For example, the basic premise of the Cilk scheduler is to decentralize
work management for the purpose of scalability. How is work prioritized and can you prevent
priority inversion in such a decentralized system while retaining a high degree of scalability?

Ultimately, the question of how to achieve good parallel performance across many co-located
tasks with varying real-time priorities may require a strong degree of coupling between the
parallel scheduling runtime and the operating system itself. To rely on the same example
again, the classic work-stealing scheduler scales very well but can not strongly enforce prioritization. Prioritization is a basic function of modern OS schedulers even in the non-real-time
context, for example as implemented in the Linux Completely Fair Scheduler (CFS) niceness
system. It seems entirely possible that future systems designers may have to make a choice
of one over the other: either a system may support ecient parallel scheduling of multiple
parallel processes simultaneously, or they can have a rich, strongly enforced priority structure, but not both simultaneously. It is possible that if a high degree of parallel real-time
performance is ultimately required then this might require an entirely specialized operating
system that re-envisions some of the basic POSIX mechanisms that we take for granted.

Further Challenges
The major systems challenges in the development of RT-OpenMP and MCFS were due
to the blending of two dierent technological traditions: parallel computing and real-time
computing.

Unlike previous systems it was critical to understand how a team of threads

could be managed under the constraint of real-time semantics.

Where existing parallel

concurrency platforms reason about a single cooperative team of threads all coexisting on
the same set of processors, these new real-time concurrency platforms must worry about

61

potentially competing teams of threads belonging to dierent real-time tasks, potentially of
dierent priority/criticality levels, and must also be concerned about the ways these teams
interfere with each other. Importantly, due to the desire to make schedulability guarantees
these teams of threads must be managed on specic cores at runtime, unlike existing parallel
systems where threads and/or work are free to migrate.

In both RT-OpenMP and MCFS it was important to limit overheads due to interference or
to make such overheads regular enough to be incorporated into scheduling analysis.

One

major question is what kinds of adaptations could be made to a system like RT-OpenMP
in order to reduce overhead and thus become more competitive with a userspace system
like federated scheduling.

Is the reliance on OS mechanisms simply too great, and poor

performance therefore should be expected? A concentrated eort here, potentially involving
the cooperation of the kernel, could be useful.

Looking to the future of parallel real-time systems development, two obvious directions
appear promising. The rst is the extension to dynamic scheduling. Both the RT-OpenMP
and the federated scheduling systems rely on static scheduling of parallel real-time workloads
prior to runtime. Where dynamism is tolerated (e.g. mixed-criticality federated scheduling)
it is also analyzed and arranged prior to runtime, with the overall system only allowed
to exist in one of a set of previously arranged modes. Moving beyond statically arranged
systems could be done either by truly dynamic scheduling, such as with a dynamic scheduler
making all decisions at runtime (potentially with a parallel real-time aware scheduler at the
operating system level). An intermediate step might be a static schedule that is periodically
updated at runtime, for example in an admission control scenario. In both cases the system
constraints and objectives seen in the development of RT-OpenMP and MCFS appear to be
relevant. How threads are activated and managed, and what overheads are inherent to the
ability to call up or dismiss threads remain important considerations.

62

A second area for future work is the extension to hard real-time parallelism. Existing systems
have only tenuously explored this topic, since the predominant infrastructure for parallel realtime computing is currently soft real-time. This future eort is likely to require signicant
dependence on a hard real-time kernel in order to manage sets of competing or even possibly
antagonistic teams of threads. The traditional Linux real-time OS architecture may not be
suitable in some such cases, and concepts presented here may require a greater degree of
control than Linux currently aords.

The prospect of adding hard real-time performance

to a parallel computing platform also begs the question whether it makes sense to move
a full-featured concurrency platform down into the kernel, where the concurrency platform
itself can reason about and select from all tasks on the system, or to pull more things up into
userspace, where concurrency platform and userspace tasks can be more tightly integrated.

Lastly, more work is needed in the general area of parallel real-time concurrency and synchronization mechanisms for parallel programming, such as [17].

Programmers expect to

have a variety of parallel programming primitives at their disposal, and the MCFS implementation, for example, only supports barrier synchronization. Unlike traditional real-time
synchronization mechanisms, where the rate of synchronization might be expected to be
roughly once per period per task, parallel synchronization methods are expected to manage
the many activities of a team of threads multiple times per period. This could mean synchronizing multiple times per thread per period, which suggests that overheads may become
relevant quickly. At a minimum, testing is required to quantify the eects of these primitives
for more time-sensitive applications.

63

Chapter 4: CyberMech, A Concurrency Platform
for Real-Time Hybrid Simulation

This chapter discusses the design and implementation of the rst concurrency platform for
Real-Time Hybrid Simulation (RTHS), called CyberMech, which allows for parallel code
execution in an RTHS context. The leading existing software platform for RTHS is a proprietary real-time operating system designed to execute MATLAB and Simulink software in
real-time, called xPC Target. This product contains very limited support for parallel execution of code, and the exible and ecient parallel execution found in modern concurrency
platforms such as Cilk Plus or OpenMP is not possible with it. CyberMech also addresses
the needs of RTHS as a parallel real-time cyber-physical application, by managing multiple
communicating concurrent real-time processes and non-thread-safe data acquisition software.

4.1

Background on RTHS

Real-Time Hybrid Simulation (RTHS) reduces the eort and cost of structural validation and
experimentation in structural, earthquake, and mechanical engineering by replacing physical
structural elements with simulated specimens. This reduction in time and cost enables new

64

testing regimes which otherwise would be infeasible with full-scale structural validation.
RTHS also is advantageous in that it allows investigators to conduct more experiments and
conduct them more quickly. This is especially useful for validating modern

smart structures

which are expected to survive greater and more varied threats and thus must be validated
under more scenarios. RTHS also permits experiments that previously were too costly or
dicult to achieve. For example, full-scale destructive physical testing of entire large bridges
and skyscrapers may never be feasible, but RTHS allows a destructive physical test of select
elements of such structures while simulating the vast majority of the test structure.

Both of the traditional structural engineering validation methods, physical testing and simulation, have signicant limitations when used in isolation. Testing of physical specimens is
the gold-standard for any engineering validation, but is expensive as it requires creating a test
subject, instrumenting it with sensors and actuators, designing and validating controllers,
and setting up an experimental environment. These costs are multiplied in the event that
an experimental specimen is large or is part of a larger structure. For example, the Large
Outdoor Shake Table at the University of San-Diego (part of the NSF program for Natural
Hazards Engineering Research Infrastructure) is capable of performing full-scale structural
tests for multi-story buildings, but single tests require months to assemble and tremendous
expense to build. Such a testing environment presents unique challenges: testing a seismic
mass to collapse is dangerous to the test apparatus itself both due to the tremendous energies
involved as well as the risk of debris falling and striking the shake table. In such scenarios
structures must be supported by safety restraint towers designed to catch falling structures.

In contrast, structural simulation is far easier to run (needing only sucient computational
resources), and with modern simulation methods is relatively easy to design and implement.
Moreover, once created, structural simulations can be designed and recongured much more
rapidly than physical specimens can be constructed. However, structural simulation cannot

65

be employed with high delity when an accurate model of the

entire

structure does not exist,

even if only a small subset of the overall structure cannot be simulated.

RTHS, which integrates physical testing and simulation at physically realistic time scales,
combines the advantages of both approaches, and in doing so largely mitigates each method's
limitations. As such it is a useful technique in many validation scenarios, but the situation
described above is particularly common in earthquake engineering laboratories that develop
novel structural safety mechanisms. Structures are particularly prone to damage as a result
of low-frequency oscillations at or near the structure's vibrational modes, so new mechanisms
are being engineered to absorb or divert energy away from these particular frequencies. For
example,

dampers

of dierent types can be used absorb structural energy, but doing so

can change the overall structural response in unexpected ways. Rather than testing a new
damper in isolation, it is far more eective to test it in the context of an actual structure, but
neither traditional validation method is suitable for this. Building a real structure (especially
a full scale structure) is prohibitively expensive, especially if the structure is large or may
be damaged. Simulating such a damper inside a structure is not feasible, since the damper
itself is a prototype. RTHS can remedy this situation by physically testing the damper and
simulating the rest of the structure, and then joining both physical and numerical parts
together in a way that is valid and realistic.

This combination of structural simulation and physical experimentation creates a

structure.

hybrid

Done correctly, this hybrid structure emulates the behavior of a full physical

structure with high delity using only a fraction of the time and expense of a full physical
specimen.

To conduct a hybrid simulation, the physical components of the test are con-

structed (the

physical substructure ), while other components are numerically simulated (the

numerical substructure ).

At test time, the hybrid structure can be subjected to experimen-

tal loading in either the physical or numerical parts, or both.

66

The numerical simulation

calculates the eects of any numerical loading on the simulated structure as well as the
eect that the simulated structure has upon the physical specimen. These eects are then
applied to the physical specimen using a set of actuators, and the specimen's response is
recorded via a set of sensors. This physical response is then inserted back into the numerical
simulation, forming a feedback control loop. Such hybrid decomposition forms an explicit
cyber-physical boundary (structural elements that are connected to both the numerical simulation and physical specimen), and the objective of any RTHS is to ensure that this boundary
is in equilibrium at all relevant points in time.

This is depicted in Figure 4.1. On the left, recorded earthquake ground motion acceleration
data are fed into a numerical simulation of a building. The eect on the physical components
of the building is calculated, and given to an actuator controller. The controller computes
the necessary actuation to apply the desired load and does so.

Then, the result on the

structure is measured via sensors, which then is fed back into the numerical simulation. The
numerical simulation is typically highly amenable to parallelization, and often thus requires
the vast majority of processing power in any linear hybrid simulation.

The dierence between hybrid simulation and real-time hybrid simulation is the timing
constraint placed upon the numerical model.

In traditional hybrid simulation, it is not

uncommon to have a single simulation step take minutes or even hours of real world time
to compute. After each simulation step is computed the resultant forces are applied to the
physical structure, which is then allowed to settle into a state of equilibrium.

Thus, this

technique is only able to capture the static eects of load on a structure. In RTHS, the goal
is to achieve a simulation that can be computed in real-time alongside a physical experiment,
which allows engineers to capture dynamic eects that can play a signicant role in structural
performance.

67

Real-Time Hybrid Simulation Execution Loop
Cyber

Input	
  Data	
  

Numerical	
  
Simula,on	
  

Actuator	
  
Controller	
  

Physical

Actuators	
  

Sensors	
  

Figure 4.1: The fundamental RTHS control loop. The results of a numerical simulation are
used to excite a physical specimen, and the measured response if fed back into the numerical
simulation. The inner and outer control loops may execute at dierent speeds. In the case
of structural engineering, recorded earthquake ground acceleration is used excite a simulated
building.

68

Depending on the dynamics of a particular numerical simulation, it may be advantageous to
execute multiple numerical simulations at multiple rates or multiple resolutions. Rather than
using a single monolithic simulation, the simulated structure may be broken into regions of
varying interest as a way to allocate computational resources. If a section is of particular
interest or moves quickly it may be simulated at a higher periodic rate or with higher delity
than elsewhere. If a section moves slowly or is uninteresting, it may be simulated at a slower
rate and with less delity. The former technique is referred to as
the latter is

multi-scale modeling.

multi time-stepping

while

Both approaches add unresolved complications to the

overall parallel real-time cyber-physical system, and this work is primarily concerned with
single time-step, single scale RTHS. Moreover, theoretical methods for separating models
and coupling them in these ways is currently ongoing work.

A preliminary step in conducting an RTHS is to conduct a

virtual RTHS, where the physical

specimens, actuators, and sensors are also simulated. This is useful for debugging simulation
and control code in a manner that is relatively safe.

A reasonably accurate model of the

physical specimen is used to provide simulated sensor response, which provides rough data
to test the RTHS system prior to using a real specimen.

Virtual RTHS does not provide

high quality test data, however, as complete numerical models for simulated physical specimens are not generally available. Consequently at that stage the system cannot be entirely
validated in principle, but can be said to be free of obvious defects.

An illustrative example of an RTHS experiment is shown in Figure 4.2. In this instance the
bottom two oors of a three story structure are simulated numerically, and the top oor
is implemented on a shake table as a physical scale model. When the structure is at rest
the boundary conditions between the cyber and physical components of the building are
satised. However, if we stimulate the bottom of the structure via a recorded earthquake
ground motion then the whole structure begins to move, resulting in a displacement of

69

Figure 4.2: A generic RTHS that decomposes a two-story structure into a real-time numerical
substructure coupled with a shake table experimental substructure.

each oor of the building.

The shake table slides back and forth to implement the third

oor's displacement in the physical specimen, which induces swaying and acceleration in the
physical structure. Of course, each action has a reaction, and the movement in the third oor
then imparts a force back upon the second oor, so the top deck's acceleration is measured
via accelerometers, and those data are fed back into the simulation as forces acting on the
numerical structure. Experimentally this setup could test any mechanism or technique that
modies the motion of the structure's top story, for example passive or active dampers, or
even active control techniques designed to counteract structural motion.

4.1.1

Structural Simulation Methodology

It is the job of domain experts to identify the numerical simulation methodology most appropriate for providing accurate experimental results. The distinguishing feature of RTHS
compared to traditional hybrid simulation is that the numerical update for each simulation
step must be reliably computed within a xed timestep. This numerical update computes
the

equations of motion  given the current position, velocity, and acceleration of each struc-

tural node at time

t,

the update computes the new position, velocity, and acceleration of

70

each node at time

t + ∆t.

Traditional hybrid simulation is likely to employ iterative solvers

that can compute an exact solution, but must converge to that solution over time. Exact
solutions are used because there is an unbounded amount of time to compute each simulation
timestep update, and for that same reason the simulated structure can be arbitrarily large
and complex. The xed timestep in RTHS requires a dierent approach. Rather than using
iterative solvers, RTHS (thus far) employs explicit solvers that compute an approximate
solution rather than an exact solution, but execute a deterministic number of calculations
and therefore take a predictable amount of time.

Further, due to the xed timestep, the

size and complexity of numerical simulation is much more limited.

Both of these factors

(size/complexity of simulation and accuracy of numerical update) introduce extra concerns
over simulation accuracy.

From the perspective of the concurrency platform, parallelism

mitigates both sources of inaccuracy by allowing for more frequent timestep updates as well
as allowing for larger and more complex simulations within that xed timestep.

The computations associated with the numerical substructure are conducted by expressing
it as a

rst-order state-space system.

This method computes the following equations at each

timestep:

y(k) = C x(k) + D u(k)

(4.1)

x(k + 1) = A x(k) + B u(k),

where

k

denotes the current simulation step, vector

u(k)

(4.2)

is the input to the system, vector

x(k) is the current state of the system, and vector y(k) is the output of the system.
A, B, C,

and

system with
and

q

n

D

Matrices

describe the dynamic characteristics of the numerical substructure. For a

states (displacement and velocity degrees of freedom),

output parameters, the sizes of these matrices and vectors are:

71

A

p

input parameters,

is

n × n, B

is

n × p,

C

is

q × n,

and

D

is

q × p.

Typically the number of inputs and outputs of the simulation

will be small relative to the number of states in the simulation.

The number of states is approximately equal to the number of structural elements in the
simulation times the ways in which they can move. For example, in a two-dimensional (crosssectional) simulation each structural element might be capable of moving in the horizontal
and vertical directions, as well as rotating about its center. In this case, the total number of
states is roughly the number of structural elements times three. Hence, the ability to compute
additional states while maintaining an adequate computational rate allows the domain expert
to either introduce additional structural elements, or to model those elements in more detail.

This representation has the benet, from the computational point of view, that it is

rassingly parallel.

embar-

This makes it particularly suitable to acceleration via parallel real-time

computing and there are well developed computational packages that can be used to implement this kind of computation.

4.1.2

Shake Table Hardware

The primary test apparatus used in this dissertation is a shake table, though the principles
may be applied to other testing scenarios as well. A shake table is capable of moving in one or
more dimensions, and some support rotation in up to three dimensions as well. They are used
in conjunction with a scale model bolted to the table. At test time, the table moves so as to
generate a desired structural input, such as recreating a recorded earthquake ground motion.
In RTHS, where structures are partitioned into a numerical and physical components, it is
common to numerically simulate the bottom of a structure (and its connection to the ground)
and then use a shake table to implement the top of the structure physically. It is possible to

72

use multiple shake tables to test larger structures, for example the multiple support columns
of a bridge, where each table implements a separate input to the test specimen.

In this work the primary focus is on a single axis electrically driven shake table, which is
designed to test a two dimensional cross section of a structure. Control over the table itself is
achieved by sending positive and negative voltages to the shake table motor, which directly
controls the speed at which the motor turns. The shaft of the motor is rigidly connected
to a worm gear, which drives the tabletop along a set of linear rails.

Thus, the angular

speed of the motor controls the linear speed of the table, and the table naturally operates
via velocity control. The motor itself is also instrumented with an angular encoder capable
of measuring the angular displacement of the motor in 1000 ne grained (less than a degree)
increments. Furthermore, one rotation of the motor corresponds to a linear movement of one
centimeter, meaning that the linear displacement of the table can be measured accurately to
0.01 millimeters. Direct control over the table's positioning is accomplished via PID control
based on this angular displacement sensor and the table's velocity. Data is gathered from
the shake table specimen via a set of accelerometers, from which the specimen's velocity and
position can be estimated. This setup is depicted in Figure 4.3

A variety of other hardware exists that may be incorporated in dierent or future experiments.

Structural engineers commonly use hydraulic actuators to test larger structures

because of the larger forces required to be exerted, and hydraulic actuators are also used
to drive larger multi-axis shake tables which then require the tight cooperation of many
separate actuators to correctly recreate a single desired motion of the table. Other sensing
methodologies are used in addition to, or instead of, accelerometers as well, such as forcesensing load cells. The specic choice of hardware for a particular experiment will be driven
by the physical requirements and the magnitude of the forces involved, as well as the experimental setup. Where the position and velocity of the physical specimen is paramount then

73

Figure 4.3: The electronic shake table used for experimental evaluations in this chapter.

74

encoders and accelerometers are likely to be used.

Where the specic forces imparted to

and from the specimen are important then load cells will be used. All of these methods use
a variation on the PID control scheme above, where a directly controlled physical quantity
(e.g., the displacement of a piston) can be used to induce a desired physical condition in
real-time (e.g., the force imparted to a structural element).

4.2

RTHS Challenges for CyberMech

RTHS, as an engineering discipline, represents a larger challenge than the simple tradeo
between computational resources and experimental delity.

RTHS involves validating structural components and scale models under conditions normally considered dangerous (i.e.

earthquakes, blasts, or other destructive events), so the

equipment must necessarily be powerful and capable of recreating potentially dangerous
conditions. Thus, there is special concern over safety. The primary safety concern during a
RTHS experiment is that actuators are not commanded beyond their design limits, either
intentionally or accidentally. If this happens at a high velocity the machine comes to a crashing halt, but even at low speed has the potential to destroy the test apparatus or physical
specimen.

One potential cause of these crashes is control instability.

At each timestep the RTHS

software needs to compute an actuator command update so the continually evolving physical
situation matches what is desired. In the single-axis shake table these actuator commands
are computed with the proportional dierence control method by subtracting the table's
desired location and the table's current location, multiplying by a constant, and treating the
resultant value as the voltage supplied to the shake table motor. For example, if the table is
1cm to the left of the desired setpoint and the control constant is 2, then this would result

75

in a positive two volt actuator command.

If the control distance is twice as far then the

control voltage doubles, and if the distance is three times farther then the control voltage
triples. Instability occurs when the dierence between the desired location and the perceived
location of the table generates excessive commanded motion, and then overshoots the desired
set point by a margin greater than the original dierence, resulting in an even more excessive
commanded motion. This could initiate a cycle of increasingly aggressive motor commands,
each cycle overshooting slightly farther, until eventually the actuator reaches it's mechanical
limit (typically at a high velocity).

The second cause of commanding an actuator beyond its design limit is general programming
errors. An experiment designer may unknowingly construct a scenario that causes this to
happen (e.g. replicating an earthquake that causes a too-large ground displacement), or may
accidentally ( e.g., due to an uninitialized variable) send an explicit out-of-bounds command.
This is exacerbated by the control system as described in Section 4.1.2 it is common for
the directly controlled system variable (e.g., voltage) to dier from the safety-critical system
variable (e.g., position). In the shake table setup described previously the table velocity is the
directly controlled system parameter, but the safety constraint is expressed in as upper and
lower bound of the table's displacement. Thus, it is insucient to simply bound the control
output of the system. If the safety question could be resolved merely by excluding certain
control outputs then it would be generally safe to assume that any control system that is
functional enough to send control outputs is also functional enough to check to see whether
a commanded output falls into the excluded category and handle that event appropriately.

The separation between the system input and safety criteria is especially problematic when
considering how a system might recognize hardware or software errors and come to a safe
halt. Traditional approaches to fault tolerance in hard real-time systems (and indeed cyberphysical systems in general) are less suitable for RTHS, where testing is conducted in a labo-

76

ratory environment and so hazards need only be contained rather than eliminated. Moreover,
a major purpose of RTHS is to make structural validation easier and more aordable. Two
key approaches to hardware fault tolerance, replication and state estimation, require multiple redundant sensors which adds cost and complexity, and also consumes additional data
acquisition resources. For similar reasons, approaches to software fault redundancy such as
N-version programming are unsuitable. One cannot make a completely general statement to
this eect, but RTHS occupies a cyber-physical design space where safety is important, but
safety features perhaps should be implemented in software rather than hardware whenever
possible. This is also true of other cyber-physical systems (perhaps more safety critical) that
simply cannot aord add additional hardware due to system constraints (e.g., lightweight
aerial drones).

4.3

Computational Architecture for RTHS

CyberMech combines a parallel real-time concurrency platform with support for executing
RTHS experiments, which enables both RTHS and virtual RTHS experiments with interand intra-task parallelism. In particular, this platform provides support for running multiple
parallel real-time numerical substructures, an inter-process communication mechanism for
multiple periodic tasks via a shared memory mechanism, and a dedicated hardware control task so as to utilize non-thread-safe software for the purpose of sending and receiving
signals with external hardware.

This section describes both the high-level details of the

computational platform, and the methodology used to integrate it with physical apparatus.

The implementation described in this work is built atop Linux with the RT-PREEMPT
patch, and is written in C. This allows for numerical simulation and control algorithms to
be written using C/C++, gives access to a wide range of Linux services, and allows parallel

77

programs to use general parallel environments such as OpenMP [5] or Cilk Plus [18]. As a
concurrency platform, a central component is the federated scheduling system [13], which
enables parallel real-time behavior. As described in Chapter 3, the federated scheduling algorithm partitions tasks onto processors prior to runtime depending on each tasks' processor
utilization.

High utilization tasks are those with utilization greater than one (which therefore

must exploit intra-task parallelism to meet their deadlines). These are given exclusive use
of a suitably large set of processors and as a result experience no contention or interference
from any other tasks on the system.

Low utilization

tasks have utilization less than one,

do not require parallelism to meet their deadlines, and are executed sequentially on the remaining non-exclusive processors with rate monotonic scheduling. This work does not claim
any novelty in the theory of scheduling parallel real-time tasks, but rather we extend the
work in [13] with a platform that enables actual RTHS experiments to be run eciently via
a clean interface.

A system overview involving this RTHS platform is given in Figure 4.4. The computational
portion of the system consists of several tasks which are either numerical simulation models
or control tasks.

The gure illustrates two important modications that adapt a general

purpose real-time concurrency platform for conducting RTHS experiments:

(1) enabling

thread-safe hardware access, and (2) inter-process communication between multiple parallel
real-time tasks.

4.3.1

Specifying RTHS Computations

This section describes the programming interface of the CyberMech system. The computational tasks that run on CyberMech (numerical simulations and control tasks) are programs
written in C or C++. These tasks are periodic programs and must conform to a particular
pattern that supports the notion of periodic execution. A programmer must implement each

78

Parallel	
  Real-‐Time	
  Hybrid	
  Simula2on	
  Overview	
  
	
  
Physical
	
  
Computa2onal	
  Infrastructure	
  
Specimen
	
  
	
  
	
  
	
   2	
  
Model	
  1	
  
Model	
  
Model	
  3	
  
	
  
	
  
	
  
	
  
	
  
	
   Task	
   	
  
Parallel	
  Task	
  
Parallel	
  
Parallel	
  
Task	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
Shared	
  M
emory	
  
	
  
	
  
Actuators	
  
	
  
	
  
	
  
	
  
Sensors	
  
	
  
	
  
Thread(s)	
  
	
   I/O	
   	
  
Hardware	
  
	
  
Reserved	
  for	
  OS	
  
	
  
DAQ	
  Hardware	
  
Sequen2al	
  
	
   Task	
   	
  
	
  
	
  
	
  
	
  
	
  
Figure 4.4:

An overview of the CyberMech platform with three numerical models and a

dedicated hardware I/O task. Using federated scheduling, CyberMech clusters contiguous
groups of processors and devotes each group to a specic task. The hardware I/O process
interprets simulation results and drives physical actuators via data acquisition hardware,
which generates and receives analog and digital signals. Communication between computational tasks and the I/O task is accomplished via writing to shared memory. One processor
is reserved for the host operating system.

79

task in three sections:

init, run,

and

finalize.

The

init

and

finalize

portions are

executed in a non-real-time fashion before and after real-time execution, respectively. This
allows a program to perform costly one-time operations such as allocating and deallocating
memory without interfering with real-time performance. The

run

function is executed pe-

riodically in real-time. When the platform executes a set of tasks, meaning a set of C or
C++ programs that have dened the

init, run,

guarantees that all tasks will complete their
its
the

run
run

and

finalize

entry points, the platform

init block before any task proceeds to executing

block, and likewise will ensure that all tasks have nished all periodic executions of
block before any task proceeds to executing its

finalize

block.

To create a task set, a list of tasks is given to the system with each one's desired real-time
parameters: the periodic execution rate, how long they should execute, as well as the work
and the span of each task. When a set of such tasks is given to the system, the system will
schedule these tasks on a desired set of processors. The scheduling method is described in
Chapter 3, and provides a utilization bound of 50%, meaning that any task set with total
system utilization less than or equal to 50% will be guaranteed schedulable. In the event
that the scheduling algorithm cannot guarantee schedulability, the platform will still provide
a best-eort schedule by assigning all high-utilization tasks to their dedicated processors and
using the remainder to execute low-utilization tasks. In practice, task sets with much higher
utilization still may work under this best-eort approach, as in [8], but the system cannot
make a theoretical assurance as to their performance.

It is hard to overstate the convenience of one feature of the federated scheduler it makes
no assumption about the dependency structure of parallel tasks, and thus it allows for
the use of any existing parallel programming language.

In order to apply parallelism to

a real-time task, one can use existing parallel languages such as OpenMP or Cilk Plus to
insert parallel statements anywhere in the real-time task code, or one can use a library that

80

provides parallelization, such as parallel linear algebra libraries. The inclusion of such nonreal-time software or runtimes necessarily places this system in the soft-real-time category,
but the benet of drawing upon the extensive development of such libraries for parallel
programming is enormous. However, if inclusion of such code is not desired then the end
user may use other parallel scheduling software. The federated scheduler only assumes that
the underlying parallel scheduler is work conserving, so any parallel scheduler with these
characteristics will comply with the theoretical analysis of the federated scheduler (modulo
scheduling overheads). The evaluations described more fully in Chapter 5 show that both
OpenMP and Cilk Plus are reasonable for tasks executing at periodic rates of up to several
thousand Hertz.

4.3.2

Thread Safe Hardware I/O

One challenge in executing parallel computations for cyber-physical systems is interfacing
with hardware that was not designed for use with parallel or even multi-threaded programs.
The data acquisition device drivers for the shake table setup explicitly state that they are
not thread safe, and in independent evaluation we found this to be true [19]. Attempting to
access a hardware device from multiple threads (even on the same processor) can result in a
variety of problems, such as data corruption, intermingling of return values, and segmentation
faults.

As 4.4 illustrates, in order to solve this problem, a simple mechanism (rst developed in [19])
is used where all hardware requests are routed though a single thread that is pinned to a
reserved processor. All tasks request data from this thread, and this thread polls the data
acquisition cards to get the requested data.

This design also has the advantage that it

reduces the overheads of reads and writes in the parallel tasks themselves and moves it to a
separate process. This solution is adequate for the one-axis shake table with a small number

81

of sensors and tasks. Undoubtedly this centralized sequential component will not scale for
a large number of independent data acquisition tasks.

However, it is not obvious how to

resolve this tension at a large scale without sacricing communication latency or overhead.
Further investigation without data acquisition hardware/software capable of interacting with
multiple threads simultaneously there is apparently little that can be done.

It is important to note that when sampling should occur depends greatly on the controltheoretic assumptions made by the RTHS control algorithm.

This mechanism allows the

decoupling of periodic execution from sampling, but whether or not this approach still meets
the needs of a specic simulation is application-dependent. Use of this technique in future
RTHS will require careful co-design of the sampling scheme and control algorithms.

The single-axis shake table platform contains two National Instruments data acquisition
cards (NI-m6259) capable of sending and receiving analog and digital signals. Accessing these
signals through data acquisition devices involves a variety of overheads including calling a
proprietary driver with unknown runtime characteristics, and represents a signicant blind
spot for the purpose of building real-time software. These overheads can be measured by
recording the time it takes to complete hardware access driver calls.

In this system the

relevant measure of hardware I/O overhead and latency is the time it takes for a read or
write to occur in response to a software event in the system.

The the time it takes to

perform an actual operation in hardware (i.e. to utilize the analog-to-digital or digital-toanalog converters) is negligible: the NI-m6259 is capable of 1.25 million samples per second
at maximum sampling rate. However, the time it takes for the data acquisition system to
react to an event as a whole, which includes the much lengthier process of invoking the driver
software via the OS kernel, is much longer. This is essentially the amount of time it takes for
the data acquisition system to arbitrarily recongure itself in response to a software event.

82

Min. (µs)

Avg. (µs)

Max. (µs)

Analog Write

110

113

155

Analog Read

111

115

170

Digital Read

53

55

100

Table 4.1: Observed analog read/write and digital read communication overheads for the
electric shake table.

For context, the single-axis shake table uses each of these operations to execute an experiment. Analog reads are used to measure accelerometers. Analog writes are used to control
the motor, and digital reads are used to record the table's position (via the angular encoder).

At a minimum, the simplest RTHS experiment will involve one call of each type each period:
a digital read to determine the current motor position, an analog write to update the motor
speed, and an analog read to measure the impact on the structure.
given in Table 4.1.

For our target of

These overheads are

1KHz operation, a single threaded application will

automatically loose at least 28% of its compute time to hardware I/O under average-case
assumptions, or 43% of its compute time under worst-case assumptions. This also motivates
our decision to separate all hardware I/O onto a separate thread running on a separate
processor. When not busy, this separate thread continually updates sensor information so
that there is recent data always available and computational threads do not have to block
(or queue) for those results. Conversely, hardware write requests can be merely registered
with the I/O thread and the computation thread can then continue on its way. In this way,
a larger portion of each period can be reserved for computation within the computational
tasks.

4.3.3

Interaction Between Tasks

Federated scheduling theory assumes that all tasks are independent. However, tasks in an
integrated RTHS experiment must communicate simulation results and sensor data. This

83

communication between tasks is provided for via Linux's inter-process shared memory mechanism. In general, this communication happens once per period: the fundamental execution
cycle is that all tasks read data from shared memory, execute their periodic iteration, and
then wait for all other tasks to nish execution. At the end of each period all tasks write their
new data into the pool, wait for all other tasks to nish writing, and then begin executing
their next periodic iteration, which starts with reading the previously written data.

This

procedure is adequate for multiple simulations that all run at the same rate.

For communication between dierent computational models, there is a choice of various
forms of synchronization: semaphores, barriers, and queues. These dierent methods may
introduce varying degrees of overhead as they were each designed to perform slightly dierent
tasks.

Thus, the overhead of communication between concurrent tasks was measured in

order to determine which mechanism was most suitable in this context. All were found to
have roughly comparable performance, and the barrier is ultimately used as having the best
combination of eciency and semantics for this platform.

Measuring these overheads in a meaningful context for RTHS was done by determining
which method best enabled a computational task to execute a numerical substructure. In
this experiment there were two tasks: Task 1 (with 1 ms period) and Task 2 (with 2 ms
period) which communicate with each other every 2 ms. Task 1 is xed at the state size of
350 and we vary the state size of Task 2. Task 2 always writes data and Task 1 always reads.
Table 4.2 shows the number of states achievable by Task 2 before inducing a deadline miss
in Task 1.

In the

semaphore

method, a semaphore controls access to the shared read/write space. At

the start of every period, the writer(s) grabs the semaphore and writes data. It releases the
semaphore when it's done, and the small task grabs the semaphore to read. In the

barrier

method, the writer writes to shared memory while the reader(s) block on a barrier.

84

The

Size (bytes)

Control

Semaphore

Barrier

Queue

8

1100

1100

1100

1100

80

1075

1075

1075

1075

800

1075

1075

1075

1075

8000

1075

1075

1075

1075

80000

1075

1075

1075

1075

800000

1000

950

1000

1000

Table 4.2: Achievable state sizes as inuenced by communication sizes and synchronization
type. Our specic application uses double width oating point values, which on our machine
are 8 bytes, so the rst row relates the transfer of one data value, while the last row relates
the transfer of 100,000 data values.

writer then releases the reader(s) when done. This does not require the writer to acquire a
lock or to block on readers. In the

queue

method, the writer writes to a circular buer and

the reader(s) race to keep up. This does not require the writer to block on readers unless
they fall so far behind that they saturate the buers, and readers can fall behind somewhat if
they need to. In the

control

method, the reader(s) and writer do not synchronize at all. This

does not provide correct behavior, but places an upper bound on performance. Table 4.2
indicates that the type of method used and the total volume of communication had little
eect on the performance of the experiment for tasks with periodic rates in the 1ms range.
Current RTHS numerical substructures use state sizes that are far smaller than the largest
value given in this table.

4.3.4

RTHS Repeatability on CyberMech

One outstanding challenge in the eld of RTHS is the comparison and validation of the dierent test apparatus in dierent RTHS labs across the country. The current state of the art in
RTHS is an ad-hoc system where each laboratory has constructed its own simulations, control software, hardware, and physical specimens. The result is that even though two RTHS
environments may claim to implement the same test scenario, for example to implement a

85

specic ground motion on a shake table, it is not known how to compare the dierences in
the results between two separate RTHS setups, as possible dierences in measurement may
stem from any of the four components listed previously. Thankfully, validating CyberMech
as a new software platform for RTHS does not require solving this larger problem of cross-site
validation, and one can rule it out as potential source of measurement error by xing two
of the variables above the actuation hardware and physical specimen and comparing
the results of the same experiment executed on both CyberMech and the existing standard
platform for RTHS: xPC Target.

The CyberMech platform is signicantly dierent than the current popular choice, xPC Target, which is a purpose-built operating system designed specically for real-time hardware
sensing and actuation such as that found in RTHS. In contrast, CyberMech is built upon
a general-purpose operating system (Linux) with support for real-time operation added by
the RT-PREEMPT patch set. These dierent methodologies yield fundamentally dierent
software platforms, in everything from the real-time scheduling theory behind system operation, to the software architecture and interfaces, to the computational hardware that drives
software timing.

Despite signicant dierences in system operation, both systems are de-

signed to achieve the same goal, and one can verify that the CyberMech platform is able to
reproduce existing results from sequential RTHS on xPC.

Evaluation of sequential performance of RTHS is aected primarily by three sources of error:
experimental sources, numerical integration, and model idealization.

In RTHS (and more

generally in seismic engineering), experimental sources of error can be subdivided into two
parts 

epistemic errors

(due to scientic uncertainty) and

aleatoric errors

(due to natural

randomness). Sources of epistemic errors are systematic, such as transfer system dynamics,
computational delays, communication delays, sensor limitations, and sensor mis-calibration.
On the other hand, sources of aleatoric errors are random, such as measurement noise and

86

quantization errors associated with truncations in the analog-to-digital (AD) conversions of
signals. Errors due to explicit or implicit numerical time integration schemes can aect the
stability and accuracy of RTHS. Most commonly, explicit schemes are employed in RTHS
because of their ability to advance the state of the system based only on the knowledge of
its current state and the input excitation. Moreover, unlike implicit schemes that sometimes
require time-consuming iterations, explicit schemes compute the solution in a single iteration,
which leads to predictability in the amount of time required for computations, a necessity
for RTHS. Finally, modeling error arises from any discrepancies between the response of the
actual (real) substructure modeled as a numerical substructure and the response acquired
from its model. These discrepancies result from the underlying assumptions of the numerical
model and from errors in the measured responses of the actual structure that are used to
calibrate the model.

The CyberMech platform is rst validated against xPC using a simulated two-story momentresisting frame, with the bottom story being the numerical substructure and the top story
being the physical component, as is illustrated in Figure 4.5. The frame elements are constructed out of aluminum and have rectangular cross-sections of 4.25 inches

×

1/16 inch

oriented along the weak axis. The columns are 19.75 inches tall and the beams are 12 inches
long. The frame is mounted atop a shake table that is driven with an electromagnetic motor.
The numerical model for the bottom story of the frame is assumed to have a single-degreeof-freedom i.e. a mass, a spring, and a damping component, where each of these properties
is identied experimentally. As described in Section 4.1.2 an explicit state-space integrator
is used to advance the numerical model in time and PID control is used to drive the motor
based on the command signals received from the numerical substructure. To minimize the
eect of epistemic errors, an identical RTHS was chosen to be executed on both the CyberMech and xPC platforms using the same numerical model and time integration scheme.
Further, to ensure that aleatoric errors such as measurement noise do not impact the RTHS

87

(a) Reference system

(b) Real-time hybrid simulation

Figure 4.5: Two-story frame validation RTHS for CyberMech and xPC.

signicantly, 100 RTHS runs were conducted on each platform (xPC and CyberMech) for a
total of 200 runs in all. To the author's knowledge, this is the rst time that the exact physical performance of two dierently implemented but otherwise identical RTHS experiments
have been compared in such a way.

As a preliminary check, we rst examined the transfer system performance as an epistemic
source of error, in order to evaluate the eectiveness of the shake table controller on each
platform. Figure 4.6 indicates that the performance of both transfer systems is similar.

Furthermore, the impact of modeling idealization error and epistemic experimental error
on each set of runs was studied.
obtained for the 1

We compute the average of the displacement response

st oor across all of the 100 runs at each point in time to cancel out random

variations in both sets of runs. This is plotted against a reference solution, obtained from a
pure numerical simulation of the two-story frame, as shown in Figure 4.7. The normalized

88

0.5

Experimental case
Ideal case

Measured displacement [cm]

Measured displacement [cm]

0.5

0

−0.5
−0.5

0
Command displacement [cm]

Experimental case
Ideal case

0

−0.5
−0.5

0.5

(a) Transfer system performance: xPC

0
Command displacement [cm]

0.5

(b) Transfer system performance: CyberMech

0

100

Displacement [cm]

−0.2

80

−0.4

60

−0.6

40

−0.8

20

−1
0

10

20

30
time [s]

40

50

0.6

0.2

160
Reference
Mean Value
140
Normalized Error
120

0

100

0.4

0
60

Displacement [cm]

0.2

160
Reference response
Mean response
140
Normalized response error
120

0.4

Error [%]

0.6

−0.2

80

−0.4

60

−0.6

40

−0.8

20

−1
0

(a) xPC platform

10

20

30
time [s]

40

50

0
60

(b) CyberMech platform

Figure 4.7: Normalized error in displacement of 1st oor resulting from modeling idealization
and epistemic experimental sources of error.

error in displacement for the CyberMech and xPC platforms is then calculated as

N ormalized Error(ti ) =

where the

max

and

min

|RT HSavg (ti ) − REF (ti )|
× 100.
max(REF (ti )) − min(REF (ti ))

(4.3)

operations pick out the maximum positive displacement and the

minimum negative displacement of the reference solution respectively. We also investigated
the impact of aleatoric errors for both sets of runs by computing the standard deviation at
each point in time for both platforms, as shown in Figure 4.8.

89

Error [%]

Figure 4.6: Comparison of transfer system performance.

0.2

0.07

0

0.06

−0.2

0.05

−0.4

0.04
0.03

−0.6

0.02

−0.8
−1
0

0.4

0.01
10

20

30
time [s]

40

50

0.1
Mean Value
Standard deviation 0.09
0.08

0.2

0.07

0

0.06

−0.2

0.05

−0.4

0.04
0.03

−0.6

0.02

−0.8

0
60

−1
0

(a) xPC platform

0.01
10

20

30
time [s]

40

50

0
60

(b) CyberMech platform

Figure 4.8: Standard deviation in displacement response of the 1st oor for both sets of runs
as a function of time.

Normalized Difference [%]

2

1

0

−1

−2
0

Figure 4.9:

10

20

30
time [sec]

40

50

60

Normalized dierence of the average displacement response of the 1st oor

between CyberMech and xPC.

It is clear from Figures 4.7 and 4.8 that the dierence between the xPC and CyberMech runs
is very small, demonstrating the fact that the CyberMech platform introduces little to no
quantitative dierence into the experiment, even though the software platform driving the
experiment is quite dierent than xPC. We show this dierence explicitly in Figure 4.9 by
subtracting the average displacement between CyberMech and xPC at each point in time.
At all points the dierence is less than 2%.

Over a course of 200 sequential RTHS trials, we used xPC and CyberMech to investigate
the impact of platform choice on errors stemming from modeling idealization, epistemic,

90

Standard deviation [cm]

Displacement [cm]

0.4

0.6

Displacement [cm]

0.1
Mean Value
Standard deviation 0.09
0.08

Standard deviation [cm]

0.6

experimental, and aleatoric sources. The results demonstrate that both platforms perform
comparably in the sequential case.

4.4

Further Challenges

There are various additional challenges and room for improvement. CyberMech so far only
has been applied to a single cyber-physical application (RTHS), and broader experience
supporting diverse cyber-physical systems seems benecial. However, there are also specic
future challenges in RTHS that need to be addressed as well. This section examines these
challenges in greater detail.

4.4.1

Application to General Cyber-Physical Systems

There is little general experience or knowledge of using parallelization in cyber-physical
systems, but RTHS appears to be an excellent exemplar of a large class of cyber-physical
applications.

Considering Figure 4.1, at an abstract level, this type of operation seems

common to any application that involves physical actuation of objects in the world.

The

essential sequence of events in any agent trying to eect the world is to rst decide what to
do, then decide how to implement that decision in the real world, then send some control
signal to actualize the desired outcome, and nally to measure the actual eect. This process
actually denes cyber-physical systems as a subset of all systems but are distinguished in
that they behave and make decisions in this manner, a concept explored alternately through
the lens of the Observe, Orient, Decide, Act (OODA) loop that was the subject of a keynote
speech at CPS Week 2018 [20].

In this light, the numerical simulation of RTHS can be

thought of as a decision making process, the motor controller decides how to implement
the desired action given its available actuators and then performs the actuation, and nally

91

General Cyber-Physical Execution Loop
Cyber

System	
  
Parameters	
  

High-‐Level	
  
Decision	
  Making	
  

Decision	
  
Implementa7on	
  

Physical

Actuators	
  

Sensors	
  

Figure 4.10: Figure 4.1 transformed into a generalized decision making procedure for cyberphysical applications.

The decision implementor is not parallelizable in RTHS, but could

conceivably be so in other applications. As before, the inner control loop and outer control
loop may execute at dierent rates.

the sensors measure that eect and incorporate the results back into the decision making
process. This generalized concept is shown in Figure 4.10.

For example, is dicult to envision a system such as the one described in [21] that would not
have some variation of the process depicted in 4.10. Thus, RTHS is an excellent candidate
for experimentation in the realm of parallel real-time systems and cyber-physical systems
design.

In this view any two cyber-physical systems dier mainly in how they make decisions and
then how they decide to implement those decisions.

In RTHS the general objective is to

maintain equilibrium of forces between the cyber and physical portions of the structure (this

92

in fact was the gist of the CPS Week keynote [20]). If we accurately know the dynamics of the
physical and numerical structures then the decision is simply computing a control input that
brings those forces closer to equilibrium, and as a practical matter this is accomplished by
matching one or two physical dimensions such as displacement or acceleration. In contrast,
the decision making process in other cyber-physical systems may be much more complex
in some respects, e.g., consider a car trying to identify all possible obstacles.

The self

driving car might be simpler in other respects in RTHS the physical structure itself reacts
to the stimulus (which in turn perturbs the desired equilibrium) leading to second-order
or non-linear eects.

A self-driving car does not modify the roadway around itself when

it implements a decision (though it may modify the behavior of trac around itself, so
perhaps this just illustrates that sometimes the system designer has the option of whether
to incorporate such second order eects or not).

4.4.2

Challenges in RTHS for CyberMech

CyberMech has provided support for RTHS practitioners to perform basic and intermediatecomplexity experiments, but much work remains to be done. Some of these challenges are
simply ancillary projects not on the critical path of supporting immediate RTHS experimentation and have been passed by for now. Other challenges require a deeper integration
of parallel real-time computing with advanced RTHS practice to more deeply explore the
design space of truly generalizable and adaptable RTHS.

First, the CyberMech platform has been validated against the prior state of the art for a
sequential RTHS experiment. Truly exploring the limitations and abilities of this architecture
requires larger experiments in two senses. In the hardware context for CyberMech has so
far only implemented RTHS experiments with one actuator and one to four sensors. This
is suitable for many RTHS that practitioners would like to implement, but many more

93

larger scenarios require multiple actuators and many more sensors. The sequential nature of
current hardware input and output, even when separated into a dedicated I/O task, is likely
to be a limiting factor.

Even with improved software support from DAQ device vendors,

key hardware devices are inherently sequential and so hardware overhead will play a role in
implementing larger sensing and actuation loops. In the software context the CyberMech
system has not yet been used to compute larger parallel computations alongside physical
hardware.

The vRTHS conducted with large parallel numerical simulations suggests that

CyberMech behaves as expected, but a comparable full RTHS experiment has not been
performed.

Second, the question of how to use computational power will only become more pertinent as
parallelism greatly expands the amount of computational work that can be achieved. Existing numerical simulation techniques using bulk-parallel decomposition can easily outstrip
processor performance increases simply by increasing the size or level of detail of simulations. Techniques such as using high resolution meshes in areas of greater numerical error,
or increasing the simulation rate also in regions of high error are potentially useful ways to allocate computational capacity to areas that need it most, and is currently being researched in
the RTHS community (referred to as multi-scale or multi-timestep simulation, respectively).
These techniques are most useful when the system can dynamically react to areas of high
numerical error during an experiment and reallocate computational eort on demand. From
a parallel real-time systems perspective we have a few building blocks that could contribute
to such a system (e.g., the mode-aware barrier from Section 3.1), but the mechanisms and
trade os involved with accomplishing this during an active RTHS experiment are as yet
open problems.

Finally, a truly capable platform for RTHS requires a rigorous specication, implementation,
and evaluation of what is meant by experimental generalizablility. The most meaningful goal

94

of CyberMech is to allow experimenters to rapidly implement and iterate structural validation
experiments. Currently, a substantial amount of eort goes into testing and validation of
each individual experiment before integration of the experimental apparatus is a whole. For
example, as yet there is no default-safe conguration or catchall safety criteria that would
convince RTHS practitioners that their expensive hardware and specimens can be trusted to
the platform with no further thought. Similarly, each individual experiment is constructed
around a specic set of hardware that has been tested and calibrated prior to use. If the
system were able to inspect itself and determine whether it can meet a certain experimental
prole as an automatic process it would be a boon to researchers.

95

Chapter 5: Parallel Computing Tradeoffs In
Statically Determined Cyber-Physical Systems

Cyber-physical systems (CPS) are becoming increasingly complex through the interaction
of computationally demanding workloads and physical control systems.

For example, as

we saw in Chapter 4, the output of a Real-Time Hybrid Simulation (RTHS) controller is
directly inuenced by the computation of arbitrarily large numerical simulations, and there
is (almost by denition) no simple model that dictates how control system inputs relate to
system outputs.

In addition, the fact that the control algorithm exists alongside a larger

system running diverse parallel real-time workloads increases the diculty of certifying the
timing behavior of critical physical interactions such as data acquisition input and output.
Then, ultimately, parallelism inuences both of these individually dicult problems: it may
be used to change the numerical model being computed, or it may be used to inuence
timing in the system. This leads to the conclusion in the most general case that parallelism
is not a simple upgrade that can be applied to a system, but that eective implementation
of parallelism in CPS demands a cyber-physical co-design process.

Cyber-physical systems tend to present unique constraints unlike those found in traditional
parallel computing or real-time systems. Traditional parallelism research focuses on maxi-

96

mizing speed or throughput, but real-time systems only require computational performance
that is sucient to satisfy desired timing constraints. Once a real-time system is able to meet
all deadlines, it is deemed correct and no further improvement is necessarily justied. A key
goal of cyber-physical systems is similarly to provide a sucient level of physical delity,
after which further improvement may not be strictly necessary but may be benecial (e.g.,
a faster control rate to reduce tracking error).

For this reason the process of allocating limited computational resources has a dierent
character in cyber-physical systems than it does in either pure parallel systems or in pure
real-time systems. For throughput-oriented systems (pure parallel systems) there is usually
a question of how processors are allocated to a computation with complex dependencies, but
there usually isn't a question of whether all processors will be allocated. For parallel realtime systems there is a question of how many processors are necessary to guarantee timing
behavior, after which no improvement is possible. For a cyber-physical system, however, we
have both the question of minimum sucient physical delity as well as the possibility of
signicant marginal improvement beyond that. If there are two competing computational
tasks we must rst satisfy the physical characteristics of the system (i.e., through certifying
the real-time behavior of the tasks), but excess capacity may then be allocated to further
improve the system in some way.

A general statement of principle in how cyber-physical computational resources should be
allocated is not yet possible, but reasonably broad statements may be made once the cyberphysical domain is suciently restricted.

This work concerns only

static RTHS systems,

whose numerical components are static (no structural elements are created, modied, or
destroyed during runtime) and whose physical components are static (physical apparatus
and control algorithms do not change during runtime).

In eect, the entire time-history

of a computational load can be accurately predicted prior to runtime. Although these are

97

Table 5.1: Categories of RTHS Explored in This Work

Static Physical
Dynamic Physical

Static Numerical

Dynamic Numerical

This work

Future work

Future work

Future work

signicant restrictions, in practice many meaningful cyber-physical systems will fall into this
category. This chapter examines case studies of such static RTHS computations and draws
broader conclusions about the tradeos inherent in static real-time parallel cyber-physical
systems.

5.1

Linearity of RTHS determines proportion of parallel/serial computation

First we describe more precisely in what ways the numerical and physical components of
RTHS are said to be static. Generally, this means that during an individual execution of the
system that the structure of these computations does not change, the real-time constraints
of these computations do not change, and the make-up of the physical apparatus and control
computations do not change.

The RTHS numerical substructures that may be considered static are those that:



Are constant in size and conguration, meaning that the number of degrees of freedom
is constant, as well as the structural mass and structural interactions (i.e., stiness and
damping) of each node in the structure.



Have constant real-time constraints, meaning that the RTHS timestep interval
not change over time.

98

∆t does



Are solved using explicit methods (rather than implicit, convergence-based methods),
meaning that the quantity of computation in each timestep is known and constant.

These numerical substructures simulate and predict the physical quantities of a structure
over time (displacement, velocity, and acceleration at each node) in evenly spaced timestep
intervals

∆t.

The basic equation used in this context is the second law of motion:

f = m × a.

At each timestep the mass of each structural element is known, and the forces applied can
be computed (potentially with input from the physical structure), so one can compute the
acceleration of that element. Once the acceleration is known, then the structural velocity
and displacement can be computed by integrating those quantities over the duration of each
timestep.

The computation of each node's acceleration from

m

and

f

is the dominating

calculation in the current generation of RTHS.

f =m×a

(5.1)

In the simple case of a single physical object Equation 5.1 would be sucient, but there
are two complications. First, each individual structural element in a simulation may have
multiple degrees of freedom (DOF), which corresponds to the element's ability to move
in multiple dimensions.

In a two dimensional simulation it is common for each node to

have three degrees of freedom:

horizontal motion, vertical motion, and rotation.

Thus,

we would need to solve Equation 5.1 multiple times for each node in order to come up
with displacement, velocity, and acceleration values for each direction of motion (degree
of freedom) for each node.

Second, the elements of a structure inuence each other, and

this must be accounted for. These connections are modeled as springs, so between any two
degrees of freedom of a structure there may be a stiness and damping value that determines
their relationship. In practice the state space representation technique is used to pack all of

99

these values (the mass of each element, as well as the stiness and damping between degrees
of freedom) into a large matrix

M,

which is a square matrix that has as many rows and

columns are there are total degrees of freedom in the simulation.

F =M ×A

(5.2)

The computational approach used in Equation 5.2 is analogous to the one used in Equation 5.1, where

M

is a combination of the simulated structure's physical description,

a vector containing the force acting on each degree of freedom, and

A

F

is

is a vector contain-

ing the acceleration at each degree of freedom. The procedure to advance each timestep is
conceptually the same.

The values of

F

are derived from the previous timestep's results

(where simulated elements exert force on one another) as well as the physical input to the
system (where physical elements exert force on a simulated element). The values of
computed from

F

and

M

A

are

either through direct solution as a system of equations or by way

of the matrix inverse with the formula given in Equation 5.3, where the solution is obtained
through simple matrix multiplication of

M −1

with

F.

In either case, the result is the accel-

eration of each degree of freedom, which is then integrated through

∆t

time units to obtain

the displacements and velocities of each degree of freedom for the next timestep.

A = M −1 × F

Obtaining the vector

A

(5.3)

is the dominating computationally intensive part of current RTHS,

and the ability to compute this vector depends primarily on the characteristics of
matrix

M

is static then obtaining the vector

A is embarrassingly parallel.

M.

If the

This is because

M

can be pre-inverted prior to runtime, so the computation reduces to matrix multiplication

M −1 ×F , where only the values of F

change from timestep to timestep. Matrix multiplication

100

Table 5.2: 723-DOF RTHS Serial and Parallelizable Work

Dynamic Num. Substructure
Static Num. Substructure

Serial Work (µs)

Parallel Work (µs)

880

1576

0

2364

is exceedingly parallelizable, meaning that static RTHS is limited only by the ability of a
machine and parallel platform to churn through computations up to the limit of the hardware.

In contrast, the matrix

M

may need to change during an experiment, in which case it is

dynamic, and the ability to compute RTHS in this same manner is dramatically limited.
happens when any of the three constituent parts of

M

This

change during an experiment: the mass

of any element, or the stiness and damping that connect any two elements. It also happens
if the number of elements or degrees of freedom (i.e., the size of

M)

were to change during

an experiment. In this case, either Equation 5.2 or Equation 5.3 can be used to compute the
vector

A,

but doing so is not embarrassingly parallel. In the former case the computation

proceeds as the direct solution of a system of equations (e.g., through factorization, row
reduction and back substitution), in the latter case matrix
matrix-multiplication with

F

M

must be inverted and then

gives the desired vector. Neither of these techniques is easily

parallelizable, and the result is that roughly a third of the computational work each period
would be sequential rather than parallelizable, which would in turn dramatically reduce the
eect of parallelism on achievable simulation sizes.

The proportion of serial and parallelizable work in the numerical substructure was measured
in a 723 degree of freedom virtual RTHS and the results are shown in Table 5.2.
static case the matrix

M

In the

was pre-inverted and the runtime calculation performed was matrix

multiplication as shown in Equation 5.3. In the dynamic case the matrix

M

was explicitly

solved through back substitution and row reduction each timestep. The remainder of the

101

work in both situations was solution of displacements and velocities at each node through
integration. In this representative case the cost of having a non-static numerical substructure
is the conversion of a 100% parallelizable workload into a 64% parallelizable and 36% serial
workload.

This is a heavy price indeed for a parallel system using an Ahmdahl's Law

argument the dynamic RTHS numerical substructure will be limited to approximately three
times speedup, while the static workload can be accelerated up to the limit of the machine
and concurrency platform.

However, this is a worst-case scenario for parallelism where the numerical substructure may
change as frequently as every timestep. In reality one can envision many RTHS scenarios
where numerical substructures may change at a less frequent pace, such as once per experiment, on a xed but long-duration schedule, or triggered by physical or experimental
mode changes where mode changes happen infrequently relative to the pace of computation.
Depending on the particulars of such a dynamic RTHS many software strategies could be
employed to mitigate the costs of dynamic numerical substructures.

For example, if the

number and conguration of all numerical substructure modes are known prior to the experiment (i.e., if there are a known number and conguration of

M

matrices) then all such

matrices can be pre-inverted prior to runtime and switched between for only the cost of
cache invalidation. If the future conguration of
question may be how quickly a new matrix

M

M

is not known prior to runtime then the

may be assembled and pre-inverted alongside

a running experiment, and then the system consideration is the latency with which such
mode switches may be done and the eect of that latency on the physical characteristics of
the cyber-physical interaction.

102

5.2

Parallel Real-time Computation of Static RTHS

Computation time is a constrained resource in RTHS experimentation, but knowing how
to allocate this resource is not a settled question. More parallel computation allows a realtime numerical simulation to execute faster or to be larger, leading for example to better
physical control or higher delity models respectively. Even in entirely static RTHS there
is a conguration and design problem: where and how to allocate parallel computational
eort. At the design stage we can trade computational capacity for larger models or faster
periodic rates, and if an experiment includes multiple numerical substructures we can tradeo computational capacity among them. Even after experiment design we may still wish to
tweak the periodic rates of various numerical substructures by adding parallel computation,
potentially at the expense of other substructures.

In this section we explore the conguration space of static RTHS through a single numerical substructure connected to a single hydraulic actuator, where communication with the
physical specimen is achieved via a single analog write and digital read each period.

Al-

though simplied, this situation allows us to construct a model of how such a single RTHS
computation behaves in isolation, and as an exemplar can be the starting point for more
complicated experimental design discussions that must be had prior to any code being written or specimens being constructed. We will nd that the "cyber-physical response" of this
simple system is not simple or predictable in the face of implementation on a real system.
This model will be of particular importance when designing RTHS experiments that must
trade o computational eort between multiple models.

It also allows an analysis of the

computational benet of parallelism in the embarrassingly parallel case presented in Equation 5.3.

103

Figure 5.1: Ten degree of freedom numerical substructure.

The starting point for this experiment is a two-story structure with 10 degrees of freedom
(pictured in Fig 5.1), where the four nodes dening the rst and second stories are allowed
to move horizontally and rotate (providing eight degrees of freedom), and the two nodes
attaching the structure to the ground are only allowed to rotate (providing two degrees of
freedom). In this analysis the four vertical columns of the structure are evenly subdivided into
a specied number of segments in order to provide a numerical simulation of approximately
arbitrarily desired degrees of freedom.

This is a realistic thing to do as the columns are

the primary load bearing elements of the structure. That renement in turn allows these
columns to simulate bending under a load where the 10 degree of freedom model cannot,
which allows the simulation to capture higher vibrational modes than the original 10 degree
of freedom structure. In practice the limit to useful renements would be dictated by the
maximum observable vibrational mode of the structure, but we may exceed this practical
limit without sacricing the integrity of the computational workload.

For the purpose of assessing computational infrastructure (e.g., CyberMech) described elsewhere in this work, this ability to arbitrarily rene an accurate RTHS numerical model
allows three analyses: rst, what sizes of models are achievable in a given system at a desired periodic rate; second, a rough estimate of the additional utility provided by parallelism
in the case of an embarrassingly parallel single simulation with a uniform timestep; and

104

third, a specic estimate of the required computational capacity for this particular model
on these particular experimental systems. The 10 degree of freedom model was rened to
approximately every multiple of 100 degrees of freedom from 106 DOF to 1602 DOF and
per-period execution times were measured when only executing with a specied number of
processor cores. The results are the longest average execution times over ten separate trials of
the RTHS experiment, each of which ran for approximately 35 seconds. This constitutes an
RTHS

capability graph

which allows an RTHS designer to easily trade o between simulation

size, periodic rate, and number of processors.

The data presented in Figures 5.2 and 5.3 are from a 16-core machine using two Intel E52687W Xeon processors.

These processors each have a 3.10 GHz clock and a 20,480KB

L3 cache. The machine ran Linux kernel 3.0.80 with the RT_PREEMPT real-time patch
version rt108 installed. In general, core 0 was reserved for the operating system, and these
experiments were run on cores 1-15.

The timing results of this investigation are shown in Figure 5.2 and Figure 5.3. The rst of
these two graphs indicates the actual per-period average execution time, while the second of
these two graphs converts that data into an achievable periodic rate. Together, these graphs
indicate what size numerical models are achievable under certain processor allocations and
time constraints. Graphs such as these show an RTHS practitioner exactly what kind of numerical models are at their disposal under the static RTHS assumption. More importantly,
this allows one to rapidly hypothesize dierent experiments of varying sizes, real-time constraints, and numbers of numerical substructures at the design stage. In particular, the unit
of simulation size, the degree of freedom, is an abstract unit of computational demand that
may describe an arbitrary node in a simulation that is moving and interacting in an arbitrary
manner.

105

106
0

500

1000

1500

2000

2500

1

3

5

9
Processor Cores

7

11

Per-Period Timing by Model Size

13

15

1024Hz Limit

106 DOF

202 DOF

306 DOF

402 DOF

506 DOF

602 DOF

706 DOF

802 DOF

906 DOF

1002 DOF

1106 DOF

1202 DOF

1306 DOF

1402 DOF

1506 DOF

1602 DOF

number of parallel cores

Figure 5.2: Overall per-period times, including both computation and hardware communication times, by model size and

Average Period Time (us)

107
0

1000

2000

3000

4000

5000

6000

7000

1

3

5

9
Processor Cores

7

11

13

15

4096Hz Target

2048Hz Target

1024Hz Target

1602 DOF

1506 DOF

1402 DOF

1306 DOF

1202 DOF

1106 DOF

1002 DOF

906 DOF

802 DOF

706 DOF

602 DOF

506 DOF

402 DOF

306 DOF

202 DOF

106 DOF

Average Periodic Rate by Model Size

common target rates for RTHS: 1024Hz, 2048Hz, and 4096Hz

Figure 5.3: Overall achievable per-period rates by model size and number of processor cores. Dashed red lines give three

Average Periodic Rate (Hz)

For example, in Figure 5.3 one can imagine the level curves along lines of constant model
size (which are the solid lines) or instead the level curves along constant periodic rate (which
are the dotted lines). It becomes simple to estimate the capability of the system in this wayfor example, for a xed model size of (approximately) 1106 DOF the experiment designer
knows they can achieve 1024Hz operation with 7 cores or they can achieve 2048Hz operation
with 13 cores. To the RTHS practitioner the distinction between using 7 or 13 cores may
be irrelevant, but the distinction between 1024Hz and 2048Hz may represent a substantial
improvement in physical performance via doubling the control rate. From a conguration
space view this represents a denite tradeo to the experiment designer: they can settle for
1024Hz and have enough excess computational capacity for a second similarly sized workload,
or they can achieve a 2048Hz control rate and fully utilize the machine.

Similarly, if the

experimental designer rst decides on a required control rate of 2048Hz they know they can
achieve any model size which has any data points above the 2048Hz line, but they can also
trace across the 2048Hz line to see the tradeo between numbers of computational cores and
model size.

Figure 5.3 demonstrates one particular aspect of the benet of parallelism for this experiment.
Orange lines are those able to achieve 1024Hz rate with a single processor, blue lines are
those able to achieve the rate with multiple processors, and green lines are those not able
to achieve 1024hz no matter how many processors. Thus, if the target operational rate of
an experiment were 1024Hz (as is common) then the blue lines represent the capability gain
due to parallelism, or to be specic, parallelism allows the experimental designer to expand
their models from approximately 800 DOF to approximately 1200 DOF, or a 50% increase in
simulation size. Similarly, for a 2048Hz control rate, parallelism yields an improvement from
approximately 500 DOF to approximately 1100 DOF a 120% improvement in simulation
size. And again for 4096Hz, parallelism yields an improvement from approximately 300 DOF
to approximately 500 DOF a 66% improvement in simulation size.

108

The other aspect of parallel improvement is the increase in periodic rate for a given model
size, which is not shown here through color. If an experimental designer had a xed model
of, for example, approximately 1000 DOF, then they could observe that just one processor
core is capable of running that simulation at roughly 500Hz. However, following that line
the designer can see a gradual improvement to as high as 3500Hz for the given model size,
and can quickly know approximately what range of periodic rates is available.

The coloring of Figure 5.2 represents a dierent aspect of experimental planning. The purple
lines exhibit unpredictable performance degradation which is not uncommon in parallel computing when the overheads of adding more cores or more threads to a computation outweigh
the added benet of additional computational resources. Empirical evaluation is useful here
as this phenomena is rather dicult to predict analytically. In a general-purpose scenario
where a computation is performed seldomly this eect is minor and can be ignored: the
performance degradation pictured in the purple lines is on the order of milliseconds. However, if a real-time computation for RTHS is performed repeatedly, and a delay of fractions
of a millisecond may mean the dierence between success and failure.This graph illustrates
this pitfall.

In particular, the blue lines represent a single-plateau performance regime,

where there are diminishing returns for additional parallelism but more cores are not majorly detrimental.

The green lines represent a double-plateau performance regime, where

there are diminishing returns up to 7 cores (which represents a socket boundary between
the two processors on this experimental machine), and above 7 cores performance resumes
improving until a second plateau is hit. The purple lines show a performance "hockey-stick"
where parallel performance improves up to the rst plateau inside a single socket, but in the
second socket additional processor cores introduce signicant overhead that is related to the
problem size. Due to this size-correlation, in addition to some basic back-of-the-envelope calculations, we suspect that the size of the L3 cache is the dominant driver of this hockey-stick
behavior.

109

In order to verify that the behavior seen in the previous gures is sensible, Figures 5.4
and 5.5 disentangle the time spent executing the numerical substructure and the time spent
invoking the hardware driver to send and receive signals through the data acquisition system.
As can be seen, computational timing dominates and largely follows the trends that would
be expected, indicating that the concurrency platform itself is performing well.

However,

the behavior of the data acquisition system was somewhat unexpected and indicates there
are unwanted interactions between the communication subsystem of CyberMech and the
primary computational concurrency platform. In Figure 5.5 the lines are color-ordered along
the light spectrum by computational size. Smaller computational models communicate with
approximately constant time no matter how many processors participate in the system.
However, as the model size increases the communication time increases, which was not
expected. The physical substructure (sensors and actuator apparatus) does not change with
model size and the communication subsystem is not doing any additional work for these larger
models. One hypothesis is that contention on a shared cache is responsible for degrading
the performance of this component: as the computational model grows in size so does the
working set size of the computation, and the cached data belonging to the data acquisition
system is evicted. More troubling is the communication behavior of large models across the
processor socket boundary for systems using between seven and nine cores. Communication
time increases with model size, but at a certain point the communication cost becomes erratic
after the jump from seven to nine cores. Fortunately, the overall maximum communication
time is relatively low and can be accounted for.

5.3

Further Challenges and Future Work

As was discussed at the beginning of this chapter, this work explores RTHS scenarios in
which the numerical substructure and physical substructures are static. This yields a highly

110

111

0

500

1000

1500

2000

2500

1

3

5

9
Processor Cores

7

11

Note: Colors match those of the "PeriodTimings"
graph for comparison

Computation Time vs. Model Size

13

15

106 DOF

202 DOF

306 DOF

402 DOF

506 DOF

602 DOF

706 DOF

802 DOF

906 DOF

1002 DOF

1106 DOF

1202 DOF

1306 DOF

1402 DOF

1506 DOF

1602 DOF

Figure 5.4: Static RTHS numerical simulation computation timings by model size and number of processor cores

Average Period Time (us)

112

100

150

200

250

300

350

1

3

5

9
Processor Cores

7

Note: 106-402 DOF models were extremely similar to the 506 DOF model
Note: The hardware I/O driver is, to our knowledge, entirely sequential

11

13

DAQ Communication Time by Model Size and Processor Cores

15

506 DOF

602 DOF

706 DOF

802 DOF

906 DOF

1002 DOF

1106 DOF

1202 DOF

1306 DOF

1402 DOF

1506 DOF

1602 DOF

506 DOF are not shown as they were extremely similar to the 506 DOF data

Figure 5.5: Static RTHS Hardware communication time by model size and number of processor cores. Model sizes under

Average Communication Time (us)

parallelizable system able to take full advantage of a real-time parallel concurrency platform,
and is realistic in practice for all known RTHS currently being conducted. Thus, this work
is relevant despite its limiting assumptions.

It is also relevant for the class of upcoming

but still static class of RTHS experiments that push beyond a single numerical substructure
and single physical substructure, such as using multiple numerical simulations or multi-rate
(multi-timestep) RTHS to target computational eort more and more exactly at areas of
high simulation error.

However, the obvious extension to dynamic numerical and physical components still remains
open, and many challenges would need to be addressed in this area. Some of the interest in
numerical substructures is driven by a desire to use implicit integration schemes, which are
convergence-based and whose timing characteristics are less well understood than explicit
schemes assumed in the static context. These integrators are crucial for the jump from linear numerical simulations to non-linear simulations, which are capable of simulating a much
wider range of structural elements. General dynamic RTHS numerical and physical components are also not available and are an ongoing area of research both within the computer
science and structural engineering community.

The easy extension to dynamic numerical

substructures in Section 5.1 is achievable, but would limit parallel speedup dramatically by
introducing serial computation on the critical path.

Getting the most out of a real-time

parallel concurrency platform for a dynamic numerical substructure requires improved parallelization of real-time numerical simulation strategies.

More generally, the work in this section represents a step towards a more complete understanding of cyber-physical systems engineering through the lens of RTHS. This understanding
is far from complete. The data presented here would allow an RTHS practitioner to greatly
accelerate their search of the possible RTHS conguration space, but it does not constitute a
full design methodology for RTHS, much less CPS systems in general. Such a methodology

113

would allow a designer to x a system's control rate, computational capacity allocation, and
physical parameters in a unied and principled manner. Analyzing this problem in the full
scope of cyber-physical systems engineering would require a much deeper understanding of
the cyber-physical interactions that occur in the general design space. One must wonder if
such a thing as a general cyber-physical systems methodology exists, and if it does exist in
some form, just how descriptive could it be?

Further questions arise at the intersection of cyber-physical systems design and parallel
computing.

How does parallelism t into a general framework for CPS? Can the whole

benet of parallelism be described in terms of computation sizing and control rates, or is
there a deeper interplay between large computational capacities and well-engineered systems?
One argument is that it does: suppose we don't know at the outset of a time-constrained
computation where the critical path actually lies- parallelism allows us to much more rapidly
and asynchronously explore the space of on-line system choices where a sequential system
might have to assume it makes the worst choice every single time.

Lastly, an important

unexplored question is parallelism in a hard real-time safety-critical context. What exactly
are the guarantees a hard real-time concurrency platform can make?

Does there exist a

method to transform known-safe sequential programs into known-safe parallelized programs,
or will the shift to multi-cores and parallel processing require manual reinvention everywhere?

114

Chapter 6: Related Work and Other Soft
Real-Time Platforms on Linux

6.1

Concurrency Platforms and Parallel Programming

Parallel programming can be signicantly more dicult than sequential computing. Modern
operating systems have been explicitly designed to provide the

process abstraction,

which

allows programmers to write most sequential programs without concern for when, where,
or how their programs execute on the system.

All of the complexity inherent in running

multiple processes concurrently is hidden from the user and handled entirely in the operating
system through the scheduling interrupt and context switching mechanisms.

All of the

complexity of process blocking and synchronization necessary to access shared resources
(such as hard drives, network connections, etc.), is hidden from the programmer through
cooperation between the operating system and system standard libraries (such as the C
standard library).

In eect, it is easy for it to appear that, from the perspective of a

sequential program, it is the

only

program executing on a system as long as it is willing to

deal with two possible complications: (1) the wall clock (real-world clock) will appear to

115

progress erratically when the process is swapped out or made to block, and (2) the state of
shared resources (such as the le system) may be modied unexpectedly.

Many sequential programs can get away with ignoring these complications.

Dealing with

the rst complication by devising a method for time-sensitive sharing of the processor is
essentially the domain of sequential real-time programs, and can be adequately addressed in
many ways. Dealing with the second complication is what is called

concurrent programming.

At its heart, concurrent programming deals with the fact that timing of events in modern
systems is largely unpredictable, and therefore it is very dicult to make concrete guarantees about how a program will execute. Even on a single core machine, external hardware
interrupts, scheduling interrupts, and blocking system calls will cause a sequential program,
in practice, to start and stop executing at unpredictable times.

This may lead to all manner of

race conditions, which occur when the state of executing soft-

ware depends on the specic ordering of events in a system. There are many manifestations
of such behavior, in fact too many to list fully here. As a simple example, consider a single
variable that is initialized to zero. Two processes may race on this variable, with one process
attempting to read the value of the variable and the other process trying to set the value of
the variable to be one. Depending on which process succeeds rst, the nal value delivered
to the reading process will be zero or one, but saying denitively what the nal value will be
is impossible without adding additional constraints. Potentially more hazardous is that race
conditions may occur when breaking up a sequence of instructions. For example, consider
two processes trying to push a node onto a linked list. The rst process executes to the point
of nding the tail of the linked list, and is then interrupted. The second process nds the
same tail and pushes a new node to the list by modifying the next pointer. Then, the rst
process resumes executing and overwrites the same tail node's next pointer, which results in
a loss of the node that was pushed by the second process.

116

Of course, concurrent programming is also done on machines with more than one processor.
In this case, multiple sequential processes may run simultaneously. This can greatly exacerbate the likelihood of race conditions happening, but does not create an inherently more
complex programming model. A good concurrent program makes no assumption about the
rate at which it executes relative to any other program on the system, and it is conceivable that a pernicious concurrent system could try to switch between processes at the exact
moments in time that would cause race conditions.

In contrast,

parallel computing

is using multiple threads in a single process to accelerate an

individual computation. This suers from the same vulnerabilities as concurrent programs,
but now individual programs' internal threads may interfere with each other as well. Thus
the dierence should be apparent: writing correct and ecient concurrent programs is an
exercise in dealing with
both external and

external interference,

internal interference.

while parallel programming must deal with

The essential diculties have not changed from a

system correctness point of view, but writing parallel programs is dramatically more dicult.
Concurrent programs are sequential programs where the realities of execution on modern
multi-process systems must occasionally be handled where shared resources are concerned.
Threads in parallel programs inherently share- when multiple threads execute inside a process
then the entire process memory space is a shared resource.

Moreover, parallel programming is explicitly about accelerating programs and getting good
performance, where concurrent programming is generally about sharing resources.

Very

simple and heavy-handed solutions to concurrent programming exist, such as Linux's (nowremoved) Big Kernel Lock that simply prevented any two sensitive areas of code from executing concurrently anywhere on the system, regardless of whether those two code sections were
actually interfering.

This was correct but often degraded performance, since it prevented

simultaneous execution even when such concurrency was allowable. A parallel program that

117

employed such a strategy would be a poorly designed one, since performance is an inherent
metric.

Thus, good parallel performance requires employing sophisticated scheduling and

synchronization techniques to maximize the amount of simultaneous work that is being done
at any given point in time. Furthermore, good performance often requires adapting parallel
programs for specic hardware architectures.

As a result, writing good parallel programs

from rst principles was for a long time exclusively the domain of expert parallel programmers who could simultaneously reason about scheduling, synchronization, and architectural
and hardware level ne-tuning.

However, it was realized that much of what was laborious and dicult about writing good
parallel programs could be systematized and automated. A properly designed

concurrency

platform would alleviate much of the burden of scheduling, resource management, and thread
coordination. An appropriately designed interface would allow an application programmer
to focus on parallel algorithm design by identifying
having to think about how to

opportunities for parallelism

implement parallelism.

rather than

Expert performance-oriented program-

mers could devote themselves to understanding how to execute a given parallel structure
most eectively, independent of the actual computations being performed. Many such concurrency platforms have been developed over time:

MIT Cilk [4], Intel's Cilk Plus [18],

OpenMP [5], and Intel's Thread Building Blocks [22] are major examples.

These platforms dene new parallel programming languages that are implemented on top of
existing programming languages such as C and C++. For example, a common element of
almost every concurrency platform is the

parallel-for loop,

which operates like a traditional

for-loop except that every iteration of the loop is allowed to execute simultaneously with every
other iteration of that same loop. A middleware runtime layer (the concurrency platform
itself ) is responsible for implementing the machinery of parallelism during program execution:

118

thread creation and management, scheduling and work allocation, and the synchronization
necessary for those functions.

These platforms are a major departure from previous parallel programming approaches,
which generally used pThreads [23] or Java threads [24] directly. These previous approaches
force the programmer to trade o between implementation diculty and eciency. Manual
thread management means either using a simple

on-demand threading model,

where new

threads are created when needed and destroyed when no longer necessary, or it requires a
more complicated

persistent threading model, where a single set of threads is created at system

initialization and then is managed throughout the life of the program. Thread creation is
not free, so on-demand threading is comparatively inecient (versus persistent threads) and
does not scale well, but conversely persistent threads do not adapt well to changes in system
architecture. For example, a program designed around four persistent threads will be unable
to take advantage of a fth processor becoming available, while multiplexing four threads
onto three processors can be highly inecient [25]. As a result, writing adaptable, scalable
parallel programs with persistent threads naturally motivates thread management with a
job scheduler, which itself motivates the modern notion of automatic thread management
within a concurrency platform.

Concurrency platforms are just as important for the way they modify the parallel programming task as they are for addressing the thread management problem.

These platforms

provide a strong separation between the way that parallel programs are written and the
way that they are executed: put simply, they separate the implementation of parallelism
from the instantiation of parallelism. This is radically dierent from the traditional threading approaches provided by pThreads and Java threads. Consider for example the call to

pthread_create(), which simultaneously creates a thread and provides a starting point for
the thread to begin executing. Here, the thread is created and the parallel work it is to per-

119

form is specied in the same stroke, and separation is impossible. In contrast, a high-level
identication that a for-loop can be turned into a parallel-for loop is the identication of
parallelism opportunities by the parallel programmer, and such a parallel program can be
executed in whatever way provides sucient performance.

The subsequent independence of implementation then naturally gives rise to the question of
what sorts of general parallel scheduling strategies are useful. The general parallel scheduling
problem is formulated as the dynamic unfolding of directed acyclic graphs (DAGs) [4, 26, 27,
28], where each node in a DAG represents a computation, and edges represent dependencies
between nodes. A node is ready to execute when all of its predecessors have been executed,
and the scheduling decision is to decide which of the available ready nodes should be executed
during each unit-time execution window.

There are two basic metrics for such jobs. The

work T1

of a job is the total number of nodes

in a DAG, which is intuitively the amount of time such a parallel program would take to
execute on a single processor. The

critical path

or

span T∞

is the longest chain of nodes in

the DAG, which is intuitively the amount of time it would take to execute such a program
on an innite number of processors.

Both of these metrics form lower bounds for the execution time of a parallel program under
all situations. Clearly a program cannot execute faster than
the critical path of length
execute faster than
has

P

T1

T∞

T∞

under any condition, since

takes at least that many time units to execute. A program can

if more than one processor is applied, but if a given execution machine

processors, then the value

work across all processors.

T1 /P

is a lower bound that represents perfect division of

The rst bound will dominate for programs with many long

sequential chains and limited opportunities for parallelism, while the second bound will
dominate for embarrassingly parallel programs with very few dependencies.

120

Thus, the optimal execution time of a parallel program under any scheduler is at least the
maximum of
forms. The

T∞

and

T1 /P .

greedy scheduler

There are two common schedulers used in concurrency plat(sometimes called

work-conserving )

simply executes as many

DAG nodes as possible during each unit-time execution window, or more formally species
that there is never an execution window where some processors sits idle if there is an eligible
(ready) node to execute.

T1 /P + T∞

The greedy scheduler approach gives a job completion time of

[12, 29] (a factor of two versus optimal). The second scheduler is the

work stealing scheduler.

random

The randomized work stealing scheduler allows idle processors to

randomly select a candidate victim processor and attempt to steal work in order to nd
something to do.

This does not guarantee that processors are always busy when there is

work to do, as the greedy scheduler does, but it guarantees that idle processors will nd
work within a very short amount of time.
job completion times of

O(T1 /P + T∞ )

The randomized work stealing scheduler gives

[4] (within some constant factor of optimal).

In

practice, most concurrency platforms use some form of randomized work stealing, including
Cilk, Cilk Plus, and Intel's Thread Building Blocks. A notable exception is OpenMP- since
OpenMP is a specication rather than a specic parallel programming language, specication
implementers are free to use scheduler they desire. At least one mainline OpenMP implementation, GNU's OpenMP, uses a centralized queue scheduler [30] that can be considered
to be a "near-greedy" scheduler.

6.2

Multi-processing vs. Parallel Processing

Multi-core real-time systems researchers have developed models, theory, and software to support

inter-task parallelism, where workloads consist of a collection of independent sequential

tasks, and multiple processors or multi-core processors allow multiple sequential tasks to
execute at once. While these systems allow many tasks to execute simultaneously, they do

121

not allow an individual task to run any faster on a multi-core machine than on a single-core
machine. This is called real-time

multiprocessing.

The focus of this work goes farther, concentrating on
where real-time tasks can have

intra-task parallelism

parallel real-time processing

systems,

in addition to inter-task parallelism.

In these systems, workloads consist of a collection of independent parallel tasks, but each
individual parallel real-time task is allowed to execute on multiple (potentially overlapping)
cores. This capability allows parallel real-time processing systems to execute a strictly larger
class of programs than real-time multiprocessing systems. In particular, when the opportunity for parallel execution exists, it allows for the execution of individual tasks with tighter
timing constraints or higher computational loads within a given timing constraint. This can
lead to improved execution of computation-heavy real-time systems such as those for video
surveillance, computer vision, radar tracking, and real-time hybrid structural testing, whose
stringent timing constraints can be dicult to meet through traditional multiprocessing.
Many of these applications are highly parallelizable, and supporting intra-task parallelism
can allow real-time systems to run more demanding programs.

6.3

Soft Real-Time vs. Hard Real-Time

The broader eld of parallel real-time concurrency platforms is still in its infancy, and this
work restricts itself to
system.

soft real-time systems

that are implemented atop the

Linux

operating

These soft-real time systems do not make absolute worst-case timing guarantees

under all circumstances, instead aiming to provide predictable real-time behavior most of
the time. To contrast more specically,

hard real-time systems

are those that are validated

and certied to have correct timing behavior under all foreseeable operating conditions, based
on (often) extremely pessimistic models of system behavior and workload performance (i.e.,

122

worst case execution time). This requires specic design for real-time behavior at every level:
hardware, operating system, system libraries, and application programs. For the purpose of
achieving hard real-time parallel performance, we would also include a hard real-time parallel
concurrency platform in that list as well.

Hard-real-time systems suer, in practice, from strict workload restrictions that can prohibit
the use of up to 50% of available processing capacity. Soft-real-time systems seek to claw back
some of this capacity in exchange for tolerating occasional deadline misses under certain conditions, representing a tradeo between timeliness and processor utilization. Deterministic
models of soft-real-time behavior exist: for example,

bounded tardiness

(or lateness), where

a job may miss its deadline by a specied amount, may be permissible in some applications
where a certain timing behavior is desired but a relaxed timing behavior may be acceptable. Bounded tardiness can be provided by otherwise traditional hard-real-time scheduling
methods such as Earliest Deadline First (EDF), but can also be provided by specically
soft-real-time scheduling algorithms such as the class of Pfair algorithms. Stochastic models
also exist: for example, a periodic task with a varying workload, called a

semi-periodic task,

may be described by a probability distribution that describes the likelihood of any given
job's actual computational requirement. A soft-real-time approach may certify the behavior
of the system up to but not exceeding a maximum computational demand, which together
with a tasks' probability distribution describes the probability that the timing requirement
for each job from such a task will be satised. Other task models may allow job arrival time,
or even both job arrival and workload, to uctuate stochastically. Lastly,

time-valued tasks

may be described by a utility function that describes the utility of nishing a soft-real-time
computation over time. This allows a system to derive maximum value from nishing a task
by its deadline, and gracefully degrade the usefulness of the computation over time until it
is no longer worthwhile. A detailed survey of soft-real-time task models, as well as specic
analysis of the bounded tardiness model, may be found in [31]

123

This dissertation uses scheduling theory that would be suitable for hard-real-time systems,
but uses an operating system (Linux with the RT_PREEMPT patchset applied) and concurrency platform (OpenMP) that are not hard-real-time software, and thus cannot reasonably
make any claim toward providing hard-real-time performance. Furthermore, no hard-realtime concurrency platforms exist, and:

1. Hard real-time parallel systems will likely need to solve most or all of the challenges
addressed in the design of soft real-time parallel systems, but also will have further
challenges beyond that. Thus, soft real-time parallel systems are a natural stepping
stone towards hard real-time parallel systems.

2. Soft real-time and hard real-time systems are both valid system models that have
particular strengths and weaknesses in dierent design contexts, and both deserve
thorough exploration in their own design space. Since soft real-time systems do not
make strong guarantees of system behavior, it is likely that future researchers will
see soft real-time parallel systems as performance-oriented systems that are especially
suitable for applications where the potential for injury to humans or property is minimal
(e.g., physically small systems or systems in highly controlled settings). In contrast,
hard real-time parallel systems will provide a lesser degree of performance increase but
will be suitable for safety-critical applications where the penalty for failure is large.

Moreover, achieving hard-real-time performance is a dicult task that goes beyond having
appropriate scheduling theory and a hard-real-time compatible software architecture. There
are many second-order eects that must considered, and either accounted for or mitigated,
to make a claim to having credible hard-real-time performance.

All forms of contention

that exist within computer hardware, and in particular cache eects, are possible sources
of interference when enforcing hard-real-time behavior. These second order eects are not
generally mitigated in multiprocessing systems to a large degree, let alone parallel processing

124

systems: the current recommendation by the Federal Aviation Administration, the United
States government agency in charge of aviation safety, is to disable cache on multi-core
processors due to the unknown risk it poses to real-time system operation, favoring instead
degraded but predictable performance.

The addition of parallel computing to the real-

time computing landscape only increases the opportunities for and likelihood of interference
occurring, since parallel tasks may have multiple threads on multiple processors a single such
task is then capable of inuencing multiple tasks on multiple other processors.

6.4

Parallel Real-Time

Real-time systems are those that must satisfy real-world timing constraints in order to be correct, and provide strong assurances of predictable system behavior under adverse conditions.
Such requirements are common where computer systems must control physical objects or
monitor physical phenomena, and the requirements themselves are usually derived from the
physical behavior of the system in question. For example, an earthquake engineer may wish
to subject a test structure to a previously recorded earthquake loading. However, moving
seismic masses against one another inevitably invokes Newton's Third Law (for every action
there is an equal and opposite reaction) and the control of such a test must incorporate a
feedback-control loop to account for this. Here, the rate at which actuation commands can
be sent to the test apparatus will determine how accurately the recorded earthquake loading can be recreated, and the rate at which sensor data can be taken o the test specimen
determines the possible test accuracy. The physical requirements of the test in fact drive
the entire system design: the engineer rst must decide what physical delity is sucient
to evaluate the phenomena they're interested in, and then select a computational platform
that is capable of providing a sucient level of computational performance.

125

In some domains, such as Real-Time Hybrid Simulation, performance is limited by the ability
to execute large simulations or control-loop computations in real-time and at a fast enough
rate in order to be useful. These systems can easily generate computational workloads that
far outstrip the capabilities of sequential processors and demonstrate an increasing demand
for high-performance (parallel) real-time computing.

There are many systems where it is

easy to see that high-performance real-time computing is either a limiting factor or obviously
useful to further development: autonomous vehicles, mobile robotics, real-time classication
and machine learning, etc.

Unfortunately, recent history suggests that the slowing growth of sequential processor capability is unlikely to change. Instead, processors have increasingly incorporated multiple
processing cores per chip so much so that single-core chips are dicult to purchase, and
processors with two, four, eight, or more processing cores are abundant. This necessitates
a paradigm shift for real-time application designers who desire more computational power,
as existing approaches to real-time processing have been largely sequential in nature.

In

doing so, they must be willing to take the plunge into parallel programming. Such a shift
requires extensions throughout real-time systems, from theoretical foundations to the design
and implementation of real-time software.

There has been much recent interest in parallel real-time computing. A variety of theoretical
results analyze scheduling algorithms and task models for both soft real-time [32, 33], and
hard real-time settings [34, 3, 35, 2, 1, 15, 21, 36, 37, 38]. There has been comparatively
little work, however, on building
putations.

systems

capable of executing such parallel real-time com-

In [21] a proprietary system was used in an autonomous vehicle for near-term

route planning, and it was shown that parallelism could provide a more comfortable ride (less
sharp accelerations). Two other systems [8, 14] examined Linux-based strategies for implementing parallel real-time execution platforms and were validated on synthetic benchmarks,

126

while [16] provided a platform for use with a special real-time operating system (Fork/Join
OS, or FJOS) and showed good parallel speedup in real-time for a number of important
numerical computations.

There has been signicant work on multiprocessor real-time scheduling prior to (and alongside) parallel real-time scheduling [39].

6.5

Real-Time Hybrid Simulation (RTHS)

Real-Time Hybrid Simulation combines numerical simulation alongside physical experimentation to simulate structures in the lab that would otherwise be infeasible or impossible to
test otherwise. This dissertation considers RTHS as an exemplary parallel real-time cyberphysical application, as it requires meeting appropriate real-time constraints alongside large
parallel workloads and tightly-coupled physical apparatus.

Early work in what is now RTHS began as quasi-static [40] and pseudo dynamic (PSD) [41]
testing, simulating dynamic responses without aiming for real-time execution, or substructuring of the specimen physically. As the eld developed, research expanded to also investigate
how to best meet the objectives in such an experiment. Existing integration schemes were
modied to enable more complex testing [42] , and error propagation was examined to facilitate eective testing techniques [43]. Real-time hybrid testing is a natural evolution of
PSD, as the best dynamic response is obtained from real-time tests with strict timing constraints [44].

RTHS was also recognized as a good way to demonstrate and evaluate the

capabilities of structural control systems - adding structural components that attempt to
control the dynamic response of a structure [45]. Recently, [46] has studied dierent control
algorithms through eective use of RTHS techniques.

127

Several non-parallel software packages have been designed specically for real-time hybrid
simulation. For system support, Linux-based systems [47, 48] can provide a exible, reusable
middleware architecture for connecting computational and physical components of an RTHS.
For simulation and modeling support, real-time packages [49] provide algorithms suitable for
real-time operation.

In practice there are a number of platforms on which RTHS is currently conducted.

The

most prevalent is Matlab's xPC system. Matlab and Simulink code is written on an xPC host
machine, and then sent to an xPC target to execute the computations and interface with the
physical components of the RTHS. During execution, all of the xPC target's computational
power is devoted solely to the RTHS. xPC does not currently support parallel processing (as
dened earlier), so all computations on this system must be performed sequentially. There
is some research into using multiple xPC targets to increase computational resources [50],
but this approach only supports simultaneous execution of multiple sequential codes, which
does not achieve the goals of parallel programming as described at the top of this chapter.
The authors in [51] has also developed Mercury, which is a closed source C++ platform that
allows for the use of more advanced nite element models in RTHS.

128

Chapter 7: Conclusion

Parallelism was a natural and foreseeable evolution for real-time systems, but it has taken
decades to apply the fruits of parallel computing research to the eld of real-time computing.
As has been demonstrated, the engineering of a parallel real-time computing concurrency
platform is non-trivial, and the allocation of this capacity introduces new tradeos among a
task's speed, computational requirement, and delity.

Two approaches were tested to explore the engineering of a parallel real-time concurrency
platform. The rst approach, RT-OpenMP, explicitly schedules all runnable code at a very
ne-grained level. This strategy used static partitioning to processors according to a demand
bound function, which gives a high degree of control over when and where code executes,
but ultimately suers from high overhead which limits the maximum periodic rate to approximately 500Hz. The second approach, federated scheduling, treats the parallel runtime
system as a black box and makes only the minimal assumption that the parallel runtime
must have a (nearly) greedy scheduler. This allows the use of existing parallel runtime systems, which are ecient but oer little control over how programs are executed. Extensive
testing demonstrates that this approach is suitable for many workloads, but is a strictly soft
real-time approach.

The design and eciency of these systems are driven at a deep level

by the specic real-time assurances each wants to make: RT-OpenMP strives for arbitrary

129

preemptability and little to no priority inversion, while federated scheduling side-steps that
issue by simple isolating parallel tasks onto dierent hardware.

A novel infrastructure, CyberMech, has been used to evaluate the performance of parallel
real-time computation and the mixed-criticality federated scheduler using Real-Time Hybrid
Simulation as an exemplar for more general cyber-physical applications. This software manages multiple communicating concurrent parallel real-time processes and enables the access
of multiple threads and processes to a non-thread-safe data acquisition software. Taken all
together, CyberMech allows the execution of RTHS experiments using federated scheduling
in a soft real-time manner on general Linux platforms, which brings parallel computation to
RTHS experimentation for the rst time and eliminates any need for proprietary computer
hardware or software beyond data acquisition devices.

Lastly, these experiences have been used to draw broader conclusions about the engineering
of parallelism in statically determined cyber-physical systems those that x computational
workloads and timing constraints prior to system execution. One specic RTHS experiment
was benchmarked exhaustively and it was shown how system designers can translate reasoning about cyber-physical properties (such as a target control rate) into management of a
parallel workload.

Taken as a whole, this dissertation represents a thorough investigation of the engineering of
parallel real-time systems where the computational workload and computational resources do
not change through program execution. The associated software allows a real-time systems
developer or suciently trained domain expert (such as a structural engineer) to implement
their own parallel real-time workloads in real cyber-physical systems.

130

7.1

Future Parallel Real-Time Platforms

Current parallel real-time concurrency platforms have two primary limitations from a realtime developer's point of view: they are statically determined and cannot exibly modify
themselves at runtime, and they are soft real-time systems.

Both RT-OpenMP and the

Federated Scheduling Service perform rely on a static partitioning of system resources prior
to runtime in order to function correctly, which is required by the theoretical analysis of these
systems. In RT-OpenMP tasks were partitioned at the ne-grained level of strands, while
in the Federated Scheduling Service high-utilization tasks were partitioned onto processors.
As a result, these systems are computationally inexible, and cannot naturally deal with
dynamic workloads that are quite common in the cyber-physical domain (e.g., when reacting
to unexpected physical changes). The suboptimal solution is over provisioning. The mixedcriticality version of the federated scheduling system provides some degree of dynamism, but
the current work only supports switching between a nite set of static operating modes and
would thus deal with arbitrary combinations of dynamic events poorly.

Moving towards more dynamic systems would seem to require either (1) getting rid of partitions or (2) allowing partitions to be recongured at runtime. The former approach would
suggest an approach akin to global earliest deadline rst scheduling, where a single work
queue is used to prioritize all outstanding work in the system. However, a global work queue
is known to be non-scalable in a parallel computing context due to the overhead of global
synchronization. Allowing partitions to be arbitrarily recongured at runtime would be a
hybrid solution, which would not require a global work queue but would require an enhanced
theory of operation so as to make assurances about satisfying timing constraints as both the
computational workload and computational resources for a workload may vary unpredictably
in time.

131

Taking dynamism one step further to cyber-physical systems would also require a model of
how physical output of the system varies according to workload and computational resources,
so that physical behavior can be predictably managed as the underlying computational
substrate changes dynamically.

Moving to hard real-time parallel computing poses a dierent challenge. The existing parallel
concurrency platforms such as Cilk and OpenMP were not built for real-time behavior,
and tend to optimize throughput for large computations rather than predictable system
execution. A popular strategy for parallel scheduling, called randomized work stealing, is
used in both Cilk and OpenMP and is unsuitable for hard real-time execution since the
basic scheduling action involves a randomized process. Randomized work stealing is used
because it distributes overhead throughout the system and in practice scales well regardless
of what the parallel workload looks like. Thus, any hard real-time parallel real-time execution
platform will need to be created from scratch, and the techniques it uses may or may not be
techniques that are popular in the general parallel computing domain. Basic versions of such
systems could simply implement a non-scalable, high-overhead platform with the knowledge
that this is the price to pay for high predictability. However, in the long term a more elegant
solution is likely to be needed, as the whole purpose of parallel programming is to maximize
the use of computational resources.

Future parallel real-tiime systems are likely to incorporate features such as dynamic computational and timing requirements that make the strict partitioning and separation of parallel tasks used in federated scheduling less feasible, and hybrid mixed-criticality federated
scheduling is one such example of that. However, the RT-OpenMP system demonstrates that
a large degree of OS involvement and subsequent overheads may not be feasible while also
achieving a high degree of parallel performance, and the existing high performance parallel
systems do not support basic real-time primitives such as work prioritization or preemption.

132

Future parallel real-time systems, that require dynamic, recongurable behavior are either
going to need a new parallel scheduling and execution strategy that can be implemented
in userspace but also support these real-time primitive, or they are going to need to work
cooperatively and eciently with the OS kernel in a thus far not devised manner.

7.2

Future of RTHS Infrastructure

The current infrastructure for RTHS, CyberMech, meets the current desires of rst-generation
parallel RTHS, but already there are enhancements that are required for future planned
RTHS experiments. Having been built on the Federated Scheduling Service the CyberMech
system is designed to handle statically determined computational workloads. This rules out
a large class of experiments where computational capacity needs to be allocated in response
to changing physical situations, and there are specic examples that motivate every aspect
of the computational execution platform.

A simulation that must respond to unexpected

physical damage motivates both dynamic mesh renement, a technique where a structure
would be modeled in high delity around unanticipated physical damage, and dynamic timing constraints, where a structure may be modeled at a varying periodic rate according to
local conditions. Renement of a numerical structure would also motivate the ability to split
numerical simulations and dynamically increase the number of separate computational tasks
executing on a system, which in turn requires a computational task model that allows for
tasks to be put online and oine as needed during program execution, rather than arranging everything statically prior to execution. Of course, all of these proposed features would
require principled strategies to hand-o computational responsibility and resources in a way
that ensures the delity of the overall experiment.

133

Even within the static domain, however, there are further RTHS experiments that could be
done to stress the current system. All RTHS experiments conducted to date have been linear
systems with a xed per-period execution time. A larger class of non-linear structural simulations exist that depend on iterative algorithms that must converge to a solution at runtime.
The static nature of the current system means that these would have to be dealt with by
over-provisioning, but there would be interesting questions of how much over-provisioning
is necessary, and would be a fertile bed for investigating the eect of deadline misses, jitter,
and computational latency on the delity of the overall experimental system.

Addition-

ally, larger structural simulations could better stress the parallel capacity of CyberMech.
This work contains structural models that can be scaled up to be arbitrarily computationally intensive through subdivision. This makes sense mathematically but at a certain point
it becomes unrealistic from an experimentalist's point of view, so larger but still realistic
models could more properly validate the results in this work. Unfortunately some models
are very small, designed with real-time execution in mind, and some models are very large
(pseudodynamic simulations where individual time steps can take minutes or hours of computation time), but not much is known between these extremes. One last bit of low-hanging
fruit would be to implement an RTHS on CyberMech with multiple physical subdomains,
but this has proven dicult more on the structural engineering side of developing adequate
controllers that can manage two physical structures and the interactions between them.

Lastly, CyberMech should be used to investigate into how such RTHS experiments are constructed and specied from both a computational and structural engineering point of view,
so that students in structural engineering courses can implement their own non-trivial RTHS
and investigate it as an alternative testing strategy alongside pure experiment and pure simulation.

The current interface is still very much that of a prototype system, with several

operations being rather exacting and laborious: specication of structural simulations, specication of physical elements, and then specication of how these are connected. Also, there

134

are a wide variety of existing structural engineering tools for simulation, but these do not
often support real-time simulation much less parallel real-time simulation. Part of constructing a good interface for CyberMech will probably also involve capturing good workows for
structural engineers to develop RTHS experiments and then integrating CyberMech with
those tools.

7.3

Future of Cyber-Physical Parallelism

Current eorts to integrate parallelism into cyber-physical systems are entirely ad-hoc. This
works for highly engineered systems that are (so far) loosely regulated, such as self-driving
car systems, but not elsewhere.

On one end of the spectrum there are small developers

or researchers who may want extra computational capacity in a small system, for example
doing parallel computing with a Raspberry Pi on board a $500 drone, who do not have
the knowledge or resources to do so successfully.

Such persons may be able to perfectly

adequately phrase their cyber-physical problem in a domain-specic language ("Running
OpenCV to track objects in front of my drone causes the periodic control rate to drop too
low.") but not be able to translate their concepts into the scheduling and implementation of
a parallel workload in such a way that computational tasks minimally interfere. For these
people there needs to be a more principled way to talk about the cyber-physical interactions
of their physical platform, their computational resources, and the computational workloads
they are running these people shouldn't need to be real-time or parallel systems engineers
to predict whether or not their desired computational workload is feasible and what elements
of that workload are elastic and inelastic.

On the other end of the spectrum there are

developers (such as aircraft designers) who need to be able to design large, complex, but safe
systems and the convey that assurance in a way that is understood and trusted (such as to
regulators).

135

A signicant need is to develop a principled approach to understanding and modeling the
impact of computational variance in cyber-physical systems. All computational tasks experience some degree of latency and jitter, for example, which will manifest itself as physical
behavior. Understanding the eects that these have on a system, especially in a complex
system where a single event can have multiple knock-on eects, will be a major step towards
a more generalized understanding of the engineering of cyber-physical systems, and would
give a much clearer indication to future systems designers of how and where to allocate
parallelism to improve computational performance.

Current practice is to build systems

slowly and to test frequently, but intuition suggests that at a certain size and scale that this
approach will become unworkable.

Ultimately the goal of this eld of research is to be able to condently build powerful (and
thus potentially dangerous) systems, with condence grounded in a robust system of analysis
that catches and prevents dangerous conditions from occurring.

It is likely there will be

a strong demand for highly complex cyber-physical and autonomous systems to become
increasingly prevalent in our world as their benets are recognized (e.g., if self-driving cars
were to cause a signicant reduction in fatal accidents). Selling these systems to the public
and to regulators will require an eective and understandable way to demonstrate strong
assurances that catastrophic behavior cannot result from system operation. Failure to do
so will limit the reach of these technologies to small and isolated systems which cannot do
much harm even in the event of catastrophic behavior, and consequently limit the benets
that we could otherwise derive.

136

Chapter 8: Bibliography

[1] K. Lakshmanan, S. Kato, and R. R. Rajkumar, Scheduling parallel real-time tasks on
multi-core processors, in

RTSS '10.

[2] S. Kato and Y. Ishikawa, Gang EDF scheduling of parallel task systems, in

RTSS '09.

[3] S. Collette, L. Cucu, and J. Goossens, Integrating job parallelism in real-time scheduling theory,

Inf. Process. Lett., vol. 106, no. 5, pp. 180187, 2008.

[4] R. D. Blumofe,

Executing Multithreaded Programs Eciently.

PhD thesis, Department

of Electrical Engineering and Computer Science, Massachusetts Institute of Technology,
Cambridge, Massachusetts, Sept. 1995.

Available as MIT Laboratory for Computer

Science Technical Report MIT/LCS/TR-677.

[5] OpenMP

specication,

January

2015.

http://openmp.org/wp/

openmp-specifications/.
[6] Message Passing Interface specication, January 2015.

http://www.mpi-forum.org/

docs/.
[7]

Cyber-Physical Systems Program Solicitation NSF 17-529. National Science Foundation,
2017.

https://www.nsf.gov/pubs/2017/nsf17529/nsf17529.htm.

137

[8] D. Ferry, J. Li, M. Mahadevan, K. Agrawal, C. Gill, and C. Lu, A real-time scheduling
service for parallel tasks, in

RTAS '13.

[9] N. Fisher, S. Baruah, and T. P. Baker, The partitioned scheduling of sporadic tasks
according to static-priorities, in

ECRTS '06.

[10] A. Saifullah, K. Agrawal, C. Lu, and C. Gill, Multi-core real-time scheduling for generalized parallel task models, in

[11] A.

Saifullah,

J.

Li,

K.

RTSS '11.

Agrawal,

C.

Lu,

and

scheduling for generalized parallel task models,

C.

Gill,

Multi-core

real-time

Real-Time Systems Journal,

2012.

doi:10.1007/s11241-012-9166-9.

[12] R. L. Graham, Bounds for certain multiprocessing anomalies,

The Bell System Tech-

nical Journal, vol. 45, pp. 15631581, Nov 1966.
[13] J. Li, J. J. Chen, K. Agrawal, C. Lu, C. Gill, and A. Saifullah, Analysis of federated
and global scheduling for parallel real-time tasks, in

2014 26th Euromicro Conference

on Real-Time Systems, pp. 8596, July 2014.
[14] J. Li, D. Ferry, S. Ahuja, K. Agrawal, C. Gill, and C. Lu, Mixed-criticality federated scheduling for parallel real-time tasks, in

2016 IEEE Real-Time and Embedded

Technology and Applications Symposium (RTAS), pp. 112, April 2016.
[15] A. Saifullah, K. Agrawal, C. Lu, and C. Gill, Multi-core real-time scheduling for generalized parallel task models, in

RTSS '11.

[16] Q. Wang and G. Parmer, Fjos: Practical, predictable, and ecient system support
for fork/join parallelism, in

2014 IEEE 19th Real-Time and Embedded Technology and

Applications Symposium (RTAS), pp. 2536, April 2014.

138

[17] S. Dinh, J. Li, K. Agrawal, C. Gill, and C. Lu, Blocking analysis for spin locks in realtime parallel tasks,

IEEE Transactions on Parallel and Distributed Systems,

vol. 29,

pp. 789802, April 2018.

[18] Intel CilkPlus.

http://software.intel.com/en-us/articles/intel-cilk-plus.

[19] K. Kieselbach, A parallel real-time platform using federated scheduling and a threadsafe shake table control system to enable cyber-physical applications, Master's thesis,
Washington University in St. Louis, 2013.

[20] F. Allgower, From rags to riches - distributed economic model predictive control in
industry 4.0, 4 2018. Remarks by Dr. Allgower as a keynote speaker at Cyber-Physical
Systems Week 2018, Porto, Portugal [Accessed: 2018 06 21].

[21] J. Kim, H. Kim, K. Lakashmanan, and R. Rajkumar, Parallel scheduling for cyberphysical systems: Analysis and case study on a self-driving car, in

ICCPS '13.

[22] J. Reinders, Intel threading building blocks: Outtting c++ for multi-core processor
parallelism, 2007.

[23] I. of Electrical and E. Engineers, Information technology - portable operating system
interface (posix) - part 1:

System application program interface (api) [c language],

1996.

[24] J. Goslin, B. Joy, G. Steele, and G. Bracha, The java language specication, 2000.

[25] R. D. Blumofe and D. Papadopoulos, The performance of work stealing in multiprogrammed environments, tech. rep., Austin, TX, USA, 1998.

[26] R. D. Blumofe and C. E. Leiserson,  Scheduling multithreaded computations by work
stealing,

Journal of the ACM, vol. 46, no. 5, pp. 720748, 1999.

139

[27] G. E. Blelloch and J. Greiner, A provable time and space ecient implementation of
nesl, in

Proceedings of the First ACM SIGPLAN International Conference on Func-

tional Programming, ICFP '96, (New York, NY, USA), pp. 213225, ACM, 1996.
[28] G. E. Blelloch, P. B. Gibbons, and Y. Matias, Provably ecient scheduling for languages with ne-grained parallelism,

J. ACM, vol. 46, pp. 281321, Mar. 1999.

[29] R. P. Brent, The parallel evaluation of general arithmetic expressions,

J. ACM, vol. 21,

pp. 201206, Apr. 1974.

[30] J. Li, S. Dinh, K. Kieselbach, K. Agrawal, C. Gill, and C. Lu, Randomized work stealing
for large scale soft real-time systems, in

2016 IEEE Real-Time Systems Symposium

(RTSS), pp. 203214, Nov 2016.
[31] U. C. Devi,

Soft Real-Time Scheduling on Multiprocessors.

PhD thesis, Department

of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, North
Carolina, 2003.

[32] C. Liu and J. Anderson, Supporting soft real-time parallel applications on multicore
processors, in

RTCSA '12.

[33] L. Nogueira and L. M. Pinho, Server-based scheduling of parallel real-time tasks, in

International Conference on Embedded Software, 2012.
[34] W. Y. Lee and H. Lee, Optimal scheduling for real-time parallel tasks,

IEICE Trans.

Inf. Syst., vol. E89-D, no. 6, pp. 19621966, 2006.
[35] G. Manimaran, C. S. R. Murthy, and K. Ramamritham, A new approach for scheduling
of parallelizable tasks in real-time multiprocessor systems,
no. 1, pp. 3960, 1998.

140

Real-Time Syst.,

vol. 15,

[36] S. Baruah, V. Bonifaciy, A. Marchetti-Spaccamelaz, L. Stougiex, and A. Wiese, A
generalized parallel task model for recurrent real-time processes, in

RTSS '12.

[37] J. Li, K. Agrawal, C.Lu, and C. Gill, Analysis of global edf for parallel tasks, in

ECRTS '13.
[38] L. Becchetti, M. Dirnberger, A. Karrenbauer, and K. Mehlhorn, Feasibility analysis in
the sporadic dag task model, in

ECRTS '13.

[39] R. I. Davis and A. Burns, A survey of hard real-time scheduling for multiprocessor

ACM Comp. Surv., vol. 43, pp. 35:144, 2011.

systems,

[40] P. Shing and S. Mahin, Computational aspects of a seismic performance test method using on-line computer control,

Earthquake Engineering & Structural Dynamics, vol. 13,

no. 4, pp. 507526, 1985.

[41] P. Pegon and A. Pinto, Pseudo-dynamic testing with substructuring at the elsa laboratory,

Earthquake engineering & structural dynamics,

vol. 29, no. 7, pp. 905925,

2000.

[42] O. Bursi and P. Shing, Evaluation of some implicit time-stepping algorithms for pseudodynamic tests,

Earthquake engineering & structural dynamics, vol. 25, no. 4, pp. 333

355, 1996.

[43] B. Pui-shum and S. Mahin, Experimental error eects in pseudodynamic testing,

Journal of Engineering Mechanics, vol. 116, no. 4, pp. 805821, 1990.
[44] M. Ahmadizadeh, G. Mosqueda, and A. Reinhorn, Compensation of actuator delay
and dynamics for real-time hybrid structural simulation,

Structural Dynamics, vol. 37, no. 1, pp. 2142, 2008.

141

Earthquake Engineering &

[45] T. Soong and B. Spencer, Supplemental energy dissipation: state-of-the-art and stateof-the-practice,

Engineering Structures, vol. 24, no. 3, pp. 243259, 2002.

[46] X. Gao, N. Castaneda, and S. J. Dyke, Real time hybrid simulation: from dynamic
system, motion control to experimental error,

Earthquake Engineering & Structural

Dynamics, vol. 42, no. 6, pp. 815832, 2013.
[47] T. Tidwell, X. Gao, H.-M. Huang, C. Lu, S. Dyke, and C. Gill,  Towards Congurable Real-Time Hybrid Structural Testing:
in

A Cyber-Physical Systems Approach,

International Symposium on Object and Component-Oriented Real-Time Distributed

Computing (ISORC), 2009.
[48] H.-M. Huang, T. Tidwell, C. Gill, C. Lu, X. Gao, and S. Dyke, Cyber-physical systems
for real-time hybrid structural testing: a case study, in
[49] N. E. Castaneda,

ICCPS '10.

Development and Validation of a Real-Time Computational Frame-

work For Hybrid Simulation of Dynamically-Excited Steel Frame Structures. PhD thesis,
Purdue University, West Lafayette, IN 47907, December 2012.

https://engineering.

purdue.edu/IISL/Publications/DSc_Dissertations/Nestor_Castaneda.pdf.
[50] Y. Chae, S. Tong, T. M. Marullo, and J. M. Ricles, Real-time hybrid simulation studies of complex large-scale systems using multi-grid processing, in

Analysis and Computation Specialty

Conference, pp. 359370, 2012.
[51] V. Saouma, G. Haussmann, D.-H. Kang, and W. Ghannoum, Real time hybrid simulation of
a non ductile reinforced concrete frame,

Journal of Structural Engineering, vol. 140, no. 2,

2013.

142

