Worst-Case Execution Time Guarantees for Runtime-Reconfigurable Architectures by Damschen, Marvin
Worst-Case Execution Time Guarantees
for Runtime-Reconﬁgurable Architectures
Zur Erlangung des akademischen Grades eines
Doktors der Ingenieurwissenschaften
der Fakultät für Informatik
des Karlsruher Instituts für Technologie (KIT)
genehmigte
Dissertation
von
Marvin Damschen
aus Moers
Tag der mündlichen Prüfung: 19. Dezember 2018
Referent: Prof. Dr.-Ing. Jörg Henkel
Karlsruher Institut für Technologie (KIT)
Korreferent: Prof. Frank Mueller, Ph.D.
North Carolina State University (NCSU)

Marvin Damschen
Burgstr. 110
76356 Weingarten (Baden)
Hiermit erkläre ich an Eides statt, dass ich die von mir vorgelegte Arbeit selbstständig verfasst habe, dass ich
die verwendeten Quellen, Internetquellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der
Arbeit — einschließlich Tabellen, Karten und Abbildungen — die anderen Werken oder dem Internet im Wortlaut
oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht
habe.
Marvin Damschen

Acknowledgments
I would like to express my deep gratitude to my advisor Prof. Dr.-Ing. Jörg Henkel for believing in me from the
beginning and providing an environment that was challenging and full of opportunities. I am thankful for his
support and guidance, and the invaluable experience that he shared. By asking the right questions and supporting
my ideas, he had a strong impact not only on the quality of my work, but especially on my development as a
researcher.
I want to thank Prof. Frank Mueller from the North Carolina State University for agreeing to co-advise my thesis.
I deeply appreciate our collaboration, which began when he welcomed me as a guest to his research lab, and his
continuing support ever since.
Dr.-Ing. Lars Bauer also had a big impact on my Ph.D. research and I want to express my sincere gratitude for all the
time he invests into research projects that provide a great environment for doctoral researchers. The contributions
he made during his Ph.D. are the basis for the evaluation platform that was used in this work. I also want to thank
Dr.-Ing. Artjom Grudnisky for being a great help in technical and general aspects during the ﬁrst months of my
Ph.D. and helping me to kick-start my Ph.D. research. I further want to thank Dr.-Ing. Farzad Samie for answering
all my questions about the Ph.D. defense and Martin Rapp for providing comments to my thesis.
I was fortunate to be part of the Transregional Collaborative Research Center “Invasive Computing” (SFB/TR 89),
which is funded by the German Research Foundation (DFG), and was a great source of experiences, collaborations
and inspirations for me. In this context I want to express my gratitude to Andreas Fried, Dr.-Ing. Manuel Mohr,
Alexander Pöppl, Sven Rheindt, Florian Schmaus and everyone else taking part in the integration of research ideas
into a common prototype. The outstanding collaboration eventually allowed us to demonstrate a prototype of the
full invasive computing technology stack during the review phase of Invasive Computing.
My thanks also go to Dr. Enrico Rossi, at the time a Ph.D. student of Prof. Dr. Giorgio Buttazzo from the Scuola
Superiore Sant’Anna in Pisa, for visiting us and being a great collaborator in the area of runtime-reconﬁgurable
real-time systems.
During my Ph.D. research I was able to supervise multiple student works and I want to thank all the students for
the work that they put in and their contributions to prototypes.
Finally, I want to thank my family. I want to express my gratitude to my parents, who supported my interests and
education however they could. My wife Katharina was not only understanding, but a continuing support during
this journey. I want to express my deepest appreciation for her love and support.
Thank you!
Karlsruhe, January 2019 Marvin Damschen

I should like to say two things, one intellectual and one moral:
The intellectual thing I should want to say to them is this: When you are studying any matter or considering any philosophy,
ask yourself only what are the facts and what is the truth that the facts bear out. Never let yourself be diverted either by what
you wish to believe or by what you think would have beneﬁcent social effects if it were believed, but look only and solely at
what are the facts. That is the intellectual thing that I should wish to say.
The moral thing I should wish to say to them is very simple. I should say: Love is wise, hatred is foolish. In this world, which
is getting more and more closely interconnected, we have to learn to tolerate each other. We have to learn to put up with the
fact that some people say things that we don’t like. We can only live together in that way, and if we are to live together and not
die together we must learn a kind of charity and a kind of tolerance which is absolutely vital to the continuation of human life
on this planet.
— Bertrand Russel, Face to Face (BBC, 1959)
i

Kurzfassung
Echtzeitsysteme sind in unserem Alltag allgegenwärtig, beispielsweise in sicherheitskritischen Umgebungen wie
der Automobil- und Luftfahrtelektronik oder der Robotik. Die Korrektheit eines Echtzeitsystems hängt nicht nur
von der Korrektheit der durchgeführten Berechnungen, sondern auch von der nicht-funktionalen Anforderung
der Einhaltung von Deadlines ab. Wird eine Deadline nicht eingehalten, kann dies zu ernsthaften Fehlfunktio-
nen führen. Daher müssen maximale Ausführungszeiten (worst-case execution times, WCET) garantiert werden.
Trotz signiﬁkanter wissenschaftlicher Fortschritte, können lediglich Mikroarchitekturen im Hinblick auf WCET
Garantien analysiert werden, die der Entwicklung von aktuellen hochperformanten Mikroarchitekturen um Jahre
hinterher sind. Zur Erfüllung der wachsenden Anforderung an Performance in Echtzeitsystemen, sind analysier-
bare Funktionen zur Performancesteigerung erforderlich. Um dem Mangel an analysierbaren Funktionen zur Per-
formancesteigerung zu entkommen, ist der Hauptbeitrag dieser Dissertation die Einführung von Laufzeitrekonﬁgu-
ration von Hardwarebeschleunigern auf einem Field-Programmable Gate Array (FPGA) mit dem Ziel Performance
unter WCET Garantien zu erreichen. Hierbei wird die Flexibilität des Systems aufrechterhalten und nicht etwa im
Hinblick auf einen einzigen Anwendungsbereich eingeschränkt.
Zunächst trägt diese Dissertation in einer ausführlichen Analyse davon, wie (durchschnittliche) Performance auf
fused CPU-GPU Architekturen erreicht wird, neuartige Ablaufplanungsansätze zur Arbeitsverteilung auf CPU und
GPU bei. Fused CPU-GPU Architekturen sind aktuell eine der Hauptrichtungen innerhalb der Entwicklung von
aktuellen hochperformanten Mikroarchitekturen, die eine CPU und eine GPU auf einem einzigen Chip vereint. Ar-
chitekturen dieser Art für die Realisierung von Echzeitsystemen einsetzen zu können wäre überaus wünschenswert,
da sie hohe Performance innerhalb eines beschränkten Flächen- und Leistungsbudgets bieten. Ein Ergebnis der
präsentierten Analyse ist jedoch die Entdeckung eines Flaschenhalses in der Cache-Kohärenz von aktuellen fused
CPU-GPU Architekturen, die den Last-Level-Cache zwischen CPU und GPU teilen. Dies führt dazu, dass (i)
Performancevorhersagen erschwert werden und so (ii) ein geteilter Last-Level-Cache zwischen CPU und GPU der
wachsenden Liste von Mikroarchitekturfunktionen hinzugefügt wird, die der durchschnittlichen Laufzeit nutzen,
aber die Analyse von WCET Garantien auf hochperformanten Architekturen praktisch unmöglich machen. Somit
wird der Bedarf an neuartigen Mikroarchitekturfunktionen für vorhersagbare Performance, die zugänglich für die
Analyse von WCET Garantien sind, weitergehend motiviert.
Diesem Ziel folgend, wird ein Kontroller zur Steuerung von Laufzeitrekonﬁgurationen namens „Command-based
Reconﬁguration Queue“ (CoRQ) präsentiert, der für seine Operationen garantierte Latenzen bietet. Dies gilt ins-
besondere für den Rekonﬁgurationsdelay, der Zeit die benötigt wird um einen Hardwarebeschleuniger auf einer
rekonﬁgurierbaren Fläche (FPGA) zu konﬁgurieren. CoRQ ermöglicht das Design von zeitlich analysierbaren
Architekturen, die WCET Garantien unterstützen. Basierend auf dem –nun möglichen– garantierten Rekonﬁgura-
tionsdelay von Beschleunigern wird eine WCET Analyse eingeführt, die es Tasks ermöglicht applikationsspeziﬁs-
che Spezialinstruktionen (CIs) zur Laufzeit zu rekonﬁgurieren. CIs werden von einer Prozessorpipeline ausgeführt
und stoßen die Ausführung von einem oder mehreren Beschleunigern an. Verschiedene Maßnahmen zur Behand-
lung von Rekonﬁgurationsdelay werden im Hinblick auf ihren Einﬂuss auf WCET Garantien und Überabschätzun-
gen verglichen. Die Timinganomalie der Laufzeitrekonﬁguration wird identiﬁziert und sicher beschränkt: einen
Fall in dem das schnellere Ausführen von Iterationen eines Berechnungskernels als in WCET während der Rekon-
ﬁguration von CIs die Gesamtlaufzeit eines Tasks verlängern kann. Sobald Tasks für WCET Garantien analysier-
bar sind die Laufzeitrekonﬁguration von CIs durchführen, stellt sich die Frage welche CIs auf einer beschränkten
iii
Kurzfassung
rekonﬁgurierbaren Fläche zur Optimierung der WCET konﬁguriert werden sollen. Diese Frage wird für Systeme
behandelt, in denen mehrere CIs mit jeweils unterschiedlichen Implementierungen (die einen Trade-off zwischen
Latenz und Flächenbedarf erlauben) ausgewählt werden können. Dies ist üblicherweise der Fall, beispielsweise
wenn von High-Level Synthese Gebrauch gemacht wird. Dieses sogenannte Instruktionsselektionsproblem zur
Optimierung der WCET wird basierend auf der Implicit Path Enumeration Technique (IPET) modelliert. IPET ist
die Pfadanalysemethode auf die sich Timing Analyseprogramme stützen, die dem Stand der Technik entsprechen.
Nach unserem Wissen ist dies der erste Ansatz von WCET Optimierung, der den Gebrauch von globalen Pro-
grammﬂussinformationen (und Informationen über Rekonﬁgurationsdelays) ermöglicht. Ein optimaler Algorith-
mus (der Branch-and-Bound ähnelt) und ein schneller heuristischer Algorithmus (der auf Greedy basiert und in
den meisten Fällen die optimale Lösung erzielt) werden vorgestellt. Schließlich wird ein Ansatz präsentiert, der
es erstmals ermöglicht die Optimierung von statischen WCET Garantien und die Optimierung der durchschnit-
tlichen Ausführung zur Laufzeit (unter Einhaltung von WCET Garantien) mittels Laufzeitrekonﬁguration von
Hardwarebeschleunigern zu vereinen. Der Ansatz besteht aus einer Analyse von Schranken für Laufzeitslack (der
Menge an Ausführungszeit, die ein Programmteil schneller als in WCET ausgeführt wird), die es auf sichere Weise
ermöglichen Beschleuniger für die Optimierung durchschnittlicher Performance zu rekonﬁgurieren. Bestehende
WCET Garantien bleiben hierbei erhalten. Weiterhin wird ein Mechanismus präsentiert, der es auf Basis von ein-
fachen Performancezählern ermöglicht den Laufzeitslack zu überwachen. Die benötigten Performancezähler sind
üblicherweise in vielen aktuellen Mikroprozessoren verfügbar.
Zusammenfassend zeigt diese Dissertation, dass Laufzeitrekonﬁguration eine Schlüsselfunktionalität für das Erre-
ichen von vorhersagbarer Performance ist.
iv
Abstract
Real-time systems are ubiquitous in our everyday life, e.g., in safety-critical domains such as automotive, avionics
or robotics. The correctness of a real-time system does not only depend on the correctness of its calculations,
but also on the non-functional requirement of adhering to deadlines. Failing to meet a deadline may lead to
severe malfunctions, therefore worst-case execution times (WCET) need to be guaranteed. Despite signiﬁcant
scientiﬁc advances, however, timing analysis of WCET guarantees lags years behind current high-performance
microarchitectures with out-of-order scheduling pipelines, several hardware threads and multiple (shared) cache
layers. To satisfy the increasing performance demands of real-time systems, analyzable performance features are
required. In order to escape the scarcity of timing-analyzable performance features, the main contribution of this
thesis is the introduction of runtime reconﬁguration of hardware accelerators onto a ﬁeld-programmable gate array
(FPGA) as a novel means to achieve performance that is amenable to WCET guarantees. Instead of designing an
architecture for a speciﬁc application domain, this approach preserves the ﬂexibility of the system.
First, this thesis contributes novel co-scheduling approaches to distribute work among CPU and GPU in an ex-
tensive analysis of how (average-case) performance is achieved on fused CPU-GPU architectures, a main trend
in current high-performance microarchitectures that combines a CPU and a GPU on a single chip. Being able to
employ such architectures in real-time systems would be highly desirable, because they provide high performance
within a limited area and power budget. As a result of this analysis, however, a cache coherency bottleneck is
uncovered in recent fused CPU-GPU architectures that share the last level cache between CPU and GPU. This
insight (i) complicates performance predictions and (ii) adds a shared last level cache between CPU and GPU to
the growing list of microarchitectural features that beneﬁt average-case performance, but render the analysis of
WCET guarantees on high-performance architectures virtually infeasible. Thus, further motivating the need for
novel microarchitectural features that provide predictable performance and are amenable to timing analysis.
Towards this end, a runtime reconﬁguration controller called “Command-based Reconﬁguration Queue” (CoRQ)
is presented that provides guaranteed latencies for its operations, especially for the reconﬁguration delay, i.e., the
time it takes to reconﬁgure a hardware accelerator onto a reconﬁgurable fabric (e.g., FPGA). CoRQ enables the
design of timing-analyzable runtime-reconﬁgurable architectures that support WCET guarantees. Based on the
–now feasible– guaranteed reconﬁguration delay of accelerators, a WCET analysis is introduced that enables tasks
to reconﬁgure application-speciﬁc custom instructions (CIs) at runtime. CIs are executed by a processor pipeline
and invoke execution of one or more accelerators. Different measures to deal with reconﬁguration delays are
compared for their impact on accelerated WCET guarantees and overestimation. The timing anomaly of runtime
reconﬁguration is identiﬁed and safely bounded: a case where executing iterations of a computational kernel faster
than in WCET during reconﬁguration of CIs can prolong the total execution time of a task. Once tasks that perform
runtime reconﬁguration of CIs can be analyzed for WCET guarantees, the question of which CIs to conﬁgure on
a constrained reconﬁgurable area to optimize the WCET is raised. The question is addressed for systems where
multiple CIs with different implementations each (allowing to trade-off latency and area requirements) can be
selected. This is generally the case, e.g., when employing high-level synthesis. This so-called WCET-optimizing
instruction set selection problem is modeled based on the Implicit Path Enumeration Technique (IPET), which is the
path analysis technique state-of-the-art timing analyzers rely on. To our knowledge, this is the ﬁrst approach that
enables WCET optimization with support for making use of global program ﬂow information (and information
about reconﬁguration delay). An optimal algorithm (similar to Branch and Bound) and a fast greedy heuristic
v
Abstract
algorithm (that achieves the optimal solution in most cases) are presented. Finally, an approach is presented
that, for the ﬁrst time, combines optimized static WCET guarantees and runtime optimization of the average-case
execution (maintaining WCET guarantees) using runtime reconﬁguration of hardware accelerators by leveraging
runtime slack (the amount of time that program parts are executed faster than in WCET). It comprises an analysis
of runtime slack bounds that enable safe reconﬁguration for average-case performance under WCET guarantees
and presents a mechanism to monitor runtime slack using a simple performance counter that is commonly available
in many microprocessors.
Ultimately, this thesis shows that runtime reconﬁguration of accelerators is a key feature to achieve predictable
performance.
vi
Author’s Contributions
The following list enumerates journal, conference and workshop papers published by the author of this thesis
while pursuing his doctorate at the Chair for Embedded Systems of the Karlsruhe Institute of Technology.
[1] Lars Bauer, Artjom Grudnitsky, Marvin Damschen, Srinivas Rao Kerekare, and Jörg Henkel. “Floating
point acceleration for stream processing applications in dynamically reconﬁgurable processors”. In: IEEE
Symp. on Embed. Syst. For Real-time Multimedia (ESTIMedia), Amsterdam, The Netherlands, October
8-9, 2015. 2015, pp. 1–2. DOI: 10.1109/ESTIMedia.2015.7351762.
[2] Marvin Damschen, Lars Bauer, and Jörg Henkel. “Extending the WCET Problem to Optimize for Runtime-
Reconﬁgurable Processors”. In: ACM Trans. on Archit. and Code Optim. (TACO) 13.4 (2016), 45:1–45:24.
DOI: 10.1145/3014059.
[3] Marvin Damschen, Lars Bauer, and Jörg Henkel. “CoRQ: Enabling Runtime Reconﬁguration Under
WCET Guarantees for Real-Time Systems”. In: IEEE Embedded Systems Letters (ESL) 9.3 (2017), pp. 77–
80. DOI: 10.1109/LES.2017.2714844.
[4] Marvin Damschen, Lars Bauer, and Jörg Henkel. “Timing Analysis of Tasks on Runtime Reconﬁgurable
Processors”. In: IEEE Trans. on Very Large Scale Integration Syst. (TVLSI) 25.1 (2017), pp. 294–307. DOI:
10.1109/TVLSI.2016.2572304.
[5] Marvin Damschen, Frank Mueller, and Jörg Henkel. “Co-Scheduling on Fused CPU-GPU Architectures
with Shared Last Level Caches”. In: IEEE Trans. on Comput.-Aided Design of Integrated Circuits and Syst.
(TCAD) (2018). ESWEEK Special Issue, to appear. DOI: 10.1109/TCAD.2018.2857042.
[6] Tanja Harbaum, Christoph Schade, Marvin Damschen, Carsten Tradowsky, Lars Bauer, Jörg Henkel, and
Jürgen Becker. “Auto-SI: An adaptive reconﬁgurable processor with run-time loop detection and accel-
eration”. In: IEEE Intl. System-on-Chip Conf., (SOCC), Munich, Germany, September 5-8, 2017. 2017,
pp. 153–158. DOI: 10.1109/SOCC.2017.8226027.
[7] Alexander Pöppl, Marvin Damschen, Florian Schmaus, Andreas Fried, Manuel Mohr, Matthias Blankertz,
Lars Bauer, Jörg Henkel, Wolfgang Schröder-Preikschat, and Michael Bader. “Shallow Water Waves on
a Deep Technology Stack: Accelerating a Finite Volume Tsunami Model Using Reconﬁgurable Hard-
ware in Invasive Computing”. In: Workshop on UnConventional High Performance Computing (UCHPC),
Santiago de Compostela, Spain, August 28-29, 2017, Revised Selected Papers. 2017, pp. 676–687. DOI:
10.1007/978-3-319-75178-8_54.
[8] Enrico Rossi, Marvin Damschen, Lars Bauer, Giorgio Buttazzo, and Jörg Henkel. “Preemption of the Par-
tial Reconﬁguration Process to Enable Real-Time Computing with FPGAs”. In: ACM Trans. on Reconﬁg.
Technol. and Syst. (TRETS) 11.2 (2018). to appear. DOI: 10.1145/3182183.
[9] Stefan Wildermann, Michael Bader, Lars Bauer, Marvin Damschen, Dirk Gabriel, Michael Gerndt, Michael
Glaß, Jörg Henkel, Johny Paul, Alexander Pöppl, Sascha Roloff, Tobias Schwarzer, Gregor Snelting,
Walter Stechele, Jürgen Teich, Andreas Weichslgartner, and Andreas Zwinkau. “Invasive computing for
timing-predictable stream processing on MPSoCs”. In: it - Information Technology 58.6 (2016), pp. 267–
280. DOI: 10.1515/itit-2016-0021.
The main focus of this thesis is on references [2–5], the contribution of Chapter 7 is currently under submission.
vii

Selected Supervised Student Theses
The following list enumerates selected student theses that were supervised by the author of this thesis and that
contributed to prototyping and implementation of the evaluation platforms used in the following chapters.
[i] Typke, Marc. “A SystemC/TLM-based Simulator for a Reconﬁgurable Heterogeneous Multi-core System”,
Master Thesis, 2016.
[ii] Middelschulte, Leif. “Extending a WCET Estimation Tool for Runtime Reconﬁgurable Processors”, Master
Thesis, 2016.
[iii] Eckhart, Artur. “A Command-Driven Reconﬁguration Controller for Hard Real-Time Systems”, Bachelor
Thesis, 2016.
[iv] Rapp, Martin. “A Mixed Criticality Architecture with Reconﬁgurable Accelerators”, Master Thesis, 2016.
[v] Blankertz, Matthias. “Extending the i-Core architecture for pipelined ﬂoating-point accelerators”, Diploma
Thesis, 2017.
[vi] Maier, Eduard. “Heterogene Mehrkernprozessorunterstützung für i-Core”, Diploma Thesis, 2017.
[vii] Sader, Thomas. “Leveraging BCET Analysis to Improve WCET Estimates on Runtime Reconﬁgurable
Processors”, Diploma Thesis, 2017.
[viii] Vutov, Petar. “A Linux Driver for the Reconﬁgurable Accelerator Queue Architecture”, Bachelor Thesis,
2018.
[ix] Münchbach, Florian. “Dynamic I/O conﬁguration in a partially reconﬁgurable accelerator framework”,
Master Thesis, 2018.
ix

Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Real-Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Worst-Case Execution Time Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Global Bound Analysis using IPET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Reconﬁgurable Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Evaluation Platform – i-Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.1 Microcoded Custom Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Associated Research Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.1 Invasive Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.2 SPP 1500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Achieving Performance on Fused CPU-GPU Architectures with Shared Last Level Caches 11
3.1 Fused CPU-GPU Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.1 Co-Scheduling on Fused Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2.2 Exploiting Shared Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Motivational Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Background on Heterogeneous Execution using OpenCL . . . . . . . . . . . . . . . . . . . . . . 14
3.4.1 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.2 OpenCL 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 Utilizing Fine-Grained SVM on Fused CPU-GPU Architectures . . . . . . . . . . . . . . . . . . 16
3.5.1 Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5.2 Kernel Launch and Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5.3 Overheads of Fine-Grained SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.6 Our Co-Scheduling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6.1 Atomic Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6.2 Device-Side Enqueuing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6.3 Host-Side Proﬁling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.7.1 Device-Side Enqueuing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.7.2 Co-Scheduling Results of Rodinia-SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.7.3 Cache Performance Bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.8 Conclusion and Implications for Predictable Execution . . . . . . . . . . . . . . . . . . . . . . . 24
4 Runtime Reconﬁguration under WCET Guarantees . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Challenges for a Guaranteed Reconﬁguration Delay . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Enabling Runtime Reconﬁguration in Real-Time Systems with CoRQ . . . . . . . . . . . . . . . 29
xi
Contents
4.2.1 Command Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.2 Guaranteed Reconﬁguration Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.3 Analyzing Sequences of Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 WCET Analysis of Tasks on Runtime-Reconﬁgurable Processors . . . . . . . . . . . . . . . 35
5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1.1 WCET-Optimizing Instruction Set Architectures . . . . . . . . . . . . . . . . . . . . . . 36
5.1.2 Runtime Reconﬁguration in Hard Real-Time Systems . . . . . . . . . . . . . . . . . . . . 37
5.2 Motivational Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Timing Analysis Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3.1 Path Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 Timing Analysis Extensions for Runtime-Reconﬁgurable Processors . . . . . . . . . . . . . . . . 40
5.4.1 Microarchitectural Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.4.2 Path Analysis Constraints for Software Emulation . . . . . . . . . . . . . . . . . . . . . 40
5.4.3 Stalling vs. Software Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5 Runtime-Reconﬁgurable Processor Infrastructure for Timing Guarantees . . . . . . . . . . . . . . 47
5.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.6.1 Implementation and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 WCET Optimization using Reconﬁgurable Custom Instructions . . . . . . . . . . . . . . . . 55
6.1 Related Work and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.4 Optimal Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.5 Heuristic Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.6.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.6.2 Impact of Reconﬁguration Delay on WCET-Optimizing Selection . . . . . . . . . . . . . 67
6.6.3 Impact of Infeasible Path Information on WCET-Optimizing Selection . . . . . . . . . . . 68
6.6.4 Runtimes, Pruning and Quality of Heuristic Selection . . . . . . . . . . . . . . . . . . . . 70
6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7 WCET Guarantees for Opportunistic Runtime Reconﬁguration . . . . . . . . . . . . . . . . 75
7.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.3.1 Ofﬂine Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.3.2 Online Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8 Thesis Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
xii
Contents
8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
8.1.1 WCET Guarantees and Mixed-Criticality for Loosely-Coupled Reconﬁgurable Architectures 86
8.1.2 Probabilistic WCET Guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.1 Demonstration Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.1.1 Concurrent Reconﬁgurable Fabric Utilization . . . . . . . . . . . . . . . . . . . . . . . . 89
A.1.2 Accelerating a Finite Volume Tsunami Model using Reconﬁgurable Hardware in Invasive
Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
xiii

1 Introduction
Real-time embedded systems are ubiquitous in our everyday life, e.g., in safety-critical domains such as auto-
motive, avionics or robotics. The correctness of a real-time system does not only depend on the correctness of
its calculations, but also on the non-functional requirement of adhering to deadlines where, under circumstances
safety-critical, output signals are produced. Failing to meet a deadline may lead to severe malfunctions, therefore
they need to be guaranteed in a process called timing validation [107]. As part of the timing validation, a schedula-
bility analysis is performed to guarantee that a given task set can be scheduled at runtime under any circumstances.
To perform a schedulability analysis, the worst-case execution time (WCET) of every task from the task set needs
to be known [107].
Determining an accurate upper WCET bound of a task is a complex problem, because performance-enhancing fea-
tures of modern processors like pipelining, caches and branch prediction introduce a microarchitectural state. This
microarchitectural state results in a dependency of the latency of instructions on the execution history. Assuming
worst-case behavior of a microarchitectural component, e.g., a cache miss, in situations where the microarchitec-
tural state cannot be determined statically does not necessarily result in a safe WCET bound for the whole task.
The state of one component may inﬂuence other microarchitectural components, e.g., whether a cache access is
a hit or miss can inﬂuence whether a branch condition is calculated in time or potentially mispredicted. Such an
effect is called timing anomaly [86], and it enforces exhaustive exploration of every possible microarchitectural
state when determining the worst-case bound for executing a sequence of instructions.
Modern high-performance processors like supplied by Intel1 feature microarchitectures with out-of-order schedul-
ing pipelines, several hardware threads and multiple (shared) cache layers, as detailed in Chapter 3. These average-
case performance enhancing features cause an explosion of possible microarchitectural states that render timing
analysis practically infeasible [6]. However, the demand for processing power in real-time systems is strongly
increasing, e.g., automated driving requires vast amounts of sensor data to be processed under timing constraints.
Therefore, high-performance microarchitectures amenable for WCET analysis are requested [6, 37, 99].
To escape the scarcity of timing-analyzable performance features, the main focus of this thesis is to introduce
runtime reconﬁguration of hardware accelerators onto a ﬁeld-programmable gate array (FPGA) as a means to
achieve performance that is amenable to the analysis of WCET guarantees. Hardware accelerators speed up the
tasks’ most compute-intensive parts, so called computational kernels (also known as hotspots) that are comprised
of one or more nested loops. When implementing these accelerators as application-speciﬁc integrated circuits, the
system would lack ﬂexibility with respect to revised standards or new algorithms. Instead, using an architecture
that is reconﬁgurable by employing an FPGA maintains a high ﬂexibility and even allows for reconﬁguring the
accelerators at runtime, thereby increasing the performance as well as the computing efﬁciency (compared to a
static set of accelerators) at the cost of a more complex timing analysis.
While runtime reconﬁguration was previously investigated with respect to real-time scheduling [16, 36, 54, 93],
novel models and analyses are required to make the beneﬁts of runtime-reconﬁgurable architectures accessible for
WCET guarantees by tasks, even in uniprocessor systems. In Chapter 5 it will be shown that –in addition to a
considerable speedup– the overestimation of a task’s static WCET guarantee can be reduced by providing WCET
guarantees for kernels in which compute-intensive calculations are performed by reconﬁgurable hardware acceler-
1 Intel is currently the biggest supplier of high-performance microprocessors worldwide with a market share of over 70% in Q1 of 2017
according to the International Data Corporation (IDC) (see https://www.idc.com/getdoc.jsp?containerId=lcUS42519017)
1
1 Introduction
ators. Accelerators typically provide functionality that corresponds to several hundred instructions when executed
on the CPU pipeline, possibly including conditional branches and other control ﬂow. Analyzing instructions for
worst-case latency introduces pessimism due to, e.g., pipeline hazards or instruction cache misses that need to
be accounted for when the CPU behavior can not exactly be determined statically. The latency of the hardware
accelerators that are executed on the reconﬁgurable fabric is under direct control of the application designer and
often precisely known (e.g., this is the case when leveraging high-level synthesis tools [60]).
1.1 Thesis Contributions
The main contribution of this thesis is to establish worst-case execution time guarantees for runtime-reconﬁgurable
systems as a means to achieve predictable performance. Speciﬁcally, the novel contributions of this thesis are:
• Novel co-scheduling approaches are presented in a case study on fused CPU-GPU architectures, a main trend
in current high-performance microarchitectures that combines a CPU and a GPU on a single chip. A cache
coherency bottleneck is uncovered that has implications for predictable performance on such architectures.
• A runtime reconﬁguration controller called “Command-based Reconﬁguration Queue” (CoRQ) is presented that
provides guaranteed latencies for its operations and supports timing analysis of runtime reconﬁguration for
WCET guarantees.
• WCET analysis is introduced for tasks on a runtime-reconﬁgurable processor. Different measures to deal with
reconﬁguration delays are compared as well as the timing anomaly of runtime reconﬁguration is identiﬁed and
safely bounded.
• The WCET-optimizing instruction set selection problem is modeled with support for global program ﬂow in-
formation and reconﬁguration delay by extending state-of-the-art models used in timing analyzers for WCET
guarantees. An optimal algorithm and a fast heuristic algorithm (that achieves the optimal solution in most
cases) are presented.
• An approach is presented that for the ﬁrst time combines optimized static WCET guarantees and runtime opti-
mization of the average-case execution (maintaining WCET guarantees) using runtime reconﬁguration of hard-
ware accelerators. It comprises an analysis of runtime slack bounds that enable safe reconﬁguration for average-
case performance under WCET guarantees.
In the following chapter, the background on real-time systems and runtime reconﬁguration is introduced that is
beneﬁcial to understanding the contributions of this thesis and puts them into context of real-time system research.
Chapter 3 details how performance is achieved on current high-performance processors that follow one of the
main current architectural trends of integrating a CPU and a GPU on a single die. It provides evidence that high-
performance architectures, which target average-case performance, can virtually not be analyzed for execution time
guarantees and motivates the need for timing-analyzable performance features. Afterwards, Chapter 4 presents
how reconﬁguration of accelerators can be performed in real-time systems within statically-guaranteed delays.
Chapters 5 and 6 focus on the WCET analysis and optimization of tasks that utilize runtime reconﬁguration under
WCET guarantees, respectively. Chapter 7 presents an online optimization approach that monitors the runtime
slack of a task (the amount of time it executed parts of code faster than in worst case) to reconﬁgure accelerators
that beneﬁt average-case execution instead of worst-case execution, while maintaining WCET guarantees. Finally,
Chapter 8 concludes the contributions of this thesis.
2
2 Background
The background on real-time systems, worst-case execution time analysis, runtime reconﬁguration and the utilized
evaluation platform is introduced in the following sections.
2.1 Real-Time Systems
In contrast to general-purpose computing systems, real-time systems must meet non-functional requirements. More
speciﬁcally, real-time systems are computing systems that must react within time constraints to events in their en-
vironment [20]. Consequently, their correctness does not only depend on the logical results of their computations,
but also on the time at which results are produced. Time constraints are given as deadlines, i.e., a maximum time
per task that is to be executed on the real-time system, within which the task needs to complete its execution.
In real-time systems, a task that fails to meet its deadline is not not only late, but wrong, because failing to meet
a deadline can lead to severe malfunctions: An increasing amount of safety-critical application domains that play
a crucial role in our society relies on real-time systems, e.g., chemical and nuclear plants, transportation systems
(railway, avionics, automotive), telecommunications, medical systems, industrial automation, robotics, and more
[20]. Generally, real-time systems are embedded as part of a larger system that is to be controlled, which ranges
from small portable devices (e.g., cellular phones, cardiac pacemaker) to larger systems (e.g., aircrafts, industrial
robots)1.
Software bugs are a common cause for accidents in safety-critical applications that can have catastrophic conse-
quences. A well-known example that demonstrates the importance of rigorous veriﬁcation of real-time systems is
a Patriot missile defense system that was operated in Saudi Arabia during the Gulf War. The defense system con-
tained a software bug in its interrupt handling routine2, which resulted in accumulation of delay in the system. The
delay inﬂuenced the system’s classiﬁcation process of ﬂying objects. On February 25, 1991, the defense system
was in operation for about 100 hours and had accumulated a total delay of 343ms, which caused it to incorrectly
classify an incoming missile as a false alarm (its trajectory was mispredicted by 687m). In a catastrophic result,
the missile struck an American Army barracks and lead to the loss of 28 lives and numerous injuries. Extreme
events like these have shown that software testing is not sufﬁcient to verify the correctness of a real-time system.
Instead, deadlines need to be guaranteed in a process called timing validation [107].
As part of the timing validation, a schedulability analysis is performed to guarantee that a given task set can be
scheduled at runtime under any circumstances. Depending on the consequences that may result from a missed
deadline, a real-time task is assigned to one of three different categories [20]:
• Hard: A real-time task is said to be hard if producing the results after its deadline may cause catastrophic
consequences on the system under control.
• Firm: A real-time task is said to be ﬁrm if producing the results after its deadline is useless for the system, but
does not cause any damage.
• Soft: A real-time task is said to be soft if producing the results after its deadline has still some utility for the
system, although causing a performance degradation.
1 The terms ‘real-time system’ and ‘real-time embedded system’ are therefore used interchangeably in the remainder of this thesis.
2 “Patriot Missile Defense: Software Problem Led to System Failure at Dhahran, Saudi Arabia” GAO/IMTEC-92-26: Published: Feb 4, 1992.
Publicly Released: Feb 27, 1992.
3
2 Background
This thesis focuses on hard real-time tasks, i.e., ‘real-time task’ always refers to a hard real-time task in the
remainder of this text. In order to perform a schedulability analysis of a hard real-time task set, each task of the
task set needs to be analyzed for characteristics in terms of execution time, required resources, and precedence
relations with other tasks. For guaranteeing that a given real-time task set can be scheduled at runtime under any
circumstances, the worst-case execution time (WCET) of each task of the set needs to be determined (see [20] for
details on hard real-time scheduling). The following section details how the WCET of a task is obtained.
2.2 Worst-Case Execution Time Analysis
Execution Time 
BCET WCET 
Lower Bound Upper Bound 
O
cc
ur
re
nc
es
 
Overestimation
Figure 2.1: Histogram of all execution times of a task.
The WCET of a task is upper-bounded using
static timing analysis
In general, obtaining upper bounds on the execution times of tasks
is not possible, because it would require the halting problem to
be decidable [83]. Therefore, real-time tasks are programmed re-
strictively: they are required to always terminate and recursion
depths as well as iteration counts of loops need to be statically
known. Virtually any task executed on a modern hardware plat-
form exhibits execution time variation that is inﬂuenced by the
task’s input. If the worst-case input (the input leading to the worst-
case execution) of a task were known, a worst-case execution time
(WCET) guarantee could be easily obtained [106]. Generally,
however, this is not the case and the worst-case input is hard to
derive. Therefore, an upper bound on the WCET (analogously, a
lower bound on the best-case execution time (BCET)) is estimated during timing analysis instead of the actual
WCET (and BCET) as shown in Fig. 2.1. The estimated bounds need to be safe, i.e., the WCET (BCET) bound
must never underestimate (overestimate) the actual WCET (BCET), and precise, i.e., the overestimation (underes-
timation) of the WCET (BCET) should be as small as possible to enable a successful schedulability analysis later
on.
In this thesis, static timing analysis is employed, which produces guaranteed WCET bounds, instead of
measurement-based approaches, which produce bounds on observed execution times only (it is never guaran-
teed that all execution times haven been observed using measurements alone). The bounds obtained by static
timing analysis allow safe schedulability analysis of hard real-time systems. A WCET bound is only valid for
a speciﬁc hardware platform and is the result of the worst-case path through the task under analysis, i.e., the
sequence of instructions that leads to the estimated WCET of the task. Consequently, timing analysis is performed
on the ﬁnished task binary (instead, e.g., on the source code). It generally performs three major sub-analyzes on
the task binary consisting of several passes each [107]:
(i) Control-ﬂow reconstruction and static analyses for control and data ﬂow. Reconstruct the control-ﬂow graph
(CFG) from the task binary, identify loops and bound iterations thereof (if possible), determine infeasible
paths through the CFG (paths that exist in the CFG, but can never be executed in practice). Infeasible paths
are eventually excluded during global bound analysis (in (iii)) to obtain a more precise WCET bound.
(ii) Microarchitectural analysis. Computes the execution time bounds of basic blocks. Assuming the latency
of each instruction were constant, microarchitectural analysis would be simple. However, average-case per-
formance enhancing features like deep pipelining or out-of-order scheduling of instructions, caches, branch
prediction, etc. introduce a microarchitectural state [90]. This microarchitectural state results in a depen-
dency of the latency of instructions on the execution history that can span numerous basic blocks. Therefore,
it is not safe to analyze each basic block separately, but the microarchitectural state that can result from the
execution of preceding basic blocks needs to be considered to obtain the latency bounds of a basic block. This
4
2.2 Worst-Case Execution Time Analysis
is done using abstract interpretation [25], a theory of program analysis that determines runtime properties
of the task under analysis without actually executing it. Abstract interpretation allows to separate analy-
sis of the microarchitectural state and the analysis of the worst-case path (which determines the WCET, as
explained in (iii)) [98].
The state of one microarchitectural feature may inﬂuence other features, and the worst case of one feature
does not necessarily lead to the worst-case execution of the whole task. Therefore, it is not safe to analyze
microarchitectural features separately: E.g., whether a cache access is a hit or miss can inﬂuence whether
a branch condition is calculated in time or potentially mispredicted. There are architectures, where a cache
miss can lead to a correctly predicted branch condition that results in a shorter execution time than a cache
hit (that would have led to a mispredicted branch) [69]. Such a situation, where a local worst case (e.g., a
cache miss) leads to a shorter total execution time, is called a timing anomaly [86]. Consequently, it is not
safe to assume a local worst case when static analysis of a task cannot precisely determine the state of each
microarchitectural feature (e.g., cache contents). Instead, all possibilities need to be considered in further
analysis. This leads to an explosion of microarchitectural states and has a strong inﬂuence on the applicabil-
ity of methods for timing analysis to speciﬁc microarchitectures. Effectively, the microarchitectures that can
be analyzed are several generations behind microarchitectures available today [90].
(iii) Global bound analysis. Combines information obtained in the previous analyzes (annotated CFG from (i)
and WCET bounds of basic blocks from (ii)) to compute the WCET bound of the whole task. State-of-
the-art timing analyzers compute the WCET by determining the worst-case path through the CFG using the
ILP-based Implicit Path Enumeration Technique (IPET) [64, 106].
The contributions of this thesis do not rely on speciﬁc approaches to perform analyzes (i) and (ii) (implications
to (ii) are addressed in Section 5.4). IPET, however, is a central technique in WCET analysis and also utilized by
approaches presented in the following chapters. It is introduced in the following.
2.2.1 Global Bound Analysis using IPET
The approaches presented in Chapters 5 and 6 base on the Implicit Path Enumeration Technique (IPET) [64] for
WCET bound calculation, as it is the program path analysis technique state-of-the-art timing analyzers rely on [4,
106]. IPET models program ﬂow as arithmetic constraints in an ILP3-formulated problem. The objective function
determines the CPU cycles executed on a path in the task’s CFG. To ﬁnd the WCET path, it needs to be maximized.
Variables in the objective function represent the execution count of a single basic block (xi) in the CFG and are
weighted with the execution cycles of that basic block (ci), which are determined in the microarchitectural analysis
(see previous section). For a program with N basic blocks, the objective function is given as:
max
x∈NN0
N
∑
i=1
cixi (2.1)
Similar to ﬂow networks, the variables are constrained by modeling the control ﬂow and capturing relative exe-
cution counts of basic blocks as ILP constraints. The more infeasible paths can be excluded by constraints, the
more precise the WCET bound will be. IPET was ﬁrst introduced in [64], which contains a detailed overview of
how constraints are generated. A brief overview is given in the following. Besides the variables xi representing the
execution counts of basic blocks, variables di for every edge in the CFG are used. Figure 2.2 (b) shows the CFG
of the simple source code excerpt shown in Fig. 2.2 (a). The loop header (represented by x2) can be entered from
outside using the edge represented by d1 or from a previous iteration using d8. The same basic block can be exited
3 Integer Linear Programming [43]
5
2 Background
int i = 0;
while (i < 100) {
if (i < 5)
...; // true
else
...; // false
i++;
}
x1 
x2 
x3 
x5 x4 
x6 
d8 
d5 d4 
d3 
d1 
d2 
d7 d6 
if (i < 5) 
 // true   // false 
i++ 
while (i < 100) 
i = 0 Program Structure:
x1 = 1 = d1 (task start)
x2 = d1+d8 = d2+d3
x3 = d3 = d4+d5
x4 = d4 = d6
x5 = d5 = d7
x6 = d6+d7 = d8
Global Information:
x3 ≤ 100 ·d1
x4 ≤ 5 ·d1
(a) Task source code (b) Task CFG (c) IPET constraints
Figure 2.2: Example of constraint generation using the Implicit Path Enumeration Technique (IPET)
when the loop condition is false and the kernel is exited using d2 or it can proceed to another iteration when the
loop condition is true using d3. Therefore, x2 = d1+d8 = d2+d3 (see Fig. 2.2 (c)).
A key feature of IPET is that global path information about input-dependent control ﬂow can be annotated using
additional constraints, e.g., an upper bound of 100 loop iterations can be given by the constraint x3 ≤ 100 · d1.
When static control and data ﬂow analysis (see previous section) recognizes that the true case can be executed
a maximum of 5 times, it is annotated using the constraint x4 ≤ 5 · d1. State-of-the-art timing analyzers utilize
annotation languages [59] that enable users to conveniently annotate expert knowledge about task execution. Such
annotations are automatically translated into IPET constraints and often lead to a more precise WCET bound.
Several extensions, e.g., for complex control ﬂows and hardware timing effects depending on a long history of ex-
ecuted instructions have been published [7, 38, 106]. One of these extensions, multi-context analysis, is addressed
in Chapter 5. In Chapter 6, IPET is extended from a WCET analysis problem to a WCET optimization problem for
runtime-reconﬁgurable processors. The following section introduces the background on reconﬁgurable computing
as a basis for the following chapters.
2.3 Reconﬁgurable Computing
Reconﬁgurable computing, i.e., performing computations using a reconﬁgurable fabric such as ﬁeld-programmable
gate arrays (FPGAs), was introduced in the early 1990s [97]. Today, it is an established computing paradigm in
a growing number of application domains in research and industry, not only in embedded computing (e.g., signal
processing [96], computer vision [53] or encryption [52]), but also in high-performance and scientiﬁc comput-
ing (e.g., ﬁnancial pricing [34] or DNA-sequencing [13]), data centers (e.g., searching [84] or database queries
[35]), networks (routing [68], intrusion detection [32]) and others. In these domains, applications generally com-
prise several compute-intensive loops, so-called computational kernels, that beneﬁt greatly from implementation
as application-speciﬁc hardware accelerators in terms of performance and energy efﬁciency. FPGAs enable the uti-
lization of application-speciﬁc hardware accelerators without fabricating custom chips and they provide ﬂexibility
as well as the ability for upgrades, just like software.
Several alternatives exist when designing reconﬁgurable systems that combine a general-purpose CPU with a
reconﬁgurable fabric [97]. Generally, a tighter integration reduces communication latency between CPU and re-
conﬁgurable fabric, but requires more effort in the architectural design of the system. Numerous products are
available that add an FPGA as a separate chip to an existing system by attaching it to the system’s peripheral
bus. Especially for embedded systems however, it is crucial to minimize (i) the communication latency between
CPU and reconﬁgurable fabric to enable acceleration of short-running kernels (e.g., in control loops) and (ii) the
6
2.4 Evaluation Platform – i-Core
Memory 
Arbiter 
32
 
… 
… 
R
ec
on
fi
g.
 
C
on
ta
in
er
 
Inter-
con-
nect 
L
oa
d 
/ 
S
to
re
 U
ni
ts
 
ME 
XC 
WB 
EX 
RA 
DE 
FE 
D$ 
SPM 
12
8 
12
8 
128 
128 
 S
ys
te
m
 B
us
 
32
 
Base System 
i-Core 
Extensions 
32 
C
I 
E
xe
cu
ti
on
 C
on
tr
ol
le
r 
Reconf. 
Fabric 
I$ 32 
P
ro
ce
ss
or
 P
ip
el
in
e 
R
ec
on
f.
 C
on
tr
ol
le
r 
(C
oR
Q
) 
IC
A
P
 
In
te
rn
al
 C
on
f.
 M
em
. 
C
m
d.
 I
nt
er
fa
ce
 
Inter-
con-
nect 
R
ec
on
fi
g.
 
C
on
ta
in
er
 
Inter-
con-
nect 
D
R
A
M
 C
on
tr
ol
le
r 
an
d 
fu
rt
he
r 
pe
ri
ph
er
y 
R
ec
on
fi
g.
 
C
on
ta
in
er
 
Inter-
con-
nect 
Figure 2.3: Overview of the evaluation platform – i-Core
system’s power consumption as well as (iii) area footprint. Therefore, the advancing trend of processor integration
has resulted in reconﬁgurable SoCs that combine FPGAs and CPUs on a single chip (e.g., Xilinx Zynq or Intel
(formerly Altera) SoC FPGA), which have led to the wide adoption of reconﬁgurable systems in embedded sys-
tems, e.g., in implementations of advanced driver assistance systems in the automotive domain. In reconﬁgurable
SoCs, CPU and FPGA are still separate processing devices that communicate over the (internal) system bus.
The evaluation platform that is employed to evaluate the contributions of Chapters 5 to 7 demonstrates that an
even tighter integration of CPU and FPGA than in current reconﬁgurable SoCs is beneﬁcial to target hard real-time
execution. It is presented in the following section.
2.4 Evaluation Platform – i-Core
i-Core [10, 31] is a reconﬁgurable processor, i.e., it is based on a general-purpose (GPP) processor pipeline and
enables the execution of runtime-reconﬁgurable Custom Instructions (CIs). Figure 2.3 gives an overview of the
i-Core architecture. CIs extend the processor’s core instruction set architecture (cISA) by application-speciﬁc
instructions that are realized using (i) microcode and (ii) reconﬁgurable accelerators. They are detailed in the
following.
2.4.1 Microcoded Custom Instructions
When the processor pipeline encounters a CI in its execute stage (EX), the pipeline stalls and initiates execution
of the respective microprogram (i.e., a program written in microcode) that implements the functionality of the
encountered CI on the CI Execution Controller. The communication between pipeline and CI Execution Controller
is performed in a protocol that is similar to other multi-cycle instructions like integer division. The microprogram
controls all resources of the reconﬁgurable fabric:
• Load/Store Units (LSUs), enable access to the main memory (through the processor’s L1 data cache (D$)) and
high-bandwidth scratchpad memory (SPM) for CIs
7
2 Background
L
T T
A
S
(a) DFG of a CI uses
multiple accelerators
L
T
T
A
S
(b) one ‘Transform’
accelerator conﬁgured
L
T T
A
S
(c) two ‘Transform’
accelerators conﬁgured
1
2
3
4
5
L
T
A
S
Load
Transform
Aggregate
Store
Figure 2.4: CIs deﬁne computations as DFGs that can be scheduled with different amounts of accelerators, resulting in different latencies
• Reconﬁgurable containers, (embedded) FPGAs that provide the reconﬁgurable area for runtime-reconﬁguration
of accelerators (one accelerator per container, each of similar complexity as, e.g., ﬂoating-point multiply-
accumulate or a dozen integer operations)
• Interconnects, connect LSUs, reconﬁgurabe containers and the processor’s register ﬁle to a common (four word-
wide segmented) bus
When CIs access register operands and the non-cacheable SPM only, they can be modeled like just another multi-
cycle instruction in the microarchitectural analysis during WCET estimation (see Section 2.2) and do not inﬂuence
data cache analysis. Note that a single microprogram can utilize one or more accelerators. In other words, the
functionality deﬁned by a CI is realized using one or more accelerators. Application-speciﬁc hardware accelerators
provide an important tradeoff: the more area is utilized, the higher the resulting performance. At the same time,
multiple accelerators compete for the constrained reconﬁgurable area. This tradeoff is the result of instruction-level
parallelism that can be exploited when more hardware resources are added to the application-speciﬁc accelerator.
The main beneﬁt of allowing CIs to utilize more than one accelerator is that this tradeoff can be chosen at runtime
by providing several microprograms that implement the same CI, but utilize different amounts of accelerators that
each implement a part of the CIs functionality. Consequently, CIs deﬁne computations as data-ﬂow graphs (DFG)
where nodes are accelerators and load/stores. Figure 2.4 (a) shows a simpliﬁed example of a CI that loads input
data, performs transformations on the data, aggregates results and ﬁnally stores them. Depending on how many
‘Transform’ accelerators are conﬁgured in the reconﬁgurable containers at runtime, the DFG can be scheduled in
5 steps (see Fig. 2.4 (b)) or 4 steps (see Fig. 2.4 (c)). Each of these schedules corresponds to a microprogram for
the CI Execution Controller, which implements the CI4.
So far, it was discussed how CIs are executed, assuming all accelerators required by a certain implementation are
currently conﬁgured on the reconﬁgurable fabric. However, CIs can be unavailable, i.e., there exists no schedule
of the CI’s DFG for the accelerators that are currently conﬁgured (reconﬁguring all accelerators required by a
CI can take several milliseconds). Two alternatives exists to handle the case that the i-Core attempts to execute
an unavailable CI at runtime: stalling and software emulation. Stalling executes CIs on the reconﬁgurable fabric
and trying to execute an unavailable CI is an error. Therefore, CIs need to be conﬁgured before the kernel is
entered and the execution is stalled until the required CIs are available. Software emulation triggers functionally-
equivalent software execution of an unavailable CI on the i-Core’s pipeline using the base processor’s cISA. It
enables execution of the kernel while required CIs are still being reconﬁgured. Thus, progress can already be made
without any CIs and as soon as reconﬁguration of a CI ﬁnishes, the CI is utilized to speed up the following iteration
of the kernel. While software emulation is always beneﬁcial at runtime, it is more complex to analyze for WCET
guarantees than stalling, which will be detailed in a more general context in Chapters 4 and 5.
4 In the remainder of this thesis it will be referred to CI implementation and CI microprogram interchangeably.
8
2.5 Associated Research Projects
i-Core exists as a constantly evolving hardware prototype. It is currently based on the Gaisler LEON3 SoC5 and
synthesizes to Xilinx Virtex-7 FPGAs. The LEON3 processor has a SPARC V8 in-order microarchitecture, sep-
arate data as well as instruction caches and supports several real-time operating systems. Appendix A presents
demonstration setups that were extended and realized in the context of this thesis to show the practicality of the
approach. A more detailed explanation of how the architecture is realized can be found in [10]. Additionally, a
SystemC-based cycle-accurate simulator is available for early evaluation of runtime system algorithms. The spe-
ciﬁc details and parameters that were used to obtain the presented evaluation results are discussed in the respective
chapters.
2.5 Associated Research Projects
The results of this thesis were achieved in the context of the research projects that are presented in the following.
2.5.1 Invasive Computing
The work presented in this thesis was partly supported by the German Research Foundation (DFG) as part of the
Transregional Collaborative Research Center “Invasive Computing” (SFB/TR 89), which just began its third and
last funding phase (4 years per phase). The project is a collaboration between researchers from the Friedrich-
Alexander University Erlangen-Nürnberg, Karlsruhe Institute of Technology and Technical University of Munich.
In its current phase, it consists of 16 subprojects.
The governing thought of Invasive Computing is to grant applications, running on a massively-parallel computer
that consists of 1000 and more compute cores, temporary exclusive access to resources like processor, communica-
tion channels and memory [50, 95]. This so-called resource-aware programming paradigm is of utmost importance
to obtain high utilization as well as computational and energy efﬁciency numbers (including predictable execution).
In Invasive Computing, a set of granted resources is called a claim. Applications allocate claims by invading re-
sources, and then infect them with a program to run. Finally, the application retreats from its claim, freeing the
resources. The state diagram of an invasive application is shown in Fig. 2.5.
start exitinvade infect retreat
Figure 2.5: States of an invasive application (following the description of [74])
Realizing this resource-aware programming model effectively requires a holistic approach. Therefore, the sub-
projects of Invasive Computing cover the full compute stack of architecture, language/compiler, operating sys-
tem/runtime system and applications. The hardware architecture targeted by invasive computing is a heteroge-
neous multiprocessor system-on-chip [50]. It consists of tiles of different types that are interconnected using a
network-on-chip. Figure 2.6 shows an instance of the invasive architecture that uses three different types of tiles:
(i) compute tiles contain several RISC CPU cores that communicate over a shared bus,
(ii) memory tiles that provide DDR memory and
(iii) i-Core tiles contain RISC CPU cores and the reconﬁgurable processor ‘i-Core’ (which was presented in the
previous section and is the evaluation platform of this thesis).
5 https://www.gaisler.com/
9
2 Background
L2 
Cache
Tile-local 
Memory 
(TLM)
CPU1
I$ D$
i-Core
(CPU 0)
I$ D$
Tile Bus
CPU2
I$ D$
CPU3
I$ D$
CPU CPU
CPU CPU
NA Memory
CPU CPU
CPU CPU
NA Memory
Memory
NA
CPU CPU
CPU i-Core
NA Memory
NoC
Router
NoC
Router
NoC
Router
NoC
Router
CI Exe. Ctrl.
Runtime-reconfigurable 
processor – i-Core
Figure 2.6: Overview of an instance of the tile-based invasive manycore architecture. Details of an i-Core tile are shown
i-Core is investigated as an architectural subproject of Invasive Computing. Within the project, it is a resource
that can be invaded by applications for exclusive access, through which predictable execution is enabled. So far,
however, execution time guarantees for runtime-reconﬁgurable processors are unavailable. This thesis introduces
general methods for WCET analysis and optimization on runtime-reconﬁgurable processors that can be applied to
the i-Core processor and the invasive architecture to achieve predictable high performance.
2.5.2 SPP 1500
Another associated research project is the DFG Priority Program SPP 1500 “Dependable Embedded Systems”,
which focuses on the various reliability concerns in the nano-era [49]. The reliability concerns include manufac-
turing variability, aging, the impact of temperature and soft errors. These concerns are addressed from a wide
range of perspectives including operating systems, compilers, micro-architectures and applications themselves.
SPP 1500 comprises 12 projects in total from research groups of ten different universities throughout Germany.
The project OTERA (Online Test Strategies for Reliable Reconﬁgurable Architectures) targets reliability concerns
in runtime-reconﬁgurable architectures on the basis of i-Core (which is the main evaluation platform used in this
thesis).
This section concludes the background for the main focus of this thesis discussed in Chapter 4 and following, i.e.,
WCET analysis and optimization using runtime reconﬁguration. The following chapter takes a step back from
architectures designed for real-time systems to verify the claim that performance on current high-performance
architectures is increasingly hard to predict and motivate need for timing-analyzable performance features.
10
3 Achieving Performance on Fused CPU-GPU
Architectures with Shared Last Level Caches
This chapter presents novel co-scheduling approaches to distribute work onto CPU and GPU in an extensive case
study on how performance is achieved on heterogeneous high-performance processors that follow one of the main
current architectural trends: integrating a (multi-core) CPU and a GPU on a single die1. Being able to employ such
architectures in real-time embedded systems would be highly desirable, because they provide high performance
within a limited area and power budget. E.g., NVIDIA partnered with numerous companies of the automotive
domain (Audi, Mercedes-Benz, Toyota, Volvo among others) in the NVIDIA DRIVE project to create a computing
platform that speciﬁcally aims at enabling autonomous driving2. Their current hardware platform "NVIDIA Drive
PX Xavier" relies on integrating a CPU, a GPU and hardware accelerators on a single chip to achieve the required
computing performance within a 30W power budget3. It remains an open question, however, how predictable
execution times can be achieved on such platforms, which is crucial to be able to deploy them in actual products
(e.g., self-driving cars) [85].
In this chapter it is shown that even when targeting average-case performance, performance predictions are beneﬁ-
cial to distribute work among heterogeneous compute devices for maximum performance. Three novel approaches
to distribute work are introduced and compared, which leverage the unique features of architectures that share
a last level cache between CPU and GPU. Furthermore, this chapter uncovers a cache coherency bottleneck in
recent such architectures that has implications on predictable performance. It ultimately provides evidence to the
claim made in Chapter 1 that high-performance architectures, which were designed for average-case performance,
can virtually not be analyzed for execution time guarantees and motivates the design and analysis of a timing-
analyzable architecture in the following chapters.
3.1 Fused CPU-GPU Architectures
CPU
L1-I$ L1-D$
L2$
CPU
L1-I$ L1-D$
L2$
GPU
GPU 
Caches
CPU
L1-I$ L1-D$
L2$
CPU
L1-I$ L1-D$
L2$
Main 
Memory 
(DDR)
Shared Last Level Cache
System Bus
Figure 3.1: High-level overview of a fused CPU-GPU architecture
with shared last level cache
With the release of AMD’s Fusion and Intel’s Ivy
Bridge architecture in 2011, the trend of processor
integration resulted in fused CPU-GPU architectures
that integrate a CPU and general-purpose GPU on a
single die. The main beneﬁt of such an integration is
that time-consuming memory transfers between main
memory and dedicated GPU memory become unnec-
essary. Instead, CPU and GPU access the same physi-
cal memory such that zero-copy transfers can be em-
ployed. Zero-copy transfers ensure coherency and
translate pointers to memory buffers for the common
CPU and GPU address space, but do not actually trans-
fer data. However, such an integration introduces a
memory bottleneck, because CPU and GPU compete for memory bandwidth of the shared physical memory.
1 The work presented in this chapter was originally published in [30]
2 https://www.nvidia.com/en-us/self-driving-cars/
3 https://en.wikipedia.org/w/index.php?title=Drive_PX-series&oldid=864203140#Drive_PX_Xavier
11
3 Achieving Performance on Fused CPU-GPU Architectures with Shared Last Level Caches
In more recent architectures, e.g., Intel Broadwell and beyond, CPU and GPU were further integrated so that they
access the shared last level cache (LLC) as shown in Fig. 3.1. This enables hardware-supported byte-level cache
coherency between CPU and GPU. Effectively, CPU and GPU can execute computational kernels on the same
data in parallel and solve problems collaboratively. In this case, the shared LLC has the potential to alleviate the
memory bottleneck present in earlier fused CPU-GPU architectures (without a shared LLC), because it can serve
accesses to a common working set instead of requiring frequent main memory accesses [110].
The idea of heterogeneous compute devices performing computations on a common memory is also captured in the
Open Compute Language (OpenCL) standard 2.0. Most prominently, OpenCL introduces Shared Virtual Memory
(SVM), i.e., a shared virtual address space between heterogeneous compute devices in an OpenCL program. SVM
is also supported by fused CPU-GPU architectures without a shared LLC, e.g., AMD’s Accelerated Processing
Units or System on Chips that feature ARM’s Mali Bifrost GPU. However, because excessive coherency trafﬁc
is required across heterogeneous devices [73], SVM was proven inefﬁcient on such architectures. In contrast, on
fused CPU-GPU architectures with a shared LLC, OpenCL 2.0 promises efﬁcient support for byte-level coherent
(so-called ﬁne-grained) SVM as well as cross-device atomics [56].
This work presents the ﬁrst investigation of collaborative execution of computational kernels on a fused CPU-GPU
architecture with a shared LLC using ﬁne-grained SVM, i.e., CPU and GPU share cache-coherent memory so that
the work of a computational kernel can be processed in parallel by both compute devices. We detail how OpenCL
programs are ported to OpenCL 2.0’s ﬁne-grained SVM. This process is applied to the entire Rodinia Benchmark
Suite [22] and overheads of ﬁne-grained SVM are evaluated. Collaborative execution of computational kernels on
ﬁne-grained SVM requires novel co-scheduling approaches that determine how much work should be performed on
CPU and GPU, respectively, for maximum performance. In previous studies on collaborative execution that used
zero-copy transfers on fused CPU-GPU architectures with a shared LLC and OpenCL 1.2, a single static data-
centric distribution of work was established for all kernels per program [113, 114]. Fine-grained SVM enables to
decide the distribution of work dynamically, based on observed progress made by CPU and GPU while executing
a kernel. Thus, a decision should be made per kernel instead of per program.
This work contributes three dynamic co-scheduling approaches that utilize different capabilities of OpenCL 2.0:
one kernel-external method based on online proﬁling and two kernel-internal methods that utilize cross-device
atomics (variables that can be modiﬁed atomically across multiple compute devices). Cross-device atomics
are currently supported by OpenCL 2.0 only, apart from that our approaches could also be realized, e.g., in
NVIDIA CUDA. One of the kernel-internal methods utilizes device-side enqueuing, another feature introduced
with OpenCL 2.0 that enables enqueuing kernels to an OpenCL device from within an executing kernel. Device-
side enqueuing is a similar technique to dynamic parallelism in NVIDIA CUDA. However, it is shown that
device-side enqueuing introduces too much overhead to be suitable for implementing co-scheduling approaches.
The other two co-scheduling approaches (one kernel-external and one kernel-internal) are further evaluated using
the Rodinia Benchmark Suite, which we ported to OpenCL 2.0. Our kernel-external method performs competi-
tively to the optimal choice of executing kernels within a program either on CPU or GPU (clairvoyant xor-Oracle,
some kernels on CPU others on GPU within the same program). The method achieves 97% of the xor-Oracle’s
performance on average. We show, however, that for most benchmarks of the Rodinia Benchmark Suite it is not
beneﬁcial to split the work of a kernel between CPU and GPU compared to running a kernel either on CPU or GPU
when ﬁne-grained SVM is used. This observation is further analyzed and it is shown that it cannot be explained
by cache conﬂicts, i.e., false or true sharing, but is the result of inefﬁcient cache coherence. As of today, Intel
platforms are the only architectures that support OpenCL 2.0’s ﬁne-grained SVM using a shared LLC. Therefore,
we focus on this architecture in the remainder of this chapter.
The novel contributions of this chapter are as follows:
12
3.2 Related Work
• We evaluate the overhead of OpenCL 2.0’s ﬁne-grained Shared Virtual Memory, and analyze the suitability of
cross-device atomics as well as device-side enqueuing for co-scheduling kernels on fused CPU-GPU architec-
tures with a shared LLC in three different co-scheduling approaches.
• We develop a co-scheduling approach that is competitive to the optimal choice of executing kernels within a
program either on CPU or GPU (on average 97% of the clairvoyant xor-Oracle’s performance and 1.43× speedup
over only using the GPU), and via analysis show that inefﬁcient cache coherence is the major performance
bottleneck for collaborative execution of the same kernel on current fused CPU-GPU architectures with shared
LLC.
• We port the Rodinia Benchmark Suite to OpenCL 2.0 with ﬁne-grained SVM and make Rodinia-SVM as well
as a variety of co-scheduling approaches available as open source4.
3.2 Related Work
3.2.1 Co-Scheduling on Fused Architectures
In state-of-the-art related work on co-scheduling for fused CPU-GPU architectures, CPU and GPU do not share
the last level cache [8, 58, 65, 75, 113, 114]. Thus, techniques like ﬁne-grained SVM are not supported and com-
munication between CPU and GPU has to rely on explicit data transfers. [58] presents an online proﬁling-based
approach that is similar to our host-side proﬁling approach, but only treats the GPU as an OpenCL 1.2 device while
CPU computations are performed in the host code. Therefore, barriers are required after every kernel run, whereas
our approach treats CPU and GPU as OpenCL 2.0 devices and utilizes OpenCL events for lightweight synchroniza-
tion (see Section 3.5.2). Data transfer overheads between different devices are mentioned as a key issue, but not
further analyzed. [8] uses the online proﬁling method of [58] and presents a power-aware co-scheduling method
that aims to minimize the energy-delay product of heterogeneous applications running on a fused CPU-GPU ar-
chitecture. The authors report an average of 12.3% percent improvement over the best performance-oriented
schedules. [114] presents an ofﬂine, machine learning-based approach to co-scheduling that determines a single
ratio that partitions the input data into separate parts processed by CPU and GPU, respectively. This saves addi-
tional transfers to maintain coherency between kernel executions, but does not allow for per-kernel decisions. [75]
presents an OpenCL runtime system that automatically schedules kernels to multiple devices that were originally
written for a single device. The runtime system takes care of buffer allocation and transfers to maintain coherency
between all devices without programmer effort. [65] and [113] speciﬁcally target irregular workloads, in which
some work items take considerably longer than others such that proﬁling information from a subset of work items
is often not representative for the performance of the whole kernel. Both approaches identify application-speciﬁc
features to model the computational kernels’ performance for scheduling decisions.
Compared to our work, state-of-the-art co-scheduling approaches did not share cache-coherent memory between
CPU and GPU, but were instead limited by explicit data transfers that were required to establish consistency.
3.2.2 Exploiting Shared Virtual Memory
In [110] the potential of fused CPU-GPU architectures with a shared LLC is explored simulatively. The authors
present an approach where compiler-generated “pre-execution code” is run on the CPU, before executing a com-
putational kernel on the GPU. The aim of this approach is to ﬁll the shared LLC such that the amount of main
memory accesses that need to be performed by the GPU is minimized. Using this approach, the authors report a
performance improvement of up to 113%, and 21.4% on average. [103] presents an extension of the gem5-gpu
4 Source code available at: https://git.scc.kit.edu/CES/Rodinia-SVM
13
3 Achieving Performance on Fused CPU-GPU Architectures with Shared Last Level Caches
simulator for fused CPU-GPU architectures [82] that supports the features of OpenCL 2.0. Compared to these
works, our approach utilizes a commercial off-the-shelf architecture (Intel) instead of simulation.
In [73] a comprehensive performance evaluation of OpenCL 1.2, OpenCL 2.0 and Heterogeneous System Archi-
tecture (HSA) 1.0 is presented. In contrast to our work, the evaluated AMD Kaveri architecture does not feature a
shared LLC between CPU and GPU. As a result, the authors observe that excessive coherency trafﬁc is generated
across devices that can affect performance signiﬁcantly.
In summary, state-of-the-art related work on co-scheduling on fused CPU-GPU architectures either failed to lever-
age cache-coherent memory between CPU and GPU or only explored cache coherency between CPU and GPU in
simulation.
3.3 Motivational Example
0 20 40 60 80 100
104
105
Using either CPU or GPU is better
(> 8%) than best program-ﬁxed ratio
% share of work executed on CPU
E
xe
cu
tio
n
Ti
m
e
[m
s] Program-ﬁxed CPU/GPU ratio
Best per-kernel CPU xor GPU
Figure 3.2: Particle Filter beneﬁts from a per-kernel scheduling deci-
sion compared to a ﬁxed ratio for the whole benchmark
when executed on OpenCL 2.0’s ﬁne-grained SVM
Before the introduction of OpenCL 2.0’s ﬁne-grained
SVM, data needed to be explicitly transferred to com-
pute devices. Furthermore, consistency guarantees for
memory buffers that were accessed in parallel by dif-
ferent compute devices did not exist. Therefore, state-
of-the-art co-scheduling approaches divided the input
data into two separate parts that were processed by
CPU and GPU, respectively [113, 114]. Effectively, a
single ratio that determines the share of work to be per-
formed on each compute device was applied to all ker-
nels of an OpenCL program. With ﬁne-grained SVM
pointers can be shared and accessed consistently by
multiple devices in parallel.
Fig. 3.2 shows execution time results for the Particle
Filter benchmark from the Rodinia Benchmark Suite (version 3.1 ported to ﬁne-grained SVM) on an Intel Core
i7-6700T (Skylake) fused CPU-GPU architecture with a shared LLC. The blue bars show the execution time for
statically-ﬁxed ratios of work performed on CPU and GPU, respectively, that are applied to all four kernels of the
benchmark. The red line shows the execution time for deciding per-kernel whether to execute it either on CPU or
on GPU. Only the single best overall decision (ﬁrst two kernels on GPU, remaining two on CPU) is shown. In any
case, the four kernels need to be executed in sequence. Two of four kernels contain loops that result in extremely
poor performance when executed on the GPU only (thus, the execution time drops from x= 0% to x= 10%), while
the other two kernels beneﬁt strongly from execution on the GPU compared to the CPU (thus, the execution time
increases from x = 10% to x = 100%). Therefore, deciding a single ratio of how to distribute work for all kernels
results in a compromise that performs worse than executing each kernel exclusively on the most-suitable device.
Due to the fact that ﬁne-grained SVM is shared consistently among different compute devices without any explicit
data transfers in between kernel executions.
This example shows that per-kernel decisions of how to distribute work have a performance beneﬁt over a single
data-centric ratio that is applied to all kernels of a program. In this work, we explore co-scheduling methods that
leverage OpenCL 2.0 features to perform per-kernel decisions at runtime beyond the binary decision of either using
the CPU or GPU but by utilizing both compute devices in parallel.
3.4 Background on Heterogeneous Execution using OpenCL
14
3.4 Background on Heterogeneous Execution using OpenCL
0 1 …
0 1 …
Figure 3.3: Hierarchy of Work Items in an OpenCL Kernel
In this section we provide an overview of OpenCL in
general and discuss features introduced in OpenCL 2.0
that we utilize for co-scheduling.
3.4.1 OpenCL
The Open Compute Language (OpenCL) is an open
standard for parallel programming of heterogeneous
systems [57]. It consists of a host-side API and a C-
like programming language for writing computational
kernels. The host-side API provides access to the plat-
form, i.e., a view of the system that the OpenCL program is executed on. The platform comprises one or more
devices that are capable of executing OpenCL kernels. Within fused CPU-GPU architectures, CPU (including all
cores) and GPU are separate devices5 belonging to the same platform. For communication between host and de-
vices, the host-side API provides functions to submit commands to command queues. Commands specify tasks that
should be performed by a device, e.g., memory operations, synchronization or kernel execution. Each command
queue is associated with exactly one device. Events can be used to formulate dependencies between commands
(from the same or different command queues) as directed acyclic graphs. A command can emit an event upon
successful execution. When submitting a command to a command queue, it can be speciﬁed that the command
should only be executed after one or more events were emitted by ﬁnishing the execution of respective commands.
Generally, when implementing an OpenCL kernel, the goal is to represent parallelism at the ﬁnest possible gran-
ularity. Figure 3.3 shows how OpenCL divides work hierarchically as well as OpenCL keywords used by the
host-side API6. The smallest unit of execution is a work item. Each work item executes an instance of the kernel
body, e.g., for a kernel that implements vector addition a work item would compute a single element. When sub-
mitting a kernel to a command queue, usually thousands of work items are instantiated and execute concurrently
(as many as given by global_size). Work items are divided into work groups. Work groups are equally-sized (by
local_size) and each group has a unique group_id. Work items have a local_id (0, . . . , local_size− 1)
that is unique within a work group only, as well as a globally unique global_id (0, . . . , global_size−1) that
is used for address calculations. The global_id speciﬁes on which part of the input data a speciﬁc work item
executes the kernel body on. Only within a work group can work items perform barrier operations and share local
memory. This way, the OpenCL compiler can perform device-speciﬁc optimizations, e.g., on CPUs a work group
is serialized to a single thread.
3.4.2 OpenCL 2.0
The OpenCL speciﬁcation 2.0 introduced several features that provide opportunities for improved collaboration
between different devices as well as the host [44, 57]. The most prominent feature is Shared Virtual Memory
(SVM) that introduces a shared virtual address space between host and devices in an OpenCL program. SVM
eliminates explicit data transfers between host and device memory, and enables direct sharing of pointer-based
data structures. OpenCL 2.0 introduces coarse-grained and ﬁne-grained SVM. Coarse-grained SVM allows host
and devices to share virtual memory pointers, but still requires buffers that are explicitly mapped and unmapped
from host and devices. A coarse-grained SVM buffer can only be mapped to a single device or the host at a time,
concurrent accesses by multiple devices are not supported. Fine-grained SVM is an optional feature of OpenCL 2.0
that deﬁnes memory consistency guarantees for SVM allocations that are concurrently accessed by the host and
one or more devices. With ﬁne-grained SVM, host and devices can share memory at byte-level granularity and
5 Note that commonly used terms like ‘compute unit’ or ‘processing element’ are deﬁned as speciﬁc parts of a device in OpenCL
6 OpenCL supports up to three-dimensional index spaces. At this point, we explain the one-dimensional case for brevity
15
3 Achieving Performance on Fused CPU-GPU Architectures with Shared Last Level Caches
1 int* ptr = (int*) malloc (...);
2 for (int i=0; i<n; i++)
3 ptr[i] = i;
4 ptr_device = clCreateBuffer (...);
5 clEnqueueWriteBuffer(ptr_device , ptr ,...);
6 clSetKernelArg (..., ptr_device);
7 clEnqueueNDRange (...);
8 clEnqueueReadBuffer(ptr_device , ptr ,...);
9 clFinish (...);
10 printf("Result: %d\n", ptr [0]);
1 int* ptr = (int*) clSVMAlloc (...);
2 for (int i=0; i<n; i++)
3 ptr[i] = i;
4
5
6 clSetKernelArgSVMPointer (...,ptr);
7 clEnqueueNDRange (...);
8
9 clFinish (...);
10 printf("Result: %d\n", ptr [0]);
Figure 3.4: Simpliﬁed example of memory allocation in OpenCL 1.2 (left) and OpenCL 2.0 with ﬁne-grained SVM (right)
read from it concurrently. Concurrent writes are supported to non-overlapping bytes. Consistency is guaranteed
before and after each command execution. When more ﬁne-grained consistency is required, atomics can be used.
Atomics are another optional feature introduced by OpenCL 2.0. In combination with ﬁne-grained SVM, atomics
can be shared between different devices. This enables cross-device atomic operations and additionally provides a
means of synchronization. This way, byte-level consistency can be guaranteed within a kernel.
Before OpenCL 2.0, the only way to execute commands on a device was to submit commands to a command
queue using the host-side API. This means that the number of work items that should be executed when launching
a kernel needed to be known before the kernel was executed. OpenCL 2.0 introduces device-side enqueuing, i.e.,
kernels get the ability to enqueue child kernels in a device-side command queue. Similarly to dynamic parallelism
in NVIDIA CUDA, this enables implementation of kernels that perform calculations iteratively or use recursion.
Like in host-side enqueuing, dependencies between child kernels can be speciﬁed using events, but generated
events are only visible to the parent kernel. Child kernels run asynchronously to the parent kernel. However, the
parent kernel is only registered as successfully executed (and may emit an event), when all its child kernels ﬁnished
execution.
3.5 Utilizing Fine-Grained SVM on Fused CPU-GPU Architectures
3.5.1 Memory Allocation
Until OpenCL 2.0, communication between the host program and compute devices required explicit allocation
of device-side buffers. As shown in the simpliﬁed example in Fig. 3.4 (l.4, left), memory that is allocated and
initialized by the host program needs to be transferred to the device-side buffer ﬁrst (l.5), before a kernel can be
launched using clEnqueueNDRange(. . .). After kernel execution ﬁnishes, the results are transferred back (l.8).
In Rodinia-SVM, we removed all device-side buffer allocations from the original Rodinia Benchmark Suite and
utilize ﬁne-grained SVM instead, as shown in Fig. 3.4 (right). This allows all devices and the host to access memory
using shared pointers. As a result, all explicit transfers between host and devices are eliminated. Furthermore,
while device-side buffers are owned by a single device at a time, ﬁne-grained SVM can be accessed consistently
by multiple devices and the host.
3.5.2 Kernel Launch and Synchronization
For launching kernels on a fused CPU-GPU architecture, one command queue is instantiated for each device
(CPU and GPU). Then, the same kernel is enqueued with only a share of the total work items (global_size, see
Fig. 3.3) plus offsets that are used for calculating global work item IDs. ID calculation depends on the speciﬁc
co-scheduling method, and is therefore detailed in Section 3.6.
16
3.5 Utilizing Fine-Grained SVM on Fused CPU-GPU Architectures
1 clEnqueueNDRangeKernelFused(commandsCPU , commandsGPU , kernel ,...) {
2 // ... (calculate work item shares and IDs)
3 if(workItemsCPU >0) // work assigned to CPU?
4 clEnqueueNDRangeKernel(commandsCPU , kernel , ..., &eventGPUDone[curr -1],
&eventCPUDone[curr]);
5 else
6 clSetUserEventStatus(eventCPUdone[curr], CL_COMPLETE);
7 if(workItemsGPU >0) // work assigned to GPU?
8 clEnqueueNDRangeKernel(commandsGPU , kernel , ..., &eventCPUDone[curr -1],
&eventGPUDone[curr]);
9 else
10 clSetUserEventStatus(eventGPUdone[curr], CL_COMPLETE);}
Figure 3.5: Launching a kernel on a fused CPU-GPU architecture without host-side synchronization
Figure 3.6: Compared to the original OpenCL 1.2 implementation of the Rodinia Benchmarks Suite that executes on the GPU only and uses
device-side buffers, the use of OpenCL 2.0 incl. ﬁne-grained SVM introduces overheads but maintains consistency
Table 3.1: Rodinia Benchmark Suite – OpenCL Bench-
marks
Name Abbreviation #Kernels
Back Propagation bp 2
Breadth-First Search bfs 2
B+Tree b+ 2
CFD Solver cfd 4
GPUDWT dwt 3
Gaussian Elimination ge 2
Heart Wall hw 1
HotSpot3D hs3D 1
HotSpot hs 1
Hybrid Sort hys 7
K-Means km 2
LavaMD md 1
Leukocyte Tracking lc 3
LU Decomposition lud 4
Myocyte mc 1
Nearest Neighbor nn 1
Needleman-Wunsch nw 2
Particle Filter prtf 4
Path Finder pthf 1
Streamcluster sc 1
Earlier versions of OpenCL required explicit synchronization
at the host-side, e.g., using clFinish(. . .) (see Fig. 3.4) or
clWaitForEvents(. . .), to achieve consistency [114]. Syn-
chronization with the host induces a signiﬁcant overhead, how-
ever, because the devices’ command queues can no longer
be processed in parallel. Because ﬁne-grained SVM main-
tains consistency, host-side synchronization is not required any-
more. However, we still need to ensure that CPU and
GPU execute kernels in lock step, i.e., when launching a
sequence of kernels like clEnqueueNDRange(kernelA,. . .);
clEnqueueNDRange(kernelB,. . .);, the CPU should not begin
executing kernelB before the GPU ﬁnished executing kernelA
and vice versa. Otherwise, results from kernelA that kernelB
depends on might not be ready when one device races ahead.
This can lead to erroneous results. As shown in Fig. 3.5, we uti-
lize events to express these dependencies. For each device a ring
buffer is allocated that stores one event for each enqueued ker-
nel (eventCPUDone and eventGPUDone, respectively). If work
items are assigned to the CPU, the kernel is enqueued on the CPU (l.4). The execution of the kernel de-
pends on an event that is emitted when the previous kernel that was enqueued to the GPU completes execution
(eventGPUDone[curr-1]). In case no work items were assigned to the CPU, the event that indicates com-
pleted execution of the current kernel launch on the CPU is emitted immediately so that no deadlocks occur
(eventCPUDone[curr], l.6). Kernel launches on the GPU are performed analogously. In Rodinia-SVM, we re-
placed all calls to clEnqueueNDRange(. . .) with our clEnqueueNDRangeFused(. . .) implementation. Further-
more, the co-scheduling methods that we detail in Section 3.6 are also applied by clEnqueueNDRangeFused(. . .).
3.5.3 Overheads of Fine-Grained SVM
Figure 3.6 shows execution time results for all benchmarks of the Rodinia Benchmark Suite (version 3.1, listed in
Table 3.1) in two variants: the original OpenCL 1.2 version as well as our OpenCL 2.0 port where all device-side
buffers were replaced by ﬁne-grained SVM allocations as explained in Section 3.5.1. In both variants, kernels are
17
3 Achieving Performance on Fused CPU-GPU Architectures with Shared Last Level Caches
executed on the GPU only (the fused kernel launch of Section 3.5.2 is not used) on a Intel Core i7-6700T (Sky-
lake). Kernel compilation times are omitted7. The results show that the convenience of being able to pass host-side
pointers directly into kernels comes at a cost. In particular, short-running benchmarks (100ms and less) are signif-
icantly slowed down, e.g., ge takes almost 3.5× longer (112ms instead of 32ms) when executed on ﬁne-grained
SVM instead of device-side buffers. Benchmarks that run 100ms or more in the OpenCL 1.2 version only take
1.14× longer on average (geometric mean). Longer-running benchmarks that alternate between kernel execution
and host-side computations like hw and sc even beneﬁt from ﬁne-grained SVM (1.9× and 1.48× speedup, respec-
tively), because with OpenCL 1.2 they explicitly need to synchronize with the host and invoke transfers after every
kernel execution. However, the geometric mean execution time increase over all benchmarks for the OpenCL 2.0
versions compared to the OpenCL 1.2 version is 1.51×. The overheads stem from the fact that the OpenCL 1.2
device-side buffers used in the Rodinia benchmarks are already allocated as zero copy buffers on fused CPU-GPU
architectures8, i.e., instead of allocating separate host-side and device-side memory, both buffers are mapped to the
same shared physical memory. Consequently, data transfers between host-side and device-side buffers do not ac-
tually transfer data, but only translate pointers and initiate the OpenCL 1.2 runtime system to establish consistency
between CPU and GPU. OpenCL 2.0’s ﬁne-grained SVM adds overhead compared to zero copy buffers, because
consistency is not only established explicitly using transfers (e.g., at the beginning and end of a computation often
consisting of multiple enqueued kernels), but continuously when kernels are executed.
(0,0) հ 0 (1,0) հ 1 (2,0) հ 2 (3,0) հ 3
(0,1) հ 4 (1,1) հ 5 (2,1) հ 6 (3,1) հ 7
(0,2) հ 8 (1,2) հ 9 (2,2) հ 10 (3,2) հ 11
(0,3) հ 12 (1,3) հ 13 (2,3) հ 14 (3,3) հ 15
Figure 3.7: For co-scheduling, multi-dimensional IDs are
mapped to one-dimensional IDs
Ultimately, these overheads have to be considered when im-
plementing OpenCL 2.0 programs to decide whether to use
ﬁne-grained SVM or not. However, ﬁne-grained SVM does
not only provide the convenience of shared pointers, but also
enables new features like cross-device atomics. In the fol-
lowing we will present co-scheduling methods that exploit
these new features and evaluate them using Rodinia-SVM.
3.6 Our Co-Scheduling Methods
Let us deﬁne two types of co-scheduling methods, namely
(1) device-side co-scheduling, where work group scheduling
is performed during execution of the respective kernel by the executing devices themselves, and (2) host-side co-
scheduling, where work groups are assigned to CPU and GPU using the host-side OpenCL API only (outside of the
kernels). In the following we will present two device-side and one host-side co-scheduling methods. All methods
leverage OpenCL 2.0’s ﬁne-grained SVM to achieve consistency while executing kernels on CPU and GPU in
parallel.
When enqueuing an OpenCL kernel using the OpenCL host API function clEnqueueNDRangeKernel(. . .), the
global_size parameter speciﬁes how many work items should be launched by the OpenCL runtime system on a
speciﬁc device (see Fig. 3.3). global_size can be given as an up to three-dimensional array. In this case, work
items are assigned a global ID for each dimension by the OpenCL runtime system. For co-scheduling, we project
multi-dimensional kernel IDs onto one-dimensional IDs. An example for the two-dimensional case is given in
Fig. 3.7.
In our co-scheduling methods, we launch a subset of the total work items on CPU and GPU and then proceed
to schedule the remaining work items based on the observed performance. The main idea behind the device-side
methods is to treat the work of a kernel as a bag-of-tasks that contains independent work groups. Initially, only
a few work groups are launched (enough to fully utilize CPU and GPU). The work items of these work groups
7 Kernel compilation can be avoided using clCreateProgramWithBinary(. . .)
8 using CL_MEM_USE_HOST_PTR or CL_MEM_ALLOC_HOST_PTR ﬂags
18
3.6 Our Co-Scheduling Methods
1 typedef struct
global_work_state_struct {
2 atomic_uint workDone;
3 size_t globalWork;
4 } global_work_state;
Figure 3.8: A global_work_state is shared between work items
using ﬁne-grained SVM to realize device-side scheduling
1 __kernel void kernel (...) {
2 PREAMBLE
3 ... // --- original kernel code ---
4 POSTAMBLE }
Figure 3.9: The device-side methods add a preamble and postamble to
each kernel that implement the co-scheduling methods
act as workers that autonomously acquire and process work from the bag-of-tasks. To implement this scheme,
the device-side methods utilize a global_work_state struct that is stored in SVM and shared between CPU and
GPU (see Fig. 3.8). globalWork is the total amount of times the body of an enqueued kernel needs to be executed
(equal to the global_size parameter passed to clEnqueueNDRangeKernelFused(. . .)). In all methods, the
kernel is executed globalWork times in total. workDone is an atomic counter that keeps track of how many
work items were executed. It is used to calculate work item IDs and to decide whether another work group needs
to be scheduled, i.e., while workDone < globalWork. Furthermore, all device-side methods add a preamble or
postamble macro to each kernel as shown in Fig. 3.9. The speciﬁc preamble and postamble implementations are
presented below. Please note that, e.g., modifying atomic variables, calculating work item IDs or handling corner
cases, results in lengthy code that we simpliﬁed in our presentation below for comprehensibility9.
3.6.1 Atomic Counting
0 21 …
Figure 3.10: A single work group executes in lock step (atomic counting). Multiple
work groups execute in parallel
In the atomic counting method, each work
item acts as a worker that loops over
the original kernel code. Initially, multi-
ple same-sized work groups are launched
(clEnqueueNDRangeKernelFused(. . .))
and execute in parallel (e.g., one work
group per CPU core and multiple ones
on GPU). No further work groups are
launched during kernel execution. Each
work item sequentially executes the kernel
body repeatedly for different global IDs.
Work items that belong to the same work group execute the kernel body in lock step as shown in Fig. 3.10. This
way, they can share an atomic counter to derive their global IDs and iterate through all IDs that constitute the
global work at local_size granularity. As detailed in Fig. 3.11, the atomic counter workDone (initially zero) is
used to assign group IDs to work items of the same group:
Before each execution of the original kernel code (l.7), each work group (the last work item of a work group)
fetches the value of workDone and increments the counter by the work group size (l.4 and l.9). The while loop
beginning in line 6 is executed until the total amount of work required by the respective kernel launch is done.
workDoneCpy (deﬁned in l.2) is a variable that stores the fetched value of workDone and is allocated once for
all work items that belong to the same work group (once for each work group). Independent of how many work
groups execute in parallel, workDoneCpy will take the values of 0, get_local_size(), 2·get_local_size(),
. . . , globalWork−get_local_size(), each exactly once for a single work group that enters the while loop
(globalWork is an integer multiple of local_size, see Fig. 3.3). Accessing the atomic counter only once per
iteration of a work group (instead of, e.g., once per work item) reduces contention during the atomic operations,
9 The full implementation of all approaches is available at: https://git.scc.kit.edu/CES/Rodinia-SVM
19
3 Achieving Performance on Fused CPU-GPU Architectures with Shared Last Level Caches
1 __kernel void kernel (...) {
2 local unsigned int workDoneCpy;
3 if (get_local_id () == get_local_size () -1)
4 workDoneCpy = atomic_fetch_add(workDone , get_local_size ());
5 barrier(CLK_LOCAL_MEM_FENCE);
6 while (workDoneCpy < globalWork) {
7 ... // --- original kernel code ---
8 if (get_local_id () == get_local_size () -1)
9 workDoneCpy = atomic_fetch_add(workDone , get_local_size ());
10 barrier(CLK_LOCAL_MEM_FENCE); }}
Figure 3.11: In atomic counting, work groups loop over the original kernel code until the total amount of work is done
1 __kernel void kernel (...) {
2 ... // --- original kernel code ---
3 if (get_local_id () == get_local_size () -1) {
4 int workDoneCpy = atomic_fetch_add( workDone , get_local_size ());
5 if (workDoneCpy < globalWork) {
6 ndrange_t child_ndrange = ndrange_1D(workDoneCpy , get_local_size (),
get_local_size ());
7 enqueue_kernel(get_default_queue (), CLK_ENQUEUE_FLAGS_NO_WAIT ,
child_ndrange , ^{ kernel (...) ;}); }}}
Figure 3.12: The device-side enqueuing method enqueues additional work groups using device-side queues
but work items of the same work group need to synchronize after each iteration (thus, execute in lock step).
Synchronization is achieved using a barrier. It ensures that every work item of the same work group sees the
same value of workDoneCpy at all times. This way workDoneCpy can be used to derive work item IDs, i.e.,
get_global_id() is redeﬁned as workDoneCpy + get_local_id(). Ultimately, the original kernel body is
executed exactly once for each work item ID 0, 1, 2, . . . , globalWork−1.
3.6.2 Device-Side Enqueuing
The device-side enqueuing method does not deﬁne a preamble, but only a postamble as detailed in Fig. 3.12.
Similarly to the atomic counting method, it uses the atomic counter workDone to keep track globally of how
many times the kernel body was executed. Again, only as many work groups are launched initially as needed to
fully utilize CPU and GPU (using clEnqueueNDRangeKernelFused(. . .)). The main difference to the atomic
counting method is how work is processed by the work items. Instead of looping, a single work item executes
the kernel body only once. After executing the kernel body, additional work groups may be launched by the work
items itself using OpenCL 2.0’s device-side enqueuing. As shown in Fig. 3.12, the work item with the highest ID
inside a work group (l.3) launches another work group by enqueuing the current kernel into the device-side queue
(l.7).
In OpenCL 2.0, Kernels are enqueued to the device-side command queue using the Clang [61] block syntax, a non-
standard C extension by Apple Inc. (also known as closure in other programming languages) that allows to deﬁne
functions that can access variables outside their scope (belonging to a captured environment). In our case (l.7)
the block ˆ{kernel(...);} deﬁnes a function that only calls the current kernel with the (captured) arguments
that were passed to the initial kernel call from the host-side API. This may seem overly complex for our use case,
however, potential alternatives like function pointers are not supported in OpenCL 2.0 and function calls are always
inlined [44].
20
3.6 Our Co-Scheduling Methods
Line 6 deﬁnes the parameters of the enqueued kernel, i.e., the global ID offset (workDone), the total amount of
work items to be launched (get_local_size()) and the work group size (get_local_size()), respectively.
Effectively, work item ID calculation does not have to be redeﬁned as in atomic counting, but get_global_id()
will return the correct IDs 0, 1, 2, . . . , globalWork−1 for exactly one work item each. We also evaluated variants
of this method, e.g., enqueuing larger amounts of work items than single work groups per enqueue_kernel(. . .)
call. However, OpenCL 2.0’s device-side enqueuing in general introduces too much overhead (caused by runtime
evaluation of the block syntax) to be suitable for co-scheduling as we show in Section 3.7.1.
3.6.3 Host-Side Proﬁling
First execution of kernel k:
Following executions of k:
Figure 3.13: At the ﬁrst execution of a kernel k, host-side proﬁling determines a ratio
rk to distribute work items
In contrast to the device-side co-scheduling
methods, the host-side proﬁling method
does not apply any modiﬁcations to the
executed kernels and work items behave
exactly the same as in standard OpenCL.
Host-side proﬁling utilizes the OpenCL
host-side API, only. Similar to the
Inspector-Executor paradigm, the perfor-
mance behavior of a speciﬁc kernel is char-
acterized in an initial phase. Afterwards,
this characterization is used to schedule all
following executions of the same kernel. Upon the ﬁrst execution of a kernel k, only a fraction of the total work
items (profiling_size) is executed for proﬁling as shown in Fig. 3.13. The profiling_size is split with half
of it executing on the CPU and the other half on the GPU. The execution time of the proﬁling depends on the
speciﬁc kernel. OpenCL events are used (1) to synchronize both devices with the host program once proﬁling
ﬁnishes and (2) to obtain the execution times of the work items executed on CPU (timeCPU) and GPU (timeGPU),
respectively (using the OpenCL API call clGetEventProfilingInfo(. . .)). A ratio r′k ∈ [0,1] of work items to
distribute to the CPU is then determined using these measured execution times as follows:
r′k = 1− (timeCPU/(timeCPU+ timeGPU))
This ratio is slightly adjusted to obtain the ﬁnal ratio rk. Low percentages of work items on GPU showed to be
detrimental to the performance compared not using it at all, while following executions on the GPU performed
slightly better than the initial proﬁling in our experiments:
rk =
⎧⎨
⎩1, r
′
k > 0.8 (all CPU)
min(0,r′k −0.05), else (mixed CPU/GPU)
Finally, rk ·global_size and (1− rk) ·global_size determine the amount of work items executed on CPU and
GPU, respectively, for following executions of k (see Fig. 3.13, the values are rounded to multiples of the work
group size). rk is also used to distribute the remaining work items after proﬁling. The amount of work items to
use for proﬁling is parameterized. In our experiments we achieved the best compromise between accuracy of the
determined ratio and overhead of the proﬁling run when 50% of global_size was used for proﬁling when a
kernel k was executed for the ﬁrst time.
21
3 Achieving Performance on Fused CPU-GPU Architectures with Shared Last Level Caches
Figure 3.15: Speedup of the co-scheduling methods applied to Rodinia-SVM, on a fused CPU-GPU architecture with shared LLC. Results are
relative to performing the optimal choice for each kernel of either executing on CPU or GPU (xor-Oracle is 100%)
3.7 Experimental Evaluation
The following results were obtained using a Intel Core i7-6700T (Skylake) fused CPU-GPU architecture with 32
GB of main memory. The Intel Core i7-6700T features a quad-core CPU and the HD Graphics 530 GPU. CPU
and GPU share 8 MiB of last level cache (maximum for Skylake). All benchmarks were compiled using GCC
version 7.2.1 and the Intel SDK for OpenCL Applications version 2017 R1. They were executed on CentOS Linux
release 7.4.1708 with the Intel OpenCL 2.0 CPU/GPU driver package SRB5.0 (Linux kernel 4.7.0.intel.r5.0). To
minimize execution time variance, hyper-threading was disabled and CPU frequency scaling set to ‘performance’
(which sets the highest frequency to all cores and effectively disables turbo boost). The Rodinia-SVM benchmarks
were executed using the default inputs from the Rodinia Benchmark Suite for reproducible and comparable results.
Results report the average of 10 executions of the respective benchmark with a standard deviation < 2% of the
average, and do not include kernel compilation times10.
In the following, we ﬁrst show that device-side enqueuing causes too much overhead to be suitable for co-
scheduling. Then, we evaluate our co-scheduling approaches, and ﬁnally show that cache coherency is a major
performance bottleneck.
3.7.1 Device-Side Enqueuing
Figure 3.14: Device-side enqueuing adds signiﬁcant overhead, even
when no kernel is enqueued. The overheads stem from
the kernel call in the block syntax
In this section we evaluate device-side enqueuing on a
subset of the Rodinia-SVM benchmarks that result in
the highest overheads when device-side enqueuing was
applied. We execute the benchmarks in two versions:
First we execute the kernels on the CPU only without
applying any co-scheduling method. Then, we execute the benchmarks again with the device-side co-scheduling
method of Section 3.6.2 applied (still CPU only). However, we immediately launch all work items (the total
global_size) when the kernels are launched from the host-side API. Effectively, co-scheduling is never actually
performed, i.e., the if statement in line 5 of Fig. 3.12 always evaluates to ‘false’, i.e., the postamble of the device-
side enqueuing method is never executed.
Figure 3.14 shows the execution time increase of the device-side enqueuing method relative to execution without
any co-scheduling method applied. Note that even though the co-scheduling code is not executed, the execution
times increase signiﬁcantly, up to almost 6× for sc. The overheads disappear, as soon as we remove the kernel
call from the block syntax in line 7 of Fig. 3.12 (e.g., by replacing kernel(. . .) with a printf). This means
that runtime processing of the block syntax (capturing the environment) is performed even when that part of the
code is not executed, and that it introduces high overheads, which render device-side enqueuing unsuitable for
implementing co-scheduling methods. These results may surprise, but are in line with results published by Intel,
where a naive port of an iterative implementation of Sierpin´ski Carpet to a recursive implementation using device-
side enqueuing resulted in a 186× execution time increase (2050ms instead of 11ms) [55]. Due to this cost, we
exclude device-side enqueueing from further experiments.
3.7.2 Co-Scheduling Results of Rodinia-SVM
Figure 3.15 shows evaluation results for the co-scheduling approaches atomic counting and host-side proﬁling, and
execution on CPU-only as well as GPU-only. The results are shown as speedups over the optimal per-kernel choice
10 Kernel compilation can be avoided using clCreateProgramWithBinary(. . .)
22
3.7 Experimental Evaluation
of whether to execute the kernel either on CPU or GPU (clairvoyant xor-Oracle, see Section 3.3 for a discussion
compared to a program-ﬁxed ratio as determined by state-of-the-art approaches designed for fused CPU-GPU
architectures without shared LLC). All speedups are relative to xor-Oracle (100%) and given in percent (of the
relative performance achieved). The geometric mean (gmean) shows that on average execution on GPU-only
performs worst (67.5%), mainly because two of the benchmarks (mc and prtf) perform very badly when their
kernels are executed only on the GPU (they contain long-running loops). With 77.6% performance of xor-Oracle
on average, execution on CPU-only performs better than GPU-only or with atomic counting. In other words,
however, xor-Oracle on average achieves a 1.48× and 1.29× speedup over CPU-only and GPU-only, respectively,
by using the most suitable compute device for each kernel.
When using both compute devices in parallel using the co-scheduling methods, one would expect to achieve a
considerable speedup over the xor-Oracle that only uses one compute device at a time. As our results show,
however, this is rarely the case (which we will analyze further in the following section). At best, atomic counting
achieves 110.4% of xor-Oracle’s performance (hw). On average it achieves 74.8% and thus performs better than
GPU-only, but worse than CPU-only. One problem of atomic counting is that some kernels perform very badly on
a particular device. Even when only a few work groups are launched initially, their execution times dominate the
kernel’s overall execution time (e.g., in mc and prtf). Additionally, atomic counting adds logic, and thus overhead,
to the kernels itself.
Host-side proﬁling, on average, achieves 96.8% of xor-Oracle’s performance and a speedup of 1.43× and 1.25×
over GPU-only and CPU-only, respectively. It also performs considerably better on average than atomic counting
(1.29× speedup), mainly because it only adds overheads to the very ﬁrst kernel execution (when proﬁling) and does
not add any code to the kernels. The overhead of proﬁling is especially evident in md that only executes a single
kernel once, where host-side proﬁling performs worst over all benchmarks (64.9% of xor-Oracle). At maximum,
host-side proﬁling achieves a 122.5% of xor-Oracle’s performance in hw, but only in one other benchmark (lc)
is another considerable performance beneﬁt over xor-Oracle achieved (116.4%). Note that a host-side proﬁling
implementation that tries to select the best device instead of distributing the work would incur similar overheads
without any resulting speedups over xor-Oracle.
In summary, host-side proﬁling performs best over all methods and is on average competitive to the clairvoyant
and thus hypothetical xor-Oracle. However, in most benchmarks it does not beneﬁt from executing kernels on CPU
and GPU in parallel compared to executing on the most-suitable single compute device, only.
3.7.3 Cache Performance Bottleneck
Figure 3.16: Cache performance metrics (all levels, measured on CPU)
when executing kernels in parallel on CPU and GPU
relative to executing the same work item distribution
sequentially (ﬁrst on CPU, then on GPU; = 1 on y-axis)
To analyze why executing kernels on both compute de-
vices in parallel on ﬁne-grained SVM does on average
not provide a considerable performance beneﬁt over
executing the kernels on the most-suitable device only,
we measured cache metrics using CPU-internal hard-
ware performance counters. A subset of the Rodinia-SVM benchmarks was selected, for which host-side proﬁling
was utilized to distribute work items to CPU and GPU for all kernels (∀k : 0 > rk < 1). These benchmarks po-
tentially beneﬁt most from utilizing both devices in parallel. Furthermore, the selected benchmarks synchronize
with the host after each kernel execution (the same as in their original versions) which allows us to measure the
performance counters for the kernel executions, only. We use the ratios rk from the previous section for all kernels
k, without performing the proﬁling step of the host-side proﬁling method.
First, all benchmarks are executed while using the devices sequentially, i.e., for each kernel we execute the work
items assigned to the CPU ﬁrst, synchronize with the host, and then execute the work items assigned to the GPU.
For this device-sequential execution, the total cache misses and cache stalls (all levels) that are encountered by the
23
3 Achieving Performance on Fused CPU-GPU Architectures with Shared Last Level Caches
CPU are measured11. Then, all benchmarks are executed while using both devices in parallel (as in the previous
section) and the same measurements are performed. In both measurements the CPU (and GPU) performs the same
amount of work, but in the device-sequential case the CPU has more idle time.
Fig. 3.16 shows the measured cache metrics from the device-parallel execution relative to the device-sequential
execution (= 1 on y-axis). For hys, km and sc, the cache misses do not increase (hys even beneﬁts from device-
parallel execution). This means that there are no cache conﬂicts like false or true sharing that impair the perfor-
mance. However, the cache-related stalls increase considerably by up to 1.75× and 1.64× on average. A similar
effect has previously been observed under simulation for cache-coherent fused architectures without a shared LLC
[81]. The authors demonstrated that the amount of data probes sent by the highly-parallel GPU to the shared cache
directory occupied the directory bandwidth which considerably slowed down the memory bandwidth that can be
sustained by the cache hierarchy. Our results demonstrate the existence of a similar cache coherency bottleneck
when ﬁne-grained SVM is used on the Intel fused CPU-GPU architecture, even when CPU and GPU share an in-
clusive LLC. Further research is required to analyze and resolve this bottleneck (in software or hardware) to fully
beneﬁt from co-processing on fused CPU-GPU architectures.
For hw and lc, a similar increase in cache-related stalls cannot be observed. These results are in line with the
speedup results shown in Fig. 3.15: hw and lc (group 1) beneﬁt considerably from co-scheduling over the xor-
Oracle, while hys, km and sc (group 2) do not. The main difference between these two groups of benchmarks is
that the kernels of group 1 are considerably longer (> 100 lines of code on average) than the kernels of group 2
(< 30 lines of code on average). Therefore, the kernels of group 1 perform considerably more operations per work
item than the kernels of group 2.
3.8 Conclusion and Implications for Predictable Execution
This work presented the ﬁrst investigation of collaborative execution of computational kernels on a fused CPU-
GPU architecture with a shared LLC using ﬁne-grained SVM. We contributed two novel device-side co-scheduling
methods that perform scheduling within the kernel code. It was shown that device-side enqueuing introduces
considerable overhead stemming from the evaluation of the block syntax that is used in device-side enqueuing of
kernels (up to 6× execution time increase), too much to be suitable for implementing co-scheduling methods.
Our host-side co-scheduling method achieved 96.8% of the clairvoyant and thus hypothetical xor-Oracle’s perfor-
mance on average (optimal per-kernel choice of exclusive CPU or GPU usage) and a speedup of 1.43× and 1.25×
over execution on GPU only and CPU only, respectively. It also provided a 1.29× speedup over ‘atomic counting’,
the best device-side co-scheduling method, because it does not add overhead to kernel execution once proﬁling is
done. This makes our host-side co-scheduling method the most competitive practical scheme to date. We further
showed that cache coherency is the major performance bottleneck in current fused CPU-GPU architectures with
a shared LLC. It was shown that when CPU and GPU execute kernels in parallel on an Intel architecture, cache-
related stalls observed on the CPU can increase by up to 1.75× while cache misses remain the same compared to
executing the same work on the CPU and only then on the GPU (while the CPU is idle).
However, some benchmarks beneﬁted considerably from collaborative execution on CPU and GPU (up to 1.23×
speedup) compared to using the most suitable device. It depends on the memory access patterns of the kernels
whether cache coherency becomes a performance bottleneck or not. In future work, it will be crucial to categorize
the memory access patterns of kernels and design optimizations to alleviate this performance bottleneck for even
more effective co-scheduling of kernels on fused CPU-GPU architectures. It becomes evident that the trend of
processor integration in high-performance architectures is a two edged sword: it can eliminate data transfers to
private memories of heterogeneous compute devices and enable co-computation of kernels by, e.g., CPU and
GPU, resulting in a high performance within a limited power and are budget (which is crucial, e.g., for embedded
11 There are no publicly documented interfaces to access Intel GPU performance counters when not using OpenGL
24
3.8 Conclusion and Implications for Predictable Execution
systems). At the same time, the potential for resource conﬂicts (and the complexity thereof) increases. While
these conﬂicts can most certainly be resolved for average-case performance, it will be more challenging for future
research to resolve them for predictable performance. The presented cache coherency bottleneck adds a shared last
level cache between CPU and GPU to the growing list of microarchitectural features that can beneﬁt average-case
performance, but lead to resource conﬂicts of such a complexity that they are virtually infeasible to analyze for
execution time guarantees.
This chapter presented novel co-scheduling approaches for fused CPU-GPU architectures in a case study on how
performance is achieved in an off-the-shelf platform. It provided further evidence that high-performance architec-
tures, which were designed for average-case performance, are not suitable for hard real-time systems that require
execution time guarantees. Thus, the following chapters take a different approach to obtain predictable perfor-
mance and base on a system that is already amenable to WCET analysis. As motivated in Chapter 1, such a
system lags years behind current platforms like the one discussed in this chapter in terms of its architectural de-
sign. The focus of the following chapters will therefore be to achieve high performance and WCET guarantees by
introducing runtime-reconﬁgurable accelerators.
25

4 Runtime Reconﬁguration under WCET Guarantees
The target of this1 and the following chapters is to achieve timing-analyzable performance by employing hardware
accelerators that speed up the tasks’ most compute-intensive parts, so called computational kernels (also known as
hotspots) that are comprised of one or more nested loops. When implementing these accelerators as application-
speciﬁc integrated circuits, the system would lack ﬂexibility with respect to revised standards or new algorithms.
Instead, using a runtime-reconﬁgurable architecture (which employs an FPGA, see Section 2.3) maintains a high
ﬂexibility and even allows for reconﬁguring the accelerators at runtime, thereby increasing the performance as
well as the computing efﬁciency (compared to a static set of accelerators) at the cost of a more complex timing
analysis. The aim of this chapter is to enable guaranteed reconﬁguration delays for conﬁguring accelerators onto
the reconﬁgurable area (which were previously unavailable). The following chapters will base on the guaranteed
reconﬁguration delays to achieve guaranteed WCETs of tasks that employ runtime reconﬁguration of accelerators.
Existing work on runtime reconﬁguration in the context of real-time systems implicitly assumes that the process
of reconﬁguration itself complies with timing guarantees [16, 27, 29, 36, 93], e.g., the time it takes to conﬁg-
ure a hardware accelerator on the reconﬁgurable fabric (reconﬁguration delay) is assumed constant and free from
conﬂicts with other system components that could affect WCET guarantees. The realization of a runtime recon-
ﬁguration controller that fulﬁlls these assumptions and that is amenable to WCET guarantees is so far unavailable.
However, guaranteed reconﬁguration delays are crucial to realize runtime-reconﬁgurable real-time systems.
The novel contributions of this chapter are as follows:
• A runtime reconﬁguration controller called “Command-based Reconﬁguration Queue” (CoRQ) that provides
guaranteed latencies for its operations and supports timing analysis for WCET guarantees. It was released as an
open-source project, including examples and benchmarks2.
• We show that conﬂicts while accessing a shared main memory during reconﬁguration can lead to a slowdown
of more than 21× in reconﬁguration bandwidth. In contrast, CoRQ guarantees constant reconﬁguration delays
even under heavy system bus load.
4.1 Challenges for a Guaranteed Reconﬁguration Delay
A straight-forward approach of improving WCET guarantees of a kernel using runtime reconﬁguration with the
constraints of timing-analyzability and reasonable implementation effort is the stalling approach (which will be
detailed in Chapter 5). Software-only execution, i.e., without any accelerators, is shown in Fig. 4.1 (top). As
shown in Fig. 4.1 (middle), a task that reconﬁgures an accelerator using stalling, stalls its execution for the whole
reconﬁguration delay. At most one reconﬁguration can be performed by the reconﬁguration port at any time.
Once all reconﬁgurations have completed (at the end of the reconﬁguration delay, Fig. 4.1 (a)), the task proceeds
execution in software and executes the reconﬁgured hardware accelerators in every iteration of the kernel. A task
that requests reconﬁguration of accelerators using stalling can be analyzed for WCET guarantees using established
timing analysis techniques by adding the reconﬁguration delay (see Fig. 4.1 (a)) to the WCET of the basic block that
requests the reconﬁguration. The assumption is that the reconﬁguration delay can be determined statically, which
is reasonable for the stalling approach because the task’s memory accesses and reconﬁguration cannot interfere on
1 The work presented in this chapter was originally published in [28]
2 Available at: https://git.scc.kit.edu/CES/corq
27
4 Runtime Reconﬁguration under WCET Guarantees
1 2 3 4 
1 2  3 CPU (SW Emul.) 
Reconf. Fabric 
 4 
 1 
Reconf. Fabric 
 2 CPU (Stalling) 
Execution Time 
 3  4 
Reconfiguration 
(b) (c) 
Reconfiguration 
(a) 
Execution Time 
Execution Time 
CPU (SW only) 
i Iteration i Reconf. Delay Accelerator A / B Exe. Accelerator A / B 
Figure 4.1: Timelines of executing a Kernel using Software only, Stalling and Software Emulation
main memory or a shared system bus. However, stalling is not state-of-the-art in reconﬁgurable systems, because
the CPU remains idle during reconﬁguration.
An approach that enables the CPU to perform useful operations in parallel to reconﬁguration is software emu-
lation, i.e., (1) accelerators are conﬁgured as early as possible in the control ﬂow graph (CFG) and execution
proceeds in parallel so that a considerable amount of reconﬁguration delay has already passed at the point in time
when the accelerators are actually needed and (2) in case execution of an accelerator is requested that is not yet
conﬁgured, functionally-equivalent software is executed (see Fig. 4.1 (bottom)). Software Emulation is an estab-
lished technique in average-case optimizing reconﬁgurable systems, because it provides considerable performance
improvements. For real-time systems, however, software emulation poses new challenges:
• As memory transfers can be initiated simultaneously by the reconﬁguration of accelerators and by tasks running
on the CPU (see Fig. 4.1 (b)), it needs to be ensured that assumptions about memory access delays during static
timing analysis of the guaranteed WCET bound (see Section 2.2) capture potential conﬂicts on main memory or
a shared system bus.
• Even when the reconﬁguration delay of an accelerator would be a statically-known constant value, the worst-case
state of the task’s execution on the CPU is unclear: how far did the task proceed (in the worst case) during the
reconﬁguration delay? In other words, from what point is it safe to assume during static timing analysis that, e.g.,
Accelerator A is readily conﬁgured on the reconﬁgurable fabric and available to be invoked (see Fig. 4.1 (c), this
question will be the focus of Chapter 5)? Usually, reconﬁguration of multiple accelerators is requested at once
(but conﬁgured sequentially). The information that a speciﬁc accelerator has been reconﬁgured and is available
to speed up execution should be obtainable by the task without interrupts that would complicate timing analysis.
• If program execution is faster or takes a different path than the worst-case path, a reconﬁguration request could
become obsolete because the requested accelerator will not be executed anymore (see Fig. 4.2). In real-time
systems, the possibility of an already occupied reconﬁguration port can lead to delays that are hard to analyze
and therefore introduce pessimism in the resulting WCET bound. Therefore, it is crucial to be able to abort
reconﬁgurations such that one can guarantee for each reconﬁguration request that the reconﬁguration port is
unoccupied.
It might seem that the stalling approach is the favorable way to perform reconﬁguration in real-time systems due
to the potentially complex analysis of software emulation. However, stalling poses similar challenges for timing
analysis when scheduling multiple real-time tasks, even on a uniprocessor system: when a task that requests a
reconﬁguration is stalled, another task could be executed in parallel to the reconﬁguration delay (see Fig. 4.1 (a)
and [16]). Concerning the resulting WCET bound by analyzing either approach, it will be shown in Chapter 5
that software emulation always provides a considerable speedup at runtime, but there are cases where additional
WCET overestimation compared to stalling diminishes the speedup on the WCET guarantee. Which approach
28
4.2 Enabling Runtime Reconﬁguration in Real-Time Systems with CoRQ
Table 4.1: CoRQ Commands with Cycles spent in EXE State
Command Immediate/Queueable latencyEXE1
clearQ Im, Qu 5
stopQ Im, Qu 0
resumeQ Im 0
abortReconf Im 5
setBaseAddr Qu 1
configBitsExt2 Qu —
configBitsInt2 Qu 6+ B/4
stallCPU Qu 1
unstallCPU Qu 1
sendGPIO Qu 1
sendIRQ Qu 1
1 discussed in Section 4.2.1 2 detailed in Section 4.2.2, B - size of bitstream [byte]
is beneﬁcial, eventually depends on several parameters of the accelerated kernel (e.g., reconﬁguration delay and
speedup of the accelerators employed) and therefore both approaches should be supported by a reconﬁguration
controller for real-time systems. In the following section, we will introduce our reconﬁguration controller CoRQ,
and explain how it addresses the challenges that we observed and supports the stalling approach as well as software
emulation in a predictable way.
4.2 Enabling Runtime Reconﬁguration in Real-Time Systems with CoRQ
On this path, the reconfiguration request of Acc. A is obsolete.  
Can it delay the following reconfiuration? 
Reconf. Acc. A Reconf. … 
Invoke Acc. A … … 
Figure 4.2: Control-ﬂow graph that shows how one reconﬁguration request
can delay a following reconﬁguration, thus impairing timing anal-
ysis
The focus of our reconﬁguration con-
troller Command-based Reconﬁguration Queue
(CoRQ) is to enable the CPU to issue sequences
of reconﬁguration requests, provide guaranteed
reconﬁguration delays and relieve the CPU from
managing accelerator availability. CoRQ pro-
vides commands to inform the CPU of ﬁnished
reconﬁgurations in a predictable way; the CPU
never has to poll or be interrupted to obtain the information that an accelerator has become available (following
a reconﬁguration). CoRQ processes 32-bit commands and can be instantiated with an internal memory to store
bitstreams (conﬁguration data for the reconﬁgurable fabric). Commands are issued by the CPU using load/stores
over the system bus (see Fig. 4.3). They are either executed immediately or enqueued in an internal FIFO queue
(denoted as immediate or queueable commands, respectively, in the following). Table 4.1 shows all 11 currently
supported commands grouped by category (immediate or queueable). The immediate commands are used to
control CoRQ itself (stop/resume processing enqueued commands, clear queue, reset) and abort a running recon-
ﬁguration. Queueable commands relieve the CPU from managing reconﬁgurations, i.e., they conﬁgure bitstreams
(from internal or external memory), provide information about available accelerators through a general-purpose
interface (or send an interrupt to the CPU), and can even stall/unstall the CPU to implement the stalling approach.
In the following we illustrate how stalling and software emulation can be realized with CoRQ.
Reconﬁguring a single accelerator in the stalling approach (see Section 4.1, Fig. 4.1 upper timeline) can be per-
29
4 Runtime Reconﬁguration under WCET Guarantees
1 stopQ (Im: Stop processing commands from queue)
2 stallCPU (Qu: Stall CPU)
3 setBaseAddr (Qu: Set main memory base address)
4 configBitsExt (Qu: Reconﬁgure from main memory)
5 sendGPIO (Qu: Reset accelerators)
6 unstallCPU (Qu: Unstall CPU)
7 resumeQ (Im: Process enqueued commands)
Listing 4.1: CoRQ commands used to realize the stalling approach
1 clearQ (Im: Ensure command queue is empty)
2 abortReconf (Im: Ensure free reconﬁguration port)
3 configBitsInt (Qu: Reconﬁgure from internal memory)
4 sendGPIO (Qu: Store info ‘Accelerator A available’)
5 configBitsInt (Qu: Reconﬁgure from internal memory)
6 sendGPIO (Qu: Store info ‘Accelerator B available’)
Listing 4.2: CoRQ commands used to realize the software emulation approach
formed using the sequence of commands shown in Listing 4.1. First, processing commands from the queue is
stopped (immediately), otherwise the following command (Line 2) would stall the CPU before the unstallCPU
command could be enqueued. All following commands (including stall CPU) are queueable. Assuming that the
main memory is idle while stalling the CPU, one can use it to conﬁgure bitstreams even under timing guarantees.
First, the base address of the bitstream is set and then configBitsExt instructs CoRQ to conﬁgure a bitstreams
relative to this base address (Lines 3 and 4). This way, the whole 32-bit address space can be addressed using
32-bit wide commands. Afterwards, sendGPIO is executed to trigger CoRQ’s GPIOs, e.g., to reset the conﬁg-
ured accelerator and ensure it is in a sane state before using it. Once these commands are processed, the CPU is
resumed. Finally, resumeQ (Line 7) is used to start processing the enqueued commands.
A reconﬁguration of two accelerators while utilizing software emulation (executing software in parallel, see bottom
timeline of Fig. 4.1) can be performed using the commands shown in Listing 4.2. In this case, neither processing
queued commands is stopped nor the CPU is stalled. Therefore, the CPU proceeds executing software after issuing
the commands to CoRQ. In this example we assume that a previous reconﬁguration request could still occupy the
reconﬁguration port and obsolete commands could be in the queue (see Fig. 4.2). To be able to guarantee the recon-
ﬁguration delay, it needs to be ensured that no earlier reconﬁguration requests are still pending. Therefore, at ﬁrst
all remaining commands are cleared and reconﬁguration (if any) is aborted (Lines 1 and 2). Afterwards, a bitstream
from internal memory is conﬁgured. This way, loading the bitstream does not conﬂict with memory accesses from
the CPU to main memory. Once reconﬁguration completes, sendGPIO is executed (Line 4) to notify the CPU that
the ﬁrst accelerator has become available. This can be done by writing into a memory-mapped register that the
CPU can read or by writing to a lookup table that is automatically queried before executing an accelerator. This
enables the CPU to use each conﬁgured accelerator immediately once it is conﬁgured (see Fig. 4.1 (bottom)), with-
out waiting for the whole set of commands to have ﬁnished processing by CoRQ. Afterwards, a second accelerator
is conﬁgured (Lines 5 and 6).
These two examples illustrate that stalling as well as software emulation can be realized by using CoRQ with
simple sequences of commands issued by the CPU.
4.2.1 Command Execution
CoRQ processes commands using a ﬁnite state machine (FSM) consisting of three states: Fetch from queue (FE),
Decode (DEC) and Execute (EXE) (see Fig. 4.4). Fetching a command takes a single cycle, the DEC state takes
30
4.2 Enabling Runtime Reconﬁguration in Real-Time Systems with CoRQ
System Bus 
CoRQ 
Main Memory 
CPU 
(LEON3) 
Reconfigurable Fabric 
ICAP 
Internal Mem. 
Command-based 
Interface 
Stall 
Base System 
Contribution 
Figure 4.3: Example of how CoRQ is attached to a System on
Chip to enable runtime reconﬁguration under timing
guarantees
CoRQ 
FSM 
DEC 
FE EXE 
Immediate? 
yes no Stall CPU 
GPIO 
IRQ 
Command Input 
Queue 
Figure 4.4: High-level view of how CoRQ processes commands
two cycles and the latency of EXE depends on the command (see Table 4.1). Immediate commands control CoRQ
itself, and thus have priority over commands from the queue. After being identiﬁed as immediate (which takes
one cycle), these commands reset the FSM (potentially aborting a queueable command in EXE) and directly enter
DEC. In sum, executing either an immediate or a queueable command takes 3+ latencyEXE cycles.
Enqueueing a command takes 2 cycles for identifying it as queueable and writing it to the queue. Commands can
simultaneously be enqueued to and fetched from the queue. The realization of this simultaneous access (with a
double-ported FIFO) incurs an additional delay of 2 cycles for commands to become visible to the FSM if the
FIFO was empty.
4.2.2 Guaranteed Reconﬁguration Delay
It is possible to load bitstreams from arbitrary addresses, however, accessing the system bus and a shared
main memory (especially DDR) can incur memory access delays that are hard to bound for WCET guaran-
tees. Therefore, guaranteeing reconﬁguration delays when using CoRQ-external memory (configBitsExt) is
outside the scope of this thesis. Reconﬁguration delays are guaranteed when the CoRQ-internal memory is used
(configBitsInt). The CoRQ-internal memory is implemented using SRAM (so-called Block RAMs on Xilinx
FPGAs). This way, the configBitsInt command can feed one word of the bitstream in each cycle to the re-
conﬁguration port and utilize its full bandwidth (see Section 4.3). Additionally, the configBitsInt command
requires 5 setup cycles and a single cycle at completion. Thus, latencyEXE = 6+ B/4 cycles, with B being the
size of the bitstream in bytes (see Table 4.1). Including the latency of FE and DEC, conﬁguring a single bitstream
from CoRQ-internal memory (configBitsInt) is guaranteed to take exactly 9+ B/4 cycles.
4.2.3 Analyzing Sequences of Commands
In the examples of Section 4.2, the latency of command sequences is simply the sum of the latencies of the queue-
able commands: conﬁguring a single bitstream from main memory using stalling (see Listing 4.1 and Fig. 4.1 (mid-
dle)) results in a latency of tstallCPU + tsetBaseAddr + tconfigBitsExt + tsendGPIO + tunstallCPU + tresumeQ = 4+ 4+
tconfigBitsExt + 4+ 4+ 3 = 19+ tconfigBitsExt cycles. This is the latency after resumeQ reaches CoRQ. At this
point the immediate command stopQ was already executed (taking 3 cycles) in parallel to ﬁlling the queue with
commands, and therefore it does not add to this latency. As mentioned before, we do not provide guarantees for
configBitsExt.
Conﬁguring two bitstreams using software emulation (see Listing 4.2 and Fig. 4.1 (bottom)) results in a latency
of tclearQ + tabortReconf + tconfigBitsInt + tsendGPIO + tconfigBitsInt + tsendGPIO = 8+ 8+ (9+ B1/4) + 4+ (9+
B2/4)+ 4 = 42+ B1/4+ B2/4. This latency starts once the immediate command clearQ reaches CoRQ
and is running in parallel to the CPU that sends the commands following clearQ to CoRQ. Executing previous
commands always takes at least as long as the delay for enqueueing the current command, therefore enqueueing
the commands does not add to the delay. If it can be guaranteed that there are no pending reconﬁgurations, clearQ
31
4 Runtime Reconﬁguration under WCET Guarantees
Table 4.2: Ressource Utilization
LUTs FlipFlops BRAM
LEON3 CPU (standard conﬁg.) 8,144 3,450 14
CoRQ 398 546 1
Internal Mem. of CoRQ (384 KB) 233 6 96
Available on VC707 303,600 607,200 1,030
and abortReconf can be omitted. In this case, enqueueing the ﬁrst command to the empty queue would incur an
additional latency of 4 cycles, resulting in a total delay of 30+ B1/4+ B2/4 (see Section 4.2.1).
4.3 Experimental Evaluation
We implemented a synthesis ﬂow for partial reconﬁguration and evaluated CoRQ based on a Gaisler LEON3 design
(GRLIB GPL 1.4.1, also see Fig. 4.3) targeting the Xilinx VC707 board (Virtex-7 FPGA)3. We used a LEON3
design provided by Gaisler that instantiates a single LEON3, uses the DDR3 on the VC707 as main memory and
runs at 100 MHz. CoRQ was added to the AHB system bus and a signal was connected to the LEON3 to enable
stalling, no further changes were made to the SoC. For evaluation purposes, we simply reconﬁgure patterns of
ﬂashing the VC707’s LEDs. The resource utilization is shown in Table 4.2.
The reconﬁguration port (called Internal Conﬁguration Access Port (ICAP) in Xilinx devices) can process 4 byte
each cycle at maximum 100 MHz on the VC707. Therefore, the theoretical maximum reconﬁguration bandwidth
is 381.47 MiB/s4 . We reconﬁgure 25 partial bitstreams of B = 57,248 bytes each, which together takes a min-
imum of 357,800 cycles when assuming the theoretical maximum reconﬁguration bandwidth without overheads.
Using CoRQ, these reconﬁgurations take 358,036 cycles5 which corresponds to a reconﬁguration bandwidth of
381.22 MiB/s. This means that CoRQ is only 0.066 % (or 236 cycles) slower than the theoretical maximum.
In the following we evaluate the impact of system bus conﬂicts on the reconﬁguration bandwidth. Figure 4.5
shows the reconﬁguration bandwidth results as measured by the CPU. Note that measuring itself adds an overhead,
therefore, the measured reconﬁguration bandwidth is always lower than the analytical bandwidth of CoRQ. The
results were obtained for reconﬁgurations using the CoRQ-internal memory (Int. Mem.), as well as main memory
over the shared AHB system bus (Main. Mem.). ‘Stalling’ leaves the CPU idle during reconﬁguration, whereas
‘Polling’ means that the CPU repeatedly reads CoRQ’s status register to check whether reconﬁguration has com-
pleted (producing trafﬁc on the AHB). ‘Bus Load’ uses a simple DMA unit that repeatedly initiates maximum
length (256 words) AHB burst transactions to provoke system bus and main memory conﬂicts during reconﬁgura-
tion. The small variance in measurements when using CoRQ-internal memory (< 1%) stems from the overhead of
measuring. The CPU’s bus accesses (e.g., for fetching instructions and reading the timers) conﬂict with the DMA.
CoRQ’s commands itself always have exactly the same latency when using internal memory.
When reconﬁguring over main memory, accesses from CPU, the DMA and CoRQ are in conﬂict. We can observe
a strong variance in reconﬁguration bandwidth between the measurements. The measurement under DMA bus
load reports only 4.69% of the Stalling bandwidth. This shows that reconﬁguration controller design is crucial
in runtime-reconﬁgurable real-time systems. Simply utilizing a shared memory for reconﬁguration can lead to a
slowdown of more than 21× in reconﬁguration bandwidth.
3 Project incl. benchmarks available at: https://git.scc.kit.edu/CES/corq
4 More precisely: (4 ·1024−2)/10−8 = 381.4697265625 MiB/s (= 400 MB/s)
5 Sum of latencies of the individual commands (see Section 4.2): 4+25 · (9+ 57,248/4)+4+3 cycles
6 Average of 50 measures, maximum error < 1%
32
4.4 Conclusion
0 50 100 150 200 250 300 350 400 450
M
ain
M
em
.
Int
. M
em
.
> 21× slower than Stalling
due to conﬂicts
3.47
376.18
71.96
379.96
74.02
379.51
Bandwidth [MiB/s]6
Stalling
Polling
Bus Load
Figure 4.5: Reconﬁguration bandwidth measured by the CPU, revealing a high variance when using main memory
4.4 Conclusion
This chapter discussed challenges for timing-analyzable runtime reconﬁguration in systems that require WCET
guarantees. It presented how these challenges can be addressed, and introduced CoRQ: a reconﬁguration controller
for real-time systems that provides guaranteed reconﬁguration delays for the stalling and software emulation ap-
proaches. In the work presented in [88], CoRQ formed the basis to design a reconﬁguration controller that enables
preemptable runtime reconﬁguration in Xilinx FPGAs, i.e., reconﬁgurations can be preempted (keeping reconﬁgu-
ration progress) to avoid priority inversion in the presence of multi-priority real-time tasks instead of being aborted
(and loosing the progress made so far). It was used to design a Xilinx Zynq-based multi-priority real-time system,
where tasks of different priority levels can request reconﬁgurations.
The following chapter bases on the reconﬁguration delay guarantees provided by CoRQ and introduces an analysis
that enables reconﬁguration of accelerators that speedup WCET guarantees in a runtime-reconﬁgurable processor.
33

5 WCET Analysis of Tasks on Runtime-Reconﬁgurable
Processors
To escape the scarcity of timing analyzable performance features that was motivated in Chapter 1, this chapter1
introduces timing analysis of tasks on runtime-reconﬁgurable processor designs in which the core instruction
set architecture (cISA) of the processor core is extended by custom instructions (CIs). These CIs initiate the
execution of accelerators on the reconﬁgurable fabric. Figure 5.1 shows a system with a reconﬁgurable instruction
set processor (generalized from [12, 47, 102] and Section 2.4). It consists of an in-order RISC CPU with a
reconﬁgurable fabric, scratchpad memory (SPM) and separate data and instruction caches (D$ and I$). With
commercial platforms like the Xilinx Zynq SoC, which couples an ARM Cortex A9 multi-core with a Xilinx
reconﬁgurable fabric on a single chip, reconﬁgurable processors have become off-the-shelf devices. In contrast to
these, we speciﬁcally choose an in-order pipeline to be able to obtain precise execution time bounds using state-
of-the-art static timing analysis. Predictable performance is achieved with our novel models for analyzing tasks
that utilize reconﬁgurable CIs. These processor designs enable speedups even for applications that contain kernels
with only short execution times in the range of 10–100 cycles when running on the fabric. CIs were detailed in
the context of the reconﬁgurable processor i-Core in Section 2.4, their most-important properties for the context
of this chapter are summarized as follows: CIs are speciﬁed by (multi-cycle) datapaths, which are implemented as
conﬁgurations on the reconﬁgurable fabric. A conﬁgured CI takes a certain area share of the fabric. The fabric can
accommodate several CIs at once, constrained by its total area (see [92] for an overview of area models). The time
required for reconﬁguring a CI at runtime depends on the size of the conﬁguration and is called reconﬁguration
delay (it can take several milliseconds); a CI ready for execution is referred to as available. A CI which is not
yet readily conﬁgured or was replaced by another CI is unavailable. In contrast to cISA instructions, a CI can be
unavailable when it is due for execution. As introduced in the previous chapter, two common approaches exist
to deal with this problem in reconﬁgurable processors, which optimize the average case on a best effort basis
(see Fig. 4.1): stalling [48, 108], i.e., halting execution until the pending reconﬁguration ﬁnishes, and software
emulation [11, 26], i.e., branching to CI-equivalent software which can be executed on the cISA (see Fig. 5.2).
The main contribution of this chapter is a timing analysis for environments in which faster paths (e.g., contain-
ing hardware-accelerated CIs) through a kernel body become successively available during execution of the kernel
(e.g., software emulation is utilized for unavailable CIs and reconﬁgurations are performed in parallel). A reconﬁg-
1 The work presented in this chapter was originally published in [29]
Main Memory 
Legend 
SPM – Scratchpad Memory 
D$ – Data Cache 
I$ – Instruction Cache 
SPM 
D$ I$ 
CPU 
Reconfigurable Fabric 
P
ip
el
in
e 
Figure 5.1: System on Chip with a reconﬁgurable processor
CI Available on
Reconfigurable Fabric?
Execute Equivalent
Software Code (cISA)
No
Initiate Execution on
Reconfigurable Fabric
Yes
Benefits WCET
Figure 5.2: Software Emulation entails testing whether a
speciﬁc reconﬁgurable CI is currently conﬁgured
(available).
35
5 WCET Analysis of Tasks on Runtime-Reconﬁgurable Processors
urable processor design amenable to this timing analysis is presented, and it is shown how the resulting information
can be utilized to obtain precise WCET bounds of tasks.
The novel contributions of this chapter are as follows:
• Timing analysis of tasks on a reconﬁgurable instruction set with support for multiple execution contexts and
comparison of measures to deal with reconﬁguration delays: stalling the core pipeline or running equivalent
software code until conﬁgurations ﬁnish (software emulation). To our knowledge, this is the ﬁrst time that
runtime reconﬁguration is supported in an analysis for WCET guarantees.
• Identiﬁcation and analysis of a timing anomaly of runtime reconﬁguration, i.e., a situation where executing
iterations of a kernel faster than worst-case time during reconﬁguration can extend the execution time of the
whole program. The timing anomaly is safely bounded during timing analysis.
• Description of key requirements to design reconﬁgurable instruction set processors that support timing guaran-
tees.
We argue that a reconﬁgurable instruction set gives application designers more control over the microarchitecture
than a ﬁxed instruction set, which beneﬁts timing analysis. Our evaluation results show that with precise analysis
of proven reconﬁgurable processor design, runtime instruction set reconﬁguration can be an enabling feature to
provide timing-analyzable performance.
5.1 Related Work
Signiﬁcant amount of work on reconﬁgurable instruction set processors has been performed. The demonstrated
beneﬁts are code size reduction, lower power consumption and increased average-case performance [42]. Current
research in reconﬁgurable instruction set processors is moving towards heterogeneous reconﬁgurable multi-core
architectures [24, 45, 50]. However, worst-case timing analysis of parallel tasks is still new ground even on
general-purpose multi-core architectures [6, 90].
5.1.1 WCET-Optimizing Instruction Set Architectures
Little work on instruction set adaptation has been performed in respect to WCET. In [112] an instruction set
selection for WCET optimization on an ASIP is performed. This approach performs WCET estimation using
timing schema [76]. Timing schema uses a syntax tree representation of the program under analysis. Inner nodes
represent the control ﬂow and leaf nodes are the basic blocks of the program. The WCET is estimated in a bottom-
up fashion with simple recursive rules. The proposed instruction set selection targets instruction set extension for
applications known at design time. Reconﬁguration is not considered in this approach.
MCGREP [104] is a two-stage pipelined, micro-programmed processor design without caches. Every instruction
has a constant delay, independent of the execution history. The two ALUs of this processor design can be re-
conﬁgured to perform an application speciﬁc instruction in every cycle using a microcode to improve instruction
throughput. MCGREP’s design allows a straight-forward timing analysis without the requirement of complex
models. However, its evaluation assumes a single-cycle load delay and compares against microprocessors at 40
MHz with their cache switched off. Requiring the absence of caches for timing predictability is too restrictive, as
LRU caches are well understood in timing analysis and allow a predictable processor design with memory hierar-
chy [107]. Additionally, the two-stage pipeline may limit the frequency of the processor. The scalability of this
design in performance-demanding scenarios remains questionable.
In [100] an integration of the real-time processor CarCore [101] and the MOLEN reconﬁgurable custom com-
puting unit [102] is presented. The integration focuses on guaranteeing hard real-time constraints for memory
accesses of processor core and reconﬁgurable hardware. Timing analysis of binaries utilizing reconﬁguration is
36
5.2 Motivational Example
not addressed. As a consequence, measured execution times are evaluated while the effects on WCET bounds
remain uninvestigated.
5.1.2 Runtime Reconﬁguration in Hard Real-Time Systems
If supported, runtime reconﬁguration is the responsibility of the task scheduler in state-of-the-art hard-real-time
systems [3, 23, 36, 54, 70], but reconﬁguration within a currently executing task is not considered. Ref. [36] tackles
the problem of scheduling access to the reconﬁguration access port for ﬁxed-priority task sets with hard deadlines.
ReconOS [3] is an operating system that provides a uniﬁed programming model for threads running in software
and threads mapped to reconﬁgurable hardware. Previous versions were based on the eCos2 real-time kernel. Ref.
[23] presents mapping and scheduling of task graphs to a uniprocessor system with reconﬁgurable units, which
minimizes utilization of fabric area. Reconﬁgurations are either performed before, or as special tasks within the
schedule. However, during execution of a task, its assigned reconﬁgurable unit is not reconﬁgured.
In [54] a per-task instruction set selection is performed using reconﬁguration to meet timing constraints. A periodic
task graph with deadlines is scheduled and the schedule is partitioned into conﬁgurations. Each conﬁguration
is assigned an instruction set to optimize the tasks’ WCET. Their approach assumes that the schedule’s WCET
reduction directly corresponds to the per-task cycle reduction due to a CI that is chosen for a subset of the tasks.
But this is only the case when there is no conditional execution of tasks. When having alternative execution paths,
adding a CI into the current WCET path may result in another path becoming the WCET path. In this case, the
gain of the CI on the overall WCET is the delta between the old and the new WCET path, which could be as small
as 1 cycle, independent of the average performance improvement due to the CI. So the overall WCET gain can be
signiﬁcantly less than the gain in the old WCET path. A similar effect was already reported by [94] when assigning
a variable to scratchpad memory for WCET reduction. Due to this effect, it is not possible to apply the inter-task
techniques in [54] on intra-task level.
In sum, state-of-the-art techniques either do not consider runtime reconﬁguration [112], introduce reconﬁgurable
architectures without investigating effects on WCET bounds [100], or runtime reconﬁguration is the responsibility
of the real-time scheduler, where reconﬁguration for an already running task is prohibited [3, 23, 36, 54, 70].
5.2 Motivational Example
As discussed in Section 5.1, state-of-the-art techniques only perform reconﬁgurations when switching from one
task to another [3, 23, 36, 54, 70]. Within a single task however, the beneﬁts of having a reconﬁgurable fabric
are not exploited. To show the opportunities that are missed by such limitations, we use an H.264 video encoder
as a motivational example in Fig. 5.3. For each input frame, the encoder goes through a sequence of kernels
with different requirements of CIs to be conﬁgured onto the fabric. When conﬁguring CIs for the whole task
before it executes, the reconﬁgurable area has to be divided among the kernels (Figure 5.3 (a)). Performing
reconﬁguration before each kernel allows every kernel to use the whole fabric area at the cost of reconﬁguration
delay (Figure 5.3 (b)), i.e., the start of the kernel is delayed (stalling) until its reconﬁguration is completed. As the
total reconﬁguration delay per task increases, the question whether the idle time of the CPU during reconﬁguration
(stalling) can be used more effectively is posed. Instead of waiting until the reconﬁguration of all CIs for a kernel
ﬁnishes, the kernel can be started immediately using software emulation (see Fig. 5.2 and Fig. 4.1). As soon
as reconﬁguration for a CI ﬁnishes, it is utilized within the kernel and the reconﬁguration delay was effectively
used to make progress (Figure 5.3 (c)). Software emulation leads to considerable runtime beneﬁts at the cost of
implementing equivalent software code for a CI. E.g., the runtime of LoopFilter, the shortest running kernel in
the H.264 video encoder, is reduced by 20% using software emulation compared to stalling on a system running
2 http://ecos.sourceware.org/
37
5 WCET Analysis of Tasks on Runtime-Reconﬁgurable Processors
the reconﬁgurable fabric at 100 MHz, the core pipeline at 800 MHz and a reconﬁguration bandwidth of 100 MB/s
(see Section 5.6 for a detailed discussion).
ME 
EE 
LF 
R 
R 
R 
When are CIs available 
in the worst case? 
Motion Estimation 
(ME) 
Encoding Engine (EE) 
Loop Filter (LF) 
Reconfiguration (R) 
ME 
EE 
LF 
R 
R 
R 
How is WCET 
influenced? 
(a) (b) (c) 
Reconfiguration  possible: 
before Task Start before Kernel during Kernel 
Scheduler responsible 
for Reconfiguration ݱ 
Figure 5.3: Sequences of kernels, e.g., in the H.264 Encoder, are well-suited for
runtime reconﬁguration, but raise new issues in timing analysis
Software emulation provides measurable
runtime beneﬁts which we make accessi-
ble for lower WCET bounds. A pessimistic
timing analysis approach would simply as-
sume an inﬁnite reconﬁguration delay and
thus assume that every CI is executed in
its cISA implementation for WCET esti-
mation. While this would produce a safe
WCET bound and enable speedups using
CIs over cISA at runtime, the guaranteed
WCET would be worse than not using re-
conﬁguration at all. Therefore, precise
modeling of reconﬁgurable CIs is required
for timing analysis.
In the following section, we introduce tim-
ing analysis fundamentals which target non-reconﬁgurable processors or the cISA of a reconﬁgurable processor,
and form the basis for our novel extensions for analysis of reconﬁgurable CIs presented in Section 5.4.
5.3 Timing Analysis Background
Timing analysis to derive WCET guarantees of tasks was introduced in Section 2.2. This section revisits global
bound analysis using the Implicit Path Enumeration Technique (IPET) to provide additional details on multi-
context path analysis as employed by the approach presented in this chapter. For computing guaranteed time
bounds for tasks on a reconﬁgurable instruction set processor, an additional reconﬁguration analysis pass that
generates IPET constraints after the microarchitectural analysis is introduced in this chapter. When using software
emulation and performing reconﬁgurations in parallel, timing analysis needs to determine the worst-case point in
time at which a CI becomes available. The aim of our reconﬁguration analysis is to provide information to the
global bound analysis about when it is safe to assume a CI to be available, i.e., when to use the path that uses the
hardware CI instead of equivalent software to effectively reduce the guaranteed WCET bound. In the following
section, we introduce the background on IPET-based multi-context path analysis.
5.3.1 Path Analysis
As any ILP-formulated problem, global bound analysis using IPET consists of two parts: the objective function
and its constraints (details in Section 2.2). For WCET analysis, the objective function determines the CPU cycles
executed on a path in the task’s control ﬂow graph (CFG). To ﬁnd the WCET path, it needs to be maximized.
Variables in the objective function represent the execution count of a single basic block (xi) in the CFG and are
weighted with the execution cycles of that basic block (ci), which is determined in the microarchitectural analysis.
For a program with N basic blocks, the objective function is given as [64]:
max
x∈NN0
N
∑
i=1
cixi
The constraints restrict the variables by modeling the control ﬂow and capturing relative execution counts of basic
blocks. The more infeasible paths can be excluded with constraints, the tighter the WCET bound will get. Program
38
5.3 Timing Analysis Background
Virtual Unrolling 
Kernel Body 
x1 
x2 
x3 
x4 
d1 
d2 
d3 
d4 
ε 
 
ε ○O[l] 
=O[l] 
ε ○F[l] 
=F[l] 
ε    
(empty) 
Kernel Body 
x1 
x2 
x3 
x4 
d1 
d2 
d3 
d4 
Kernel Body 
x2 
x3 
d2 
d3 
d4 
(a) Virtually Unrolling a Kernel
x1 = d1
x2 = d1+d4 = d2+d3
x3 = d2 = d4
x4 = d3
(b) Basic program structural constraints
(single context)
x3 ≤ 200 ·d1
(c) Program functionality constraint (single
context) for an upper bound of 200 kernel
iterations
xε1 = d
ε
1
xF [l]2 = d
ε
1 = d
F [l]
2 +d
F [l]
3
xF [l]3 = d
F [l]
2 = d
F [l]
4
xO[l]2 = d
F [l]
4 +d
O[l]
4 = d
O[l]
2 +d
O[l]
3
xO[l]3 = d
O[l]
2 = d
O[l]
4
xε4 = d
F [l]
3 +d
O[l]
3
(d) Program structural constraints after virtual
unroll (multiple contexts)
Figure 5.4: IPET constraint generation for single contexts and multiple contexts after virtual unrolling
structural constraints can be derived automatically from the CFG, program functionality constraints need to be
user-speciﬁed or provided by further analysis passes.
Basic Constraints
An overview of how to generate IPET constraints was given in Section 2.2, a brief example is described in the
following. Besides the variables xi representing the execution counts of basic blocks, variables di for execution
counts of edges in the CFG are used. E.g., consider the loop kernel in Fig. 5.4 (a). The loop header (represented by
x2) can be entered from outside using the edge represented by d1 or from a previous iteration using d4. The same
basic block can be left when the loop condition becomes false and the kernel is exited using d3 or it can proceed
to another iteration when the loop condition is true using d2. Therefore, x2 = d1 +d4 = d2 +d3 (see Fig. 5.4 (b)).
An upper bound of 200 for the number of kernel iterations can be given by the program functionality constraint
x3 ≤ 200 ·d1 (see Fig. 5.4 (c)).
Execution Context
The simple constraints presented so far only consider a single execution context, i.e., only one abstract microar-
chitectural state is considered at the beginning of each basic block. For a speciﬁc basic block this context needs
to include any additional delay that may occur on any path to this basic block. In a loop, e.g., the ﬁrst iteration
will encounter much more cache misses than following iterations, but the cache misses of the ﬁrst iteration need
to be taken into account for all the following iterations. Consequently, the resulting WCET of the task will be very
pessimistic. Therefore, more ﬁne-granular analyses considering different paths to reach a basic block separately
have been developed [87, 98].
Following the deﬁnition in [98], a context is a sequence (denoted by ∗ in the formula) of ﬁrst (C) and recursive (R)
calls of functions as well as ﬁrst (F) and following/other (O) iterations of loops. The set T of all contexts for a
program P is [98]:
T := {C[c],R[c],F [l],O[l] : c ∈ calls(P), l ∈ loops(P)}∗
With calls(P) and loops(P) being the sets of all calls and all loops in P, respectively.
39
5 WCET Analysis of Tasks on Runtime-Reconﬁgurable Processors
Analyzing Basic Blocks in Multiple Contexts
Contexts allow separate timing analysis of basic blocks for the different paths to reach them. Analysis starts with
the empty context ε . When a function is called or a loop body is executed, then the context changes. It can be
considered as a stack: when a function is called or a loop is entered, this information is appended using the ‘◦’
operator. Upon exit, the information is removed. Recursive calls and following loop iterations replace (ﬁrst) calls
or iterations on top of the stack, respectively. For example, a basic block inside a loop l ∈ loops(P), which is
reached by calling c ∈ calls(P), entering l and executing it repeatedly would be in the context C[c]O[l]. A more
detailed explanation can be found in [98], the theoretical background in [71].
In the CFG, the context inﬂuence of a loop is represented by virtually unrolling the loop once, as shown in
Fig. 5.4 (a). Using this representation, multi-context IPET constraints can be generated as seen in Fig. 5.4 (d).
In general, each basic constraint in Section 5.3.1 is generated for each distinguished context of the designated
basic block. E.g., x3 in Fig. 5.4 can be entered using d2 and exited using d3 in context F [l] as well as O[l]. Addi-
tionally, the constraints capture context changes performed using the ‘◦’ operator, e.g., when entering, reentering
or exiting a loop.
5.4 Timing Analysis Extensions for Runtime-Reconﬁgurable Processors
The main contributions of this chapter are the path analysis extensions to obtain precise WCET bounds of tasks on
runtime-reconﬁgurable processors and are presented in Section 5.4.2. The following section introduces properties,
which we assume the microarchitecture to provide and which we exploit during timing analysis.
5.4.1 Microarchitectural Analysis
The input to microarchitectural analysis (see Section 2.2) is the reconstructed CFG from the binary under analysis.
Its output is a WCET bound per basic block and context (see Section 5.3.1) incorporating cache misses/hits,
memory access latencies, pipeline stalls, and other delays which can occur in the system. To determine the WCET
of a basic block including CIs, the CI latencies need to be known and possible inﬂuences on the rest of the
microarchitecture –especially on core pipeline and caches– need to be accessible to the respective analysis passes.
The delay for initiating and performing the reconﬁguration of a CI needs to be analyzable statically. Additionally,
reconﬁgurations in parallel to execution may not void any timing guarantees of the microarchitecture, e.g., the
delay for the CPU to perform bus accesses. We assume to be able to initiate a sequence of CI reconﬁgurations
using the core pipeline (e.g., by accessing a device on the bus) and then either use stalling or software emulation
while reconﬁgurations take place. The requested CIs for an upcoming kernel and the order of conﬁgurations need
to be obtainable by analyzing the binary. This information is used to generate constraints for path analysis as
described in the following section.
An implementation of a reconﬁguration controller which provides these properties is CoRQ, as presented in Sec-
tion 5.5.
5.4.2 Path Analysis Constraints for Software Emulation
Using software emulation, CI functionality can be executed on two separate paths: the CI itself or functionally-
equivalent software, depending on whether the CI is available or not (see Fig. 5.2). In the following we always
initiate a sequence of reconﬁgurations of CIs immediately before entering a kernel. Kernel execution and recon-
ﬁgurations are performed in parallel (see Figs. 4.1 and 5.6). With the reconﬁguration delay for each CI known (in
cycles of the core pipeline), we can determine the total delay for a CI to become available in the sequence. Once
the CI is available, the program path that uses the CI will be taken for all of its invocations in the remaining kernel
iterations. The main challenge is to precisely analyze at which point during kernel execution these path changes
40
5.4 Timing Analysis Extensions for Runtime-Reconﬁgurable Processors
will happen (from software emulation to hardware-accelerated CI) by CIs becoming available successively: How
far will the kernel have progressed in the worst case when reconﬁguration of a CI ﬁnishes and what exactly is the
worst case?
Assumptions
For analyzing tasks that use reconﬁgurable CIs for worst-case timing with manageable complexity, we apply the
following assumptions.
• We assume that software emulation of a CI always has a longer delay than the CI itself. In case the CI took
longer than the software emulation, it would not make sense to use a CI anyway. The result of this assumption is
that we know that software emulation is executed when the invocation of a CI lies on the worst-case path, unless
the respective CI is explicitly annotated available (which results in a lower WCET).
• CIs currently executing in software emulation are never moved to hardware and we conservatively assume the
availability of a CI not to change during a kernel iteration, even when reconﬁguration ﬁnishes early in the
iteration and the CI is the last executed instruction. This assumption eases analysis and is safe (the WCET of a
single kernel iteration is reduced when a CI becomes available), but may introduce pessimism.
Worst-Case CI Availability
In the following, the aim is to bound the worst-case number of kernel iterations uk for every CI k, in which the
CI is unavailable. During these iterations, the CI invocations need to be executed using software emulation. To
obtain a safe bound, we need to determine under which circumstances the execution time is maximized when a CI
becomes available after its constant reconﬁguration delay. For this, consider the timelines in Fig. 5.5 for a kernel
with 6 iterations in total. Iterations without the CI available are yellow for executing faster than WCET and orange
for executing them in WCET. After the marked conﬁguration delay for the speciﬁc CI (“Reconﬁguration Finish”),
every following kernel iteration (green) makes use of the CI path (instead of software emulation) for all of its
invocations (possibly multiple per iteration). Clearly, these iterations need to run in worst-case time to maximize
the execution time after the conﬁguration delay. The remaining questions to maximize the total execution time of
the kernel are:
(i) Given an upper bound of kernel iterations. In the worst case, how many of these iterations are run after the
reconﬁguration delay?
(ii) What is a safe time bound for iterations that execute before ﬁnishing reconﬁguration?
First, let us consider question (i) and assume no iteration of the kernel can overlap the point at which the conﬁg-
uration ﬁnishes. Let WCETavail be the WCET of one kernel iteration with the CI available, r the reconﬁguration
delay for the CI and m the remaining iterations after executing with software emulation during r. Under these
assumptions, the execution time becomes:
Execution Time = r+m ·WCETavail
As m is the only variable in this equation and every constant is positive, the equation is maximized when m is
maximized. Under the assumption that no iteration of the kernel can overlap the Reconﬁguration Finish, this is
achieved by executing every iteration during r in worst-case time as depicted in Fig. 5.5 (b). Counterintuitively, this
means that the execution time is increased when fewer iterations are executed with the CI unavailable. We call this
property minimum progress (during the reconﬁguration delay). To ﬁnally answer (i) and (ii), iterations overlapping
the Reconﬁguration Finish need to be considered. As the last iteration of the kernel with the CI unavailable can
41
5 WCET Analysis of Tasks on Runtime-Reconﬁgurable Processors
Execution Time 
Reconfiguration Finish 
2 3 (u1) 1 4 5 6 
(a) Possible execution at runtime, where no iterations overlap Reconﬁguration Finish (RF)
Execution Time 
Reconfiguration Finish 
1 3 4 5 6 2 
Actual worst-case ݑଵ! 
(b) Worst case, when no iterations can overlap RF (not safe in general)
Execution Time 
Reconfiguration Finish 
2 3 (u1) 1 4 5 6 
(c) Executing iterations faster than WCET before RF can prolong the execution time of the whole kernel (timing anomaly)
Execution Time 
Reconfiguration Finish 
1 4 5 6 2 3 (u1) 
(d) Applied case for safe WCET bounds
	
	

	
	

	
	

Figure 5.5: Different cases for execution times of kernel iterations. Executing all iterations in WCET does not necessarily bound the total
WCET of the kernel, because the worst-case number of iterations in which CI1 is unavailable (u1) can be mispredicted (timing
anomaly in (b)). For safe bounds, an additional iteration needs to be considered that assumes CI1 unavailable (like in (d))
now reach into the part of the timeline in which the CI is already available, this iteration can prolong the execution
time. This can lead to a case, where executing iterations faster than worst-case time during reconﬁguration extends
the execution time of the whole kernel. We call this the timing anomaly of runtime reconﬁguration. An example
is depicted in Fig. 5.5 (c). For bounding this timing anomaly, the case shown in Fig. 5.5 (d) needs to be applied in
timing analysis, i.e., maximum overlap (of one kernel iteration).
Summarizing, worst-case execution during reconﬁguration is safely bounded when minimum progress and max-
imum overlap are combined: All iterations are executed in worst-case time, as this results in minimum progress
during the reconﬁguration delay. An additional full iteration starting after the Reconﬁguration Finish is assumed
to execute with the CI unavailable in worst-case time for bounding a potentially overlapping iteration. To generate
safe constraints this case is always applied to bound the timing anomaly at the potential cost of overestimation.
For a scenario with multiple CIs in a kernel, it can analogously be argued for the worst case of CIi to become avail-
able after CIi−1 when considering the reconﬁguration ﬁnish of CIi−1 as point zero on the timeline and accounting
for the additional iteration (potential timing anomaly) of CIi−1. In the following section we will formally express
these considerations for multiple CIs.
Basic Constraints
First, the analysis of the worst-case number of iterations a CI is unavailable needs to be formalized. Then, IPET
path constraints can be generated that model the information obtained from the analysis. Without loss of generality,
suppose that the CIs for the following kernel are conﬁgured in the sequence CI1, . . . , CIn. We denote ri as the delay
to conﬁgure CIi. With existing timing analysis and the properties of Section 5.4.1, we can determine WCETi, the
WCET of one kernel iteration with CI1, . . . , CIi−1 available (and CIi, . . . , CIn still unavailable) by modeling the CI
availability using IPET path constraints. Consider the conditional branch to CI-equivalent software (see Fig. 5.2)
in the case a CI should be invoked, but is unavailable. As shown in Fig. 5.6, for every CI invocation j in the binary,
42
5.4 Timing Analysis Extensions for Runtime-Reconﬁgurable Processors
the CI and its software emulation reside on separate paths in the CFG, which are immediately joined after the
CI functionality is executed. Let xSW j be the variable representing the ﬁrst basic block of the software emulation
path of invocation j. A constraint of xSW j = 0 is used to annotate the CI invoked by j to be available, because it
forces the path analysis to exclude this path and account for the hardware CI path using xCI j only. For determining
WCETi, we generate a constraint for every invocation of CI1, . . . , CIi−1 to exclude software emulation. WCET0
is the special case of not generating any CI-speciﬁc constraints and effectively always executing the software
emulation for every CI to be conﬁgured.
Conditional Branch 
for CI Invocation j
...
Kernel Header
Initiate Reconfiguration
xinit
CI-equivalent Software
xSWj
Hardware CI
xCIj
...
...
Figure 5.6: CFG of a Kernel invoking a CI with Software Emulation
Suppose we would know uk, the number of iterations
in which CIk is unavailable (and therefore CIs, s > k
unavailable), but all previous CIs (if any) CIt , t < k are
available. The total number of iterations CIi is unavail-
able is ∑ik=1 uk. Let u0 = WCET0 = 0, then we can
deﬁne the remaining reconﬁguration time needed for
CIi after CIi−1 became available as:
si :=
i
∑
k=1
rk −
i−1
∑
k=0
uk ·WCETk (5.1)
In other words, this is the time to conﬁgure CI1, . . . ,
CIi minus the time already spent in the ﬁrst ∑i−1k=0 uk
iterations. Formally, we can deﬁne ui recursively as
follows:
ui :=
⎧⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎩
si
/
WCETi+1, if si > 0
1, if si+WCETi−1 > 0∧ si ≤ 0
0, else
(5.2)
If si becomes ≤ 0 (the second and third case of ui), the time ∑ik=1 rk until CIi becomes available, is already covered
by the time spent in the iterations in which CI1, . . . , CIi−1 are unavailable. As discussed in Section 5.4.2, an
additional iteration for iterations overlapping the point in time the reconﬁguration ﬁnishes can become necessary,
however. This is the case, if CIi became available in the additional iteration of CIi−1 (the second case, e.g., iteration
3 in Fig. 5.5 (d)). If CIi became available in the previous iteration (e.g., iteration 2 in Fig. 5.5 (d)), the additional
iteration of CIi−1 already covers the additional iteration of CIi such that CIi−1 and CIi become available in the
same iteration. It would be safe but pessimistic to not differentiate between the second and third case and always
add an additional iteration. For a single CI1, the equation becomes u1 = r1
/
WCET1+1 (see Fig. 5.5 (d)).
Assuming no invocations of CIi are contained in a nested loop inside the kernel, constraints can directly be gener-
ated that restrict the number of kernel iterations in which CIi is unavailable in hardware and the software emulation
needs to be executed. The limitation that CI invocations cannot be contained in nested loops will be removed in
Section 5.4.2. For a single invocation j of CIi inside the kernel, let xSWi, j be the variable representing the number
of executions of the ﬁrst basic block in CIi’s software emulation. Let xinit be the number of executions of the basic
block which initiates reconﬁguration before entering the kernel as shown in Fig. 5.6. Finally, include the previous
reconﬁguration analysis into global bound computation, the following constraint is generated:
xSWi, j ≤
i
∑
k=1
uk · xinit (5.3)
43
5 WCET Analysis of Tasks on Runtime-Reconﬁgurable Processors
This constraint ensures that the software emulation of invocation j of CIi (xSWi, j ) is accounted for at most as often
in the worst-case path as CIi was determined to be unavailable in iterations of the kernel under analysis, i.e., it
adds information to the analysis that aims to reduce the estimated WCET bound. Even though we can determine
the exact number of kernel iterations which start with CIi unavailable, the constraint is generated as an inequality
because it might happen that the worst-case path in the kernel does not even include invocation j of CIi. Generating
a constraint xSWi, j = ∑
i
k=1 uk · xinit with ui > 0 would force the timing analyzer to include invocation j of CIi in its
path analysis, possibly hindering it from ﬁnding the real worst-case path.
Conditional Execution
In the following, pessimism in the analysis of conditional execution of CIs is removed. As discussed previously,
the constraints generated by Eq. (5.3) correctly allow the timing analyzer to ﬁnd worst-case paths not containing
every CI invocation. However, for the upper bounds on how often the software emulation of a CI invocation needs
to be executed (because the hardware is not yet available), the maximum possible number of iterations making use
of this invocation is always considered. Even when the speciﬁc CI invocation is not part of the worst-case path
for all iterations of the kernel. This may be very pessimistic: consider a CI which is unavailable for 10 iterations
of a kernel in total. Furthermore, the CI is invoked only 5 times during the 10 iterations of its unavailability in
the worst-case path. Then, the CI needs to be invoked only 5 times using software emulation during the whole
execution of the kernel. So far however, the generated constraints force the analyzer to assume 10 invocations of
the CI need to be executed in software emulation (the total iterations in which the CI is unavailable). As only
5 iterations exist in which the CI is invoked and unavailable, the constraints will force 5 iterations of the kernel
to emulate the CI in software even though the hardware is already available. This pessimism can be removed by
performing an extended analysis of WCETi and determining for every invocation j of an unavailable CIi, . . . , CIn
whether this invocation is part of the worst-case path, i.e., xSWi, j > 0 (in the analysis of WCETi). When generating
the constraint for the number of software emulations of invocation j of CIi, uk is only added to the sum if this
invocation is part of the worst-case path that deﬁnes WCETk. We deﬁne:
inp(i, j,k) :=
⎧⎨
⎩1, invocation j of CIi is part of the worst-case path of WCETk0, else (5.4)
inp(i, j,k) is obtained from the solved ILP which was used to determine WCETk by simply testing whether xSWi, j >
0. For every invocation of a CI the exact number of times the software emulation is executed when entering the
kernel in the worst-case path is obtained from this analysis. The following inequality deﬁnes the updated constraint
that makes use of this information:
xSWi, j =
i
∑
k=1
inp(i, j,k) ·uk · xinit (5.5)
This constraint is generated for every invocation of all CIs instead of the constraint in Eq. (5.3).
Loop Nests
As mentioned before, the constraints generated so far do not support CIs that are contained in loop nests. This
limitation is removed in the following. When considering an invocation j of CIi which is contained in a nested
loop inside the kernel, the constraints generated by Eq. (5.3) would result in an unsafe (too low) upper bound for
the number of times the software emulation is executed, because j is assumed to be executed at most once per
iteration. In a nested loop, however, j can be executed multiple times per iteration, at maximum as often as the
basic block its conditional branch for software emulation is contained in (see Fig. 5.6). For brevity, let nf(CIi, j)
(nesting factor) be the statically known product of upper loop bounds for every level of loop nest to reach the basic
44
5.4 Timing Analysis Extensions for Runtime-Reconﬁgurable Processors
block which contains invocation j of CIi from the kernel iteration top level and nf(CIi, j) = 1 if it is not contained
in a loop nest.
We obtain the following constraint:
xSWi, j ≤ nf(CIi, j) ·
i
∑
k=1
uk · xinit (5.6)
When including the analysis of conditional execution, we obtain the equality:
xSWi, j = nf(CIi, j) ·
i
∑
k=1
inp(i, j,k) ·uk · xinit (5.7)
Using Eq. (5.7), constraints for all invocations of CIi are generated. After generating constraints for all invocations
of CI1, . . . , CIn, the global WCET bound analysis of a task including reconﬁgurable CIs can be performed for a
single context (e.g., not considering cache effects).
Multiple Contexts
For extending the constraints of the previous section for multi-context timing analysis that enables treating the ﬁrst
iteration of a loop differently from the following iterations (e.g., for precise analysis of cache effects), we need
to generate the constraint of Eq. (5.7) for every context the kernel can appear in. Let t(xinit) ⊆ T be the subset of
execution contexts xinit can appear in (see Section 5.3.1 for a deﬁnition of contexts). Again, the reconﬁguration
delay of a CI needs to be expressed as worst-case iterations of the kernel. However, in a multi-context analysis, a
loop l can be in its ﬁrst F [l] and other O[l] iterations. Therefore, the basic constraint in Eq. (5.3) becomes:
xϑ◦F [l]SWi, j + x
ϑ◦O[l]
SWi, j
≤
i
∑
k=1
uϑk · xϑinit ∀ϑ ∈ t(xinit) (5.8)
Note that the number of iterations in which CIi is unavailable uϑi is now also context-dependent as denoted by the
superscript ϑ ∈ t(xinit), because iterations of the kernel now have different delays depending on the context the
kernel is entered in. Therefore, we need to redeﬁne uϑi in the following. The reconﬁguration of CI1 starts in the
ﬁrst iteration of the kernel, therefore in context ϑ ◦F [l]. The worst-case delay for one iteration of the kernel is now
context-dependent and especially dependent on whether the iteration is the ﬁrst one of the other iterations of the
kernel. We denote the ﬁrst iteration, in which all CIs to be reconﬁgured are unavailable, as WCETϑ◦F [l]1 . Following
iterations in parallel to reconﬁguration are all in context ϑ ◦O[l]. WCETϑ◦O[l]1 denotes the worst-case time bound
of an iteration in this context in parallel to conﬁguring CI1. WCET
ϑ◦F [l]
i and WCET
ϑ◦O[l]
i , the WCET of one ﬁrst
or other kernel iteration with CI1, . . . , CIi−1 available and CIi, . . . , CIn unavailable when the kernel is entered in
context ϑ , can be determined analogously to the single context analysis explained in Section 5.4.2.
Now let us consider uϑ1 , the number of iterations in which CI1 is unavailable. To account for the ﬁrst iteration
of the kernel, its delay is subtracted from the reconﬁguration delay of CI1 and uϑ1 is accordingly increased by 1.
Together with the additional iteration to bound the timing anomaly discussed in Section 5.4.2, there are now at
least 2 iterations with CI1 unavailable. The formula directly resembles the single context deﬁnition in Eq. (5.2)
when inserting the values for CI1, uϑ1 becomes:
uϑ1 :=
⎧⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎩
⌈(
r1−WCETϑ◦F [l]1
)/
WCETϑ◦O[l]1
⌉
+2,
if r1−WCETϑ◦F [l]1 > 0
2, else
(5.9)
45
5 WCET Analysis of Tasks on Runtime-Reconﬁgurable Processors
As in the single context analysis, the remaining reconﬁguration delay after the time spent in the ﬁrst ∑i−1k=0 u
ϑ
k
iterations needs to be determined for determining uϑi . However, now a different context needs to be considered for
the ﬁrst iteration than for the others. Therefore, the context-sensitive extension of si is deﬁned as follows:
sϑi :=
i
∑
k=1
rk −
(
WCETϑ◦F [l]1 +(u
ϑ
1 −1) ·WCETϑ◦O[l]k +
i−1
∑
k=2
uϑk ·WCETϑ◦O[l]k
)
(5.10)
Finally, uϑi for i > 1 becomes:
uϑi :=
⎧⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎩
⌈
sϑi
/
WCETϑ◦O[l]i
⌉
+1, if sϑi > 0
1, if sϑi +WCET
ϑ◦O[l]
i−1 > 0∧ sϑi ≤ 0
0, else
(5.11)
As in the single-context case, uϑi for i> 1 can become 0 when CIi−1 and CIi become available in the same iteration.
For including the analysis of conditional CI execution and loop nests as explained in Section 5.4.2 and Sec-
tion 5.4.2, inp(i, j,k) also needs to be extended for contexts as it is dependent on the WCET of one kernel it-
eration. inp(i, j,k) determines whether invocation j of CIi is part of the worst-case path deﬁning WCETk. Thus,
the context-aware inpϑ (i, j,k) is deﬁned as
inpϑ (i, j,k) :=
⎧⎨
⎩1, invocation j of CIi is part of the worst-case path of WCET
ϑ◦F [l]
k or WCET
ϑ◦O[l]
k
0, else
(5.12)
Including the analysis of conditional CI execution and loop nests analogously to the single-context constraint in
Eq. (5.7), the ﬁnal constraint for multi-context analysis is obtained:
xϑ◦F [l]SWi, j + x
ϑ◦O[l]
SWi, j
= nf(CIi, j) ·
i
∑
k=1
inpϑ (i, j,k) ·uϑk · xϑinit (5.13)
Using this equation, constraints for all invocations of CI1, . . . , CIn can be generated and global WCET bound
analysis of a task can be performed including reconﬁgurable CIs with multiple contexts.
5.4.3 Stalling vs. Software Emulation
Modeling software emulation with parallel reconﬁguration needs to make pessimistic assumptions to achieve a
safe worst-case bound and an additional analysis needs to be performed to generate path analysis constraints.
Therefore, it needs to be determined in which cases the approach is competitive or superior to stalling under timing
guarantees. For clarity, we will only consider a single execution context in the following, but multiple contexts in
the evaluation.
When stalling, execution halts for the duration of reconﬁguring all CIs required in the upcoming kernel. After-
wards, every kernel iteration is executed in its worst-case execution bound with all CIs available, let us denote
this worst-case bound as WCETn+1. Therefore when stalling, the WCET for a kernel with an upper bound of I
iterations (while neglecting the time taken to exit the kernel) is:
n
∑
k=1
rk + I ·WCETn+1 (5.14)
46
5.5 Runtime-Reconﬁgurable Processor Infrastructure for Timing Guarantees
Reconf. Controller (CoRQ)
Main Memory
cmd Queue
CI conf.
Storage
Enq.
CMD
FSM
Imm.
CMD
FSM
ICAP
Legend Reconfigurable Base
Processor
Extensions for Timing-
Analyzable Reconfiguration
P
ip
el
in
e
Reconfigurable Fabric
CI Execution
Controller
RC
1
RC
2
RC
...
RC
n
SPM
CPU
SPM – Scratchpad Memory RC – Reconfigurable Container
CMD – Reconfiguration Command    ICAP – Internal Configuration Access Port
D$ I$
Figure 5.7: Overview of System on Chip used for Evaluation
From the considerations for generating path analysis constraints for the software emulation approach in Sec-
tion 5.4.2, the WCET of a kernel can be bounded. For brevity, the basic constraints of Section 5.4.2 are discussed in
the following. The upper bound of kernel iterations in which reconﬁguration takes place is ∑nk=1 uk. Furthermore,
during uk iterations of the kernel each iteration has an upper execution time bound of WCETk. The remaining
I−∑nk=1 uk iterations have an upper bound of WCETn+1 each. Therefore, we obtain the following execution time
bound for all kernel iterations:
n
∑
k=1
uk ·WCETk +
(
I−
n
∑
k=1
uk
)
·WCETn+1 (5.15)
For the software emulation approach resulting in a lower worst-case time bound, the inequality of (5.15) < (5.14)
needs to be satisﬁed. When simplifying this inequality, the following test is obtained:
n
∑
k=1
uk · (WCETk −WCETn+1)<
n
∑
k=1
rk (5.16)
It can be noted that Inequality (5.16) is independent of the total iterations of the kernel, whether software emulation
is beneﬁcial over stalling can be decided by analyzing the ﬁrst ∑nk=1 uk iterations only. For a speciﬁc iteration
included in uk, WCETk −WCETn+1 denotes the additional time the software emulation takes because some CIs
are unavailable. The software emulation approach results in a lower time bound for a kernel if and only if the total
additional time remains lower than the total reconﬁguration time. The practical implications of this analysis are
investigated in the evaluation.
5.5 Runtime-Reconﬁgurable Processor Infrastructure for Timing
Guarantees
For evaluating this work, the reconﬁgurable processor i-Core was employed, which was presented in Section 2.4.
For the context of this chapter it is important to remember that a CI is executed by the CI Execution Controller
in a protocol similar to other multi-cycle instructions like division and directly accesses register operands or the
47
5 WCET Analysis of Tasks on Runtime-Reconﬁgurable Processors
Timing  
Analyzer (aiT) 
Timing Analyzer (aiT) 
Reconfiguration Analysis 
Constraint Generation 
Binary with CIs and 
dynamic Reconfiguration 
CI Latency and auxiliary  
Un-/Availability Constraints 
Reconfiguration Delay  
per CI and Kernel, 
CI Latencies 
Binary with CIs 
substituted 
WCETi per Kernel  
and Context 
Constraints for  
reconfigurable CIs 
Binary with CIs  
substituted 
Unaltered 
Binary 
Global 
Time Bound 
Legend 
Implemented for 
our Evaluation 
External Tool 
Figure 5.8: Evaluation toolﬂow
non-cacheable SPM only (it follows the generalized architecture shown in Fig. 5.7). This way, in microarchi-
tectural analysis, a CI is just another multi-cycle instruction, which does not inﬂuence data cache analysis. The
reconﬁgurable fabric is partitioned into containers of identical size [92]. A CI can be conﬁgured into any set of
reconﬁgurable containers (possibly multiple) and potentially replaces a currently conﬁgured CI [11].
We further employ CoRQ (see Chapter 4) as the reconﬁguration controller to perform timing analyzable recon-
ﬁguration using the integrated conﬁguration access port (ICAP). CoRQ is accessible by the core pipeline as a
memory-mapped on-chip bus device and processes reconﬁguration commands that enable guaranteed reconﬁg-
uration delays (see Fig. 5.7, details in Chapter 4). For achieving predictability, we statically select which CIs
to reconﬁgure at which program points and in what order. The resulting reconﬁguration sequences are fed as
commands to CoRQ by a sequence of stores to its address. CoRQ’s internal bitstream memory is ﬁlled with all
conﬁgurations required by the task over the bus at task load.
5.6 Experimental Evaluation
5.6.1 Implementation and Setup
The static timing analysis ﬂow as described in Section 5.3 was split in several steps as depicted in Fig. 5.8.
(i) Reconﬁguration analysis is performed on the compiled binary, giving absolute reconﬁguration delays per
CI.
(ii) AbsInt aiT [2] is used to determine WCETϑi (see Section 5.4.2) for all kernels. As aiT is closed-source
software, we could not directly integrate support for CIs. Therefore, every CI opcode in the binary was
substituted by an ADD opcode and a constraint in aiT’s AIS2 Language to set the delay for the new ADD
instruction to the delay of the speciﬁc CI (e.g., Fig. 5.9 (a)). aiT outputs an XML report, which we parsed
to determine every uϑi for every kernel and generate the constraints described in Section 5.4.2 in AIS2 (e.g.,
Fig. 5.9 (b)).
(iii) Our generated constraints were used to calculate the global WCET bound using aiT.
48
5.6 Experimental Evaluation
instruction 0x40001238 additionally takes: ((36*def("cISA_freq_mul"))-1) cycles;
(a) Example for Setting the CI Latency
flow sum: point(0x40001244) == (2*67) point(0x400011a0);
(b) Example for a Generated Constraint for CI Availability
Figure 5.9: Generated Constraints in aiT’s Format (AIS2)
Table 5.1: Kernels and Custom Instructions (CI) in the H.264 Encoder
CI Name and Short Description Working Set CLoC4
MotionEstimation Kernel
SATD: Sum of Abs. Transf. Differences 16×16 px 123
SAD: Sum of Abs. Differences 16×16 px 24
EncodeMacroBlock Kernel
MC_Hz: Motion Compens. Interpol. Horiz. 4 px 51
IPred_HDC: Intra Prediction Horiz. 16×16 px 35
IPred_VDC: Intra Prediction Vert. 16×16 px 19
DCT: Discrete Cosine Transf. 4×4 px 76
HT2x2: Hadamard Transform 2×2 px 12
HT4x4: Hadamard Transform 4×4 px 111
LoopFilter Kernel
LoopFilter: In-Loop Deblock. Filter 4 px 82
The analysis is performed ofﬂine and runs on a workstation; it does not induce any runtime overheads. We evalu-
ated our analysis with an H.264 encoder application, which uses 9 CIs covering the most compute-intensive kernels
shown in Table 5.1. Every kernel conﬁguration requires the whole CI containers. Therefore, before entering a ker-
nel, reconﬁgurations for all containers are initiated to meet the kernel’s CI requirements. It contains complex
control ﬂow with numerous decisions and nested loops. Most of the properties tested in the Mälardalen WCET
Benchmarks3 are covered, e.g., Discrete Cosine Transform is contained in both. For evaluating the overestimation
of the static analysis, we executed the same binary obtained from BCC 4.4.2 (Gaisler’s extended GCC 4.4.2) at O1
in our SystemC-based cycle-accurate simulator which models the reconﬁgurable system shown in Fig. 5.7. Before
performing the evaluation, we calibrated aiT and our simulator by harmonizing hardware parameters and verifying
the results of test-cases, e.g., load-store sequences.
5.6.2 Results
In the following, the inﬂuences of stalling and software emulation on WCET bounds of a single kernel are analyzed.
Afterwards, the overestimation and WCET reduction when using reconﬁgurable CIs on the whole H.264 encoder
is analyzed.
Software Emulation vs. Stalling on a Single Kernel
For the analysis we use results obtained from performing timing analysis on a binary that executes the LoopFilter
kernel of H.264 on 99 macroblocks (QCIF resolution). LoopFilter is the kernel of lowest complexity in our
H.264 encoder, it contains a single CI and allows detailed analysis of worst-case CI availability. The guaranteed
time bounds are compared to results obtained by executing the same binary in our simulator. Table 5.2 gives
an overview of the parameters investigated. ffabric stays constant at 100 MHz and we choose multiples of it for
fCPU which resemble realistic setups (rounded to the next power of two). E.g., the LEON3 processor which we
3 http://www.mrtc.mdh.se/projects/wcet/benchmarks.html
4 C Lines of Code that are replaced by utilizing a hardware CI (without comments or whitespace)
49
5 WCET Analysis of Tasks on Runtime-Reconﬁgurable Processors
Parameter [Unit] Symbol Values
CPU frequency [MHz] fCPU 100, 200, 400, 800, 1600
Fabric frequency [MHz] ffabric 100
Conﬁguration Port fICAP 25, 50, 100
Frequency [MHz]
Table 5.2: Parameters investigated
fCPU/ ffabric 1 2 4 8 16
u1 at fICAP = ffabric 3 4 6 11 21
u1 at fICAP = 12 · ffabric 4 6 11 21 41
u1 at fICAP = 14 · ffabric 6 11 21 41 81
Table 5.3: CI Unavailability (uk) obtained during WCET bound esti-
mation for LoopFilter
SW Emulation Observed SW Emulation Overestimation Stalling Observed Stalling Overestimation
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
T
ho
us
an
d 
C
yc
le
s 
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
T
ho
us
an
d 
C
yc
le
s 
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
T
ho
us
an
d 
C
yc
le
s 
1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 fCPU 
ffabric 
fCPU 
ffabric 
fCPU 
ffabric 
(a) fICAP = ffabric (b) fICAP = 12 ffabric (c) fICAP =
1
4 ffabric
Figure 5.10: Observed Runtimes and Guaranteed WCET Bounds for LoopFilter
extended for reconﬁgurable CIs is advertised as running at 400 MHz when implemented as an ASIC, its successor
the LEON4 is advertised running at 1500 MHz. The commercially available Xilinx Zynq-7000 SoC couples an
ARM Cortex A9 at 866 MHz with a Xilinx 7-Series reconﬁgurable fabric.
All results are measured in cycles of the CPU pipeline, therefore they are determined by the relation of fCPU, ffabric
and fICAP. Figure 5.10 (a) shows the results for fCPU/ ffabric ∈ {20 = 1, . . . ,24 = 16} and fICAP = ffabric = 100 MHz.
This corresponds to running the Internal Conﬁguration Access Port (ICAP) at its maximum frequency, and results
in a reconﬁguration bandwidth of 400 MB/s when conﬁguring 32 bit of data every cycle. This reconﬁguration
bandwidth is possible when using a dedicated on-chip conﬁguration storage (e.g., as available in CoRQ, see Chap-
ter 4), we utilize the BRAM resources of Xilinx FPGAs in our prototype. When increasing the minimum evaluated
CPU pipeline frequency of 100 MHz by a factor of c for a ﬁxed ICAP frequency, the reconﬁguration delay, mea-
sured in CPU cycles, increases by the factor c as well. Additionally, the runtime beneﬁt of hardware CIs compared
to software emulation decreases. According to our prediction in Section 5.4.3, software emulation results in a lower
time bound than stalling for fCPU/ ffabric ∈ {8,16}, but not for fCPU/ ffabric ∈ {1,2,4}. The time bounds obtained
for the kernel shown in Fig. 5.10 (a) reﬂect these predictions. Software emulation results in 30.28%, 16.39%
and 4.48% higher time bounds than stalling for fCPU/ ffabric ∈ {1,2,4}, respectively. For fCPU/ ffabric ∈ {8,16}
software emulation results in 1.38% and 8.25% lower time bounds. When considering the observed worst-case
runtime, however, software emulation always takes less time than stalling, i.e., 2.84%, 4.57%. 7.28%, 11.98% and
20.85% less for fCPU/ ffabric ∈ {1,2,4,8,16}, respectively. The reason for this discrepancy is that for analyzing
software emulation for the worst-case time bound, pessimistic assumptions about CI availability need to be made
as detailed in Section 5.4.2. In contrast, we know exactly at which point in a kernel a CI is available when stalling:
directly from the beginning, after stalling for a statically known amount of cycles. Therefore, overestimation for
software emulation ranges from maximal 40.17% with fCPU = ffabric to 13.25% with fCPU/ ffabric = 8, while the
overestimation for stalling is never above 4.54%, again maximal with fCPU = ffabric.
50
5.6 Experimental Evaluation
Figure 5.10 (b) and Fig. 5.10 (c) show the results for fICAP = 50 MHz= 12 · ffabric and fICAP = 25 MHz= 14 · ffabric,
i.e., a reconﬁguration bandwidth of 200 MB/s and 100 MB/s, respectively. This corresponds, e.g., to systems mak-
ing use of cheaper but slower ﬂash memory for the CI Storage instead of SRAM-based memory and therefore
requiring a lower fICAP. The overall trend is that overestimation is lower with slower memories. In software emu-
lation the pessimism of assuming an additional iteration of CI unavailability (see Section 5.4.2), has less inﬂuence
on the guaranteed time bound as there are more iterations with the CI unavailable in total. At fCPU/ ffabric = 16 and
fICAP = 14 · ffabric ( fCPU = 64 · fICAP), overestimation rises again and reaches its overall maximum of 63.73%. As
seen in Table 5.3, timing analysis guarantees that a maximum of 19 iterations (99 total iterations minus 81 = u1
plus 1, as discussed in Section 5.4.2) of the kernel need to be executed after reconﬁguration delay at this point. In
the observed runtime however, all iterations are ﬁnished in software during reconﬁguration.
When stalling, overestimation also decreases slightly with slower memory as the reconﬁguration delay takes a
bigger share on the overall execution time and does not introduce overestimation. As more iterations in software
emulation can be executed during reconﬁguration and overestimation decreases, the resulting time bound is only
1.07% higher than stalling at fCPU/ ffabric = 4, fICAP = 12 · ffabric and already 2.02% lower at fICAP = 14 · ffabric. The
observed execution time for software emulation is 10.33% and 13.11% lower than for stalling, respectively.
In sum, software emulation beneﬁts from slow reconﬁguration bandwidths or high CPU frequencies. In our results,
to reduce the guaranteed time bound over stalling, the CPU needs to run at least at eight times the ICAP frequency
due to pessimism. In the observed runtime, software emulation is always beneﬁcial over stalling.
Overestimation
In this section, we analyze the inﬂuence of CIs on overestimation of WCET bounds for the H.264 encoder encoding
20 frames in QCIF resolution. Higher resolutions would increase the number of iterations per kernel and therefore
reduce the relative effects of the reconﬁguration delays on the total execution time. All results were taken with
fICAP = ffabric and several values for fCPU/ ffabric.
Figure 5.11 shows the percentage of overestimation for a general-purpose version of our H.264 encoder with-
out any CIs (cISA execution only), and several alternatives of using reconﬁgurable CIs. As the runtime in the
general-purpose case is unaffected by the fabric frequency, the amount of overestimation is constant at 38.04%.
Introducing additional control ﬂow by inserting the conditions for software emulation without actually conﬁguring
CIs –denoted as Software Emulation (always unavailable)– increases the overestimation slightly to 39.93%.
As mentioned in Section 5.6.2, the pessimism for bounding the timing anomaly when using software emulation
is highest when reconﬁguration of CIs takes only few iterations of a speciﬁc kernel. Overestimation reaches its
maximum for software emulation when fCPU = ffabric with 47.95% and its minimum at fCPU/ ffabric = 16 (maxi-
mum iterations during reconﬁguration) with 11.89%. In sum, in our results the overestimation is less for software
emulation than for the general purpose CPU when the pipeline frequency is twice the fabric frequency or higher.
When using stalling, overestimation reaches its maximum of 17.97% when fCPU = ffabric and its minimum of
6.66% when fCPU/ ffabric = 16. It is generally lower than when using software emulation (see also Section 5.4.2)
or cISA instructions only.
We can observe that increasing fCPU/ ffabric results in lower overestimation for both approaches for dealing with
reconﬁguration delay. This has two reasons:
(i) Increasing fCPU/ ffabric results in more kernel iterations which can be executed during the reconﬁguration
delay of a CI using software emulation. This means that the pessimism of assuming an additional iteration
in software to bound the timing anomaly (see Section 5.4.2) has a lesser share on the total iterations and
therefore a lesser effect on the timing bound.
51
5 WCET Analysis of Tasks on Runtime-Reconﬁgurable Processors
(ii) Increasing fCPU/ ffabric also increases the share of execution time (measured in CPU cycles) spent on the fab-
ric. The execution time on the fabric does not introduce overestimation and therefore the total overestimation
decreases.
Software emulation is affected by (i) and (ii), while stalling is only affected by (ii). Therefore, increasing
fCPU/ ffabric has a stronger effect on the overestimation of software emulation than of stalling.
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
1 2 4 8 16
O
ve
re
st
im
at
io
n 
cISA only  (GPP Binary)
Software Emulation (always
unavailable)
Software Emulation
(parallel Reconfiguration)
Stalling
Combination
fCPU 
ffabric 
Figure 5.11: H.264 overall overestimation without CI Invocations (cISA only) and differ-
ent alternatives of invoking CIs. Software Emulation (always unavailable)
introduces CI Invocations, but never executes them in hardware. Combination
chooses either Software Emulation or Stalling per kernel to optimize the timing
bound (see Section 5.4.3).
Using the analysis of Section 5.4.3,
we use a combination of software
emulation and stalling to choose the
more beneﬁcial approach per kernel.
As a result, we apply software emula-
tion at fCPU/ ffabric = 8 for two out of
three kernels and stalling for the other
one. fCPU/ ffabric = 16 is equivalent to
software emulation only. It turns out
that while overestimation is higher
than using stalling in these cases, the
resulting time bound is lower. This is
achieved by our models guaranteeing
that the reconﬁguration delay can be
hidden effectively. All other cases are equivalent to stalling only.
Speedup
Figure 5.12 shows the speedup obtained by using runtime reconﬁguration compared to execution on the cISA
only. The left graph shows the speedup obtained in the guaranteed runtime, and the right graph shows the speedup
of the observed runtime. As in the previous section, all results in this section are obtained with fICAP = ffabric.
The speedup in the guaranteed runtime is higher than in the observed runtime for stalling and the combination of
stalling and software emulation from the previous section. The reason for this effect is that in addition to the actual
speedup introduced by CIs, the overestimation is reduced as discussed in Section 5.6.2. Therefore, the speedup
in the predicted runtime is on average 24.43% higher (minimum 17.01% at fCPU/ ffabric = 1, maximum 29.42%
at fCPU/ ffabric = 16) than for the observed runtime when stalling. Software emulation results in a 11.5% higher
speedup in the predicted runtime than in the observed runtime on average (minimum -6.70% at fCPU/ ffabric = 1,
maximum 23.37% at fCPU/ ffabric = 16).
0
2
4
6
8
10
12
14
16
18
1 2 4 8 16
Sp
ee
du
p 
on
 G
ua
ra
nt
ee
 
0
2
4
6
8
10
12
14
16
18
1 2 4 8 16
Sp
ee
du
p 
on
 O
bs
er
va
tio
n 
fCPU 
ffabric 
fCPU 
ffabric 
cISA / CIs (Software Emulation) cISA / CIs (Stalling) cISA / CIs (Combination)
Figure 5.12: H.264 overall speedup on the guaranteed time bound (left) and the observed runtime (right)
52
5.7 Conclusion
Figure 5.13 takes stalling as a baseline and compares the guaranteed and observed results of software emulation
and the combination of both approaches to it. Software emulation is always beneﬁcial for the observed runtime
with an average reduction of 4.8%, a minimum of 1.58% with fCPU/ ffabric = 1 and a maximum of 10.48% with
fCPU/ ffabric = 16. For the guaranteed WCET bound, however, software emulation is beneﬁcial only when the
CPU pipeline runs faster than the fabric at a factor of fCPU/ ffabric = 8 or more, because of overestimation (see
Section 5.6.2). Using software emulation for suitable kernels and stalling for others, the combination does not
increase the runtime over stalling in any case. In cases where software emulation is beneﬁcial for some kernels,
the combination achieves the same or better guaranteed runtime reduction than using software emulation only.
5.7 Conclusion
0.80
0.85
0.90
0.95
1.00
1.05
1.10
1.15
1 2 4 8 16
Sp
ee
du
p 
Guarantee: Stalling /
Software Emulation
Observed: Stalling /
Software Emulation
Guarantee: Stalling /
Combination
Observed: Stalling /
Combination
fCPU 
ffabric 
Figure 5.13: Speedup of Software Emulation and Combination over Stalling in H.264
This chapter presented a novel tim-
ing analysis approach for tasks on
runtime-reconﬁgurable processors, it
supports static analysis of runtime re-
conﬁguration of multiple custom in-
struction (CIs) and multiple execu-
tion contexts (e.g., as used for precise
worst-case analysis of cache effects).
In the evaluation the precision of the
analysis and the beneﬁt of using CIs
on WCET reduction as well as re-
duced overestimation was shown. We
compared the effects on safe estimated WCET bounds of executing CI-equivalent software (software emulation)
to halting execution (stalling) during the reconﬁguration delay. In the observed worst case, software emulation
was always beneﬁcial over stalling. However, in the estimated time bound, software emulation was superior only
when the CPU pipeline frequency was higher than the fabric frequency by a factor of eight or more as stalling can
be analyzed more precisely. An analysis to choose either stalling or software emulation per kernel was introduced
and evaluated to combine the beneﬁts of both approaches. In sum, we have shown that runtime instruction set
reconﬁguration can be an enabling feature to provide timing-analyzable performance.
In this chapter, the set of CIs to conﬁgure for each kernel was assumed given. It turns out, however, that in cases
where there are more CIs to choose from than ﬁt onto the reconﬁgurable area it is an NP-hard problem to choose
the WCET-optimizing set of CIs. This problem is further complicated by the fact that computations (like the ones
performed by CIs) can be implemented in hardware using different alternatives that choose a tradeoff between area
requirements and resulting latency. The following chapter presents an optimal and a heuristic solution to selecting
WCET-optimizing sets of CI implementations.
53

6 WCET Optimization using Reconﬁgurable Custom
Instructions
The previous chapter has shown instruction set extensions by reconﬁgurable custom instructions (CIs) to be an
effective means to achieve predictable performance. CIs were detailed in the context of i-Core in Section 2.4,
their most-important properties for the context of this chapter are summarized as follows: CIs initiate execution of
hardware accelerators conﬁgured on a reconﬁgurable fabric that is tightly coupled to a processor core (see [97] for
an overview of reconﬁgurable architectures). An application binary in such an architecture provides directives to
a reconﬁguration controller (like CoRQ, see Chapter 4) to conﬁgure the CIs’ accelerators onto the reconﬁgurable
fabric. Reconﬁgurations are performed for the requirements of an upcoming kernel (also known as hot spot), i.e., a
compute-intensive part of the application, e.g., a loop nest. In the previous chapter it was shown that –additional to a
considerable speedup– the overestimation of a task’s WCET can be reduced by moving calculations from software
code to hardware CIs. CIs typically implement functionality that corresponds to several hundred instructions when
executed on the CPU pipeline, possibly including conditional branches and other control ﬂow. While analyzing
instructions for worst-case latency may introduce pessimism due to, e.g., pipeline hazards or instruction cache
misses, the latency of the hardware accelerators –executed on the reconﬁgurable fabric– is precisely known.
In this chapter1 an approach of selecting WCET-optimizing sets of CIs for computational kernels that seamlessly
integrates into state-of-the-art timing analysis is proposed. While this chapter does not target the reduction of
overestimation of a task’s WCET bound or resolving the problem of timing anomalies (like the one discovered
in Section 5.4.2) in this work, an effective approach is presented to statically select sets of reconﬁgurable CIs to
optimize a task’s WCET bound and advance research on timing-analyzable high-performance architectures. One
main problem in selecting WCET-optimizing CIs is the instability of the worst-case path, i.e., when reducing the
1 The work presented in this chapter was originally published in [27]
Pipeline
CPU
Reconfigurable Fabric
1 2 … A…3
Application Source
Compilation with Custom 
Instruction (CI) Generation
Design Flow Target Architecture
Timing Analysis
Application Binary with CI 
Suggestions
CI Configuration Alternatives
(Multiple per CI)
WCET-Optimizing
Instruction Set Selection
Memory Subsystem 
(e.g., Caches, Scratchpad, …)
Contribution of this Work Existing Instruction Set Extension Flow Target Hardware Legend
Figure 6.1: Toolﬂow performing WCET-Optimizing Instruction Set Selection integrated with timing analysis. As input to our approach we
take application binary with suggestions where to place custom instructions as well as different implementation alternatives per
custom instruction, differing in resource requirements and latency.
55
6 WCET Optimization using Reconﬁgurable Custom Instructions
latency of the worst-case path by inserting a CI, a whole different path can become the new worst-case path.
Therefore, WCET bound estimation is an integral part of WCET-optimizing CI selection. Figure 6.1 shows our
envisioned toolﬂow. CI selection, also referred to as instruction set selection, is the second of the two main steps
in the so-called instruction set extension problem [42]. The ﬁrst step is the CI generation that is performed when
compiling the application source code. In this step, kernels are identiﬁed in the application and partitioned into
segments of code to execute in software and segments to execute in hardware. For the segments to execute in hard-
ware, several alternatives that differ in resource demands as well as latencies are generated and then synthesized
into conﬁgurations for the reconﬁgurable fabric. CIs provide an assembly-level interface to execute the hardware
segments. Which CIs are implemented in hardware instead of the original software code and how much resources
to allocate per CI is determined by the CI selection according to an optimization goal, e.g., average-case perfor-
mance. Several approaches to CI generation exist that can provide CIs and implementation alternatives as an input
to CI selection [42]. Different from existing CI selection approaches that target average-case performance, our
novel WCET-optimizing selection requires the application binary, as it is the only way to be able to obtain precise
WCET bound estimates (see Section 2.2 and [106]). To obtain a ﬁnished binary with generated CIs while keeping
the ﬂexibility to execute the original software, we introduce CI super blocks which will be detailed in Section 6.2.
Effectively, the selection step of the instruction set extension problem is moved from the compiler to the timing
analyzer in this work, i.e., post-compilation. This is achieved by extending the analysis of the conditional jump
that either jumps to the hardware CI, if conﬁgured, or the original software code, otherwise, which was introduced
in the previous chapter. The result is an effective technique that considerably reduces the guaranteed WCET bound
compared to the original task that does not use CIs.
The novel contributions of this chapter are:
• Modeling the WCET-optimizing instruction set selection problem with support for global program ﬂow infor-
mation and reconﬁguration delay by extending state-of-the-art models used in timing analyzers like AbsInt aiT
[2] or OTAWA [7].
• An optimal solution that effectively reduces the search space by mapping selection candidates to weak compo-
sitions of an integer, i.e., the algorithm recursively generates all distributions of reconﬁgurable fabric area to
CIs while adhering to area constraints. Recursion subtrees corresponding to distributions of area that cannot be
utilized in CI implementations are pruned early. In our evaluation we show that less than 1% of all possible
570,240 selections need to be evaluated when optimizing the EncodeMacroBlock kernel as part of the H.264
encoder with our optimal search algorithm.
• A heuristic solution that performs a maximum number of WCET estimates linear in the partitions of area avail-
able for conﬁguring CIs on the reconﬁgurable fabric. It reduces the runtime of optimization down to 11.18%
of the optimal search algorithm in the before-mentioned EncodeMacroBlock kernel, the most-complex kernel
evaluated. Its results produced maximum 2.52% lower speedups on the WCET than optimal in our evaluations.
We show that previous work targeting optimization of the worst-case path, e.g., instruction cache locking or
scratchpad memory allocation of program code, share similarities with the WCET-optimizing instruction set se-
lection problem, but cannot be adapted to obtain optimal solutions. For introducing runtime instruction set re-
conﬁguration as an enabling feature to provide timing-analyzable performance, novel models and solutions are
required.
6.1 Related Work and Motivation
WCET-optimizing instruction set selection bears resemblance to other static optimizations targeting the worst-
case path like instruction cache locking or scratchpad memory allocation of program code. In this section, the
56
6.1 Related Work and Motivation
differences of these problems to WCET-optimizing instruction set selection are pointed out. Additionally, state-of-
the-art solutions to instruction set selection speciﬁcally are discussed and their shortcomings explained.
Caches are used to effectively reduce the average memory access latency of a CPU. It is very difﬁcult to predict
whether a memory access can be served by the cache (cache hit) or needs to be served by main memory (cache
miss). WCET analysis always needs to consider a cache miss when it cannot guarantee a cache hit. This typically
leads to overestimation of the WCET bound. Cache locking is a software-controlled mechanism to load code
segments into the cache and prevent them from being evicted. Several works utilize instruction cache locking to
reduce overestimation resulting from cache analysis and thus lowering the WCET bound [40, 66, 78]. Similarly,
the instruction cache can be replaced by allocating program code directly to predictable scratchpad memory [39].
Even though these techniques are complementary to instruction set selection, the question arises whether the
same algorithms can be applied. Similar to instruction set selection, the instruction cache locking and program
code allocation problem entail WCET estimation to determine the worst-case path and using this information to
select code segments that can be most proﬁtably sped up for lowering the WCET bound. However, both need to
choose between two alternatives for a code segment only: utilizing the fast memory (i.e., locking it in the cache or
allocating it in scratchpad memory) or main memory. Instruction set selection has several alternatives to choose
from: the original software or different CI implementations for the same functionality with different degrees of
parallelism and therefore different delays as well as resource requirements. Even with extensions for evaluating
multiple alternatives to choose from (e.g., the different CI implementations), existing algorithms for cache locking
would remain unsuitable for our problem. In [40] and [66] the problem is modeled similarly using Execution Flow
Graphs and Execution Flow Trees, respectively. However, the execution ﬂow is modeled on the level of function
calls. As this work targets kernels, the aim of this chapter is to model the function-internal control ﬂow.
In [78] as well as [39] function-internal control ﬂow is modeled similarly to the instruction set selection presented
in [112], which in turn is an ILP formulation of a WCET estimation technique called timing schema [76]. Timing
schema is a tree-based WCET estimation technique (see [38] for an overview of estimation techniques). In current
timing analyzers, it was succeeded by the more powerful Implicit Path Enumeration Technique (IPET) [64], which
was introduced in Section 2.2.1. Timing schema is still commonly used in state-of-the-art WCET optimization
approaches however, because it is computationally cheap and it enables WCET optimization to be modeled as
a single ILP (as opposed to the combinatorial problem that we present in Section 6.3). In timing schema, the
estimation is calculated by building a representation which generally corresponds to the abstract syntax tree of
the program and traversing it bottom-up by simple recursive rules. Infeasible path information cannot efﬁciently
be applied, because the recursive rules are local to program statements [38]. This can lead to imprecise WCET
estimates as shown in the simple example in Fig. 6.2: the rules are unable to capture the global information that
the true case of the if statement can appear maximum 5 times in the worst-case path. In this example, timing
schema produces an estimate based on a program path that executes the true case 100 times and therefore this
case seems to be the most proﬁtable candidate to be optimized. However, this path never appears in an actual
execution of the program. State-of-the-art timing analyzers can correctly determine that the false case dominates
the WCET in the example in Fig. 6.2 using value analysis and generating constraints for IPET. Therefore, when
utilizing a computationally cheap, but imprecise, WCET estimation technique like timing schema during WCET
optimization, the allocated resources may not even be utilized in the ﬁnal WCET bound that is obtained using a
timing analyzer. Additionally, state-of-the-art timing analyzers support powerful annotation languages to provide
global path information [59] (the impact on WCET optimization is evaluated in Section 6.6.3). Thus, we propose
to extend state-of-the-art timing analysis using IPET to support WCET optimization, as opposed to treating WCET
optimization and timing analysis as two separate processes.
In [112] WCET-optimizing instruction set selection for instruction set extensible processors is performed. These
processors contain custom functional units that can be conﬁgured to implement frequently used instruction patterns
for speedups by exploiting instruction level parallelism and operator chaining [111]. According to the processor
57
6 WCET Optimization using Reconﬁgurable Custom Instructions
Timing Schema Rules (excerpt):
• T(if (Exp) T else F) = T(Exp)+max(T(T), T(F))
• T(while (Exp) Body = T(Exp)+n · (T(Exp) + T(Body))
Program Code:
int i = 0;
while (i < 100) {
if (i < 5)
..; // tT = 8
else
..; // tF = 4
i++; }
Syntax Tree:
i < 100 
i < 5 // true // false 
i++ if 
while 
Bottom-up Calculation (with T(Exp) = 1):
(i) tif = T(if (i < 5) T else F) = T(i < 5)+max(tT, tF) = 1+max(8,4) = 9
⇒ true case explicitly determined as worst-case path.
(ii) T(while (i < 100) Body) = T(i < 100)+100 · (T(i < 100)+ tif+T(i++))
= 1+100 · (1+9+1) = 1101
⇒ Decision: optimize true case, e.g., using cache locking or a custom instruction.
Actual WCET:
T(i < 100)+101 ·T(i < 5)+5 · tT+95 · tF+100 = 622
⇒ In contrast to the result obtained by Timing Schema, the false case dominates the WCET. Optimizations
relying on timing schema would therefore allocate resources on the wrong path.
Figure 6.2: Simple example that shows how WCET optimization approaches that rely on Timing Schema perform suboptimal decisions
model used in that work, the presented heuristic assumes a uniform cost per selected pattern (i.e., occupation of
one custom functional unit). The WCET-optimizing instruction set is selected per task, i.e., during task execution
the instruction set is ﬁxed. Therefore, the cost of conﬁguring a selected pattern is not taken into account in their
approach. In this chapter, dynamic reconﬁguration of custom instructions with varying area demands is targeted (1
up to A units of the reconﬁgurable fabric area). For evaluating the proﬁt of an instruction on reducing the WCET
estimate, its required area demands as well as its reconﬁguration delay need to be factored in. The impact of
reconﬁguration delay on WCET optimization is evaluated in Section 6.6.2.
In summary, state-of-the-art WCET optimization approaches model program ﬂow at the level of function calls,
rely on the imprecise timing schema, do not consider reconﬁguration delay while evaluating the proﬁts of potential
decisions or support binary decisions only (either optimize a certain path or not). In the following, all of these
shortcomings are resolved.
6.2 System Model
Similar to the timing analysis presented in the previous chapter, the optimization presented in this chapter is
applied to the reconstructed control-ﬂow graph (CFG) of an application in binary form, as it is the only way
to obtain safe and precise WCET estimates [106]. To enable the WCET-optimizing selection of CIs, additional
compile-time information is required: potential CIs and their possible conﬁgurations to choose from (see [42] for
an overview). The granularity of a CI, i.e., the amount of software it replaces, depends on the speciﬁc target
architecture. In our evaluation, CIs replace 12 to 123 lines of C code (see Section 6.6.1). For conﬁguring the CIs
58
6.2 System Model
... 
CI Super Block 
ds 
dt = dSW+dHW 
CI Available? 
xi 
Software 
Implementation 
(First Basic Block) 
... 
yci(i),0 yci(i),j, j>0 
ds 
... 
xi+1 xi+1 
dSW dHW 
false true 
CI  Hardware 
Implementations 
(Assembly Instruction) 
Figure 6.3: CI super block as part of a CFG
in hardware, we assume reconﬁgurable fabric area to be allocatable in up to A discrete units. This corresponds to
the common area model of dividing the fabric area into A equally-sized partitions2 like in the 1D or 2D partitioned
area models in [93]. As in the previous chapter, reconﬁgurations are requested before beginning execution of a
kernel to conﬁgure CIs that speed up the kernel’s computations, as is shown in Fig. 6.4 (a). Let CI be the set of all
CIs. We assume a speciﬁc conﬁguration j of a CI k ∈ CI in hardware to have a constant delay tk, j (cycles spent
in the pipeline’s execution stage), to require area on the reconﬁgurable fabric ak, j ∈ [1,A] and to take a constant
reconﬁguration delay rk, j for conﬁguring it on the fabric. For a constant reconﬁguration delay, a constant bandwidth
for transferring conﬁguration data to the reconﬁgurable fabric’s conﬁguration memory needs to be guaranteed, e.g.,
by employing CoRQ (see Chapter 4 for details). Stalling the CPU during reconﬁguration is assumed in this chapter
for WCET optimization (see Fig. 4.1). Note that the resulting CI selection can directly be used in a system that
employs software emulation and parallel reconﬁguration at runtime after a WCET bound is obtained using the
timing analysis approach of Chapter 5.
Additional to hardware conﬁgurations, a CI can be implemented using its original software code j = 0. The
software implementation does not have a constant delay tk,0, because it is subject to, e.g., cache and pipeline
analysis in the speciﬁc context that it is executed in. It does not require fabric area nor reconﬁguration delay (i.e.,
ak,0 = rk,0 = 0). For providing the ﬂexibility to execute the original software for generated CIs, we introduce CI
super blocks (which are a timing analysis construct that base on the conditional branch used in Chapter 5). As
shown in Fig. 6.3, CI super blocks begin with a conditional branch before every CI (the actual instruction in the
binary), which jumps to the functionally equivalent software code when the CI is not implemented in hardware.
If a conﬁguration for the CI is available on the reconﬁgurable fabric, the CI is executed instead of jumping to the
software. The CI super block ends by joining paths of hardware CI and software. Multiple CI super blocks in the
binary can execute the same CI k. Let B be the set of all blocks, i.e., basic blocks (not contained in super blocks)
as well as super blocks. The function ci(i) determines which CI k is executed by a super block i ∈B, i.e.:
ci: B→ CI∪{0}, i → k, with ci(i) = 0 ∈ CI if i is a basic block (not a super block) (6.1)
The context-dependent delay for executing implementation j of CI super block i is denoted as ei, j for hardware as
well as software implementations. While CI execution on the reconﬁgurable fabric itself is context independent
(tci(i), j is constant, for j > 0), invoking the CI from the CPU pipeline can add additional cycles, e.g., because of
pipeline hazards or instruction fetch miss of the CI. Therefore, ei, j ≥ tci(i), j for j > 0. Consider the example of
2 A partition directly maps to a reconﬁgurable container on the evaluation platform i-Core (see Section 2.4). The more general term ‘partition’
is employed throughout this chapter for consistency with commonly-used area models like presented in [93].
59
6 WCET Optimization using Reconﬁgurable Custom Instructions
Fig. 6.4 (a), it provides input to the WCET-optimizing instruction set selection. In this example, two CIs were
generated, one with m1 = 2 and the other with m2 = 3 different hardware implementations. From microarchitec-
tural analysis (see Section 2.2), the worst-case bound per block which considers, e.g., cache, pipeline or branch
prediction effects is obtained (see Fig. 6.4 (b)). This way, e4,0 and e6,0 can be unequal, even when they execute
the same CI (ci({4,6}) = 1) in the same implementation ( j = 0). ei, j is the main parameter that is used to cal-
culate WCET estimates based on a speciﬁc selection of implementations in Section 6.3. Equation (6.1) is used
to concisely formulate Eqs. (6.4) to (6.8) that our WCET estimation is based on. Effectively, a CFG is obtained
that is parameterized by a CI selection using CI super blocks. In the following the WCET bound estimation tech-
nique IPET (introduced in Section 2.2.1) is extended to the problem formulation of this chapter for evaluating and
directing the WCET optimization.
6.3 Problem Formulation
In order to obtain precise WCET estimates that utilize global program ﬂow information during instruction set se-
lection, the system model of Section 6.2 and global bound calculation using IPET (see Section 2.2.1) are integrated
in the following. Selecting an instruction set to optimize the WCET bound essentially means that the WCET is
minimized over all possible selections, i.e., the aim is to minimize the maximum execution time. In the following,
the ILP-formulation of IPET is extended for capturing the implementation alternatives of a CI k ∈ CI. To this end,
new variables yk, j ∈ {0,1} are introduced for every implementation j with yk, j = 1 if CI k is implemented using
alternative j and yk, j = 0 otherwise. E.g., yk,0 = 1 would mean that CIk is not implemented in hardware but utilizes
its original software instead (see Section 6.2 and Fig. 6.3). The following constraint is introduced to ensure that
exactly one implementation is chosen –potentially in software ( j = 0) or hardware ( j > 0)– with mk being the
number of hardware conﬁgurations of CI k:
mk
∑
j=0
yk, j = 1 ∀k ∈ CI (6.2)
To only allow solutions that ﬁt onto the reconﬁgurable fabric, the following area constraint is introduced:
|CI|
∑
k=1
mk
∑
j=0
ak, jyk, j ≤ A ∈ N0 (6.3)
I.e., the sum of area on the reconﬁgurable fabric ak, j required to implement all CIs k using the selected implemen-
tation j (for which yk, j = 1) needs to be lower than or equal to the total fabric area A. Any y ∈ {0,1}|CI|×M , with
M = max
k∈CI
mk +1, satisfying Eq. (6.2) and Eq. (6.3) is a feasible instruction set selection. As shown in Fig. 6.4 (c),
the obtained constraints are used to extend constraints generated by IPET.
The objective function for optimizing the WCET in the presence of CI super blocks is developed as follows. The
system model introduced in Section 6.2 enables us to capture every implementation alternative as a single super
block in the CFG (see Fig. 6.3). The total cycle contribution of CI k’s super block i to the WCET bound is given
as:
mk
∑
j=0
ei, jyk, jxi (6.4)
60
6.3 Problem Formulation
(a) Input to WCET-Optimizing Timing Analysis:
x1 
x2 
x3 
x5 x4 
x6 
d9 
d5 d4 
d3 
d1 
d2 
d7 d6 
if (i < 5) 
Invoke CI 1 Invoke CI 2 
Invoke CI 1 
for (i = 1..100) 
Reconfigure y 
… 
d8 
x7 
true false 
Basic Block 
CI Super Block 
Legend 
Reconﬁgurable area:
A = 5
Generated CIs:
CI= {1,2}, |CI|= 2
Hardware implementations per CI:
m1 = 2,m2 = 3
Area demands (ak,0: software):
a1 = (0,3,3),a2 = (0,2,4,5)
Reconﬁguration delays:
r1 = (0,10,10),r2 = (0,7,12,16)
CI latencies on reconﬁgurable fabric (undeﬁned for
software: tk,0 =⊥):
t1 = (⊥,10,12), t2 = (⊥,15,11,9)
(b) Obtained from Microarchitectural Analysis:
Worst-case basic block delays:
c1,c2,c3,c7 ∈ N
Invoked CIs:
ci({1,2,3}) = 0,ci({4,6}) = 1,ci({5}) = 2
Worst-case CI Super Block delays (in order of invoked CI):
e4 = (50,10,12),e6 = (48,10,12),e5 = (60,18,14,12)
(ek, j ≥ tk, j for j > 0, because execution history-dependent, see Section 6.2)
(c) Generated Constraints:
Program Structure (by IPET):
1 = x1 = d1 (kernel entry constraint)
x2 = d1+d9 = d2+d3
x3 = d3 = d4+d5
x4 = d4 = d6
x5 = d5 = d7
x6 = d6+d7 = d8
x7 = d8 = d9
Global Information:
x3 ≤ 100 ·d1 (upper loop bound)
x4 ≤ 5 ·d1 (true case max. 5 times)
CIs and Reconﬁgurable Fabric:
2
∑
j=0
y1, j = 1,
3
∑
j=0
y2, j = 1
(exactly one conﬁguration per CI)
2
∑
k=1
mk
∑
j=0
ak, jyk, j ≤ 5 (area constraint)
(d) Generated Combinatorial Objective Function:
min
y∈{0,1}2×4
(
max
x∈N60
(
c1x1+ c2x2+ c3x3+ c7x7
+
2
∑
j=0
e4, jy1, jx4+
3
∑
j=0
e5, jy2, jx5+
2
∑
j=0
e6, jy1, jx6
)
+
2
∑
j=0
y1, jr1, j +
3
∑
j=0
y2, jr2, j
)
Figure 6.4: Simple example of how an instance of the problem formulated in Sections 6.2 and 6.3 is generated
61
6 WCET Optimization using Reconﬁgurable Custom Instructions
E.g., when choosing the software implementation, the cycle contribution becomes ei,0xi, which directly resembles
the contribution of a basic block in IPET’s objective function (max∑Ni=1 cixi, see Section 2.2.1). The WCET for a
given selection y without accounting for reconﬁguration delay can be determined as:
WCET′(y) := max
x∈N|B|0
⎛
⎜⎝ |B|∑
i=1
ci(i) ∈CI
cixi+
|B|
∑
i=1
ci(i)∈CI
mci(i)
∑
j=0
ei, jyci(i), jxi
⎞
⎟⎠ (6.5)
Additionally, the reconﬁguration delay induced by a selection y needs to be accounted for. Neglecting it could
result in suboptimal selections in which the time spent conﬁguring the selected CIs outweighs the time saved
by performing hardware-accelerated calculations (more details in Section 6.6.2). Every CI super block utilized
in a kernel is conﬁgured exactly once before entering the kernel (with zero reconﬁguration delay for software
implementation). Therefore, the WCET including reconﬁguration delay is obtained as follows:
WCETr(y) := WCET′(y)+
|CI|
∑
k=1
mk
∑
j=0
yk, jrk, j (6.6)
For every selection y, an ILP instance that determines the WCET of the kernel when reconﬁguring y is obtained.
E.g., when selecting the software implementation for every CI, the following objective function is obtained, which
again resembles an objective function of an IPET problem instance without any CIs:
max
x∈N|B|0
⎛
⎜⎝ |B|∑
i=1
ci(i)∈CI
cixi+
|B|
∑
i=1
ci(i)∈CI
ei,0xi
⎞
⎟⎠ (6.7)
Note that for every choice of y, only WCETr(y) changes while the constraints remain static once they were
generated.
Putting it all together, the WCET-optimizing instruction set selection problem becomes a combinatorial problem
with the following objective function:
min
y∈{0,1}|CI|×M
⎛
⎜⎝ max
x∈N|B|0
⎛
⎜⎝ |B|∑
i=1
ci(i)∈CI
cixi+
|B|
∑
i=1
ci(i)∈CI
mci(i)
∑
j=0
ei, jyci(i), jxi
⎞
⎟⎠+ |CI|∑
k=1
mk
∑
j=0
yk, jrk, j
⎞
⎟⎠ (6.8)
The objective function for our example in Fig. 6.4 is shown in Fig. 6.4 (d). As there are ﬁnite choices for y ∈
{0,1}|CI|×M (|CI| and M are ﬁnite), Eq. (6.8) could be transformed into a single ILP by resolving the miny(. . .)
of Eq. (6.5) into one constraint per choice of y. However, this would result in up to 2|CI|·M constraints of high
complexity, which becomes practically infeasible even for small values. Also note that the ILPs only need to be
evaluated per kernel and not for the whole application. Therefore, the ILPs are considerably less complex (fewer
variables and constraints) than the ILP for determining the WCET of the whole application. In the following
section we will show how the search space can be pruned and feasible y are generated efﬁciently.
6.4 Optimal Solution
In theory, up to 2|CI|·M possible selections y need to be evaluated. In practice, however, the search space is
considerably smaller for the following reasons:
• The number of possible hardware conﬁgurations mk per CI k varies a lot, e.g., in our evaluation we had a
minimum of 1 to a maximum of 78 = M implementations for CIs (including software implementation) within
62
6.4 Optimal Solution
one kernel (more details Section 6.6). From these ∑|CI|k=1 (mk + 1) ≤ |CI| ·M different CI implementations in
total, again in practice only a small subset is relevant. For the CI with 78 different implementations, many
implementations had different degrees of parallelism and latencies, but required the same amount of area and
reconﬁguration delay when synthesized to the reconﬁgurable fabric. When considering only the minimum-
latency implementation per required fabric area, our algorithm was able to prune the number of implementations
to 10 relevant ones. Therefore, in practice the relevant number of implementations per CI k is much smaller than
mk +1.
• Additionally, the possible selections can be pruned considerably when applying the area constraint early (see
Eq. (6.3)). Let us consider the inner sum of Eq. (6.3), it models the allocation of area per CI for a speciﬁc
selection as a tuple a = (a1,a2, . . . ,a|CI|). To prune the search space, we will ﬁnd the number of unique tuples
fulﬁlling Eq. (6.3) in the following. Having a total area of A on the reconﬁgurable fabric means the number
of selections utilizing the whole fabric is equal to the number of possibilities to distribute the area to CIs such
that ∑|CI|k=1 ak = A (allowing ak = 0 for the software implementation). I.e., the number of selections utilizing the
whole fabric is the number of so-called weak compositions of the integer A into exactly |CI| parts, which is(A+|CI|−1
|CI|−1
)
[51]. The number of all unique tuples fulﬁlling Eq. (6.3) (which additionally allows less than A area
to be distributed, i.e., ∑|CI|k=1 ak ≤ A), is exactly ∑As=0
(s+|CI|−1
|CI|−1
)
< 2A+|CI|. Effectively, a maximum of 2A+|CI|
ILPs need to be solved to ﬁnd the WCET-optimal selection.
Algorithm 1 Recursive Search for Optimal Selection
1: ybest ← (0, . . . ,0), WCETbest ← ∞  For brevity, global variables to obtain result
2: function OPTSEARCH(A, k, y)  Remaining area A, CI k, (partial) selection y
3: if k < |CI|+1 then  Function was called to select CI k
4: for ak ← 0,A do  For all possible possible values of ak
5: y′k ← GETMINLATENCYIMPL(k, ak)
6:  Minimum latency implementation for CI k allocating exactly ak area
7: if y′k = 0 then  Implementation allocating exactly ak area exists
8: y′ ← (y1, . . . ,yk−1,y′k,0, . . . ,0)T  Add found implementation to current selection
9: OPTSEARCH(A−ak, k+1, y′)  Branch to another recursion subtree
10: end if
11: end for
12: else  Implementations for all CIs selected, evaluate resulting selection y
13: if WCETr(y) < WCETbest then  Calculate WCET bound for y, see Eq. (6.6)
14: ybest ← y, WCETbest ← WCETr(y)  Save so far best evaluated selection
15: end if
16: end if
17: end function
Combining both observations leads to an additional opportunity for pruning, which our optimal search algorithm
shown in Algorithm 1 exploits. The algorithms recursively generates the weak compositions of A into exactly |CI|
parts as tuples a= (a1,a2, . . . ,a|CI|). In the initial call OPTSEARCH(A, 1, y), the algorithm enumerates the possible
values of a1 ∈ {0, . . . ,A} (Line 3). For every value of a1 it tries to ﬁnd the best implementation (minimum latency)
of CI 1 requiring exactly a1 area (Line 4). The recursive calls OPTSEARCH(A− a1, 2, y) take place only, if an
implementation for a chosen a1 is found. Otherwise, the whole recursion subtree for the value of a1 is pruned.
Every leaf of the recursion tree (k = |CI|+ 1) deﬁnes a unique selection y fulﬁlling Eq. (6.2) as well as Eq. (6.3)
and is evaluated by solving the ILP of Eq. (6.8) (Line 12). Figure 6.5 visualizes how pruning is applied and
how generated tuples correspond to selection candidates for the input provided by the example in Fig. 6.4. The
effectiveness of our approach of pruning the search space by recursively generating weak compositions of A is
evaluated in Section 6.6.4. While it shows effective in practice, the number of candidates to be evaluated can still
grow exponentially in A and |CI|. Therefore, a heuristic solution is presented in the following section.
63
6 WCET Optimization using Reconﬁgurable Custom Instructions
(0, 2) (0, 5)(0, 4)
(0, 0) (3, 0)(3, 0)
(3, 2) (3, 5)(3, 4)
Generate possibilities for a= (a1,0) only that cor-
respond to implementations of CI 1.
Based on a given (a1,0), generate possibilities for
a = (a1,a2) (implementations of CI 2).
The subtree (3,0) = (a1,2,0) is pruned (×), because the respective implementation j = 2 of CI 1 requires the same
area as j = 1, but does not provide a latency beneﬁt, i.e., a1,1 = a1,2 = 3∧ t1,1 < t1,2∧ r1,1 ≤ r1,2. (3,4) and (3,5)
are pruned, because a1, j +a2, j′ > A = 5 (area constraint).
Finally, from theoretically |{0,1}2×4|= 256 possible inputs y ∈ {0,1}2×4 to the objective function, only
6 =
∣∣∣∣∣∣∣∣∣∣∣
⎧⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎩
⎛
⎝1 0 0 0
1 0 0 0
⎞
⎠
︸ ︷︷ ︸
=ˆa=(0,0)
,
⎛
⎝0 1 0 0
1 0 0 0
⎞
⎠
︸ ︷︷ ︸
=ˆa=(3,0)
,
⎛
⎝1 0 0 0
0 1 0 0
⎞
⎠
︸ ︷︷ ︸
=ˆa=(0,2)
,
⎛
⎝1 0 0 0
0 0 1 0
⎞
⎠
︸ ︷︷ ︸
=ˆa=(0,4)
,
⎛
⎝1 0 0 0
0 0 0 1
⎞
⎠
︸ ︷︷ ︸
=ˆa=(0,5)
,
⎛
⎝0 1 0 0
0 1 0 0
⎞
⎠
︸ ︷︷ ︸
=ˆa=(3,2)
⎫⎪⎪⎪⎪⎪⎬
⎪⎪⎪⎪⎪⎭
∣∣∣∣∣∣∣∣∣∣∣
possible selections y need to be evaluated using timing analysis to ﬁnd the optimal solution.
Figure 6.5: Visualization of how pruning is applied and how generated tuples correspond to selection candidates for the input provided by the
example in Fig. 6.4. For clarity, tuples that were pruned because a chosen ak did not correspond to a possible implementation of
CI k are omitted.
6.5 Heuristic Solution
Algorithm 2 Greedy Heuristic for WCET-Optimizing Instruction Set Selection
1: repeat
2: if ∑|CI|k=1 ak, j(yk) = A then return  Return if reconﬁgurable area is fully occupied
3: end if
4: x ← result x which determines WCET′(y), see Eq. (6.5)  Get current worst-case path information
5: y′ ← y
6: y ← UPGRSELECTION(y′, x)  Attempt to upgrade a CI implementation
7: until y′ = y  Exit loop when unable to upgrade any CI
8:
9: function UPGRSELECTION(y, x)
10: proﬁtbest ← 0, ynext ← y
11: freePartitions ← A−∑|CI|k=1 ak, j(yk)
12: for k ← 1, |CI| do
13: proﬁt ← -1, s← 0
14: while proﬁt < 0∧ s < freePartitions do
15: y′k ← GETMINLATENCYIMPL(k, ak, j(yk) + s)  Try to ﬁnd upgrade for CI k
16: if y′k = 0 then  If upgrade with exactly s additional area found
17: y′ ← (y1, . . . ,yk−1,y′k,yk+1, . . . ,ymk)
18: proﬁt ← proﬁt(y′k,yk,x)  Deﬁned in Eq. (6.9), proﬁt > 0 ⇒ y′k = y+k
19: end if
20: s← s+1
21: end while
22: if proﬁt > proﬁtbest then
23: ynext ← y′, proﬁtbest ← proﬁt  Save best y+k found so far
24: end if
25: end for
26: return ynext
27: end function
We introduce a greedy heuristic that performs a number of WCET estimates linear in the number of partitions that
the reconﬁgurable fabric area was divided in, i.e., maximal A estimates (for A > 0). It is shown in Algorithm 2.
64
6.6 Experimental Evaluation
The heuristic starts with implementing all CIs in software, i.e., not allocating any area of the reconﬁgurable fabric
for CIs. For every CI, it assigns a proﬁt which calculates the WCET reduction on the current worst-case path,
when choosing an alternative implementation. Let j(yk) be the implementation selected for CIk in y. We deﬁne the
proﬁt of selecting y′k over yk for a CI k as:
proﬁt(y′k,yk,x) :=
|B|
∑
i=1
ci(i)=k
(ei, j(yk)− ei, j(y′k))xi
︸ ︷︷ ︸
latency reduction on current worst-case path
− (rk, j(y′k)− rk, j(yk))
︸ ︷︷ ︸
additional reconﬁguration delay
(6.9)
Where x provides information about the current worst-case path and is obtained by solving Eq. (6.5) and keeping
the values of the variables xi, i.e., x determines WCET′(y). Note that the proﬁt can become negative if the latency
reduction on the current worst-case path is smaller than the additional reconﬁguration delay for the additional
area. The heuristic calculates the proﬁt for selecting the next best implementation y+k instead of yk for every CI k
(Line 18). The implementation y+k for a CI k is the implementation that can be chosen with minimum increase
in the amount of area over yk resulting in a positive proﬁt. There might be several implementations according to
this deﬁnition with the same required area. In this case y+k is the implementation with minimum latency tk, j (and
minimum j). If no such implementation y+k exists (i.e., no implementation with positive proﬁt was found), the
CI is not considered for selecting a different implementation. Among the CIs for which y+k exists, the algorithm
greedily chooses the one with the maximum proﬁt and upgrades y to select y+k for the chosen CI k (Line 23). This
process is repeated such that in every iteration the CI k with maximum proﬁt(y+k ,yk,x) is upgraded. The algorithm
terminates when no y+k for any k exists anymore or insufﬁcient area is left to be allocated for selecting y
+
k (Line 7).
In every iteration, either a CI upgrade is selected and the allocated area increased by a minimum of one or the
algorithm terminates. For every iteration but the last one WCET estimate is performed. Therefore, a maximum of
A WCET estimates are performed in total.
6.6 Experimental Evaluation
6.6.1 Evaluation Setup
This work is evaluated on the reconﬁgurable processor i-Core presented in Section 2.4. A CI is executed by a
so-called CI Execution Controller. Its protocol is similar to other multi-cycle instructions like division and directly
accesses register operands or non-cacheable Scratchpad Memory. This way, in microarchitectural analysis when
determining WCETs of basic blocks, a CI is just another multi-cycle instruction that does not inﬂuence data
cache analysis. The reconﬁgurable fabric is divided into A equally-sized partitions, complying to common models
of allocating reconﬁgurable fabric area as assumed in our system model (see Section 6.2). The reconﬁguration
controller CoRQ (see Chapter 4) is employed with private memory to store conﬁgurations provides predictable
reconﬁguration of CIs. Initiating a speciﬁc conﬁguration is done by a stores of the CPU using the memory-mapped
interface of CoRQ (see Chapter 4 for more details).
Our timing analysis of tasks on reconﬁgurable processors has been detailed in Chapter 5 and was evaluated using
the commercial timing analyzer AbsInt aiT [2]. In this work we extend WCET analysis as an integral part of WCET
optimization. Therefore, the optimal search and heuristic selection algorithms were implemented as processors
within the open-source WCET estimation framework OTAWA [7]. We extended the existing analysis support
for the LEON3 CPU in OTAWA to support CI opcodes, CI super blocks with conﬁguration-dependent latency
and reconﬁguration delay. Our approach was evaluated with the H.264 encoder application that was also used in
the previous chapter. It uses 9 CIs covering the most compute-intensive kernels shown in Table 6.1. Multimedia
applications in general are regularly subject to hard real-time constraints in the domain of computer vision. Notable
65
6 WCET Optimization using Reconﬁgurable Custom Instructions
Table 6.1: Kernels and Custom Instructions (CI) in the H.264 Application
CI Name and Short Description Working Set #Confs. Min. Part. Max. Part. CLoC5
MotionEstimation Kernel
SATD: Sum of Abs. Transf. Differences 16×16 px 77 1 9 123
SAD: Sum of Abs. Differences 16×16 px 3 1 4 24
EncodeMacroBlock Kernel
MC_Hz: Motion Compens. Interpol. Horiz. 4 px 29 1 6 51
IPred_HDC: Intra Prediction Horiz. 16×16 px 3 1 1 35
IPred_VDC: Intra Prediction Vert. 16×16 px 5 1 4 19
DCT: Discrete Cosine Transf. 4×4 px 21 1 5 76
HT2x2: Hadamard Transform 2×2 px 1 1 1 12
HT4x4: Hadamard Transform 4×4 px 17 1 4 111
LoopFilter Kernel
LoopFilter: In-Loop Deblock. Filter 4 px 6 1 4 82
examples are advanced driver assistance systems, e.g., vehicle detection and tracking [15], but also consumer
electronics, e.g., face recognition in digital cameras [109]. The H.264 encoder contains complex control ﬂow
with numerous decisions and nested loops. Most of the properties tested in the Mälardalen3 or TACLeBench4
WCET Benchmarks are covered, e.g., Discrete Cosine Transform is contained in all three. The H.264 decoder
–that is part of TACLeBench– performs a subset of the computations performed in the H.264 encoder that is
evaluated in the following. Especially the EncodeMacroBlock kernel stresses our selection heuristic (more details
in Section 6.6.4), as it contains separate compute-intensive paths that share some CIs. The kernel iterates over
macroblocks (MBs). Which path is executed within a kernel iteration depends on the type of MB, either I-MB
or P-MB, determined by the MotionEstimation kernel, i.e., it is input dependent. I-MB and P-MB path also
contain separate CIs leading to instability of the worst-case path, i.e., adding more partitions to the current worst-
case path can result in the other path becoming the worst case. We compiled the application using BCC 4.4.2
(Gaisler’s extended GCC 4.4.2) at O1 and performed our selection on the encoder for a frame size of 99 MBs
(QCIF resolution). At higher optimization levels, GCC emitted irreducible loops, i.e., complex loop structures that
cannot be extracted as well-deﬁned loop routines by the timing analyzer. Therefore, O1 provided the lowest WCET
bound for the baseline executing all CI super blocks in software. The selection is performed ofﬂine and runs on a
workstation with an AMD FX-6300 CPU and 12 GB of RAM. The result is used to generate a single conﬁguration
for every kernel that includes CI super blocks. The conﬁgurations are supplied to the optimized application on
the target system by loading them into the private memory of CoRQ before executing the application. Before
entering a kernel that includes CI super blocks, its speciﬁc conﬁguration is triggered. The pipeline stalls for the
reconﬁguration delay and continues with entering the kernel once reconﬁguration ﬁnishes.
The parameters evaluated were different numbers of partitions A (300 slices each on a Xilinx Virtex-7), reconﬁgu-
ration bandwidths as well as relations of CPU frequency and fabric frequency fCPU/ ffabric. Similar to the evaluation
of the timing analysis in Section 5.6, ffabric stays constant at 100 MHz and we choose multiples of it for fCPU that
resemble realistic setups. E.g., running the CPU at fCPU = 400 MHz, which the LEON3 CPU is advertised as
running at as an ASIC implementation, would correspond to the parameter fCPU/ ffabric = 4. The successor of the
LEON3, LEON4 is advertised running at 1500 MHz, corresponding to fCPU/ ffabric = 15. The commercially avail-
able Xilinx Zynq-7000 SoC couples an ARM Cortex A9 at 866 MHz with a Xilinx 7-Series reconﬁgurable fabric,
corresponding to fCPU/ ffabric ≈ 9. Note that while the WCET in seconds (WCET cycles/ fCPU) is anticipated to
3 http://www.mrtc.mdh.se/projects/wcet/benchmarks.html
4 http://www.tacle.eu/index.php/activities/taclebench
5 C Lines of Code that are replaced by utilizing a hardware CI (without comments or whitespace)
66
6.6 Experimental Evaluation
600
1100
1600
2100
2600
3100
3600
4100
4600
5100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
W
C
E
T
 C
yc
le
s 
T
ho
us
an
ds
 
fCPU/ffabric 
Ignoring
Reconf. Delay
Considering
Reconf. Delay
600
1100
1600
2100
2600
3100
3600
4100
4600
5100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
W
C
E
T
 C
yc
le
s 
T
ho
us
an
ds
 
fCPU/ffabric 
Ignoring
Reconf.
Delay
Considering
Reconf.
Delay
Ignoring Reconf. Delay 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 
Consider. Reconf. Delay 7 7 7 6 5 5 5 5 5 5 5 5 5 5 5 5 
Ignoring Reconf. Delay 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 
Consider. Reconf. Delay 7 7 7 6 5 5 5 5 5 5 5 5 5 5 5 5 
Partitions Allocated 
Partitions Allocated 
(a) 
(b) 
A = 7 
A = 10 
Figure 6.6: Optimal Results for the EncodeMacroBlock Kernel of the H.264 Encoder and different Values of fCPU/ ffabric, A as well as a
Reconﬁguration Bandwidth of 200 MB/s. Comparing Results considering Reconﬁguration Delay during Selection and Results not
considering it.
get lower (better) with higher fCPU, the WCET cycles are increasing (at a constant ffabric), because hardware CIs
perform less computations on the reconﬁgurable fabric within one CPU cycle.
In Sections 6.6.2 and 6.6.3, we focus on the effects of considering reconﬁguration delay and infeasible path infor-
mation on the selection result, respectively. In Section 6.6.4, the effectiveness of pruning during optimal search is
analyzed and runtime as well as quality of selection results of our optimal search and heuristic algorithms com-
pared. Note that all discussed results are upper bounds of the actual WCET. In general, it is not possible to obtain
the actual WCET [106].
6.6.2 Impact of Reconﬁguration Delay on WCET-Optimizing Selection
In this section we evaluate the impact of reconﬁguration delay on WCET-optimizing CI selection. Figure 6.6 shows
results obtained by applying our optimal search algorithm (see Section 6.4) to the EncodeMacroBlock kernel of
the H.264 Encoder for fCPU/ ffabric ∈ [1 : 16] and reconﬁguration bandwidth of 200 MB/s (half of the theoretical
maximum in current Xilinx FPGAs). We compare the results obtained by considering the reconﬁguration delay
during selection, as in Eq. (6.8), with results obtained by ignoring it (i.e., rk, j = 0 ∀k ∈ CI, j ∈ mk). The ﬁnal
WCET bound always includes the reconﬁguration delay required to conﬁgure the selection result.
Figure 6.6 (a) shows the results for A = 7, i.e., the algorithm can allocate up to 7 partitions for the selection to
optimize the WCET of this kernel. For fCPU/ ffabric ∈ [1 : 3], the selections and the resulting WCET bound are
equal. For higher frequencies of the CPU, the WCET bound obtained by ignoring the reconﬁguration delay during
selection is higher than the WCET bound obtained by considering the reconﬁguration delay with a maximum
of 4.08 % increase at fCPU/ ffabric = 16. More importantly, the lower WCET bounds are obtained with fewer
partitions. It is not beneﬁcial to use all 7 partitions with fCPU/ ffabric ∈ [4 : 16], because the CIs having the biggest
effect on reducing the WCET bound are implemented in hardware already. Increasing the number of allocated
partitions for these CIs yields diminishing returns in their latency reduction. In total, this leads to an increase of the
67
6 WCET Optimization using Reconﬁgurable Custom Instructions
WCET bound, because the additional reconﬁguration delay outweighs the latency reduction of the WCET path.
This effect becomes even more apparent, with A = 10 as shown in Fig. 6.6 (b), keeping all other parameters as in
Fig. 6.6 (a). In this case, ignoring the reconﬁguration delay already yields a higher WCET bound by 4.02 % at
fCPU/ ffabric = 1 and up to 17.14 % at fCPU/ ffabric = 16 over considering the reconﬁguration delay. Furthermore,
at fCPU/ ffabric = 16 only half the partitions are required when considering the reconﬁguration delay (5 partitions,
as compared to 10 when ignoring it). The effect of obtaining lower WCET bounds with fewer partitions when
considering the reconﬁguration delay during selection compared to not considering it, becomes more severe with
higher reconﬁguration delay (measured in CPU cycles) per allocated partition. The reconﬁguration delay per
partition increases when fCPU/ ffabric is increased (e.g., when using a higher-frequency CPU) or the reconﬁguration
bandwidth is lowered (e.g., when using cheaper memory). Additionally, raising A further would again lead to
worse selections when not considering the reconﬁguration delay, as more partitions would be allocated for only
little CI latency improvement.
In sum, not considering the reconﬁguration delay during WCET-optimizing CI selection can not only lead to
suboptimal results. It can lead to higher WCET bounds allocating more partitions than required in the optimal
results (considering reconﬁguration delay). Existing approaches for selecting CIs to optimize the WCET, target
application-speciﬁc instruction set processors (ASIPs) instead of reconﬁgurable processors, and therefore do not
consider reconﬁguration delay (see Section 6.1). While runtime reconﬁguration provides the ﬂexibility to utilize
the whole fabric area per kernel and was proven to provide substantial WCET reductions [29], the reconﬁguration
delay needs to be considered during selection to avoid suboptimal results. Previous approaches targeting ASIPs
can therefore not be applied to reconﬁgurable processors as the results obtained by our approach show.
6.6.3 Impact of Infeasible Path Information on WCET-Optimizing Selection
As motivated in Fig. 6.2, previous WCET-optimizing selection and allocation approaches relying on timing schema
cannot utilize information about the global program ﬂow. Therefore, global ﬂow information provided by anno-
tation languages in state-of-the-art timing analyzers (see [59] for an overview), which is crucial to precise WCET
bounds, cannot be utilized during optimization and therefore decisions are made on imprecise WCET estimates.
In the evaluations of our approach, the CFG was annotated with infeasible path information using the XML-based
FFX language [17] supported by OTAWA. During WCET bound estimation, these annotations are translated into
IPET constraints (see Section 2.2.1). Similar to all other IPET constraints used in our optimization approach, the
constraints need to be generated once for the whole optimization process and can be reused for all WCET bound
estimations.
Figure 6.7 shows results with and without infeasible path information obtained by applying the optimal search
algorithm (see Section 6.4) to the EncodeMacroBlock kernel of the H.264 Encoder for several parameters. For
evaluating the effects of infeasible path information, the path encoding I-MBs (see Section 6.6.1) is annotated
as infeasible, which becomes the worst-case path only at some point when adding partitions to CIs (the exact
point depends on fCPU/ ffabric and the reconﬁguration bandwidth). For most selections and especially when not
allocating any partitions (original software instead of hardware CIs only), the P-MB path is the worst-case path.
Still, we can show that annotating the I-MB path as infeasible has a considerable effect on the resulting WCET
bound. A reconﬁguration bandwidth of 400 MB/s –the theoretic maximum in Xilinx Virtex-7 FPGAs– is used in
Fig. 6.7 (a) for allocating A = 5 partitions. At fCPU/ ffabric = 1, the difference between optimized WCET bound
with and without infeasible path information is maximal with 12.71 % more WCET cycles when not utilizing the
infeasible path information. The additional WCET cycles are a result of allocating partitions to CIs that lie on the
path marked as infeasible. Our approach enables to utilize this information during optimization and therefore does
not allocate any partitions to the infeasible path when this information is provided. For fCPU/ ffabric ∈ [2 : 4], the
difference decreases down to 3.45 % at fCPU/ ffabric = 4, because the increased speed of the CPU relative to the
68
6.6 Experimental Evaluation
550
650
750
850
950
1050
1150
1250
1350
1450
1550
1 2 3 4
W
C
E
T
 C
yc
le
s 
T
ho
us
an
ds
 
fCPU/ffabric 
No Infeasible
Path Info
With
Infeasible
Path Info
550
650
750
850
950
1050
1150
1250
1350
1450
1550
1 2 3 4
W
C
E
T
 C
yc
le
s 
T
ho
us
an
ds
 
fCPU/ffabric 
No Infeasible
Path Info
With
Infeasible Path
Info
550
650
750
850
950
1050
1150
1250
1350
1450
1550
1 2 3 4W
C
E
T
 C
yc
le
s 
T
ho
us
an
ds
 
fCPU/ffabric 
No Infeasible
Path Info
With
Infeasible Path
Info
No Path Info 7 7 7 6 
With Path Info 6 6 6 6 
No Path Info 5 5 5 5 
With Path Info 5 5 5 5 
No Path Info 5 5 5 5 
With Path Info 5 5 5 5 
Partitions Allocated 
Partitions Allocated 
Partitions Allocated 
(a) 
(c) 
(b) 
A = 5 
Reconf. BW = 400 MB/s 
A = 5 
Reconf. BW = 200 MB/s 
A = 7 
Reconf. BW = 200 MB/s 
Figure 6.7: Optimal Results for the EncodeMacroBlock Kernel of the H.264 Encoder and different Values of fCPU/ ffabric, A as well as Re-
conﬁguration Bandwidth. Comparing Results utilizing Infeasible Path Information (I-MB Path marked infeasible) and not utiliz-
ing it.
fabric compensates the partition wasted on the infeasible path when not utilizing the global ﬂow information. One
might suspect that this effect is only possible at this high reconﬁguration bandwidth, because the reconﬁguration
delay of adding an additional partition to the worst-case path while utilizing infeasible path information might
be too high to reduce the WCET bound otherwise (see Section 6.6.2). However, with half the reconﬁguration
bandwidth (200 MB/s), the results remain valid with 11.95 % additional cycles at fCPU/ ffabric = 1 and 3.08 %
at fCPU/ ffabric = 4 when comparing the selection not utilizing WCET path information with the selection that
does. For higher values of fCPU/ ffabric than 4, the instability of the worst-case path leads to the infeasible path not
appearing for A = 5. Keeping the reconﬁguration bandwidth at 200 MB/s and increasing A leads to an additional
effect shown in Fig. 6.7 (c). The difference between the WCET bounds obtained with infeasible path information
and without is generally lower, with maximal 3.90 % additional WCET cycles at fCPU/ ffabric = 3. The reason
is that with infeasible path information the optimal choice only allocates 6 out of 7 = A available partitions, the
reconﬁguration delay of adding an additional partition to the worst-case path is now too high to effectively reduce
the WCET bound. Adding an additional partition to the infeasible path when not utilizing this information therefore
adds reconﬁguration delay to the WCET bound without providing actual beneﬁt. Therefore, similar to the impact
of reconﬁguration delay evaluated in Section 6.6.2, utilizing infeasible path information during WCET-optimizing
CI selection provides better results and can even provide better results with fewer partitions. Previous WCET-
optimizing approaches for CI selection and memory allocation relied on timing schema and were therefore unable
to utilize global ﬂow information (see Section 6.1). However, utilizing this information is crucial to obtain good
selection results.
69
6 WCET Optimization using Reconﬁgurable Custom Instructions
Table 6.2: Evaluation Results LoopFilter Kernel, |CI|= 1
Measures
A 0 1 2 3 4
Total Possible Selections 7 7 7 7 7
Weak Comp. of A into |CI| 1 2 3 4 5
Opt. Estimates 1 2 3 4 5
Heur. Estimates 1 1 2 3 3
Opt. Runtime [ms] 122.0 123.2 119.2 120.0 124.4
Heur. Runtime [ms] 118.0 112.4 122.4 122.4 119.2
WCET unoptimized [cycles] 4,467,172
Opt. Speedup 1 1 19.8516 19.8516 19.8516
Heur. Speedup 1 1 19.8516 19.8516 19.8516
While our approach enables utilizing global ﬂow information and considering the reconﬁguration delay during
optimization, it adds complexity by utilizing IPET over simpler techniques like timing schema to obtain WCET
estimates. In the following, the quality of the results of our heuristic compared to optimal search as well as runtimes
are evaluated to demonstrate the practicality of our approach.
6.6.4 Runtimes, Pruning and Quality of Heuristic Selection
Sections 6.6.2 and 6.6.3 demonstrated the importance of considering reconﬁguration delay as well as global pro-
gram ﬂow information during WCET optimization. In contrast to previous approaches, our approach enables
considering both types of information. However, this requires evaluating several IPET instances and therefore
raises the question whether the runtimes of the optimization remain within acceptable bounds.
Tables 6.2 to 6.5 show evaluation results for all major kernels of the H.264 encoder application. We ﬁxed
fCPU/ ffabric at 4 and a reconﬁguration bandwidth of 400 MB/s, as it reﬂects the realistic setup of running the
CPU at 400 MHz (which the LEON3 processor is advertised as running at when implemented as an ASIC) and
the reconﬁgurable fabric at 100 MHz, as well as running the conﬁguration port at its maximum speed. The scal-
ability and effectiveness of pruning during the optimal search as well as the heuristic is evaluated by running the
algorithms for A ∈ [0 : 21]. Giving optimal search the freedom to allocate up to A = 21 partitions results in the
maximum number of candidates for the most complex kernel EncodeMacroBlock (see Table 6.5). The maximum
number of candidates is reached for lower A for the MotionEstimation (A = 11) and LoopFilter (A = 4) ker-
nels, the additional measurements for these kernels are therefore omitted. Table 6.2 to Table 6.5 are in increasing
order of kernel complexity (number of instructions and number of CI super blocks). The ﬁrst line of the tables is
the total number of possible selections calculated as ∏k∈CI(mk + 1), i.e., the number of all combinations of con-
ﬁgurations, plus the original software implementation, per CI without any restrictions. Weak compositions of A
were explained in Section 6.4 as a technique we apply for pruning selection candidates during optimal search. The
number of weak compositions of A into exactly |CI| parts are calculated as ∑As=0
(s+|CI|−1
|CI|−1
)
[51]. Opt. Estimates
and Heur. Estimates are the number of WCET estimates calculated using Eq. (6.6) during optimization using op-
timal search and the heuristic, respectively. The last lines of the tables are the runtimes of the optimizations and
the speedups obtained on the WCET estimate of the kernel, comparing the selection result to the software-only
implementation.
Effectiveness of Pruning and Scalability of Optimal and Heuristic Selection
Table 6.2 shows the results obtained for the evaluated kernel of least complexity, LoopFilter. The kernel includes
one CI super block only, with 7 implementation alternatives and therefore 7 possible selections in total. For only
70
6.6 Experimental Evaluation
Ta
bl
e
6.
3:
E
va
lu
at
io
n
R
es
ul
ts
Mo
ti
on
Es
ti
ma
ti
on
K
er
ne
l,
|C
I|
=
2
M
ea
su
re
s
A
0
1
2
3
4
5
6
7
8
9
10
11
To
ta
lP
os
si
bl
e
Se
le
ct
io
ns
31
2
31
2
31
2
31
2
31
2
31
2
31
2
31
2
31
2
31
2
31
2
31
2
W
ea
k
C
om
p.
of
A
in
to
|C
I|
1
3
6
10
15
21
28
36
45
55
66
78
O
pt
.E
st
im
at
es
1
2
3
4
5
6
7
8
9
10
29
30
H
eu
r.
E
st
im
at
es
1
1
2
3
4
5
6
7
8
9
10
11
O
pt
.R
un
tim
e
[m
s]
43
60
.0
43
50
.8
43
18
.4
43
29
.2
43
60
.4
43
31
.2
43
75
.2
43
91
.2
43
98
.8
44
56
.8
43
73
.2
43
17
.2
H
eu
r.
R
un
tim
e
[m
s]
42
91
.6
43
03
.2
43
32
.0
43
17
.2
43
37
.6
43
09
.6
43
42
.4
43
61
.6
43
23
.6
44
05
.6
43
44
.0
43
28
.8
W
C
E
T
un
op
tim
iz
ed
[c
yc
le
s]
11
2,
17
3,
89
3
O
pt
.S
pe
ed
up
1
4.
82
5.
48
9.
04
21
.0
1
24
.6
7
25
.6
5
26
.7
0
26
.7
0
26
.9
7
26
.9
7
27
.2
4
H
eu
r.
Sp
ee
du
p
1
4.
82
5.
48
9.
04
21
.0
1
24
.6
7
25
.6
5
26
.7
0
26
.7
0
26
.9
7
26
.9
7
27
.2
4
Ta
bl
e
6.
4:
E
va
lu
at
io
n
R
es
ul
ts
En
co
de
Ma
cr
oB
lo
ck
K
er
ne
l,
|C
I|
=
6
M
ea
su
re
s
A
0
1
2
3
4
5
6
7
8
9
10
To
ta
lP
os
si
bl
e
Se
le
ct
io
ns
57
02
40
57
02
40
57
02
40
57
02
40
57
02
40
57
02
40
57
02
40
57
02
40
57
02
40
57
02
40
57
02
40
W
ea
k
C
om
p.
of
A
in
to
|C
I|
1
7
28
84
21
0
46
2
92
4
17
16
30
03
50
05
80
08
O
pt
.E
st
im
at
es
1
2
3
4
5
29
9
51
7
81
6
11
92
16
29
21
00
H
eu
r.
E
st
im
at
es
1
1
2
3
4
5
6
7
8
9
10
O
pt
.R
un
tim
e
[m
s]
45
68
.4
46
35
.6
47
57
.6
51
54
.4
59
23
.2
71
94
.4
95
68
.8
12
33
1.
6
15
72
1.
2
19
55
0.
4
23
13
1.
2
H
eu
r.
R
un
tim
e
[m
s]
45
40
.8
45
46
.0
45
56
.4
45
47
.2
45
61
.6
45
80
.4
46
05
.2
46
47
.6
47
47
.2
47
31
.2
47
84
.8
W
C
E
T
un
op
tim
iz
ed
[c
yc
le
s]
13
,3
48
,2
51
O
pt
.S
pe
ed
up
1
2.
43
3.
35
4.
66
9.
84
10
.1
9
10
.4
4
10
.5
5
10
.6
2
10
.6
9
10
.6
9
H
eu
r.
Sp
ee
du
p
1
2.
43
3.
35
4.
66
9.
84
9.
94
10
.2
9
10
.5
5
10
.6
2
10
.6
9
10
.6
9
Ta
bl
e
6.
5:
E
va
lu
at
io
n
R
es
ul
ts
En
co
de
Ma
cr
oB
lo
ck
K
er
ne
l,
|C
I|
=
6
(c
on
tin
ue
d)
M
ea
su
re
s
A
11
12
13
14
15
16
17
18
19
20
21
To
ta
lP
os
si
bl
e
Se
le
ct
io
ns
57
02
40
57
02
40
57
02
40
57
02
40
57
02
40
57
02
40
57
02
40
57
02
40
57
02
40
57
02
40
57
02
40
W
ea
k
C
om
p.
of
A
in
to
|C
I|
12
37
6
18
56
4
27
13
2
38
76
0
54
26
4
74
61
3
10
09
47
13
45
96
17
71
00
23
02
30
29
60
10
O
pt
.E
st
im
at
es
25
71
30
08
33
84
36
83
39
01
40
45
41
30
41
74
41
93
41
99
42
00
H
eu
r.
E
st
im
at
es
10
10
10
10
10
10
10
10
10
10
10
O
pt
.R
un
tim
e
[m
s]
27
23
0.
4
31
07
2.
8
34
45
7.
6
37
08
7.
2
38
97
3.
6
40
32
4.
4
41
03
7.
6
41
72
5.
2
41
68
6.
8
41
75
1.
6
41
69
5.
2
H
eu
r.
R
un
tim
e
[m
s]
47
61
.6
47
40
.0
47
53
.2
48
02
.0
47
64
.4
47
92
.8
47
80
45
92
.0
46
08
.0
46
08
.4
45
97
.6
W
C
E
T
un
op
tim
iz
ed
[c
yc
le
s]
13
,3
48
,2
51
O
pt
.S
pe
ed
up
10
.6
9
10
.6
9
10
.6
9
10
.6
9
10
.6
9
10
.6
9
10
.6
9
10
.6
9
10
.6
9
10
.6
9
10
.6
9
H
eu
r.
Sp
ee
du
p
10
.6
9
10
.6
9
10
.6
9
10
.6
9
10
.6
9
10
.6
9
10
.6
9
10
.6
9
10
.6
9
10
.6
9
10
.6
9
71
6 WCET Optimization using Reconﬁgurable Custom Instructions
0
10
20
30
40
50
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
R
un
tim
e 
[s
]
#Partitions A [×1200 LUTs] 
Optimal Search Greedy Heuristic
Figure 6.8: Visualization of the runtime results of the optimization approaches when applied to the EncodeMacroBlock kernel (|CI|= 6)
one CI, the optimal search performs as many WCET estimate calculations as the number of weak compositions
of A into exactly |CI| parts (denoted as number of compositions for the remainder of this text) until A = 4. For
higher A the number of estimates remains constant, as no possible implementation for the CI requiring more than
4 partitions exists. The heuristic never performs more than 3 estimates while the optimal search performs maximal
5, however, the complexity of the kernel is too low to show any measurable runtime effect.
The kernel of next higher complexity is MotionEstimation, its evaluation results are shown in Table 6.3. It in-
cludes two CI super blocks having 4 and 78 different implementations, for a total of 312 possible selections. This is
enough, to demonstrate the effectiveness of pruning by ﬁnding selections which correspond to weak compositions
of A (see Section 6.4), as even at A= 11 the 78 possible compositions is only a quarter of the total number of pos-
sible selections. The additional pruning of the search space in our recursive search further reduces the search space
to 30 candidates which is 9.62% of the total selection candidates and 38.46% of the number of compositions. The
heuristic further reduces the number of WCET estimations performed to 11, 36.67% of the estimations performed
by the optimal search, the runtime beneﬁt is barely measurable, however.
EncodeMacroBlock is the most complex kernel in our evaluation. It contains 6 CI super blocks resulting in a
total of 570,240 possible selections. The results are shown in Table 6.4 and Table 6.5. Again, ﬁnding selections
which correspond to weak compositions of A prunes the search space effectively, but for |CI| = 6 the number of
possible compositions of A already grows rapidly, reaching 51.91% of the total number of estimates at A = 21.
However, the number of estimates calculated during optimal search stays much lower with a maximum of 4200 at
A = 21, which is 0.74% of the total number of possible selections. This is possible by pruning recursive subtrees
early in our optimal search algorithm. Still, the runtime6 of the optimal search algorithm does not scale well with
increasing A as visualized in Fig. 6.8. At A= 0 it takes 4.5s, doubling already at A= 6 with 9.0s, again doubling at
A = 9 to 18.88s. For higher values of A the runtime growth stagnates, reaching its maximum of 41.43s at A = 21.
Especially because there may be numerous kernels within the application under optimization, these runtime values
may hinder design space exploration. To reduce the optimization runtime at the potential cost of quality of the
result (discussed in the following section), the heuristic can be applied. It performs a maximum of 10 estimate
calculations, leading to a runtime of maximal 4.63s. 4.25s of this runtime are spend in CFG reconstruction and
microarchitectural analysis, which are preparation steps to WCET bound estimation before any optimization can
take place. Therefore, solving ILPs for estimate calculations during optimization only takes a fraction of the total
runtime of the heuristic. As the dominating part of the runtime –the microarchitectural analysis– only needs to be
performed once, an additional estimate required roughly under 40ms. This value is often dominated by the noise
between measurements. Thus, extending IPET for precise WCET estimates during WCET optimization can be
suitable for design space exploration as our results show.
6 All runtime results were computed as the average of 10 median values from 12 measurements
72
6.6 Experimental Evaluation
1
2
3
4
5
6
7
8
9
10
11
0 1 2 3 4 5 6 7 8 9 10
#Partitions A [×1200 LUTs] 
Optimal Search Greedy Heuristic
Greedy result
2.5% worse
Sp
ee
du
p
[W
C
E
T
 S
of
tw
ar
e 
/ W
C
E
T
 O
pt
im
iz
ed
]
Figure 6.9: Visualization of the speedup results of the optimization approaches when applied to the EncodeMacroBlock kernel (|CI|= 6)
. . .
dct4x4
ipredhdc
dct4x4
. . .
Estimated on current worst-case path:
proﬁt(ipredhdc) > proﬁt(dct4x4)
Actual WCET reduction:
– ipredhdc: 13,659 cycles
– dct4x4: 32,373 cycles
basic block
CI super block
current worst-case path (I-MB)
competing worst-case path (P-MB)
I-MB WCET = 271,155 cycles P-MB WCET = 257,496 cycles
(13,659 cycles lower than I-MB)
Figure 6.10: The greedy heuristic is unaware of any “competing” worst-case paths (P-MB path here). Thus, the estimated proﬁt on the
current worst-case path (I-MB) can be higher than the actual WCET reduction as in this case that appears when optimizing
EncodeMacroBlock at A = 5 (simpliﬁed)
In sum, the pruning techniques for the optimal search algorithm have shown very effective, but can still lead to
runtimes unsuitable for design space exploration. In these cases the heuristic can reduce the runtime down to
11.18% of the optimal search algorithm. However, the heuristic can lead to suboptimal results in certain cases,
which we will detail in the following section.
Quality of Heuristic Selection
For the kernel of least complexity, LoopFilter, and medium complexity, MotionEstimation, our heuristic as
well as optimal search always ﬁnd the same solution for all values of A as shown in Table 6.2 and Table 6.3,
respectively. Therefore, we focus on the EncodeMacroBlock kernel and the evaluation results, which exhibit
heuristic selections different from optimal search as highlighted with yellow background in shown in Table 6.4
and visualized in Fig. 6.9. More speciﬁcally, the heuristic ﬁnds selections that produce 2.52% and 1.46% lower
speedups at A = 5 and 6 than the optimal solution, respectively (while ﬁnding the optimal solution in all other
cases). The reason for this is the calculation of the proﬁt function in Eq. (6.9), which tries to estimate the effect of
a CI implementation on a previously calculated WCET bound. The problem is that the proﬁt is calculated for the
current worst-case path. Due to the instability of the worst-case path, adding a CI implementation y′j that beneﬁts
the WCET bound can have a smaller effect on the total bound than on the current worst-case path. However, this is
not sufﬁcient for the heuristic to make a suboptimal choice. Additionally, a CI implementation is needed that was
assigned a lower proﬁt than y′j, but actually has a higher effect on the total WCET bound than y′j. This can happen
when a CI conﬁguration can appear in the current worst-case path as well in the next longest path in the program.
This is the case for the dct4x4 CI within the EncodeMacroBlock kernel. The case is visualized in simpliﬁed form
in Fig. 6.10. E.g., for A = 5 the heuristic chooses the suboptimal selection as follows: after allocating 4 partitions
for the P-MB path, the I-MB path becomes the worst-case path. The heuristic calculates a proﬁt of 257,496 cycles
73
6 WCET Optimization using Reconﬁgurable Custom Instructions
for implementing ipredhdc in hardware, allocating the last partition. However, the I-MB path takes only 13,659
cycles longer than the P-MB path at this point. Therefore, implementing dct4x4 with a proﬁt of 46,032 cycles in
hardware and effectively reducing the WCET bound by 32,373 cycles, because it appears in the P-MB as well as
the I-MB path would have been the better choice. For A < 5, the I-MB path never becomes the worst-case path.
For A> 6, the heuristic has enough partitions available to fully compensate for the suboptimal decision. Therefore,
the heuristic ﬁnds the optimal results in these cases.
6.7 Conclusion
This chapter presented how timing analysis using IPET can be extended to perform WCET optimization on
runtime-reconﬁgurable processors. The WCET-optimizing instruction set selection problem was formulated, i.e.,
selecting the WCET-optimal set of reconﬁgurable custom instruction implementations. Techniques for generating
and pruning potential instruction set selections were discussed and realized in an optimal search algorithm. The
effectiveness of pruning in our optimal search algorithm was demonstrated, it only needed to evaluate less than 1%
of all possible 570,240 selections when optimizing the EncodeMacroBlock kernel as part of the H.264 encoder.
However, as the optimal search algorithm was still not scaling well for large problem instances, a heuristic was
introduced which performs maximally as many evaluations as there are partitions to allocate on the reconﬁgurable
fabric. The heuristic was an order of magnitude faster than the optimal search for the previously mentioned ker-
nel. Additionally, an analysis of suboptimal solutions (up to 2.52% lower speedup) obtained from the heuristic
in our evaluation, showed that they are a result of competing worst-case paths during optimization that share CIs.
Our problem formulation and algorithms were implemented based on the timing analyzer OTAWA, showing the
seamless integration into a state-of-the-art timing analysis tool.
The consequences of utilizing timing schema in WCET optimization were shown, a WCET estimation technique
that does not support global program ﬂow information, but is still commonly used in state-of-the-art WCET opti-
mization approaches. Our novel problem formulation of WCET optimization enables considering global program
ﬂow information such as reconﬁguration delay during optimization and the importance of considering these infor-
mation was demonstrated. Not considering global information can lead to higher WCET bounds that require more
resources than the optimal solution.
In sum, novel WCET optimization approaches were provided and runtime instruction set reconﬁguration was
shown to be an enabling feature for timing-predictable performance. To the best of our knowledge, our model is
the ﬁrst formulation of a WCET optimization problem with support for global program ﬂow information and we
can envision applications to problems other than instruction set extension.
In the following, the static WCET optimization presented in this chapter is complemented by an online optimization
that targets average-case performance while maintaining WCET guarantees.
74
7 WCET Guarantees for Opportunistic Runtime
Reconﬁguration
Execution Time
BCET WCET
Lower Bound Upper Bound
O
cc
ur
re
nc
es
 [
#]
Current Execution
Slack
Figure 7.1: The WCET of a task is upper-bounded using static
timing analysis. At runtime, slack towards this
upper bound becomes a resource that can be lever-
aged, e.g., for runtime reconﬁguration of accelera-
tors on an FPGA.
As presented in the previous chapters, recent works have
demonstrated that runtime reconﬁguration of hardware ac-
celerators is a viable way to achieve high performance that
is analyzable for execution time guarantees [16, 29, 77, 89].
The latency of a reconﬁgurable hardware accelerator is un-
der direct control of an application designer. It is often
precisely known, e.g., when leveraging high-level synthesis
tools. Where in average-case optimizing systems the recon-
ﬁgurable area is allocated to accelerators that result in the
best speedup on average, the main constraint in real-time
systems is to statically guarantee WCET bounds. Thus, re-
conﬁgurable area is allocated to accelerators that reduce the
execution time of the statically-determined worst-case path
of a task (like in the previous chapter). However, executing
the worst-case path and completing in WCET is usually highly improbable (see Chapter 1). In fact, WCET analysis
approximates the WCET of a task by an upper execution time bound and thus, the actual runtime of the task will
virtually always be faster than the guaranteed WCET as shown in Fig. 7.1 (also see Section 2.2). Ultimately this
means that (1) the average-case execution time (ACET) of a task is generally considerably lower than the WCET
and (2) conﬁguring accelerators to optimize the WCET of a task can waste reconﬁgurable area on program paths
that might never be executed in practice. It is, however, highly desirable to achieve a high utilization of accelera-
tors and optimize average-case execution to fulﬁll additional non-functional constraints like, e.g., power or thermal
constrains.
The novel contributions of this chapter are as follows:
• an approach to optimize static WCET guarantees as well as runtime optimization of the ACET (maintaining
WCET guarantees) using runtime reconﬁguration of hardware accelerators (in the form of Custom Instructions
as presented in Section 2.4 and Chapter 6)
• analysis of runtime slack bounds that enable safe reconﬁguration for average-case performance under WCET
guarantees
• an approach for monitoring the runtime slack of a task (the amount of time it executed parts of code faster than
in worst-case) using simple performance counters
7.1 Related Work
While runtime reconﬁguration of accelerators is an established concept in embedded systems in general [97], it
gained traction in recent years for achieving performance in systems that need to fulﬁll hard real-time guarantees
[72]. Several works in this direction are concerned with scheduling task sets that employ runtime reconﬁguration
[16, 54, 89]. The authors of [89] present scheduling strategies and admission tests for periodic hard real-time
75
7 WCET Guarantees for Opportunistic Runtime Reconﬁguration
tasks that occupy area on a runtime-reconﬁgurable fabric for the duration of their execution. Fully and partially
reconﬁgurable fabrics are considered, where partitions allocated to tasks are restricted to be equally-sized. In [16]
an overview over state-of-the-art real-time scheduling for reconﬁgurable systems is provided, and a scheduling
framework as well as an analysis for periodic multi-priority real-time tasks is presented. The presented model
divides tasks into software and hardware subtasks, where hardware subtasks model the execution of hardware
accelerators on a reconﬁgurable fabric. Software subtasks request execution of hardware subtasks and self-suspend
until the hardware subtask has ﬁnished execution. The requests for hardware subtask execution trigger runtime
reconﬁguration of the respective hardware accelerator onto a partitioned reconﬁgurable area.
Whereas the previous scheduling approaches assume that each task provides a given set of hardware accelerators
that need to be conﬁgured, the work of [54] addresses the problem of ﬁnding a set of conﬁgurations (of a processor
with a reconﬁgurable instruction set) for periodic task graphs that enable timing constraints to be met. A periodic
task graph with deadlines is scheduled and the schedule is partitioned into conﬁgurations for the reconﬁgurable
fabric. Each conﬁguration is assigned an instruction set to optimize the tasks’ WCET. Further work on ﬁnding
sets of hardware accelerators that minimize the WCET of a task are available for non-reconﬁgurable [112] and
reconﬁgurable processors (see previous chapter). While all of the above mentioned approaches utilize runtime
reconﬁguration to fulﬁll real-time constraints, none of the approaches considers online optimization of average-
case execution once constraints are met. Thus, they are unable to optimize for dynamic workloads like signal
processing or computer vision applications.
When a task executes faster than in the worst case, the execution time difference between the guaranteed WCET
and the current execution, i.e., the runtime slack (see Fig. 7.1), becomes a resource that can be used to optimize
additional constraints. Several works on runtime slack exploitation target dynamic voltage scaling to reduce the
overall energy consumption of the system [62, 91]. Furthermore, opportunistic monitoring for security or reliability
issues has been presented [67]. Orthogonal to these works, runtime slack was used to optimize the ACET using a
complex (i.e., hard to analyze but high-performance) microarchitecture that provides a simple, timing-analyzable
architecture mode [5]. In case insufﬁcient runtime slack is detected at runtime, the architecture enters the simple
mode that is the basis for WCET guarantees. However, this approach does not provide optimization of WCET
guarantees.
In summary, state-of-the-art approaches either do not consider runtime slack but focus on utilizing runtime recon-
ﬁguration of accelerators for optimizing worst-case execution only, or they do consider runtime slack but not for
reconﬁguration of accelerators. In this chapter, both properties are achieved for the ﬁrst time, i.e., an approach
is presented that optimizes worst-case execution using accelerators and considers runtime slack as a resource for
online optimization of the average-case execution using reconﬁguration.
7.2 System Model
As in previous chapters, this chapter focuses on runtime-reconﬁgurable processor designs in which the core in-
struction set architecture (cISA) of the processor core is extended by custom instructions (CIs) (see Section 2.4).
CIs initiate the execution of (one or more) hardware accelerators on the reconﬁgurable fabric. To model the use of
reconﬁgurable CIs inside the control ﬂow graph (CFG) of a task during WCET analysis, we base on the “stalling”
model of Chapters 4 and 5, which is summarized in Fig. 7.2 and Fig. 7.3 for the context of this chapter. Each
kernel in the stalling model is preceded by a basic block that initiates reconﬁguration of CIs, which are used within
the kernel body (xreconf in Fig. 7.3). During reconﬁguration the task waits, i.e., it only proceeds its execution af-
ter the reconﬁguration delay has passed. The reconﬁguration delay depends on the size of the conﬁguration data
and the bandwidth of the reconﬁguration port. Afterwards, the CPU can execute the kernel and utilize hardware
accelerators by invoking CIs as shown in Fig. 7.2 (bottom). The functionality of a CI (xHWi ) is alternatively avail-
able as a software implementation (xSWi ), e.g., the software implementation that the CI was derived from when
76
7.3 Our Approach
1 2 3 4
1 2 3 4 5 6
Figure 7.2: Visualization of execution without utilizing acceleration (top) and with conﬁguring CIs before executing a kernel (bottom).
using high-level synthesis (we assume the software implementation will always have a higher latency than the
hardware-accelerated CI). This way, more opportunities to generate CIs can be identiﬁed at compile time than
actually ﬁt onto the reconﬁgurable area of the speciﬁc target platform (as detailed in the previous chapter). The
selection problem, i.e., choosing a subset of CIs to optimize certain criteria, was addressed for optimizing WCET
guarantees in the previous chapter. At runtime, a conditional branch tests whether a speciﬁc CI was conﬁgured
and is available in hardware. If this is the case, the functionality is executed using hardware accelerators, and in
software otherwise.
...
...
...
Figure 7.3: CFG of a kernel that conﬁgures and utilizes
reconﬁgurable CIs in the stalling model.
Not all utilized CIs need to be conﬁgured
(constrained area), but can be executed in a
functionally-equivalent software implemen-
tation.
The following section gives a high-level summary of how selection
can be solved for WCET-optimizing conﬁgurations and a brief de-
scription of selection for performance-optimizing conﬁgurations
based on the utilized model. Afterwards, extensions to the model
are presented that enable multiple reconﬁgurations within a kernel
to switch between different conﬁgurations at runtime.
7.3 Our Approach
The main idea of this chapter is to employ two conﬁgurations for
the reconﬁgurable fabric: a safe conﬁguration (a selection of CIs
that optimizes the WCET bound like in Chapter 6) and a perfor-
mance conﬁguration (a selection of CIs that optimizes the ACET).
At the start of a kernel, the reconﬁgurable area is conﬁgured us-
ing the safe conﬁguration. During its execution the task’s slack
towards the WCET guarantee, i.e., the amount of time it executed
parts of code faster than in worst case, is continuously sampled
and accumulated. Once sufﬁcient slack is accumulated, the per-
formance conﬁguration is conﬁgured onto the reconﬁgurable area
and the average-case execution is accelerated. The performance
conﬁguration has a higher WCET bound than the safe conﬁguration, and in the worst case it could experience a
slower execution than what was guaranteed, i.e., the accumulated slack could be reduced. Therefore, special care
needs to be taken for the case that the slack might be depleted. We might need to switch back to the safe conﬁgu-
ration to avoid this case, and we need sufﬁcient slack left not to violate the guaranteed WCET bound. With high
probability, however, the execution of the performance conﬁguration will be faster (it optimizes the average case)
and the task will ﬁnish with a lower execution time than when executing in the safe conﬁguration.
The approach comprises an ofﬂine preparation phase in which the safe and performance conﬁgurations are created
and their information is annotated to the task under optimization, as well as the actual online optimization phase
that performs reconﬁgurations within the WCET bound using runtime slack.
77
7 WCET Guarantees for Opportunistic Runtime Reconﬁguration
7.3.1 Ofﬂine Preparation
The input to the ofﬂine preparation is the reconstructed control-ﬂow graph (CFG) of the task’s binary, which is
obtained as the ﬁrst step during WCET analysis (see Section 2.2).
Safe Conﬁguration
Start with empty reconfigurable area
Perform WCET estimate using IPET
Calculate CI profits on current worst-case path
Greedily select CI with highest profit
Reconf. area full?
yesno Can change 
worst-case path
Figure 7.4: CIs are selected to obtain the safe conﬁguration. Each time
a CI is selected, the WCET needs to be estimated again
Several methods to obtain a suitable safe conﬁguration
exist as detailed in the previous chapter. In Section 6.5
a greedy algorithm was presented, which is brieﬂy
summarized as follows (see Chapter 6 for details, espe-
cially supporting multiple hardware implementations
per CI and the inﬂuence of reconﬁguration delay on the
selection result, which are ignored here for brevity).
The algorithm repeatedly performs WCET estimation
of the CFG and CI selection using the stalling model as
shown in Fig. 7.4. The ﬁrst WCET estimate will result
in a worst-case path that does not utilize any hardware-accelerated CI: because the software implementation (xSWi
in Fig. 7.3) has a longer latency than the CI, it is the worst-case choice. After each WCET estimate the CI that pro-
vides the maximum proﬁt on the current worst-case path is selected and added to the temporary safe conﬁguration,
where the proﬁt of a CI a is estimated as:
proﬁtWCET(a) = ∑
xi calls functionality of a
xi · (latencysw− latencyhw)
I.e., the total proﬁt of selecting a is estimated to be the latency difference between its software implementation
and CI multiplied by the total amount of times a is executed in the current worst-case path. After annotating
the selection of CI a′ with maximum proﬁt to the CFG such that all calls to a′ utilize the hardware-accelerated CI
(instead of functionality-equivalent software), another WCET estimate needs to be performed. The worst-case path
might have changed, because the execution time of the previous worst-case path was reduced. WCET estimation
and CI selection are repeated alternately until the whole reconﬁgurable area of the target platform is occupied.
Performance Conﬁguration
The performance conﬁguration is obtained with a similar greedy algorithm. We utilize the same latency estimates
(latencysw and latencyhw) that were obtained during WCET estimation. Instead of worst-case path information,
however, proﬁling information is used to determine the proﬁt of a CI, i.e., the application is run for a typical use
case to determine the number of calls na to each CI a. Thus, the proﬁt on the ACET is estimated as:
proﬁtACET(a) = na · (latencysw− latencyhw)
Again, CIs are greedily added to the performance conﬁguration until the whole reconﬁgurable area is occupied.
In contrast to the worst-case path, the average-case calls to accelerators do not change when additional CIs are
selected. Therefore, only a single proﬁling run is required.
WCET Bounds
The ﬁnal ofﬂine preparation step is to determine WCET bounds for certain parts and conﬁgurations of the task
that are used to perform decisions during online optimization. The task’s guaranteed WCET bound is determined
based on the safe conﬁguration. Additionally, WCET bounds are determined for single iterations of each kernel.
78
7.3 Our Approach
1 wrslck 0 (reset runtime slack accumulator)
2 f o r ( . . . ) { (kernel header)
3 mobeg WCETsafeiter (set counter for current iteration)
4 rdslck slack (read accumulated runtime slack)
5 conditionally reconﬁgure based on slack, see Fig. 7.5
6 . . . (original kernel body)
7 moend} (add counter value to accumulated slack)
Listing 7.1: Operations that manage slack monitoring are added to the kernels that should be optimized at runtime
In safe 
configuration?
slack ൐ th՜perfslack ൏ th՜safe
Stay in safe 
configuration
Stay in perf. 
configuration
Reconfigure safe 
configuration
Reconfigure perf. 
configuration
no yes
no yes no yes
Figure 7.5: In each kernel iteration it is decided whether to reconﬁgure based on the current runtime slack and slack thresholds (th→perf and
th→safe)
As shown in Fig. 7.3, a single iteration starts with the kernel header and ends with a branch from the end of the
kernel body back to the header. The bounds of an iteration are determined as WCETsafeiter and WCET
perf
iter for the
safe and performance conﬁguration, respectively. This information is required for online slack monitoring and to
decide when to switch the conﬁguration as explained in the following section.
7.3.2 Online Optimization
Slack Monitoring
To enable online slack monitoring, we utilize a simple performance counter that counts CPU cycles similar to
the cycle count register of the “Performance Monitoring Unit” in ARM Cortex-R cores used in the Xilinx Zynq
UltraScale+ platform. Four operations are deﬁned in the following that control counting and accumulation of
CPU cycles: mobeg, moend, rdslck and wrslck. These operations are added to the kernels that should be
optimized as shown in Listing 7.1. At the beginning of each iteration of a kernel, the counter is initialized with
the kernel’s per-iteration WCET bound WCETsafeiter using mobeg, which copies an unsigned integer value from a
register argument into the counter. Then, during the kernel iteration, the counter is decremented in every cycle.
At the end of the kernel iteration, moend stops the counter and adds the current value to the accumulator. Note
that, because the counter was initialized with WCETsafeiter , it is guaranteed to have a value ≥ 0 when the kernel
is executed in the safe conﬁguration. In the performance conﬁguration, however, the counter could count down
until WCETsafeiter −WCETperfiter (a negative number, because WCETperfiter > WCETsafeiter ). In this case, the accumulated
slack is reduced. The currently accumulated slack is read using rdslck, which copies the accumulator value to
a register argument. In case the accumulated slack is reduced below a certain threshold while in the performance
conﬁguration, the safe conﬁguration needs to be conﬁgured again. How the thresholds are obtained and enforced
using the conditional reconﬁguration (Line 5) is described in the following.
Reconﬁguration within WCET Bounds
While the stalling model (detailed in Chapter 5) enables reconﬁguration only before entering a kernel, in the fol-
lowing conditional reconﬁguration is introduced that decides whether to reconﬁgure or not in each iteration of
the kernel based on the currently accumulated slack as shown in Fig. 7.5. As shown in Listing 7.1, conditional
reconﬁguration is executed before the kernel body in every kernel iteration (Line 5). It manages the state of the
79
7 WCET Guarantees for Opportunistic Runtime Reconﬁguration
reconﬁgurable area and triggers a reconﬁguration in case the accumulated slack reached a certain threshold. To
determine the minimum threshold of accumulated slack that enables to switch from safe to performance conﬁgu-
ration (while maintaining WCET guarantees), the worst-case execution –immediately after reconﬁguration of the
performance conﬁguration was triggered– needs to be considered. First, it takes rdelayperf cycles to conﬁgure the
performance conﬁguration. Then, the kernel iteration takes a maximum of WCETperfiter cycles, which means that the
accumulated slack is reduced by |WCETsafeiter −WCETperfiter |. Finally, when the conditional reconﬁguration block is
executed again and we need to switch back to the safe conﬁguration, it takes rdelaysafe to perform the reconﬁgu-
ration. In summary, the minimum accumulated slack to be able to switch from safe to performance conﬁguration
and remain within the WCET guarantee in the worst case is:
th→perf := rdelayperf+ |WCETsafeiter −WCETperfiter |+ rdelaysafe (7.1)
As mentioned before, a kernel iteration reduces the accumulated slack by |WCETsafeiter −WCETperfiter | in the worst
case and conﬁguring the safe conﬁguration takes rdelaysafe. Thus, to safely switch back from performance to safe
conﬁguration, the reconﬁguration needs to be triggered once the accumulated slack is lower than:
th→safe := |WCETsafeiter −WCETperfiter |+ rdelaysafe (7.2)
For any value > th→safe, there is still enough accumulated slack to safely execute one iteration of the kernel (even
in worst case) and switch back to the safe conﬁguration afterwards.
During the ofﬂine preparation phase (explained Section 7.3.1), the slack is annotated as constant 0 such that only the
initial reconﬁguration to the safe conﬁguration is accounted for during WCET analysis (just like without applying
the approach of this chapter).
7.4 Experimental Evaluation
The presented approach is suitable for any application that can beneﬁt from hardware accelerators in different
execution paths. It is evaluated on the runtime-reconﬁgurable processor i-Core that was introduced in Section 2.4
in the following. For this work, a simple performance counter (cycle counter plus accumulator) was added that
implements the operations for slack monitoring explained in Section 7.3.2. The ofﬂine preparation algorithms were
implemented (Section 7.3.1) by running Absint aiT [2] in batch mode to obtain WCET estimates and iteratively
generate constraints in aiT’s AIS2 constraint language to model selected CIs. As aiT is closed-source software, we
could not directly integrate support for reconﬁgurable CIs. Instead, every call to a CI in the binary was substituted
by an ADD opcode and a constraint that sets the delay for the new ADD instruction to the delay of the speciﬁc
CI during WCET estimation (see Section 5.6 for details). Guaranteed reconﬁguration delays are obtained using
CoRQ (see Chapter 4) and are annotated using additional AIS2 constraints.
The approach is evaluated with the same H.264 encoder application used in Chapters 5 and 6 that uses 9 CIs, which
cover the most compute-intensive kernels shown in Table 7.1. Each kernel reconﬁgures the full reconﬁgurable area
(modeled to be of similar size like in the Xilinx Zynq XC7Z010 platform). The H.264 encoder covers most of the
properties tested in the TACLeBench2 WCET Benchmark, e.g., the H.264 decoder –that is part of TACLeBench–
performs a subset of the computations performed in the H.264 encoder that we evaluate. We compiled the ap-
plication using BCC 4.4.2 (Gaisler’s extended GCC 4.4.2) at O13 for a frame size of 396 macroblocks (i.e., CIF
resolution). Note that higher resolutions would beneﬁt our approach by reducing the relative overhead of reconﬁg-
uration delays. Measured execution times are obtained using our cycle-accurate SystemC-based simulator of the
1 C Lines of Code that are replaced by utilizing a hardware CI (without comments or whitespace)
2 http://www.tacle.eu/index.php/activities/taclebench
3 At higher optimization levels, GCC emitted so-called irreducible loops that increase the WCET estimate compared to O1.
80
7.4 Experimental Evaluation
Table 7.1: CIs used in the H.264 Application
Accelerated func. and Description MB type Working Set CLoC1
MotionEstimation Kernel
SATD: Sum of Abs. Transf. Differences P and I 16×16 px 123
SAD: Sum of Abs. Differences P and I 16×16 px 24
EncodeMacroBlock Kernel
MC_Hz: Motion Compens. Interpol. Horiz. P 4 px 51
IPred_HDC: Intra Prediction Horiz. I 16×16 px 35
IPred_VDC: Intra Prediction Vert. I 16×16 px 19
DCT: Discrete Cosine Transf. P and I 4×4 px 76
HT2x2: Hadamard Transform P and I 2×2 px 12
HT4x4: Hadamard Transform I 4×4 px 111
LoopFilter Kernel
LoopFilter: In-Loop Deblock. Filter P and I 4 px 82
0 10 20 30 40 50 60 70 80 90 100 avg
0
1
2
3
4
5
6
·106
safe cfg. = perf cfg. safe cfg. = perf cfg.
% of intra-coded macroblocks (I-MBs)
E
xe
cu
tio
n
Ti
m
e
[c
yc
le
s] WCET bound (safe cfg.) Safe cfg. Perf. cfg.
Figure 7.6: Execution time of EncodeMacroBlock for safe conﬁguration and different performance conﬁgurations obtained for different
execution proﬁles
reconﬁgurable processor. Before performing the evaluation, we calibrated aiT and our simulator by harmonizing
hardware parameters and verifying the results of test-cases, e.g., load-store sequences. Hardware parameters like
reconﬁgurable area constraints, reconﬁguration bandwidth and partial bitstream sizes of CI implementations were
obtained from our Xilinx Virtex-7-based hardware prototype. The CPU is modeled to be clocked at 400 MHz
(at which the LEON3 operates when implemented as an ASIC), the reconﬁgurable area is clocked at 100 MHz
(frequency of accelerators used by CIs when implemented on Virtex-7).
7.4.1 Results
Ofﬂine Preparation
The H.264 encoder has two main execution paths with different CI requirements, depending on whether a mac-
roblock (MB) is encoded either by referencing an MB from a previous frame (P-MB) or using the current frame
only (I-MB). Table 7.1 shows which CIs are used in the I-MB or P-MB path. First, we will focus on the
EncodeMacroBlock kernel that performs the actual encoding. It is the most complex kernel and provides an
opportunity to create distinct safe and performance conﬁgurations. Which of the total 396 MBs is encoded as
either I-MBs or P-MBs at runtime is input-dependent: hectic video scenes have a high ratio of I-MBs (up to 100%,
e.g., video from a camera pointing sideways out of a moving car), steady scenes have a low ratio of I-MBs. There-
fore, the performance conﬁguration differs for different execution proﬁles (synthesized input data that is encoded
as I-MBs and P-MBs randomly distributed in the desired ratio). During WCET optimization, however, the current
worst-case path encodes all MBs either as P-MBs or I-MBs (the worst-case path changes while preparing the safe
81
7 WCET Guarantees for Opportunistic Runtime Reconﬁguration
conﬁguration, see Fig. 7.4). The ﬁnal safe conﬁguration (see Section 7.3.1 and Chapter 6) selects MC_Hz, DCT,
HT2x2 and HT4x4. In this conﬁguration, the worst-case path does not encode any I-MBs but only P-MBs.
Figure 7.6 shows execution time results for different execution proﬁles (and different resulting performance con-
ﬁgurations) of the EncodeMacroBlock kernel. The performance conﬁguration differs from the safe conﬁguration
when 40% or more of the total MBs in a frame are I-MBs. In these cases, IPred_HDC and IPred_VDC are selected
instead of MC_Hz and a performance increase of up to 27.4% is achieved at x = 100 (7.7% on average, without
reconﬁguration delay). Unaccelerated execution takes between 20.1 · 106 (x = 0) and 19.4 · 106 (x = 100) cycles,
i.e., the measured speedup of the safe conﬁguration is between 7.1× and 5.4×.
Online Optimization
0 10 20 30 40 50 60 70 80 90 100 avg avg≥40
0
1
2
3
4
5
6
·106
proﬁles targeted by perf. cfg.
% of intra-coded macroblocks (I-MBs)
E
xe
cu
tio
n
Ti
m
e
[c
yc
le
s]
reconf. bandwidth = 400 MB/s
WCET bound (safe cfg.) Safe cfg. Optimized
0 10 20 30 40 50 60 70 80 90 100 avg avg≥40
0
1
2
3
4
5
6
·106
proﬁles targeted by perf. cfg.
% of intra-coded macroblocks (I-MBs)
E
xe
cu
tio
n
Ti
m
e
[c
yc
le
s]
reconf. bandwidth = 800 MB/s
WCET bound (safe cfg.) Safe cfg. Optimized
Figure 7.7: Execution time of EncodeMacroBlock for safe conﬁguration and online optimization for different execution proﬁles and recon-
ﬁguration bandwidths (400 MB/s (top), 800 MB/s (bottom))
As shown in Fig. 7.6, the presented optimization approach is suitable to the H.264 encoder when video frames
with ≥ 40% I-MBs can be expected (i.e., moderate movement). We proceed using the performance conﬁguration
obtained in this case (IPred_HDC, IPred_VDC, DCT, HT2x2, HT4x4) in the following for all execution proﬁles. In
this case, |WCETsafeiter −WCETperfiter | = |12630− 16374| = 3744, i.e., an iteration encoding P-MBs takes 3744 cy-
cles longer for the performance conﬁguration compared to safe conﬁguration in the worst case. Figure 7.7 shows
execution time results of the EncodeMacroBlock kernel when our approach is applied for different reconﬁgura-
tion bandwidths and I-MB ratios. A reconﬁguration bandwidth of 400 MB/s (Fig. 7.6 (top)) is supported by our
Xilinx Virtex-7-based hardware prototype, while recent Xilinx FPGAs (“UltraScale+”) support a reconﬁguration
bandwidth of 800 MB/s4 (Fig. 7.6 (bottom)). In both cases, our approach is beneﬁcial for frames that contain
≥ 50% of I-MBs. Even when the performance conﬁguration results in a lower execution time at 40% I-MBs (see
Fig. 7.6), the speedup is voided by the additional reconﬁguration delay (of switching from safe to performance
4 See: “Virtex UltraScale+ FPGA Data Sheet: DC and AC Switching Characteristics" (DS923)
82
7.5 Conclusion
conﬁguration after accumulating sufﬁcient slack). The maximum execution time reduction is 19.8% and 23.0% for
a reconﬁguration bandwidth of 400 MB/s and 800 MB/s, respectively. For frames that contain < 50% of I-MBs,
the execution time can be slowed down (as expected) by up to 11% (x = 0 for both reconf. bandwidths). When
only considering the execution proﬁles that the performance conﬁguration was obtained for (≥ 40% of I-MBs),
the average execution time reduction ( avg≥40 ) is 8.1% and 10.2% (total average 1.5% and 3.0%) for a reconﬁguration
bandwidth of 400 MB/s and 800 MB/s, respectively. The statically guaranteed WCET bounds were hit in none of
our experiments.
Finally, ca. 45% of the execution time of the H.264 encoder are spent in the EncodeMacroBlock kernel (between
38.2% and 51.7% in our measurements). Thus, our technique (when applied to the EncodeMacroBlock kernel)
approximately reduces the total execution time on average (including atypical execution proﬁles like still images)
by 0.7% and 1.4% for a reconﬁguration bandwidth of 400 MB/s and 800 MB/s, respectively. On average in
the targeted execution proﬁles ( avg≥40 ), the total execution time is reduced by 3.6% and 4.6%; as well as up to
8.9% and 10.4% (again, for a reconﬁguration bandwidth of 400 MB/s and 800 MB/s, respectively). Note that the
execution time reductions were achieved by only adding slack monitoring using performance counters as described
in Section 7.3.2 to the runtime-reconﬁgurable processor i-Core.
7.5 Conclusion
This chapter presented the ﬁrst step towards optimized static WCET guarantees and runtime optimization of the
ACET using runtime reconﬁguration of hardware accelerators. It might very well enable the use of runtime recon-
ﬁguration for performance in safety-critical systems –where nowadays only static conﬁgurations are used– because
it enables to utilize runtime slack for reconﬁguration accelerators that beneﬁt average-case execution while main-
taining WCET guarantees. Execution time reductions by up to 8.9% and 10.4% (on average 3.6% and 4.6%
for targeted execution proﬁles) at a reconﬁguration bandwidth of 400 MB/s and 800 MB/s, respectively, were
demonstrated for the H.264 encoder. To achieve these beneﬁts, only slack monitoring using a performance counter
needed to be added to a runtime-reconﬁgurable processor. In future work, more sophisticated runtime slack pre-
diction (instead of static thresholds) could beneﬁt execution proﬁles that were not targeted by the “performance
conﬁguration”, e.g., to stay in the time-wise safe conﬁguration when the worst-case path is frequently executed
even though sufﬁcient slack was accumulated.
The following chapter concludes the contributions of this thesis.
83

8 Thesis Conclusion
Real-time systems have a rapidly increasing demand for performance that cannot be provided by high-performance
architectures, which were designed for average-case performance. First, this thesis introduced novel co-scheduling
approaches to distribute work among CPU and GPU in an extensive analysis of how (average-case) performance
is achieved on fused CPU-GPU architectures, a main trend in current high-performance microarchitectures that
combines a CPU and a GPU on a single chip. Being able to employ such architectures in real-time systems would
be highly desirable, because they provide high performance within a limited area and power budget. During the
analysis, however, a cache coherency bottleneck was uncovered that (i) complicated performance predictions and
(ii) added a shared last level cache between CPU and GPU to the growing list of microarchitectural features that
can beneﬁt average-case performance, but render the analysis of WCET guarantees on high-performance architec-
tures virtually infeasible. This motivated the need for novel microarchitectural features that provide predictable
performance that are amenable to timing analysis. Thus, the main focus of this thesis was to establish worst-case
execution time guarantees for runtime-reconﬁgurable systems as a novel means to achieve predictable performance.
Towards this end, ﬁrst a runtime reconﬁguration controller called “Command-based Reconﬁguration Queue”
(CoRQ) was presented that provides guaranteed latencies for its operations and enables timing analysis of runtime-
reconﬁgurable architectures for WCET guarantees. Based on the –now feasible– guaranteed reconﬁguration delay
of accelerators, a WCET analysis was introduced that enables tasks to reconﬁgure application-speciﬁc custom in-
structions (CIs, which invoke execution of one or more accelerators) during runtime. Different measures to deal
with reconﬁguration delays were compared as well as the timing anomaly of runtime reconﬁguration identiﬁed
and safely bounded: a case where executing iterations of a computational kernel faster than in WCET can pro-
long the total execution time of a task. Once tasks that perform runtime reconﬁguration of CIs could be analyzed
for WCET guarantees, the question of which CIs to conﬁgure on a constrained reconﬁgurable area to optimize
the WCET was raised, when multiple CIs with different implementations each (allowing to trade-off latency and
area requirements) can be selected. This so-called WCET-optimizing instruction set selection problem was mod-
eled based on the Implicit Path Enumeration Technique. To our knowledge, this is the ﬁrst approach that enables
WCET optimization with support for global program ﬂow information (and reconﬁguration delay). An optimal
algorithm (similar to Branch and Bound) and a fast greedy heuristic algorithm (that achieves the optimal solution
in most cases) were presented. Finally, an approach was presented that for the ﬁrst time combines optimized static
WCET guarantees and runtime optimization of the average-case execution (maintaining WCET guarantees) us-
ing runtime reconﬁguration of hardware accelerators. It comprised an analysis of runtime slack bounds that enable
safe reconﬁguration for average-case performance under WCET guarantees and presented a mechanism to monitor
runtime slack using a simple performance counter that is commonly available in many microprocessors.
Ultimately, runtime reconﬁguration of accelerators was shown as a key feature to achieve predictable performance.
8.1 Future Work
This thesis opens several directions for future research. Two main directions are highlighted in the following.
85
8 Thesis Conclusion
8.1.1 WCET Guarantees and Mixed-Criticality for Loosely-Coupled Reconﬁgurable
Architectures
The WCET analysis and optimization of Chapters 5 to 7 focus on reconﬁgurable processors, i.e., architectures
where the reconﬁgurable fabric is integrated into the pipeline of a processors (a so-called tightly-coupled archi-
tecture). A tight coupling between processor and FPGA reduces communication latencies and enables readily-
conﬁgured accelerators to be analyzed just like existing multi-cycle instructions during WCET analysis (see Chap-
ter 5). However, the design of a tightly-coupled architecture requires considerable effort to integrate processor
pipeline and reconﬁgurable fabric. Thus, many commercially-available architectures are loosely-coupled, i.e.,
processor(s) and reconﬁgurable fabric are separate processing devices on the same chip that are connected via
a common system bus. Commercial loosely-coupled architectures are, e.g., the Xilinx Zynq or Intel (formerly
Altera) SoC FPGA platforms. To achieve WCET guarantees on such architectures, communication protocols be-
tween processor and reconﬁgurable accelerators as well as worst-case analyses need to be designed that consider
the common system bus, but still allow execution time guarantees. This would enable application of the approaches
presented in Chapters 5 to 7 to loosely-couples architectures and provide the basis for further research in the di-
rection of mixed-criticality [19], where only a subset of tasks needs to fulﬁll execution time guarantees while the
other tasks are executed with best effort. E.g., the Xilinx Zynq UltraScale+ couples an ARM Cortex-A53 high-
performance processor and an ARM Cortex-R5 real-time processor to a reconﬁgurable fabric. When only real-time
tasks executing on the Cortex-R5 utilize reconﬁgurable accelerators, it can be expected that –as a result of requiring
schedulability guarantees– the overall utilization of the reconﬁgurable fabric will be quite low, as recent works on
real-time scheduling on a loosely-coupled system have shown [16]. Allowing best-effort tasks that execute on the
Cortex-A53 to also utilize reconﬁgurable accelerators on the shared fabric can help to raise its utilization (resulting
in less wasted resources of the whole system), while guarantees are maintained for real-time tasks1.
8.1.2 Probabilistic WCET Guarantees
This thesis focused on analysis of deterministic WCET guarantees, i.e., through static analysis an execution time
bound is obtained that is guaranteed to never be exceeded. As motivated in Chapters 1 and 3, the problem with
this approach is that current high-performance architectures can virtually not be analyzed, because they introduce
average-case performance enhancing features that lead to an explosion of possible microarchitectural states. More
precisely, for many of these features, e.g., out-of-order execution [63, 90], the effects on the execution time are
actually understood in principle, but modeling them analytically is only possible with simpliﬁcations that introduce
so much pessimism that the results cannot be used in practice [14]. Therefore, an emerging trend in real-time sys-
tems research is the analysis of probabilistic WCET (pWCET) guarantees. Instead of obtaining an absolute WCET
guarantee, the aim of pWCET analysis is to obtain a probability density function that, for a given execution time,
determines the probability that the execution time is exceeded. When this probability is sufﬁciently low (often
lower than 10−15 is targeted), the execution time bound that achieves this probability is considered safe, because,
e.g., the probability of mechanical failures in the real-time system is several magnitudes higher. pWCET analysis
exists in static and measurement-based variants. Measurement-based pWCET analysis can derive execution time
guarantees from measured execution times of a task (“end-to-end” measurements), when the measurements ful-
ﬁll certain statistical properties (e.g., they need to be independent and identically distributed (i.i.d.)) that allow the
application of Extreme Value Theory [41]. Extreme Value Theory can derive probability density functions that pre-
cisely model the extremes from statistical data (like maximum execution time from execution time measurements).
This makes measurement-based pWCET an especially promising approach, because the real-time system can be
treated as a gray box (only partial knowledge of its behavior is required) such that it can potentially overcome the
1 Early results of our research in this direction can be found in Rapp, Martin. “A Mixed Criticality Architecture with Reconﬁgurable Accelera-
tors”, Master Thesis, 2016.
86
8.1 Future Work
limitation of static WCET analysis, which requires the whole system behavior to be modeled in detail. However,
obtaining i.i.d. measurements that allow measurement-based pWCET requires some form of randomization and it
is still unclear how pWCET results can be safely used in schedulability analyses [33]. Furthermore, in systems that
are analyzable for deterministic WCET guarantees, pWCET analysis does not necessarily produce better results
[1]. In our future research, we will investigate the applicability of pWCET analysis on fused CPU-GPU archi-
tectures (see Chapter 3) as well as reconﬁgurable architectures (basing on approaches presented in the previous
chapters). Our aim is to enable pWCET guarantees for systems that can so far not be analyzed for execution time
guarantees at all as well as to evaluate what kind of architectures are more suitable to either deterministic WCET
or pWCET guarantees.
87

A Appendix
A.1 Demonstration Prototypes
In the context of this thesis, mainly within the DFG-funded project Invasic Computing (Section 2.5.1), demonstra-
tion setups were created that showed the practicality of the presented research.
A.1.1 Concurrent Reconﬁgurable Fabric Utilization
Figure A.1 shows a demonstrator, which bases on previous i-Core prototypes to implement an extension to the
i-Core concept (see Section 2.4) called Concurrent Reconﬁgurable Fabric Utilization (COREFAB) [45] on a Xilinx
Virtex-7 VC707 evaluation board. COREFAB allows general-purpose processors (GPPs) within the same system
on chip to utilize reconﬁgurable fabric of i-Core to execute CIs. In the following, CIs issued by the i-Core are
named ‘primary CIs’ and those from the GPPs are named ‘remote CIs’. From the application developer’s view,
primary CIs and remote CIs appear identical, but the latency to issue a remote CI is higher and primary CIs are
preferred over remote CIs. Primary and remote CIs are analyzed in hardware during execution. As long as no
resource conﬂict appears, both CIs execute in parallel using accelerators that are conﬁgured on the reconﬁgurable
fabric. In case a conﬂict appears, the remote CI is stalled and the primary CI proceeds. Thus, CI execution on i-Core
is not impaired by remote CIs, which is an opportunity for future work to investigate mixed-criticality execution
on the i-Core’s reconﬁgurable fabric (guaranteeing WCET for i-Core only, but achieving a high utilization of
accelerators using remote CIs).
The demonstrator shows the H.264 encoder that was evaluated in the previous chapters, running on a GPP that
beneﬁts from acceleration via remote CIs using COREFAB. The last encoded frame and an on-screen display
showing status information are output to a screen using HDMI. i-Core provokes conﬂicts in this setup by running
a loop that only executes primary CIs (SAD and DCT, see Table 7.1). This demonstrator shows that despite these
conﬂicts, the H.264 encoder executes 2× to 3× faster (encoded frames per second) on the GPP using COREFAB
than without.
A.1.2 Accelerating a Finite Volume Tsunami Model using Reconﬁgurable Hardware in
Invasive Computing
Within the Invasive Computing project, pipelined ﬂoating-point accelerators were created for i-Core, to enable
efﬁcient implementation of ﬂoating-point CIs [9]. We designed ﬂoating-point CIs to accelerate the Shallow Water
Equations application in X10 (SWE-X10) [80]. SWE-X10 is a proxy application for the computation of shallow
water waves, it implements a model that can be used to predict the propagation of a tsunami wave [79].
In a collaboration between virtually all subprojects of Invasive Computing, the i-Core-accelerated SWE-X10 ap-
plication was leveraged to create a demonstration setup of the full Invasive Computing stack as shown in Figs. A.2
and A.3. For this demonstrator, the InvasIC manycore architecture [50] was implemented on a proDesign proF-
PGA1 consisting of four Xilinx XC7V2000T FPGAs. In total, the prototype consists of 80 processor cores (16 tiles
that were connected via a network on chip and contain 5 cores each). Four of these cores are i-Cores. On top of
the InvasIC architecture, the parallel operating system OctoPOS [74] provides hardware abstractions and resource
management, following the resource-aware invasive programming model (see Section 2.5.1). Invasive X10 (X10i)
1 https://www.profpga.com
89
A Appendix
Figure A.1: Prototype demonstrating that accelerators can be efﬁciently shared between different cores within the same system on chip
InvasIC Architecture
Heterogeneous Tiled Architecture
OctoPOS
Parallel Operating System
X10i (based on libFIRM)
Invasive X10 Compiler and Runtime
SWE-X10 (in ActorX10)
Application utilizing i-Core Acceleration
█Application       █ Middleware        █ Hardware
Figure A.2: High-level overview of the Invasive Computing stack
90
A.1 Demonstration Prototypes
Figure A.3: Demonstrator that executes i-Core-accelerated SWE-X10 on the full Invasive Computing stack
[18] is a compiler and runtime system that enables high-level programming of invasive applications based on the
programming language X10 [21]. Finally, the SWE-X10 application is executed on top, demonstrating the full
invasive stack. For demonstration purposes, a live visualization was implemented. It is executed on a separate
computer and receives live simulation results from the InvasIC prototype over Ethernet. Figure A.3 shows the
demonstrated scenario, where a tsunami wave propagates from the south-west of Cyprus.
91

List of Figures
2.1 Histogram of all execution times of a task. The WCET of a task is upper-bounded using static timing
analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Example of constraint generation using the Implicit Path Enumeration Technique (IPET) . . . . . . . 6
2.3 Overview of the evaluation platform – i-Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 CIs deﬁne computations as DFGs that can be scheduled with different amounts of accelerators, result-
ing in different latencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 States of an invasive application (following the description of [74]) . . . . . . . . . . . . . . . . . . . 9
2.6 Overview of an instance of the tile-based invasive manycore architecture. Details of an i-Core tile are
shown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 High-level overview of a fused CPU-GPU architecture with shared last level cache . . . . . . . . . . 11
3.2 Particle Filter beneﬁts from a per-kernel scheduling decision compared to a ﬁxed ratio for the whole
benchmark when executed on OpenCL 2.0’s ﬁne-grained SVM . . . . . . . . . . . . . . . . . . . . . 14
3.3 Hierarchy of Work Items in an OpenCL Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Simpliﬁed example of memory allocation in OpenCL 1.2 (left) and OpenCL 2.0 with ﬁne-grained
SVM (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Launching a kernel on a fused CPU-GPU architecture without host-side synchronization . . . . . . . 17
3.6 Compared to the original OpenCL 1.2 implementation of the Rodinia Benchmarks Suite that executes
on the GPU only and uses device-side buffers, the use of OpenCL 2.0 incl. ﬁne-grained SVM intro-
duces overheads but maintains consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.7 For co-scheduling, multi-dimensional IDs are mapped to one-dimensional IDs . . . . . . . . . . . . . 18
3.8 A global_work_state is shared between work items using ﬁne-grained SVM to realize device-side
scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.9 The device-side methods add a preamble and postamble to each kernel that implement the co-scheduling
methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.10 A single work group executes in lock step (atomic counting). Multiple work groups execute in parallel 19
3.11 In atomic counting, work groups loop over the original kernel code until the total amount of work is
done . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.12 The device-side enqueuing method enqueues additional work groups using device-side queues . . . . 20
3.13 At the ﬁrst execution of a kernel k, host-side proﬁling determines a ratio rk to distribute work items . . 21
3.15 Speedup of the co-scheduling methods applied to Rodinia-SVM, on a fused CPU-GPU architecture
with shared LLC. Results are relative to performing the optimal choice for each kernel of either exe-
cuting on CPU or GPU (xor-Oracle is 100%) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.14 Device-side enqueuing adds signiﬁcant overhead, even when no kernel is enqueued. The overheads
stem from the kernel call in the block syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.16 Cache performance metrics (all levels, measured on CPU) when executing kernels in parallel on CPU
and GPU relative to executing the same work item distribution sequentially (ﬁrst on CPU, then on
GPU; = 1 on y-axis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1 Timelines of executing a Kernel using Software only, Stalling and Software Emulation . . . . . . . . 28
93
List of Figures
4.2 Control-ﬂow graph that shows how one reconﬁguration request can delay a following reconﬁguration,
thus impairing timing analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Example of how CoRQ is attached to a System on Chip to enable runtime reconﬁguration under timing
guarantees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 High-level view of how CoRQ processes commands . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Reconﬁguration bandwidth measured by the CPU, revealing a high variance when using main memory 33
5.1 System on Chip with a reconﬁgurable processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Software Emulation entails testing whether a speciﬁc reconﬁgurable CI is currently conﬁgured (avail-
able). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.3 Sequences of kernels, e.g., in the H.264 Encoder, are well-suited for runtime reconﬁguration, but raise
new issues in timing analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.4 IPET constraint generation for single contexts and multiple contexts after virtual unrolling . . . . . . 39
5.5 Different cases for execution times of kernel iterations. Executing all iterations in WCET does not
necessarily bound the total WCET of the kernel, because the worst-case number of iterations in which
CI1 is unavailable (u1) can be mispredicted (timing anomaly in (b)). For safe bounds, an additional
iteration needs to be considered that assumes CI1 unavailable (like in (d)) . . . . . . . . . . . . . . . 42
5.6 CFG of a Kernel invoking a CI with Software Emulation . . . . . . . . . . . . . . . . . . . . . . . . 43
5.7 Overview of System on Chip used for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.8 Evaluation toolﬂow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.9 Generated Constraints in aiT’s Format (AIS2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.10 Observed Runtimes and Guaranteed WCET Bounds for LoopFilter . . . . . . . . . . . . . . . . . 50
5.11 H.264 overall overestimation without CI Invocations (cISA only) and different alternatives of invoking
CIs. Software Emulation (always unavailable) introduces CI Invocations, but never executes them
in hardware. Combination chooses either Software Emulation or Stalling per kernel to optimize the
timing bound (see Section 5.4.3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.12 H.264 overall speedup on the guaranteed time bound (left) and the observed runtime (right) . . . . . . 52
5.13 Speedup of Software Emulation and Combination over Stalling in H.264 . . . . . . . . . . . . . . . . 53
6.1 Toolﬂow performing WCET-Optimizing Instruction Set Selection integrated with timing analysis. As
input to our approach we take application binary with suggestions where to place custom instructions as
well as different implementation alternatives per custom instruction, differing in resource requirements
and latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Simple example that shows how WCET optimization approaches that rely on Timing Schema perform
suboptimal decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3 CI super block as part of a CFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4 Simple example of how an instance of the problem formulated in Sections 6.2 and 6.3 is generated . . 61
6.5 Visualization of how pruning is applied and how generated tuples correspond to selection candidates
for the input provided by the example in Fig. 6.4. For clarity, tuples that were pruned because a chosen
ak did not correspond to a possible implementation of CI k are omitted. . . . . . . . . . . . . . . . . 64
6.6 Optimal Results for the EncodeMacroBlock Kernel of the H.264 Encoder and different Values of
fCPU/ ffabric, A as well as a Reconﬁguration Bandwidth of 200 MB/s. Comparing Results considering
Reconﬁguration Delay during Selection and Results not considering it. . . . . . . . . . . . . . . . . . 67
6.7 Optimal Results for the EncodeMacroBlock Kernel of the H.264 Encoder and different Values of
fCPU/ ffabric, A as well as Reconﬁguration Bandwidth. Comparing Results utilizing Infeasible Path
Information (I-MB Path marked infeasible) and not utilizing it. . . . . . . . . . . . . . . . . . . . . . 69
94
List of Figures
6.8 Visualization of the runtime results of the optimization approaches when applied to the EncodeMacroBlock
kernel (|CI|= 6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.9 Visualization of the speedup results of the optimization approaches when applied to the EncodeMacroBlock
kernel (|CI|= 6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.10 The greedy heuristic is unaware of any “competing” worst-case paths (P-MB path here). Thus, the
estimated proﬁt on the current worst-case path (I-MB) can be higher than the actual WCET reduction
as in this case that appears when optimizing EncodeMacroBlock at A = 5 (simpliﬁed) . . . . . . . . 73
7.1 The WCET of a task is upper-bounded using static timing analysis. At runtime, slack towards this
upper bound becomes a resource that can be leveraged, e.g., for runtime reconﬁguration of accelerators
on an FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
7.2 Visualization of execution without utilizing acceleration (top) and with conﬁguring CIs before execut-
ing a kernel (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.3 CFG of a kernel that conﬁgures and utilizes reconﬁgurable CIs in the stalling model. Not all utilized
CIs need to be conﬁgured (constrained area), but can be executed in a functionally-equivalent software
implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.4 CIs are selected to obtain the safe conﬁguration. Each time a CI is selected, the WCET needs to be
estimated again . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.5 In each kernel iteration it is decided whether to reconﬁgure based on the current runtime slack and
slack thresholds (th→perf and th→safe) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.6 Execution time of EncodeMacroBlock for safe conﬁguration and different performance conﬁgura-
tions obtained for different execution proﬁles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.7 Execution time of EncodeMacroBlock for safe conﬁguration and online optimization for different
execution proﬁles and reconﬁguration bandwidths (400 MB/s (top), 800 MB/s (bottom)) . . . . . . . 82
A.1 Prototype demonstrating that accelerators can be efﬁciently shared between different cores within the
same system on chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.2 High-level overview of the Invasive Computing stack . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.3 Demonstrator that executes i-Core-accelerated SWE-X10 on the full Invasive Computing stack . . . . 91
95

List of Tables
3.1 Rodinia Benchmark Suite – OpenCL Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1 CoRQ Commands with Cycles spent in EXE State . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Ressource Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.1 Kernels and Custom Instructions (CI) in the H.264 Encoder . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Parameters investigated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 CI Unavailability (uk) obtained during WCET bound estimation for LoopFilter . . . . . . . . . . . 50
6.1 Kernels and Custom Instructions (CI) in the H.264 Application . . . . . . . . . . . . . . . . . . . . . 66
6.2 Evaluation Results LoopFilter Kernel, |CI|= 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3 Evaluation Results MotionEstimation Kernel, |CI|= 2 . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4 Evaluation Results EncodeMacroBlock Kernel, |CI|= 6 . . . . . . . . . . . . . . . . . . . . . . . . 71
6.5 Evaluation Results EncodeMacroBlock Kernel, |CI|= 6 (continued) . . . . . . . . . . . . . . . . . 71
7.1 CIs used in the H.264 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
97

Bibliography
[1] Jaume Abella, Damien Hardy, Isabelle Puaut, Eduardo Quiñones, and Francisco J Cazorla. “On the com-
parison of deterministic and probabilistic WCET estimation techniques”. In: Real-Time Systems (ECRTS),
2014 26th Euromicro Conference on. IEEE. 2014, pp. 266–275.
[2] AbsInt. aiT Worst-Case Execution Time Analyzers. Website: http://www.absint.com/ait/. [Online;
accessed 31-Aug-2018]. 2018.
[3] Andreas Agne, Markus Happe, Andreas Keller, Enno Lubbers, Bernhard Plattner, Marco Platzner, and
Christian Plessl. “ReconOS: An operating system approach for reconﬁgurable computing”. In: IEEE Micro
34.1 (2014), pp. 60–71.
[4] Sebastian Altmeyer, Björn Lisper, Claire Maiza, Jan Reineke, and Christine Rochange. “WCET and
Mixed-Criticality: What does Conﬁdence in WCET Estimations Depend Upon?” In: OASIcs-OpenAccess
Series in Informatics. Vol. 47. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. 2015.
[5] Aravindh Anantaraman, Kiran Seth, Kaustubh Patil, Eric Rotenberg, and Frank Mueller. “Virtual Simple
Architecture (VISA): Exceeding the Complexity Limit in Safe Real-time Systems”. In: SIGARCH Comput.
Archit. News 31.2 (May 2003), pp. 350–361. ISSN: 0163-5964. DOI: 10.1145/871656.859659.
[6] Philip Axer, Rolf Ernst, Heiko Falk, Alain Girault, Daniel Grund, Nan Guan, Bengt Jonsson, Peter Mar-
wedel, Jan Reineke, Christine Rochange, Maurice Sebastian, Reinhard Von Hanxleden, Reinhard Wilhelm,
and Wang Yi. “Building timing predictable embedded systems”. In: ACM Trans. on Embed. Comput. Syst.
13.4 (2014), 82:1–82:37.
[7] Clément Ballabriga, Hugues Cassé, Christine Rochange, and Pascal Sainrat. “OTAWA: An open toolbox
for adaptive WCET analysis”. In: SEUS. Springer, 2010, pp. 35–46.
[8] Rajkishore Barik, Naila Farooqui, Brian T Lewis, Chunling Hu, and Tatiana Shpeisman. “A black-box
approach to energy-aware scheduling on integrated CPU-GPU systems”. In: IEEE/ACM Int. Symp. on
Code Gen. and Opt. IEEE. 2016, pp. 70–81.
[9] Lars Bauer, Artjom Grudnitsky, Marvin Damschen, Srinivas Rao Kerekare, and Jörg Henkel. “Floating
point acceleration for stream processing applications in dynamically reconﬁgurable processors”. In: IEEE
Symp. on Embed. Syst. For Real-time Multimedia (ESTIMedia), Amsterdam, The Netherlands, October
8-9, 2015. 2015, pp. 1–2. DOI: 10.1109/ESTIMedia.2015.7351762.
[10] Lars Bauer and Jörg Henkel. Run-time adaptation for reconﬁgurable embedded processors. Springer Sci-
ence & Business Media, 2010.
[11] Lars Bauer, Muhammad Shaﬁque, and Jörg Henkel. “A computation-and communication-infrastructure
for modular special instructions in a dynamically reconﬁgurable processor”. In: Int. Conf. on Field Pro-
grammable Logic and Applications. IEEE. 2008, pp. 203–208.
[12] Lars Bauer, Muhammad Shaﬁque, Simon Kramer, and Jörg Henkel. “RISPP: rotating instruction set pro-
cessing platform”. In: Proc. of Design Automat. Conf. ACM. 2007, pp. 791–796.
[13] Khaled Benkrid, Ying Liu, and AbdSamad Benkrid. “A highly parameterized and efﬁcient FPGA-based
skeleton for pairwise biological sequence alignment”. In: IEEE Transactions on Very Large Scale Integra-
tion (VLSI) Systems 17.4 (2009), pp. 561–570.
99
Bibliography
[14] Guillem Bernat, Antoine Colin, and Stefan M Petters. “WCET analysis of probabilistic hard real-time
systems”. In: IEEE Real-Time Syst. Symp. IEEE. 2002, pp. 279–288.
[15] Margrit Betke, Esin Haritaoglu, and Larry S Davis. “Real-time multiple vehicle detection and tracking
from a moving vehicle”. In: Machine vision and applications 12.2 (2000), pp. 69–83.
[16] Alessandro Biondi, Alessio Balsini, Marco Pagani, Enrico Rossi, Mauro Marinoni, and Giorgio Buttazzo.
“A Framework for Supporting Real-Time Applications on Dynamic Reconﬁgurable FPGAs”. In: Proc. of
Real-Time Syst. Symp. IEEE. 2016, pp. 1–12.
[17] Armelle Bonenfant, Hugues Cassé, Marianne De Michiel, Jens Knoop, Laura Kovács, and Jakob Zwirch-
mayr. “FFX: A portable WCET annotation language”. In: Proceedings of the 20th International Conference
on Real-Time and Network Systems. ACM. 2012, pp. 91–100.
[18] Matthias Braun, Sebastian Buchwald, Manuel Mohr, and Andreas Zwinkau. Dynamic X10: Resource-
Aware Programming for Higher Efﬁciency. Tech. rep. 8. X10 ’14. Karlsruhe Institute of Technology, 2014.
[19] Alan Burns and Rob Davis. “Mixed criticality systems-a review”. In: Department of Computer Science,
University of York, Tech. Rep (2013).
[20] Giorgio C Buttazzo. Hard real-time computing systems: predictable scheduling algorithms and applica-
tions. Vol. 24. Springer Science & Business Media, 2011.
[21] Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu,
Christoph Von Praun, and Vivek Sarkar. “X10: an object-oriented approach to non-uniform cluster com-
puting”. In: Acm Sigplan Notices. Vol. 40. 10. ACM. 2005, pp. 519–538.
[22] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin
Skadron. “Rodinia: A benchmark suite for heterogeneous computing”. In: IEEE Int. Symp. on Workl. Char-
act. Ieee. 2009, pp. 44–54.
[23] Juan Antonio Clemente, Javier Resano, and Daniel Mozos. “An Approach to Manage Reconﬁgurations
and Reduce Area Cost in Hard Real-time Reconﬁgurable Systems”. In: ACM Trans. Embed. Comput. Syst.
13.4 (Mar. 2014), 90:1–90:24. ISSN: 1539-9087. DOI: 10.1145/2560037.
[24] Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, Hui Huang, and Glenn Reinman.
“Composable Accelerator-rich Microprocessor Enhanced for Adaptivity and Longevity”. In: Proc. of the
Int. Symp. on Low Power Electronics and Design. ISLPED ’13. Beijing, China: IEEE Press, 2013, pp. 305–
310. ISBN: 978-1-4799-1235-3.
[25] Patrick Cousot and Radhia Cousot. “Abstract interpretation: a uniﬁed lattice model for static analysis of
programs by construction or approximation of ﬁxpoints”. In: Proc. of the Symp. on Principles of program-
ming languages. ACM. 1977, pp. 238–252.
[26] Michael Dales. “Managing a reconﬁgurable processor in a general purpose workstation environment”.
In: Proc. of the Conf. on Design, Automation and Test in Europe. IEEE Computer Society. Mar. 2003,
p. 10980.
[27] Marvin Damschen, Lars Bauer, and Jörg Henkel. “Extending the WCET Problem to Optimize for Runtime-
Reconﬁgurable Processors”. In: ACM Trans. on Archit. and Code Optim. (TACO) 13.4 (2016), 45:1–45:24.
DOI: 10.1145/3014059.
[28] Marvin Damschen, Lars Bauer, and Jörg Henkel. “CoRQ: Enabling Runtime Reconﬁguration Under
WCET Guarantees for Real-Time Systems”. In: IEEE Embedded Systems Letters (ESL) 9.3 (2017), pp. 77–
80. DOI: 10.1109/LES.2017.2714844.
100
[29] Marvin Damschen, Lars Bauer, and Jörg Henkel. “Timing Analysis of Tasks on Runtime Reconﬁgurable
Processors”. In: IEEE Trans. on Very Large Scale Integration Syst. (TVLSI) 25.1 (2017), pp. 294–307. DOI:
10.1109/TVLSI.2016.2572304.
[30] Marvin Damschen, Frank Mueller, and Jörg Henkel. “Co-Scheduling on Fused CPU-GPU Architectures
with Shared Last Level Caches”. In: IEEE Trans. on Comput.-Aided Design of Integrated Circuits and Syst.
(TCAD) (2018). ESWEEK Special Issue, to appear. DOI: 10.1109/TCAD.2018.2857042.
[31] Marvin Damschen, Martin Rapp, Lars Bauer, and Jörg Henkel. “i-Core: A runtime-reconﬁgurable proces-
sor platform for cyber-physical systems”. In: Embedded, Cyber-Physical, and IoT Systems: Smart Cam-
eras, Hardware/Software Co-Design, and Multimedia — Essays Dedicated to Marilyn Wolf on the Occa-
sion of Her 60th Birthday. Ed. by S. S. Bhattacharyya, M. Potkonjak, and S. Velipasalar. to appear. Springer
International Publishing, 2019.
[32] Abhishek Das, David Nguyen, Joseph Zambreno, Gokhan Memik, and Alok Choudhary. “An FPGA-based
network intrusion detection architecture”. In: IEEE Transactions on Information Forensics and Security 3.1
(2008), pp. 118–132.
[33] Robert I Davis, Alan Burns, and David Grifﬁn. “On the Meaning of pWCET Distributions and their use in
Schedulability Analysis”. In: Proceedings 2017 Real-Time Scheduling Open Problems Seminar (RTSOPS).
2017.
[34] Christian De Schryver, Ivan Shcherbakov, Frank Kienle, Norbert Wehn, Henning Marxen, Anton Kostiuk,
and Ralf Korn. “An energy efﬁcient FPGA accelerator for monte carlo option pricing with the heston
model”. In: Reconﬁgurable Computing and FPGAs (ReConFig), 2011 International Conference on. IEEE.
2011, pp. 468–474.
[35] Christopher Dennl, Daniel Ziener, and Jurgen Teich. “On-the-ﬂy composition of FPGA-based SQL query
accelerators using a partially reconﬁgurable module library”. In: Field-Programmable Custom Computing
Machines (FCCM), 2012 IEEE 20th Annual International Symposium on. IEEE. 2012, pp. 45–52.
[36] Florian Dittmann and Stefan Frank. “Hard real-time reconﬁguration port scheduling”. In: Proc. of the Conf.
on Design, Automation and Test in Europe. IEEE. 2007, pp. 123–128.
[37] Stephen A Edwards and Edward A Lee. “The case for the precision timed (PRET) machine”. In: Proc. of
Design Automat. Conf. ACM. 2007, pp. 264–265.
[38] Andreas Ermedahl, Friedhelm Stappert, and Jakob Engblom. “Clustered worst-case execution-time calcu-
lation”. In: IEEE Trans. on Computers 54.9 (2005), pp. 1104–1122.
[39] Heiko Falk and Jan C Kleinsorge. “Optimal static WCET-aware scratchpad allocation of program code”.
In: Proc. of Design Automat. Conf. ACM. 2009, pp. 732–737.
[40] Heiko Falk, Sascha Plazar, and Henrik Theiling. “Compile-time decided instruction cache locking using
worst-case execution paths”. In: Proc. of Int. Conf. on Hardware/software codesign and syst. synthesis.
ACM. 2007, pp. 143–148.
[41] Willliam Feller. An introduction to probability theory and its applications. Vol. 2. John Wiley & Sons,
2008.
[42] Carlo Galuzzi and Koen Bertels. “The instruction-set extension problem: A survey”. In: ACM Trans. on
Reconﬁg. Technol. and Syst. 4.2 (2011), p. 18.
[43] Robert S Garﬁnkel, George L Nemhauser, et al. Integer programming. Vol. 4. Wiley New York, 1972.
[44] Khronos OpenCL Working Group et al. “The OpenCL speciﬁcation version 2.0”. In: https: // www.
khronos. org/ registry/ OpenCL/ specs/ opencl-2. 0. pdf (2015).
101
Bibliography
[45] Artjom Grudnitsky, Lars Bauer, and Jörg Henkel. “COREFAB: concurrent reconﬁgurable fabric utilization
in heterogeneous multi-core systems”. In: Proc. of Int. Conf. on Compilers, Architecture and Synthesis for
Embed. Syst. ACM. 2014, p. 5.
[46] Tanja Harbaum, Christoph Schade, Marvin Damschen, Carsten Tradowsky, Lars Bauer, Jörg Henkel, and
Jürgen Becker. “Auto-SI: An adaptive reconﬁgurable processor with run-time loop detection and accel-
eration”. In: IEEE Intl. System-on-Chip Conf., (SOCC), Munich, Germany, September 5-8, 2017. 2017,
pp. 153–158. DOI: 10.1109/SOCC.2017.8226027.
[47] Scott Hauck, Thomas W Fry, Matthew M Hosler, and Jeffrey P Kao. “The Chimaera reconﬁgurable func-
tional unit”. In: IEEE Trans. on Very Large Scale Integration Syst. 12.2 (2004), pp. 206–217.
[48] John R Hauser and John Wawrzynek. “Garp: A MIPS processor with a reconﬁgurable coprocessor”. In:
Proc. of Symp. on Field-Programm. Custom Comput. Machines. IEEE. 1997, pp. 12–21.
[49] Jörg Henkel, Lars Bauer, Joachim Becker, Oliver Bringmann, Uwe Brinkschulte, Samarjit Chakraborty,
Michael Engel, Rolf Ernst, Hermann Härtig, Lars Hedrich, et al. “Design and architectures for dependable
embedded systems”. In: Proceedings of the seventh IEEE/ACM/IFIP international conference on Hard-
ware/software codesign and system synthesis. ACM. 2011, pp. 69–78.
[50] Jörg Henkel, Andreas Herkersdorf, Lars Bauer, Thomas Wild, Michael Hübner, Ravi Kumar Pujari, Artjom
Grudnitsky, Jan Heisswolf, Aurang Zaib, Benjamin Vogel, et al. “Invasive Manycore Architectures”. In:
17th Asia and South Paciﬁc Design Automation Conference (ASP-DAC). 2012, pp. 193–200.
[51] Silvia Heubach and Touﬁk Mansour. “Compositions of n with parts in a set”. In: Congressus Numerantium
168 (2004), p. 127.
[52] Trang Hoang et al. “An efﬁcient FPGA implementation of the Advanced Encryption Standard algorithm”.
In: Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF),
2012 IEEE RIVF International Conference on. IEEE. 2012, pp. 1–4.
[53] Dominik Honegger, Helen Oleynikova, and Marc Pollefeys. “Real-time and low latency embedded com-
puter vision hardware based on a combination of FPGA and mobile CPU”. In: Intelligent Robots and
Systems (IROS 2014), 2014 IEEE/RSJ International Conference on. IEEE. 2014, pp. 4930–4935.
[54] Huynh Phung Huynh and Tulika Mitra. “Runtime reconﬁguration of custom instructions for real-time em-
bedded systems”. In: Proc. of the Conf. on Design, Automation and Test in Europe. IEEE. 2009, pp. 1536–
1541.
[55] Robert Ioffe, Sonal Sharma, and Michael Stoner. “Achieving performance with OpenCL 2.0 on In-
tel®processor graphics”. In: Proc. of Int. Works. on OpenCL. ACM. 2015, p. 3.
[56] Stephen Junkins. “The Compute Architecture of Intel® Processor Graphics Gen9”. In: Intel whitepaper v1
(2015).
[57] David R Kaeli, Perhaad Mistry, Dana Schaa, and Dong Ping Zhang. Heterogeneous Computing with
OpenCL 2.0. Morgan Kaufmann, 2015.
[58] Rashid Kaleem, Rajkishore Barik, Tatiana Shpeisman, Brian T Lewis, Chunling Hu, and Keshav Pingali.
“Adaptive heterogeneous scheduling for integrated GPUs”. In: Proc. of the Int. Conf. on Par. Arch. and
Comp. ACM. 2014, pp. 151–162.
[59] Raimund Kirner, Jens Knoop, Adrian Prantl, Markus Schordan, and Albrecht Kadlec. “Beyond loop
bounds: comparing annotation languages for worst-case execution time analysis”. In: Software & Systems
Modeling 10.3 (2011), pp. 411–437.
102
[60] Dirk Koch, Frank Hannig, and Daniel Ziener. FPGAs for Software Programmers. 1st. Springer Publishing
Company, Incorporated, 2016. ISBN: 3319264060, 9783319264066.
[61] Chris Lattner. “LLVM and Clang: Advancing Compiler Technology”. In: Proc. of FOSDEM (2011).
[62] Young Choon Lee and Albert Y Zomaya. “On Effective Slack Reclamation in Task Scheduling for Energy
Reduction.” In: JIPS 5.4 (2009), pp. 175–186.
[63] Xianfeng Li, Abhik Roychoudhury, and Tulika Mitra. “Modeling out-of-order processors for WCET anal-
ysis”. In: Real-Time Systems 34.3 (2006), pp. 195–227.
[64] Yau-Tsun Steven Li, Sharad Malik, and Andrew Wolfe. “Efﬁcient microarchitecture modeling and path
analysis for real-time software”. In: Proc. of Real-Time Syst. Symp. IEEE. 1995, pp. 298–307.
[65] Jianqiao Liu, Nikhil Hegde, and Milind Kulkarni. “Hybrid CPU-GPU scheduling and execution of tree
traversals”. In: Proc. of the Int. Conf. on Supercomp. ACM. 2016, p. 2.
[66] Tiantian Liu, Minming Li, and Chun Jason Xue. “Minimizing WCET for real-time embedded systems via
static instruction cache locking”. In: Real-Time and Embed. Technol. and Applications Symp. IEEE. 2009,
pp. 35–44.
[67] Daniel Lo, Mohamed Ismail, Tao Chen, and G Edward Suh. “Slack-Aware Opportunistic Monitoring for
Real-Time Systems”. In: Real Time and Embed. Technol. and Applications Symp. IEEE. 2014, pp. 203–
214.
[68] John W Lockwood, Nick McKeown, Greg Watson, Glen Gibb, Paul Hartke, Jad Naous, Ramanan Raghu-
raman, and Jianying Luo. “NetFPGA–an open platform for gigabit-rate network switching and routing”.
In: Microelectronic Systems Education, 2007. MSE’07. IEEE International Conference on. IEEE. 2007,
pp. 160–161.
[69] Thomas Lundqvist and Per Stenstrom. “Timing anomalies in dynamically scheduled microprocessors”. In:
Proc. of Real-time Syst. Symp. IEEE. 1999, pp. 12–21.
[70] Arno Luppold, Benjamin Menhorn, Heiko Falk, and Frank Slomka. “A new concept for system-level de-
sign of runtime reconﬁgurable real-time systems”. In: ACM SIGBED Rev. 10.4 (2013), pp. 57–60.
[71] Florian Martin, Martin Alt, Reinhard Wilhelm, and Christian Ferdinand. “Analysis of loops”. In: Compiler
Construction. Springer. 1998, pp. 80–94.
[72] Tulika Mitra, Jürgen Teich, and Lothar Thiele. “Time-Critical Systems Design: A Survey”. In: IEEEDesign
& Test 35.2 (2018), pp. 8–26.
[73] Saoni Mukherjee, Yifan Sun, Paul Blinzer, Amir Kavyan Ziabari, and David Kaeli. “A comprehensive
performance analysis of HSA and OpenCL 2.0”. In: IEEE Int. Symp. on Perf. Analy. of Syst. and Soft.
IEEE. 2016, pp. 183–193.
[74] Benjamin Oechslein, Jens Schedel, Jürgen Kleinöder, Lars Bauer, Jörg Henkel, Daniel Lohmann, and
Wolfgang Schröder-Preikschat. “OctoPOS: A parallel operating system for invasive computing”. In: Pro-
ceedings of the International Workshop on Systems for Future Multi-Core Architectures (SFMA). EuroSys.
2011, pp. 9–14.
[75] Prasanna Pandit and R Govindarajan. “Fluidic kernels: Cooperative execution of opencl programs on mul-
tiple heterogeneous devices”. In: IEEE/ACM Int. Symp. on Code Gen. and Opt. ACM. 2014, p. 273.
[76] Chang Yun Park and Alan C Shaw. “Experiments with a program timing tool based on source-level timing
schema”. In: Proc. of Real-time Syst. Symp. IEEE. 1990, pp. 72–81.
103
Bibliography
[77] Luca Pezzarossa, Martin Schoeberl, and Jens Sparsø. “A Controller for Dynamic Partial Reconﬁguration
in FPGA-Based Real-Time Systems”. In: Real-Time Distributed Computing (ISORC), 2017 IEEE 20th
International Symposium on. IEEE. 2017, pp. 92–100.
[78] Sascha Plazar, Jan C. Kleinsorge, Peter Marwedel, and Heiko Falk. “WCET-aware Static Locking of In-
struction Caches”. In: Proceedings of the Tenth International Symposium on Code Generation and Opti-
mization. ACM. 2012, pp. 44–52.
[79] Alexander Pöppl, Michael Bader, Tobias Schwarzer, and Michael Glaß. “SWE-X10: Simulating shallow
water waves with lazy activation of patches using ActorX10”. In: Proceedings of the Second Internationsl
Workshop on Extreme Scale Programming Models and Middleware. IEEE Press. 2016, pp. 32–39.
[80] Alexander Pöppl, Marvin Damschen, Florian Schmaus, Andreas Fried, Manuel Mohr, Matthias Blankertz,
Lars Bauer, Jörg Henkel, Wolfgang Schröder-Preikschat, and Michael Bader. “Shallow Water Waves on
a Deep Technology Stack: Accelerating a Finite Volume Tsunami Model Using Reconﬁgurable Hard-
ware in Invasive Computing”. In: Workshop on UnConventional High Performance Computing (UCHPC),
Santiago de Compostela, Spain, August 28-29, 2017, Revised Selected Papers. 2017, pp. 676–687. DOI:
10.1007/978-3-319-75178-8_54.
[81] Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M Beckmann, Mark D Hill, Steven K
Reinhardt, and David A Wood. “Heterogeneous system coherence for integrated CPU-GPU systems”. In:
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. ACM. 2013,
pp. 457–467.
[82] Jason Power, Joel Hestness, Marc S Orr, Mark D Hill, and David A Wood. “gem5-gpu: A Heterogeneous
CPU-GPU Simulator”. In: IEEE Comp. Arch. Letters 14.1 (2015), pp. 34–36.
[83] Peter Puschner and Ch Koza. “Calculating the maximum execution time of real-time programs”. In: Real-
Time Syst. 1.2 (1989), pp. 159–176.
[84] Andrew Putnam, Adrian M Caulﬁeld, Eric S Chung, Derek Chiou, Kypros Constantinides, John Demme,
Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, et al. “A reconﬁgurable fabric for
accelerating large-scale datacenter services”. In: ACM SIGARCH Computer Architecture News 42.3 (2014),
pp. 13–24.
[85] Gurulingesh Raravi. “The Journey Towards Reconciling Performance and Predictability”. In: CODES+ISSS:
Special Session - Future Automotive Systems Design: Research Challenges and Opportunities. 2018.
[86] Jan Reineke, Björn Wachter, Stephan Thesing, Reinhard Wilhelm, Ilia Polian, Jochen Eisinger, and Bernd
Becker. “A Deﬁnition and Classiﬁcation of Timing Anomalies”. In: WCET 4 (2006).
[87] Christine Rochange and Pascal Sainrat. “A context-parameterized model for static analysis of execution
times”. In: Trans. on High-Performance Embed. Architect. and Compilers. Springer, 2009, pp. 222–241.
[88] Enrico Rossi, Marvin Damschen, Lars Bauer, Giorgio Buttazzo, and Jörg Henkel. “Preemption of the Par-
tial Reconﬁguration Process to Enable Real-Time Computing with FPGAs”. In: ACM Trans. on Reconﬁg.
Technol. and Syst. (TRETS) 11.2 (2018). to appear. DOI: 10.1145/3182183.
[89] Sangeet Saha, Arnab Sarkar, and Amlan Chakrabarti. “Scheduling dynamic hard real-time task sets on
fully and partially reconﬁgurable platforms”. In: IEEE Embedded Systems Letters 7.1 (2015), pp. 23–26.
[90] Martin Schoeberl. “Time-predictable computer architecture”. In: EURASIP Journal on Embed. Syst. 2009
(2009), p. 2.
[91] Kiran Seth, Aravindh Anantaraman, Frank Mueller, and Eric Rotenberg. “FAST: Frequency-aware Static
Timing Analysis”. In: ACM Trans. Embed. Comput. Syst. 5.1 (Feb. 2006), pp. 200–224. ISSN: 1539-9087.
DOI: 10.1145/1132357.1132364.
104
[92] Christoph Steiger, Herbert Walder, and Marco Platzner. “Operating systems for reconﬁgurable embedded
platforms: Online scheduling of real-time tasks”. In: IEEE Trans. on Computers 53.11 (2004), pp. 1393–
1407.
[93] Christoph Steiger, Herbert Walder, Marco Platzner, and Lothar Thiele. “Online scheduling and placement
of real-time tasks to partially reconﬁgurable devices”. In: Proc. of Real-Time Syst. Symp. IEEE. 2003,
pp. 224–235.
[94] Vivy Suhendra, Tulika Mitra, Abhik Roychoudhury, and Ting Chen. “WCET Centric Data Allocation to
Scratchpad Memory”. In: Proceedings of the 26th IEEE International Real-Time Syst. Symposium. RTSS
’05. Washington, DC, USA: IEEE Computer Society, 2005, pp. 223–232. ISBN: 0-7695-2490-7. DOI:
10.1109/RTSS.2005.45.
[95] Jürgen Teich, Jörg Henkel, Andreas Herkersdorf, Doris Schmitt-Landsiedel, Wolfgang Schröder-Preikschat,
and Gregor Snelting. “Invasive computing: An overview”. In: Multiprocessor System-on-Chip. Springer,
2011, pp. 241–268.
[96] Russell Tessier and Wayne Burleson. “Reconﬁgurable computing for digital signal processing: A survey”.
In: Journal of VLSI signal processing systems for signal, image and video technology 28.1-2 (2001), pp. 7–
27.
[97] Russell Tessier, Kenneth Pocek, and Andre DeHon. “Reconﬁgurable Computing Architectures”. In: Pro-
ceedings of the IEEE 103.3 (2015), pp. 332–354.
[98] Henrik Theiling, Christian Ferdinand, and Reinhard Wilhelm. “Fast and precise WCET prediction by sep-
arated cache and path analyses”. In: Real-Time Syst. 18.2-3 (2000), pp. 157–179.
[99] Lothar Thiele and Reinhard Wilhelm. “Design for timing predictability”. In: Real-Time Syst. 28.2-3 (2004),
pp. 157–177.
[100] Sascha Uhrig, Stefan Maier, Georgi Kuzmanov, and Theo Ungerer. “Coupling of a reconﬁgurable archi-
tecture and a multithreaded processor core with integrated real-time scheduling”. In: Proc. of Int. Symp.
Parallel and Distributed Processing. IEEE. 2006, 4–pp.
[101] Sascha Uhrig, Stefan Maier, and Theo Ungerer. “Toward a processor core for real-time capable autonomic
systems”. In: Proc. of Int. Symp. Signal Processing and Information Technology. IEEE. 2005, pp. 19–22.
[102] Stamatis Vassiliadis, Stephan Wong, Georgi Gaydadjiev, Koen Bertels, Georgi Kuzmanov, and Elena
Moscu Panainte. “The MOLEN Polymorphic Processor”. In: IEEE Trans. Comput. 53.11 (Nov. 2004),
pp. 1363–1375. ISSN: 0018-9340. DOI: 10.1109/TC.2004.104.
[103] Li Wang, Ren-Wei Tsai, Shao-Chung Wang, Kun-Chih Chen, Po-Han Wang, Hsiang-Yun Cheng, Yi-Chung
Lee, Sheng-Jie Shu, Chun-Chieh Yang, Min-Yih Hsu, et al. “Analyzing OpenCL 2.0 workloads using a
heterogeneous CPU-GPU simulator”. In: IEEE Int. Symp. on Perf. Analy. of Syst. and Soft. IEEE. 2017,
pp. 127–128.
[104] Jack Whitham and Neil Audsley. “MCGREP–A Predictable Architecture for Embedded Real-Time Sys-
tems”. In: Proc. of Real-Time Syst. Symp. IEEE. 2006, pp. 13–24.
[105] Stefan Wildermann, Michael Bader, Lars Bauer, Marvin Damschen, Dirk Gabriel, Michael Gerndt, Michael
Glaß, Jörg Henkel, Johny Paul, Alexander Pöppl, Sascha Roloff, Tobias Schwarzer, Gregor Snelting,
Walter Stechele, Jürgen Teich, Andreas Weichslgartner, and Andreas Zwinkau. “Invasive computing for
timing-predictable stream processing on MPSoCs”. In: it - Information Technology 58.6 (2016), pp. 267–
280. DOI: 10.1515/itit-2016-0021.
105
Bibliography
[106] Reinhard Wilhelm, Jakob Engblom, Andreas Ermedahl, Niklas Holsti, Stephan Thesing, David Whalley,
Guillem Bernat, Christian Ferdinand, Reinhold Heckmann, Tulika Mitra, Frank Mueller, Isabelle Puaut,
Peter Puschner, Jan Staschulat, and Per Stenström. “The Worst-case Execution-time Problem—Overview
of Methods and Survey of Tools”. In: ACM Trans. Embed. Comput. Syst. 7.3 (May 2008), 36:1–36:53.
ISSN: 1539-9087. DOI: 10.1145/1347375.1347389.
[107] Reinhard Wilhelm, Daniel Grund, Jan Reineke, Marc Schlickling, Markus Pister, and Christian Ferdinand.
“Memory hierarchies, pipelines, and buses for future architectures in time-critical embedded systems”. In:
Trans. on Comput.-Aided Design of Integrated Circuits and Syst. 28.7 (2009), pp. 966–978.
[108] Ralph D Wittig and Paul Chow. “OneChip: An FPGA processor with reconﬁgurable logic”. In: Proc. of
Int. Symp. FPGAs for Custom Comput. Machines. IEEE. 1996, pp. 126–135.
[109] Ming Yang, Ying Wu, James Crenshaw, Bruce Augustine, and Russell Mareachen. “Face detection for
automatic exposure control in handheld camera”. In: Fourth IEEE International Conference on Computer
Vision Systems (ICVS’06). IEEE. 2006, pp. 17–17.
[110] Yi Yang, Ping Xiang, Mike Mantor, and Huiyang Zhou. “CPU-assisted GPGPU on fused CPU-GPU archi-
tectures”. In: Int. Symp. on High Perf. Comp. Arch. IEEE. 2012, pp. 1–12.
[111] Pan Yu and Tulika Mitra. “Scalable custom instructions identiﬁcation for instruction-set extensible pro-
cessors”. In: Proc. of Int. Conf. on Compilers, Architecture and Synthesis for Embed. Syst. ACM. 2004,
pp. 69–78.
[112] Pan Yu and Tulika Mitra. “Satisfying real-time constraints with custom instructions”. In: Int. Conf. on
Hardware/Software Codesign and Syst. Synthesis. IEEE. 2005, pp. 166–171.
[113] Feng Zhang, Bo Wu, Jidong Zhai, Bingsheng He, and Wenguang Chen. “FinePar: Irregularity-aware ﬁne-
grained workload partitioning on integrated architectures”. In: IEEE/ACM Int. Symp. on Code Gen. and
Opt. IEEE. 2017, pp. 27–38.
[114] Feng Zhang, Jidong Zhai, Bingsheng He, Shuhao Zhang, and Wenguang Chen. “Understanding co-running
behaviors on integrated CPU/GPU architectures”. In: IEEE Trans. on Par. and Dist. Syst. 28.3 (2017),
pp. 905–918.
106
