Improving Model-Based Software Synthesis: A Focus on Mathematical Structures by Goens Jokisch, Andres Wilhelm
I M P R O V I N G M O D E L - B A S E D S O F T W A R E S Y N T H E S I S
A Focus on Mathematical Structures
Dissertation
zur Erlangung des akademischen Grades





Andrés Wilhelm Goens Jokisch
geboren am 23.09.1989 in San Salvador
Gutachter:
Prof. Dr.-Ing. Jeronimo Castrillon
Technische Universität Dresden







Dedicated to all who are unjustly opressed only for being born with a specific
sex, race or species.

P R E A M B L E




Computer hardware keeps increasing in complexity. Software design
needs to keep up with this. The right models and abstractions empower
developers to leverage the novelties of modern hardware. This thesis
deals primarily with Models of Computation, as a basis for software de-
sign, in a family of methods called software synthesis.
We focus on Kahn Process Networks and dataflow applications as ab-
stractions, both for programming and for deriving an efficient execution
on heterogeneous multicores. The latter we accomplish by exploring the
design space of possible mappings of computation and data to hardware
resources. Mapping algorithms are not at the center of this thesis, how-
ever. Instead, we examine the mathematical structure of the mapping
space, leveraging its inherent symmetries or geometric properties to im-
prove mapping methods in general.
This thesis thoroughly explores the process of model-based design,
aiming to go beyond the more established software synthesis on dataflow
applications. We starting with the problem of assessing these methods
through benchmarking, and go on to formally examine the general goals
of benchmarks. In this context, we also consider the role modern machine
learning methods play in benchmarking.
We explore different established semantics, stretching the limits of
Kahn Process Networks. We also discuss novel models, like Reactors,
which are designed to be a deterministic, adaptive model with time as
a first-class citizen. By investigating abstractions and transformations in
the Ohua language for implicit dataflow programming, we also focus on
programmability.
The focus of the thesis is in the models and methods, but we evaluate
them in diverse use-cases, generally centered around Cyber-Physical Sys-
tems. These include the 5G telecommunication standard, automotive and
signal processing domains. We even go beyond embedded systems and





Some contents of this thesis have been published previously, including
ideas and some figures. The following are the publications cited in this
thesis that I co-authored:
[Ode+14] Maximilian Odendahl, Andrés Goens, Rainer Leupers, Gerd
Ascheid, Benjamin Ries, and Berthold Vöckingand Tomas
Henriksson. “Optimized buffer allocation in multicore plat-
forms.” In: Proceedings of the conference on Design, Automa-
tion & Test in Europe. European Design and Automation As-
sociation. 2014, p. 324.
[GC15] Andrés Goens and Jeronimo Castrillon. “Analysis of Process
Traces for Mapping Dynamic KPN Applications to MPSoCs.”
In: System Level Design from HW/SW to Memory for Embed-
ded Systems. IESS 2015. IFIP Advances in Information and Com-
munication Technology, vol 523. Ed. by Marcelo Götz, Gunar
Schirner, Marco Aurélio Wehrmeister, Mohammad Abdul-
lah Al Faruque, and Achim Rettberg. Foz do Iguaçu, Brazil:
Springer International Publishing, Nov. 2015, pp. 116–127.
ISBN: 978-3-319-90023-0. DOI: ✶✵✳✶✵✵✼✴✾✼✽✲✸✲✸✶✾✲✾✵✵✷✸✲
✵❴✶✵. URL: ❤tt♣s✿✴✴❧✐♥❦✳s♣r✐♥❣❡r✳❝♦♠✴❝❤❛♣t❡r✴✶✵✳✶✵✵✼✪
✺❈✪✷❋✾✼✽✲✸✲✸✶✾✲✾✵✵✷✸✲✵❴✶✵.
[Ode+15] Maximilian Odendahl, Andrés Goens, Rainer Leupers, Gerd
Ascheid, and Tomas Henriksson. “Buffer allocation based on-
chip memory optimization for many-core platforms.” In: 2015
IEEE International Parallel and Distributed Processing Sympo-
sium Workshop. IEEE. 2015, pp. 1119–1124.
[GCL16] Andrés Goens, Jeronimo Castrillon, and Maximilian Oden-
dahland Rainer Leupers. “An Optimal Allocation of Memory
Buffers for Complex Multicore Platforms.” In: Journal of Sys-
tems Architecture 66-67 (May 2016), pp. 69–83. DOI: ✶✵✳✶✵✶✻✴
❥✳s②s❛r❝✳✷✵✶✻✳✵✺✳✵✵✷.
[Goe+16] Andrés Goens, Robert Khasanov, Jeronimo Castrillon, Si-
mon Polstra, and Andy Pimentel. “Why Comparing System-
level MPSoC Mapping Approaches is Difficult: a Case Study.”
In: Proceedings of the IEEE 10th International Symposium on
Embedded Multicore/Many-core Systems-on-Chip (MCSoC-16).
Ecole Centrale de Lyon, Lyon, France, Sept. 2016, pp. 281–288.
DOI: ✶✵✳✶✶✵✾✴▼❈❙♦❈✳✷✵✶✻✳✹✽. URL: ❤tt♣s✿✴✴❝❢❛❡❞✳t✉✲
❞r❡s❞❡♥✳❞❡✴❢✐❧❡s✴✉s❡r✴❥❝❛str✐❧❧♦♥✴♣✉❜❧✐❝❛t✐♦♥s✴✶✻✵✾❴
●♦❡♥s❴▼❈❙♦❈✳♣❞❢.
[MGC16] Christian Menard, Andrés Goens, and Jeronimo Castrillon.
“High-Level NoC Model for MPSoC Compilers.” In: Proceed-
ings of the IEEE Nordic Circuits and Systems Conference (NOR-
CAS’16). NORCAS. Copenhagen, Denmark, Nov. 2016, pp. 1–6.




[Völ+16] Marcus Völp, Sascha Klüppelholz, Jeronimo Castrillon, Her-
mann Härtig, Nils Asmussen, Uwe Assmann, Franz Baader,
Christel Baier, Gerhard Fettweis, Jochen Fröhlich, Andrés
Goens, Sebastian Haas, Dirk Habich, Mattis Hasler, Immo
Huismann, Tomas Karnagel, Sven Karol, Wolfgang Lehner,
Linda Leuschner, Matthias Lieber, Siqi Ling, Steffen Märcker,
Johannes Mey, Wolfgang Nagel, Benedikt Nöthen, Rafael
Peñaloza, Michael Raitza, Jörg Stiller, Annett Ungethüm, and
Axel Voigt. “The Orchestration Stack: The Impossible Task of
Designing Software for Unknown Future Post-CMOS Hard-
ware.” In: Proceedings of the 1st International Workshop on
Post-Moore’s Era Supercomputing (PMES), Co-located with The
International Conference for High Performance Computing, Net-
working, Storage and Analysis (SC16). Salt Lake City, USA, Nov.
2016. URL: ❤tt♣s✿✴✴❝❢❛❡❞✳t✉✲ ❞r❡s❞❡♥✳❞❡✴❢✐❧❡s✴✉s❡r✴
❥❝❛str✐❧❧♦♥✴♣✉❜❧✐❝❛t✐♦♥s✴✶✻✶✶❴❱♦❡❧♣❴P▼❊❙✳♣❞❢.
[Goe+17] Andrés Goens, Robert Khasanov, Marcus Hähnel, Till Sme-
jkal, Hermann Härtig, and Jeronimo Castrillon. “TETRiS: a
Multi-Application Run-Time System for Predictable Execu-
tion of Static Mappings.” In: Proceedings of the 20th Inter-
national Workshop on Software and Compilers for Embedded
Systems (SCOPES’17). SCOPES ’17. Sankt Goar, Germany: ACM,
June 2017, pp. 11–20. ISBN: 978-1-4503-5039-6. DOI: ✶✵✳✶✶✹✺✴
✸✵✼✽✻✺✾ ✳ ✸✵✼✽✻✻✸. URL: ❤tt♣ ✿ ✴ ✴ ❞♦✐ ✳ ❛❝♠ ✳ ♦r❣ ✴ ✶✵ ✳ ✶✶✹✺ ✴
✸✵✼✽✻✺✾✳✸✵✼✽✻✻✸.
[GSC17] Andrés Goens, Sergio Siccha, and Jeronimo Castrillon. “Sym-
metry in Software Synthesis.” In: ACM Transactions on Archi-
tecture and Code Optimization (TACO), 14.2 (July 2017), 20:1–
20:26. ISSN: 1544-3566. DOI: ✶✵✳✶✶✹✺✴✸✵✾✺✼✹✼. eprint: ❛r❳✐✈✿
✶✼✵✹✳✵✻✻✷✸. URL: ❤tt♣✿✴✴❞♦✐✳❛❝♠✳♦r❣✴✶✵✳✶✶✹✺✴✸✵✾✺✼✹✼.
[Hem+17] Gerald Hempel, Andrés Goens, Josefine Asmus, Jeronimo
Castrillon, and Ivo F. Sbalzarini. “Robust Mapping of Process
Networks to Many-Core Systems Using Bio-Inspired Design
Centering.” In: Proceedings of the 20th International Workshop
on Software and Compilers for Embedded Systems (SCOPES ’17).
SCOPES ’17. Sankt Goar, Germany: ACM, June 2017, pp. 21–30.
ISBN: 978-1-4503-5039-6. DOI: ✶✵ ✳ ✶✶✹✺ ✴ ✸✵✼✽✻✺✾ ✳ ✸✵✼✽✻✻✼.
URL: ❤tt♣✿✴✴❞♦✐✳❛❝♠✳♦r❣✴✶✵✳✶✶✹✺✴✸✵✼✽✻✺✾✳✸✵✼✽✻✻✼.
[Cas+18] Jeronimo Castrillon, Matthias Lieber, Sascha Klüppelholz,
Marcus Völp, Nils Asmussen, Uwe Assmann, Franz Baader,
Christel Baier, Gerhard Fettweis, Jochen Fröhlich, Andrés
Goens, Sebastian Haas, Dirk Habich, Hermann Härtig, Mat-
tis Hasler, Immo Huismann, Tomas Karnagel, Sven Karol,
Akash Kumar, Wolfgang Lehner, Linda Leuschner, Siqi Ling,
Steffen Märcker, Christian Menard, Johannes Mey, Wolfgang
Nagel, Benedikt Nöthen, Rafael Peñaloza, Michael Raitza,
Jörg Stiller, Annett Ungethüm, Axel Voigt, and Sascha Wun-
derlich. “A Hardware/Software Stack for Heterogeneous Sys-
tems.” In: IEEE Transactions on Multi-Scale Computing Systems




[Ert+18] Sebastian Ertel, Andrés Goens, Justus Adam, and Jeronimo
Castrillon. “Compiling for Concise Code and Efficient I/O.”
In: Proceedings of the 27th International Conference on Com-
piler Construction (CC 2018). CC 2018. Vienna, Austria: ACM,
Feb. 2018, pp. 104–115. DOI: ✶✵✳✶✶✹✺✴✸✶✼✽✸✼✷✳✸✶✼✾✺✵✺. URL:
❤tt♣s✿✴✴❞❧✳❛❝♠✳♦r❣✴❝✐t❛t✐♦♥✳❝❢♠❄✐❞❂✸✶✼✾✺✵✺.
[Goe+18] Andrés Goens, Sebastian Ertel, Justus Adam, and Jeron-
imo Castrillon. “Level Graphs: Generating Benchmarks
for Concurrency Optimizations in Compilers.” In: Proceed-
ings of the 11th International Workshop on Programmabil-
ity and Architectures for Heterogeneous Multicores (MULTI-
PROG’2018), co-located with 13th International Conference on
High-Performance and Embedded Architectures and Compilers
(HiPEAC). Manchester, United Kingdom, Jan. 2018. URL: ❤tt♣✿
✴ ✴ r❡s❡❛r❝❤ ✳ ❛❝ ✳ ✉♣❝ ✳ ❡❞✉ ✴ ♠✉❧t✐♣r♦❣ ✴ ♠✉❧t✐♣r♦❣✷✵✶✽ ✴
♣❛♣❡rs✴▼❯▲❚■P❘❖●✲✷✵✶✽❴●♦❡♥s✳♣❞❢.
[GMC18] Andrés Goens, Christian Menard, and Jeronimo Castrillon.
“On the Representation of Mappings to Multicores.” In: Pro-
ceedings of the IEEE 12th International Symposiumon Embedded
Multicore/Many-core Systems-on-Chip (MCSoC-18). Vietnam Na-
tional University, Hanoi, Vietnam, Sept. 2018, pp. 184–191. DOI:
✶✵✳✶✶✵✾✴▼❈❙♦❈✷✵✶✽✳✷✵✶✽✳✵✵✵✸✾. URL: ❤tt♣s✿✴✴✐❡❡❡①♣❧♦r❡✳
✐❡❡❡✳♦r❣✴❞♦❝✉♠❡♥t✴✽✺✹✵✷✸✷.
[KGC18] Robert Khasanov, Andrés Goens, and Jeronimo Castrillon.
“Implicit Data-Parallelism in Kahn Process Networks: Bridg-
ing the MacQueen Gap.” In: Proceedings of the 9th Work-
shop and 7th Workshop on Parallel Programming and RunTime
Management Techniques for Manycore Architectures and De-
sign Tools and Architectures for Multicore Embedded Comput-
ing Platforms (PARMA-DITAM’18), co-located with 13th Interna-
tional Conference on High-Performance and Embedded Archi-
tectures and Compilers (HiPEAC). PARMA-DITAM ’18. Manch-
ester, United Kingdom: ACM, Jan. 2018, pp. 20–25. ISBN: 978-
1-4503-6444-7. DOI: ✶✵✳✶✶✹✺✴✸✶✽✸✼✻✼✳✸✶✽✸✼✾✵. URL: ❤tt♣✿
✴✴❞♦✐✳❛❝♠✳♦r❣✴✶✵✳✶✶✹✺✴✸✶✽✸✼✻✼✳✸✶✽✸✼✾✵.
[Ert+19a] Sebastian Ertel, Justus Adam, Norman A. Rink, Andrés Goens,
and Jeronimo Castrillon. “Category-Theoretic Foundations
of “STCLang: State Thread Composition as a Foundation
for Monadic Dataflow Parallelism”.” In: CoRR abs/1906.12098
(June 2019). arXiv: ✶✾✵✻✳✶✷✵✾✽. URL: ❤tt♣✿✴✴❛r①✐✈✳♦r❣✴❛❜s✴
✶✾✵✻✳✶✷✵✾✽.
[Ert+19b] Sebastian Ertel, Justus Adam, Norman A. Rink, Andrés Goens,
and Jeronimo Castrillon. “STCLang: State Thread Composi-
tion as a Foundation for Monadic Dataflow Parallelism.” In:
Proceedings of the 12th ACM SIGPLAN International Symposium
on Haskell. Haskell 2019. Berlin, Germany: ACM, Aug. 2019,
pp. 146–161. ISBN: 978-1-4503-6813-1. DOI: ✶✵✳✶✶✹✺✴✸✸✸✶✺✹✺✳
✸✸✹✷✻✵✵. URL: ❤tt♣ ✿ ✴ ✴ ❞♦✐ ✳ ❛❝♠ ✳ ♦r❣ ✴ ✶✵ ✳ ✶✶✹✺ ✴ ✸✸✸✶✺✹✺ ✳
✸✸✹✷✻✵✵.
[Goe+19] Andrés Goens, Alexander Brauckmann, Sebastian Ertel,
Chris Cummins, Hugh Leather, and Jeronimo Castrillon. “A
xi
Case Study on Machine Learning for Synthesizing Bench-
marks.” In: Proceedings of the 3rd ACM SIGPLAN International
Workshop on Machine Learning and Programming Languages
(MAPL). MAPL 2019. Phoenix, AZ, USA: ACM, June 2019, pp. 38–
46. DOI: ✶✵✳✶✶✹✺✴✸✸✶✺✺✵✽✳✸✸✷✾✾✼✻. URL: ❤tt♣✿✴✴❞♦✐✳❛❝♠✳
♦r❣✴✶✵✳✶✶✹✺✴✸✸✶✺✺✵✽✳✸✸✷✾✾✼✻.
[GMC19] Andrés Goens, Christian Menard, and Jeronimo Castril-
lon. “On Compact Mappings for Multicore Systems.” In:
Proceedings of the IEEE International Conference on Embed-
ded Computer Systems Architectures Modeling and Simulation
(SAMOS). Ed. by D. Pnevmatikatos, M. Pelcat, and M. Jung.
Vol. 11733. IEEE. Pythagorion, Greece: Springer, Cham, July
2019, pp. 325–335. ISBN: 978-3-030-27561-7. DOI: ✶✵✳✶✵✵✼✴
✾✼✽✲✸✲✵✸✵✲✷✼✺✻✷✲✹❴✷✸. URL: ❤tt♣s✿✴✴❧✐♥❦✳s♣r✐♥❣❡r✳❝♦♠✴
❝❤❛♣t❡r✴✶✵✳✶✵✵✼✴✾✼✽✲✸✲✵✸✵✲✷✼✺✻✷✲✹❴✷✸.
[Loh+19] Marten Lohstroh, Martin Schoeberl, Andrés Goens, Armin
Wasicek, Christopher Gill, Marjan Sirjani, and Edward A. Lee.
“Actors Revisited for Time-Critical Systems.” In: Proceedings of
the 56th annual Design Automation Conference. DAC 2019. Las
Vegas, NV, USA: ACM, June 2019, 4pp. DOI: ✶✵✳✶✶✹✺✴✸✸✶✻✼✽✶✳
✸✸✷✸✹✻✾. URL: ❤tt♣ ✿ ✴ ✴ ❞♦✐ ✳ ❛❝♠ ✳ ♦r❣ ✴ ✶✵ ✳ ✶✶✹✺ ✴ ✸✸✶✻✼✽✶ ✳
✸✸✷✸✹✻✾.
[BGC20] Alexander Brauckmann, Andrés Goens, and Jeronimo Castril-
lon. “ComPy-Learn: A Toolbox for Exploring Machine Learn-
ing Representations for Compilers.” In: 2020 Forum for Specifi-
cation and Design Languages (FDL). Kiel, Germany, Sept. 2020.
[Bra+20] Alexander Brauckmann, Andrés Goens, Sebastian Ertel, and
Jeronimo Castrillon. “Compiler-Based Graph Representa-
tions for Deep Learning Models of Code.” In: Proceedings of
the 29th ACM SIGPLAN International Conference on Compiler
Construction (CC 2020). CC 2020. San Diego, CA, USA: Associa-
tion for Computing Machinery, Feb. 2020, pp. 201–211. ISBN:
9781450371209. DOI: ✶✵✳✶✶✹✺✴✸✸✼✼✺✺✺✳✸✸✼✼✽✾✹. URL: ❤tt♣s✿
✴✴❞♦✐✳♦r❣✴✶✵✳✶✶✹✺✴✸✸✼✼✺✺✺✳✸✸✼✼✽✾✹.
[Kha+20] Asif Ali Khan, Andrés Goens, Fazal Hameed, and Jeronimo
Castrillon. “Generalized Data Placement Strategies for Race-
track Memories.” In: Proceedings of the 2020 Design, Automa-
tion and Test in Europe Conference (DATE). DATE ’20. Grenoble,
France: IEEE, Mar. 2020, pp. 1502–1507. ISBN: 978-3-9819263-
4-7. DOI: ✶✵✳✷✸✾✶✾✴❉❆❚❊✹✽✺✽✺✳✷✵✷✵✳✾✶✶✻✷✹✺. URL: ❤tt♣s✿
✴✴✐❡❡❡①♣❧♦r❡✳✐❡❡❡✳♦r❣✴❞♦❝✉♠❡♥t✴✾✶✶✻✷✹✺.
[Loh+20c] Marten Lohstroh, Íñigo Íncer Romero, Andrés Goens, Patri-
cia Derler, Jeronimo Castrillon, Edward A. Lee, and Alberto
Sangiovanni-Vincentelli. “Reactors: A Deterministic Model
for Composable Reactive Systems.” In: Cyber Physical Systems.
Model-Based Design – Proceedings of the 9th Workshop on De-
sign, Modeling and Evaluation of Cyber Physical Systems (CyPhy
2019) and the Workshop on Embedded and Cyber-Physical Sys-
tems Education (WESE 2019). Ed. by Roger Chamberlain, Mar-
tin Edin Grimheden, and Walid Taha. New York City, NY, USA:
Springer International Publishing, Feb. 2020, pp. 59–85. ISBN:




[Men+20] Christian Menard, Andrés Goens, Marten Lohstroh, and
Jeronimo Castrillon. “Achieving Determinism in Adaptive AU-
TOSAR.” In: Proceedings of the 2020 Design, Automation and
Test in Europe Conference (DATE). DATE ’20. Grenoble, France:
IEEE, Mar. 2020, pp. 822–827. ISBN: 978-3-9819263-4-7. DOI:
✶✵ ✳ ✷✸✾✶✾ ✴ ❉❆❚❊✹✽✺✽✺ ✳ ✷✵✷✵ ✳ ✾✶✶✻✹✸✵. URL: ❤tt♣s ✿ ✴ ✴
✐❡❡❡①♣❧♦r❡✳✐❡❡❡✳♦r❣✴❛❜str❛❝t✴❞♦❝✉♠❡♥t✴✾✶✶✻✹✸✵.
[Wit+20] Robert Wittig, Andrés Goens, Christian Menard, Emil Matus,
Gerhard P. Fettweis, and Jeronimo Castrillon. “Modem De-
sign in the Era of 5G and Beyond: The Need for a Formal Ap-
proach.” In: Proceedings of the 27th International Conference
on Telecomunications (ICT). Bali, Indonesia, Oct. 2020.
[Men+21] Christian Menard, Andrés Goens, Gerald Hempel, Robert
Khasanov, Julian Robledo, Felix Teweleitt, and Jeronimo Cas-
trillon. “Mocasin – Rapid Prototyping of Rapid Prototyping
Tools: A Framework for Exploring New Approaches in Map-
ping Software to Heterogeneous Multi-cores.” In: Proceed-
ings of the 13th RAPIDO Workshop on Rapid Simulation and Per-
formance Evaluation: Methods and Tools, co-located with 16th
International Conference on High-Performance and Embedded
Architectures and Compilers (HiPEAC). RAPIDO ’21. Budapest,
Hungary: ACM, Jan. 2021.
[GNC] Andrés Goens, Timo Nicolai, and Jeronimo Castrillon. “mp-
sym: Improving Design-Space Exploration of Clustered





First and foremost, I thank my advisor Jeronimo Castrillon. I consider him
to have been both a mentor and a friend during the time I’ve spent work-
ing on this thesis. His advice shaped my research and this thesis would
not exist without his guidance and help.
I also want to thank my current and former colleagues and co-authors
at the chair for compiler construction: Justus Adam, Hasna Bouraoui,
Alexander Brauckmann, Sebastian Ertel, Fazal Hameed, Gerald Hempel,
Sven Karol, Asif Khan, Robert Khasanov, Nesrine Khouzami, Christian
Menard, Norman Rink, Julian Robledo, Lars Schütze and Felix Wittwer.
Thank you for creating a great environment to learn and work together,
for countless discussions and insights, for your patience with my insis-
tance on going to Zeltmensa and the great discussions that arose there,
and for offering my comradery and friendship.
I want to thank everyone who worked with me as a student, help-
ing me realize my research vision, from whom I’ve also learned a great
deal, and some of whom have become colleagues in the meantime. Con-
cretely, thank you, Alexander Brauckmann, Sebastian Krammer, Chris-
tian Menard, Timo Nicolai, Marcus Rossel, Alexander Thierfelder, Felix
Teweleitt and Markus Walter.
Thanks to Silexica for letting me work with their product, which started
as a spinoff of Multi-Processor System-on-Chip (MPSoC) Application Pro-
gramming Studio (MAPS). Special thanks go to Luis Murillo for the patience
of reading through all my papers related to Silexica and also being a
source of inspiration in this collaboration. Mostly however, I want to thank
Max Odendahl, for trusting in my abilities while knowing me only on a per-
sonal level, and introducing me to the field. Without him and Aufwärts
Aachen, I would not be where I am today, thank you!
During my Ph.D I had the opportunity to visit Andy Pimentel at the Uni-
versity of Amsterdam, where I was warmly received by him, Simon Polstra
and the rest of the group. Thank you for welcoming me and for a fruitful
collaboration. I’d also like to thank the HiPEAC project for funding this
visit through a collaboration grant. I also had the opportunity to visit Ed-
ward Lee at the University of Califonria at Berkeley. There, Matt Weber
and Gil Lederman received me in their office, where I felt very welcome,
like any other colleague. I want to thank both, as well as Marten Lohstroh,
all of whom I had great discussions with, and who made my visit at Berke-
ley extremely fruitful. A special thank you also goes to Mary Stewart for
helping sort out everything there, even to the point of making sure I had
something to eat at the group lunches. Most of all, I would like to thank
Edward Lee for accepting me to visit his group and taking the time to talk
with me regularly. This visit was a pivotal point in my Ph.D. and I really ap-
preciated everything and everyone. Outside the academic realm, I want to
thank Giulia Leggett for making this visit extremely enriching also from a
personal point of view. I also want to thank the German foreign exchange
service DAAD and specifically the FIT Weltweit project, as well, the Center
for Advancing Electronics Dresden (cfaed) cluster of excellence, for helping
me finance this visit.
I also want to thank the rest of my co-authors. Colloaboration with them
made this thesis possible. Thanks to Chris Cummins and Hugh Leather for
being so open in our collaboration and for their hospitality in Edingburgh.
To everyone in the cfead Orchestration path, for sharing a vision with me
xv
and constructive retreats. I also thank Marcus Hähnel and Till Smejkal for
a very successful collaboration what started the TETRiS project. Thanks to
Josefine Asmus and Ivo Sbalzarini for collaboration on the work on design
centering, which was very insightful. I also thank Sergio Siccha, for taking
our friendly discussions so seriously that we ended up collaborating in
the mapping symmetries work. Thanks also to Robert Wittig for a beach-
side discussion at Samos that led to a collaboration on the model-based
approaches to 5G. I thank Arka Maity, Nishant Budhev and Tulika Mitra
for sharing their LTE traces with us.
I started this Ph.D. at the cfaed cluster of excellence, which provided
funding and a great academic environment. I want to thank everyone at
the program office for helping me throughout this time, as well as my the-
sis advisory committee, Jeronimo Castrillon, Christel Baier and Hermann
Härtig. I also want to thank Conny Okuma for her patience throughout
the years with my incomplete formularies and late handing over of docu-
ments. Thanks as well to the German Research Foundation DFG for fund-
ing me after cfaed.
The final phase of my Ph.D. was mainly funded by the Studienstiftung
des deutschen Volkes. Besides financial support also provided me with
an excellent offer of intellectual complementary opportunities. Thank you
for this opportunity, and thanks to Maike Lieser for helping me apply to
this scholarship, I am sure I would not have received it without her help. I
also want to thank her for everything else, as she was probably the biggest
positive influence in my life during the time of my Ph.D.
I want to thank everyone at TEDxDresden and everyone from animal
rights activism for giving me meaningful projects to do with my life be-
sides my research. Also to everyone at Bodyworks and Basketball Club
Dresden for giving me a constant outlet to find a healthy balance with
sports.
Finally, and most importantly, I want to thank my friends and family for
being there for me and reminding constantly of all the important and en-
joyable aspects of life, besides academics. To all my friends in and around
Dresden, who accompanied me through life these past six years, thank
you for making this one of the best times of my life. To my friends back
in Aachen, San Salvador and spread throughout the rest of the world,
thanks for being a constant source of love and friendship that has kept
me grounded. I won’t list everyone who has made my life better these last
six years and whom I consider a friend, I’m sure they know who they are,
and I thank each and every one.
I would certainly not be who I am, and this thesis would not be possi-
ble, without the tremendous support from my family. My cousins, uncles
and (great) aunts, my two big sisters, thank you everyone for always being
there for me. Especially my little sister Ute, who’s accompanied me a large
part of my time here in Dresden, being a constant source of support and
inspiration. My father made me be curious and think critically since I was
a kid, and coupled this inspiration with unconditional love, which I am cer-
tain was an indispensable for me to write this thesis. My mother made me
be social and empathic, and made sure I became a well-rounded person.
Her constant support and openness made me always do what interested
me, and I am certain this thesis would never have happened without her.
Thank all of you for everything!
Andrés Goens, January 2021
xvi
C O N T E N T S
Preamble vii
1 Introduction 1
1.1 The Multicore Era . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Programming Multicores . . . . . . . . . . . . . . . . . . . . 3
1.3 Software Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.1 A Note on Originality . . . . . . . . . . . . . . . . . . . 11
2 Mapping KPNs to Heterogenous MPSoCs 13
2.1 Kahn Process Networks . . . . . . . . . . . . . . . . . . . . . 13
2.2 Execution Traces . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Architecture Models . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 The Mapping Problem . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Simulating Mappings . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.1 Simulating the Execution of Kahn Process Networks 26
2.6 Software Synthesis Flows . . . . . . . . . . . . . . . . . . . . 27
2.6.1 The MAPS flow . . . . . . . . . . . . . . . . . . . . . . 28
2.7 The ♠♦❝❛s✐♥ tool . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Benchmarking 33
3.1 Representative Benchmarks . . . . . . . . . . . . . . . . . . 33
3.1.1 Sample use cases . . . . . . . . . . . . . . . . . . . . . 35
3.2 KPN Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 CPN Benchmarks . . . . . . . . . . . . . . . . . . . . . 36
3.2.2 The E3S Benchmarks . . . . . . . . . . . . . . . . . . . 37
3.3 Random Benchmarks and Level Graphs . . . . . . . . . . . . 38
3.4 Machine Learning for Benchmarking . . . . . . . . . . . . . 40
3.4.1 Generative models . . . . . . . . . . . . . . . . . . . . 40
3.4.2 Potential Problems . . . . . . . . . . . . . . . . . . . . 41
3.4.3 Models of Code . . . . . . . . . . . . . . . . . . . . . . 44
4 Mathematical Structures in Mappings 47
4.1 Symmetries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.1 Architectures and Applications . . . . . . . . . . . . . 47
4.1.2 Mappings . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1.3 Calculating Symmetries . . . . . . . . . . . . . . . . . 52
4.1.4 Partial Symmetries . . . . . . . . . . . . . . . . . . . . 56
4.2 Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.1 Architectures . . . . . . . . . . . . . . . . . . . . . . . 63
4.2.2 Mappings . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.3 Low-distortion Embeddings . . . . . . . . . . . . . . . 66
4.3 Representations . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5 Applications of Mathematical Structures in Mappings 75
5.1 Compact Mappings . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Robust Mappings . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3 Design Space Exploration . . . . . . . . . . . . . . . . . . . . 82
xvii
5.3.1 Heuristics and Metaheuristics . . . . . . . . . . . . . 83
5.3.2 Leveraging Symmetries . . . . . . . . . . . . . . . . . 85
5.3.3 Leveraging Metric Spaces . . . . . . . . . . . . . . . . 88
5.4 A Vision of IoT Mappings . . . . . . . . . . . . . . . . . . . . . 92
5.5 Run-time applications: TETRiS . . . . . . . . . . . . . . . . . . 95
6 Beyond KPN: Models of Computation 99
6.1 An overview of Models of Computation . . . . . . . . . . . . 100
6.1.1 Partial Computation: Scott Domains . . . . . . . . . . 100
6.1.2 Concurrent Computation . . . . . . . . . . . . . . . . 101
6.1.3 Dataflow Models of Computation . . . . . . . . . . . 102
6.2 The MacQueen Gap . . . . . . . . . . . . . . . . . . . . . . . . 104
6.2.1 The MacQueen Gap . . . . . . . . . . . . . . . . . . . 105
6.2.2 Exploiting the Gap . . . . . . . . . . . . . . . . . . . . 108
6.3 Reactors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3.1 Applications in 5G . . . . . . . . . . . . . . . . . . . . 117
7 Programming Languages 123
7.1 Freedom from Choice . . . . . . . . . . . . . . . . . . . . . . 123
7.1.1 Dataflow, Actors and Discrete Events . . . . . . . . . 124
7.1.2 Implicit Dataflow . . . . . . . . . . . . . . . . . . . . . 126
7.1.3 Stateful Functions . . . . . . . . . . . . . . . . . . . . 127
7.2 Stateful Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 127
7.3 Concise code and Efficient I/O . . . . . . . . . . . . . . . . . 130
8 Related Work 137
8.1 Dataflow-based Software Synthesis . . . . . . . . . . . . . . 137
8.2 Mapping Space Structures . . . . . . . . . . . . . . . . . . . . 138
8.2.1 Symmetries . . . . . . . . . . . . . . . . . . . . . . . . 138
8.2.2 Distances . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.3 Run-time and hybrid approaches . . . . . . . . . . . . . . . . 139
8.4 Other model-based design tools . . . . . . . . . . . . . . . . 139
8.5 Random Benchmark Generation and Machine Learning . . 140
9 Conclusions 141
9.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
a Mathematical Supplement 145
a.1 Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
a.2 Metric Spaces and Low-Distortion Embeddings . . . . . . . 149
xviii
1I N T R O D U C T I O N
Programming computers is notoriously difficult. Indeed, people learning
to program usually struggle, paradoxically, with the fact that the com-
puter does precisely what they tell it to do. This is confusing, not because
a computer program is executed faithfully, but rather, because humans
think at a very different level of abstraction.
It is certainly true that instructions in computer architectures are at a
completely different level of abstraction than the instructions we give
each other. However, most programs are also not written at the level
of the architecture. Programming languages are designed with increas-
ingly improving abstractions, making it easier for programmers to ex-
press themselves. Complementary to these efforts are compilers, which
serve as bridge between the levels of abstraction. Ideally, a compiler trans-
lates the abstract human-level expressions into efficient machine-level in-
structions. While we have made significant progress, this task has proven
to be dauntingly difficult.
Traditionally, we have put the research and effort into optimizing the
execution of a single core. Most of the progress of decades of research in
programming language and compilers revolves around this single-core
model. In the last decade or two, however, with the multicore era, this
challenge has increased dramatically. Now we have to use and coordinate
multiple cores, commonly with different capabilities. The widespread pro-
gramming language abstractions and compiler analyses of today are ill-
suited to tackle this challenge.
There is probably no universal solution to these emerging problems, as
different domains have different requirements. This thesis thus focuses
mostly a particular domain, that of Cyber-Physical Systems (CPSs) or gen-
erally, embedded systems. In this domain, a family of methods called soft-
ware synthesis seeks to enable efficient programming of complex multi-
core systems. Central to these methods is a focus on using models for de-
scribing computation. We follow the idea of letting theory inform practice,
in striving to improve methods of software synthesis. We do this by identi-
fying and exploiting mathematical structures in the problems in software
synthesis.
1.1 The Multicore Era
On the hardware side, the last two decades have firmly established what
we call the multicore era. Modern computing systems are almost univer-
sally composed of multiple logical cores, and there is a clear trend of in-
creasing both the number and the degree of heterogeneity of these cores.
This increasing complexity brings about an increasing challenge in taming
it.
Both the execution frequency and the closely intertwined single-core
processing speed of computing systems increased exponentially up un-
til the early 2000s (cf. Figure 1.1), an empirical fact observed by Gordon
Moore in 1965 [Moo+65]. Since the early 2000s, however, while transis-




structing a human, we can probably say something like “look through
those pictures and sort out the ones that have cats”. A modern x86 chip,
on the other hand, would understand something closer to this:
▲❇❇✵❴✶✿
❝♠♣ ❞✇♦r❞ ♣tr ❬r❜♣ ✲ ✺✻❪✱ ✶✵
❥❣❡ ▲❇❇✵❴✹
♠♦✈ ❡❞✐✱ ❞✇♦r❞ ♣tr ❬r❜♣ ✲ ✺✻❪
❝❛❧❧ ❴r❡❛❞❴♣✐❝t✉r❡
♠♦✈ q✇♦r❞ ♣tr ❬r❜♣ ✲ ✻✹❪✱ r❛①
♠♦✈ r❞✐✱ q✇♦r❞ ♣tr ❬r❜♣ ✲ ✻✹❪
❝❛❧❧ ❴❝♦♥t❛✐♥s❴❝❛ts
♠♦✈s①❞ r❝①✱ ❞✇♦r❞ ♣tr ❬r❜♣ ✲ ✺✻❪
♠♦✈ ❞✇♦r❞ ♣tr ❬r❜♣ ✰ ✹✯r❝① ✲ ✹✽❪✱ ❡❛①
♠♦✈ ❡❛①✱ ❞✇♦r❞ ♣tr ❬r❜♣ ✲ ✺✻❪
❛❞❞ ❡❛①✱ ✶
♠♦✈ ❞✇♦r❞ ♣tr ❬r❜♣ ✲ ✺✻❪✱ ❡❛①
❥♠♣ ▲❇❇✵❴✶
▲❇❇✵❴✹✿
♠♦✈ ❡❛①✱ ❞✇♦r❞ ♣tr ❬r❜♣ ✲ ✺✷❪
♠♦✈ r❝①✱ q✇♦r❞ ♣tr ❬r✐♣ ✰ ❴❴❴st❛❝❦❴❝❤❦❴❣✉❛r❞❅●❖❚P❈❘❊▲❪
♠♦✈ r❝①✱ q✇♦r❞ ♣tr ❬r❝①❪
♠♦✈ r❞①✱ q✇♦r❞ ♣tr ❬r❜♣ ✲ ✽❪
❝♠♣ r❝①✱ r❞①
♠♦✈ ❞✇♦r❞ ♣tr ❬r❜♣ ✲ ✻✽❪✱ ❡❛①
❥♥❡ ▲❇❇✵❴✻
This snippet is a very oversimplified version of the task, but it serves to
make the point. Where we abstractly tell a human to look through the pic-
tures, and they understand them as a whole set, interpreting themselves
how to go through the set. On the other hand, we instruct the machine
to iterate through them by a series of very fine-grained commands. We
need to set certain registers to contain the right memory addresses, be-
fore calling an instruction to operate on them. We then call external func-
tions that do the reading and cat identification. To then loop through the
pictures here, simplified, we repeat this reading and identifying by jump-
ing to a previous point in the sequence of instructions. Even this x86 as-
sembly snippet is already an abstraction, not only because it uses human-
readable mnemonics for the instructions, but more so because it also ab-
stracts away the concrete memory addresses and the microarchitecture.
In practice, however, almost no one would write this assembly code. In-
stead, they could write something closer to this (equivalent) C snippet:





Notice how the register management and several other low-level de-
tails are abstracted away. The end of the loop is very clear to read, as we
know when we have reached the final picture. We can certainly say this
is at a level of abstraction between the human and machine instructions
listed above. However, the very widespread ❢♦r instruction we used here
also has the inherently sequential semantics, as exhibited by the assem-
bly code it translates to. The semantics of the for loop are that the loop
4
body will execute completely. After each iteration of the body, the incre-
ment expression is executed (usually incrementing the iteration variable),
and the condition is evaluated, deciding wether to continue iterating. In-
deed, in the two (equivalent) snippets above, we do not know how the
functions r❡❛❞❴♣✐❝t✉r❡ and ❝♦♥t❛✐♥s❴❝❛ts work. Do they have an inner
state, or side effects? We do not know if we can call r❡❛❞❴♣✐❝t✉r❡ in a dif-
ferent order, or multiple times in parallel. Perhaps it is internally keeping
a single reference to the iterator of the image files and doing so would
break the logic. The ❢♦r instruction is very useful to abstract away the
logic of registers and instruction jumps, but not a useful abstraction for
expressing concurrency. A similar construct exists in functional program-
ming, ♠❛♣, which generally does not have this implicit sequential seman-
tics. The ♠❛♣ instruction is what is called a higher-order function, taking
a function as an argument and applying it to a list or any iterable object,
in general. The same cat-identifying snippet, in Haskell, can be written as
follows:
r❡s✉❧t ❂ ♠❛♣ ✭❝♦♥t❛✐♥s❴❝❛ts ✳ r❡❛❞❴♣✐❝t✉r❡✮ ♣✐❝t✉r❡s
While the language separates statefull and stateless computation, al-
lowing a great analysis of concurrency, there are reasons why Haskell is
not the most widespread language for embedded systems. For example,
garbage collection makes execution times very unpredictable. Similarly,
the lazyness of the language adds a performance penalty to large com-
plex computations. Compiling Haskell code to an efficient single-core ex-
ecution is significantly more challenging than equivalent C code. The lazy-
ness also makes it difficult to reason about time in the computation. This
is crucial in application domains like CPS, where the systems interact with
their environment. The ♠❛♣ abstraction, as implemented in Haskell, is not
well-suited for many tasks in the domain of CPS. In general, we are faced
with trade-offs between abstract expressivity and translatability to an ef-
ficient execution. At its core, the challenge is about choosing the right
models and corresponding abstractions for a particular domain.
1.3 Software Synthesis
Models play different roles in science and engineering. E.A. Lee explains
this well in [Lee17]. He argues that scientists adapt their models to fit ex-
periments in the world, while engineers adapt designs in the world to fit
their models. Indeed, some fundamental principles of computation, like
λ-calculus, are arguably discovered instead of invented, as Wadler con-
tends [Wad15]. Those might fit in the first paradigm, giving computer sci-
ence a justification for its name. In the case of programming multicore
systems, however, the problem is clearly in the second realm: we need to
engineer good models [Lee06]. No serious argument can be made for lan-
guages like C or Haskell, nor the x86 instruction-set; They were invented,
not discovered.
There are different ways of finding and exploiting the right models for
programming multicores. It is unlikely that there is a single right model for
this. Different models are differently suitable for different use-cases. For
example, applicative functors in functional programming [Mar+14] seem
to be a great model for expressing I/O concurrency in microservice-based
systems. As mentioned before, however, Haskell and its underlying model
are not a great fit for CPS.
5
For CPS and, embedded systems in general, there is a family of methods
called software synthesis [RPM92; Abb+93; Lin98; BLM00; Pin+95; CSL11;
BML12]. It is a family of methods devised precisely to help with the bur-
den of fully exploiting the capabilities of modern multicores. Inspired by
hardware design flows, it aims to bridge the ensuing (software) productiv-
ity gap by integrating knowledge of the application and target multicore
architecture into the compilation process. At the core of these methods
lies a shift in the programming model. Instead of the de facto sequential,
shared-memory model, programmers express the code in diverse Models
of Computation (MoCs). This makes the underlying model explicit, not im-
plicit as is the case in most programming languages.
These models expose the structure of the computation in ways that
permit a compiler to reason about its parallel execution, even in the pres-
ence of heterogeneous hardware. Aided by abstract models of the target
architecture, we can design compilers for multicore systems that devise
execution strategies specialized to the target architecture and applica-
tions. Depending on the flow, the target architecture can be implicit in
the methodology [RPM92] or be an explicit input to the flow [CLA11]. This
can be realized for example by finding efficient mappings, i.e. allocations
of computational and communication resources to the different parts of
an application.
As mentioned above, the central principle behind software synthesis
is the underlying model of computation. Some approaches [Lin98] use
general models, like Petri Nets [Pet62], while others [RPM92] more con-
strained models like Synchronous Data Flows (SDFs) [LM87]. Most allow for
multiple models [BLM00; Pin+95; BML12], generally dataflow models. Mak-
ing the model explicit just makes it easier to see the trade-off between ex-
pressivity and translatability to an efficient execution. The advantage of
models like Petri nets is that they can express virtually any computation.
On the other hand, very constrained models, like SDF provide behavioral
guarantees that permit several optimizations, like static schedules and
channel bounds [Par95].
Several more modern flows [Thi+07; CLA11; PEP06; Kan+06] have set-
tled at the Kahn Process Network (KPN) model. Originally meant as deno-
tational semantics for parallelism [Kah74], the model has been shown to
be compatible with dataflow [LP95]. Kahn Process Networks are provably
deterministic [Kah74], which is not the case for other models, e.g. Petri
Nets. In a canonical sense, KPNs are more general than most dataflow
models, and represent the most general deterministic dataflow model of
computation [LM09]. In this thesis we will focus on a software synthesis



















Figure 1.3: A flow for MoC-based Software Synthesis. The main abstractions col-
ored in green are the ones we deal with in this thesis.
6

There are exactly 28 “ 64 mappings in the mapping space, yet only 6
distinct execution times (colors). This is because, at least a priori, map-
ping both tasks to PE6, as shown in the figure, or mapping them both to
PE7, will obviously result in the same execution time, since the two cores
are identical (Cortex-A15™). This can be generally understood as a prop-
erty of the symmetries of the architecture, and should be exploited when
exploring this mapping space.
Similarly, researchers often use heuristics based on geometric proper-
ties of the design space to explore it. Yet they often disregard the encod-
ing they use for the design space. If we consider the point p4, 4q on Fig-
ure 2.9, there are four points adjacent to it, yet they are vastly different in
terms of their execution time. We can compare the geometry of the space
with the geometry of the architecture itself, and see why this is the case:
the distances in this space do not reflect the architecture with its hetero-
geneity and its memory subsystem. In general, the geometry of this space
does not reflect the geometry of the problem.
Mappings encode the resource allocation for the application to an ar-
chitecture. As such, they inherit structural properties of both the applica-
tion and its semantics, as well as of the underlying architecture. Yet map-
pings are commonly treated as simple lists of assignment, disregarding
this structure, like which tasks depend on which, or if they are mapped to
cores with a large communication latency between them. Mapping algo-
rithms commonly encode a heterogeneous architecture as a list of num-
bers of cores of different types, or perhaps use a grid system to encode
processing elements (PEs) as they assume a NoC with a regular-mesh topol-
ogy. These models break down as soon as the complexities of the architec-
ture transcend the fixed model, for example by having multiple clusters
or levels of hierarchy, or star-mesh topologies instead of regular meshes.
The problems mentioned here permeate the design of the internal al-
gorithms in software synthesis flows, which effectively constraints them
to a small class of models or disregards opportunities for reasoning about
the structure of the problem. While memory has been identified as a first-
class citizen for achieving efficient implementations, many methods also
consider it just as an afterthought. For example, when discussing hetero-
geneity in architectures, the heterogeneity implied by the memory sub-
system is seldom considered, nor are emerging memory technologies like
non-volatile memories (NVMs).
The issues raised above are not inherent issues with the flow, but rather
with the state of practice. However, the flow itself does has some inher-
ent limitations as well. The KPN model of computation falls short on cer-
tain use-cases. For example, the blocking-read semantics common in KPN
implementations are ill-suited for certain cases of data-level parallelism.
Also, perhaps more importantly, KPNs do not model time in the physical
world, which plays a central role for the execution of CPSs. In general, a
model-based design approach needs to evolve its models according to
the use-cases.
Another inherent problem with the flow as formulated is the structure
of the flow itself. An application is described using a concrete MoC and
then this is used to reason about an implementation. However, the flow
as depicted in Figure 1.3 (and implemented in practice in many instances)
disregards transformations at the level of the application. This could
mean a feedback loop back to the application, or perhaps semantics-
8
preserving code transformations at the model level, as part of the explo-
ration.
If methods like software synthesis are to be used in practice, we should
also make sure they also work in practice. Strong results on a varied
benchmark suite from real-world applications are usually a much better
indicator for practical applications than, say, a good asymptotic worst-
case behavior. In order to get such results, however, we need such a var-
ied realistic, up-to-date benchmark suite. In reality, however, increasingly
branching subdomains and concerns of intellectual property mostly yield
a scarce landscape of outdated benchmarks instead.
Finally, there are multiple issues with these flows that depend more on
the industry itself than the methods directly. Tool support and maturity,
degree of adoption and knowledge of the models are all beyond the realm
of the academic contribution of this thesis.
1.4 Contribution
In this thesis we seek to improve the tools we use for understanding and
tackling the problems discussed with software synthesis. We work in a
model-based perspective and consider the trade-off we have introduced,
between abstract expressivity and translatability to an efficient execution.
To consider this we tackle the problem from both sides: the models and
the compilers, in a very general sense, that translate to an efficient ex-
ecution. The main idea behind this thesis is that the underlying models
endow the problem with structure. We can then identify this structure
(mathematically) and leverage it to improve our solutions. Again there
are two ways of doing this:
1. by taking a concrete flow and improving it leveraging its own struc-
ture, or
2. by changing the underlying models in a way that improves the bal-
ance in some way in the trade-off above.
This thesis discusses both. We first focus on software synthesis for (high-
performance) embedded systems running on Multi-Processor Systems-
on-Chip (MPSoCs). In particular, we focus on a concrete software synthesis
flow [CLA11; CL14] based on KPNs. Chapter 2 introduces this flow, as de-
picted in Figure 1.3, and the corresponding background on the mapping
problem.
To evaluate methods in software synthesis in particular and compilers
in general, we need to test them on benchmarks. Chapter 3 discusses
benchmarking in compilers, and introduces some benchmarks we use in
the thesis. It also discusses benchmark generation, with its advantages
and pitfalls, both using random processes and machine learning.
As motivated in Figure 1.4, the mapping problem in software synthesis
has a rich structure, like its symmetries or geometry. We identify and de-
scribe this structure in Chapter 4. Describing the structure is only as useful
as the applications we find from it. In Chapter 5 we discuss multiple ap-
plications, e.g. at compile time in DSE or at run-time in hybrid mappings.
We also show how this structure can be used to formulate other proper-
ties of mappings, like robustness or compactness, which can be useful for
resilient computation even in real-time scenarios in CPS.
After exploring how to improve concrete flows with its structure, we
turn our attention to the underlying models. Chapter 6 reviews Models of
9
Computation (MoCs) in general, and shows how to improve the methods
here. We first show how to improve existing methods, discussing what we
call the MacQueen gap in the KPN semantics. We discuss a novel model,
Reactors, where time is a first-class citizen, and discuss applications in the
telecommunications and automotive domains.
MoCs are abstract mathematical models, they need to be exposed to
programmers using a language or an API. In Chapter 7 we discuss the pro-
gramming languages used to develop MoC-based applications. We review
different existing languages, including the Ohua paradigm, which can be
used for implicitly defining dataflow applications. We discuss language-
level transformations and abstractions in the context of Ohua and how
MoC-based design can be used for optimizing I/O in microservice-based ar-
chitectures, i.e. in a collection of loosely-coupled services in a networked
setting.
The topic of this thesis is broad, and much related work exists for all
aspects covered here. While different chapters cover related work perti-
nent to the topic discussed, we review and discuss it concisely again in
Chapter 8. Finally, some conclusions from this work are summarized in
Chapter 9.
While all topics covered in this thesis are related by model-based design
of software, not every chapter depends on everything previous. Figure 1.5
shows the logical dependencies of the chapters, and in some cases, the
sections of the chapters in this thesis. Any path in this graph should yield
a consistent exposition of the topics discussed. A reader only interested
in some topics can readily skip chapters and sections that are not in the


















































Figure 1.5: Dependencies of chapters and sections of this thesis.
10
1.4.1 A Note on Originality
This thesis presents the fruits of over half a decade of research on the
subjects presented. Research, especially in an interdisciplinary approach
like presented here, is much more fruitful when collaborative. In the case
of joint work, I have made an effort to focus only on my own contribu-
tions in this thesis, whenever possible. I have also taken care to describe
the work of my colleagues as theirs, when I have included it as an indis-
pensable requirement to understand my own work. However, some of
the ideas in this thesis are the result of joint work and cannot be credited
to a single person. In those cases I have also taken care to describe the
work as joint and mention other coauthors. If in doubt, any idea or result
that I have included here which has already been published elsewhere is
also due to my coauthors.
11

2M A P P I N G K P N S T O H E T E R O G E N O U S M P S O C S
Software synthesis refers to a family of methods, rather than a concrete
one, which share common properties about the abstract flow for generat-
ing code for an efficient execution in (heterogeneous) multicores. It can be
seen as embedded in a spectrum of design approaches going from hard-
ware design (and classical Electronic Design Automation (EDA)) through
hardware-software co-design up to software synthesis on the other end.
While some principles apply more generally than others, to actually pro-
duce and optimize code, we need to focus on a concrete flow. In this chap-
ter we will introduce the concepts behind software synthesis and map-
pings in a concrete flow, mapping KPN applications onto heterogeneous
hardware. The flow is an instance of the general flow from Figure 1.3, and
is presented in detail in [CL14].
As is general in Software Synthesis, the applications to be executed are
represented abstractly, linked with a model of computation, Kahn Process
Networks (KPNs). Similarly, the target architecture is assumed to be known
at compile-time, and is modeled via an abstract architecture model. The
KPN model has a property that allows to capture the abstract execution
behavior in a trace that is independent of the execution target. Combin-
ing these application and architecture models, and using an execution
trace, a simulation can be used to estimate the performance of a map-
ping - an assignment of physical execution and communication resources
on the target architecture to the logical (abstract) components of the KPN
application. In an iterative process, these estimations can be leveraged
to determine a near-optimal mapping subject to objective goals (e.g. exe-
cution time, energy consumption). Finally, a compiler can lower the KPN
application to an executable that uses the selected mapping.
The rest of this chapter will explore the various models referred to in
this flow, with precise mathematical definitions and a discussion of com-
mon design choices and goals.
2.1 Kahn Process Networks
The main flow we investigate in this thesis is based on the MoC of Kahn
Process Networks (KPNs). In this section we introduce this model, or rather,
its most common implementation with blocking-read semantics [KM76] .
In Chapter 6 we will discuss the original (denotational) semantics [Kah74]
and how they differ to those introduced here. There, we also discuss other
MoCs and how they relate to each other.
We can think of a KPN as computation distributed among different pro-
cesses (originally derived from coroutines). Each of these processes exe-
cutes sequentially and is Turing complete. However, the processes share
no memory, they have local memories accessible only to themselves.
They communicate between each other using channels. These channels
work as unbounded FIFO buffers. Processes have sets of outgoing and in-
coming channels. As an instruction, any process can write to one of its
outgoing channels or read from one of its incoming channels. They do so
in discrete tokens of data.
13
1 ❴❴P◆❦♣♥ ❢❢t❴♣r♦❝❡ss
2 ❴❴P◆✐♥✭✐♥t ❝♥t✱ s❤♦rt sr❝❴❞❛t❛❬◆❪✮
3 ❴❴P◆♦✉t✭❝♦♠♣❧❡① ❢r❡q❬◆❪✮④
4 ✐♥t ✐✱ ❧♦♦♣❴❝♥t❀
5 ❴❴P◆✐♥✭❝♥t✮
6 ❧♦♦♣❴❝♥t ❂ ❝♥t❀




Listing 1: An Fast Fourier Transform (FFT) implemented as a KPN process in CPN,
based on Appendix A.1.3 of [CL14]
The original language [KM76] was proposed as an extension of POP-
2, which is pretty dated and has fallen out of use today. Instead of this
language, we will consider a more modern incarnation, C for Process Net-
works (CPN), which extends the C programming language [She+14]. We do
so by looking at the example from Listing 1. Processes in CPN are instan-
tiated from process templates, similar to classes and objects in object-
oriented languages. The listing shows a very simplified process template
for an FFT process. Lines 2 and 3 declare the incoming and outgoing chan-
nels for the process. In Line 5, the ❝♥t channel is read and its value is
stored in the local variable ❧♦♦♣❴❝♥t in Line 6. Then in Lines 7-10 the pro-
cess applies an FFT to the data in its incoming channel sr❝❴❞❛t❛ and out-
puts it to an outgoing channel, ❢r❡q. Similar to the read operation in Lines
5-6, the values of the input channel data are available in the identifier
sr❝❴❞❛t❛ in the scope of the ❴❴P◆✐♥. In an analogous fashion, the values
written to the ❢r❡q variable in the scope of the ❴❴P◆♦✉t are written to the
corresponding output channel.
In general, the communication in KPNs is asynchronous: When a process
writes to an outgoing channel, the data is buffered in the channel until it
is read, and the process continues to execute. If a process reads from a
channel, it receives the oldest token buffered in the channel. If there are
no tokens, execution blocks until such a token is written to a channel -
hence the denomination of blocking-read semantics. A channel can be
the outgoing channel of at most one process (this should also be so for at
least one process, otherwise the channel is useless). On the other hand,
if a channel is an incoming channel to multiple processes, all tokens are
copied for each of those processes. Hence, all processes will see the ex-
act same incoming stream of tokens from a shared channel, instead of
splitting them up.
Let us consider the FFT process from Listing 1 and combine it with other
processes into a full application. Listing 2 describes a simplified algorithm
for a low-pass filter on a stereo sound file, using this FFT process. We also
omit the templates and channel declarations in this simplified listing. The
sr❝ process reads the stereo file, splits it into two channels and sends the
sound in blocks of a determined length as tokens. These files are then
transformed from the time domain to the frequency domain using an FFT,
filtered and transformed back to the time domain. A sink channel gathers
the filtered blocks from both channels, left and right, and combines them
again into a stereo sound file that it can store.
14
❴❴P◆♣r♦❝❡ss sr❝ ❂ sr❝❴♣r♦❝❡ss
❴❴P◆♦✉t✭❝♥t✱ sr❝❴❧❴♦✉t✱ sr❝❴r❴♦✉t✮❀
❴❴P◆♣r♦❝❡ss ❢❢t❴❧ ❂ ❢❢t❴♣r♦❝❡ss
❴❴P◆✐♥✭❝♦✉♥t✱ sr❝❴❧❴♦✉t✮ ❴❴P◆♦✉t✭❢❢t❴❧❴♦✉t✮❀
❴❴P◆♣r♦❝❡ss ❢❢t❴r ❂ ❢❢t❴♣r♦❝❡ss
❴❴P◆✐♥✭❝♦✉♥t✱ sr❝❴r❴♦✉t✮ ❴❴P◆♦✉t✭❢❢t❴r❴♦✉t✮❀
❴❴P◆♣r♦❝❡ss ❢✐❧t❡r❴❧ ❂ ❢✐❧t❡r❴♣r♦❝❡ss
❴❴P◆✐♥✭❝♦✉♥t✱ ❢❢t❴❧❴♦✉t✮ ❴❴P◆♦✉t✭❢✐❧t❡r❴❧❴♦✉t✮❀
❴❴P◆♣r♦❝❡ss ❢✐❧t❡r❴r ❂ ❢✐❧t❡r❴♣r♦❝❡ss
❴❴P◆✐♥✭❝♦✉♥t✱ ❢❢t❴r❴♦✉t✮ ❴❴P◆♦✉t✭❢✐❧t❡r❴r❴♦✉t✮❀
❴❴P◆♣r♦❝❡ss ✐❢❢t❴❧ ❂ ✐❢❢t❴♣r♦❝❡ss
❴❴P◆✐♥✭❝♦✉♥t✱ ❢✐❧t❡r❴❧❴♦✉t✮ ❴❴P◆♦✉t✭✐❢❢t❴❧❴♦✉t✮❀
❴❴P◆♣r♦❝❡ss ✐❢❢t❴r ❂ ✐❢❢t❴♣r♦❝❡ss
❴❴P◆✐♥✭❝♦✉♥t✱ ❢✐❧t❡r❴r❴♦✉t✮ ❴❴P◆♦✉t✭✐❢❢t❴r❴♦✉t✮❀
❴❴P◆♣r♦❝❡ss s✐♥❦ ❂ s✐♥❦❴♣r♦❝❡ss
❴❴P◆✐♥✭❝♦✉♥t✱ ✐❢❢t❴❧❴♦✉t✱ ✐❢❢t❴r❴♦✉t✮❀
Listing 2: An audio filter KPN application in CPN, based on Figure 7a in [She+14]
The data flow in the example of Listing 2 is very structured: it goes from
the source, splits into two channels, through the filter, back to the sink.
This structure can easily be visualized in a graph, like in Figure 2.1. More
generally, we can think of any KPN application as a directed graph K “
pVK, EKq, where the nodes VK represent the processes, and the edges EK,
the channels. This works even when a channel is an incoming channel for
multiple processes. In that case, we can split it into multiple edges from
the process it is going from, to each of the target channels. We can do so
without loss of generality since these are the semantics of such channels.









Figure 2.1: The audio filter application as a KPN graph
2.2 Execution Traces
Kahn Process Networks have a more abstract definition with mathemati-
cal semantics [Kah74], in the sense of Scott [Sco70]. These abstract away
the concrete implementation of individual steps in a computation. Even
so, the execution of a computation can be thought of as a series of steps
or partial computations that eventually yield the final result. These series,
which is commonly referred to as execution trace, can be captured as a se-
quence of steps, e.g. as the element of a Scott Domain1. Abstract compu-
tations, modeled as Scott-continuous functions, can can make computa-
1 This will be discussed more in-depth in Chapter 6
15
tions of arbitrary length. For an alphabet Σ, this is modeled by (countably)
infinite sequences in Σω :“ tpanqnPN | an P Σ for all n P Nu. A concrete ex-
ecution, on the other hand, always has a finite length. It always resides
in Σ˚, the Kleene closure of Σ. For a (Scott-continuous) function, this se-
quence can be modeled as a finite string in the computation domain.
In a concurrent execution, multiple entities concurrently execute steps.
As modeled by Kahn, these entities all implement individual functions. As
such, there is not a unique series of steps that can be said to be the execu-
tion trace of the computation. To see this, consider the example depicted
in Figure 2.2: It shows multiple execution orders for the audio filter KPN
application. If we were to consider the values in the channels, each of
these orders would yield a different sequence of values. In this case, the
actions in the alphabet Σ should also model the actual values of the arrays
of floating-point values that can be stored in the channels, which is why
we show the processes in the figure instead. The traces corresponding to
the executions shown in Figure 2.2 are all equivalent.
src fft_l fft_r filter_l filter_r ifft_l ifft_r sink
src fft_l filter_l ifft_l fft_r filter_r ifft_r sink
src fft_l filter_l fft_r ifft_l filter_r ifft_r sink
src fft_r filter_r fft_l filter_l ifft_l ifft_r sink
src fft_l filter_l fft_r filter_r ifft_l ifft_r sink
Figure 2.2: Different possible sequential executions of the audio filter KPN.
In the case of a concurrent execution thus, traces are in fact equiva-
lence classes of strings. We define this more formally, following [Maz95],
the first chapter of [DR95]. Let ∆ be a symmetric, reflexive relation on Σ,
which we call a dependency. This means that if pa, bq P ∆, we have pb, aq P ∆
and also pa, aq P ∆ for all a P Σ. With ∆ we define an additional relation over
Σ, namely I :“ pΣ ˆ Σqz∆. We call I the induced independency. We define
an equivalence relation „I on the monoid Σ˚ (with respect to concatena-
tion) as follows: We require that if a, b P I, then ab „I ba. The relation „I is
defined as the least congruence that satisfies this requirement. Note that
a congruence is an equivalence relation that respects the algebraic struc-
ture, in this case the monoid structure of the concatenation operation.
We call the equivalence classes of Σ˚{„I traces. By definition, the con-
catenation operation on Σ˚ factors over the equivalence relation „I , and
thus Σ˚{„I defines a monoid (with identity element rǫs„i , where ǫ P Σ
˚
is the empty string). We call this the Trace Monoid, TpΣq. We care about
the algebraic structure of a monoid since it is central to the definition of
Scott-continuity.
There are two additional equivalent definitions of this monoid as his-
tories and dependence graphs. We present histories here, as they are
better for the intuition. Instead of a single alphabet Σ, we have a finite set
of alphabets Σ :“ pΣiq, i P I , where I is a finite index set. We can think of
the indices as corresponding to the entities in the system (e.g. processes)
and the alphabets Σi to the alphabets of actions of these individual en-
tities. If we think of the individual entities as computing some function,
their execution trace will be a unique string ai P Σ˚i (recall that concrete
executions are finite). Since, in general, these entities do not compute in-
dependently, they have common synchronization points. These synchro-
nization points are abstractly modeled in the computation alphabet by
16
mutual elements in Σi X Σj for two entities i, j P I . In the case of syn-
chronous dataflow[LM87] application, for example, we could model the
alphabet as being tuples of a channel and a value, and the common syn-
chronization points would be reading to or writing from a value. In KPN,
since the communication is asynchronous, we would need to model both
the channels and the processes as entities.
We can define a monoid, the product monoid PpΣq, by component-wise
concatenation of the strings: paiqipbiqi “ paibiqi for all i P I . However, not
every such a string product can be the history of a system. The synchro-
nization points of different subsystems should be consistent with each
other. To avoid this, we want to ensure histories are consistent. For this,
we define elementary histories as follows: For any a P
Ť
iPI Σi, the elemen-
tary history of a is the tuple paiqiPI , with
ai “
#
a, if a P Σi,
ǫ, otherwise.
Here, ǫ represents the empty string. The monoid generated by all elemen-
tary histories for elements in
Ť
iPI Σi is called the history monoid HpΣq,
and is a submonoid of PpΣq. If we examine the definition, it is not difficult
to convince ourselves that these are precisely the histories which avoid
inconsistencies.
We can go from a trace to a history by the morphism π : Tp
Ť
iPI q Ñ
HpΣq, a ÞÑ pπipaqqi, i P I , where πi is the projection
Ť
iPI Σi Ñ Σi. Here,
for the trace monoid TpΣq we define the dependencies to be
Ť
iPI Σi ˆ Σi.
This is not just a morphism, but in fact an isomorphism: See Theorem
1.5.4 of [Maz95]. Thus, the two concepts are equivalent. For the rest of
this thesis we will use the terms traces and histories interchangeably.
Traces, and equivalently histories, can be used to describe the concrete
computations in concurrent systems like those described by a KPN. They
are also well-suited to model these systems in the context of process cal-
culi, like Communicating Sequential Process (CSP). However, an important
observation is the converse: a concrete execution of a KPN is determined
uniquely by its history. Moreover, any concrete implementation of the
KPN realizing the same execution will have the same history: the history is
an invariant of the abstract execution model. It captures the concurrent
essence of the concrete computation.
2.3 Architecture Models
Hardware architectures are in contrast to applications from the point of
view of modeling. Abstraction boundaries are arguably more clearly de-
fined in the hardware world: semiconductor components like transistors
implementing digital switches are used to form logic gates (like a NAND
gate). Logic gates are used in increasingly complex logic diagrams for
building components like an Arithmetic Logic Unit (ALU). These compo-
nents are combined into digital machines in a microarchitecture to ex-
pose a well-defined Instruction-Set Architecture (ISA) in a PE [Lee17]. PEs
can then be connected via on-chip interconnects to on-chip memory and
other peripherials to make an MPSoC. There are mostly clear boundaries
between these platforms, as A. Sangiovanni-Vincentelli calls them [San07]
(which are levels of abstraction). Designers at each level expose a small
amount of complexity through these established abstractions, in what is
commonly referred to an hourglass design [Bec19].
17






Multi-Processor Systems on Chip
Figure 2.3: Different levels of abstraction in architectures
Figure 2.3 summarizes different models used at different levels in ar-
chitectures. The “bottleneck” design shown on the left implies how there
are well-defined abstraction layers at the different levels. The layer be-
tween hardware and software is, in a sense, also just such a layer of ab-
straction. Since these layers are clearer in the hardware world, so are the
corresponding models at those levels of abstraction. If we want to rea-
son about the execution of complex applications on MPSoCs, we certainly
should not focus on modeling individual logic gates in the architecture.
The challenge is to model architectures at the right level of abstraction.
In the modeling of the computation in applications, we care about the
semantics of the model. It should be expressive enough to capture the
application while being rigid enough to allow a compiler and system to
reason about its execution and optimize it as much as possible. Hardware,
on the other hand, is fixed: in software synthesis (and in this thesis) we’re
not concerned with hardware design. As such, we take a more scientific
role2 to modeling hardware, as opposed to the engineering role we take
for applications: We fit the model to the hardware, not the hardware to
the model.
Architectures models for software synthesis have two main require-
ments: specification and simulation. In order to derive an efficient im-
plementation of an application to an architecture, the model of that ar-
chitecture needs to at least include the possible decisions required for
that software implementation. Different PEs and their types in the archi-
tecture, scratchpad memories or Direct Memory Access (DMA) controllers,
when present, are certainly necessary parts of the models. If actual phys-
ical memory addresses or concrete instructions in the ISA should also be
included depends on the flow: an end-to-end compiler that produces bi-
naries might benefit from modeling these, whereas a higher-level, source-
to-source compiler might do without them if it only makes abstract deci-
sions about resource allocation and leaves code generation to a separate
compiler.
Similarly, in many cases a simulation is part of the software synthe-
sis flow. In this case, a model of the architecture needs to allow such a
simulation. Obviously, a simple analytic model requires a different level
of abstraction for the architecture model than a cycle-accurate simula-
tor. A very concrete way of considering this is the Y-chart approach pro-
posed in [Kie+01], as depicted in Figure 2.4, which is based on Figure 6
from [Kie+01].
The Y-charts approach is closer to a co-design methodology: architec-
tures are part of the design space, albeit only as parametrized families.
As such, they model an architecture as an abstract set of parameters (e.g.



























Figure 2.4: Multiple Levels of Abstraction in the Y-Chart Approach (Inspired by Fig-
ure 6 in [Kie+01]).
number of cores of specified core types) for specification (mapping), with
an ad-hoc model for simulation (in matlab/mathematica) or well-defined
models from a lower level of abstraction (cycle-accurate models or VHDL).
Thus, the approach described in Figure 2.3 shows well how different mod-
els of architectures at different levels of abstraction can co-exist and be
used. While accurate simulation is pivotal for effective software synthesis,
simulation methods and accuracy are beyond the scope of this thesis. We
will thus focus only on models of architecture for the sake of DSE and the
specification of decisions (concretely, here, mappings).
The general situation described in the Y-charts approach is very com-
mon in practice: A parametrized family of hardware architectures is as-
sumed as part of the flow, and architectures are described in terms of
this family. With newer developments in hardware, like the proliferation
of NoC-based architectures, many modern approaches apply the same
principle to these modern architectures. For example, the models used
by [Wei+14; Sin+10; RG18] all assume a regular mesh (N ˆ M) NoC-based
topology and parametrize the architecture by the size of the mesh, N, M
as well as the core types and communication and memory parameters
like worst-case latency values. In the Sesame framework [PEP06], an ad-
ditional abstraction layer called the mapping layer works as an intermedi-
ate virtual platform, in correspondence with the KPN, application, which is
then mapped to the target platform. In the DOL approach [Thi+07], archi-
tectures are modeled in an XML specification that implicitly models the
architectures as graphs with specific annotations e.g. for memory sizes
or resource sharing methods like first in - first out (FIFO). While this is an
ad-hoc model, its graph-based nature is general enough to describe ar-
bitrary architectures. This is common of the most general models at this
level of abstraction: they are graph-based models. In [ECP06], architec-
tures are modeled as bi-partite graphs with cores and memories. This
is bi-partite structure is actually similar to the constraint graphs defined
in [Wei+14; RG18], which basically describe the subset of the architecture
used by a mapping. In MAPS [CLA11], on the other hand, for the purposes of
mapping, architectures are described by labeled graphs where only the
cores are nodes and the edges represent communication. This is similar
19
to the model described in [PEP06]. Some of these models 3 have been
an influence in the SHIM standard [The15] and the IEEE 2804-2019 Stan-
dard [CDA20]. There are subtle differences between all these models,
which makes comparing approaches difficult [Goe+16].
In practice, however, the different graph-based architecture models
are mostly equivalent. For this thesis we use a model based on the
MAPS model for defining architecture graphs. An architecture graph A “
pVA, EA, lAq is a labeled directed multigraph where the nodes VA repre-
sent PEs in the architecture. These PEs are labeled with core types. Commu-
nication in the achitecture graph is represented by the edges EA. Since A
is a multigraph, EA is a multiset: there can be multiple edges e1, . . . , en P EA
between two cores PE, PE1 P VA. These edges are different by their la-
bel lApeiq, i “ 1, . . . , n. The labels of edges identify them as communica-
tion primitives. Communication primitives are an abstraction that encom-
passes communication via multiple methods: shared memories, DMA or
even specialized hardware like hardware FIFO buffers. Communication
primitives can also be used to model different software libraries/APIs for



















Figure 2.5: The Odroid-XU4 Architecture.
Consider the architecture depicted in Figure 2.5,the Exynos Odroid-XU4
with a Samsung Exynos 5422 chip, which has an octocore ARM big.LITTLE
(4+4) architecture. This architecture has two types of cores, the ARM Cor-
tex A7/A15 , little and big, respectively. Similarly, there are three types of
communication primitives in the architecture: communication via the L1
and L2 caches, or over the shared DRAM memory. This architecture can
be modeled in an architecture graph by having 8 nodes, one for each core
(4 of each of the two core types), and connecting the nodes by with all the
primitives that can be used to communicate between them. Figure 2.6
shows the architecture graph for this example.
In NoC-based architectures, the communication depends on the rout-
ing over the on-chip network. In particular, the communication latency
changes depending on the number of hops required to communicate be-
tween two PEs. Our model of architecture graphs (among others, like the
DOL architecture model) has the advantage of having a different commu-
nication primitive for each of these connections with different numbers of
hops, thus being able to model NoC-topologies as well as others (e.g. BUS-
based). However, for simplicity of reasoning, we can sometimes benefit
of a related graph, which we call the topology graph [GMC18]. A topology
graph T “ pVT , ET , lTq is also a directed multigraph with the same vertex
set VT “ VA as that of the architecture graph A, namely the set of cores.








. The edges are different:













Figure 2.6: An Example of an Architecture Graph for the Odroid-XU4 Architecture.
we only add an edge for a communication primitive e P EA if it allows di-
rect communication between two cores. Thus, ET Ď EA. For a BUS-based
architecture like the ODROID-XU4, this topology graph corresponds to the
architecture graph. However, for a NoC-based architecture, the topology
graph captures the network topology. Figure 2.7 shows the difference of
the architecture graph A and the topology graph T for a 2 ˆ 2 regular
mesh NoC topology. The difference between the two graphs in this case
is that the topology graph has no nodes for multiple hops, whereas the ar-
chitecture graph has them. As such, the topology graph reflects the topol-




















Figure 2.7: Comparison of the Architecture and Topology Graphs for a 4 ˆ 4-Mesh
NoC-based Architecture.
As mentioned above, the subtle differences in different models make
comparison between them difficult [Goe+16]. The main reason for this are
the two distinct roles that architecture models play in software synthesis,
as we have discussed in this section. Having a common model for specifi-
cation is beneficial for defining software synthesis approaches, and thus,
desirable. Having common models of architecture, while beneficial for
comparison, is not necessarily desirable: there are good reasons for hav-
ing simulations at different levels of accuracy. Nevertheless, Pelcat and
others have [Pel+15] made an attempt to define such common models
of architecture. Their definition is abstract: they require a unique, repro-
ducible cost of computation. This solves the problem of comparability, at
the cost of the simulation. In a sense, their definition of a model of ar-
chitecture is tantamount to defining a specification for a simulation. We
believe this is a great idea, but unfortunately not yet mature enough in
terms of the models that exist and their integration to simulators. The Lin-
ear System-Level Architecture model they propose is also a graph-based
21
model and is similar to the graph-based models discussed above. How-
ever, we believe that it is better to separate both concerns conceptually,
namely simulation and the specification of mappings. As such, we will fo-
cus only on the graphs defined in this section for mapping specification
and leave the simulation level open to the multiple levels of accuracy, as
required by the use-case.
2.4 The Mapping Problem
The main problem we address in the first part of this thesis is the map-
ping problem [Mar+11]. The mapping problem is the decision problem of
assigning physical resources (hardware) to the logical tasks and data (soft-
ware) of an application. As can be seen from Figure 1.3 in the introduction,
this is a central problem in software synthesis.
We commonly think of assigning the tasks and communication chan-
nels (or data) to the physical resources, and not the other way around.
The reason we do not choose to do so again has a mathematical back-
ground, as we will explain here. Such an assignment is a correspondence
and can be interpreted as a relation R Ď A ˆ K that relates the architec-
ture A and the application K. By abuse of notation we refer to the graphs
A, K here to mean both one relation on their nodes VA, VK and one on
their edges EA, EK.
A relation is the most general description of such a correspondence.
However, in this thesis we do not consider mappings where a single task
can be assigned to multiple hardware resources. The thread affinity mech-
anism in the POSIX standard, for example, assigns a POSIX process to mul-
tiple (hardware) threads. Then, the operating system scheduler decides
in which of the specified threads to actually execute the process, possibly
migrating it multiple times during its execution. We do not consider this
kind of behavior. If we want to model it with the mathematical framework
proposed here, however, we can. For this, we describe the final mapping
as decided by the scheduler at run-time, and consider migrations as mul-
tiple spatial mappings at different time instances.
We define a mapping to have exactly one physical resource for each
logical one (i.e. for each task or data/communication channel). This kind
of mathematical relation is precisely the definition of a function, which is
why we model mappings as functions m : K Ñ A, i.e. assigning physical re-
sources to the logical ones. A mapping also needs to be consistent. If it as-
signs two tasks t1, t2 P VK to different PEs, when these tasks exchange data
(i.e., pt1, t2q P EK), the data communication channel needs to be mapped
to a physical channel that respects the task assignment: we require that
mppt1, t2qq “ pmpt1q, mpt2qq P EA. This condition, mathematically, means
precisely that a mapping respects the graph structure of K and A. In other
words, a mapping is a morphism of graphs m : K Ñ A.
Consider the example of the mapping depicted in Figure 2.8. It shows
the mapping
m : t1 ÞÑ PE1, t2 ÞÑ PE2, pt1, t2q ÞÑ L2$ .
This mapping can be considered as the morphism of graphs depicted on
the right, where the image mpKq ď A is a subgraph of the architecture
graph A (cf. Figure 2.6). We could not map the communication edge pt1, t2q
to, say, the L1 cache of PE3, L1$, since this cannot be used to communicate


























Figure 2.8: An example of a mapping as a diagram (left) and as a morphism of
graphs (right).
of mpt1q, mpt2q, or any L1 cache for that matter, since (more precisely) there
is no edge pmpt1q, mpt2qq P VA with the label lAppmpt1q, mpt2qqq “ L1$.
We define a set M Ď tm : K Ñ A, m is a morphismu “: MorpK, Aq as the
set of (valid) mappings. A morphism of graphs m : K Ñ A that is not in the
set M is an invalid mapping. This might be because different reasons, e.g.
if a PE p P VA is not general purpose and cannot execute some tasks, or
when modeling the sizes of data (channels), if a communication channel
does not fit a physical resource. We model this by letting M be a proper
subset of MorpK, Aq, the set of morphisms K Ñ A.
Having formally defined a mapping, we can also define the mapping
problem. Let Θ : M Ñ Rkě0 be a function on the set of mappings. We
call Θ an objective function. For example, Θ : M Ñ Rě0 (for k “ 1) can
be the execution time of the application K when mapped via m to the ar-
chitecture A. This could similarly be another measure of the quality of
a mapping, like throughput or total energy consumption. It can also be
a combination of multiple metrics for k ą 1. Additionally, depending on
the use-case, the results of the software synthesis process might need to
respect some constraints. For example, we might want to minimize the
energy consumption while maintaining the execution time under some
real-time threshold. Let C : M Ñ B be the (boolean) function that de-
cides if a mapping satisfies the required constraints. Thus, in the exam-
ple, Θ would be the energy consumption and Cpmq would be true if and
only if the mapping’s execution respects the real-time constraint. We can





Here, the minimum of the vector Θpmq P Rkě0 for k ą 1 can be under-
stood as an element-wise minimum. In particular, some points are incom-
parable: if Θpm1q1 ą Θpm2q1 and Θpm1q2 ă Θpm2q2, then Θpm1q, Θpm2q are
incomparable. This element-wise comparison of vectors gives us a par-
tial order on Rkě0. Equation 2.1 can be then understood as finding Pareto-
minimal points, i.e. points that are not dominated by any other point in
the set. Concretely, we say that m̂ is not dominated by any point (is Pareto
minimal), if m ć m̂ for all m P M. A variant of this same problem can be
encoded as an integer optimization problem, e.g. as is done in [ECP06].
As a problem formulation, however, we believe the treatment given here
defining the conditions as a morphism of graphs is much simpler to read
and understand and just as expressive.
23

Dynamic mappings, on the other hand, are chosen at run-time. The dy-
namic mapping problem is in essence the same as task scheduling. The
trade-offs between time available to make a (scheduling) decision and
the available information at run-time are certainly not unique to the map-
ping problem. However, dynamic mappings present an additional hurdle
in heterogeneous systems, since code has to be compiled for the different
possible targets.
Hybrid mapping approaches sit between static and dynamic ones. An
ahead-of-time decision process or mapping space pruning analyzes the
mapping space and pre-defines a set of mappings or partial mappings.
From these pre-defined mappings, a run-time system chooses a mapping
or constructs a mapping from the partial mappings, based on the avail-
able information at run-time.
Finally, we distinguish between heuristics and meta-heuristics. Map-
ping heuristics, like load-balancing, are domain-specific algorithms that
exploit the specific domain-knowledge to find a solution based on a pre-
defined model of the problem. On the other hand, meta-heuristics, like ge-
netic algorithms, rely on an iterative evaluation of the points. In the case
of mappings, this usually means a simulation or profiling of a mapping’s
execution. Again, this distinction is not unique to the mapping problem.
2.5 Simulating Mappings
Simulations are extremely important for analyzing an application’s perfor-
mance, or more generally, its behavior. As described in Section 2.3, there
are multiple levels of detail in which to model and, consequently, simulate,
an architecture and its execution. For investigating the mapping problem
in software synthesis, higher-level simulations are preferable for multi-
ple reasons. First and foremost, higher-level simulations are faster. If a
meta-heuristic iteratively evaluates dozens, hundreds or even thousands
of mappings to find a near-optimal one, it greatly benefits from the fast
evaluation time associated with a higher-level simulation.
Higher levels of abstraction come with a trade-off. The accuracy of the
simulation suffers in exchange for the simpler models and faster simula-
tion times. Let Θ̃ be the approximation of Θ from the simulation. A loss in
accuracy means that |Θpmq ´ Θ̃pmq| becomes larger. However, depending
on the use-case and mapping objective Θ, this loss in simulation accuracy
might not necessarily affect the quality software synthesis results. Sup-
pose that the objective Θ represents execution time or energy consump-
tion, and the goal of the software synthesis is just a best-effort minimiza-
tion of Θ (with no additional constraints, i.e. C ” True). Then the accuracy
of the simulation is not important, only its fidelity. If Θpm1q ă Θpm2q we
want the result of the simulation to reflect this, Θ̃pm1q ă Θ̃pm2q. As long
as this is the case, we don’t care about the actual value of |Θpmiq ´ Θ̃pmiq|,
since in this case the exploration will still find the minimum. The fidelity of
the simulation is a measure of how often this is true. On the other hand,
if the application is a real-time application, then the truth value of C will
depend on the accuracy of the simulation. Here, the accuracy of the sim-
ulation is much more important.
This chapter describes the simulation aspects which pertain the models
of computation and the practical tooling we will use. Nuanced simulation
details and advanced techniques are beyond the scope of this thesis.
25
2.5.1 Simulating the Execution of Kahn Process Networks
The behavior of a system plays a central role in simulation. A determinis-
tic model should yield deterministic simulation results. Non-determinism,
when present, should also be captured by the models and reflected by the
simulations.
The behavior of systems is commonly captured in execution traces,
which simply record the behavior of different entities (e.g. processes or
actors) at different timepoints. This can be formally captured in a monoid
structure of (Mazurkiewicz) traces or, equivalently, histories [DR95], as de-
scribed in Section 2.2. Traces are common in many domains, as they use-
ful to understand the behavior of systems [Nag+96]. However, for sys-
tems that are non-deterministic, (by definition) the behavior of the sys-
tem does not only depend on the input. This can make designing [Lee06]
and debugging [Mur+14] particularly difficult. In cyber-physical-sytems or,
more generally, reactive systems in the sense of Harel and Pnueli [HP85],
input from the physical world might come in a non-deterministic fashion.
The problem of capturing the behavior of such a system is even more
complex when the system is distributed [Sha16].
Kahn Process Networks are deterministic, as are all the dataflow mod-
els that can be embedded as KPNs. This means that the behavior of a
KPN application depends only on the input to the network. In particular,
it does not on the mapping and scheduling or related execution details.
Thus, their behavior can be captured by a (Mazurkiewicz) trace. This per-
mits to re-create their behavior in a fashion that is independent of the
mapping. By “replaying” the trace, i.e. simulating the execution of a pro-
cess for every input in the trace, a discrete-event simulator can success-
fully simulate the execution of a KPN, since the token sequence is guar-
anteed to be identical given identical inputs. In particular, this allows us
to do Design-Space Exploration (DSE).
A discrete-event simulation of a KPN application thus requires behav-
ior traces. It also needs to model the execution and communication times.
Modeling execution times from a trace is simple, with a crucial assump-
tion: if the execution times for a trace event only depend on the PE type.
This assumption will not always hold, e.g. when the instruction cache is
flushed due to scheduling decisions, or due to unpredictability from the
operating system (OS). Note that data caches are modeled as part of the
communication between processes. In most cases this assumption is a
good approximation, as it is normal to expect that the same code exe-
cuting on the same data and the same ISA will usually require the same
amount of time.
Modeling communication is more complicated, as it depends on the
memory subsystem. In general, the communication costs of sending a
KPN token depend on multiple factors, like the size of the token, con-
tention in the memory subsystem (and correspondingly methods of ar-
bitration, routing in the case of a NoC, etc), or the API and protocol being
used. For the simulations in this thesis, we use a model based on anno-
tations of the architecture graph A. These annotations are functions that
calculate the time cost of communicating data, as a function of its size.
In this way, we model both the latency and bandwidth of the communi-
cation. We use a split-cost communication model to assign costs to send-
ing and receiving data [Ode+13]. This separation can be used to simulate
26
based on traces, as described above, since we can then compute the cost
of communication for both the sending and receiving processes.
When dealing with NoC-based architectures, this model is not as accu-
rate. Communication over a NoC depends also on the routers and links
along the path, including the routing algorithms. We extended the split-
cost communication model to account for these issues in [MGC16]. The
idea is to add a third term to account for the network, in addition to the
consumer and producer costs. This third term can account for the rout-
ing and the topology of the network while maintaining an analytic model
which is cheap to evaluate in a high-level simulation.
Simulation is essential for software synthesis, yet it is not the focus of
this thesis. The main contribution of [MGC16], with a concrete model for
the Tomahawk 2 architecture [Noe+14] and the corresponding evaluation
comparing to the SystemC-based simulator Noxim [Cat+15] are due to my
coauthors and beyond the scope of this thesis.
2.6 Software Synthesis Flows
Many flows exist that enable model-based design in a software synthesis
flow. In the introduction we discussed some of the original software syn-
thesis methods, the approach of [Lin98] uses Petri Nets, or [RPM92] which
uses SDFs and other flows which use multiple models [BLM00; Pin+95;
BML12], generally dataflow.
SystemCoDesigner [Hau+08] is based on SystemC and aimed at FPGAs,
as is the case with CAPH [SBA13], which is based on dataflow and the actor
model. Although the flows are based on MoCs, their goal is not software
but rather an FPGAs implementation, and as such these flows are closer to
HLS than the rest. Coincidentally, the term software synthesis is an allusion
to the much better-known HLS.
Also based on a more general dataflow model is the Turnus [Cas+13]
flow. It builds on top of RVC-CAL, which is in turn based on the CAL actor
language [EJ03].
More specific is the SDF For Free (SDF3) [SGB06] framework, which
does much more than generating random SDF graphs. As a software
synthesis tool [SGB10], it focuses on the more restricted SDF MoC, al-
lowing much more sophisticated analysis of the applications. Similarly,
PREESM [Pel+14] works with parametrized extensions of SDF [Des+13] that
provide a greater trade-off between expressiveness and analyzability.
On the other side of the MoC spectrum, many related flows use KPN. The
static-mapping-based flows of Distributed Operation Layer (DOL)[Thi+07],
Sesame [Erb+07] or MAPS [CLA11] use different levels of abstraction to de-
rive an efficient execution form a KPN-based application description.
Going beyond static mapping, the DAARM [Wei+14] flow maps dataflow
applications using a hybrid approach. Similarly, the work of [QP15] ex-
tends the Sesame approach to hybrid mappings, and Spider [Heu+14] ex-
tends the work of PREESM to hybrid mappings.
This thesis and the contributions included in it are not aimed at propos-
ing (yet another) software synthesis design flow. Instead, we propose
methods to improve existing flows, with the ambitious goal of being gen-
eral enough that the improvements would benefit most of the flows dis-
cussed. Perhaps a good way to think of this is: Just as these flows help
users write more efficient applications, we aim to help the flow design-
ers improve their flows. For chapters 4 and 5, and partially Chapter 3, we
27
focus on one flow to do this. We choose the MAPS flow [CLA11], which we
describe next. Some contributions, on the other hand, go beyond these
flows. This is particularly the case in chapters 6 and 7, and in part Chap-
ter 3
2.6.1 The MAPS flow
The MPSoC Application Programming Studio (MAPS) is a software synthesis
flow developed at RWTH Aachen University and spun-off into a company,
Silexica4, which kindly allowed us to use the KPN mapping flow of MAPS
for our research. MAPS is very comprehensive, it does much more than
KPN-based software synthesis. It has analysis algorithms to suggest par-
allelization of sequential code, both as OpenMP annotations as well as
CPN annotations. We will not discuss these here. It also has detailed plat-
form models which are used in simulation and performance estimation




















Figure 2.10: The Software Synthesis Flow from Figure 1.3. MAPS implements all
steps in the flow, which are therefore all depicted in green.
Figure 2.10 describes the MAPS flow, as an instance of the general soft-
ware synthesis flow in Figure 1.3 from the introduction. Applications are
written as KPN applications in the CPN language. While CPN supports SDF
annotations as well, these are embedded into the KPN MoC for analysis
and code generation, there is no separate DSE and code generation for
purely-SDF applications. The architecture model is an XML-based descrip-
tion which has detailed models of the communication subsystem and its
topology, including different possible communication APIs [Ode+13], dif-
ferent frequency and voltage domains and even models of the ISA for the
processing elements. This model influenced the definition of the SHIM
standard [The15] which then resulted in an IEEE standard [CDA20].
Performance estimation in MAPS follows in multiple steps. In a first step,
using a POSIX threads (pthreads) backend, the application is emulated on
the host machine to gather functional KPN traces. Since the KPN model is
deterministic, these traces are independent of the actual performance
values of the application. Then, the processes are instrumented and ex-
ecuted in isolation, dividing them into what in MAPS are called segments.
These segments are defined as the execution between any two reads
or writes to or from channels5. Using the data from the functional KPN
trace, MAPS obtains a detailed trace of the instructions executed during
each segment in the process. These detailed instruction traces are com-
bined with an abstract processor model from the architecture description
4 ❤tt♣s✿✴✴✇✇✇✳s✐❧❡①✐❝❛✳❝♦♠✴
5 A special annotation can be used additionally to divide segments manually.
28
to estimate the performance on the target platform [Eus+14]. This yields
traces with performance annotations for every process and every PE type.
Finally, these performance-annotated traces are used in conjuction with
the mapping and communication model in a discrete-event simulator
to estimate the overall performance of the mapping. If performance-
annotated traces are available from a profiling execution on the actual
hardware, these can be used instead.
The DSE step in MAPS is similar to that of all the flows described at the
beginning of this section, as well as the flow of the ♠♦❝❛s✐♥ tool, which we
will describe shortly. In this DSE step, MAPS generates a mapping. On some
platforms, when processes share a PE, MAPS can also generate a sched-
ule for the processes. These mappings are then used by the Clang-based
CPN compiler to generate target-specific C code, which can be further com-
piled by a C compiler for the target platform. This way, the flow generates
target-specific code from an abstract KPN description of the application
(and the appropriate platform models).
As explained in the introduction, this thesis does not focus on the per-
formance estimation and code generation steps of this flow (cf. Figure 1.3).
We use MAPS for performance estimation and code generation. For evalu-
ating our methods in DSE and application, architecture and mapping mod-
els, we primarily use the ♠♦❝❛s✐♥ tool, which we describe in the next sec-
tion.
2.7 The ♠♦❝❛s✐♥ tool
In this thesis we will use ♠♦❝❛s✐♥, an open-source 6 tool for the MoC-
based analysis and simulation of applications [Men+21]. This tool, for-
merly known as ♣②❦♣♥ [MGC16; GMC18], has been developed as part of
a collaborative effort between multiple researchers at the Chair for Com-
piler Construction at TU Dresden. While the tool itself is a joint contribu-
tion with the coauthors of [Men+21], many concepts introduced in this
thesis have been implemented and tested using ♠♦❝❛s✐♥. As such, this
section will explain the tool in-depth, to enable the description of the dif-
ferent implementations of contributions from this thesis implemented in
♠♦❝❛s✐♥.
Figure 2.11 depicts the basic flow of ♠♦❝❛s✐♥, which can be understood
as a tool for rapid prototyping of prototyping tools. Multiple dataflow MoCs
are supported by ♠♦❝❛s✐♥, like SDF or task graphs. These models, among
others, can be seen as specializations of KPN [LP95] and will be discussed
more in-depth in Chapter 6. The ♠♦❝❛s✐♥ architecture is composed of mul-
tiple modules that can be combined to create a specific tool (e.g. for map-
ping or simulation). In the figure we show the modules that are relevant
for this thesis.
In general, simulating a KPN requires four inputs, as explained in Sec-
tion 2.5: the KPN graph, a platform description, execution traces and a
mapping. The ♠♦❝❛s✐♥ tool has data internal structures for these four
inputs that reflect the models as explained insection 2.1-2.4. The tool
boasts multiple readers to generate the internal data structures from es-
tablished formats like t❣❢❢ [DRW98], s❞❢✸ [SGB06] or the ▼❆P❙ formats.
Instead of a concrete trace, ♠♦❝❛s✐♥ expects a trace generator, which
can generate the trace on the fly: this is useful e.g. for non-deterministic

































Figure 2.11: Mapping and simulating KPN Applications in ♠♦❝❛s✐♥.
generator, for example, simply reads the trace from a file. A mapping,
while required for simulation, does not need to be provided: it can be cal-
culated in a Design-Space Exploration. This is not surprising, since a sig-
nificant part of this thesis concerns itself with improving such mapping
algorithms.
A central part of ♠♦❝❛s✐♥ is a discrete-event simulator [MGC16] that uses
the principles outlined above to simulate KPN applications based on their
traces (as well as other models of computation). We will not dwell on the
design of the s✐♠✉❧❛t❡ module since it goes beyond the contribution and
scope of this thesis. Many contributions of this thesis are implemented
in ♠♦❝❛s✐♥. This is done as modules, using the ♠♦❝❛s✐♥ toolbox infrastruc-
ture. Different contributions of this thesis and the corresponding refer-
ences are described in the figure and marked as such (with light-green
coloring). In the following, we will describe some other central modules
of ♠♦❝❛s✐♥.
platform designer
Many concepts developed in this thesis are aimed at emerging technolo-
gies and future hardware architectures. To model these increasingly com-
plex architectures, we aim at an abstract description of their topologies
(cf. Section 2.3). As part of ♠♦❝❛s✐♥, with the help of Felix Teweleitt, we de-
signed a modeling infrastructure, in essence a small embedded domain-
specific language, to describe hardware topologies. This infrastructure is
the ♣❧❛t❢♦r♠❴❞❡s✐❣♥❡r module of ♠♦❝❛s✐♥.
Listing 3 shows an example of our ♣❧❛t❢♦r♠❴❞❡s✐❣♥❡r. The code in this





★ ❝❧✉st❡r ✵ ✇✐t❤ ❧✷ ❝❛❝❤❡
♣❞✳❛❞❞P❡❈❧✉st❡r❋♦rPr♦❝❡ss♦r✭✧❝❧✉st❡r❴❛✼✧✱ ♣r♦❝❡ss♦r❴✵✱
♥✉♠❴❧✐tt❧❡✮
★ ❆❞❞ ▲✶✴▲✷ ❝❛❝❤❡s
♣❞✳❛❞❞❈❛❝❤❡❋♦rP❊s✭✧❝❧✉st❡r❴❛✼✧✱ ✶✱ ✵✱ ✽✳✵✱ ❢❧♦❛t✭✬✐♥❢✬✮✱
❢r❡q✉❡♥❝②❉♦♠❛✐♥❂✶✹✵✵✵✵✵✵✵✵✳✵✱ ♥❛♠❡❂✬▲✶❴❆✼✬✮
♣❞✳❛❞❞❈♦♠♠✉♥✐❝❛t✐♦♥❘❡s♦✉r❝❡✭✧▲✷❴❆✼✧✱ ❬✧❝❧✉st❡r❴❛✼✧❪✱ ✷✺✵✱ ✷✺✵✱
❢❧♦❛t✭✬✐♥❢✬✮✱ ❢❧♦❛t✭✬✐♥❢✬✮✱
❢r❡q✉❡♥❝②❉♦♠❛✐♥❂✶✹✵✵✵✵✵✵✵✵✳✵✮
★ ❝❧✉st❡r ✶✱ ✇✐t❤ ❧✷ ❝❛❝❤❡
♣❞✳❛❞❞P❡❈❧✉st❡r❋♦rPr♦❝❡ss♦r✭✧❝❧✉st❡r❴❛✶✺✧✱ ♣r♦❝❡ss♦r❴✶✱ ♥✉♠❴❜✐❣✮
★ ❆❞❞ ▲✶✴▲✷ ❝❛❝❤❡s
♣❞✳❛❞❞❈❛❝❤❡❋♦rP❊s✭✧❝❧✉st❡r❴❛✶✺✧✱ ✶✱ ✹✱ ✽✳✵✱ ✽✳✵✱
❢r❡q✉❡♥❝②❉♦♠❛✐♥❂✷✵✵✵✵✵✵✵✵✵✳✵✱ ♥❛♠❡❂✬▲✶❴❆✶✺✬✮
♣❞✳❛❞❞❈♦♠♠✉♥✐❝❛t✐♦♥❘❡s♦✉r❝❡✭✧▲✷❴❆✶✺✧✱ ❬✧❝❧✉st❡r❴❛✶✺✧❪✱ ✷✺✵✱ ✷✺✵✱
❢❧♦❛t✭✬✐♥❢✬✮✱ ❢❧♦❛t✭✬✐♥❢✬✮✱
❢r❡q✉❡♥❝②❉♦♠❛✐♥❂✷✵✵✵✵✵✵✵✵✵✳✵✮






Listing 3: The Odroid-XU4 Platform with the Platform Designer
31
principal innovation behind the ♣❧❛t❢♦r♠❴❞❡s✐❣♥❡r is that it works with a
stack of clusters. The functions ♥❡✇❊❧❡♠❡♥t✭✮ and ❢✐♥✐s❤❊❧❡♠❡♥t✭✮ can
be nested to describe the topology in a hierarchical fashion. Between
these functions, the API allows us to describe heterogeneous cores and
different levels of interconnects with different properties, like their fre-
quency.
mappers
The mapping problem (cf. Section 2.4) plays an important role in this the-
sis. While we do propose some mapping heuristics for special contexts,
many methods in this thesis are orthogonal to the mapping heuristics.
As part of this thesis we have implemented multiple mapping algorithms
from literature in ♠♦❝❛s✐♥. These can be found in the ♠❛♣♣❡r module. The
heursitics included are the Group Based Mapping (GBM) heuristic [Cas+12]
and a static variant of the Linux Completely Fair Scheduler (CFS). We also
have some meta-heuristics, which include a random walk, simulated an-
nealing [Ors+07], tabu search [MEP08] and genetic algorithms [ECP06].
configuration
♠♦❝❛s✐♥ is designed to be a tool for tool development. As such, one of
its main goals is to enable building different scenarios for different con-
texts, like static mapping of KPN applications or hybrid execution of dy-
namic Long Term Evolution (LTE) loads [Men+21]. We use the Hydra [Yad19]
framework to configure ♠♦❝❛s✐♥ and construct different scenarios as dif-
ferent tools. This configuration philosophy allows us to work in a modular
fashion, which in turn allows us to implement different contributions of
this thesis as ♠♦❝❛s✐♥ modules.
32
3B E N C H M A R K I N G
The methods we will discuss in this thesis are ultimately about improving
the performance of software. We need benchmarks to assess the perfor-
mance of software, and consequently to assess if the performance im-
proves. Benchmarks are essential for the research and development of
compilers and programming languages [HPP09], as well as hardware ar-
chitectures or runtime systems. In this context, benchmarks are gener-
ally understood to be collections of programs with particular properties.
Mostly, they cover a range of behaviors that are typical of, and important
for programs in a particular domain. This description, however, can mean
several different things. In this chapter, we formally define different types
of benchmarks and use them to classify different use cases. We then pro-
ceed to discuss concrete KPN and task-graph benchmarks for software
synthesis, as well as benchmark generation strategies both with random
graph models and machine learning.
3.1 Representative Benchmarks
Possible code fragments ( ) 








Figure 3.1: An illustration of probabilities in code space
To formalize our argumentation, we take a statistical view of program
code. Consider a formal language that describes the set of all possible
programs. For a program of fixed bounded size, this is a finite set Ω. For
example, the set of syntactically correct C source files smaller than 1 TiB in
size is certainly bounded by |Ω| ď 22
40
. Out of the syntactically-correct pro-
grams, only a fraction successfully compiles, and an even smaller fraction
executes something that makes sense semantically. Ideally, code written
by developers falls into the subset of executable programs, as an even
smaller subset. However, in this subset of correctly written code, not ev-
ery code fragment is going to be equally common. A fragment like ❢♦r✭✐♥t
33
✐ ❂ ✵❀ ✐ ❁ ♥❀ ✐✰✰✮ is probably going to be seen much more frequently
than something like ✭✯✭✫♠❛✐♥✰✵①✶✸✹✮✮✭①✮. It is worth noting that there
is a more nuanced discussion behind what constitutes a unit of code. At
this point, however, we can omit this discussion and consider the whole
program as a unit, for simplicity of the argumentation. Thus, there is an
implicit probability density function (pdf) p on the discrete set of possible
code units Ω which models the way programmers write code. This is de-
picted in Figure 3.1. In reality, this is a highly dimensional space, and many
challenges would arise in defining a proper geometry in such a space. We
depict the code space Ω as one-dimensional for illustrative purposes, just
like the continuity of the pdf, which we have no reason to assume.
In this statistical view of code, we can consider some precise questions:
What does it mean for a collection of programs to be a benchmark? More
precisely, what does it mean for it to be representative, or what proper-
ties would be desirable of such a collection of programs? Consider the
examples depicted in Figure 3.2. This figure depicts histograms for three
kinds of collections of programs along the (implicit) pdf described in Fig-
ure 3.1. We think of an idealized abscissa dimension, with a proper metric,
such that programs that are semantically close are close on this dimen-
sion. Obviously, a multi-dimensional formalization would be better for
this, but we stick to a single dimension for the intuition provided by the
figures. Thus, a bin in the histogram might contain a single program but











Figure 3.2: An illustration of different types of benchmarks
The first kind of collection depicted, labeled as a “representative cover-
age benchmark”, has a handful of programs, each of which correspond
to a different category in the space of probable programs. Programs that
could be written by a human, but where it is unlikely that this will hap-
pen, are not covered by this type of benchmark. Furthermore, for every
type or category of code fragments, there is only one representative ex-
ample in the set. In particular, programs that are moderately likely will
be represented just as much as programs that are extremely likely to be
written. This, in a sense, overrepresents the former and underrepresents
the latter.
The second kind depicted, labeled as “representative benchmark”, re-
moves this imbalance. It is similar to the “representative coverage bench-
34
mark”, but the difference is that in this kind of collection, programs ap-
pear with a relative frequency that is roughly in line with their probability
to be written. A benchmark of this kind would probably have more pro-
grams than a “representative coverage benchmark”, without including sig-
nificantly more types of programs or behaviors.
Finally, a “fuzzing benchmark” is a collection that does the opposite of
a “representative coverage benchmark”. It has programs covering those
programs that are unlikely to be written by a human, but possible: The
corner cases.
3.1.1 Sample use cases
We argue that what kind of benchmark is most appropriate depends on
the use case. To illustrate this, we will explain two large classes of use-
cases that require benchmarks. This certainly does not constitute an ex-
haustive classification, but will hopefully help clarify how the benchmark
choice is nuanced.
Testing
A very common use case for benchmarking is testing. Assume we have
developed a compiler optimization1 and want to see how good it works.
For this, we want to find out, in case someone writes a program and tries
our optimization on it, how we can expect it to behave. More formally,
we have a property P of code, like the speedup obtained by applying our
compiler optimization. We want to calculate the expected value ErPs over
the implicit pdf of writing the code we use our compiler on2.
For testing, we argue that we want a representative benchmark. Ideally,
we would get a set of programs x1, . . . , xl „ p i.i.d., where p is the implicit
pdf of code been written3. The expected value ErPs can thus be approx-
imated arbitrarily well with growing sample size l. We do this because,
in our example, we assume that the users of our compiler will also draw
from this distribution p, and thus Erspeedups tells us what speedup the
users can expect to get out of our optimization.
If we use a “representative coverage benchmark”, we can get a skewed
result, because of the over- and underrepresentation of program types
in this kind of benchmark. Thus, if our optimization works extremely well
for a small class of programs with a moderate chance of occurring, and
not so well with the most common types of programs, our testing would
return wrong results. It would tell us that our optimization is likeley to im-
prove our program, by overshooting the weight given to the moderately
common class where it serves well. In practice, however, our optimization
would be unlikeley to bring much improvement in this case, if we expect
our compiler to be used by everyone.
1 A good mapping heuristic in software synthesis can be considered a compiler optimization
in this context.
2 Technically, using the compiler is a conditional clause on the probability of a piece of code
to be written by a human.
3 A compelling case can be made that in some cases it’s the “dynamic” property of the prob-




Another common use case is tuning a heuristic. Consider again a com-
piler optimization as an example. In this case, however, instead of having
a finished optimization that we want to test, we are designing the opti-
mization by tuning a heuristic that is part of it. We want the heuristic to
be tuned such that the optimization works best (which we would asses
e.g. by testing, the other use-case). Training machine learning models also
falls under this category, and is thus likely that this use case will continue
to increase in its importance in the feature.
For tuning the heuristic, an argument can be made for all three kinds of
benchmarks from Figure 3.2. It depends on the heuristic. Assume we’re
dealing with a code transformation (e.g. converting Python 2 code auto-
matically into Python 3), which either it works or it doesn’t. We want to
optimize the parameters of our heuristic so that it works on the most
cases possible. In this case we probably want a “fuzzing benchmark”, to
be sure we cover the corner cases, or better yet, a combination of a
“fuzzing benchmark” and a “representative coverage benchmark”. On the
other hand, if the heuristic is something like a transformation expected to
speed up the execution, then the argument for a “representative bench-
mark” is basically the same as for testing. We want it maximize the ex-
pected value of this speedup. An important distinction between heuris-
tics pertains the way the parameters are set. Depending on how they
are updated, repeatedly seeing similar code examples might be useless
or even counter-productive, such that a “representative coverage bench-
mark” might be best suited.
More importantly yet is the process of designing the heuristic, before it
is tuned. Usually this process is iterative. In it, having to look at the corner
cases is common, too. Arguments for all the discussed kinds of bench-
marks can thus be made in similar fashion for the process of designing
a heuristic, depending on specific goals. For our methods improving soft-
ware synthesis, we mostly want “representative coverage benchmarks”.
In [Goe+19] we systematically classified all benchmarks and their usage
in papers in the CGO and PACT conferences between 2013 and 2016. Ta-
ble 3 in that work shows the analysis of 20 research papers from the con-
ferences and years mentioned and the benchmarks used, metrics eval-
uated and classification for benchmark type. In particular, the analysis
shows that most papers aim to characterize some improvement and re-
quire what we here call a “representative coverage benchmark”. A few
papers also used benchmarks as input for training or tuning a heuristic.
3.2 KPN Benchmarks
The ♠♦❝❛s✐♥ framework supports three input formats at the time of this
writing: t❣❢❢, ▼❆P❙ and s❞❢✸. We will discuss the first two for benchmark-
ing here, while the s❞❢✸ format will be discussed in Section 3.3.
3.2.1 CPN Benchmarks
The first input format for ♠♦❝❛s✐♥ is the ▼❆P❙ format, which uses bench-
marks written in the CPN language (cf. Section 2.1). A CPN application is
compiled using the MAPS flow, which evolved into the commercial tool
36

Table 3.1: Summary of applications in the E3S






The benchmark suite is pretty dated, being over 20 years old at the time
of this writing. Unfortunately, benchmarks are generally scarce. The meth-
ods investigated in this thesis here have more to do with the trends than
the actual numbers, which is why using such a dated benchmark suite is
still adequate. We expect the relative performance of mapping algorithms
on the E3S benchmarks to be similar to that on present and future applica-
tions, since the importance is the interplay between communication and
computation costs, not the absolute values thereof.
A significant focus of the methods we will evaluate with these bench-
marks is on the multicore architectures. For this, we use the same method
as in [Wei+14; Sch+17]. We use the architecture topology of a modern mul-
ticore, including the frequencies, as well as the memory subsystem with
its latency and bandwidth, and scale the numbers from the E3S for each
of the cores of the modern multicore. This gives a realistic scenario, albeit
not simulating a concrete instance of the architecture. In [Sch+17] the au-
thors do this to create architectures with a regular mesh structure, with
less realistic topologies like heterogeneous meshes with randomly placed
cores. Instead, we use the topologies from concretely proposed or exist-
ing systems like the HAEC [Fet+19] or the Kalray MPPA3 Coolidge [inc20]
and map the processors in these architectures to those in the benchmark
suite.
3.3 Random Benchmarks and Level Graphs
The third category of inputs for ♠♦❝❛s✐♥ is s❞❢✸, which uses the SDF3 frame-
work [SGB06]. This framework is based on TGFF, adapted to the SDF model
of computation. We will discuss SDF more in detail in Chapter 6. However,
for the purposes of benchmarking as discussed here, both SDF and task
graphs can be considered as special cases of KPN. The random graph gen-
eration of the SDF3 framework allows multiple configurations on the types
of graphs it generates, controlling the number of actors (processes) as
well as the degree of connectivity in the graph, firing rates and execution
times of the actors, or if the graph is acycilic.
Random benchmark generation has two main advantages over using
fixed benchmarks. The first advantage is the amount of benchmarks,
which is virtually unlimited with a random generation approach. The sec-
ond advantage is the control over the properties of the benchmarks. Us-
ing SDF3 we can consider precisely what effect the properties of the graph
have on the algorithms (e.g. its size, or connectivity), by generating bench-
marks which have the desired parameters for the independent variable
we are investigating. The main disadvantage is obvious: random bench-
38
marks are not as realistic as actual benchmarks. It is not clear if we will
find a graph like the one generated by SDF3 in a real-life application.
Since we have both the CPN and the E3S benchmarks, we will focus our
evaluation on those. Instead of discussing the graph generation in SDF3,
we will discuss random benchmark generation from a different type of
graph, level graphs[Goe+18]. The main difference is that for the use-case
for level graphs in [Goe+18] we do not have better, realistic benchmarks
we can use instead.
The context for benchmark generation we will discuss here are micro-
service-oriented architectures. Large internet companies like Facebook or
Twitter have an infrastructure that consists of multiple micro-services that
depend on each other [Mar+14]. A crucial factor for optimal performance
is the amount of I/O calls these micro-services make. We will discuss the
use case more in-depth in Chapter 7. In this section we will only focus on
the benchmark generation.
The micro-service-based infrastructures from large companies like
Facebook or Twitter are the intellectual property (IP) of these companies
and not in the public domain. If, for example, we want to improve a
method for optimizing I/O in Facebook’s spam-fighting service [Mar+14],
we cannot use a large representative benchmark sample from Facebook
to test against their method. Instead, we observe the general structure
of the programs in their work and device a methodology for generating











Figure 3.4: An example of a Level Graph. Adapted from Figure 1 of [Goe+18].
Figure 3.4 shows an example of a level graph. The graph depicted is a
tree which is organized by levels, which are indexed with integer numbers.
The nodes in the graph are labeled as different kinds of node, namely
r❡qÑ✐♦,s✉❜❢✉♥❝t✐♦♥ and ❝♦♠♣✉t❡. The graph depicted in Figure 3.4 is de-
signed to benchmark I/O optimization, which is why the node labels are
designed accordingly, reflecting I/O calls and other computation, as well
as an additional s✉❜❢✉♥❝t✐♦♥ node that creates nested benchmarks with
additional function calls. This is also by design, to test the use-case.
The idea behind level graphs is to reflect the intuition of locality in code.
This intuition is based on the observation that long-range dependencies
in code are less common than short-ranged ones. While programmers do
sometimes refer back to identifiers defined far behind, it is far more com-
mon to define values before using them. We interpret this as a statistical
feature of the distribution of code as commonly written by humans (cf.
Section 3.1). Levels in level graphs are thus designed to define the proba-
bility distribution of dependencies in graphs.
There are generally two accepted models of random graphs, the Erdős-
Réyni approach [ER59] and the Gilbert approach [Gil59]. The former de-
fines a uniform distribution over all graphs for a given number of nodes,
while the latter defines the probabilities of the edges independently. Our
39
definition of Level Graphs is based on the Gilbert approach, but instead
of having uniform probabilities, the probabilities are defined through the
levels. Concretely, a level graph L “ ppV, Eq, lq is a directed graph pV, Eq
with a level function l : V Ñ N to the natural numbers, with the property
that for all nodes v, w P V there can only be an edge pv, wq P E if the level
of v is smaller than that of w, i.e. pv, wq P E ñ lpvq ă lpwq. To generate
a probability distribution in level graphs we define the probability of the
edge pv, wq P E to be as follows:
pppv, wqq “
#
0, if lpvq ě lpwq,
2lpvq´lpwq, otherwise.
The method can be generalized by choosing a different probability for
the case where lpvq ă lpwq. The chosen value 2lpvq´lpwq is, to an extent,
arbitrary. This probability definition ensures that dependencies are more
common locally, between levels that are close by, discouraging but not
prohibiting long-range dependencies.
A level graph can be used to generate code in different languages
or back-ends, expressing the same computation. In [Goe+18] we im-
plemented three back-ends for I/O optimizing frameworks, one for
Ÿauhau [Ert+18] (see Section 7.3), one for Twitter’s Muse [Kac15] and one
for Facebook’s Haxl [Mar+14]. These back-ends are also based on differ-
ent languages, namely Clojure and Haskell. The abstract nature of level
graphs allows us to generate code in different languages.
3.4 Machine Learning for Benchmarking
In the previous two sections we have discussed multiple benchmarks in
two different classes: hand-written benchmarks and randomly generated
benchmarks. We have discussed the advantages and disadvantages of
both. Hand-written benchmarks cost many person-hours to write and
maintain, and are usually very limited due to IP. Random benchmarks can
overcome the scarcity of hand-written ones at the cost of accuracy, since
they are less realistic and, accordingly, not as useful for assessing how
well a method will perform on real use-cases. There is a third approach
that sits in-between the two above, which is to use machine learning to
generate benchmarks with realistic properties. This section discusses this
approach and its limitations.
3.4.1 Generative models
Machine learning models that could generate benchmarks fall under the
general term “generative models”. There are different classes of genera-
tive models, however:
1. A model in the Fischer-Wald setting is a machine learning model
solving the problem of density estimation [Vap13]. This means find-
ing a pdf p1pt, α0q in a set of pdfs tp1pt, αq | α P Λu parametrized by
elements of the parameter set Λ, such that for the risk functional
Rpαq “
ş
´ logpp1pt, αqqdpptq, the value of Rpα0q is minimal over all
α P Λ.
2. A conditional estimation model can again mean a solution to a





project it into one dimension for visualization by making a principal com-
ponent analysis using all points. The figure shows the relative frequencies
as a function of the first principal component, i.e. the one with the largest
eigenvalue (by modulus). This figure thus serves to reproduce the intu-
ition of representativeness as illustrated in Figure 3.5. It is very clear that
the (feature) space covered by the benchmarks is larger than that covered
by the Github kernels. These, in turn, cover more of the feature space than
the generated CLGen kernels. These results are consistent with an expla-
nation of the results from Figure 3.7, within the formalism as introduced
here. Concretely, considering the formalism of benchmarks as reproduc-
ing a particular probability density and considering the task we want to
learn as a random variable. We believe this probabilistic model of bench-
marks has the potential to drive research forward in this direction, and
we should focus on it in future work.
A clear first conclusion from this re-thinking of the benchmarking
model is that we should also question the objective we are measuring
in Figure 3.7. The accuracy we consider is the accuracy on the established
benchmark suites. While this seems natural, the question is, is it the most
useful objective? In a real-world scenario we will have our own codebase
and will want to get the maximal accuracy in our code base. Good perfor-
mance in the benchmarks is only useful to us if our code is similar to that
on the benchmarks.
To evaluate this scenario, we took all 91 kernels from a concrete project,
the Freedesktop project4, and removed them form the Github dataset.
The choice of the project is in principle arbitrary, the important property
being that it has a moderate amount of kernels to evaluate on, without
significantly reducing the Github dataset to the point we cannot use it.
Using the same methods as above, we assessed the accuracy of training
with all seven5 benchmark suites compared to the Github kernels, without
the Freedesktop kernels, obviously. Surprisingly, the heuristic performed
significantly better with the Github kernels at 73%, compared to the 48%
obtained with the established benchmarks. These results support the the-
sis that the concept of representativeness is central to benchmarking and
models like the one proposed here should be investigated further.
3.4.3 Models of Code
One property of the generative models in CLGen is the way they represent
code. They do so by considering the (normalized) code as a stream of
characters that the model learns to predict. So far, in this thesis, we have
strongly motivated graph-based representations of code, from dataflow
graphs even to the closely-related level graphs. It is certainly not a new
insight that graphs are well-suited to represent code in its non-linearity.
Compiler construction in general is based on multiple graphs, like syntax
trees or control- and data-flow graphs (CDFGs).
Based on this observation, we investigated graph-based representa-
tions of code for machine learning. We focused specifically on compil-
ers [Bra+20]. Graph models in machine learning are an emerging field,
with Gated Graph Sequence Neural Networks (GGNNs) [Li+15] being suc-
cessful in multiple reasoning tasks. In the context of programming lan-
guage models, GGNNs and related graph models have also been very
4 https://www.freedesktop.org/
5 the benchmark suites are: AMD SDK, NPB, NVIDIA SDK, Parboil, Polybench, Rodinia, SHOC
44

The results in Figure 3.9 show that graph-based models achieved bet-
ter accuracy than sequence-based ones, in general terms. This is not sur-
prising, as it has been discussed that they are better at exposing the
non-linear structure of code. Also not surprising is that all models do
worse in the grouped split, when forced to generalize across benchmark
suites. However, it is worth noting that the CDFG-based representations
performed better on the random split, and the AST-based representations
performed better on the grouped split. A CDFG is at a level of abstraction
closer to the machine than an AST, which is closer to the code itself as
written by a human. In this light, it is not surprising that a CDFG-based
representation was better at learning with a more representative bench-
mark in the random split. The problem in that case is more related to
the execution of code on a CPU or a GPU, whereas on the grouped split,
for generalizing across benchmark suites, understanding the semantics
of the code is more important for predicting how it might fare.
In this section we have seen how graph-based representations and the
level of abstraction are important, as well as how we should pay closer
attention to the representativeness of a benchmark. A natural question
at this stage is whether this insight can be used to improve generative
models and generate better, more representative benchmarks. For this
we also need graph-based generative models, which have received less at-
tention than GGNNs in inference. The graph generative model of [Li+18b]
works by generating sequences that construct the graph. While this al-
lows us to create graphs representing code, the sequential structure
of the generative sequences still pose some problems. This model is
also very generic, which makes it easy to generate invalid code graphs,
just like CLGen can generate invalid code. Expanding upon this, Alexan-
der Brauckmann managed to generate more valid code samples than
CLGen [Bra20] (up to 88%, compared to 38% for CLGen). In a related ef-
fort, Alexander Thierfelder designed a domain-specific extension to the
model of [Li+18b], aiming to generate LLVM-based graphs that are cor-
rect by construction [Thi20]. The LLVM language is complex, and we could
unfortunately not design a generative model where the graphs are cor-
rect by construction, but we could capture most of the LLVM semantics in
the model. Graph-based models of code are a promising direction for fu-
ture work, which could allow us to generate representative benchmarks,
among others [LC20].
46
4M A T H E M A T I C A L S T R U C T U R E S I N M A P P I N G S
The space of mappings in the software synthesis flow we described has
a rich mathematical structure. This chapter aims to explore and expose
that structure, at least in part. We will consider two main aspects of
the mathematical structure hidden within the simple notion of a map-
ping, namely the inherent symmetry, and the degrees of similarity be-
tween mappings. We will consider how to extract this structure in a
computationally-efficient fashion, and how it can be exposed to tools that
aim to exploit it, in different representations.
This thesis focuses primarily on a view of the mapping problem cen-
tered on computation, instead of data. In many cases, with the increasing
discrepancy between execution frequency and memory access times (cf.
Figure 1.1) this view is not ideal. The problem space of data allocation is
usually more clearly structured and can be modeled better. For exam-
ple, we worked on integer linear programming (ILP)-based methods to
describe and optimize memory allocation [Ode+14; Ode+15; GCL16]. We
omit this work from this thesis for space reasons. We also omit work
on emerging memory technologies, concretely race-track memory (RTM),
where we used similar ILP-based models and other meta-heuristics like
genetic algorithms or domain-specific heuristics to optimize data place-
ment [Kha+20].
4.1 Symmetries
In this chapter we will explore the mathematical structure of symmetry
in the software synthesis process, mostly the work published in [GC15;
GSC17; Goe+17; GMC18; GNC]. The material in this section makes use of
concepts in group theory. We assume the basic concepts as seen in any
undergraduate course on group theory, with the definitions of groups, ac-
tions and orbits. A brief introduction, to the level required by this chapter,
can be found in Appendix A.1.
4.1.1 Architectures and Applications
Intuitively, when we say an object is very symmetric we usually mean it
has parts that are similar or identic, and the object looks identical (or sim-
ilar) from multiple points of view. In a symmetric face, for example, both
the left and right sides of the face are similar. A hexagonal mosaic might
look the same when seen from six different angles. Mathematically, this is
commonly modeled through transformations. A reflection along the ver-
tical axis in a face, or rotations of 60˝ in the heaxgon, both leave the ob-
ject (mostly) unchanged. We can do the same for hardware architectures,
even heterogeneous ones.
For example, the Exynos 5 in the Odroid-XU4 has four identical
Cortex A7™, say PE1, . . . , PE4 and four identical Cortex A15™cores, say
PE5, . . . , PE8. A transformation that swaps the cores PE1 and PE2 leaves
the archtiecture topology unchanged, since the cores are identical. This
is depicted in Figure 4.1. On the other hand, a transformation that swaps
47
PE1 and PE5 does change the topology, since the cores are of different



























































Figure 4.1: Examples of transformations in the Odroid-XU4 architecture.
When the interconnect subsystem is more complex, this is also re-
flected in the topology. Consider the NoC-based architecture depicted in
the example, with four identical cores PE1 . . . , PE4. An analogous transfor-
mation to the one described before, which swaps the cores PE1 and PE2,
is not a symmetry of this topology, as depicted in Figure 4.2. The change
in the cores changes the communication patterns. Before the transfor-
mation, sending data from PE1 to PE3 needs two hops, whereas after the
transformation it can be sent within a single hop, as shown by the red













Figure 4.2: The communication topology affects symmetries in architectures.
Generally, the transformations that preserve the structure of the archi-
tecture topology have a clear structure. If two transformations t1 and t2
preserve the structure of the architecture topology, then their composi-
48
tion t1 ˝ t2 also preserves it. Similarly, it is clear that reversing a transfor-
mation t´11 also preserves the structure. Finally, the identity transforma-
tion on the architecture idA (which does not change anything) clearly pre-
serves the structure. These observations together mean that these trans-
formations have the structure of a group with the function composition
p˝q as its operation.
More precisely, the group of symmetries of the architecture is precisely
the group of graph isomorphisms from the architecture graph A to itself.
An isomorphism from an object to itself is called an automorphism. We
denote the group of automorphisms of the architecture A as AutpAq
For the case of the NoC-based architecture, the authomorphism group
AutpANoCq – D4 is a dihederal group on 4 points. It conists of 3 rotations,
4 reflections and the identity transformation. The Odroid architecture, on
the other hand, has AutpAOdroidq – S4 ˆ S4 as symmetry group. This group
with 48 transformations consists of (independent) arbitrary permutations









Figure 4.3: The topology of the Kalray MPPA3 Coolidge.
Since the Odroid architecture is heterogeneous, both clusters are dis-
tinct and there is no symmetry between them. Many complex archi-
tectures, however, do consist of multiple identical clusters. Consider
the architecture depicted in Figure 4.3. It is the MPPA3 Coolidge from
Kalray [inc20] and consists of 5 identical clusters. Each cluster has 17 cores,
16 of which are identical general-purpose cores, and the last one is a
special-purpose secure and management core.
The MPPA3 Coolidge is a hierarchically-designed architecture. The five
identical clusters are conceptually at a different level than the cores at
each cluster. Designs like the HAEC [Fet+19] topology mentioned in the
introduction (cf. Figure 1.2) have even more levels of hierarchy. The sym-
metries of these hierarchical architectures are reflected in the different
levels of hierarchy of the topology [GNC]. For example, the automor-
phism group of the MPPA3 Coolidge is AutpACoolidgeq – S16 ≀ S5 and has
16! ¨ 5! « 2.51 ¨ 1015 symmetries.
So far we have discussed the symmetries of architectures. However, we
can apply the same principle to applications and their graphs. Conisder
the audio filter example application from Section 2.1 (cf. Figure 2.1). The
left and right channels perform precisely the same computation on dif-
ferent data. We could not, for example, just swap the ❢❢t❴❧ and ❢❢t❴r
nodes, since that would result in a different application that also swaps
the channels of the audio file. On the other hand, if we swap the whole
subgraph consisting of ❢❢t❴❧, ❢✐❧t❡r❴❧ and ✐❢❢t❴❧ with the equivalent
49
subgraph of ❢❢t❴r, ❢✐❧t❡r❴r and ✐❢❢t❴r, the application remains identi-

















Figure 4.4: A symmetry transformation of the audio filter application.
Mathematically, we need to model the semantics of the application to
reflect its symmetries. For an application K “ pVK, EKq we can label the
nodes VK with unique identifiers relating them to the KPN process that
execute them (e.g. the ❴❴P◆♣r♦❝❡ss in a CPN program). Formally, thus, the
automorphism group AutpEq of the labeled graph K is trivial, i.e. AutpEq “
tidu. We could label K differently to capture the symmetry from Figure 4.4.
For example, if we use the source code of the process as label, we would
capture this symmetry. We have to be careful, however, as this can lead
to a problematic definition of symmetries.
An application might use the same code at different points, resulting
in very different behavior. For example, consider an application that re-
ceives a list of points, which it sorts before operating on it. Before return-
ing the list, it sorts them again to ensure they are sorted. Both times it
sorts the list using the quicksort algorithm, yet the second time the list
is almost always sorted or close to being sorted. Then, the execution of
the same quicksort code in the second instance behaves very differently
from the first time.
A difference like the one outlined above is very difficult to capture
automatically, as it requires understanding of the application to a very
high level of abstraction. We thus consider application symmetries as
manually-defined annotations. There are some conceivable ways to au-
tomatically capture and annotate such application symmetries, for exam-
ple when dealing with known data-level parallelism (DLP). In future work,
a framework as we discuss in Chapter 6, Section 6.2 could be extended to
extract application symmetries from DLP. For the rest of this thesis, how-
ever, we focus on symmetries induced from the architecture.
4.1.2 Mappings
We have seen how the architecture and applications have symmetries
in their structure. The groups AutpAq and AutpKq act on the architecture
A and the application K, respectively. These actions also induce an ac-
tion on the mapping space. Let m : K Ñ A be a mapping. Recall that a
symmetry σ P AutpAq of the architecture is a transformation that leaves
the structure of the architecture unchanged. Conisder then the mapping
50
σm :“ k ÞÑ σpmpkqq. Since the structure of A is unchanged, then the
structure of m and σm is also identical. All observable properties Θ of m
and σm, like the execution time or energy consumption, are the same. If
they were not, it would have to be due to a structural difference in the
(sub)architectures mpKq, pσmqpkq “ σpmpKqq ď A, which are isomorphic
by assumption on σ. We say that these properties like the execution time
and energy consumption are invariants of the group action.
The case for K is analogous. Let π P AutpKq be a symmetry of the ap-
plication. Then the mapping πm :“ k ÞÑ mpπ´1kq is equivalent to m. Note
that we define it with π´1 instead of π so that this defines a left action.
Indeed, for π, τ P AutpKq, we have



















































































Figure 4.5: Group actions on mappings.
Figure 4.5 shows an example of the action on mappings. It depicts a
mapping m “ rPE2, PE1, PE1, PE5, PE1, PE5, PE3, PE7s of the audio filter ap-
plication on the Odroid XU4. On the bottom right, we depict the action of
σ on m for the architecture symmetry σ “ pPE1, PE2q, in cycle notation1,
which is the same symmetry transformation depicted in Figure 4.1. This
results in the mapping σm “ rPE1, PE2, PE2, PE5, PE2, PE5, PE3, PE7s. On the
top right, we show the action of π on m for the application symmetry
π “ p❢❢t❴❧, ❢❢t❴rqp❢✐❧t❡r❴❧, ❢✐❧t❡r❴rqp✐❢❢t❴❧, ✐❢❢t❴rq,
which is the application symmetry depicted in Figure 4.4. This results in
the mapping πm “ rPE2, PE1, PE1, PE1, PE5, PE3, PE5, PE7s.
From the underlying problem formulation, as defined in Chapter 2,
mappings under these symmetries are necessarily indistinguishable from
each other, since they rely on inherent symmetries of the models. On
the other hand, these are just models, they need not reflect reality. It still
leaves the question open, how does this hold up in pracitice? Are equiva-
lent mappings actually equivalent? In [Goe+17] we tested this empirically,
by executing four equivalent mappings and measuring the runtime and
1 See Appendix A.1 for an explanation
51

here for formalizing the symmetries of applications, architectures and
mappings. An overview of the methods of computational group theory
can be found in [Hol05; Ser03], which both cover far beyond the basics
presented in this subsection. Here we will present only the methods nec-
essary for the calculations required of applications to software synthesis.
To leverage the symmetries explained in this thesis, we need methods
to calculate the following:
1. Given an architecture graph A, calculate (generators for) the group
of symmetries AutpAq.
2. Given a mapping m : T Ñ A and the symmetry group G :“ AutpT Ñ
Aq, enumerate the orbit Gm.
3. Given two mappings m, m1 : T Ñ A and the symmetry group G :“
AutpT Ñ Aq, determine whether m “ gm1 for a g P G, i.e. if the two
mappings are in the same orbit.
Mature software exists for computational group theory that can, in
principle, solve these problems. The GAP system is a Domain-Specific Lan-
guage (DSL) for computational discrete algebra with a focus on (computa-
tional) group theory [GAP20]. We developed algorithms for dealing with
problems 1-3 in GAP [GSC17]. We also included naive versions of most al-
gorithms implemented directly in Python in ♠♦❝❛s✐♥.
Using GAP-based algorithms in software synthesis tools like MAPS in prac-
tice, however, comes with a series of complications. The largest problem
is that it adds a dependency on a whole ecosystem. A complete distribu-
tion of GAP is over 200 MiB of size and takes around a second to start up
in standard commodity hardware of today. Additionally, to communicate
with a running GAP instance we need to use OS pipes, which is cumber-
some and not portable. We thus developed a standalone library [GNC;
Nic20], ♠♣s②♠, which implements the required algorithms to solve prob-
lems 1-3 and includes a domain-specific extension for efficiently dealing
with hierarchical (e.g. clustered) archictures [GNC].
Calculating the group of symmetries from an architecture graph (Prob-
lem 1) is very related to the graph isomorphism problem. This is a problem
in NP, and it is not known, neither believed to be in P nor NP-complete.
In December 2015, Lásló Babai published a pre-print where he claims to
have found a quasi-polynomial algorithm [Bab16], yet at the time of this
writing (January 2021) the peer-review is still not complete. Regardless of
the worst-case complexity, graph isomorphism can be solved efficiently
in practice for most instances [MP14]. Algorithms for doing so are imple-
mented in nauty/Traces, which ♠♦❝❛s✐♥ and ♠♣s②♠ use to solve Problem 1.
Virtually all MPSoCs have topologies that follow a well-defined set of de-
sign principles, like using NoCs, hierarchical clusters or groups of identical
PEs. This is also the idea behind the ♣❧❛t❢♦r♠ ❞❡s✐❣♥❡rmodule in ♠♦❝❛s✐♥.
In [GNC] we showed that we can leverage this to construct the automor-
phism group of the architecture. In particular, the automorphism groups
of hierarchical architectures are the wreath product of symmetries of the
clusters. We used a specialized algorithm that leverages a wreath-product
decomposition, originally applied in model checking [DM09]. Table 4.1
shows the domain-specific approach to finding architecture symmetries
in hierarchical designs, as described in [GNC].
Most algorithms in computational group theory use a special data struc-
ture describing the group. This data structure is called a base and strong
53
Table 4.1: Correspondence of architecture and group-theoretic constructions.
Adapted from Table 1 in [GNC].
Hardware Architecture Group Theory
Bus-based connection (n identical
elements/clusters)
Symmetric Group Sn
Distinct elements/clusters Direct product G1 ˆ . . . ˆ Gn
NoC Connection with topology
graph Γ (identical elements)
Automorphism group of Γ AutpΓq
Hierarchical composition Wreath product G ≀ H
generating set (BSGS), see [Hol05; Ser03] for more details. The standard al-
gorithm for calculating the BSGS for a group is the Schreier-Sims Algorithm.
Multiple variants of this algorithm exist, which are more efficient under
different circumstances. Computer algebra systems (CAS) like GAP use dif-
ferent variants with sophisticated heuristics for selecting which variant to
use. In ♠♣s②♠ we implement some variants of the Schreier-Sims algorithm
with a less sophisticated selection heuristic, which do not surpass GAP’s
performance. For all groups investigated in this thesis, however, ♠♣s②♠
was comparable to GAP, without the large ecosystem dependency [Nic20;
GNC].
Problem 2 is a standard problem in computational group theory. We
solve it using the Orbit algorithm, which can easily be adapted to a lazy
variant, described in Algorithm 1. If we use a perfect hash, the algorithm
returns exactly the orbit of the mapping. If the hash can have duplicates,
a smaller orbit might be returned, but the algorithm will clearly never
yield elements from outside the orbit. This lazy variant is especially useful
when looking for any mapping in the orbit which fullfils some properties,
instead of being interested in the full orbit. This is especially useful in the
TETRiS system, which we will describe in Section 5.5.
Algorithm 1 A lazy variant of the standard orbit algorithm
input: A generating set X “ pg1, . . . , gnq, xg1, . . . , gny “ AutpMq for the
mapping space, a mapping m0.
output: The orbit of m0: AutpMqm “ tgm0 | g P AutpMqu
1: H Ð tHashpm0qu
2: CurElems Ð tgim0, Hashpgim0q R H | i “ 1, . . . , nu
3: H Ð H Y tHashpmq | m P CurElemsu
4: while CurElems ‰ H do
5: for m P CurElems do
6: yield m
7: CurElems Ð tgim, Hashpgimq R H | m P CurElems, i “ 1, . . . , nu
8: H Ð H Y tHashpmq | m P CurElemsu
Finally, to solve Problem 3, we could simply solve Problem 2 for both
elements and see if the orbits are identical. Orbits form a partition of the
mapping space M, meaning that two orbits are either identical or dijsoint
(and the union of all orbits yields M). This is a very inefficient way of solv-
ing Problem 3, since it means we have to enumerate the whole orbit for
each element. Using the same principle of the Orbit’s partition, we can
also just enumerate the orbit for one element and see if the other ele-
54
ment is in it. While this is also an improvement, it is still very inefficient.
Orbits can be very large when the problem has much symmetry. By de-
fault, ♠♣s②♠ uses this variant as a fall-back method to solve Problem 3
when correctness needs to be guaranteed.
Another alternative for this which works without enumerating any or-
bits is based on the fact that the symmetries of the mapping AutpMq ď
SympMq, the symmetric group on M (i.e. the group of all permutations
on M). Thus, if two mappings m, m1 are in the same orbit under AutpMq,
then they are also in the same orbit under SympMq: there exists a per-
mutation σ P S|M| which takes m to m
1. However, |M| as we have seen
can be very large, as it grows (at least) as |VA||VK| and SympMq – S|M| is
thus unimaginably large, namely | SympMq| “ |M|! ě p|VA||VK|q!. If we con-
sider only architecture symmetries, this all works over the much smaller
AutpAq. We obviously do not have to iterate over the group to construct σ,
since we know both m, m1 we can construct it directly. Knowing σ, we can
efficiently solve the group membership problem [Ser03] for these permu-
tation groups, using the BSGS data structure. We know, namely, that σ is in
AutpMq if and only if AutpMqm “ AutpMqm1, by the definitions of the orbit
and σ. On the other hand, if we cannot construct σ from m, m1, because it
leads to contradictions, then, obviously, the orbits are different.
There is an alternative variant of this which also allows us to select a
mapping to work with, e.g. for DSE. It is based on canonical representa-
tives [GSC17; GMC18]. A canonical representative of an orbit Gm is an ele-
ment m0 P Gm such that there is a function f : M{G Ñ M which maps Gm
to m0. In other words, the function f selects a unique element of every
orbit, this element is the canonical representative.
For constructing our canonical representatives, we order mappings us-
ing the lexicographical ordering. For two mappings m “ pm1, m2, . . . , mkq
and m1 “ pm11, . . . , m
1
kq we say that m ď m
1 if and only if there exists a j ď k
such that mi “ m1i for all i ă j and mj ď m
1
j. The function f for the canon-
ical element of the orbit thus maps Gm to min Gm. In other words, we
choose canonical elements to be the lexicographical-minimal elements
of the orbits.
Algorithm 2 Local search for finding canonical representatives. Adapted
from Algorithm 1 of [GMC18].
input: A mapping m, a generating set S, with xSy “ AutpMq.
output: A mapping mcanonical “ gm with mcanonical ď m
1 for all m1 P Gm
1: F Ð tmu
2: Fold Ð H
3: while F ‰ Fold do
4: Fold Ð F
5: for all s P S do
6: for all m1 P F do
7: if sm ă m then
8: F Ð F Y tsmu




To find the lex-minimal canonical representatives we use a local-search
algorithm based on an iteration similar to the Orbit Algorithm. Algorithm 2
shows this local-search heuristic. This algorithm returns the lex-minimal
55
element of the orbit if the generating set has a particular property, which
we called being a strictly order-preserving generating set [GMC18]. We say
S is a strictly order-preserving generating set if for two mappings m1 ă m
in the same orbit, i.e. m P xSym1, there exists a word s1, . . . , sn in the gen-
erators si P S, such that m1 “ s1 . . . snm with sipsi`1 . . . snqm ă psi`1 . . . snqm
for all i “ 1, . . . , n ´ 1. For example, for the symmetric group Sn, the set
of all transpositions S “ tpi, jq | i ‰ j P t1, . . . , nuu is such a strictly order-
preserving generating set. Without this property, the local search could
yield an element which is not the lex-minimal element. If we remove the
optional reduction in Line 9, we significantly speed up this search and
make the probability of finding only a local minimum instead of the global
one higher. Since all mappings in the orbit are equvialent, such a local min-
imum will always have the same objective properties Θ as the real canon-
ical representative (cf. Section 2.4). Thus, finding a local minimum instead
of the canonical representative is tantamount to considering a smaller
group of symmetries, and thus a very acceptable risk for a considerable
speed-up. Both ♠♣s②♠ and ♠♦❝❛s✐♥ implement this heuristic and use it by
default for design-space exploration, as we will see in Section 5.3. We also
integrated ♠♣s②♠ into ♠♦❝❛s✐♥, using the simple Python versions of the al-
gorithms in ♠♦❝❛s✐♥ only as fall-back.
4.1.4 Partial Symmetries
The symmetries we have considered so far can be considered as “global”
symmetries: they are transformations of the complete structure (e.g. ar-
chitecture, mapping). The intuitive notion of symmetry, however, is more
general than this. What we consider as symmetry also includes the rela-
tionship of a structure to its parts. In particular, a symmetry can be local
to a part of the structure, without being global. A general discussion of
this can be found in [Law98], as well as a detailed exposition of the math-
ematical background of this section.
We can see what we mean by local structures in the example depicted in
Figure 4.7. It shows two NoC architectures both with a regular mesh topol-
ogy. The first one is a two-by-two mesh, the second one four-by-four. We
can compare now the symmetries of both architectures intuitively, and
see how these translate to the group-theoretic sense. The four-by-four
mesh is larger, and has a sort of self-similarity: it can be thought of as
composed of four copies of the two-by-two mesh arranged in a larger








































Figure 4.7: A comparison of the two different-sized meshes and the intuitive no-
tion of their symmetries.
56
However, if we look at the group of automorphisms of the corre-
sponding architecture graphs, we get a result that defies this intuition:
both architectures have the same groups of symmetries! More precisely,
their groups of automorphisms are isomorphic, they are dihedral groups
on 4 points, D4. More concretely, there are only 8 possible structure-
preserving transformations acting on these two topologies, which are the
rotations of 90˝, 180˝, 270˝, 360˝ “ 0˝ and the reflections among each of
the axes (horizontal, vertical and both diagonals). We cannot, for exam-
ple, divide the four-by-four mesh into a two-by-two mesh of two-by-two
meshes, and rotate that larger two-by-two mesh by 90˝ or one of the
smaller ones by 90˝. These two operations both work locally, if we ignore
the rest of the structure, but do not preserve the whole structure of the
mesh, as illustrated by Figure 4.8. The figure shows how a rotation on the
bottom left 2 ˆ 2 mesh breaks the communication structure. Highlighted
is the communication between PE1 and PE3, which changes from 2 hops

































































Figure 4.8: An example of a local symmetry that is not a global symmetry of a 4 ˆ 4
mesh.
For the mathematical formalization of this intuitive notion of local sym-
metries, in this section, we follow [Law98]. There are essentially two equiv-
alent ways of formalizing this intuitive notion of local symmetries, inverse
semigroups and ordered groupoids. We will consider the formalization
using inverse semigroups, as it is conceptually simpler for computations,
and mathematically equally as powerful. In the case of global symmetries,
there are concrete transformations of architectures and mappings, which
correspond to abstract groups. For partial symmetries, we will consider
partial transformations of mappings, which we will model as partial per-
mutations, and these partial permutations (transformations) have a cor-
responding abstract inverse semigroup.
We start by defining partial functions and partial permutations.
Definition 4.1.1. Let X,Y be sets. A partial function f : X Ñ Y is a function
from a subset of X to a subset of Y. We denote the domain of f by domp f q
the codomain of f by codp f q. Thus, the partial function f : X Ñ Y is a (total)
function f : domp f q Ñ codp f q
Definition 4.1.2. Let X be a set. A partial function f : X Ñ X from X to
itself is called a partial permutation if the (total) function f : domp f q Ñ




1 2 3 4
5 6 7 8
¸ ˜
1 2 5 6
1 5 2 6
¸
For computations [Eas+19], the three notations can be interpreted to
make different data structures that make different opperations more ef-
ficient, like application of the partial function (as an array look-up), for
sparse partial permutations (as lookups in key-value pairs), or cycles for
efficient multiplication (as concatenation). They have different benefits
and drawbacks. For readability though, the cycle notation is the most com-
pact one, and the one we will use for the rest of this thesis.
Just as for groups, we can define the (left) action of a semigroup:
Definition 4.1.3. Let S be a semigroup and X be a set. We say that S acts
on X (on the left) if there is a function ¨ : S ˆ X Ñ X such that pabq ¨ x “
a ¨ pb ¨ xq. If S is a monoid with identity 1 and the function ¨ satisfies the
condition 1 ¨ x “ x for all x P X, we say that the action is a monoid action.
The action of a semigroup of partial permutations on an architecture
works the same as with groups, except it does not work on the whole
architecture. Let f be a partial permutation on an architecture A, and m :
K Ñ A be a mapping on that architecture. If the partial permutation is
defined on all cores that m maps to, i.e., impmq Ď domp f q, then we can
use the action of the semigroup of partial permutations of A to define
another mapping f m by f mptq “ f ¨ mptq for all t in K. If f is not defined on
some of the cores of m, i.e., impmq Ę domp f q, then we cannot define f m.
In this way, f also defines a partial permutation f̂ on the set of mappings
M Ď tm : K Ñ Au “: AK.
Consider for example the mapping of an application with three tasks
to the 4 ˆ 4-mesh defined by m1pt1q “ m1pt3q “ PE1 and m1pt2q “
PE5, which we can also write as the vector m1 “ p1, 5, 1q. Then the par-
tial permutation p1, 2, 6, 5q from Figure 4.9 above defines the mapping
p1, 2, 6, 5qm1 “ p2, 1, 2q. Similarly, the action of the partial permutation
p2, 5qp1qp6q yields a new mapping, p2, 5qp1qp6qm1 “ p1, 2, 1q. However, since
the translation τ “ r1, 5sr2, 6sr3, 7sr4, 8s is not defined on PE5 “ m1pt2q, we
cannot define r1, 5sr2, 6sr3, 7sr4, 8s as a mapping. Formally we can say that
the partial permutations {p1, 2, 6, 5q and {p2, 5qp1qp6q are defined on m1, but
{r1, 5sr2, 6sr3, 7sr4, 8s is not defined on m1.
What happens with application symmetries? As defined here, the edges
of the application K are the (data) dependencies of a computation process
(or task). All dependencies have to be respected, which means that consid-
ering partial symmetries of the application can lead to non-determinstic
or faulty behavior.
We are now ready to formally define the set of partial symmetries of ar-
chitectures and mappings, as in the case of groups. Recall that a mapping
m : K Ñ A can be seen as a morphism of graphs from M to A. In particu-
lar, every mapping m defines a subgraph mpKq ď A. This subgraph has a
node mptq P VA for every PE in the architecture A that is used in a mapping,
and similarly an edge pmpt1q, mpt2qq P EA for every communication prim-
itive where a channel is mapped to. Precisely the isomorphism of these
subgraphs is what defines the partial symmetries of the architecture.
Definition 4.1.4 (AutSemi). Let A be an architecture graph. The set of
partial symmetries of the architecture graph AutSemipAq is the set of par-
tial labelled-graph isomorphisms of A, i.e. the partial permutations ϕ of
59
VA which induce an isomorphism of labeled graphs between dompϕq and
codpϕq.
As motivated above, AutSemipAq acts on the set of mappings M, just as
AutpAq the group of (total) symmetries does. This action (and the action of
the group AutpKq) define together an embedding on AutSemipMq ď IpMq,
the inverse semigroup of partial permutations on M, which is how we
define AutSemipMq.
In inverse semigroups, not every element has an inverse, only a pseudo-
inverse. Consider the identity partial permutation on the lower-left 2 ˆ 2-
NoC in the 4 ˆ 4 mesh, i “ p1qp2qp5qp6q. This identity partial permutation is
an idempotent, which means that i2 “ i, which implies that i´1 “ i. Groups,
in contrast, have precisely one idempotent, the identity element. The set
of idempotents of a semigroup plays an important role in describing the
structure of the semigroup [Law98]. If we then consider the translation
τ from above, we can multiply τi “ r1, 5sr2, 6s, which is defined only on
two cores. If we muliply it with the pseudoinverse of i, i´1 “ i, we get
τii´1 “ τii “ τi ‰ τ. There is no way we can get τ back from τi, since τ is
defined on 4 cores.
Just as with groups, we can define orbits for inverse semigroups. How-
ever, due to the one-way nature of some multiplication operation, the
orbit of a semigroup is more complicated. Let X be a set and let S be a
semigroup acting on X. Then, for an element x P X, we can think of the
orbit graph of x as a directed graph O “ pV, Eq where V “ tsx | s P Su. The
edges E are defined by the action, namely an edge e “ pv, wq is added for
every v, w for which there exists an s P S such that v “ sw. This directed
graph is clearly connected, but not strongly connected. The strongly con-
nected components (sccs) of this orbit graph define equivalence classes
and play the role that orbits played in group actions for our application
to software synthesis.
By definition of AutSemipAq, for a partial symmetry f P AutSemipAq and
a mapping m we know that if the mapping f m is defined, then the two sub-
graphs f mpKq – mpKq ď A are isomorphic. We also get an isomorphism
between the two subgraphs by Lemma 4.1.5
Lemma 4.1.5. Let m : T Ñ A be a mapping and let f P AutSemipAq be
a partial automorphism of the architecture such that impmq Ď domp f q.
Then, the two graphs mpKq and p f mqpKq are isomorphic and the function
ϕ : mpKq Ñ p f mqpKq, mptq ÞÑ f ¨ mptq for all t P K is an isomorphism of
labeled graphs.
Proof. First note that ϕ is well-defined. Indeed, since impmq Ď domp f q it
means that f ¨ mptq is defined for all t P VK. Since f P AutSemipAq, we
know that the type of mptq and f ¨ mptq is equal for all t P VK, as well as the
type of all edges pmpt1q, mpt2qq and p f ¨ mpt1q, f ¨ mpt2qq is equal. Thus, ϕ is a
morphism of labeled graphs. Finally, since f P AutSemipAq, we know that
f is a partial permutation, and in particular, a bijection between domp f q
and codp f q. In particular, ϕ is bijective, and as a bijective morphism of
labeled graphs, an isomorphism.
What about the converse, if the subgraphs generated by the mappings
are isomorphic, does this mean that there is a (partial) isomorphism of
the mappings too? Can we use this to characterize equivalent mappings?
In general, no. Consider the subgraph of the mappings m2 :“ p5, 5, 1q
60
and m3 :“ p5, 1, 1q. Both these mappings project into isomorphic sub-
graphs m2pKq – m3pKq – m1pKq, but obviously the mappings are not
equivalent. Even if the subgraphs are isomorphic, the crucial difference
is, however, that the mapping ϕ as defined in Lemma 4.1.5 is not an iso-
morphism of (labeled) graphs. What if tasks t1 and t2 are equivalent? In
other words, what if g “ p1, 2qp3q is a (full) automorphism of the appli-
cation graph? Then, the mappings m1 and m3 are equivalent (via g), but
the function ϕ of Lemma 4.1.5 is still not an isomoprhism of the sub-
graphs. However, we can generalize the function by applying g first, as
ϕ ˝ g : mpTq Ñ f mpTq, mptq ÞÑ p f m ˝ gqptq “ p f mqpgptqq. This generaliza-
tion, in fact, yields a full characterization of equvialent mappings through
isomorphy of subgraphs.
Theorem 4.1.6. Let A be an architecture with inverse semigroup of auto-
morphisms S “ AutSemipAq and let K be an application graph with group
of automorphsims G “ AutpKq. For mappings m, m1 : T Ñ A, the following
statements are equvialent:
1. There exists a partial permutation f P S and a permutation g P G,
such that ϕ ˝ g is an isomorphism of labeled graphs.
2. The two mappings are equivalent by symmetries in the orbit of S ˆ
G.
Proof. The implication p1q ñ p2q follows directly from the definition of ϕ
and the action of S ˆ G. For the implication p2q ñ p1q, since m and m1 are
in the same scc of the orbit of S ˆ G, there exists an x P S ˆ G such that
m “ x ¨ m1. We can use the direct product structure of S ˆ G to decompose
x “ f g for f P S, g P G. This means that m “ f g ¨ m1 “ f ¨ pg ¨ m1q. Applying
Lemma 4.1.5 on m and g ¨ m1 shows that ϕ ˝ g is an isomorphism.
How do partial symmetries with inverse semigroups compare to
(global) symmetries, in the sense of group theory? We can start with a
simple example, of a 2 ˆ 2 mesh, which we will call M2. The group of
symmetries of this architecture, as we have seen, is D4 with |D4| “ 8
symmetries. What about the partial symmetries? It is easy to check that
| AutSemipM2q| “ 45, which are many more partial symmetries than global
ones! But in fact, comparing the size of the group and the semigroup is
misleading. We can’t compare them, as they deal with different objects,
functions and partial functions. For this case of M2 there is a sense in
which we do not get any more symmetries by going to the partial sym-
metry world. We can see it through the following argument: the group
AutpM2q – D4 acts canonically on the power set of M2, PowpM2q, sim-
ply by acting element-wise: For M Ď M2 and g P AutpM2q, the (canoni-
cal) action is defined as follows: g ¨ M :“ tg ¨ m | m P Mu. In this action,
the orbits PowpM2q{G are in obvious bijection to the sccs of the orbit of
M2{ AutSemipM2q.
We have seen how to describe partial symmetries, a natural question
is how to calculate them? This can be accomplished with the methods
of [Eas+19], and our applications of it in joint work with Sergio Siccha
and Jeronimo Castrillon [GSC17]. In fact, ♠♣s②♠ implements Algorithm 2
from [GSC17]. We worked with Sebastian Krammer in his bachelor the-
sis [Kra17] on finding more efficient algorithms. Unfortunately, the algo-
rithms as implemented so far are not efficient enough to be useful in the
context of mappings and DSE.
61
In future work, we believe we should be able to find explicit generating
systems for an n ˆ n mesh for an arbitrary n, which would significantly
improve the performance of the algorithms, which is limited by finding
a good generating set. Using inverse semigroups also opens up an addi-
tional avenue for future work, where similarities can be described instead
of precise symmetries. For example, mapping an edge between two cores
in a mesh to a different edge type with a smaller number of hops is sure to
not worsen the performance of the application when running in isolation,
although we cannot say if it will improve it or not. Such a transformation
can also be described with semigroups, and the directed graph structure
of the orbits nicely encompasses such one-way transformations.
4.2 Metric Spaces
When considering the design space of mappings M “ AK we usually con-
sider no quantitative relationship between mappings. For two mappings
we can say if they are identical or not, or perhaps with the methods of Sec-
tion 4.1 if they are equivalent or not. However, any further relationship we
can’t describe: can we say that two mappings are very similar, or very dif-
ferent? Can we quantify the distance between two mappings? Intuitively,
we can. This section requires some basic concepts from the mathematic
theory of (discrete) metric spaces and embeddings into real spaces. Ap-
pendix A.2 gives an overview of the required concepts, a more thorough
exposition can be found in [Mat02], Chapter 15 in particular.
Normally, we encode mappings as vectors m “
´
a1, . . . , a|VK|
¯
where
ai P VA is the PEs where task i is mapped. If we interpret these vectors as
being (real) vectors in R|VK|, we can endow them with a vector distance,
like the Euclidean distance dEuclideanpv, wq “
ař
ipvi ´ wiq
2. This can be




a norm for p ě 1. For p “ 1, this norm is also known as the Mathattan
distance, in allusion to the distance between buildings in a regular mesh
like the streets of Manhattan. We can endow the space of mappings with
a metric also by using the Hamming distance, which counts only the num-
ber of differing entries in the vector. However, none of these metrics are




























































p4 ´ 1q2 “ 3 dist “
a
p4 ´ 5q2 “ 1
Figure 4.10: An intuitive example of distance between mappings.
62
Consider the example in Figure 4.10. It shows three mappings
m1 : t1 ÞÑ PE2, t2 ÞÑ PE1; m2 : t1 ÞÑ PE2, t2 ÞÑ PE4;
m3 : t1 ÞÑ PE2, t2 ÞÑ PE5 .
We would normally write these mappings as vectors, m1 “ p2, 1q , m2 “
p2, 4q and m2 “ p2, 5q . If we calculate the standard (Euclidean) distance of
these vectors, then m2 is farther away from m1 than from m3. However,
we know that communication between PE1 and PE4 is much faster than
between PE4 and PE5. The Euclidean distance in the mapping space does
not reflect the structure of the communication subsystem.
4.2.1 Architectures
In the example illustrated in Figure 4.10 we saw intuitively how mappings
can be more or less similar. This intuitive notion clearly depends on the
underlying architecture. It is the hardware architecture that determines
the cost of communicating data between processes. In order to endow
the space of mappings with a metric space structure, we should first do
so with the architecture.
We can use the intuition behind the example to define a metric that
takes latency into account this way [GMC18]. The fundamental observa-
tion here is that in a multicore architecture, communication between dif-
ferent PEs takes different amounts of time. There are multiple problems
with using the communication time between PEs directly as a distance
between PEs. Firstly, communication times depend on multiple factors:
the latency and bandwidth of the communication resources used, the
amount of data being sent, the (software) communication protocol, clock
synchronization between hardware resources like the PEs and buses, arbi-
tration or other contention issues, etc. Of course, we can model these to
various degrees. However, the distance between PEs needs to be a fixed
number and not a function of all these factors. As an approximation, how-
ever, we can use the expected latency for a package of a standardized size
(e.g. 8 bytes). As an expected value, this is a fixed number, but through its
statistical nature it can include as much complexity in the model as re-
quired2.
The second issue we run into when using communication times for
defining a distance is that, by definition, the distance between a point
and itself has to be 0, but usually a PE has to communicate with itself
using an L1 cache, scratchpad memory or similar, which has a small but
non-zero latency. In this sense, the expected communication latency be-
tween cores is not a metric space distance, but it approximates one well.
We propose thus to ignore this latency and set the distance to 0, to obtain
the mathematical metric space structure.
Finally, this metric space structure depends strongly on the unit used to
measure latency (e.g. cycles, milliseconds, etc), as well as on the absolute
speed of the communication sub-architecture. Since the goal of exposing
this structure is to leverage it for algorithmic decisions like finding good
mappings, it is useful to have comparable distances between different
2 If communication in the architecture is asymmetric, this will not define a metric. We can
average the communication from p to q and from q to p to fix this, but we should probably
consider this case separately.
63
architectures. For this, we propose to norm the metric distance function
such that the average distance between PEs is 1.
Put together, these principles yield the following definition:
Definition 4.2.1 (Architecture Metric Space). Let A “ pP, Eq be an archi-
tecture graph and lat : P Ñ P be the expected latency between PEs. Then
we set
dA : P ˆ P, pp, qq ÞÑ
#
latpp, qq, if p ‰ q
0, otherwise
(4.1)
Remark 4.2.2. For an architecture graph A “ pP, Eq, the tuple pP, dAq is a
metric space.
Proof. Obviously dApp, pq “ 0 for all p P P, by definition, and dApp, qq ą 0
for p ‰ q since the expected latency between PEs is always greater than
0. For p, q, r P P we have dApp, qq ` dApq, rq ě dApp, rq since the expected
latency of moving data from p to q and then to r will always be at least as
much as moving it from p to r directly.
In this way we endow M with a discrete metric space structure, with a
metric that reflects the memory subsystem of the architecture, or more
generally, its communication. While this allows for a simple and powerful
mathematical definition, a metric space structure can be inefficient for
calculations. To cope with this, we will also discuss low-distortion embed-
dings and show how we can find them for the metric spaces introduced.
Appendix A.2 reviews the basic notions of metric spaces, as well as more
advanced concepts needed to introduce and find the more computation-
ally efficient low-distortion embeddings.
Unfortunately, this metric also has some issues. In particular, it does
not distinguish between core types on heterogeneous systems. To fix this,
we propose an alternative metric space structure on M, by adding extra
dimensions for the communication and the computation. This is funda-
mentally very similar to adding channels in the mapping vectors. We thus
define a metric on the channels, based on the metric defined by Defi-
nition 4.2.1. The distance between two channels c, c1 P EA is defined as
| latpc1q ´ latpc2q| for the communication channel between the cores. We
then apply a similar concept for the cores, and take relative values of the
expected runtime. Disregarding the ISA or micro-architecture, we can use
the frequencies as a first estimation, which is what we do here. Obviously
the frequency is not the best estimation of the expected differences in ex-
ecution times between PEs, but we restrict our consideration to this for the
scope of this thesis. Future work should focus on finding better metrics
for the mapping space.
This definition will not produce a metric, since distinct cores which are
equivalent will have a distance of 0, and similarly equivalent channels. To
deal with this, we add a minimal distance between the cores (e.g. 0.1 times
the distance between the next two core types).
Application distances
To go from A to M, we can use the same principle as the Lp norms and de-





pq1{p, which can immediately be checked to be
64
a metric on M. This way we can consider, as a metric space (embedding),
the structure of A to be
M K . . . Kl jh n
ˆ|VK|
M, i.e. M ˆ . . . ˆl jh n
ˆ|VK|
ˆM with dpMi, Mjq “ t0u for all i ‰ j.
(4.2)
There are multiple issues with this as well. A crucial problem with it is
that this does not consider the dependencies between tasks in the appli-
cation graph A, nor does it consider how multiple tasks might be more or
less relevant. Many methods can be considered to account for this fact,
like having factors for the dimensions of the copies of M in the orthogo-
nal sum. However, we omit evaluating multiple such metrics to limit the


























































dist = min_dist dist = 2∆tPE
Figure 4.11: An example of a problem with the orthogonal-sum construction of the
distance metric for the mapping space.
Figure 4.11 illustrates another problem with this construction. The met-
ric does not distinguish between tasks mapped on the same core or on
different cores, something that has a large impact on the performance
of the mapping. Here we are considering the variant of the metrics with
the extra dimensions, but the problem is independent of the architecture
metric we base this on.
A particularly important property of these metric space constructions
is that they give meaning to distances in the mapping space; they make
it into a landscape. This is a highly-dimensional landscape, which we can-
not visualize except in the simplest examples, like the two-task mapping
we visualized previously, in Figure 1.4. There are other ways of visualiz-
ing this space, however. The t-SNE method [MH08] aims to group points
by their distances, making points that are close by in the mapping space
also close in the visualization, and simlarly for points far appart. A disad-
vantage of this method is that it does not preserve the actual values, the
coordinates of the points become meaningless. A different approach is to
use random projections onto a two dimensional-space. By the Johnson-
Lindenstrauss lemma [JL84], such a random projection will have a low
distortion with high probability (see Appendix A.2 for more details). We




might have a mapping m0, for which we want to find all mappings that are
within a radius r of it, i.e. compute the ball Brpm0q with radius r around m0.
For this we need to iterate over all m P M and calculate if dMpm0, mq ď r,
which is intractable for all but the simplest examples.
To deal with this, we use established methods from discrete geometry
to calculate low-distortion embeddings. A mapping ι : M ãÑ Rn such that
there exists a D ą 0 with
D´1dpx, yq ď }ιpxq ´ ιpyq} ď dpx, yq (4.3)
is called an embedding with distortion D (cf. Appendix A.2). In other words,
the relative error of the distances is at most D. Using convex optimiza-
tion [Mat02], we can calculate a low-distortion embedding for a finite met-
ric space. This allows us to work with vectors of real numbers which make
many algorithmic tasks scalable, e.g. computing random points in a ball.
Since the size of the mapping space grows exponentially with the num-
ber of tasks and changes for every application, computing such an embed-
ding for a large mapping space every time we want to do DSE would also
be intractable. We can avoid this by using the orthogonal sum construc-
tion from Equation 4.2. Given an embedding ι : A ãÑ Rk with distortion D
for the architecture with a given metric dA, we can construct an embed-
ding ιk of the mapping space defined as in Equation 4.2 with distortion
D.
Theorem 4.2.3 (Theorem III.1 of [GMC18]). Let ι : pM, dq ãÑ pRn, } ¨ }pq be
an embedding with distortion D and define ιk : pMk, dpq ãÑ pRnk, } ¨ }pq as
ιkppx1, . . . , xkqq “ pιpx1q, . . . , ιpxkqq. Then ιk is an embedding with distortion
of at most D.
Proof. It is clear why ιk is an embedding (well-defined and injective), since
ι is one. The distortion follows from the homogeneity of the } ¨ }p-norm
applied to Equation 4.3.
The mapping space can still have a high dimension, a problem usu-
ally called the curse of dimensionality. With this construction, for the met-
ric without the extra dimensions, the dimension of the embedding ιk
is k|VA| “ |VK||VA|. A method to improve this is to use the Johnson-
Lindenstrauss lemma to reduce the dimension with a projection. We do
this with an iterative method, described in Algorithm 3.
Algorithm 3 Iterative dimensionality reduction via the Johnson-
Lindenstrauss lemma.
input: A discrete metric space M, a low-distortion embedding ι : M ãÑ Rn
and a target distortion D.
output: An embedding with dimension ď n and distortion at most D.
1: dim Ð 1
2: while dim ď n do
3: for _ P numIterationsPerDim do
4: ι̃ Ð JLReductionpι, dimq
5: D̃ Ð CalculateDistortionpι̃q
6: if D̃ ď D then return ι̃
7: dim Ð 2 ¨ dim
return ι
Algorithm 3 exponentially increases the dimension, running









5A P P L I C A T I O N S O F M A T H E M A T I C A L S T R U C T U R E S I N
M A P P I N G S
In Chapter 4 we have seen how the mapping space has inherent struc-
ture and how we can describe this structure explicitly using mathematical
methods. In this chapter we will see how we can leverage this structure
to improve software synthesis flows in different ways.
5.1 Compact Mappings
In Section 4.2 we show multiple ways of endowing the mapping space with
a distance metric. A common method for defining a metric in a NoC-based
system is to count the number of hops between two processors [Sin+10;
Sch+17]. Indeed, this is the same as the L1 (Manhattan) distance on the
topology graph of the architecture. A natural idea that arises from this is
to search for compact mappings, i.e. mappings that take a (geometrically)





















Figure 5.1: Equivalent mappings of two applications, one being compact and the
other one not. Adapted from Figure 1 in [GMC19].
Figure 5.1 illustrates the idea of compact mappings. It depicts two vari-
ants for mapping the two application graphs depicted in the figure. The
particular property of these two variants is that they are equivalent from
the point of view of the distances: For any two connected nodes in any
of the application graphs, the node distance in terms of number of hops
between both nodes is identical in both mapping variants. Intuitively, how-
ever, the mappings on the right are preferable to those on the left. Does
this intuition translate to actual benefits in mappings?
We first have to translate this intuition into a formal definition. We de-
fine the support of a mapping m as the set of cores that have tasks mapped
to them, i.e. supppmq “ mpVKq Ď VA. We can look at the size of the support
in the metric interpretation of the mapping space. Let rm be the minimal
radius r ą 0 such that there is a ball Brpv0q with radius r for a point v0 P VA
that contains the support of the mapping, i.e. supppmq Ď Brpv0q. A com-
pact mapping is a mapping with a small r. How small r should be to be
75
considered compact, depends on factors like the metric space and the
number of tasks. What we can do properly is compare rm for different
mappings to see if they are more or less compact, according to this defi-
nition. For the examples in Figure 5.1 we can see that both mappings on
the left have a radius of rm “ 3 according to the L1 (Manhattan) distance,
whereas those on the right both have a smaller rm “ 2.
To test this idea we used a SystemC-based NoC simulator,
Noxim [Cat+15], which we modified to obtain more detailed statistics
about the simulations [GMC19]. In particular, we extracted the variance of
the package delays in the simulation. We configured Noxim to simulate
a 10 ˆ 10 mesh topology with xy routing and worm-hole switching. This
choice was made to mimic the routing of commercial platforms like
the Tile-Gx series from Mellanox Technologies [Mel15a; Mel15b], Intel’s
Xeon Phi [Tam+18] Scalable Platform [Sod+16], or academic ones like
OpenPiton [Bal+16].
If we execute the example from Figure 5.1, the non-compact example
on the left actually outperforms the compact one on the right. By closer in-
spection of the figure, this is because the distances within the application
are very high. In other words, the mappings depicted are simply bad map-
pings. A lot of contention within the application offsets any gains from
avoiding contention against other mappings.
However, while the motivational example is not very informative in
terms of finding good mappings to combine, it does motivate the idea of
compact mappings. We used a heuristic to find such compact mappings
in a regular mesh NoC, while also ensuring they are not as bad as those
in the example. We do this by ensuring the communication costs are low
within the application as well, using a greedy heuristic.
Algorithm 4 A greedy heuristic for low-communication mapping in NoC-
based architectures. Adapted from Algorithm 1 in [GMC19].
input: A connected application graph K “ pVK, EKq, the size of the mesh
n, a set of occupied cores X Ď t1, . . . , nu ˆ t1 . . . , nu “: VA
output: A mapping m : VK Ñ VA
1: CurNode Ð RandomFrom(VAzX)
2: v0 Ð RootpKq
3: mapping Ð pv0 ÞÑ CurNodeq
4: X Ð X Y tCurNodeu
5: for e “ pn1, n2q P BreadthFirstEdgeSearchpKq do
6: CurNode Ð mapping(n1)
7: d Ð mind“1...nta P VAzX | |a ´ CurNode| ď du ‰ H
8: q Ð RandomFrompta P VAzX | |a ´ CurNode| ď duq
9: mappingpn2q Ð q
10: X Ð X Y tqu
return mapping
The heuristic is described in Algorithm 4. We assume the application
graph is (weakly) connected. The heuristic then starts with any node in
the application such that there is a path from it to every node in the appli-
cation (❘♦♦t). It then randomly assigns an unused core to this node, subse-
quently iterating through the application graph in a breadth first fashion.
In this breadth-first search, it assigns cores such that the distance from




geometry, as seen by comparing the compact mapping in Figure 5.1 with
the non-compact one in Figure 5.4 (note that the application is sligthly
different).
5.2 Robust Mappings
Faulty cores are an unfortunate reality of MPSoCs. After some time, at least
one core is likely to fail. However, using hardware monitors, these faults
can be reliably detected, sometimes even before the core actually starts
failing [ZK11; Zha19]. A strategy to deal with faulty cores, when detected, is
to migrate tasks executing in that core to a different core. This way, when
the core fails, the execution can continue without the application failing.
While such a remapping strategy is ideal for preserving the functional
correctness of applications, it can have negative consequences on the per-
formance of the application. Especially for real-time applications, where
the timing performance is part of the functionality, these consequences
can be as fatal as a core failing without being detected or without remap-
ping. Moreover, in mixed-criticality domains, a pre-determined mapping
can be varied at runtime due to priority issues or similar unforseen cir-
cumstances. To deal with this, we propose to search for mappings that
are robust [Hem+17]. We say a mapping is robust when its runtime prop-
erties are unchanged by minor variations in the mapping.
The robustness of a mapping and the corresponding methods pro-
posed in this section are appropriate for soft or firm1 real-time applica-
tions, especially in mixed-criticality contexts. In this context, we say a map-
ping is feasible if its execution time is below a specified real-time deadline.
To test if a (feasible) mapping is robust, we apply perturbations. A pertur-
bation consists in taking the mapping and changing it partially, to see if it
is (still) feasible. A robust mapping should be resistant to perturbations,
as motivated by the remapping scenarios described before.
To find such robust mappings we propose [Hem+17] adapting the bio-
inspired algorithm for called Lp-adaptation [AMS17]. This algorithm uses
the metric space structure of the mapping space (cf. Section 4.2) to navi-
gate it and find a design center. For a fixed probability P, a design center is
a feasible point m in the design space, such that points in a neighborhood
of m are feasible with probability at least P. For the context of this discus-
sion, we consider neighborhoods of the form Brpmq, a ball with radius r
around the point m. The Lp-adaptation algorithm seeks to find an m which
maximizes the radius r such that the Brpmq is feasible with probability at
least P.
Figure 5.5 again uses a visualization with the method described
in [Li+18a] to illustrate the intuition behind the design centering algorithm.
We see the mountain landscape of the mapping space, where the height is
the execution time of that mapping. The figure shows three different pos-
sible thresholds (high, medium, low). All peaks that are above the thresh-
old are colored red: these depict infeasible points. A robust mapping is a
mapping such that we could “walk” in any direction from it without reach-
ing one of the red peaks. The larger the radius where this is possible, the
more robust the mapping. This metric interpretation of the design space
allows us to use the metric-based algorithm of Lp adaptation to estimate
1 A firm real-time application is one where the computation and data is useless after missing





Section 2.4. We will see many applications of the structures defined and
analyzed in Chapter 4
5.3.1 Heuristics and Metaheuristics
Generally in DSE we distinguish between two approaches for dealing
with these kinds of intractable problems, heuristics and meta-heuristics
(cf. Section 2.4). Recall that mapping heuristics are domain-specific al-
gorithms that exploit the specific domain-knowledge to find a solution
based on a pre-defined model of the problem, whereas meta-heuristics
rely on an iterative evaluation of the points. As we outlined above, dif-
ferent heuristics and meta-heuristics come with trade-offs between the
exploration time required to find a solution and the quality of said solu-
tion. This is certainly the case for many discrete optimization problems in
general, the mapping problem being no exception [Goe+16]. Commonly,
meta-heuristics tend to find better results provided enough time, but re-
quire accordingly more time to do so.
A particular difficulty of comparing mapping approaches and algo-
rithms are the different models used by different algorithms [Goe+16].
With the ♠♦❝❛s✐♥ tool we designed a common framework that allows us to
compare between mapping algorithms [Men+21]. In particular, in ♠♦❝❛s✐♥
we have two heuristics for mapping: the GBM heuristic [Cas+12] and a static
mapping variant [Men+21] of the CFS scheduler from Linux. Additionally,
we have implemented genetic algorithms based on and inspired by those
found in Sesame [ECP06; QP14; Goe+16], a simulated annealing [Ors+07]
mapping algorithm and a tabu search [MEP08]. We also have a simple
random walk algorithm for reference. A survey of these mapping algo-
rithms, among others can be found in [Sin+13]. We implemented these al-
gorithms for ♠♦❝❛s✐♥ and this thesis to have a basis for comparison from
established literature.
We first compare these mapping algorithms to establish a baseline. We
execute a random walk 500 random iterations. For the genetic algorithm
we run an evolutionary µ ` λ strategy for 20 generations of population
size 10, crossover rate of 1 with probability 0.35 and mutation probability
0.5, with a tournament selection with tournament size 4. For the GBM algo-
rithm we set the parameters as ❜①❴♠ of 1, ❜①❴♣ of 0.95, ❜②❴♠ of 0.5,❜②❴♣ of
0.75, The simulated annealing heuristic we execute with an initial temper-
ature of 1 and a final temperature of 0.1, with a temperature proportion-
ality constant of 0.5 and a random movement starting radius of 5. Finally,
for the tabu search mapper we set a maximum of 30 iterations, each of
size 5 and with a move set size of 10 and tabu tenure of 5, and a random
candidate move update radius of 2. These parameters were not chosen
systematically (e.g. using something like Bayesian optimization or general
(hyper-)parameter optimization approaches), but through manual testing
on examples to find sensible defaults. A deliberate choice in the param-
eters, however, is that the exploration times should be comparable be-
tween the meta-heuristics, i.e. such that the iterative mappers evaluate a
similar amount of mappings.
Figure 5.9 shows a comparison of the different heuristics and meta-
heuristics on the E3S benchmarks. Each of the metaheuristics that require
random data we execute 10 times and show the variation as calculated
by the unbiased estimator of the standard deviation of the multiple sam-
pled times. The execution times vary obviously depending on the differ-
83

growing design space and its complexity, which affects the metaheuris-
tics, while the static CFS mapper can still leverage domain-specific knowl-
edge to find fairly good mappings.
5.3.2 Leveraging Symmetries
As motivated when discussing them in Section 4.1, symmetries can be
used to improve DSE in the mapping problem. There are two distinct appli-
cations of symmetries in DSE. The first application is for speeding up meta-
heuristics (without modifying them), as shown in [GSC17], by leveraging
the equivalence of symmetric mappings in a symmetry-aware cache. The
second application is by pruning the design space as seen by the meta-
heuristic, effectively changing the meta-heuristic [GMC18; GNC].
We will first discuss the idea of a symmetry-aware cache. As discussed
before, meta-heuristics work through an iterative principle, where they
evaluate mappings and drive the search based on the results of the eval-
uation. While the evaluation is fast and light-weight by design (cf. sec-
tions 2.5 or 2.3), it still usually dominates the execution time of the ex-
ploration (cf. Figure 5.9). A defining property of the symmetries is how
simulation results are invariants of the equivalence classes of orbits (cf.
Section 4.1). This means that if we know the results of a simulation for a
mapping, we know the results for all mappings in its equivalence class. We
can leverage this by designing a symmetry-aware mapping cache, which
stores simulations results by equivalence class instead of storing them for
each mapping [GSC17]. This yields a trade-off, where computations about
the symmetry have to be executed every time a mapping is going to be
looked-up in the cache or evaluated. Ideally, these calculations would re-
quire but a fraction of the time saved on simulations.
We implemented a symmetry-aware cache in ♠♦❝❛s✐♥ which uses
♠♣s②♠ and its Python interface. We used this to evaluate the method
of symmetry-aware caching on the E3S benchmarks by accelerating the
various meta-heuristics discussed in Section 5.3.1. We can also evaluate
the domain-specific methods of mpsym [GNC] by applying this method
to multiple architecture topologies. In addition to the Odroid XU4 and
MPPA3 Coolidge, which we used consistently throughout this thesis, we
also test the methods on the exploration of two additional architectures:
HAEC and a simple generic cluster. The HAEC architecture (cf. Figure 1.2) is a
PCB design with low-latency optical interconnects on layers with a regular-
mesh structure (we modeled it as a 4 ˆ 4 mesh). Multiple such layers (we
model 4) are then connected, using low-latency wireless interconnects to
communicate between adjacent layers. While in the HAEC design, each
node of the layer is an MPSoC, we model the topology by placing cores
in those nodes and considering the board as a single MPSoC. This serves
to evaluate our methods on this topology. The generic cluster architec-
ture we evaluate is the simplest non-trivial clustered architecture topol-
ogy possible. It consists of two identical clusters, each of which with two
identical cores. Each cluster shares a cache, and the two clusters can com-
municate over main memory.
To manage the sheer amount of experiments (" 105) for this evalua-
tion and the upcoming ones in this chapter, we slightly modified the pa-
rameters of the meta-heuristics, reducing the overall execution time. We
reduced the number of generations of the genetic algorithm to 10, the










• Do tasks A and B execute on the same PE?
The ❙②♠♠❡tr✐❡s representation representation normalizes mappings
that are equivalent to a single (canonical) mapping, while still using the
vector form (cf. Section 4.1). Some examples of well-suited questions for
it are:
• Is this mapping equivalent to mapping m1?
• Do tasks A and B execute on the same PE?
The ▼❡tr✐❝❙♣❛❝❡❊♠❜❡❞❞✐♥❣ representation uses the communication
topology to define meaningful distances between PEs and by extension,
between mappings (cf. Section 4.2). This representation is well-suited to
answer questions like:
• Is this mapping very similar to mapping m1? (can give false positives,
as seen in Section 4.2)
• Is the expected latency between tasks A and B under 10µs?
The ❙②♠♠❡tr②❊♠❜❡❞❞✐♥❣ representation combines the ❙②♠♠❡tr✐❡s and
▼❡tr✐❝❙♣❛❝❡❊♠❜❡❞❞✐♥❣ representations. As such, it combines the both
their strengths and weaknesses as a mapping ontology. Other represen-
tations, not necessarily based on metric spaces, could readily be added
to this language. For example, we could design a hierarchy or inclusion-
based distance with a way to define a PE hierarchy with refinements (PEs
P clusters P chips) and similarly for hierarchical applications.
The Language
The statements in the language refer to a mapping, i.e. every mapping in
the mapping space either satisfies such a statement or it does not. Thus,
the questions motivated for the different representations above can be
combined in a single statement, like: “Is this mapping very similar to m1
(distance ď 100) and not Is this mapping equivalent to m1 and (there exists
a PE p such that tasks A, B and C are mapped to p or (the expected latency
between tasks A and B is small than 10 and the expected latency between
tasks B and C is smaller than 10 and the expected latency between tasks
A and C is smaller than 15µs)).”
In this language, a special solver tries to find a solution to a statement
(i.e. a mapping) or a set of such solutions by evaluating the propositions
in the statement in a specific order. For example, if we have a proposi-
tional statement in conjunctive normal form, we can solve the different
conjuncts iteratively. Since a mapping has to satisfy each of them, the fi-
nal mapping can be found by first filtering a large portion of mappings
with the strongest conjunct, and iterating from there. In his work, Felix
Teweleitt designed a solver for ♠♦❝❛s✐♥ which utilizes a simple heuristic
with precisely this principle to solve some queries [Tew19], but there is
potential for much more sophisticated solving methods.
We choose to extend propositions about mappings to first-order logic
so that we can have quantifiers only valid for some specific domains, like
mappings, PEs, hardware communication resources, tasks (or processes
or actors), communication channels. It is clear why and how these do-
mains are the ones we can quantify over for first-order formulas de-
scribing mappings. An additional idea would be to include physical dis-
tances (over a discrete set of distances). This can be combined with differ-
ent spatial ontologies in semantic localization for the Internet of Things
94
(IoT) [Web19]. This way, we could define IoT-mappings that have specific
requirements specified in our logical mapping language.
A vision of such IoT mappings could be the following example: A smart
autonomous car enters a smart parking lot. The parking lot is dark and
pretty full already, and the car is low on battery, so that it needs to find
a parking space with a suitable recharge station. To navigate in this dark
environment, the smart car needs to offload its pedestrian recognition al-
gorithm to a service in the parking lot, which it does by using an ontology-
powered service discovery [WAL19] mechanism. Since the large concrete
structure of the mapping space blocks the signal, only some very close-
by servers in the smart parking lot are suitable for offloading computa-
tion with low latency and high reliability. Spatial ontologies have to be
included in the mapping query to offload the high-performance pedes-
trian recognition in a dark environment. Furthermore, for legal reasons,
the car cannot offload some decision-critical parts of the computation to
an external device. This complex set of constraints on the IoT mapping can
be formulated in a mixed-ontology sentence, which includes a successor
of our logical mapping language with multiple representations, as well as
other IoT-ontologies like semantic localization.
Clearly this vision is very far removed from today’s reality, but it explains
the motivation for a logical language and mapping ontologies based on
the representations as discussed in this thesis.
5.5 Run-time applications: TETRiS
So far, the applications of the structures we have discussed are primarily
useful at compile- or design-time. In this section we will discuss TETRiS, a
hybrid mapping approach where the structure of mapping symmetries
are useful at run-time.
In Section 4.1 we saw how the symmetries of the mapping problem de-
fine multiple mappings to be equivalent. We expect mappings that are
equivalent to have the same runtime or energy consumption. Indeed, the
simulation results are identical for equivalent mappings. When leverag-
ing this structure for DSE, we consider only one of the multiple equiva-
lent mappings, disregarding the rest, since they yield identical results in a
simulation. The Transitive Efficient Template Run-time System (TETRiS) ap-
proach [Goe+17] leverages this property in a complementary fashion, by
selecting equivalent variants at run-time according to the current system
load. While this works for a single mapping, the strength of TETRiS lies in se-
lecting from different mappings with different properties first and using
the equivalent variants to find a multi-application schedule.
We say that a design point (mapping) m1 dominates another m2 if for
the objective property Θ, m1 is at least as good as m2: Θpm1q ď Θpm2q. Re-
call that as defined in Equation 2.1, Θ is a multi-objective function and the
comparison Θpm1q ď Θpm2q is to be understood component-wise, i.e. for
each objective i, Θipm1q ď Θipm2q. A Pareto point is a design point (map-
ping), which is not dominated by any other design points. The different
mappings TETRiS chooses from are, ideally, Pareto points in the space of
properties we are interested in. Figure 5.18 illustrates this for the proper-
ties of energy, performance and resource utilization. Each of the green
points in the property space depicted to the right of the figure is a Pareto
point. It is better than every other point in at least one of the properties





6B E Y O N D K P N : M O D E L S O F C O M P U T A T I O N
In his seminal paper in 1936, Alan Turing proposed what he called a “com-
puting machine”1. While his machine was motivated by a person doing
computations, he intended to capture the very notion of compatibility by
it: namely what is possible to compute at all. He was modeling computa-
tion. Two additional such models of computation existed at the time, the
λ-calculus as proposed by Alonzo Church that same year [Chu36], and the
concept of general recursive functions due to Herbrand and Gödel, de-
veloped by Kleene [Kle36]. These three equivalent models [Tur37] were
the original models of computation. They are equivalent in the sense that
they define the same notion of what is computable. To an extent, these
models were not concerned with how to (efficiently) compute something,
but rather, what we can compute and what not. Since then, with the revo-
lution of digital computers, the interest increasingly shifted to care about
how we can compute. This spawned a much larger amount of models of
computation at different levels of abstraction.
In 1972, Karp [Kar72] kick-started the field of computational complexity
by identifying many problems that were equivalently difficult to compute,
the class of NP-complete problems. Computational complexity relies on
the fact that the asymptotic behavior of the number of steps of an algo-
rithm, as a function of the input (size), is invariant when changing between
these models of computation. Around the same time, in 1970, Dana Scott
proposed a mathematical theory of computation [Sco70] based on what
are now called (Scott) domains2 and the Scott-topology. Two ideas are
central in Scott’s formalization. The first is a method for capturing partial
computations, i.e. computations that have advanced but not finished yet.
The second idea is that of modeling a computation as a continuous func-
tion between such domains, where a proper notion of continuity (in the
Scott topology) models causality in the computation. Scott’s semantics
allowed to capture the process of a computation, but not the internals,
which are abstracted away by the function.
The question of how we compute can be modeled in different ways
by complexity asymptotics or partial computations in the Scott formal-
ism, but some aspects are still left unmodeled. A significant such aspect
not taken into account by these models is where we are computing. The
theory of distributed computation was growing, with models like Petri
Nets [Pet62] or seminal work like Lamport’s on clocks and ordering of
events [Lam78]. These models deal with properties of a computing sys-
tem that has physically separate parts which split and distribute the com-
putational load. However, the focus of the models is the system doing the
computation, not the computation itself.
In this thesis we are mostly interested in concurrent models of com-
putation. Such models abstract away the (distributed) computing system
and focus on the computation itself. They consider and express concur-
rency in the computation, which can be exploited for parallel or asyn-
chronous execution.
1 Now known as a Turing machine
2 Also called ω-complete partial orders [Gun92], and closely related to algebraic lattices.
99
6.1 An overview of Models of Computation
This section will survey some of the most important concurrent models
of computation. Before diving into the models, we will first discuss the
mathematical semantics3 of computation by Scott.
6.1.1 Partial Computation: Scott Domains
When Scott proposed his mathematical theory of computation [Sco70],
he used the term mathematical to contrast it with operational compu-
tation. In practice, the steps of a computation are defined by the ISA of
the machine executing them. Most people don’t write programs directly
for the ISA, however. They write them in an abstract programming lan-
guage, which is translated by a compiler into machine instructions. Thus,
in practice, the implementation of a compiler is what informally dictates
the (operational) semantics of programs. Scott’s theory had the ambitious
goal of being an abstraction that sat between these operational seman-
tics and the abstract notions of computability of e.g. Church or Turing.
He intended to abstract away the arbitrary implementation choices that
were necessary but did not change the essence of the execution. While
today his model is not the single established abstract model of seman-
tics he sought out to define, it introduced several important ideas and
mathematical structures to models of computation. In particular, a cru-
cial abstraction introduced by his theory is that of partial computation.
His theory makes it possible to express a computation as a series of par-
tial results, without regarding the actual implementation of these. We will
now introduce the basics of Scott’s mathematical theory of computation.
Two related concepts can be used to computation in Scott’s semantics,
ω-complete partial orders [Gun92] or complete semi-lattices [LM09]. We
will use the latter. Let xA, ďy be a partially-ordered set (poset). For a subset
B Ď A, we say a is an upper or lower bound of B if a ě b (resp. ď) for all b P
B. Similarly, we say a is a greatest lower bound/least upper bound of B if a is a
lower/upper bound of B and for all other lower/upper bounds a1 we have
a ď {b or a ě b, respectively. A nonempty set D Ď A is then called directed
if every nonempty subset of D has an upper bound . If every such set D
has a least upper bound, we say that A is directed-complete. In that case,
we denote the least upper bound of D as \D. If A additionally has a least
element K P A with K ď a for all a P A, we say that A is a complete partial
order. If, instead, A is directed-complete and every non-empty subset has
a greatest lower bound, we say A is a complete semilattice.
The canonical example of this are sequences, which are a generaliza-
tion of strings. Let Σ be an alphabet (a set). We call Σ˚ the set of words
(Kleene star) over Σ, and Σω “ N Ñ Σ is the set of (countably) infinite
sequences over Σ. We then define S “ Σ˚ Y Σω as the set of (finite or infi-
nite) sequences over the alphabet Σ. The set of sequences S is obviously
a poset with the prefix relation Ď, where s Ď s1 iff there exists a t P S with
s.t “ s1. Here, p.q : S Ñ S denotes the concatenation operator (which co-
incidentally makes S a monoid with neutral element ǫ, the empty string).
In fact, S is a complete semilattice with regard to Ď (cf. [LM09]). In Scott’s
model, these sequences describe the partial steps of a computation pro-
cess, generating data in discrete steps (not necessarily all at once).
3 Nowadays we call these semantics denotational
100
A function f : S Ñ S is called monotone if for s Ď s1 it holds that
f psq Ď f ps1q. Interpreting f as computation, this models causality: having
more input data cannot change the data that has already been output.
In other words, the future cannot change the past. An additional, more
technical definition is that of continuity. A monotone function f : S Ñ S
is called continuous if for all directed sets D in S, it holds that f p\Dq “
\ f pDq :“ \t f psq | s P Du. This concept is distinct from that of a mono-
tone function only for infinite sequences. It means that a function will not
produce its output only after reading an infinite amount of input. We call
this continuous because the prefix relation defines a topology on the set
S, the Scott topology.
6.1.2 Concurrent Computation
Scott’s computation model implicitly assumed a sequential computation
process, and Scott-continuous functions are a powerful method for de-
scribing partial sequential computations. Can we also use this model to
describe parallel computation? Gilles Kahn did precisely this, four years
after Scott published his mathematical theory of computation. He used
the formalism of Scott to define a model of parallel computation, based
on what he coined as process networks, now known as Kahn Process Net-
works (KPNs) [Kah74].
The basic idea to generalize the Scott theory of computation for con-
current execution is simple. We compose functions in networks of Scott
functions, these are the KPNs. These composed functions yield a system
of equations. For example, we can compose a Scott continuous func-
tion f with itself by applying it to its output. This yields an equation:
f psq “ f p f psqq, which is solved by a fixed point of f (i.e. a sequence s P S
with f psq “ s). A series of related results on such systems of equations
and fixed-points by Tarski, Kleene and others show that such a system
always has a least fixed point. This defines the semantics of KPN. For ex-
ample, for the case of the single function f as above, if f is the identity
function, this least fixed point is ǫ. This solves problems with loops in the
system by giving well-defined semantics, and even yields a procedure to
find the fixed points, by recursively applying the functions. In particular,
this means that KPNs are deterministic (as per their fixed-point semantics).
There are other related models that span from the same time period,
like the Hewitt-Agha actor model [HBS73; Agh86]. This was also a model
of parallell computation. In it, actors communicate with other actors via
messages in a non-deterministic fashion. Actors can also be dynamically
created and the connections between them are also dynamic. While this
yields much more flexibility, it comes with a high price: determinism.
Other models of parallel computation include Petri Nets [Pet62], in
which a bipartite graph of places and transitions models the distributed
execution of a system. Transitions in petri nets are very flexible as well,
but they are also non-deterministic, the order in which multiple activated
transitions fire is non-deterministic in general.
A series of more abstract models are the Process Calculi, which includes
the well-known Π-calculus and CSP. These models are called calculi be-
cause they define specific composition rules, like parallel composition
A|B or A}B for processes with clear semantics. They are well-known for
describing systems and specifying their behavior, e.g. in the context of
101
model checking [BK08]. However, these are also very abstract models of
computation.
Figure 6.1 shows an overview of the different models of computation
and their properties. The dotted nodes refer to abstract properties of
the models, whereas the filled nodes are concrete models. Concretely,
the ones colored light-blue are that we review and use more in detail in
this thesis. Timed models, like reactors, will be discussed in Section 6.3,
and dataflow models in the section below. This figure was inspired by Fig-



























HSDF / task graphs
Figure 6.1: Overview of different models of computation. Color-filled nodes refer
to concrete models, dotted ones are abstract properties.
6.1.3 Dataflow Models of Computation
A series of models stands out in the context of software synthesis and also
in the domain of embedded system software, these are dataflow models
of computation. More dataflow models have been proposed than what
we could reasonably list and describe here. The original idea however,
or at least one of the first to be published, goes back to Dennis [Den74;
Den86] These dataflow models were also related with KPN, in so-called
102
dataflow process networks [LP95; LM09]. Common among most dataflow
models is the concept of actors, which encapsulate computation and
which have firing semantics. Actors communicate exclusively via explicit
input and output channels, which work as FIFO buffers. An actor fires when
certain conditions are met, consuming tokens in (some of) its input chan-
nels, and producing other tokens in its output channels.
We will describe Dennis dataflow using a formalism similar to the one
described in [Par95; LM09]. This formalism is very general and allows to
describe many other dataflow paradigms as special cases. The basis of
the formalism are the firirng rules. An actor has a finite set R of firing rules,
and each rule R P R is a finite tuple of words over the alphabet of values
Σ̄ :“ Σ Y tKu. Here, K represents an abscent value, which means no data
has to be present in that channel for the actor to fire. The patterns are
sometimes also interpreted to be words in an extended alphabet with
wildcards, e.g. Σ Y tK, ˚u, where ˚ stands for any value in Σ. Note that,
mathematically speaking, both K and ˚ are unnecessary, as the empty
string ǫ has the same effect as K and ˚ can be replaced by a series of
rules, one for each value in Σ. In most practical instances of dataflow, on
the other hand, rules only consist of values in tK, ˚u, which is why they are
very useful for descriptions.
An actor fires whenever there is enough tokens in the input channels
to satisfy a rule. Here, satisfying a rule specifically means the rule R is a
prefix of the channel values C, i.e. R Ď C. If we include special values K
and ˚, the pattern has to be interpreted, e.g. by transforming it into the
mathematically equivalent variants explained above. In this case, the to-
kens are consumed from the channels and the actor executes, computing
something and potentially producing some outputs, which are not part of
the specification in the firing rules.
Note that there is nothing preventing multiple rules to apply simultane-
ously. For example, an actor with two inputs could have the rules p˚, Kq
and pK, ˚q, firing as soon as one of the two channels has a token. If mul-
tiple rules apply simultaneously, there is no general order in which the
actor fires and consumes the inputs. This means that this model is non-
deterministic. We denote this very general, dynamic variant as Dynamic
Data Flow (DDF) (alternatively, Dennis Data Flow).
If we add an additional condition, requiring that for two rules R, R1 there
is no upper bound S (i.e. with R Ď S, R1 Ď S), then we can show that
the model is deterministic. We can even relax this condition somewhat
and keep determinism. In [LM09], the authors show this by explicitly con-
structing a Scott-continuous function from actor firings and embedding
the model into KPN. They also discuss possible relaxations. This determin-
istic variant of (Dennis) dataflow is sometimes referred to as Dataflow
Process Networks (DPN).
All these models are very expressive, so much so that they do
not permit very strong analysis of their behavior. In contrast, the SDF
model [LM87] has a very well-defined behavior and allows more anal-
ysis to be done statically, like scheduling or bounding the sizes of the
channels [Par95]. The firing rates in the SDF model are fixed. In the for-
malism, this means the firing rules are always of the form p˚n1 , . . . , ˚nk q,
where ˚0 “ ǫ p“ K and the ni are called rates. Moreover, the number of
tokens produced is also fixed statically, which is not part of the formal-
ism of firing explained above. An apparently more strict variant of SDF is
Homogeneous SDF (HSDF), in which all the rates are 1. However, these two
103
are equivalently expressive: a well-behaved4 SDF graph can be unrolled
to an equivalent HSDF graph. The semantics of HSDF are basically equiva-
lent with the model of task graphs, which are widespread in the design of
embedded systems and HLS.
We discuss two additional variants of dataflow which sit semantically
between SDF and DDF. The first is Cyclo-Static Data Flow (CSDF) [Bil+96], in
which the static values of SDF are replaced with cycles that repeat, allowing
for some controlled dynamism while retaining the analysability. Finally,
Scenario-Aware Data Flow (SADF) [The+06] is a more general model which
allows enabling and disabling certain paths in the graph, which are other-
wise static.
Figure 6.2 shows a Venn diagram of the dataflow models discussed here
and their relationship. Here we draw the distinctions as strict as possible.
For example, we draw HSDF as a subset of SDF since, definitionally, it is,
even though they have the same semantic expressive power. In other
words, every HSDF is an SDF, and conversely, not every SDF is an HSDF, even
though there exists an equivalent (unrolled) HSDF, it is just equivalent, not
identical. We also include KPN and the Kahn-MacQueen (KMQ) blocking-
reads semantics since they are commonly discussed as dataflow models
as well. Since the models are fundamentally different, we depict them
in the Venn diagram as what is embeddable semantically. Note that we
depict DPN as being included in KMQ (which is proven in [LM09]), but we do
not know if this inclusion is strict, in other words, if there are KMQ models
which are not expressable as DPN. We will discuss the difference between









Figure 6.2: Relationships between different dataflow models of computation.
6.2 The MacQueen Gap
The KPN model was defined by Gilles Kahn in 1974 [Kah74]. While in this
paper he motivated how examples of such networks could be defined,
the semantics of a concrete language were only later postulated by Kahn
with MacQueen in 1976 [KM76]. However, there is a gap in the semantics
of formally defined networks (KPN) and the concrete networks that can
be defined by the Kahn-MacQueen blocking-reads execution semantics:
These concrete semantics are not as general as the formal model allows
4 Concretely, a graph that can be executed without deadlocks and without an indefinite accu-
mulation of tokens.
104
them to be. More concretely, there are networks which fall under the KPN
formalism that cannot be expressed using the Kahn-MacQueen blocking-
reads semantics. We call this gap in the semantics “the MacQueen gap”,
as the gap between the formal model by Kahn and the concrete execution
semantics by Kahn and MacQueen [LM09; KGC18].
In this Section we explore the MacQueen gap by showing the difference
between the two formalisms, and see how we can exploit it. The contri-
bution of this thesis is limited to the theoretical advantage from this se-
mantics gap. The practical implementation and evaluation of the library
that we describe in [KGC18], which exploits this gap in the semantics is,
accordingly, beyond the scope of this thesis.
6.2.1 The MacQueen Gap
Recall from sections 2.1 and 6.1.2 that a KPN can be modeled as a directed
graph K “ pV, Eq where the nodes V are Scott-continuous functions f P V
mapping from the set of sequences from the input channels Si1 ˆ . . . ˆ Sik
to the set of output channels So1 ˆ . . . ˆ Sol , and the edges represent the
corresponding Scoott-domains of sequences.
The Kahn-MacQueen (KMQ) blocking reads semantics are defined in a
more operational fashion. The model of computation is defined implic-
itly by the semantics of a language [KM76], characterized mainly through
blocking reads to channels. While the original semantics by Kahn [Kah74]
do suggest a programming paradigm similar to the KMQ blocking-read se-
mantics, Kahn’s original examples in a programming language made the
waiting explicit in the program, not implicit in the read semantics. Nei-
ther paper aims to prove that the semantics emerging from the proposed
languages correspond to the mathematical semantics of the networks in
terms of Scott-continuous functions.
A central point of this distinction is the level at which these two seman-
tics are defined: While the KPN semantics are defined at a denotational
level, the KMQ blocking-read semantics are operational in nature, and thus,
more fine grained. This distinction is also crucial for understanding the se-
mantics gap, since the gap itself is operational in nature as well.
To understand the difference between the semantics we will first con-
sider both from a denotational point of view. It is obvious that the basic se-
mantics of the language describe a finite directed graph, and conversely,
that any finite directed graph can be defined this way, by sequentially list-
ing every node and all incoming and outgoing edges. Thus, we can think
of every KMQ process as a function f mapping from the set of sequences
from the input channels Si1 ˆ Sik to the set of output channels So1 ˆ Sol .
The pertinent question for characterizing KMQ processes is the continuity.
We sketch a proof of this in Theorem 6.2.1.
Theorem 6.2.1. A KMQ process is Scott-continuous.
Proof. (Sketch)
Let P be a KMQ process. Since P is sequential, and the reads and writes
are blocking, there is exactly one sequence of read and write operations
that will be executed for given inputs. This means that we can divide P into
segments of execution between reads and writes, resulting in a sequence
ps1, c1q.ps2, c2q. . . . where for each i si P Σ is a value and ci is the channel
to/from which the value is read. We can then construct the correspond-
ing (Scott-continuous) function f . We discuss the case for f : S Ñ S, for a
105
single input channel r and a single output channel w, the others are anal-
ogous. Let i1 be such that s1. . . . .si1 where c1 “ . . . ci1 “ w, ci1`1 “ r. The in-
dex i1, as well as s1, . . . , si1 have to be identical for all sequences, since they
cannot depend on any inputs, by defnition. We set f pǫq “ s1. . . . .si1 “: f0.
Similarly, we let i2, i3 be such that
ci`1 “ . . . ci2 “ r ‰ w “ ci2`1 “ . . . ci3 ‰ r “ ci3`1.
We define x1 :“ si1`1. . . . .si2 and set f px1q “ f0.si2`1. . . . .si3 , and continue
this process for all psi, ciq. It is clear that such a construction will produce a
Scott-continuous function if it is well-defined. To see that it is well-defined
we need to prove with the concrete semantics of the programming lan-
guage that the same input produces the same output.
Clearly, the proof sketch in Theorem 6.2.1 is not a formal proof, since we
don’t have formal semantics for the concrete language that defines the
KMQ blocking-reads. Defining these and proving Theorem 6.2.1 properly
is beyond the scope of this thesis. We get the following corollary immedi-
ately by definition:
Corollary 6.2.2. Every Kahn-MacQueen Network is a Kahn Process Net-
work.
What about the converse implication? Can every KPN be realized by a
program following the KMQ blocking-reads semantics? To understand the








f : Si1 ˆ Si2 Ñ So1
f : px, yq “ ppx1, . . . , xjq, py1, . . . , ykqq
ÞÑ px1 ` y1, . . . , xminpj,kq ` yminpj,kqq
Figure 6.3: An example of a KPN which admits non-blocking-read semantics.
By abuse of notation, we allow j, k “ 8 and for j “ 8 “ k to mean that
for two streams x : ω Ñ Σi1 , y : ω Ñ Σi2 we define f ppxi, yiqq “ pxi ` yiq for
all i P N. The process f is, thus, a deterministic merge via addition of the
two input streams and obviously Scott-continuous, i.e. a Kahn process.
Now consider the following three cases:
1. x “ ǫ, y “ p1q
2. x “ p1q, y “ ǫ
3. x “ p1q, y “ p1q
It is clear that the first two cases are prefixes of the third. By the defini-
tion of f , only this third case will generate an output p2q, whereas the first
two cases will result in an empty stream on the output channel o1. How-
ever, operationally, there are different ways of processing these streams.
A KMQ program has to choose to read one channel first, blocking, then
read the second channel, blocking, and then output the sum. Listing 4
shows an example of code in their original language proposed by Kahn
and MacQueen.
106







Listing 4: A deterministic merge (sum) in the POP-2-based language of KMQ.
This implementation will block in Case 1 leaving unread data in the chan-
nels, while it will execute normally in cases 2 and 3. This is because we
(arbitrarily) choose to read i1 before i2. If we reverse this order, the imple-
mentation would block on Case 2 instead, leaving unread tokens in the
channel i1. This is relevant if we consider the execution and communi-
cation times, since e.g. there is a finite read time required to read every
channel. Consider the Gantt-charts depicted in Figure 6.4. They show how
blocking when reading i1 delays the whole execution, even if i2 could be
read. This is because the blocking-read semantics forces a deterministic
ordering of reading tokens when executing, whereas the KPN semantics













i2 “ 1 i1 “ 1
Figure 6.4: Examples of Gantt Charts corresponding to implementations of the
Kahn Function f .
Having understood the nature of the semantics gap, we can thus re-
turn to the question of the other direction in Theorem 6.2.1. The gap we
have shown here exposes a difference in the operational semantics, yet
the different versions discussed all result in the same denotational Kahn
process as defined in Figure 6.3. This does not contradict the converse










g : Si1 ˆ Si2 Ñ So1 ˆ So2
g : px, yq “ py, xq
Figure 6.5: A counterexample of the equivalence of Kahn-MacQueen and Kahn
processes.
107
By exploiting the problem exposed in the first example, we can come up
with a proper counterexample to the reverse direction of Theorem 6.2.1.
The example depicted in Figure 6.5 is again clearly a Kahn process (Scott
continuous), which just forwards the two incoming channels indepen-
dently. In practice, this Kahn process is not very useful, but it serves for-
mally as a simple counterexample to the equivalence of KMQ blocking-
reads processes and Kahn processes. To this, consider again as inputs
streams pi1, i2q the three cases from the first example:
1. x “ ǫ, y “ p1q
2. x “ p1q, y “ ǫ
3. x “ p1q, y “ p1q
Unlike f , the function g has a different behavior for every case:
1. gpǫ, 1q “ p1, ǫq
2. gp1, ǫq “ pǫ, 1q
3. gp1, 1q “ p1, 1q
This process cannot be realized by a KMQ process with blocking reads.
Assume there was such a process. Then, from the sequentiality of code,
either i1 or i2 will be read first. Without loss of generality let us assume
that i1 is read first. Then for the input stream pǫ, 1q however, the process
will block and will never output the 1 from channel i2, which yields the
contradiction.
6.2.2 Exploiting the Gap
We have seen in the previous section how there is a gap in the opera-
tional blocking-read semantics proposed by Kahn and MacQueen and the
denotational KPN semantics. While the counterexample from Figure 6.5
does not seem very useful, the gap in the operational nature shown in Fig-
ure 6.4 readily suggests how this gap could be exploited. In general, the
Scott continuity of KPNs requires the arrival of tokens to be determistic,
but it does not require the execution of independent nodes to follow the
same order as the tokens, as required by the Kahn-MacQueen blocking-
read semantics. Thus, as suggested by the example, the MacQueen gap
can be exploited for asynchronous computation, as long as it does not
break determinism.
This asynchronous execution can be used to execute multiple work-
ers in a data-parallel fashion. Figure 6.6 shows an example of a network
which does this. The worker processes w1, . . . , wn can exploit data paral-
lelism by dividing a workload into different parts. This allows us to asyn-
chronously execute the workloads, as long as we take care to preserve the
order at the sink node. We can achieve this by making it part of the logic
of the channels. In [KGC18] we proposed to exploit this gap and tested
an implementation of this in MAPS, which modified the FIFO libraries of
nodes labeled as data-parallel to relax the deterministic semantics of the
KMQ blocking-reads and allowed asynchronous execution of data-parallel
workers while preserving the deterministic KPN execution. The implemen-
tation of this library is beyond the contribution of this thesis, which is









Figure 6.6: An example of data-parallelism exploiting the MacQueen gap.
6.3 Reactors
So far we have discussed multiple MoCs with different extensions. Most
models we have focused on in this thesis are deterministic, which as
explained in the introduction, is an important and useful property of a
model’s semantics. We have shown determinism in KPNs allows us to simu-
late and analyze their execution. Without it, many concepts we have seen
in chapters 2,4 and 5 break down.
However, the models we have discussed neglect one important aspect,
time. Computation takes time [Lee09], and this is a fundamental prop-
erty of its semantics which is usually implicit. Determinism as we have
discussed it here means that the output of a computation is a determin-
istic function of its input. This does not mean that the time it takes is de-
terministic, as we have studied in [Goe+17]. Especially in the context of
CPSs or real-time systems, the computation time is an essential part of
the functional specification of an application. In this section we discuss
the Reactor model [Loh+19], which aims be a deterministic MoC with timed
semantics.
The Reactors model is inspired by the Hewitt-Agha actor model [Agh86],
which is a very widespread and well-known model of concurrent compu-
tation. The actor model is neither deterministic nor timed. Determinism
in Reactors comes from combining ideas from multiple paradigms [LL19],
notably, through explicit discrete-event semantics. The reactor model has
two distinct time notions, physical and logical time. Physical time refers to
the time as elapsing in the physical part of the system, and that part of
the model is thus not part of the digital logic. Logical time, on the other
hand, is the digital counterpart of physical time, and is the time that gov-
erns the computation of the reactor network. Every CPS has physical and
logical time, by their very definitions. A novelty of the reactor models is
making both time concepts and their separation explicit. Just as in any
other timed MoC for CPSs, the two times are tightly coupled and intended
to be synchronized. Making the separation explicit allows us to control
the synchronization of both time models and have better control over a
deterministic execution of the time logic.
Just as in the dataflow models discussed in Section 6.1, the actor model
divides computation into isolated actors that communicate solely over ex-
plicit messages. The main difference to models like SDF or KPN is that ac-
tors and channels are not fixed. Instead, they can be dynamically created
109
and destroyed. In Reactors, we aim to combine good ideas from multi-
ple established MoCs. We permit dynamic re-configuration of the network
through mutations which are well-defined (not arbitrary) transformations
of the network’s topology [Loh+20c]. This permits us to reason about de-
terminism more explicitly. At the time of this writing, mutations are only
defined abstractly. Specifying a set of well-defined mutations that allow
us to reason about determinism and time, while still providing enough
flexibility as need by the applications, is ongoing work. We will discuss
this in an example use case for 5G in Section 6.3.1.
This thesis deals with model-based design in general. As such, Reac-
tors are part of the contribution as yet another model with distinct advan-
tages and disadvantages. Thus, apart from the design choices discussed,
we only briefly outline the concepts behind reactors and a simplified de-
notational semantics, as well as some applications leveraging particular
features of this model as opposed to other MoCs. The detailed design and
implementation of Reactors as runtime systems and the corresponding
polyglot coordination language, Lingua Franca5, are outside the scope of
this thesis [Loh+20a; Loh20].
Denotational Semantics
In [Loh+20c] we laid the groundwork for an operational formalization of
reactors. The reactor model is a moving target and has been refined since.
At the time of this writing, the most thorough and up-to-date account is
in [Loh20]. Here we will deviate from the formalization both in [Loh+20c]
and [Loh20], however, and attempt a denotational approach to seman-
tics. In ongoing work with Marcus Rossel, we are using the Lean theorem
prover [Mou+15] to formally verify reactors, proving properties like deter-
minism under certain conditions. A reason for this denotational approach
is that the original formalization has some mathematical inaccuracies and
unspecified behavior. Clarifying or correcting these inaccuracies is neces-
sary for having a well-defined model. The second reason for the deviation
is the level of detail. We want to simplify the formalization of [Loh+20c;
Loh20]. The aim of the formalization here is to isolate the abstract model’s
(denotational) semantics and leave implementation-specific details out as
much as possible. An advantage of this formalization is that it relates KPNs
and Reactors formally.
We explicitly restrict ourselves to a subset of the model, leaving out mu-
tations and any kind of exception-handling policies. A more comprehen-
sive (operational) model, including some of these concepts, is discussed
in Chapter 2 of [Loh20]. These restrictions are in part for simplicity, but
also due to this being ongoing work. At the time of this writing, we have
not finished the Lean-based formalization to include these aspects. Ex-
tending a simple model is easier than changing a complete model that is
problematic. It is important to note that as ongoing work, this alternative
formulation has not yet undergone peer-review (as opposed to [Loh+20c])
and is subject to change.
Timeless model
Reactors are a timed model, with specific semantics of how the time pro-
gresses and what can happen when. The logical (functional) semantics of
a reactor network are complex as well, however. We first begin defining
5 ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴✐❝②♣❤②✴❧✐♥❣✉❛✲❢r❛♥❝❛
110
the computational semantics of the network in a timeless fashion, and
then extend the model to include time.
Computation is essentially manipulation of data. Models are thus built
and defined by how they manipulate data. We follow a model of compu-
tation based on Scott’s semantics of computation (cf. Section 6.1). Data is
modeled as sequences s P S “ Σ˚ Y Σω over a finite alphabet Σ, which
we require to include a special symbol K
!
P Σ that represents absence
of data 6. The basic unit of computation in Reactors are reactions, which
take a finite number of data tokens as input and return a finite number as
output. Thus, to define a reaction we simply consider a Scott-continuous
function n : Sk Ñ Sm. Sequences can have different lengths (both finite
or infinite), yet reactor networks execute in discrete ticks, which result in
sequences of the same length. To model this we define “padding” using
the special symbol K P Σ on a finite sequence s P Σ˚ “ SzΣω , by defining
ŝ “ s.pKωq. Finally, an important restriction is that we want to ensure reac-
tions in the timeless model do not to take multiple inputs from the same
channel before producing an output.
Definition 6.3.1 (Reaction). Let n : Sk Ñ Sm be a Scott continuous func-
tion. We call n a reaction if for any two s, s1 P S such that s1 is a proper
prefix7 of s, i.e. s1 Ĺ s, then this also holds for the images under n, i.e.
nps1q Ĺ npsq.
Note that our definition is a restriction on the definition of reactions.
As defined in [Loh+20c] with an informal source code “object”, they can
be interpreted in the semantics of the language of that source code. As
such, they could implement any relation on Sk ˆ Sm. In particular, we as-
sume reactions are deterministic (as mathematical functions) and respect
causality (being Scott-continuous). Note that this does not mean they are
stateless. State is implicit in the definition of a function on the complete
history of inputs, as opposed to a function on a single input token. In the




, ˚q “ p˚, s1ljhn
j
, ˚q with s, s1 P Σ,
the ˚ being other values we don’t care about here. However, we use the
denotational formalization of computation of functions by Scott as com-
plete sequences of inputs and outputs, which makes the state implicit.
Definition 6.3.1 also has strong theoretical consequences. It implies that
every monotone reaction is Scott-continuous, since it is equivalent with
| f psq| ě |s| for all s P S, which avoids the pathological cases that distinguish
monotone and Scott-continuous functions. Proving this fact is beyond the
scope of this thesis.
Modeling reactions as Scott-continuous functions, we do not specify
anything about the length of the sequences. A reaction might produce
a longer output sequence than its input sequence. At this stage this is not
important, as we model the complete computation with a single function.
We will come back to this later, in the timed model, when we relate these
sequences to concrete times.
6 Note that the exclamation mark in the notation before refers to the requirement of that
inclusion (as opposed to a statement of a fact).
7 Not to be confused with s1 Ę s, the negation of s1 Ď s.
111
If a reaction f : Sk Ñ Sm has the property that
f ps1, . . . , si´1, K, si`1, . . . , skq Ď pK
ω, . . . , Kωq for all sj P S, 1 ď j ď k, j ‰ i,
we say that f has a trigger on the input i. Recall that the symbol K repre-
sents the absence of values. Intuitively, thus, a reaction that triggers on
i will not execute if there is no input on i. In other words, the values in i
trigger the reaction, hence the name. Besides triggers, the original defini-
tion also has other components as part of reactions, namely sources and
effects (or dependencies and anti-dependencies), scheduleable actions
and a deadline. We include most of these concepts in other definitions,
e.g. the reactor or the network.
To communicate between reactors (or perhaps more precisely, be-
tween reactions), we need to send and receive data. We do this using
input and output ports, which we model simply as identifiers in an index
or identifier set I. A reactor has a series of reactions with input and output
ports, and reactors connect to each other through them.
Definition 6.3.2 (Reactor). Let I be an index set. A reactor is a tuple r “
pN, D, D_q where N is a finite poset of reactions n : Skn Ñ Smn and
D :N Ñ pt1, . . . , knu Ñ Iq,
D_ :N Ñ pt1, . . . , mnu Ñ Iq,
are called the sources and effects respectively. We define the set of in-
put ports as Inputprq “
Ť
nPN impDpnqq and, similarly, the set of out-
put ports we define as Outputprq “
Ť
nPN impD
_pnqq. We require that
Inputprq X Outputprq “ H as part of the definition of a reactor.
The sources D make a correspondence between the indices in the tuple
of input streams of a reaction and the (port) identifiers I. For example, if
a reaction n : S2 Ñ S takes two inputs, Dpnq : 1 ÞÑ c, 2 ÞÑ b means that the
ports c and b are the two input ports of n, in that order. The effects D_
are analogous but for the outputs of the reaction.
We require N to be a poset for two reasons. Firstly, we want to be able to
specify an order in which reactions are always executed. However, we also
want to allow explicitly making the model non-deterministic by making
reactions incomparable. When two reactions are incomparable, they are
executed in a non-deterministic order. By the order-extension principle, it
is always possible to execute reactions while respecting the partial order.
More formally, let n, n1 : S Ñ S be two reactions. For simplicity, we
assume they have a single (shared) input and output port: pDpnqqp1q “
pDpn1qqp1q and similarly pD_pnqqp1q “ pD_pn1qqp1q. Recall that ŝ “ s.pKqω
for s P SzΣω is a “padding” of a sequence with absent values. We say that
a function f : S Ñ S is a priority-preserving execution if for all s P S and for
all i P N, it holds that:
p f̂ pŝqqi “ pn̂pŝqqi, if n ď n
1, pn̂pŝqqi ‰ K (6.1)
p f̂ pŝqqi “ pn̂
1pŝqqi, if pn̂pŝqqi “ K (6.2)
p f̂ pŝqqi “ pn̂
1pŝqqi, if n
1 ď n, pn̂pŝqqi ‰ K (6.3)
p f̂ pŝqqi “ pn̂pŝqqi, if pn̂
1pŝqqi “ K (6.4)
p f̂ pŝqqi P tpx̂pŝqqi | x P tn, n
1uu otherwise (6.5)
In this case we write f P
Ů
D,D_tn, n
1u. Equations 6.1 and 6.3 formal-
ize the reaction priority when the two reactions are ordered, and Equa-
tion 6.5 the non-deterministic ordering when n and n1 are incomparable.
112
If a reaction returns an absent value K, then the value of the other re-
action is written on the output sequence. Note when n ď n1 or n1 ď n




equivalent up to padding with K.
This definition can be trivially generalized to more than one (shared)
input or output sequence (component-wise), and to non-shared input or
output sequences by requiring equations 6.1-6.5 to hold quantified over
all non-shared sequences. Finally, for a poset N of reactions, we defineŮ
D,D_ N analogously (component-wise), requiring equations 6.1-6.5 to
hold pairwise for any two n, n1 P N.
Reactors are connected in networks. We model these networks explic-
itly, separate from reactors themselves. In the original definition, this is
avoided by building reactors hierarchically. There is no semantic distinc-
tion between a hierarchical model and a flat model, where all contained
reactors are “inlined” in a network8. We prefer separating the reactors
and their networks, since the definition of reactor networks allows us to
specify the semantics of how they can be connected. Here, we distinguish
between two cases: an untimed one, which we call timeless and repre-
sents the purely logical execution of the network, and a timed one, which
is the general case and is built on top of the former.
Definition 6.3.3 (Timeless reactor network). A timeless reactor network
is a multigraph R “ pV, E, ξq with a set of reactors as nodes V, a set
of edges E, which we require to be pairs of indices, E Ď I ˆ I and
ξ : E Ñ ttr1, r2u | r1, r2 P Vu. For this multigraph we require that for
any two distinct reactors r1 ‰ r2 P V the input and output ports are pair-
wise disjoint, i.e. Inputpriq X Outputprjq “ H for all i, j P t1, 2u and every
edge is a tuple consisting of an output port and an input port, i.e. for
all pi, jq “ e P E Ď I ˆ I there exist r1, r2 P V such that i P Outputpr1q
and j “ Inputpr2q. We additionally require that the multigraph has no self-
edges, i.e. |ξpeq| ą 1 for all e P E.
Recall that a multigraph is a graph that can have multiple edges, and the
function ξ : E Ñ ttr1, r2u | r1, r2 P Vu defines which vertices are connected
by each edge. Here, the edges themselves carry semantics as well. They
define which ports specifically they connect in the reactor. We define the
set IpRq “
Ť
rPVpInputprq Y Outputprqq as the set of ports of R.
We make an additional remark about Definition 6.3.3, namely that we
don’t require all ports to be connected. Indeed, some ports we explicitly
want to leave disconnected to define the general, timed model.
Timed Networks
We are finally ready to introduce time into the model. Reactors are based
on a logical time model of discrete events. We formalize logical time as
a totally ordered set of discrete timestamps, which is order-isomorphic
to the naturals N (or a finite subset). When two events happen at the
same time, we want to keep the total-order property to distinguish them.
For this, we use superdense time [MMP91; Pto14] , which adds microsteps
at every time unit. Thus, time tags t P N ˆ N are lexicographically or-
dered tuples of natural numbers, where the first number represent the
timestamp as ticks (in some specific unit of time), and the second num-
ber represents microsteps. Physical time, on the other hand, we define
8 Note that this might change if we extend the model to include mutations.
113
as real numbers R to allow continuous-time physical models (e.g. New-
tonian mechanics). However, computation only can interact with physical
time at discrete time intervals. We compose these two types of time in a
unique time object, a tag.
Definition 6.3.4 (tag). A (time) tag t P T is a value in the sum (type) T :“
pN ˆ Nq ‘ R “ pN ˆ Nq 9YR, which is commonly also called the disjoint
sum9. The embedding for the first component N ˆ N ãÑ T is called logical
time, and the embedding from the second component R ãÑ T is called
physical time. We say that t is a logical or physical time tag respectively.
Note that Definition 6.3.4 differs from [Loh+20c; Loh20]. The rationale
for this is that this definition gives us a uniform way of referring to time
while still distinguishing between logical and physical time. We could also
have defined T “ pN ˆ Nq ‘ N taking into account only the discrete mea-
surements of time that are available to the digital component of the CPS.
This definition with the real numbers R instead allows the model to be
combined with continuous-time models of physical time, and it adds no
restrictions to our semantics.
Reactions are, in a sense, controlled functions we compute from incom-
ing data. Some data we have no control over, like incoming input (e.g.
from a sensor), or an asynchronous computation we scheduled. To model
these we use actions. Note that these actions are more a model of (tagged)
data, as opposed to reactions which are a model of computation. This
creates a false dichotomy, since actions are fundamentally different from
reactions. Actions are more closely related to the input and output ports,
and the naming confusion might be thus easier to resolve when thinking
that, in this way, reactions react to actions.
Actions are central to the model, since they are the mapping between
the functional world of reactions and the time semantics. Definition 6.3.5
ensures actions do not mix the two different time types, and respect
causality (i.e. an action cannot change the past).
Definition 6.3.5. Let Tdiscrete :“ t| T Ď T | T is discreteu be the set of
discrete subsets of T. An action is a partial10 function A : Tdiscrete Ñ S
such that
• For all T P dompAq, the discrete set of tags T and the corresponding
sequence ApTq are order-isomorphic (in particular, |T| “ |ApTq|).
• All T P dompAq are either sets of logical or physical tags, i.e. T Ď
N ˆ N or T Ă R. We call A a logical or physical action, respectively.
For a discrete set of times T P Tdiscrete, with Tdiscrete as in Defini-
tion 6.3.5, we call dpTq “ inftăt1PdompAq t
1 ´ t the minimum delay or spacing
of the time set T. Here, subtraction is to be understood component-wise,
and only up to 0, as it is sometimes [Run89] defined on the set of nat-
ural numbers i.e. pt1, t2q ´ pt11, t
1
2q :“ pmaxpt1 ´ t
1
1, 0q, maxpt2 ´ t
1
2, 0qq. For
an action A we define dpAq “ infTPdompAq dpTq. Note that the formulation
in [Loh20] distinguishes between the minimum delay as specified by the
programmer and the minimum (time) spacing as acceptable for the run-
time system. We consolidate both here, since, for simplicity, we disregard
policies for when this spacing is violated and error handling in general.
Consequently, we also do not model the spacing violation policy included
9 In the language of set theory, that we use by convention in this thesis.
10 See Section 4.1.4 for the required definitions.
114
in [Loh20]. We define it also for the time set and not the action, since our
semantics are denotational and not operational.
Definition 6.3.5 allows us to associate any given subset of times to a
different sequence of values. In particular, this allows us to model the
timestamps themselves being part of the value in the sequence, e.g. for
a reaction that stores the current time to a log file.
The order-preserving bijection between a discrete set of tags and
sequences ensures a causal execution. Since the mapping is order-
preserving, a going forward in timestamps can only increase the port’s
history (the sequence of values). Similarly, adding tokens to the port’s his-
tory can only move forward in time. Moreover, since it is a bijection, it
means that adding tokens to the port’s history has to move forward in
time, and vice-versa. One step in the discrete set of tags corresponds to
exactly one value in the history. In particular, time has to advance every
time reactions are executed. This is why we need microstep delays in log-
ical time, so that we can execute events with identical logical timestamps.
For physical time we cannot have two events with identical timestamps,
but the timestamps can be arbitrarily close to each other, so this is not a
very strong restriction. Note that in [Loh20] physical time gets converted
to logical time when assigned a tag. In that formalism it is thus possible
for two physical actions to have values with identical tags, but the tags
would ultimately have different microstep units when executed, which is
an unavoidable source of non-deterimnism. This is not different from e.g.
adding a small enough ǫ ą 0 in this model.
Definition 6.3.6 (Reactor network). A reactor network is a tuple pR, τq,
where R “ pV, E, rq is a timeless reactor network and τ : IpRq Ñ A is
a partial function of the identifier set (of ports) of R to a set of actions
A, such that for every i P I “ IpRq, exactly one of the following is true:
i P dompτq or there exist an edge e in ER, such that e “ pi, jq or e “ pj, iq for
an i ‰ j P I.
We call impτq Ď A the set of actions of the reactor network pR, τq. Here,
the mapping τ relates actions with all dangling ports in the timeless net-
work. The last condition on τ ensures that no ports are left dangling in
the (timed) reactor network.
Both our original description in [Loh+20c] and the updated one
in [Loh20] are very explicit about reaction and event queues, scheduling
and mutexes. These are very important aspects for any implementation
of the model, yet they conflate the implementation and the semantics.
Here we are interested mostly in the general concepts behind reactors,
the implementation is outside the scope of this thesis. As a consequence,
we rather err on the side of abstraction, by preferring to abstract away
details and clarify them in future work if necessary.
Definition 6.3.7 (Execution of reactor networks). Let T Ď T be a discrete
set of time tags and let pR, τq be a reactor network. We denote by RτpTq
the network obtained by substituting in the timeless network R for each
port i P dompτq the sequence pτpiqqpTq (recall that τpiq is an action). An
execution of R with discrete set of time tags T is a sequence si for every
port i P I such that:
1. For every reactor r “ pN, D, D_q in VR there exists a function f PŮ
D,D_ N with f psi1 , . . . , sik q “ psj1 , . . . , sjm q where i1, . . . , ik, j1, . . . , jm
are the corresponding ports of r according to the edges E of RτpTq.
115
2. For every f as in 1, the sequences psiqiPIpRq are a fix-point of the set
of equations defined by the network RτpTq.
3. For every time value t P T, at least one action A is non-absent, i.e.
|tA P impτq | Aptq ‰ Ku| ě 1.
The execution of a reactor network in this way is not modeled as an iter-
ative process. The computation itself is modeled through the sequences
S in the Scott semantics of computation. The time values of actions A
are chosen (non-deterministically) for an execution, modeling the non-
determinism from the environment. Condition 1 in Definition 6.3.7 de-
fines the execution priority of reactions. This, together with Condition 2
ensure that reactions have well-defined semantics. Finally, Condition 3
ensures that only one action is scheduled at a time.
A central idea behind Reactors is to split logical and physical time ex-
plicitly. However, these two time concepts are conceptually linked, since
logical time is just a digital estimation of physical time. Thus, the reactor
runtime should strive to synchronize these two time concepts whenever
possible. This is realized by the requirement of executing events in times-
tamps order, ensuring logical time never goes past physical time. Nothing
guarantees that the converse does not happen, however. Physical time
could go far beyond logical time. In an implementation, and indeed in the
formalization of [Loh20], a deadline in the reactions controls how far away
logical time can lag behind physical time.
Note that the definition of reactor networks does not exclude any loops.
The fix-point-based definition allow us to have well-defined semantics
with such loops (cf. [Kah74; LM09]), as ensured by Condition 2 in Defini-
tion 6.3.7. In some cases, however, the least fix-point of the network might
result in an empty sequence. This can be the case when the ordering in
reaction causes a so-called causality loop. See Section 2.6 of [Loh20] for
a more thorough discussion. Also note that Condition 2 does not require
the fix-point to be minimal, but this is given by the order-isomorphism
condition on actions.
In [Loh20], reactors are explicitly required to have two special actions,
a startup and a shutdown action. We do not require these two actions
explicitly. An empty reactor network, that does nothing, is also a well-
defined reactor network, albeit a pretty useless one.
Conjecture 6.3.8 (Reactors are deterministic). Let pR, τq be a reactor net-
work such that for every reactor r “ pN, D, D_q the set N is totally ordered.
Then for every execution RτpTq with a discrete time set T the values of si
for every port i P I are uniquely specified.
Our formalization has allowed us to specify determinism in reactors
in a mathematically precise fashion. We believe in future work we can
prove Conjecture 6.3.8 by using fix-point theorems, in a fashion similar
to [Kah74].
This also gives us the language to discuss different kinds of determin-
ism: can the values be independent of the set of time tags T? In general,
it cannot work. Consider a network which prints the timestamps it sees,
this will never be independent of the timestamps. On the other hand, if
every discrete sequence of timestamps is mapped to the same sequence
of values, i.e. ApTq “ ApT1q for all actions A and (valid) sequences of
timestamps T, T1, then the behavior is trivially time-deterministic. Note
that in this case, Definition 6.3.5 implies that T – T1 are order isomor-
phic and, consequently, T “ T1, which is a very strict condition on the
116
network, that has to have a constant number of actions. There are cer-
tainly relaxations of this that allow us to define reasonable conditions for
time-determinism.
We can even go further and distinguish between logical and physical ac-
tions for determinism. For example, we can define a reactor network to
be time-deterministic if it only depends on the image sequences of phys-
ical actions A. The non-determinism from the physical world is outside
our control, but with this definition we are also ensuring logical actions to
behave deterministically as a function of the physical ones.
A final word on distributed execution, which we have ignored so far,
is due here. Our semantics are denotational, they are meant to describe
what is computed, not how. In particular, a distributed execution should
adhere to these semantics just as a sequential one. A fundamental prob-
lem with using our model for distributed execution, however, are time
tags, which are uniform in the model. Strictly speaking, we could replace
our Newtonian model of time with a relativistic one and consider differ-
ent frames of reference and transformations. The model, as is, can thus
be considered a model for a fixed (inertial) frame of reference.
In future work we plan to consider distributed execution and its conse-
quences (or lack thereof) on our denotational semantics. We also plan to
precisely identify conditions for these different possible definitions of de-
terminism and verify them, using the Lean theorem prover [Mou+15] and
a formalization similar to the one described here. As mentioned above,
this is ongoing (unpublished) work with Marcus Rossel.
6.3.1 Applications in 5G
Having defined Reactors formally, we consider some applications for the
model. In this section we will discuss Reactors in the 5G standard.
Telecommunication standards evolve constantly, pushing the limits of
signal processing systems from almost every angle. Consumer demands
adapt to increases in capabilities. This results in a feedback loop that not
only raises the demands themselves, but also their heterogeneity. In LTE
today we already see very dynamic demands, with different users requir-
ing very different bandwidths at different times. With the increased capa-
bilities of 5G, the dynamicity of the demand will only increase.
Signal processing systems, however, are not built for dynamic work-
loads; they must tolerate the worst case. This makes sense, since a sys-
tem that is capable of processing the highest demands can also process
lower demands. However, parameters like user count, resource blocks
supported, used MIMO scheme and carrier aggregation have a nuanced
relationship in terms of resources pressure. Additionally, the sub-carrier
spacing is also flexible in 5G systems. As a direct consequence, the real-
time requirements have to adapt to the changing transmission time in-
terval. All of this yields a parameter space with a large dynamic range of
possible workloads.
Figure 6.7 shows a simplified overview of the uplink modem in a bases-
tation for 5G. We see that the overview already resembles MoCs like
dataflow or Reactors. Details on the requirements, like the sizes and num-
bers of FFT nodes depend significantly on the workload being processed.
However, the dependencies between the resources required for the base-
band processing and the parameters of the workload are non-trivial. Fig-









Encoder RateMatcher Scrambler Modulator Precoder
SubCarrierMapper SCFDMAModulator ChannelReactor SCFDMADemodulator
SubCarrierDemapper EqualizerReactor TransformDecoderReactor Demodulator
Descrambler RxRateMatcher TurboDecoder
Figure 6.10: The Reactor network of the modified WiBench benchmark in Lingua
Franca.
ments, we must ensure that the changing system not only respects the
deterministic semantics of the decoder, but also the timing requirements.
This is why we propose to use a formal model of computation to describe
5G (and beyond). Using the model of Reactors we can make the execution
deterministic and timed. It also can help define well-behaved dynamic be-
havior through the use of mutations in future work.
Modeling 5G with Reactors
In ongoing (unpublished) work with Robert Wittig and Christian Menard,
we adapted the WiBench benchmark [Zhe+13] to work with Lingua Franca,
an implementation of Reactors. Figure 6.10 depicts the Reactor network
implementing this benchmark. Since WiBench is single threaded, we only
compared to a single threaded version in Reactors. In particular, we did
not leverage data level parallelism throughout the layer, nor the pipeline
parallelism that we get from the network’s topology for free. This is a
worst-case assumption we made to analyze the overhead. By using the
Reactor model, the benchmark is deterministic, even if it was to run us-
ing this parallelism [Loh+20c]. More importantly though, we can use the
model’s time semantics to define the constraints that ensure each sub-
frame is processed on time. Our implementation is thus still static (cf. Fig-
ure 6.10), since we have not yet specified well-defined mutations. This im-
plementation presents a great opportunity for future work to research
and develop safe mutations for 5G.
Our implementation of the Reactor-based WiBench had an overhead
of 15% (median over 100 executions), compared to the baseline imple-
mentation of WiBench. There is certainly potential to improve this, e.g. as
the scheduler of the C++ implementation on Lingua Franca, used for this
implementation, was not optimized at all. Nevertheless, this is a purely
software-based implementation, so it serves only as a very rough estima-
tion of the overhead; it is best suited to study the model’s suitability and
develop Reactor mutations for adaptability. An efficient implementation
120
in practice could work with reconfigurable hardware, e.g. implementing
a Precision Timed (PRET) [EL07] machine, which is well-suited to Reactors’
semantics.
In general, these preliminary results open up many avenues for re-
search in adaptability in 5G. We can the Reactors model, at the semantic
level to support the necessary adaptability in 5G. Similarly, we can design
reconfigurable hardware that implements it.
Other applications: Automotive
The reactors model has many desirable properties for designing reliable
CPSs, which can be applied in a multitude of domains. An important exam-
ple is the automotive domain, where the high-performance requirements
of autonomous driving and modern entertainment are coupled with the
timed CPS including the car and its surroundings. To keep the scope of
this thesis limited, we omit a thorough discussion of an application of Re-
actors in the automotive domain. In [Men+20], we showed how we can
use the Reactors model to achieve determinism in the AUTOSAR Adaptive
Platform (AP), a modern automotive standard.
121

7P R O G R A M M I N G L A N G U A G E S
In this thesis we have discussed multiple Models of Computation (MoCs),
reasoning about their semantics and how to best deploy them on a partic-
ular hardware architecture. A natural question arising from this is, “how
do we program in these MoCs?”.
In Chapter 2, Section 2.1 we saw the C for Process Networks (CPN) lan-
guage. It is a DSL designed to describe data flow programs with the KMQ
blocking-read semantics, with special annotations for SDF actors. Other
MoC-based languages exist, like the CAL actor language [EJ03], or Lingua
Franca [Loh+20b]. These languages allow “freedom from choice” [Lee19],
by enforcing a model that limits the ways in which to make mistakes, ide-
ally without compromising the expressiveness of what can be designed
with the model.
A common trade-off when designing programming languages is also
the question of expressiveness versus performance. High-level expres-
sive abstractions are often at odds with low-level performance optimiza-
tions. However, well-designed abstractions can use semantics-preserving
compiler transformations to still derive an efficient execution. The whole
principle of software synthesis can be seen as an instance of this.
This chapter discusses programing languages for defining and enforc-
ing the semantics of a MoC. After a short review of existing languages, it
focuses on the Ohua [Ert19] language, which defines dataflow implicitly. It
also discusses how we can leverage the language and its semantics to de-
fine semantics-preservings transformations at a language level. We show
this for a use-case optimizing I/O on microservice-oriented architectures.
7.1 Freedom from Choice
This section reviews some programming languages and how they provide
“freedom from choice” in the sense of A. Sangiovani-Vincentelli [Lee19].
There is a distinct sense in which this is the central question of program-
ming languages in general. By removing memory management through
having no pointer arithmetic and garbage collection, Java frees its users
from multiple families of errors that are possible in C. Rust’s owner-
ship types take a different approach, also removing complete families
of memory-management based errors, without introducing large perfor-
mance overheads or unpredictable behavior from the garbage collector.
Elm [Cza12], on the other hand, exposes a functional paradigm with a
strong type-system for GUI development of web applications, which elim-
inates virtually all run-time errors.
These kinds of “freedom from choice” are beyond the scope of this the-
sis, which focuses on MoCs like those described in Chapter 6. In more con-
straining MoCs, like the ones discussed here, the temptation to break away
from the semantics might be higher. An interesting observation and dis-
cussion of this phenomenon can be found in [TDJ13]. Not only does it show
that developers commonly break away from the semantics if these are
not enforced, but also gives multiple explanations why. This is why we










7 ✭❧❡t ❬❬① ②❪ ✭s♣❧✐t s✮❪
8 ✭❧❡t ❬❬①♦✉t ②♦✉t❪
9 ❬✭✐❢❢t ✭❢✐❧t❡r ✭❢❢t ①✮✮✮
10 ✭✐❢❢t ✭❢✐❧t❡r ✭❢❢t ②✮✮✮❪❪
11 ✭s✐♥❦ ①♦✉t ②♦✉t✮✮✮✮
12 ✭sr❝✮✮✮✮
Listing 5: The Audio Filter Example written in Ohua
dataflow-like semantic with time triggering, and is thus closer to discrete
event models. On the discrete events side, as mentioned above, hardware
description languages like Verilog or VHDL have discrete-event semantics.
These languages are very widespread and are used commercially.
7.1.2 Implicit Dataflow
The languages surveyed so far are explicit about their abstractions: Ac-
tors, Reactors or Processes are declared explicitly. Similarly, channels de-
scribing the data flow are made explicit either through channel declara-
tions or through the connection of explicit ports. A programmer writing in
e.g. CPN or Lingua Franca has to have a model of the network describing
the application in their head (or in their IDE). Implicit abstractions, on the
other hand, work by generating implicit models from linguistic constructs
that don’t exhibit their structure directly.
Implicit abstractions, as we just defined them, are ubiquitous in pro-
gramming languages. Objects in object-oriented programming (OOP), for
example, are an implicit abstraction for data encapsulation that is funda-
mentally similar to actors. A thorough classification of these implicit mod-
els is outside the scope of this thesis. Instead, we will look closely at the
Ohua programming paradigm [Ert19], which derives a dataflow execution
from functional semantics.
The Ohua programming paradigm, by S. Ertel, and others is an implicit
model of concurrency. It can be used to express concurrency at a lan-
guage level, without explicit constructions, like threads and locks. This
comes from lowering an Ohua program into a dataflow-based execution.
This model is not part of the original contribution of this thesis. We will
introduce it here as background material.
Ohua itself is a general paradigm that works on multiple languages, and
the framework has evolved over the years of its development. The version
of Ohua we will discuss here is based on Clojure and Java, but the Ohua
compiler and its principles work with many languages. Rutimes also ex-
ist for Rust, Javascript or Go, at different levels of maturity. Ohua is best
understood by diving directly into examples.
Consider the code in Listing 5. The code in the example is written in a DSL
embedded in Clojure, a dialect of Lisp. It implements the same example
from Chapter 2 (cf. Listing 2 or Figure 2.1), a two-channel audio filter. Inter-
126
nally, the compiler transforms this code into a dataflow graph (similar to
that depicted in Figure 2.1) for execution. A special function, ♦❤✉❛, anno-
tates the AST it receives as argument to be executed as implicit dataflow.
The s♠❛♣ function is a special variant of ♠❛♣ that considers state in the
functions. We will discuss the semantics of s♠❛♣ in Section 7.2. Finally, the
❛❧❣♦ definition in Ohua is akin to the anonymous function definition ❢♥ in
Clojure. It defines Ohua “algorithms”, which are transformed to dataflow
actors. As a MoC, this can be embedded in the Dennis dataflow models
discussed in Chapter 6.
The example in Listing 5 can be transformed into a dataflow graph for
execution.The main advantage of this transformation is that a dataflow
graph exposes concurrency, which can be exploited e.g. in a parallel ex-
ecution or for optimizing I/O (cf. Section 7.3). This duality between code
and dataflow graphs is a core concept behind Ohua. The other central pi-
lar of the Ohua design concept are stateful functions, an abstraction that
encapsulates functions with state and side-effects in the context of their
dataflow execution.
7.1.3 Stateful Functions
The functional programming community has made the distinction be-
tween pure and impure functions widespread. A pure function is a func-
tion in the mathematical sense of the word: it receives a certain input and,
deterministically, produces an output. This could be as simple as negat-
ing a boolean value, or as complicated as inference with a gargantuan
deep neural network. The main point is that the entirety of the usage of a
function is that it returns a value in a deterministic fashion from its inputs.
In most imperative languages, like C or Java, functions usually also have
side-effects. Writing the output to the terminal, storing data in a global
data structure or even reading data from a sensor in a CPS, these are all
examples of side effects. A language that only allows pure functions is ba-
sically useless, since even printing the result of a computation is impure.
Stateful functions are a special abstraction, where the concept of pure
functions is extended to consider the state of the computation. While this
excludes aspects like the time of the computation and side-effects like
actuation, it is general enough to cover large classes of functions used in
most software. A stateful function is a function f : a Ñ b and an abstract
state S, where the execution of the function can be seen as dependent
on the state, which it also modifies. In other words, we consider f as a
function:
f : a ˆ S Ñ b ˆ S (7.1)
Pure functions can be seen as a special case of stateful functions, with a
trivial state S “ t˚u. Listing 6 shows an example of a stateful function, writ-
ten in Java, which is identified as such by the ❅❞❡❢s❢♥ annotation [EAC18].
It models a parser, which writes the parsed symbols to a symbol table.
The table, a private object of the class, is the state of this stateful function.
It is implicitly managed as the state by the Ohua runtime.
7.2 Stateful Parallelism
Software synthesis for multicores is effective because we use MoCs that ex-
pose concurrency. The most natural way of leveraging the exposed con-
127




♣✉❜❧✐❝ ❙②♠❜♦❧❖❜❥❡❝t ♣❛rs❡✭❊①♣r❡ss✐♦♥❖❜❥❡❝t ❡①♣r✮ ④





Listing 6: An example of a stateful function.
currency is parallelism. When writing code using explicit models in the
dataflow family, like KPN or SDF, the concurrency also becomes explicit.
However, in an implicit language like Ohua, we need to be careful to con-
sider stateful computation when extracting concurrency.
Ohua uses a special operator, s♠❛♣, to derive concurrency from stateful
computation. The principle behind s♠❛♣ is that it extends the higher-order
♠❛♣ function to consider the state of the function it maps. This adds a de-
pendency between multiple executions of the same (stateful) function.
For a single function mapped over a collection, this principle exposes no
concurrency. There is none, in general. However, when we compose mul-
tiple functions in a ♠❛♣, we get a different picture.
Consider three (stateful) functions, f : a Ñ b and g : b Ñ c, h : c Ñ d
with respective states S f , Sg and Sh, which we thus model as functions:
s f : a ˆ S f Ñ b ˆ S f , sg : b ˆ Sg Ñ c ˆ Sg, sh : c ˆ Sh Ñ d ˆ Sh
We use the prefix s to distinguish the stateful version of the function
(with the state dependencies explicit). Then, we get the dependency graph
for the execution of the (Haskell) expression ♠❛♣ ✭❢✳❣✳❤✮ ✐♥♣✉ts as de-



























Figure 7.3: Dependencies of ✭♠❛♣ ✭❢ ✳ ❣ ✳ ❤✮ ✐♥♣✉ts✮. Adapted from Figure 5
in [Ert+19b]
The pattern we see in Figure 7.3 is very similar to the higher-order func-
tion s❝❛♥ on the state. Intuitively, if we consider the function s❢✬ that only
returns the state of s❢, then the state S f ,i :“ pS f qi corresponds to the
128
i-th value of the expression s❝❛♥❧ s❢✬ S f ,0 ✐♥♣✉t. Threading the state
around ❢ explicitly can be achieved with the functional pattern known as
a monad, in a fashion similar to the state monad in Haskell. Two differ-
ent concrete implementations of this principle in Haskell are discussed
in [Ert+19b]. These implementations are beyond the scope of this thesis.
The result of this monadic composition of state threads, however, is that
we can write virtually the same expression as above in a monadic compo-
sition:
s♠❛♣ ✭❢ ❃❂❃ ❣ ❃❂❃ ❤✮ ✐♥♣✉ts
The ❃❂❃ operator is the monadic equivalent of function composition,
with the ✳ (dot) operator. This yields the dependencies explicitly. Similarly,
these state threads and their composition can be formalized using cate-
gory theory [Ert+19a]. This formalization is rather technical. We will only
sketch it here.
Let C be a Cartesian closed category, which is a technical condition
that in a model-theoretic interpretation of categories corresponds to the
typed λ calculus [Hue85]. We can think of C as the values and functions
of the language, like the category of Haskell types Hask4. A Cartesian
closed category has a terminal object K P ObjpCq, and any two objects
B, C P ObjpCq have a product B ˆ C and an exponential BY. These construc-
tions are defined via universal properties in commutative diagrams and
are rather technical. We will omit the precise definitions here for space
reasons. It suffices to say they correspond with the known constructions,
e.g. the product is the Cartesian product in the category Set of sets.
The main idea of formalizing and dealing with state threads is to in-
dex them. We do this through a (countable) index set N Ď N, which
for practical purposes we can also think of as being finite. We “split” the
state into local states which correspond to the indices in N. Formally, let
Si P ObjpCq, i P N be pairwise distinct (i.e. i ‰ j ñ Si ‰ Sj). We define for
I Ď N the state object SI “ ˆiPISi as the product of the Si for all i P I. If
I “ N, we call the state object SN the global state. The individual states
Si :“ Stiu for i P N we call fundamental states. We thus formally define a
state thread as a morphism:
f : pa ˆ SIq Ñ pb ˆ SIq, for an I Ď N
This definition formalizes the intuition behind Equation 7.1. It is justified
by Lemma 7.2.1.
Lemma 7.2.1 (Lemma 1.3 of [Ert+19a]). The following define the objects
and morphisms of a subcategory S of C,
ObjpSq “ ta ˆ sI | a P ObjpCq, I Ď Nu , (7.2)
MorphpSq “ t f : pa ˆ sIq Ñ pb ˆ sIq | f P MorphpCq, I Ď Nu . (7.3)
Proof. Since C is a category, and as such the composition of morphisms
behaves as required, it suffices to show that morphisms respect the
structure of the subcategory. It is clear that idaˆsI P MorphpSq for every
a P ObjpCq, I Ď N, since ida P MorphpCq. Let f , g P MorphpSq with such that
g ˝ f is defined in C. Then it has to hold that there are a, b and c P ObjpCq
as well as I Ď N, such that f : pa ˆ SIq Ñ pb ˆ SIq and g : pb ˆ SIq Ñ pc ˆ SIq,
since g ˝ f is defined in C. But then g ˝ f : pa ˆ SIq Ñ pc ˆ SIq is in MorphpSq
by definition of S .




In microservices, I/O plays a crucial role in the performance. A microser-
vice will commonly send multiple requests to a different microservice,
as part of its operation. Each request comes with a significant overhead
from establishing the connection and sending the data. If the requests do
not depend on each other, however, they can instead be sent as a single,
batched request for mitigating the overhead.
Batching requests is not a novel idea, it is well-established as a tech-
nique to optimize I/O. The trade-off comes from the code required to write
batched requests. A developer writing code for such a microservice-based
architecture needs to take both the functionality and the I/O optimizations
into account. This makes the code harder to write, read and maintain, as
the optimizations clutter the readability of the functionality. This is un-
acceptable in a context where development time is a highly valuable re-
source, be it for human-resource costs or because it is important to have
a working solution as quick as possible.
The situation described is the case for Facebook’s spam-fighting ser-
vices [Mar+14]. When fighting spam, a novel filter must not only be ef-
fective and perform efficiently, it also should be implemented as fast as
possible without compromising the functionality. An ideal spam-fighting
system thus allows the developers to focus on the functionality, and op-
timizes the implementation without cluttering the code. This is precisely
what the Haxl system attempts, using the Haskell abstraction of applica-
tive functors. Consider the Haskell code snippet (taken from [Mar+14]) in
Listing 7:
❧❡t ♥✉♠❈♦♠♠♦♥❋r✐❡♥❞s ❂
❧❡♥❣t❤ ✭✐♥t❡rs❡❝t ✭❢r✐❡♥❞s❖❢ ①✮ ✭❢r✐❡♥❞s❖❢ ②✮✮
✐♥
✐❢ ♥✉♠❈♦♠♠♦♥❋r✐❡♥❞s ❁ ✷ ✫✫ ❞❛②s❘❡❣✐st❡r❡❞ ① ❁ ✸✵
t❤❡♥✳✳✳
❡❧s❡✳✳✳
Listing 7: An example of a Spam-fighting request to be optimized by Haxl (from
[Mar+14]).
The listing shows a code example where, for spam fighting, the number
of common Facebook friends of two users are calculated. This is done by
calling the function ❢r✐❡♥❞s❖❢ for both ① and ②. Code like Listing 7 is easy
to read, but unoptimized, it will send to requests to the microservice that
handles the ❢r✐❡♥❞s❖❢ function. The solution of Haxl is to use applicative
functors to compose I/O calls, such as ❢r✐❡♥❞s❖❢, and automatically batch
independent requests this way. Listing 8 shows how this is achieved in
Haxl using applicative-do notation.
Under the hood, the applicative functor definition for ❢r✐❡♥❞s❖❢ in Haxl
gathers the arguments ① and ② and makes a single, batched request. This
optimizes the execution with a minimal obfuscation of the code: A devel-
❞♦ ❛ ❁✲ ❢r✐❡♥❞s❖❢ ①
❜ ❁✲ ❢r✐❡♥❞s❖❢ ②
r❡t✉r♥ ✭❧❡♥❣t❤ ✭✐♥t❡rs❡❝t ❛ ❜✮✮





qq ✐♥ . . .
Listing 9: The request from Listing 7 in Ÿauhau.
oper has to switch from a pure functional style to an applicative style, but
can otherwise focus on the semantics of the program.
Our central observation is that this I/O optimizations all come “for free”
from the exposed concurrency in a dataflow execution. If we define a
dataflow operator that gathers the inputs and issues a single batched re-
quest, we get the composition from the semantics of dataflow. This is the
main idea behind Ÿauhau.
Ÿauhau is based on the iteration of Ohua embedded in Clojure. It is
a language with an explicit annotation for functions which perform I/O.
Leveraging these annotations, a semantics-preservig transformation can
minimize the number of independent I/O-performing function calls. We
map the Clojure-based Ohua to an internal expression IR, as is shown in
Figure 7.5.
① ÞÑ x ✭❧❡t ❬① t❪ t✮ ÞÑ ❧❡t x “ t ✐♥ t
✭✐♦ ①✮ ÞÑ ✐♦pxq ✭❢✉♥ ❬①❪ t✮ ” ★✭t✮ ÞÑ λx.t
✭t t✮ ÞÑ t t ✭❢ ①✶ ✳✳✳ ①♥✮ ÞÑ ❢❢ f px1 . . . xnq
✭✐❢ t t t✮ ÞÑ ✐❢pt t tq
Figure 7.5: Mapping the terms of the Clojure-based language to an expression IR.
Adapted from Figure 9 in [Ert+18].
The expression IR defined in Figure 7.5 is based on λ calculus with a ❧❡t
construction for lexical scoping and explicit conditionals (✐❢). The central
innovations of the language are the two particular terms, ❢❢ for foreign
functions, which is any (possibly stateful) function. The other term is ✐♦,
which is an explicit annotation that a function does I/O. The premise is
to optimize the number of annotated I/O calls leveraging the concurrency
from the dataflow semantics derived from this language. The example in
Listing 9 lists the same request as Listing 7 in Ÿauhau.
The I/O optimizations in Ÿauhau come from batching requests. Instead
of calling ❢❢friendsOf(✐♦(①)) and ❢❢friendsOf(✐♦(②)) as two separate I/O re-
quests, we can call them as a single batched request, since they are in-
dependent. In Ÿauhau we do so by introducing a batched I/O statement,
❜✐♦. The ❜✐♦ statement takes a list of arguments and does a batched
call with this list. In the example, the two statements become a single
❢❢friendsOf(❜✐♦(❬①✱②❪).
Through semantics-preserving transformations using ❧❡t floating we
can get independent I/O calls batched this way and transform them to
dataflow. The explicit concurrency in the dataflow transformation allows
to execute batched I/O calls when they arrive. The details of the transfor-




notably however, both Haxl and Muse significantly increase the amount
of I/O requests when the program becomes more nested, while in Ÿauhau
this is not the case. This is because of the underlying dataflow MoC, which
flattens the dependency graph, effectively inlining the sub-function calls
and allowing the framework to batch across function boundaries. In par-
ticular, this means again that the premise of making developers not worry
about performance in the DSL is partially broken by Haxl and Muse, and
this is fixed by Ÿauhau.
We have seen how MoC-based design can be useful in a different con-
text, besides the CPS applications we focused on in the previous chap-
ters. In particular, the Ohua approach can be used to make the model
more implicit in the computation, which is crucial for developer adoption
in many contexts. Finally, we have seen in this section as well some con-
crete advantages of random benchmarks with a clearly-defined structure,
in the concrete example of the Level Graphs from Section 3.3. They al-
lowed us to tailor experiments to isolate specific features of our dataflow
MoC-based approach and compare them to the other state-of-the-art ap-
proaches, which would not have been possible otherwise.
135

8R E L A T E D W O R K
The nature of this thesis is not focused enough for us to discuss related
work from a general point of view. Perhaps the closest in spirit to the work
presented in this thesis is the Ptolemy II project [Pto14]. It is a tool for ex-
ploring model-based design and has a strong focus on CPSs. Ptolemy II is
very comprehensive and studies and implements several MoCs discussed
in this thesis (cf. Section 6.1). The scope of Ptolemy II is far larger and more
detailed than this thesis. It is, however, aimed at application developers.
In contrast, many methods in this thesis (chapters 4-5) are more focused
on tool developers, for improving the methods enabling model-based de-
sign.
Instead of discussing related work generally, we will go over on the dif-
ferent methods proposed here to improve model-based design, and dis-
cuss related work by broader categories. In most chapters and sections
we discuss related work directly. While we will systematically go over re-
lated work here, we refer to the according chapters and sections for dis-
cussion.
8.1 Dataflow-based Software Synthesis
There are many tools for software synthesis, as we discuss in Section 2.6.
We have mentioned [Lin98] which uses Petri Nets, or [RPM92] based
on SDFs. The flows in [BLM00; Pin+95; BML12] are generally based on
dataflow.
More recently, there is SystemCoDesigner [Hau+08], which is based on
SystemC and targets Field Programmable Gate Arrays (FPGAs). The same
is the case for the dataflow-based CAPH [SBA13] framework. On the soft-
ware side, there is the Turnus [Cas+13] flow, which is based on RVC-CAL.
Also relevant is the PREESM [Pel+14] flow, which is based on parametrized
extensions of SDF models. These flows are all related to MAPS [CLA11], and
its spinoff in Silexica, which is the KPN-based software synthesis flow that
we focus on in this thesis, and describe in Chapter 2. As such, the more
closely related tools are KPN-based flows, like Sesame [Erb+07], with the
related ESPAM [SDN06] and Daedalus [Nik+08]. Similarly, the DOL [Thi+07]
is a closely related KPN-based flow.
While we did not propose a new software synthesis flow, the methods
in this thesis and in ♠♦❝❛s✐♥ are related to methods implemented in these
diverse flows. To the best of our knowledge, there are no systematic com-
parisons of approaches and heuristics in terms of their performance, as
we argue in [Goe+16]. The survey in [Sin+13] is a systematic comparison of
mapping approaches at an abstract level, but it does not execute and com-
pare the different heuristics in benchmarks. The work in [Bra+01], does
execute and compare heuristics, on the other hand, but these are from
before the multicore era and not as directly related.
137
8.2 Mapping Space Structures
In [TP13], Thompson and Pimentel exploit the mapping space structure
explicitly, exploiting both symmetries of the problem and a kind of met-
ric for the mapping space in the form of operators for genetic algorithms.
These can both be seen as special cases of the structures in Chapter 4, al-
beit for a simpler case with homogeneous architectures. The work from
Richthammer and others [RG18; RFG20] is very similar in nature to the ap-
plications discussed in Chapter 5. They also aim to improve DSE methods
from an algorithm-agnostic fashion, although the concrete structure they
exploit is different. Their methods are orthogonal to ours (and could be
combined). Less directly related are approaches for pruning the design
space in general settings, outside the mapping problem [WS04].
8.2.1 Symmetries
Symmetries have been explored in software synthesis implicitly in many
cases, e.g. in [HT01; Kre+05; Sin+10; Rol+15]. In fact, when researchers or
developers just distinguish between core types in architectures with sim-
ple memory subsystems they are implicitly considering the symmetries
of the problem. The problem becomes more difficult when the architec-
ture topologies are more complex. The authors of [Sch+17] also consider
symmetries in DSE, albeit in a more ad-hoc fashion (without the mathe-
matical theory of groups or semigroups). In a related idea, in [Wei+16]
they also introduce the concept of “shapes”, which are a special case of
the symmetries exploited in the TETRiS method, but is limited to meshes in
NoCs. For some applications, the symmetries have also been considered
explicitly [Coh88], but not systematically like we do in this thesis.
Methods from group theory have also been used to exploit symmetries
for problems in computer science and engineering before, in ways that
are very similar to the methods discussed in this thesis [Cra+96; Cla+98].
In particular, some of our methods are inspired by the usage of wreath
products in model checking [DM09].
8.2.2 Distances
Distances between mappings are commonly described in NoC-based sys-
tems. For example, the heuristic described in [Sin+10] considers mappings
in NoC systems and uses the number of hops in the topology to find them.
This strategy is common in many approaches. In [Wei+14], for example,
the authors encode a related notion of distance in the constraints of their
operating points. To the best of our knowledge, explicit low-distortion em-
beddings have not been used in this context before.
Robustness in computation is a broad subject and much work has been
done in different aspects of it, albeit most of it does not consider robust
mappings explicitly. The work of [ZK11; Zha19] defines mapping migration
strategies, without focusing on a DSE for robust mappings. A strategy for
finding robust mappings explicitly was proposed in [Che+16], using redun-
dancy instead of the geometry of the mapping space. This allows for more
robustness but requires more resources. Design centering methods, like
the ones we used to find robust mappings, have also been used in many
other disciplines in engineering, e.g. integrated-circuit design [Che+15].
138
To the best of our knowledge, the only other work defining compact-
ness of mappings explicitly is [Yan+10]. However, the quality of this work,
including the heuristic and its evaluation, is dubious. The idea behind
compactness of mappings is composability of applications, on the other
hand, which has been seriously researched with other methods. CoMP-
SoC [Han+09] or the work in [Kum+08] deal with composability of applica-
tions. They do not do so using the geometry of the mappings1, however,
but using sophisticated hardware support instead.
8.3 Run-time and hybrid approaches
There are many run-time and hybrid flows related to our TETRiS approach.
The flows proposed in [Cas+10; Kan+14; Zhu+16], for example, also pro-
pose methods for multi-application mappings. These methods all rely on
statically knowing all applications at compile time and calculating joint
mappings, which does not scale nor works in more dynamic systems.
The approaches from DAARM [Wei+14] or Spider [Heu+14], or the meth-
ods proposed in [MMB07; QP15] are all hybrid approaches. As such, they
all solve the problem with static approaches discussed above, like TETRiS
does. These rely on different methods for finding the final mappings at
run-time and have different advantages. Most works have the architec-
ture model implicit in the flow (cf. Section 2.3), e.g. [Wei+14] which as-
sumes regular NoC meshes from its problem formulation or [MMB07]. A
distinct advantage of our symmetries-based approach to hybrid mapping
is that it uses a general architecture model, which works for arbitrarily
complex architectures.
On the side of run-time adaptivity, the MacQueen gap is generally ig-
nored in literature, where KPN are equated with the KMQ blocking-read se-
mantics. We were not the first to recognize the gap, however. The two
models are also treated as separate in [LM09]. There are many other
related ways of adapting MoCs at run-time. Models like SADF [The+06] or
Multi-Alternative Process Networks [BJC21] do this by modeling the adap-
tivity explicitly in the graph. The AdaPNet model [Sch+14] defines more
comprehensive transformations and is very general and flexible.
On the side of applications to 5G, mostly the complete field of research
in telecommunications investigates methods for adapting to the dynamic
demands of upcoming applications. Models are common for different
areas of the field, but their use is generally not proposed at a system-
level as we do. An example that is proposed at a system-level is the Nu-
cleus project [Cas+11], that is based on the MAPS software synthesis flow
and KPNs. Since this is work-in-progress, it is not possible to provide a de-
tailed examination of advantages and disadvantages of our proposed ap-
proach using Reactors, compared to established methods, as we have yet
to fully examine and understand these. A good overview of model-based
approaches in modem design can be found work in [Gat+20].
8.4 Other model-based design tools
The tools above focus on software synthesis in a flow similar to the one
described in this thesis, using KPN or dataflow models and DSE to find prof-
itable mappings for executing applications in modern hardware. There
1 Which is not a good strategy, as we saw in Section 5.1
139
are other tools and languages which are focused more in the model’s
semantics. Besides Ptolemy II [Pto14], these include the CAL [EJ03] lan-
guage and the related RVC-CAL compiler. Synchronous languages like LUS-
TRE [PHP87] or ESTEREL [BD91], which have discrete-event semantics, are
also relevant. These are languages are related to the Reactors model.
Even hardware description languages like VHDL or Verilog, and even
SystemC [Mue+01] can be seen as related, in this case it being HLS, which
is in fact an inspiration for the term software synthesis. Also on the com-
mercial side, Signal Processing Work System and Synopsys System Stu-
dio both from Synopsis join the LabVIEW Communications System Design
Suite or Matlab Simulink [KKM16] as model-based design tools with well-
defined MoCs. In Section 7.1.1 we review and discuss many of these tools
and programming languages based on MoCs.
8.5 Random Benchmark Generation and Machine Learning
A prominent example of random code generation is CSmith [Yan+11],
used to stress-test compilers. A related approach is the grammar-based
method presented in [McK97]. In this thesis we discussed random bench-
marking for evaluating optimizations, more so than for testing corner
cases. The tools TGFF [DRW98] and SDF3[SGB06], also discussed in Sec-
tion 3.1 are more closely related to benchmarking as we investigated it.
We used the proposed Level graphs to evaluate language-based trans-
formations for I/O optimization. We discussed the main related work for
this in Section 7.3, namely Haxl [Mar+14] and Muse [Kac15], as well as the
unpublished Stitch [Don14] from Twitter.
The direct connection between benchmarking and machine learning
comes from CLGen [Cum+17a]. Machine learning for code is a broad sub-
ject and only a minor part of this thesis. A broader overview can be found
in [All+18], although we will solely discuss work closely related to the con-
tributions presented in this thesis. We based our graph-based methods
and the evaluation on the ideas presented in [Cum+17b; BJH18]. Conse-
quently, based in part on our graph-based methods, the work in [Cum+20;
Ye+20] recently proposed some potential improvements to our compiler-
based representations.
140
9C O N C L U S I O N S
Programming computers is notoriously difficult. This thesis certainly will
not change that. More generally, this statement will probably remain true
for a long time. This does not mean, however, that we cannot make
progress towards easing the process of programming computers.
In this thesis we have argued for model-based design of software sys-
tems. This is much more common in the hardware world, where models
are central to design, commonly used for ensuring deterministic behav-
ior. The success of this paradigm is what allows us to have programmable
digital computers in the first place.
In the software world, where the level of abstraction is higher and the
entry barrier lower, models are usually more implicit, less strict, or both.
When using well-defined Models of Computation (MoCs) for programming,
however, we can reason about software and its performance. Concretely,
software synthesis flows and the mapping problem result from doing pre-
cisely this.
In this thesis we studied such MoC-based software synthesis flows, with
a focus on Kahn Process Network (KPN). We surveyed multiple dataflow
MoCs, and discussed the advantages and disadvantages of them. The KPN
MoC allows us to express concurrency in computation in a deterministic
fashion, while remaining very expressive. Compared to most dataflow
MoCs, it allows for maximally dynamic, data-dependent behavior. We also
discussed a semantics gap between the Kahn-MacQueen (KMQ) blocking-
reads semantics and KPN, which can be exploited in applications with data-
parallelism.
When lowering KPNs down to be executed in MPSoCs, the mapping prob-
lem plays a crucial role, especially for heterogeneous systems. In this the-
sis we have discussed this intractable problem at length. A central theme
of our discussion has been the structure of the mapping space. We have
seen how the space is large and complex, yet structured.
The mapping space is very symmetrical, which concretely means that
many mappings are equivalent in terms of properties like performance or
energy efficiency. This is due to symmetries in the target hardware archi-
tectures and applications. MPSoCs usually have multiple cores with identi-
cal microarchitectures and memory subsystems with a regular structure.
Data-level parallelism in applications also yield such symmetry. We have
seen how to describe and exploit this symmetry, pruning the mapping
space for Design-Space Exploration (DSE) or for finding equivalent map-
pings when some resources are unavailable at run-time.
The mapping space also has different geometric interpretations. We
have seen how to find different embeddings of these geometric inter-
pretations and exploit them in DSE meta-heuristics. This also allowed us
to design novel heuristics and meta-heuristics, based on the geometric
structure of the space, to find mappings with low communication costs
or high robustness. In general, we have seen how the way we repre-
sent mappings can expose much of this structure. We believe much work
would benefit from explicitly considering the structure exposed to the al-
gorithms.
141
There is little point in exposing complex structures and engineering so-
phisticated algorithms if they don’t improve our methods. To assess if
they do, however, we need to test them, using benchmarks. Given the im-
portance of this, we argue for careful consideration as to what and how
improvements are assessed using benchmarks. In this thesis we have
argued for a statistical view of code, seeing improvements on methods
as improvements on the expected value of some property, like the pro-
gram’s execution time.
Unfortunately, benchmarks are scarce and seldom specialized. We
have discussed options for is overcoming this issue, using random bench-
mark generation and machine learning. In particular, we have seen how
our statistical view of code for benchmarking exposes some possible pit-
falls of machine learning for benchmark generation, both in theory and
in practice.
While KPN-based flows have many advantages, they are not well-suited
for every application domain. For example, KPNs do not have semantics to
deal with time, which is important in Cyber-Physical Systems (CPSs). Simi-
larly, the KPN graph structure is rigid, which limits the adaptability of the
model. In this thesis we discussed a novel model, Reactors, which ad-
dresses these limitations. We focused on the opportunities of this model
in the 5G telecommunications standard.
Most of this thesis has focused on the advantages of model-based de-
sign, which are plentiful. An important disadvantage, however, is the ease
of use of this design process. Exposing models through APIs is not produc-
tive, since developers can and usually do end up abandoning the model’s
constraints. We need programming languages and, especially, program-
ming models that make MoC-based design accessible to programmers,
while enforcing the model’s constraints. In this thesis we briefly discussed
the Ohua programming model, which derives a dataflow execution im-
plicitly from a conventional programming language. We saw how we can
use this to combine the advantages of MoC-based design with concise pro-
gramming, by optimizing I/O in microservice-based infrastructures.
9.1 Future Work
We believe the single most important aspect to drive MoC-based design
forward is fostering its adoption. We need tools and environments that
make it easier for programmers to design applications with a well speci-
fied MoC.
The Lingua Franca project, which implements Reactors, is a great av-
enue for fostering adoption. Its polyglot design allows programmers to
use known languages to write reactors1, while still being a coordination
language that can enforce the MoC semantics. The use of known lan-
guages has two distinct advantages, as it reduces the learning curve and
allows using legacy code. Currently, the compiler does not understand
nor does it type-check the target language, leaving that task to the com-
piler. In future work we could add a type system to Lingua Franca, rein-
forcing the “freedom from choice” it provides.
A potential disadvantage from the Lingua Franca project, on the other
hand, could be its coordination language. Explicitly writing the networks
can be confusing for developers. This is exacerbated by the fact that Re-
actors is a complex model, difficult to grasp and learn. We believe the
1 This refers to the unit of computation, reactors, as part of the model, Reactors.
142
approach by Ohua of making the network implicit is a great avenue for fu-
ture work. Another example to follow is the Elm language, a functional lan-
guage for the web. Elm started as a language with an explicit Functional
Reactive Programming (FRP) paradigm, which was confusing for users.
They made this paradigm implicit2, which translated into a success for
the learning curve and the language’s adoption. This is part of a general
vision, where compilers and languages are thought of as assistants that
make development easier for programmers, instead of only focussing on
correctness and performance. We believe we should follow similar paths
for the design of software using MoCs like Reactors or KPN.
On the side of mapping there are also many open avenues for improv-
ing our work. In particular, partial symmetries expose a great deal of
the problem’s structure which we are not exploiting yet. Designing effi-
cient methods to detect and exploit them would help navigate the design
space of mappings much better. This would also open up opportunities
for using inverse semigroups of non-symmetries, like reducing the num-
ber of hops in a communication link. Our methods would also greatly ben-
efit from incorporating application symmetries when exploiting data-level
parallelism.
The geometry of the mapping space also has many open questions that
should improve its usefulness. While we discussed the trade-off between
the distortion and dimensionality of embeddings, we did not exploit it
in this thesis. More importantly, while the metrics we discussed are a
good starting point, we saw that they do not reflect the mapping struc-
ture very well yet. Finding better metrics could greatly improve mapping
meta-heuristics, especially those based on some concept of locality in the
search space.
In this thesis we have discussed and shown how provable properties of
MoCs can and do improve the design process. While traditional pen and
paper proofs are a great way of finding and proving these properties, the
advent of powerful theorem proving assistants provides an opportunity
to improve upon this. Formally verified proofs of properties of the system
give us more certainty on their correctness, and some degree of automa-
tion. Combining MoC-based design with developments in formal verifica-
tion methods can help us to write correct software more frequently in
less time.
While formal methods can verify properties of our software, they might
never completely replace testing. Even if they one day do so, that will not
be in the near future. This is why we believe the work on benchmark-
ing is an important avenue for future work. We believe advances in ma-
chine learning could enable our vision of benchmark generation flows.
Informed by statistical models and tailored for a specific use-cases, such





AM A T H E M A T I C A L S U P P L E M E N T
a.1 Groups
Definition A.1.1. Let G be a (finite1) set and ˝ : G ˆ G Ñ G be a mapping.
We say that pG, ˝q is a group if the following hold:
• The mapping ˝ : G ˆ G is associative, i.e. for any f , g, h P G we have
f ˝ pg ˝ hq “ p f ˝ gq ˝ h.
• There exists a neutral element e P G with g ˝ e “ e ˝ g “ g for all
g P G.
• For every g P G there is an inverse element g´1 P G such that g ˝
g´1 “ g´1g “ e.
By abuse of notation, we normally identify the structure pG, ˝q with the
set G and say G is a group. Similarly, when the operation ˝ is understood
from context we commonly abbreviate it as multiplication, without writ-
ing it explicitly: g ˝ h “: gh. Groups are ubiquitous in mathematics. For
example, the natural numbers form a group with addition pN, `q, as do
the reals (without 0) with multiplication pRzt0u, ¨q.
An important example of a group is the so-called symmetric group:
Example A.1.2. Let X be a finite set. Then, the set of bijections X Ñ X
from X to itself is a group with regard to function composition. Indeed, if
f , g : X Ñ X are bijections, then so is f ˝ g : X Ñ X. The identity function
IdX : x ÞÑ x is the neutral element and the inverse function f ´1 is the
group inverse of f , since f ˝ f ´1 “ f ´1 f “ IdX . We call the group of
bijections on X the symmetric group on X and write SympXq. If n P N, n ą
0 is a natural number and X “ t1, . . . , nu, then we write Sn to refer to
Sympt1, . . . , nuq.
This is an important example because every finite group can be found
in Sn for some n, as a subgroup, which we will define shortly. We first want
to introduce cycle notation. A permutation π : t1, . . . , nu Ñ t1, . . . , nu can
be written in different ways. The simplest way to do this is to write it in a
two-row matrix:
˜
1 2 . . . n
πp1q πp2q . . . πpnq
¸
While this is simple to understand, there is a more concise way to write
permutations that has many advantages, including some computational
advantages. It is called cycle notation. We write a permutation as a prod-
uct of cycles pi πpiq πpπpiqq . . . πkpiqq, maximal such that the values don’t
repeat. We can do this since n is finite and thus for some k, πk`1piq “ i,
and we choose the minimal k with that property. For example, the cy-
cle p1, 2, 3q is the permutation 1 ÞÑ 2, 2 ÞÑ 3, 3 ÞÑ 1. From this example
it perhaps is clearer why they are called cycles, since the last element
maps to the first one, cyclically. By convention, 1-cycles are not written
1 Groups are not finite by definition, but all groups we discuss in this thesis are.
145
explicitly. The identity mapping can be sometimes be written as pq , but
could equivalently be p1qp2q . . . pnq. For another example, the permutation
1 ÞÑ 2, 2 ÞÑ 1, 3 ÞÑ 3, 4 ÞÑ 5, 5 ÞÑ 4 can be written as p1, 2qp3, 4, 5q.
Definition A.1.3. Let H Ď G be a subset of a group G. We say that H is a
subgroup if H is a group with respect to the restriction of the multiplica-
tion on G. Equivalently, if e P H and for every g, h P G we have gh´1 P G.
We normally write H ď G to denote that H is a subgroup of G (and
H ă G if, additionally, we know that G ‰ H).
In group theory in particular, and in mathematics in general, mappings
that preserve the structures of the objects being studied are a very pow-
erful tool. We proceed to define these mappings for groups.
Definition A.1.4. For two groups G, G1, a mapping ϕ : G Ñ G1 is called a
group homomorphism (or a morphism of groups) if it respects the group’s
structure, i.e. if ϕpghq “ ϕpgqϕphq for all g, h P G and ϕpeGq “ e1G.
The definition above already implies that ϕpg´1q “ ϕpgq´1. As men-
tioned before, these structure-preserving mappings are very important
in the theory of groups, as they are used to relate different mappings. A
group homomorphism ϕ : G Ñ H is called a monomorphism (or embed-
ding) if it is injective, epimorphism if it is surjective and isomorphism if it
is bijective. A group isomorphism from a group G to itself, ϕ : G Ñ G is
called an automorphism. Isomorphisms play a central role in mathematics
(e.g. in their more general definition for categories). They define equiva-
lence classes of objects, i.e. being isomorphic is an equivalence relation.
We usually write G – H to say that G and H are isomoprhic, that is to say, if
there exists an isomoprhism ϕ : G Ñ H. Two objects that are isomorphic
are usually consider to be “the same”, since any structural property has to
be an invariant of the isomorphism class. In fact, this indistinguishability
between isomorphic objects is at the center of the univalance axiom in
homotopy type theory as a foundation of mathematics [Uni13].
In the case of groups, there is a particular property that most other
structures in mathematics do not have. The set of isomorphisms, i.e. the
set of mappings that preserve the structure of an object and define equiv-
alences between objects, is itself a group! Besides group automorphisms,
the structure-preserving mappings of other structures are also groups
(e.g. homeomoprhisms in topological spaces, graph isomorphisms, in-
vertible matrices in vector spaces). In all these cases there is a direct re-
lationship between the structure and the structure preserving mappings,
as the mappings can transform these structures. This concept is general-
ized with group actions.
Definition A.1.5. Let G be a group and X be a set. We say that G acts on X
if there is a group homomorphism G Ñ SympXq from the group G to the
symmetric group on X. Equivalently, if there is an α : G ˆ X Ñ X which is
associative and respects the group operation (in particular αpe, xq “ x for
all x P X)2. We also say that X is a G-set.
As an example, consider a regular polygon with n sides, a regular n-
gon. Figure A.1 shows this for the example of a square (a square is
a regular 4-gon). We name the four corners of the square as 1, 2, 3, 4.
This could thus be interpreted as a graph G “ pV “ t1, 2, 3, 4u, E “
tt1, 2u, t2, 3u, t3, 4u, t4, 1uuq.
2 We call this a left action, and a right action similarly β : X ˆ G Ñ X, where we write the




Figure A.1: A square.
The group of permutations S4 acts on the graph G by permuting the
points V “ t1, 2, 3, 4u (and the edges accordingly). We can take any per-
mutation π P S4 and apply it on the graph. For example, consider the




Figure A.2: The action of the permutation p1, 2q on the square.
Figure A.2 shows the example of the action of p1, 2q on the square. We
note that the square is not a square anymore, it has lost its structure. A
natural question to ask is, what are the permutations that leave a square




Figure A.3: The action of the rotation p1, 2, 3, 4q on the square.
Consider a rotation by 90˝, counter-clockwise. We can write this as the
permutation ρ “ p1, 2, 3, 4q. Figure A.3 depicts the action of ρ on the square,
and indeed, it preserves the structure. We can think of two additional ro-
tations, by 180˝ and 270˝, which would also preserve the structure of the
square. It is worth noting that a rotation by 180˝ is the same as rotating
by 90˝ twice, and similarly, ρ3 is the rotation by 270˝. It is also important
to note that a rotation by 360˝ and 0˝ are indistinguishable on the square,
they are the identity permutation Idt1,2,3,4u. In fact, these four rotations
form a sub-group C4 ă S4, called a cyclic group. More generally, the rota-
147
tions that preserve the structure of a regular n-gon are a cyclic group of
order n, Cn.
Definition A.1.6. The cyclic group Cn is the group formed by an n-cycle
a “ p1, . . . , nq in Sn, i.e. Cn “ ta, a2, . . . , an “ Idnu.
We have seen that the rotation ρ in the example above is enough to find
all elements of the group C4. We say that ρ generates the group C4. Cyclic
groups are characterized by having a single generator 3. Thus, all other
groups have multiple generators (except the trivial group teu which can be
considered as having 0 generators.) More generally, for a set X Ď G in a
group G, we define xXy ď G to be the smallest subgroup of G containing X.
For the case of finite groups, we can characterize xXy as the set of words




Figure A.4: The action of the reflection p1, 2qp3, 4q on the square.
Rotations are not all the symmetries of the square, we can also have
reflections. Figure A.4 shows the action of a reflection along the vertical
axis, namely σ “ p1, 2qp3, 4q, on the square. Reflections are fundamentally
different from rotations, no rotation could achieve this transformation.
We can also have a reflection along the horizontal axis and both diago-
nals, for a total of 4 reflections. It is perhaps not obvious at first, but if we
combine reflections and rotations on the square, we always get a reflec-
tion or a rotation. In fact, these 8 transformations form another group D4,
with C4 ă D4 ă S4. The dihedral group on four points, D4, in generated
by the rotation ρ and the reflection σ, i.e. D4 “ xρ, σy.
In the action of permutations on the graph, the group elements act si-
multaneously on both the nodes and edges. If we look at the action only
on the nodes, the action of C4 (and D4) can take any point to any other
point. But if we look at the whole graph, we see that there is no way to
take the original square to the square in Figure A.4 with the action of C4.
Similarly, we cannot use any group element of D4 to take the shape in
Figure A.2, which is not a square anymore. This is precisely the property
that defines D4 as the symmetry group of the square. These questions,
concerning which elements can be taken which others, are common in
group theory. This is why there is a definition to describe these sets: we
call them orbits .
Definition A.1.7. Let G be a group and X be a G-set. Further, let x P X be
an element of X. We define the orbit of x to be the set Gx :“ tgx | g P Gu of
points that x can be transfromed into. We further call X{G :“ tGx | x P Xu
the set of orbits of X.
3 Recall that we are only considering finite groups.
148
For example, all squares are precisely the elements in the orbit of D4s,
where s is the graph of the square from the examples above. Orbits define
a partition on the set X, meaning that any two orbits Gx, Gy are either
equal or disjoint and X “
Ť
xPX Gx.
Finally, we discuss how we can construct other groups from existing
groups. The simplest construction is called the direct product. For groups
G, H, we write G ˆ H, endowing the Cartesian product with a component-
wise multiplication (i.e. pg, hqpg1, h1q “ pgg1, hh1q for all g, g1 P G, h, h1 P H).
There is a more general construction called a semi-direct product, which
is a generalization of the direct product. Here we will only discuss a special
case of semi-direct products, namely the wreath product G ≀ H.
Let G be a group and let H ď Sn for an n P N . We consider the direct
product of n copies of G:
Gn :“ G ˆ . . . ˆl jh n
n times
G
Then, the group H acts on these n copies of G by permuting their in-
stances. Let pg1, . . . , gnq P Gn, h P H. We define:
hpg1 . . . , gnq :“ pgh1, . . . , ghnq,
where h permutes the order of the elements in the n-tuple of elements of
Gn. This defines an action of H on Gn. We can use this action to construct
the wreath product G ≀ H on the Cartesian product Gn ˆ H, by defining
the multiplication as:
ppg1, . . . , gnq, hqppg
1




“ ppg1, . . . , gnq




“ ppg1, . . . , gnqpg
1




Intuitively, the wreath product works when we have copies of a sub-
structure arranged in a particular larger structure. It applies transforma-
tions both at the substructure level and an the level of the larger struc-
ture.
a.2 Metric Spaces and Low-Distortion Embeddings
Here we discuss (discrete) metric spaces and low-distortion embeddings.
A metric space is the mathematical formalization of distances. We define
a metric to be able to measure distances in a particular space.
Definition A.2.1. Let M be a set and let d : M ˆ M Ñ Rě0 . We say that d
is a metric on M and, equivalently, pM, dq is a metric space, if the following
hold:
1. For all m, m1 P M, dpm, m1q “ 0 ô m “ m1.
2. For all m, m1 P M, we have dpm, m1q “ dpm1, mq.
3. For all k, l, m P M we have dpk, mq ď dpk, lq ` dpl, mq
The motivation for these properties is intuitively clear. Property 1 says
two things, first, that there is no distance from an element to itself, and
second, that no two equal elements are in the same place (have no dis-
tance between them). If we don’t require the second property (i.e. replace
149
ô with ñ in Property 1, we get what is called a pseudo-metric (or a de-
generate metric). The second property, Property 2 states that distance is
symmetric. Finally, Property 3 is the triangle inequality: it states that the
shortest path between two elements is always the direct path, their dis-
tance.
The canonical metric spaces are Rn with different norms, like the p-
norms. A norm is a more restrictive concept than a metric, but we will
not define norms here further.
Example A.2.2. For p ě 1, the function px, yq ÞÑ }x ´ y}p : Rn ˆ Rn Ñ Rě0





1{p is the p-norm.
The case for p “ 2 is the well-known Euclidean distance in vector spaces.
Also well-known is the case of of p “ 1, which is sometimes called the Ma-
hattan or Taxi distance, in allusion to the distance when moving through
the streets of a neighborhood that look like a regular mesh, like in Man-
hattan.
In this thesis we are particularly interested in the case where M is finite,
which we will assume from here on. If M “ tm1, . . . , mnu is finite, we can






dpm1, m1q dpm1, m2q . . . dpm1, mnq





dpmn, m1q dpmn, m2q . . . dpmn, mnq
˛
‹‹‹‹‹‚
The structure preserving mappings (moprhisms) of metric spaces are
called isometries. They have the particular property that they are always
injective, due to Property 1.
Definition A.2.3. Let M, M1 be metric spaces. We say that a mapping ϕ :
M Ñ M1 is an isometry if for all m, m1 P M, dMpm, m1q “ dM1 pϕpmq, ϕpm1qq.
An isometry is thus always an embedding (monic), since for any two
points m, m1 P M with ϕpmq “ ϕpm1q we have 0 “ dM1 pϕpmq, ϕpm1qq “
dMpm, m
1q.
In the case of groups, embeddings into a particular group Sn are use-
ful for computing. While we did not discuss it as thoroughly, the basis
of all computation we are concerned with in this thesis are these embed-
dings into permutation groups Sn4. The question is, can we find an equiva-
lent method for metric spaces, using isometries to (finite subsets of) Rn?
The unfortunate answer is that no, for a finite metric space M there is
not always n, p such that there exists an isometry from M to pRn, } ¨ }pq
(see [Mat02] for a proof). Fortunately, however, when dealing with real
numbers we can always look for approximations.
Definition A.2.4. Let M be a metric space and ι : M ãÑ Rn be an embed-
ding onto Rn. We say that ι has distortion D ą 0 if
1
D
dpx, yq ď }ιpxq ´ ιpyq} ď dpx, yq
4 In computational group theory there are other branches like matrix groups or black-box
groups, where embedding into an Sn is infeasible, but we are not concerned with these in
this thesis.
150
While, in general, we cannot find an isometry, we can search for an
embedding with a low distortion. There is a particularly useful result in
this context: we can use convex optimization to find an embedding of M
onto Rn with the Euclidean (p “ 2) norm [Mat02]. This is unfortunately
only the case for this norm, e.g. for p “ 1 finding such an embedding is
known to be NP-complete [Mat02]
A problem with the convex optimization method above is that it yields
an embedding with dimension |M|, which might be very high. The dimen-
sion of the vector space strongly affects algorithmic properties of the
problem. It would be ideal to find an embedding into a mapping with a
lower dimension, without increasing the distortion much. In [JL84], John-
son and Lindenstrauss describe this precise problem and its solution as
follows: “Given n points in Euclidean space, what is the smallest k “ kpnq
so that these points can be moved into k-dimensional Euclidean space
via a transformation that expands or contracts all paairwise distances
by a factor of at most 1 ` ǫ? The answer, that k ď cpǫq Log n, is a sim-
ple consequence of the isoperimetric inequality for the n-sphere stud-
ied in [FLM77].” They proceed to formalize and prove this fact, which we
will not restate here more precisely. This result is known as the Johnson-
Lindenstrauss Lemma.
An intuitive albeit sometimes misleading interpretation of the proof of
this lemma is that a projection onto a random subspace will have a low
distortion with high probability. In practice, the distribution does give a
very useful “transform” for dimensionality reduction, simply by projecting
onto a random subspace. However, we should be careful when using this
and ideally check the distortion, if possible.
151

R E F E R E N C E S
[Chu36] Alonzo Church. “An unsolvable problem of elementary num-
ber theory.” In: American journal of mathematics 58.2 (1936),
pp. 345–363.
[Kle36] Stephen Cole Kleene. “General recursive functions of natu-
ral numbers.” In:Mathematische annalen 112.1 (1936), pp. 727–
742.
[Tur37] Alan M Turing. “Computability and λ-definability.” In: The
Journal of Symbolic Logic 2.4 (1937), pp. 153–163.
[ER59] P Erdős and A Réyni. “On random graphs I.” In: Publicationes
Mathematicae 6.290-297 (1959), p. 18.
[Gil59] Edgar N Gilbert. “Random graphs.” In: The Annals of Mathe-
matical Statistics 30.4 (1959), pp. 1141–1144.
[Pet62] Carl Adam Petri. “Kommunikation mit automaten.” In: (1962).
[Moo+65] Gordon E Moore et al. Cramming more components onto inte-
grated circuits. 1965.
[Sco70] Dana Scott. Outline of a mathematical theory of computation.
Oxford University Computing Laboratory, Programming Re-
search Group Oxford, 1970.
[Kar72] Richard M Karp. “Reducibility among combinatorial prob-
lems.” In: Complexity of computer computations. Springer,
1972, pp. 85–103.
[HBS73] Carl Hewitt, Peter Boehler Bishop, and Richard Steiger. “A
Universal Modular ACTOR Formalism for Artificial Intelli-
gence.” In: Proceedings of the 3rd International Joint Conference
on Artificial Intelligence. Standford, CA, USA, August 20-23, 1973.
1973, pp. 235–245.
[Den74] Jack B Dennis. “First version of a data flow procedure lan-
guage.” In: Programming Symposium. Springer. 1974, pp. 362–
376.
[Kah74] Gilles Kahn. “The semantics of a simple language for parallel
programming.” In: Information processing 74 (1974), pp. 471–
475.
[KM76] Gilles Kahn and David MacQueen. Coroutines and Networks
of Parallel Processes. Research Report. 1976, p. 20. url: ❤tt♣s✿
✴✴❤❛❧✳✐♥r✐❛✳❢r✴✐♥r✐❛✲✵✵✸✵✻✺✻✺.
[FLM77] Tadeusz Figiel, Joram Lindenstrauss, and Vitali D Milman.
“The dimension of almost spherical sections of convex bod-
ies.” In: Acta Mathematica 139.1 (1977), pp. 53–94.
[Lam78] Leslie Lamport. Time, Clocks, and the Ordering of Events in a
Distributed System. 1978.
[JL84] William B Johnson and Joram Lindenstrauss. “Extensions of
Lipschitz mappings into a Hilbert space.” In: Contemporary
mathematics 26.189-206 (1984), p. 1.
153
[HP85] David Harel and Amir Pnueli. “On the development of re-
active systems.” In: Logics and models of concurrent systems.
Springer, 1985, pp. 477–498.
[Hue85] Gérard Huet. “Cartesian closed categories and lambda-
calculus.” In: LITP Spring School on Theoretical Computer Sci-
ence. Springer. 1985, pp. 123–135.
[Agh86] Gul Agha. ACTORS: A Model of Concurrent Computation in Dis-
tributed Systems. The MIT Press Series in Artificial Intelligence.
Cambridge, MA: MIT Press, 1986.
[Den86] Jack B Dennis. “Data flow computation.” In: Control Flow
and Data Flow: concepts of distributed programming. Springer,
1986, pp. 345–398.
[FW86] Philip J Fleming and John J Wallace. “How not to lie with statis-
tics: the correct way to summarize benchmark results.” In:
Communications of the ACM 29.3 (1986), pp. 218–221.
[RHW86] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams.
“Learning representations by back-propagating errors.” In:
nature 323.6088 (1986), pp. 533–536.
[LM87] Edward A Lee and David G Messerschmitt. “Synchronous
data flow.” In: Proceedings of the IEEE 75.9 (1987), pp. 1235–
1245.
[PHP87] Daniel Pilaud, N Halbwachs, and JA Plaice. “LUSTRE: A declar-
ative language for programming synchronous systems.” In:
Proceedings of the 14th Annual ACM Symposium on Principles
of Programming Languages (14th POPL 1987). ACM, New York,
NY. Vol. 178. 1987, p. 188.
[BB88] Jonathan Barzilai and Jonathan M Borwein. “Two-point step
size gradient methods.” In: IMA journal of numerical analysis
8.1 (1988), pp. 141–148.
[Coh88] Harvey A Cohen. “Symmetry considerations applied to hard-
ware convolvers for image filtering.” In: Proceedings of the
1988 IEEE International Conference on Systems, Man, and Cyber-
netics. Vol. 2. IEEE. 1988, pp. 1128–1131.
[Run89] Colin Runciman. “What about the natural numbers?” In: Com-
puter Languages 14.3 (1989), pp. 181–191.
[BD91] Frédéric Boussinot and Robert De Simone. “The ESTEREL lan-
guage.” In: Proceedings of the IEEE 79.9 (1991), pp. 1293–1304.
[MMP91] Oded Maler, Zohar Manna, and Amir Pnueli. “Prom timed
to hybrid systems.” In: Workshop/School/Symposium of the
REX Project (Research and Education in Concurrent Systems).
Springer. 1991, pp. 447–484.
[Gun92] Carl A Gunter. Semantics of programming languages: struc-
tures and techniques. MIT press, 1992.
[RPM92] Sebastian Ritz, Matthias Pankert, and Heinrich Meyr. “High
level software synthesis for signal processing systems.” In:
Proceedings of the international conference on application spe-
cific array processors. IEEE Computer Society. 1992, pp. 679–
680.
154
[Abb+93] Ben Abbott, Ted Bapty, Csaba Biegl, Gabor Karsai, and Janos
Sztipanovits. “Model-based software synthesis.” In: IEEE Soft-
ware 10.3 (1993), pp. 42–52.
[DR95] Volker Diekert and Grzegorz Rozenberg. The book of traces.
World scientific, 1995.
[LP95] Edward A Lee and Thomas M Parks. “Dataflow process net-
works.” In: Proceedings of the IEEE 83.5 (1995), pp. 773–801.
[Maz95] Antoni W Mazurkiewicz. Introduction to Trace Theory. 1995.
[Par95] Thomas M. Parks. “Bounded Scheduling of Process Net-
works.” PhD thesis. EECS Department, University of Califor-
nia, Berkeley, Dec. 1995. url: ❤tt♣✿✴✴✇✇✇✷✳❡❡❝s✳❜❡r❦❡❧❡②✳
❡❞✉✴P✉❜s✴❚❡❝❤❘♣ts✴✶✾✾✺✴✷✾✷✻✳❤t♠❧.
[Pin+95] José Luis Pino, Soonhoi Ha, Edward A Lee, and Joseph T Buck.
“Software synthesis for DSP using Ptolemy.” In: Journal of VLSI
signal processing systems for signal, image and video technol-
ogy 9.1-2 (1995), pp. 7–21.
[Bil+96] Greet Bilsen, Marc Engels, Rudy Lauwereins, and Jean Peper-
straete. “Cycle-static dataflow.” In: IEEE Transactions on signal
processing 44.2 (1996), pp. 397–408.
[Cra+96] James Crawford, Matthew Ginsberg, Eugene Luks, and
Amitabha Roy. “Symmetry-breaking predicates for search
problems.” In: KR 96.1996 (1996), pp. 148–159.
[Nag+96] Wolfgang E Nagel, Alfred Arnold, Michael Weber, Hans-
Christian Hoppe, and Karl Solchenbach. “VAMPIR: Visualiza-
tion and analysis of MPI resources.” In: (1996).
[HS97] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term
memory.” In: Neural computation 9.8 (1997), pp. 1735–1780.
[McK97] Bruce McKenzie. “Generating strings at random from a con-
text free grammar.” In: (1997).
[CDT98] G Calafiore, F Dabbene, and R Tempo. “Uniform sample gen-
eration in l/sub p/balls for probabilistic robustness analysis.”
In: Proceedings of the 37th IEEE Conference on Decision and Con-
trol (Cat. No. 98CH36171). Vol. 3. IEEE. 1998, pp. 3335–3340.
[Cla+98] Edmund M Clarke, E Allen Emerson, Somesh Jha, and A
Prasad Sistla. “Symmetry reductions in model checking.”
In: International Conference on Computer Aided Verification.
Springer. 1998, pp. 147–158.
[DRW98] Robert P Dick, David L Rhodes, and Wayne Wolf. “TGFF: task
graphs for free.” In: Proceedings of the Sixth International
Workshop on Hardware/Software Codesign.(CODES/CASHE’98).
IEEE. 1998, pp. 97–101.
[Law98] Mark V Lawson. Inverse semigroups: the theory of partial sym-
metries. World Scientific, 1998.
[Lin98] Bill Lin. “Software synthesis of process-based concurrent
programs.” In: Proceedings of the 35th annual Design Automa-
tion Conference. 1998, pp. 502–505.
155
[BLM00] SS Bhartacharyya, Ranier Leupers, and Peter Marwedel.
“Software synthesis and code generation for signal process-
ing systems.” In: IEEE Transactions on Circuits and Systems II:
Analog and Digital Signal Processing 47.9 (2000), pp. 849–875.
[Koc+00] Erwin A de Kock, WJM Smits, Pieter van der Wolf, J-Y Brunel,
WM Kruijtzer, Paul Lieverse, Kees A Vissers, and Gerben Es-
sink. “YAPI: Application modeling for signal processing sys-
tems.” In: Proceedings of the 37th Annual Design Automation
Conference. 2000, pp. 402–405.
[Bra+01] Tracy D Braun, Howard Jay Siegel, Noah Beck, Ladislau L
Bölöni, Muthucumaru Maheswaran, Albert I Reuther, James
P Robertson, Mitchell D Theys, Bin Yao, Debra Hensgen, et al.
“A comparison of eleven static heuristics for mapping a class
of independent tasks onto heterogeneous distributed com-
puting systems.” In: Journal of Parallel and Distributed comput-
ing 61.6 (2001), pp. 810–837.
[HT01] Frank Hannig and Jürgen Teich. “Design space exploration
for massively parallel processor arrays.” In: International Con-
ference on Parallel Computing Technologies. Springer. 2001,
pp. 51–65.
[Kie+01] Bart Kienhuis, Ed F Deprettere, Pieter Van der Wolf, and Kees
Vissers. “A methodology to design programmable embed-
ded systems.” In: International Workshop on Embedded Com-
puter Systems. Springer. 2001, pp. 18–37.
[Mue+01] Wolfgang Mueller, Juergen Ruf, Dirk Hoffmann, Joachim Ger-
lach, Thomas Kropf, and Wolfgang Rosenstiehl. “The simula-
tion semantics of SystemC.” In: Proceedings Design, Automa-
tion and Test in Europe. Conference and Exhibition 2001. IEEE.
2001, pp. 64–70.
[Mat02] Jiří Matoušek. Lectures on discrete geometry. Vol. 212. Springer
Science & Business Media, 2002.
[EJ03] Johan Eker and J Janneck. CAL language report: Specification of
the CAL actor language. 2003.
[Ser03] Ákos Seress. Permutation group algorithms. Vol. 152. Cam-
bridge University Press, 2003.
[SD03] Todor Stefanov and Ed Deprettere. “Deriving process net-
works from weakly dynamic applications in system-level de-
sign.” In: Proceedings of the 1st IEEE/ACM/IFIP international con-
ference on Hardware/software codesign and system synthesis.
2003, pp. 90–96.
[Sir04] Marjan Sirjani. “Formal specification and verification of con-
current and reactive systems.” In: PhD thesis (2004).
[WS04] G Gary Wang and Songqing Shan. “Design space reduction
for multi-objective optimization and robust design optimiza-
tion problems.” In: SAE transactions (2004), pp. 101–110.
[AGL05] James Ahrens, Berk Geveci, and Charles Law. “Paraview: An
end-user tool for large data visualization.” In: The visualiza-
tion handbook 717.8 (2005).
[Hol05] Derek F. Holt. Handbook of Computational Group Theory. CRC
Press, 2005.
156
[Kre+05] Marcio Kreutz, César A Marcon, Luigi Carro, Flavio Wagner,
and Altamiro A Susin. “Design space exploration compar-
ing homogeneous and heterogeneous network-on-chip ar-
chitectures.” In: Proceedings of the 18th annual symposium on
Integrated circuits and system design. 2005, pp. 190–195.
[ECP06] Cagkan Erbas, Selin Cerav-Erbas, and Andy D Pimentel. “Mul-
tiobjective optimization and evolutionary algorithms for the
application mapping problem in multiprocessor system-on-
chip design.” In: IEEE Transactions on Evolutionary Computa-
tion 10.3 (2006), pp. 358–374.
[Kan+06] Tero Kangas, Petri Kukkala, Heikki Orsila, Erno Salminen,
Marko Hännikäinen, Timo D Hämäläinen, Jouni Riihimäki,
and Kimmo Kuusilinna. “UML-based multiprocessor SoC de-
sign framework.” In: ACM Transactions on Embedded Comput-
ing Systems (TECS) 5.2 (2006), pp. 281–320.
[Lee06] Edward A Lee. “The problem with threads.” In: Computer 39.5
(2006), pp. 33–42.
[PEP06] Andy D Pimentel, Cagkan Erbas, and Simon Polstra. “A sys-
tematic approach to exploring embedded system architec-
tures at multiple abstraction levels.” In: IEEE Transactions on
Computers 55.2 (2006), pp. 99–112.
[SDN06] Todor Stefanov, Ed Deprettere, and Hristo Nikolov. “Multi-
processor system design with ESPAM.” In: Proceedings of the
4th International Conference on Hardware/Software Codesign
and System Synthesis (CODES+ ISSS’06). IEEE. 2006, pp. 211–216.
[SGB06] S. Stuijk, M.C.W. Geilen, and T. Basten. “SDF3: SDF For Free.”
In: Application of Concurrency to System Design, 6th Interna-
tional Conference, ACSD 2006, Proceedings. Turku, Finland:
IEEE Computer Society Press, Los Alamitos, CA, USA, June
2006, pp. 276–278. doi: ✶✵✳✶✶✵✾✴❆❈❙❉✳✷✵✵✻✳✷✸. url: ❤tt♣✿
✴✴✇✇✇✳❡s✳❡❧❡✳t✉❡✳♥❧✴s❞❢✸.
[The+06] Bart D Theelen, Marc CW Geilen, Twan Basten, Jeroen PM
Voeten, Stefan Valentin Gheorghita, and Sander Stuijk. “A
scenario-aware data flow model for combined long-run av-
erage and worst-case performance analysis.” In: Fourth ACM
and IEEE International Conference on Formal Methods and
Models for Co-Design, 2006. MEMOCODE’06. Proceedings. IEEE.
2006, pp. 185–194.
[DeC+07] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gu-
navardhan Kakulapati, Avinash Lakshman, Alex Pilchin,
Swaminathan Sivasubramanian, Peter Vosshall, and Werner
Vogels. “Dynamo: amazon’s highly available key-value store.”
In: ACM SIGOPS operating systems review 41.6 (2007), pp. 205–
220.
[EL07] Stephen A Edwards and Edward A Lee. “The case for the pre-
cision timed (PRET) machine.” In: Proceedings of the 44th an-
nual Design Automation Conference. 2007, pp. 264–265.
[Erb+07] Cagkan Erbas, Andy D Pimentel, Mark Thompson, and Simon
Polstra. “A framework for system-level modeling and simula-
tion of embedded systems architectures.” In: EURASIP Journal
on Embedded Systems 2007 (2007), pp. 1–11.
157
[MMB07] Orlando Moreira, Jacob Jan-David Mol, and Marco Bekooij.
“Online resource management in a multiprocessor with a
network-on-chip.” In: Proceedings of the 2007 ACM symposium
on Applied computing. 2007, pp. 1557–1564.
[Ors+07] Heikki Orsila, Tero Kangas, Erno Salminen, Timo D. Hämäläi-
nen, and Marko Hännikäinen. “Automated memory-aware
application distribution for multi-processor system-on-
chips.” In: J. of Sys. Arch. 53.11 (2007), pp. 795–815.
[San07] Alberto Sangiovanni-Vincentelli. “Quo vadis, SLD? Reasoning
about the trends and challenges of system level design.” In:
Proceedings of the IEEE 95.3 (2007), pp. 467–506.
[Thi+07] Lothar Thiele, Iuliana Bacivarov, Wolfgang Haid, and Kai
Huang. “Mapping applications to tiled multiprocessor em-
bedded systems.” In: Seventh International Conference on Ap-
plication of Concurrency to System Design (ACSD 2007). IEEE.
2007, pp. 29–40.
[BK08] Christel Baier and Joost-Pieter Katoen. Principles of model
checking. MIT press, 2008.
[Dic08] Robert Dick. Embedded Systems Synthesis Benchmark Suite
(e3s). 2008. url: ❤tt♣✿✴✴③✐②❛♥❣✳❡❡❝s✳✉♠✐❝❤✳❡❞✉✴✪✺❈⑦✪✼❇✪
✼❉❞✐❝❦r♣✴❡✸s✴.
[Hau+08] Christian Haubelt, Thomas Schlichter, Joachim Keinert, and
Mike Meredith. “SystemCoDesigner: automatic design space
exploration and rapid prototyping from behavioral models.”
In: Proceedings of the 45th annual Design Automation Confer-
ence. 2008, pp. 580–585.
[Kum+08] Akash Kumar, Bart Mesman, Bart Theelen, Henk Corporaal,
and Yajun Ha. “Analyzing composability of applications on
MPSoC platforms.” In: Journal of Systems Architecture 54.3-4
(2008), pp. 369–383.
[MH08] Laurens van der Maaten and Geoffrey Hinton. “Visualizing
data using t-SNE.” In: Journal of machine learning research
9.Nov (2008), pp. 2579–2605.
[MEP08] Sorin Manolache, Petru Eles, and Zebo Peng. “Task mapping
and priority assignment for soft real-time applications under
deadline miss ratio constraints.” In: ACM Transactions on Em-
bedded Computing Systems (TECS) 7.2 (2008), pp. 1–35.
[Nik+08] Hristo Nikolov, Mark Thompson, Todor Stefanov, Andy Pi-
mentel, Simon Polstra, Raj Bose, Claudiu Zissulescu, and Ed
Deprettere. “Daedalus: toward composable multimedia MP-
SoC design.” In: Proceedings of the 45th annual Design Automa-
tion Conference. 2008, pp. 574–579.
[DM09] Alastair F Donaldson and Alice Miller. “On the constructive
orbit problem.” In: Annals of mathematics and artificial intelli-
gence 57.1 (2009), pp. 1–35.
[EMD09] Wolfgang Ecker, Wolfgang Müller, and Rainer Dömer.
“Hardware-dependent software.” In: Hardware-dependent
Software. Springer, 2009, pp. 1–13.
158
[HPP09] Mary Hall, David Padua, and Keshav Pingali. “Compiler re-
search: the next 50 years.” In: Communications of the ACM
52.2 (2009), pp. 60–67.
[Han+09] Andreas Hansson, Kees Goossens, Marco Bekooij, and Jos
Huisken. “CoMPSoC: A template for composable and pre-
dictable multi-processor system on chips.” In: ACM Transac-
tions on Design Automation of Electronic Systems (TODAES) 14.1
(2009), pp. 1–24.
[Lee09] Edward A Lee. “Computing needs time.” In: Communications
of the ACM 52.5 (2009), pp. 70–79.
[LM09] Edward A Lee and Eleftherios Matsikoudis. “The semantics
of dataflow with firing.” In: G. Huet, G. Plotkin, J.-J. Lévy, and Y.
Bertot, editors, From Semantics to Computer Science: Essays in
Honour of Gilles Kahn (2009), pp. 71–94.
[Cas+10] Jeronimo Castrillon, Ricardo Velasquez, Anastasia Stulova,
Weihua Sheng, Jianjiang Ceng, Rainer Leupers, Gerd Ascheid,
and Heinrich Meyr. “Trace-based KPN composability analy-
sis for mapping simultaneous applications to MPSoC plat-
forms.” In: 2010 Design, Automation & Test in Europe Confer-
ence & Exhibition (DATE 2010). IEEE. 2010, pp. 753–758.
[Sin+10] Amit Kumar Singh, Thambipillai Srikanthan, Akash Kumar,
and Wu Jigang. “Communication-aware heuristics for run-
time task mapping on NoC-based MPSoC platforms.” In: Jour-
nal of Systems Architecture 56.7 (2010), pp. 242–255.
[SGB10] Sander Stuijk, Marc Geilen, and Twan Basten. “A predictable
multiprocessor design flow for streaming applications with
dynamic behaviour.” In: 2010 13th Euromicro Conference on
Digital System Design: Architectures, Methods and Tools. IEEE.
2010, pp. 548–555.
[Yan+10] Bo Yang, Liang Guang, Thomas Canhao Xu, Tero Säntti,
and Juna Plosila. “Multi-application mapping algorithm for
network-on-chip platforms.” In: 2010 IEEE 26-th Convention
of Electrical and Electronics Engineers in Israel. IEEE. 2010,
pp. 000540–000544.
[Bha+11] Shuvra S Bhattacharyya, Johan Eker, Jörn W Janneck,
Christophe Lucarz, Marco Mattavelli, and Mickaël Raulet.
“Overview of the MPEG reconfigurable video coding frame-
work.” In: Journal of Signal Processing Systems 63.2 (2011),
pp. 251–263.
[CLA11] Jeronimo Castrillon, Rainer Leupers, and Gerd Ascheid.
“MAPS: Mapping concurrent dataflow applications to hetero-
geneous MPSoCs.” In: IEEE Transactions on Industrial Informat-
ics 9.1 (2011), pp. 527–545.
[Cas+11] Jeronimo Castrillon, Stefan Schürmans, Anastasia Stulova,
Weihua Sheng, Torsten Kempf, Rainer Leupers, Gerd As-
cheid, and Heinrich Meyr. “Component-based waveform de-
velopment: the Nucleus tool flow for efficient and portable
software defined radio.” In: Analog Integrated Circuits and Sig-
nal Processing 69.2 (2011), pp. 173–190.
159
[CSL11] Jeronimo Castrillon, Weihua Sheng, and Rainer Leupers.
“Trends in embedded software synthesis.” In: 2011 Interna-
tional Conference on Embedded Computer Systems: Architec-
tures, Modeling and Simulation. IEEE. 2011, pp. 347–354.
[Mar+11] Peter Marwedel, Iuliana Bacivarov, Chanhee Lee, Jürgen Te-
ich, Lothar Thiele, Qiang Xu, Georgia Kouveli, Soonhoi Ha,
and Lin Huang. “Mapping of applications to MPSoCs.” In:
2011 Proceedings of the Ninth IEEE/ACM/IFIP International Con-
ference on Hardware/Software Codesign and System Synthesis
(CODES+ ISSS). IEEE. 2011, pp. 109–118.
[Yan+11] Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. “Finding
and understanding bugs in C compilers.” In: ACM SIGPLANNo-
tices. Vol. 46. 6. ACM. 2011, pp. 283–294.
[ZK11] Xiao Zhang and Hans G Kerkhoff. “A dependability solution
for homogeneous MPSoCs.” In: 2011 IEEE 17th Pacific Rim In-
ternational Symposium on Dependable Computing. IEEE. 2011,
pp. 53–62.
[BML12] Shuvra S Bhattacharyya, Praveen K Murthy, and Edward
A Lee. Software synthesis from dataflow graphs. Vol. 360.
Springer Science & Business Media, 2012.
[Cas+12] Jeronimo Castrillon, Andreas Tretter, Rainer Leupers, and
Gerd Ascheid. “Communication-aware mapping of KPN ap-
plications onto heterogeneous MPSoCs.” In: DAC Design Au-
tomation Conference 2012. IEEE. 2012, pp. 1262–1267.
[Cza12] Evan Czaplicki. “Elm: Concurrent FRP for Functional GUIs.” In:
Senior thesis, Harvard University 30 (2012).
[Gaj+12] Daniel D Gajski, Jianwen Zhu, Rainer Dömer, Andreas Gerst-
lauer, and Shuqing Zhao. SpecC: Specification language and
methodology. Springer Science & Business Media, 2012.
[Cas+13] Simone Casale-Brunet, Claudio Alberti, Marco Mattavelli,
and Jorn W Janneck. “Turnus: a unified dataflow design space
exploration framework for heterogeneous parallel systems.”
In: 2013 Conference on Design and Architectures for Signal and
Image Processing. IEEE. 2013, pp. 47–54.
[Des+13] Karol Desnos, Maxime Pelcat, Jean-François Nezan, Shuvra
S Bhattacharyya, and Slaheddine Aridhi. “Pimm: Parameter-
ized and interfaced dataflow meta-model for mpsocs run-
time reconfiguration.” In: 2013 International Conference on Em-
bedded Computer Systems: Architectures, Modeling, and Simu-
lation (SAMOS). IEEE. 2013, pp. 41–48.
[OWG13] Michael FP O’Boyle, Zheng Wang, and Dominik Grewe.
“Portable mapping of data parallel programs to OpenCL for
heterogeneous systems.” In: Proceedings of the 2013 IEEE/ACM
International Symposiumon CodeGeneration andOptimization
(CGO). IEEE Computer Society. 2013, pp. 1–10.
[Ode+13] Maximilian Odendahl, Jeronimo Castrillon, Vitaliy Volevach,
Rainer Leupers, and Gerd Ascheid. “Split-cost communica-
tion model for improved MPSoC application mapping.” In:
2013 International Symposium on System on Chip (SoC). IEEE.
2013, pp. 1–8.
160
[SBA13] Jocelyn Sérot, François Berry, and Sameer Ahmed. “CAPH: a
language for implementing stream-processing applications
on FPGAs.” In: Embedded Systems Design with FPGAs. Springer,
2013, pp. 201–224.
[Sin+13] Amit Kumar Singh, Muhammad Shafique, Akash Kumar, and
Jörg Henkel. “Mapping on multi/many-core systems: survey
of current and emerging trends.” In: 2013 50th ACM/EDAC/IEEE
Design Automation Conference (DAC). IEEE. 2013, pp. 1–10.
[TDJ13] Samira Tasharofi, Peter Dinges, and Ralph E Johnson. “Why
do scala developers mix the actor model with other concur-
rency models?” In: European Conference on Object-Oriented
Programming. Springer. 2013, pp. 302–326.
[TP13] Mark Thompson and Andy D Pimentel. “Exploiting do-
main knowledge in system-level MPSoC design space explo-
ration.” In: Journal of Systems Architecture 59.7 (2013), pp. 351–
360.
[Uni13] The Univalent Foundations Program. Homotopy Type Theory:
Univalent Foundations of Mathematics. Institute for Advanced
Study: ❤tt♣s✿✴✴❤♦♠♦t♦♣②t②♣❡t❤❡♦r②✳♦r❣✴❜♦♦❦, 2013.
[Vap13] Vladimir Vapnik. The nature of statistical learning theory.
Springer science & business media, 2013.
[Zhe+13] Qi Zheng, Yajing Chen, Ronald Dreslinski, Chaitali
Chakrabarti, Achilleas Anastasopoulos, Scott Mahlke, and
Trevor Mudge. “WiBench: An open source kernel suite for
benchmarking wireless systems.” In: 2013 IEEE international
symposium on workload characterization (IISWC). IEEE. 2013,
pp. 123–132.
[CL14] Jerónimo Castrillón Mazo and Rainer Leupers. “Program-
ming heterogeneous mpsocs: Tool flows to close the soft-
ware productivity gap.” In: (2014).
[Don14] Jake Donham. Introducing Stitch. Tech. rep. [Online; accessed
4-May-2017]. 2014. url: ❤tt♣s✿✴✴✇✇✇✳②♦✉t✉❜❡✳❝♦♠✴✇❛t❝❤❄✈❂
❱❱♣♠▼❢❚✽❛❨✇.
[Eus+14] Juan Fernando Eusse, Christopher Williams, Luis Gabriel
Murillo, Rainer Leupers, and Gerd Ascheid. “Pre-
architectural performance estimation for ASIP design
based on abstract processor models.” In: 2014 International
Conference on Embedded Computer Systems: Architectures,
Modeling, and Simulation (SAMOS XIV). IEEE. 2014, pp. 133–140.
[Heu+14] Julien Heulot, Maxime Pelcat, Karol Desnos, Jean-Francois
Nezan, and Slaheddine Aridhi. “Spider: A synchronous pa-
rameterized and interfaced dataflow-based rtos for multi-
core dsps.” In: 2014 6th European Embedded Design in Educa-
tion and Research Conference (EDERC). IEEE. 2014, pp. 167–171.
[Kan+14] Shin-haeng Kang, Hoeseok Yang, Sungchan Kim, Iuliana Baci-
varov, Soonhoi Ha, and Lothar Thiele. “Static mapping of
mixed-critical applications for fault-tolerant MPSoCs.” In:
Proceedings of the 51st annual design automation conference.
2014, pp. 1–6.
161
[Mar+14] Simon Marlow, Louis Brandy, Jonathan Coens, and Jon Purdy.
“There is no fork: An abstraction for efficient, concurrent,
and concise data access.” In: Proceedings of the 19th ACM
SIGPLAN international conference on Functional programming.
2014, pp. 325–337.
[MP14] Brendan D. McKay and Adolfo Piperno. “Practical graph iso-
morphism, {II}.” In: Journal of Symbolic Computation 60.0
(2014), pp. 94–112. issn: 0747-7171. doi: ❤tt♣✿✴✴❞♦✐✳♦r❣✴✶✵✳
✶✵✶✻✴❥✳❥s❝✳✷✵✶✸✳✵✾✳✵✵✸. url: ❤tt♣✿✴✴✇✇✇✳s❝✐❡♥❝❡❞✐r❡❝t✳
❝♦♠✴s❝✐❡♥❝❡✴❛rt✐❝❧❡✴♣✐✐✴❙✵✼✹✼✼✶✼✶✶✸✵✵✶✶✾✸.
[Mur+14] Luis Gabriel Murillo, Simon Wawroschek, Jeronimo Castril-
lon, Rainer Leupers, and Gerd Ascheid. “Automatic detection
of concurrency bugs through event ordering constraints.” In:
2014 Design, Automation & Test in Europe Conference & Exhibi-
tion (DATE). IEEE. 2014, pp. 1–6.
[Noe+14] Benedikt Noethen, Oliver Arnold, Esther Perez Adeva, Tobias
Seifert, Erik Fischer, Steffen Kunze, Emil Matúš, Gerhard Fet-
tweis, Holger Eisenreich, Georg Ellguth, et al. “10.7 A 105GOPS
36mm 2 heterogeneous SDR MPSoC with energy-aware dy-
namic scheduling and iterative detection-decoding for 4G
in 65nm CMOS.” In: 2014 IEEE International Solid-State Cir-
cuits Conference Digest of Technical Papers (ISSCC). IEEE. 2014,
pp. 188–189.
[Pel+14] Maxime Pelcat, Karol Desnos, Julien Heulot, Clément Guy,
Jean-François Nezan, and Slaheddine Aridhi. “Preesm: A
dataflow-based rapid prototyping framework for simplifying
multicore dsp programming.” In: 2014 6th european embed-
ded design in education and research conference (EDERC). IEEE.
2014, pp. 36–40.
[Pto14] Claudius Ptolemaeus, ed. System Design, Modeling, and Simu-
lation using Ptolemy II. Ptolemy.org, 2014. url: ❤tt♣✿✴✴♣t♦❧❡♠②✳
♦r❣✴❜♦♦❦s✴❙②st❡♠s.
[QP14] Wei Quan and Andy D Pimentel. “Towards exploring vast mp-
soc mapping design spaces using a bias-elitist evolutionary
approach.” In: 2014 17th Euromicro Conference on Digital Sys-
tem Design. IEEE. 2014, pp. 655–658.
[Sch+14] Lars Schor, Iuliana Bacivarov, Hoeseok Yang, and Lothar
Thiele. “AdaPNet: Adapting process networks in response to
resource variations.” In: Proceedings of the 2014 International
Conference on Compilers, Architecture and Synthesis for Embed-
ded Systems. 2014, pp. 1–10.
[She+14] Weihua Sheng, Stefan Schürmans, Maximilian Odendahl,
Mark Bertsch, Vitaliy Volevach, Rainer Leupers, and Gerd As-
cheid. “A compiler infrastructure for embedded heteroge-
neous MPSoCs.” English. In: Parallel Computing. Vol. 40. 2. El-
sevier, Feb. 2014, pp. 51–68. doi: ❤tt♣✿✴✴❞①✳❞♦✐✳♦r❣✴✶✵✳
✶✵✶✻✴❥✳♣❛r❝♦✳✷✵✶✸✳✶✶✳✵✵✼.
162
[Wei+14] Andreas Weichslgartner, Deepak Gangadharan, Stefan Wil-
dermann, Michael Glaß, and Jürgen Teich. “DAARM: Design-
time application analysis and run-time mapping for pre-
dictable execution in many-core systems.” In: 2014 Interna-
tional Conference on Hardware/Software Codesign and System
Synthesis (CODES+ ISSS). IEEE. 2014, pp. 1–10.
[Cat+15] Vincenzo Catania, Andrea Mineo, Salvatore Monteleone,
Maurizio Palesi, and Davide Patti. “Noxim: An open, exten-
sible and cycle-accurate network on chip simulator.” In: 2015
IEEE 26th international conference on application-specific sys-
tems, architectures and processors (ASAP). IEEE. 2015, pp. 162–
163.
[Che+15] Shuang Chen, Xue Lin, Alireza Shafaei, Yanzhi Wang, and
Massoud Pedram. “Analysis of deeply scaled multi-gate de-
vices with design centering across multiple voltage regimes.”
In: 2015 IEEE SOI-3D-Subthreshold Microelectronics Technology
Unified Conference (S3S). IEEE. 2015, pp. 1–2.
[Kac15] Alexey Kachayev. Reinventing Haxl: Efficient, Concurrent and
Concise Data Access. Tech. rep. [Online; accessed 4-May-2017].
2015. url: ❤tt♣s✿✴✴✇✇✇✳②♦✉t✉❜❡✳❝♦♠✴✇❛t❝❤❄✈❂❚✲♦❡❦❱✽P✇✈✽.
[Li+15] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard
Zemel. “Gated graph sequence neural networks.” In: arXiv
preprint arXiv:1511.05493 (2015).
[Mel15a] Mellanox Technologies. TILE-Gx36 Processor. [Online; ac-
cessed 2019-05-22]. Mellanox Technologies. 2015. (Visited on
05/22/2019).
[Mel15b] Mellanox Technologies. TILE-Gx72 Processor. [Online; ac-
cessed 2019-05-22]. Mellanox Technologies. 2015. (Visited on
05/22/2019).
[Mou+15] Leonardo de Moura, Soonho Kong, Jeremy Avigad, Floris Van
Doorn, and Jakob von Raumer. “The Lean theorem prover
(system description).” In: International Conference on Auto-
mated Deduction. Springer. 2015, pp. 378–388.
[Pel+15] Maxime Pelcat, Karol Desnos, Luca Maggiani, Yanzhou Liu,
Julien Heulot, Jean-François Nezan, and Shuvra S Bhat-
tacharyya. “Models of architecture.” In: (2015).
[QP15] Wei Quan and Andy D Pimentel. “A hybrid task mapping al-
gorithm for heterogeneous MPSoCs.” In: ACMTransactions on
Embedded Computing Systems (TECS) 14.1 (2015), pp. 1–25.
[Rol+15] Sascha Roloff, David Schafhauser, Frank Hannig, and Jürgen
Teich. “Execution-driven parallel simulation of PGAS appli-
cations on heterogeneous tiled architectures.” In: 2015 52nd
ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE.
2015, pp. 1–6.
[The15] The Multicore Association, Inc. Software-Hardware Interface
for Multi-Many-Core (SHIM) Specification, V1.0. The Multicore
Association, Inc. Jan. 2015.
[Wad15] Philip Wadler. “Propositions as types.” In: Communications of
the ACM 58.12 (2015), pp. 75–84.
163
[Bab16] László Babai. “Graph isomorphism in quasipolynomial time.”
In: Proceedings of the forty-eighth annual ACM symposium on
Theory of Computing. 2016, pp. 684–697.
[Bal+16] Jonathan Balkind, Michael McKeown, Yaosheng Fu, Tri
Nguyen, Yanqi Zhou, Alexey Lavrov, Mohammad Shahrad,
Adi Fuchs, Samuel Payne, Xiaohua Liang, Matthew Matl, and
David Wentzlaff. “OpenPiton: An Open Source Manycore Re-
search Framework.” In: Proceedings of the Twenty-First Inter-
national Conference on Architectural Support for Programming
Languages and Operating Systems. ASPLOS ’16. Atlanta, Geor-
gia, USA: ACM, 2016, pp. 217–232. isbn: 978-1-4503-4091-5. doi:
✶✵✳✶✶✹✺✴✷✽✼✷✸✻✷✳✷✽✼✷✹✶✹. url: ❤tt♣s✿✴✴❞♦✐✳♦r❣✴✶✵✳✶✶✹✺✴
✷✽✼✷✸✻✷✳✷✽✼✷✹✶✹.
[Che+16] Kuan-Hsun Chen, Jian-Jia Chen, Florian Kriebel, Semeen
Rehman, Muhammad Shafique, and Jörg Henkel. “Task map-
ping for redundant multithreading in multi-cores with relia-
bility and performance heterogeneity.” In: IEEE Transactions
on Computers 65.11 (2016), pp. 3441–3455.
[KKM16] Enagnon Cedric Klikpo, Jad Khatib, and Alix Munier-Kordon.
“Modeling multi-periodic simulink systems by synchronous
dataflow graphs.” In: 2016 IEEE Real-Time and Embedded Tech-
nology and Applications Symposium (RTAS). IEEE. 2016, pp. 1–
10.
[Olo16] Andreas Olofsson. “Epiphany-V: A 1024 processor 64-bit risc
system-on-chip.” In: arXiv preprint arXiv:1610.01832 (2016).
[Sha16] Christopher Shaver. “On the Representation of Distributed
Behavior.” PhD thesis. EECS Department, University of Cali-
fornia, Berkeley, Dec. 2016. url: ❤tt♣✿✴✴✇✇✇✷✳❡❡❝s✳❜❡r❦❡❧❡②✳
❡❞✉✴P✉❜s✴❚❡❝❤❘♣ts✴✷✵✶✻✴❊❊❈❙✲✷✵✶✻✲✷✵✻✳❤t♠❧.
[Sod+16] A. Sodani, R. Gramunt, J. Corbal, H. Kim, K. Vinod, S.
Chinthamani, S. Hutsell, R. Agarwal, and Y. Liu. “Knights Land-
ing: Second-Generation Intel Xeon Phi Product.” In: IEEE Mi-
cro 36.2 (Mar. 2016), pp. 34–46. issn: 0272-1732. doi: ✶✵✳✶✶✵✾✴
▼▼✳✷✵✶✻✳✷✺. url: ❤tt♣s✿✴✴❞♦✐✳♦r❣✴✶✵✳✶✶✵✾✴▼▼✳✷✵✶✻✳✷✺.
[Wei+16] Andreas Weichslgartner, Stefan Wildermann, Johannes
Götzfried, Felix Freiling, Michael Glaß, and Jürgen Teich.
“Design-time/run-time mapping of security-critical applica-
tions in heterogeneous mpsocs.” In: Proceedings of the 19th
International Workshop on Software and Compilers for Embed-
ded Systems. 2016, pp. 153–162.
[Zhu+16] Di Zhu, Lizhong Chen, Siyu Yue, Timothy M Pinkston, and
Massoud Pedram. “Providing balanced mapping for multi-
ple applications in many-core chip multiprocessors.” In: IEEE
Transactions on Computers 65.10 (2016), pp. 3122–3135.
[ABK17] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud
Khademi. “Learning to represent programs with graphs.” In:
arXiv preprint arXiv:1711.00740 (2017).
[AMS17] Josefine Asmus, Christian L Müller, and Ivo F Sbalzarini. “Lp-
Adaptation: Simultaneous Design Centering and Robustness
Estimation of Electronic and Biological Systems.” In: Scientific
reports 7.1 (2017), pp. 1–12.
164
[Cum+17a] Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh
Leather. “Synthesizing benchmarks for predictive modeling.”
In: 2017 IEEE/ACM International Symposium on Code Generation
and Optimization (CGO). IEEE. 2017, pp. 86–99.
[Cum+17b] Christopher Cummins, Pavlos Petoumenos, Zheng Wang,
and Hugh Leather. “End-to-end Deep Learning of Optimiza-
tion Heuristics.” In: Proceedings of the International Confer-
ence on Parallel Architectures and Compilation Techniques
(PACT 2017). Portland, Oregon, US, Sept. 2017.
[Kra17] Sebastian Krammer. “Isomorphism-Classes of Subgraphs via
Semigroups.” Bachelor’s Thesis. RWTH Aachen, 2017.
[Lee17] Edward A Lee. Plato and the Nerd: The Creative Partnership of
Humans and Technology. MIT Press, 2017.
[Sch+17] Tobias Schwarzer, Andreas Weichslgartner, Michael
Glaß, Stefan Wildermann, Peter Brand, and Jürgen Te-
ich. “Symmetry-eliminating design space exploration for
hybrid application mapping on many-core architectures.”
In: IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems 37.2 (2017), pp. 297–310.
[All+18] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and
Charles Sutton. “A survey of machine learning for big code
and naturalness.” In: ACM Computing Surveys (CSUR) 51.4
(2018), pp. 1–37.
[BJH18] Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler.
“Neural code comprehension: a learnable representation of
code semantics.” In: Advances in Neural Information Process-
ing Systems. 2018, pp. 3585–3597.
[Chi18] David Chisnall. “C is not a low-level language.” In: Communi-
cations of the ACM 61.7 (2018), pp. 44–48.
[EAC18] Sebastian Ertel, Justus Adam, and Jeronimo Castrillon. “Sup-
porting Fine-grained Dataflow Parallelism in Big Data Sys-
tems.” In: Proceedings of the 9th International Workshop
on Programming Models and Applications for Multicores and
Manycores (PMAM). PMAM’18. Vienna, Austria: ACM, Feb. 2018,
pp. 41–50. isbn: 978-1-4503-5645-9. doi: ✶✵✳✶✶✹✺✴✸✶✼✽✹✹✷✳
✸✶✼✽✹✹✼. url: ❤tt♣ ✿ ✴ ✴ ❞♦✐ ✳ ❛❝♠ ✳ ♦r❣ ✴ ✶✵ ✳ ✶✶✹✺ ✴ ✸✶✼✽✹✹✷ ✳
✸✶✼✽✹✹✼.
[Li+18a] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom
Goldstein. “Visualizing the Loss Landscape of Neural Nets.”
In: Neural Information Processing Systems. 2018.
[Li+18b] Yujia Li, Oriol Vinyals, Chris Dyer, Razvan Pascanu, and Peter
Battaglia. “Learning deep generative models of graphs.” In:
arXiv preprint arXiv:1803.03324 (2018).
[RG18] Valentina Richthammer and Michael Glaß. “On Search-Space
Restriction for Design Space Exploration of Multi-/Many-
Core Systems.” In: MBMV. 2018.
165
[Tam+18] S. M. Tam, H. Muljono, M. Huang, S. Iyer, K. Royneogi, N.
Satti, R. Qureshi, W. Chen, T. Wang, H. Hsieh, S. Vora, and
E. Wang. “SkyLake-SP: A 14nm 28-Core xeon® processor.” In:
2018 IEEE International Solid - State Circuits Conference - (ISSCC).
Feb. 2018, pp. 34–36. doi: ✶✵✳✶✶✵✾✴■❙❙❈❈✳✷✵✶✽✳✽✸✶✵✶✼✵. url:
❤tt♣s✿✴✴❞♦✐✳♦r❣✴✶✵✳✶✶✵✾✴■❙❙❈❈✳✷✵✶✽✳✽✸✶✵✶✼✵.
[Bec19] Micah Beck. “On the Hourglass Model.” In: Commun. ACM
62.7 (June 2019), pp. 48–57. issn: 0001-0782. doi: ✶✵✳✶✶✹✺✴
✸✷✼✹✼✼✵. url: ❤tt♣s✿✴✴❞♦✐✳♦r❣✴✶✵✳✶✶✹✺✴✸✷✼✹✼✼✵.
[BCJ19] Hasna Bouraoui, Jeronimo Castrillon, and Chadlia Jerad.
“Comparing dataflow and openmp programming for
speaker recognition applications.” In: Proceedings of the
10th and 8th Workshop on Parallel Programming and Run-
Time Management Techniques for Many-core Architectures
and Design Tools and Architectures for Multicore Embedded
Computing Platforms. 2019, pp. 1–6.
[Eas+19] James East, Attila Egri-Nagy, James D Mitchell, and Yann
Péresse. “Computing finite semigroups.” In: Journal of Sym-
bolic Computation 92 (2019), pp. 110–155.
[Ert19] Sebastian Ertel. “Towards Implicit Parallel Programming for
Systems.” PhD thesis. Dresden, Germany: TU Dresden, Dec.
2019, 121pp.
[Fet+19] Gerhard Fettweis, Meik Dörpinghaus, Jeronimo Castrillon,
Akash Kumar, Christel Baier, Karlheinz Bock, Frank Ellinger,
Andreas Fery, Frank H. P. Fitzek, Hermann Härtig, Kam-
biz Jamshidi, Thomas Kissinger, Wolfgang Lehner, Michael
Mertig, Wolfgang E. Nagel, Giang T. Nguyen, Dirk Plette-
meier, Michael Schröter, and Thorsten Strufe. “Architecture
and Advanced Electronics Pathways Toward Highly Adap-
tive Energy-Efficient Computing.” In: Proceedings of the IEEE
107.1 (Jan. 2019), pp. 204–231. issn: 0018-9219. doi: ✶✵✳✶✶✵✾✴
❏P❘❖❈✳✷✵✶✽✳✷✽✼✹✽✾✺. url: ❤tt♣s✿✴✴✐❡❡❡①♣❧♦r❡✳✐❡❡❡✳♦r❣✴
❞♦❝✉♠❡♥t✴✽✺✻✺✽✾✵.
[Lee19] Edward A Lee. “Freedom From Choice and the Power of Mod-
els: in Honor of Alberto Sangiovanni-Vincentelli.” In: Proceed-
ings of the 2019 International Symposium on Physical Design.
2019, pp. 126–126.
[LL19] Marten Lohstroh and Edward A Lee. “Deterministic actors.”
In: 2019 Forum for Specification and Design Languages (FDL).
IEEE. 2019, pp. 1–8.
[Tew19] Felix Teweleitt. “A logic language for IoT mappings.” Studien-
arbeit. TU Dresden, 2019.
[Web19] Matthew Weber. “Context and Interaction in the Internet of
Things.” PhD thesis. EECS Department, University of Califor-
nia, Berkeley, Aug. 2019. url: ❤tt♣✿✴✴✇✇✇✷✳❡❡❝s✳❜❡r❦❡❧❡②✳
❡❞✉✴P✉❜s✴❚❡❝❤❘♣ts✴✷✵✶✾✴❊❊❈❙✲✷✵✶✾✲✶✶✹✳❤t♠❧.
[WAL19] Matthew Weber, Ravi Akella, and Edward A Lee. “Service
Discovery for the Connected Car with Semantic Accessors.”
In: 2019 IEEE Intelligent Vehicles Symposium (IV). IEEE. 2019,
pp. 2417–2422.
166
[Yad19] Omry Yadan. Hydra - A framework for elegantly configuring
complex applications. Github. 2019. url: ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴
❢❛❝❡❜♦♦❦r❡s❡❛r❝❤✴❤②❞r❛.
[Zha19] Yong Zhao. “Health monitoring and life-time prognostics to
enable dependable many-processor S0Cs.” PhD thesis. Uni-
versity of Twente, 2019.
[Bra20] Alexander Brauckmann. “Investigating Input Representa-
tions and Representation Models of Source Code for Ma-
chine Learning.” MA thesis. TU Dresden, Feb. 2020.
[BCM20] Nishant Budhdev, Mun Choon Chan, and Tulika Mitra. Iso-
RAN: Isolation and Scaling for 5G RANvia User-Level Data Plane
Virtualization. 2020. arXiv: ✷✵✵✸✳✵✶✽✹✶ ❬❝s✳◆■❪.
[CDA20] C/DA - Design Automation. “IEEE Standard for Software-
Hardware Interface for Multi-Many-Core.” In: IEEE Std 2804-
2019 (Jan. 2020), pp. 1–84. doi: ✶✵ ✳ ✶✶✵✾ ✴ ■❊❊❊❙❚❉ ✳ ✷✵✷✵ ✳
✽✾✽✺✻✻✸. url: ❤tt♣s✿✴✴st❛♥❞❛r❞s✳✐❡❡❡✳♦r❣✴st❛♥❞❛r❞✴✷✽✵✹✲
✷✵✶✾✳❤t♠❧.
[Cum+20] Chris Cummins, Zacharias Fisches, Tal Ben-Nun, Torsten
Hoefler, Hugh Leather, and Michael O’Boyle. “Pro-
gram Graphs for Machine Learning.” In: arXiv preprint
arXiv:2003.10536 (2020).
[Gat+20] Alan Gatherer, Ashish Shrivastava, Hao Luan, Asheesh
Kashyap, Zhenguo Gu, and Miguel Dajer. “Towards a Domain
Specific Solution for a New Generation of Wireless Modems.”
In: arXiv preprint arXiv:2012.02890 (2020).
[inc20] Kalray inc. Kalray MPPA3 Coolidge Anouncement. 2020. url:
❤tt♣s ✿ ✴ ✴ ✇✇✇ ✳ ❦❛❧r❛②✐♥❝ ✳ ❝♦♠ ✴ r❡❧❡❛s❡ ✲ ♦❢ ✲ t❤✐r❞ ✲
❣❡♥❡r❛t✐♦♥✲♠♣♣❛✲♣r♦❝❡ss♦r✲❝♦♦❧✐❞❣❡✴.
[KC20] Robert Khasanov and Jeronimo Castrillon. “Energy-efficient
Runtime Resource Management for Adaptable Multi-
application Mapping.” In: Proceedings of the 2020 Design,
Automation and Test in Europe Conference (DATE). DATE
’20. Grenoble, France: IEEE, Mar. 2020, pp. 909–914. isbn:
978-3-9819263-4-7. doi: ✶✵✳✷✸✾✶✾✴❉❆❚❊✹✽✺✽✺✳✷✵✷✵✳✾✶✶✻✸✽✶.
url: ❤tt♣s✿✴✴✐❡❡❡①♣❧♦r❡✳✐❡❡❡✳♦r❣✴❞♦❝✉♠❡♥t✴✾✶✶✻✸✽✶.
[LC20] Hugh Leather and Chris Cummins. “Machine learning in com-
pilers: Past, present and future.” In: 2020 Forum for Specifica-
tion and Design Languages (FDL). IEEE. 2020, pp. 1–8.
[Loh20] Marten Lohstroh. “Reactors: A Deterministic Model of Con-
current Computation for Reactive Systems.” PhD thesis. UC
Berkeley, 2020.
[Loh+20a] Marten Lohstroh, Christian Menard, Alexander Schulz-
Rosengarten, Matthew Weber, Jeronimo Castrillon, and Ed-
ward A Lee. “A Language for Deterministic Coordination
Across Multiple Timelines.” In: 2020 Forum for Specification
and Design Languages (FDL). IEEE. 2020, pp. 1–8.
167
[Loh+20b] Marten Lohstroh, Christian Menard, Alexander Schulz-
Rosengarten, Matthew Weber, Jeronimo Castrillon, and Ed-
ward A Lee. “A Language for Deterministic Coordination
Across Multiple Timelines.” In: 2020 Forum for Specification
and Design Languages (FDL). IEEE. 2020, pp. 1–8.
[Nic20] Timo Nicolai. “Faster MPSoC Task Mapping via Symmetry Re-
duction.” Studienarbeit. TU Dresden, 2020.
[Pal+20] Aditya Paliwal, Sarah M Loos, Markus N Rabe, Kshitij Bansal,
and Christian Szegedy. “Graph Representations for Higher-
Order Logic and Theorem Proving.” In: AAAI. 2020, pp. 2967–
2974.
[RFG20] Valentina Richthammer, Fabian Fassnacht, and Michael Glaß.
“Search-space Decomposition for System-level Design Space
Exploration of Embedded Systems.” In: ACM Transactions on
Design Automation of Electronic Systems (TODAES) 25.2 (2020),
pp. 1–32.
[Rup20] Karl Rupp. Microprocessor Trend Data. ❤tt♣s✿✴✴❣✐t❤✉❜✳❝♦♠✴
❦❛r❧r✉♣♣✴♠✐❝r♦♣r♦❝❡ss♦r✲tr❡♥❞✲❞❛t❛. 2020.
[Thi20] Alexander Thierfelder. “A Domain-Specific Generative Model
of Code for LLVM.” MA thesis. TU Dresden, Feb. 2020.
[Ye+20] Guixin Ye, Zhanyong Tang, Huanting Wang, Dingyi Fang, Jian-
bin Fang, Songfang Huang, and Zheng Wang. “Deep Program
Structure Modeling Through Multi-Relational Graph-based
Learning.” In: Proceedings of the ACM International Conference
on Parallel Architectures and Compilation Techniques. 2020,
pp. 111–123.
[BJC21] Hasna Bouraoui, Chadlia Jerad, and Jeronimo Castrillon.
“Towards Adaptive multi-Alternative Process Network.” In:
Proceedings of the 12th Workshop and 10th Workshop on
Parallel Programming and RunTime Management Techniques
for Manycore Architectures and Design Tools and Architec-
tures for Multicore Embedded Computing Platforms (PARMA-
DITAM’21), co-located with 16th International Conference on
High-Performance and Embedded Architectures and Compilers
(HiPEAC). PARMA-DITAM 2021. Budapest, Hungary: Schloss
Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publish-
ing, Jan. 2021.
[GAP20] The GAP Group. GAP – Groups, Algorithms, and Programming,
Version 4.11.0. The GAP Group. 2020. url: ✪✺❈✉r❧✪✼❇❤tt♣s✿✴✴
✇✇✇✳❣❛♣✲s②st❡♠✳♦r❣✪✼❉.
168
L I S T O F F I G U R E S
Figure 1.1 Chip trends as obtained from [Rup20]. The lines
present the exponential growth prediction if con-
sidering data up until the year 2000. . . . . . . . . . 2
Figure 1.2 The HAEC architecture [Fet+19] has multiple levels
of hierarchy: on-chip, intra-board (optical links) and
inter-board (wireless). . . . . . . . . . . . . . . . . . 3
Figure 1.3 A flow for MoC-based Software Synthesis. The main
abstractions colored in green are the ones we deal
with in this thesis. . . . . . . . . . . . . . . . . . . . 6
Figure 1.4 An example of the mapping space for a simple two-
task application. . . . . . . . . . . . . . . . . . . . . . 7
Figure 1.5 Dependencies of chapters and sections of this thesis. 10
Figure 2.1 The audio filter application as a KPN graph . . . . . 15
Figure 2.2 Different possible sequential executions of the au-
dio filter KPN. . . . . . . . . . . . . . . . . . . . . . . . 16
Figure 2.3 Different levels of abstraction in architectures . . . 18
Figure 2.4 Multiple Levels of Abstraction in the Y-Chart Ap-
proach (Inspired by Figure 6 in [Kie+01]). . . . . . . 19
Figure 2.5 The Odroid-XU4 Architecture. . . . . . . . . . . . . . 20
Figure 2.6 An Example of an Architecture Graph for the
Odroid-XU4 Architecture. . . . . . . . . . . . . . . . 21
Figure 2.7 Comparison of the Architecture and Topology
Graphs for a 4 ˆ 4-Mesh NoC-based Architecture. . 21
Figure 2.8 An example of a mapping as a diagram (left) and as
a morphism of graphs (right). . . . . . . . . . . . . . 23
Figure 2.9 An example of the mapping space for a simple two-
task application. . . . . . . . . . . . . . . . . . . . . . 24
Figure 2.10 The Software Synthesis Flow from Figure 1.3. MAPS
implements all steps in the flow, which are there-
fore all depicted in green. . . . . . . . . . . . . . . . 28
Figure 2.11 Mapping and simulating KPN Applications in ♠♦❝❛s✐♥. 30
Figure 3.1 An illustration of probabilities in code space . . . . 33
Figure 3.2 An illustration of different types of benchmarks . . 34
Figure 3.3 The KPN graph of the speaker recognition application. 37
Figure 3.4 An example of a Level Graph. Adapted from Figure 1
of [Goe+18]. . . . . . . . . . . . . . . . . . . . . . . . . 39
Figure 3.5 An illustration of generative models in the Fischer-
Wald setting. . . . . . . . . . . . . . . . . . . . . . . . 41
Figure 3.6 The flow of CLGen and our re-evaluation. Adapted
from Figure 1 in [Goe+19]. . . . . . . . . . . . . . . . 42
Figure 3.7 Accuracy obtained by the heuristic for the differ-
ent datasets in the setup. Adapted from Figure 2
of [Goe+19]. . . . . . . . . . . . . . . . . . . . . . . . . 43
Figure 3.8 Smoothed relative frequencies of kernels as func-
tion of the first principal component. Adapted from
Figure 6 of [Goe+19]. . . . . . . . . . . . . . . . . . . 43
169
Figure 3.9 A comparison of the accuracy of multiple machine
learning methods for the CPU/GPU classification of
OpenCL kernels. Adapted from Figure 12 of [Bra+20]. 45
Figure 4.1 Examples of transformations in the Odroid-XU4 ar-
chitecture. . . . . . . . . . . . . . . . . . . . . . . . . 48
Figure 4.2 The communication topology affects symmetries in
architectures. . . . . . . . . . . . . . . . . . . . . . . 48
Figure 4.3 The topology of the Kalray MPPA3 Coolidge. . . . . 49
Figure 4.4 A symmetry transformation of the audio filter ap-
plication. . . . . . . . . . . . . . . . . . . . . . . . . . 50
Figure 4.5 Group actions on mappings. . . . . . . . . . . . . . . 51
Figure 4.6 Measurements of four equivalent mapings for the
audio filter application on the Odroid-XU3 architec-
ture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Figure 4.7 A comparison of the two different-sized meshes
and the intuitive notion of their symmetries. . . . . 56
Figure 4.8 An example of a local symmetry that is not a global
symmetry of a 4 ˆ 4 mesh. . . . . . . . . . . . . . . . 57
Figure 4.9 The transformation of Figure 4.8 as a partial permu-
tation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Figure 4.10 An intuitive example of distance between mappings. 62
Figure 4.11 An example of a problem with the orthogonal-sum
construction of the distance metric for the mapping
space. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Figure 4.12 A visualization of a random projection of the map-
ping space . . . . . . . . . . . . . . . . . . . . . . . . 66
Figure 4.13 Comparison of multiple distance metrics on the
Odroid XU4 platform. . . . . . . . . . . . . . . . . . . 68
Figure 4.14 Comparison of multiple distance metrics as pre-
dictors of the maximal run-time difference on the
Odroid XU4 platform. . . . . . . . . . . . . . . . . . . 69
Figure 4.15 Comparison of multiple distance metrics on the
MPPA3 Coolidge platform. . . . . . . . . . . . . . . . 69
Figure 4.16 Comparison of multiple distance metrics as pre-
dictors of the maximal run-time difference on the
MPPA3 Coolidge platform. . . . . . . . . . . . . . . . 70
Figure 4.17 Comparison of the predictive power of multiple dis-
tance metrics. . . . . . . . . . . . . . . . . . . . . . . 70
Figure 4.18 A visualization of the mapping space of Figure 2.9
in the ❙②♠♠❡tr✐❡s representation. . . . . . . . . . . 71
Figure 4.19 A visualization of the mapping space of Fig-
ure 1.4 in the ▼❡tr✐❝❙♣❛❝❡❊♠❜❡❞❞✐♥❣ (left) and
❙②♠♠❡tr②❊♠❜❡❞❞✐♥❣ (right) representations. . . . . . 72
Figure 4.20 An overview of all four representations discussed
in the example of the mapping space for a simple
two-task application from Figure 1.4. . . . . . . . . . 73
Figure 5.1 Equivalent mappings of two applications, one be-
ing compact and the other one not. Adapted from
Figure 1 in [GMC19]. . . . . . . . . . . . . . . . . . . . 75
Figure 5.2 Comparison of latencies between compact, non-
compact and random mappings. Adapted from Fig-
ure 2 in [GMC19]. . . . . . . . . . . . . . . . . . . . . 77
170
Figure 5.3 Comparison between compact, non-compact and
random mappings running isolation or with an-
other 9 applications. Adapted from Figure 4
in [GMC19]. . . . . . . . . . . . . . . . . . . . . . . . . 78
Figure 5.4 Two equivalent mappings that yield good perfor-
mance. Adapted from Figure 5 in [GMC19]. . . . . . 78
Figure 5.5 Visualization of the design space for multiple
thresholds . . . . . . . . . . . . . . . . . . . . . . . . 80
Figure 5.6 Examples of possible neighborhoods around de-
sign centers in two-dimensional random projec-
tions of the design space for the ❛✉❞✐♦ ❢✐❧t❡r ap-
plication on the Odroid XU4. . . . . . . . . . . . . . . 81
Figure 5.7 Design centering and perturbation stability for mul-
tiple threshold levels in the Odroid XU4 platform. . 81
Figure 5.8 Design centering and perturbation stability for mul-
tiple threshold levels in the MPPA3 Coolidge platform. 82
Figure 5.9 Comparison of multiple mapping heuristics and
metaheuristics on the E3S benchmarks, relative to
the results of the genetic algorithms. . . . . . . . . 84
Figure 5.10 The effect of a symmetry-aware cache on multi-
ple architecture topologies as evaluated on the E3S
benchmarks. . . . . . . . . . . . . . . . . . . . . . . . 86
Figure 5.11 The effect of symmetry-pruning of the DSE by chang-
ing the operations in algorithms to consider sym-
metry. Evaluated on multiple architecture topolo-
gies on the E3S benchmarks. . . . . . . . . . . . . . . 87
Figure 5.12 The effect of a pruning via symmetries on the
MPPA3 Coolidge as a function of the number of
tasks in the application as evaluated on the E3S
benchmarks. . . . . . . . . . . . . . . . . . . . . . . . 88
Figure 5.13 The effect of embedding-based representations in
metaheuristics that leverage the geometry on the
MPPA3 Coolidge platform. . . . . . . . . . . . . . . . 89
Figure 5.14 Visualization of the same design space of the audio
filter benchmark on the MPPA3 Coolidge platform
in two different representations. . . . . . . . . . . . 90
Figure 5.15 Comparison of the effects of multiple representa-
tions on the Odroid XU4 platform. . . . . . . . . . . 91
Figure 5.16 Comparison of the effects of multiple representa-
tions on the MPPA3 Coolidge platform. . . . . . . . 92
Figure 5.17 The actual mapping space of a GSM-based two-task
application from E3S on the Odroid XU4 that in-
spired Figure 1.4. . . . . . . . . . . . . . . . . . . . . . 93
Figure 5.18 An illustration of Pareto points in the mapping space. 96
Figure 5.19 Variant selection in TETRiS. . . . . . . . . . . . . . . . 96
Figure 5.20 The TETRiS flow. . . . . . . . . . . . . . . . . . . . . . . 97
Figure 5.21 Comparison of the TETRiS system with Linux’ CFS ex-
ecuting four instances of the audio filter benchmark
simultaneously an Odroid XU4. Adapted from Fig-
ure 9 in [Goe+17]. . . . . . . . . . . . . . . . . . . . . 98
Figure 6.1 Overview of different models of computation.
Color-filled nodes refer to concrete models, dotted
ones are abstract properties. . . . . . . . . . . . . . 102
171
Figure 6.2 Relationships between different dataflow models
of computation. . . . . . . . . . . . . . . . . . . . . . 104
Figure 6.3 An example of a KPN which admits non-blocking-
read semantics. . . . . . . . . . . . . . . . . . . . . . 106
Figure 6.4 Examples of Gantt Charts corresponding to imple-
mentations of the Kahn Function f . . . . . . . . . . 107
Figure 6.5 A counterexample of the equivalence of Kahn-
MacQueen and Kahn processes. . . . . . . . . . . . 107
Figure 6.6 An example of data-parallelism exploiting the Mac-
Queen gap. . . . . . . . . . . . . . . . . . . . . . . . . 109
Figure 6.7 Simplified model of a basestation uplink modem.
Adapted from Figure 2 of [Wit+20] . . . . . . . . . . 118
Figure 6.8 Different parameter combinations and their effects
on the requirements on computation in LTE. “No.
RB” denotes the number of resource blocks. . . . . 118
Figure 6.9 Possible configurations in a resource-constrained
LTE environment. The number of UEs are depicted
with a meaningless random jitter for visibility.
Adapted from Figure 2 in [Wit+20]. . . . . . . . . . . 119
Figure 6.10 The Reactor network of the modified WiBench
benchmark in Lingua Franca. . . . . . . . . . . . . . 120
Figure 7.1 An audio filter in SDF semantics in Ptolemy II . . . . 124
Figure 7.2 The audio filter example in Lingua Franca . . . . . . 125
Figure 7.3 Dependencies of ✭♠❛♣ ✭❢ ✳ ❣ ✳ ❤✮ ✐♥♣✉ts✮.
Adapted from Figure 5 in [Ert+19b] . . . . . . . . . . 128
Figure 7.4 Microservices at Amazon. . . . . . . . . . . . . . . . 130
Figure 7.5 Mapping the terms of the Clojure-based language
to an expression IR. Adapted from Figure 9 in [Ert+18].132
Figure 7.6 Batching I/O with Ÿauhau compared to Haxl and
Muse. Adapted from Figure 11 of [Ert+18]. . . . . . . 133
Figure 7.7 Concurrent I/O with Ÿauhau compared to Haxl and
Muse. Adapted from Figure 12 of [Ert+18]. . . . . . . 134
Figure 7.8 Concurrent I/O in modular programs with Ÿauhau.
Adapted from Figure 13 of [Ert+18]. . . . . . . . . . . 134
Figure A.1 A square. . . . . . . . . . . . . . . . . . . . . . . . . . 147
Figure A.2 The action of the permutation p1, 2q on the square. 147
Figure A.3 The action of the rotation p1, 2, 3, 4q on the square. . 147
Figure A.4 The action of the reflection p1, 2qp3, 4q on the square. 148
172




ω-complete partial order, 100
(greatest) lower bound, 100

























average network delay, 77
batched I/O, 132

























































































































L I S T O F A C R O N Y M S
ALU Arithmetic Logic Unit
ANOVA Analysis of Variance
AP Adaptive Platform
API Application Programming Interface
AST abstract syntax tree
BNF Backus-Naur Form
BSGS base and strong generating set
CDFG control- and data-flow graph
cfaed Center for Advancing Electronics Dresden
CAS computer algebra system
CFS Completely Fair Scheduler
CGO International Symposium on Code Generation and Optimization
CPN C for Process Networks
CPU Central Processing Unit
CPS Cyber-Physical System
CSDF Cyclo-Static Data Flow
CSP Communicating Sequential Process
DDF Dynamic Data Flow
DLP data-level parallelism
DMA Direct Memory Access
DPN Dataflow Process Networks
DOL Distributed Operation Layer
DSE Design-Space Exploration
DSL Domain-Specific Language
HAEC Highly-Adaptive Energy-Efficient Computing
E3S Embedded System Synthesis Benchmarks Suite
EDA Electronic Design Automation
FFT Fast Fourier Transform
FIFO first in - first out
FPGA Field Programmable Gate Array
175
FRP Functional Reactive Programming
GAP Groups Algorithms Programming
GBM Group Based Mapping
GGNN Gated Graph Sequence Neural Network
GPU Graphics Processing Unit
GSM Global System for Mobile Communications
GUI Graphical User Interface
HLS High-Level Synthesis
HOG Histogram of Oriented Gradients
HSDF Homogeneous SDF
IDE Integrated Development Environment
IFFT inverse FFT
i.i.d. independent and identically distributed
ILP integer linear programming





KPN Kahn Process Network
KMQ Kahn-MacQueen
LSTM Long Short-Term Memory
LTE Long Term Evolution
NoC Network on Chip
NVM non-volatile memory
MAPS MPSoC Application Programming Studio




PCB printed circuit board
pdf probability density function








RNTI Radio Network Temporary Identifier
RTM race-track memory
RVC Reconfigurable Video Coding
SADF Scenario-Aware Data Flow
scc strongly connected component
SDF Synchronous Data Flow
SDF3 SDF For Free
TETRiS Transitive Efficient Template Run-time System
TGFF task graph for free
UE User Equipment
XML Extensible Markup Language
177
