Hardware synthesis of complex system-on-chip designs for embedded systems using a behavioural programming and multi-process model by Bosse, Stefan
PROCEEDINGS 
 
 
 
  
 
 
 
 
 
 
13 - 17 September 2010 
 
 
Crossing Borders within the ABC 
 
Automation, 
Biomedical Engineering and 
Computer Science 
 
 
 
Faculty of  
Computer Science and Automation 
 
 
 
www.tu-ilmenau.de  
 
 
 
Home / Index: 
http://www.db-thueringen.de/servlets/DocumentServlet?id=16739 
55. IWK
Internationales Wissenschaftliches Kolloquium
International Scientific Colloquium
Impressum 
Published by 
 
Publisher: Rector of the Ilmenau University of Technology 
Univ.-Prof. Dr. rer. nat. habil. Dr. h. c. Prof. h. c. Peter Scharff 
 
Editor: Marketing Department (Phone: +49 3677 69-2520) 
Andrea Schneider (conferences@tu-ilmenau.de) 
 
 Faculty of Computer Science and Automation 
(Phone: +49 3677 69-2860) 
Univ.-Prof. Dr.-Ing. habil. Jens Haueisen 
 
Editorial Deadline:  20. August 2010 
 
Implementation:  Ilmenau University of Technology 
Felix Böckelmann 
Philipp Schmidt 
 
 
USB-Flash-Version. 
 
Publishing House: Verlag ISLE, Betriebsstätte des ISLE e.V. 
Werner-von-Siemens-Str. 16 
98693 llmenau 
 
Production:  CDA Datenträger Albrechts GmbH, 98529 Suhl/Albrechts 
 
Order trough:  Marketing Department (+49 3677 69-2520) 
Andrea Schneider (conferences@tu-ilmenau.de) 
 
ISBN: 978-3-938843-53-6 (USB-Flash Version) 
 
 
Online-Version: 
 
Publisher: Universitätsbibliothek Ilmenau 
  
Postfach 10 05 65 
 98684 Ilmenau 
 
 
© Ilmenau University of Technology (Thür.) 2010 
 
The content of the USB-Flash and online-documents are copyright protected by law. 
Der Inhalt des USB-Flash und die Online-Dokumente sind urheberrechtlich geschützt. 
 
 
Home / Index: 
http://www.db-thueringen.de/servlets/DocumentServlet?id=16739 
Hardware Synthesis of Complex System-on-Chip Designs for Embedded Systems Using a
Behavioural Programming and Multi-Process Model
Stefan Bosse(1,2)
University of Bremen, Department Computer Science,Workgroup Robotics, Germany(1), ISIS
Sensorial Materials Scientific Centre, Germany(2)
ABSTRACT
Embedded Systems used for control, for example in
Cyber-Physical-Systems (CPS), perform the monito-
ring and control of complex physical processes using
applications running on dedicated execution platforms
in a resource-constrained manner. Application-specific
System-On-Chip (SoC) designs providing the executi-
on platform have advantages compared with traditio-
nally used program-controlled multi-processor archi-
tectures. SoC designs can be modelled on structural
and behavioural level. The behavioural level is general-
ly a more sophisticated modelling level. In the context
of CPS, these are mainly reactive systems with domi-
nant and complex control paths. The major contribu-
tion to concurrency appears on control path level. A
new SoC design methodology is presented using the
behavioural hardware compiler ConPro providing an
imperative programming model based on concurrent-
ly communicating sequential processes (CSP) with an
extensive set of interprocess-communication primitives
and guarded atomic actions. The programming langua-
ge and the compiler-based synthesis process enables
the design of constrained power- and resource-aware
embedded systems with pure Register-Transfer-Logic
efficiently mapped to FPGA and ASIC technologies.
Concurrency is modelled explicitly on control- and da-
tapath level. Additionally, concurrency on datapath le-
vel can be explored and optimized automatically by dif-
ferent schedulers. The CSP programming model can be
synthesized to different levels, not only used for hard-
ware circuit synthesis: software models (C,ML), inter-
mediate μCode, RTL state level, and finally VHDL. A
common source for both hardware and software imple-
mentation with identical functional behaviour is used.
An extended case study of a communication protocol
used in high-density sensor-actuator-networks should
demonstrate the design of a SoC for a robot actuator.
The communication protocol is suited for high-density
intra- and interchip networks.
Index Terms— Cyber Physical Systems, System-
on-Chip design, Synthesis, Digital Logic, Highlevel
Synthesis, ASIC and FPGA technology, Communicati-
on, Network Protocols, Parallel systems, Parallel com-
puting
1. INTRODUCTION AND OVERVIEW
Embedded Systems used for control, for example in
Cyber-Physical-Systems (CPS), perform the monito-
ring and control of complex physical processes using
applications running on dedicated execution platforms
in a resource-constrained manner. System-On-Chip
designs are preferred for high miniaturization and low-
power applications. Traditionally, program-controlled
multi-processor architectures are used to provide the
execution platform, but application-specific digital
logic gains more importance.
There are two different ways to model and imple-
ment System-on-Chip-Designs (SoC) used in those
embedded systems: using 1. a structural and/or 2. a
behavioural level. The structural level decomposes a
SoC into independent submodules - processor cores
(or data processing units in general), memories, and
peripherials - interacting with each other using cen-
tralized or distributed networks and communication
protocols. The behavioural level usually describes the
behaviour of the full design interacting with the envi-
ronment without detailed assumptions about system
architecture, generally a more sophisticated modelling
level. In the context of CPS, these are mainly reactive
systems with dominant and complex control paths.
The major contribution to concurrency appears on
control path level, which can be explicitly modelled on
algorithmic level.
A new SoC-design methodology is presented using the
behavioural hardware compiler ConPro providing an
imperative programming model based on concurrently
communicating sequential processes (CSP) [5] and
guarded atomic actions [4] with an extensive set of
interprocess-communication primitives. The program-
ming language and the compiler-based synthesis flow
enables the design of application-specific constrained
power- and resource-aware embedded systems on
Register-Transfer-Level efficiently mapped to FPGA
and ASIC technologies. Concurrency is modelled
explicitly on control- and datapath level. Additionally,
concurrency on data path level can be explored and
optimized automatically by different schedulers.
Hardware blocks (including IPC and externally model-
led) can be accessed transparently from programming
698
level with a generic object-orientated approach.
The CSP programming model can be synthesized
to different other levels, not only used for hard-
ware circuit synthesis: software models (C, ML),
intermediate μCode, RT state level, and finally to
hardware behaviour level, e.g. VHDL. A common
source for both hardware and software implementation
with identical functional behaviour matches different
embedded architecture levels and enables code reuse.
The Metalanguage ML (OCaML) is well suited for
simulation and test-pattern based functional model
checking.
Why a new language? Traditional programming
languages like C are designed for sequential program-
ming only, and concurrency is present to some extent
through the use of libraries [1]. Concurrency should
be controlled by first-class language constructs [3] to
enable optimized design of massive parallel systems
and hardware synthesis. There are several examples
of new designed languages for concurrent program-
ming,like SystemJ [1] or X10 [3]. C-like languages
used for hardware-synthesis are wide spread, but are
not fully suitable for RTL synthesis due to strong
dependency on memory model (pointers) and the
missing concurrency model.
What is novel compared with other high-level-
synthesis approaches? One language targets both
concurrent software and hardware programming, the
hardware synthesis process can be fine-grained control-
led on programming level using parameterized blocks.
A traditional compiler approach with μCode inter-
meadiate representation (without loss of concurrency)
enables fast and optimized synthesis. Object-orientated
access of hardware blocks using the External Module
Interface (EMI)- part of the programming model -
provides a modern and transparent interface for both
software and hardware designers, closing the gap
between software and hardware models. The extended
set of IPC primitives enables concurrent programming
of complex control and data processing systems.
2. SYSTEM-ON-CHIP DESIGN USING A
BEHAVIOURAL MODEL APPROACH AND
HIGH-LEVEL SYNTHESIS
Concurrency has great impact on system and data pro-
cessing behaviour concerning latency, data throughput,
and power consumption. Streaming and functional data
processing requires fine-grained concurrency (on data
path level), however, reactive control systems (for ex-
ample communication) require coarse-grained concur-
rency (on control path level).
The structural level decomposes a SoC into inde-
pendent submodules interacting with each other using
centralized or distributed networks and communication
protocols, mainly program-controlled multi-processor
architectures.
The behavioural level usually describes the functio-
nal behaviour of the full design interacting with the en-
vironment. Most applications and data processing are
modelled on algorithmic behavioural level using some
kind of imperative programming language.
The ConPro high-level synthesis of SoC designs uses
a behavioural imperative programming language with
a compiler-based synthesis approach from algorithmic
programming level to register-transfer level mappable
directly to digital logic [2].
Concurrency is modelled explicitly on control path
level with processes executing a set of instructions
sequentially, initially independent of any other pro-
cess. Interprocess-communication (IPC) provides syn-
chronization with different objects (mutex, semapho-
re, event, timer) and data exchange between processes
using queues or channels, based on the Communica-
ting Sequential Processes (CSP, Hoare 1985) model.
There are local and global resources (storage, IPC), ac-
cessed by one process and several processes, respecti-
ve. Concurrent access of global resources is automati-
cally guarded by a mutex scheduler, serializing access,
and providing atomic access without conflicts.
There are process and top-level instructions. Top-level
instructions are evaluated during synthesis (configura-
tion). Process instructions are transformed and mapped
to states of a clock-synchronous finite-state-machine
(FSM) controlling the process RTL data path temporal-
ly and spatially, shown in Figure 1.
More fine-grained concurrency is provided on data path
level using bounded blocks executing several instructi-
ons (only data path, e.g. data assignments) in one time
unit. Block level parallelism can be enabled explicitly
or implicitly explored by a basicblock scheduler [2].
The complete synthesis process can be fine-grained pa-
rameterized on programming block level, for example
selection of different expression models (allocation) or
activation of specific schedulers and optimizers.
Hardware blocks, modelled on hardware level
(VHDL), can be accessed from the programming level
using an object-orientated programming approach with
methods. All hardware blocks, including IPC, are trea-
ted like abstract data type objects (ADTO) with a defi-
ned set of methods accessible on process level and top
level (only applicable with configuration methods, for
example setting the time interval of a timer). The bridge
between the hardware and software model is provided
by the External Module Interface (EMI).
The relationship of the proposed programming and exe-
cution model and the required building blocks of Con-
pro (programming language and synthesis) are illustra-
ted in Figure 2. The programming language supports
different types of storage objects (single registers and
variables in shared RAM blocks, true bit-scaled), dif-
ferent aggregation types (array, structure) and abstract
699
FF
1
2
FSM RTL
queue q: int;
process a:
begin
  reg x: int;
  x <- 0;
  for i = 1 to 10 
  do 
    x <- x + q; 
  done;
end;
Process
F
F
1
2
FSM RTL
process b:
begin
  reg y: int;
  y <- 1;
  for i = 1 to 10 
  do 
    q <- y+i;
    y <- y*2; 
  done;
end;
Process
Figure 1: Mapping of the proposed multi-process model to FSM-RTL
architecture using high-level synthesis.
objects. Programming statements can modify data (ex-
pressions, assignments) or have impact on the control
flow (conditional and counting loops, conditional bran-
ches, concurrent multi-value selection).
Figure 3 gives an overview of the design flow guiding
through different levels provided by the ConPro frame-
work. After the source code is parsed and transformed
into an abstract syntax tree (AST), there are different
allocation, scheduling, and optimization stages. The re-
ference stack scheduler performs symbolic analysis on
AST level and resolves constant and storage propagati-
on, conditional assignments and multiple assignments.
This ALAP scheduler has impact on scheduling and al-
location done by optimization. The intermediate μCode
representation was choosen for simplified RTL synthe-
sis and optimization (synthesis pass I).
The basicblock scheduler partitions the program
code into blocks without control side entries contai-
ning only data assignments (basicblocks). For each
basicblock a data-dependency analysis is performed.
Independent data assignments can be bound to the
same time unit. These optimizing schedulers can be
activated or deactivated on block level. Finally in
synthesis pass III the RTL is synthesized and mapped
to VHDL. Alternatively, after pass I (AST) or II
(μCode) software output with same functional and
simulated/scheduled concurrency behaviour can be
compiled.
The synthesis flow
χ : CP→ AST→ μCODE→ RTL→ VHDL (1)
is defined by a set of rules χ. Each set consists of sub-
sets which can be selected by parameter settings (for
example scheduling like loop unrolling, or different
Communicating
Sequential
Processes
Instruction
Processing
Concurrency
Parallelism
Imperative &
Sequential
Multi-
Processing
Interprocess-
Communication
Behavioural Model
ConPro
Implementation &
Designflow
Imperative Constrained
Parallel Programming Language
Hardware
Compiler
Software
Compiler
Analysis
Optimization
Hardware
Model
Software
Model
Synthesis
Data&
Control Path RTLVHDL µCode C ML
SoC
Hardware
Processor
Software
Algorithm Algorithmic Level
Programming Language
External Module 
Interface
EMI
Intermediate
Representation
Process Types ObjectsStatement
Data Control Abstract
Guarded
Atomic
Actions
Computation &
Execution Model
Building
Blocks
Figure 2: Building blocks: from the programming model to hardware
using high-level synthesis.
Parser
SoC
Hardware
Processor
Software
Intermediate
Representation
Analysis
ConPro
Source Analysis
AST
Transformation
AST
Optimization
AST
Synthesis Pass 1µCode Synthesis AST
µCode
Transformation
Optimization
µCode
Referenzstack
Scheduler
Basicblock
Scheduler Synthesis Pass 2
Expression
SchedulerµCode
µCode
EMI
Source
Parser Analysis
µCode
Source
FSM&Datapath
Synthesis
Rules
Rules
Rules
VHDL
Synthesis
Toolchain
Script Generator
Synthesis Pass 3
Rules
Code
Templates
Constraints
µCode
Source
C/ML
Synthesis
Rules
Allocation
Scheduling
Figure 3: SoC design flow using the high-level synthesis framework
ConPro providing mapping of a parallel programming model to RTL
hardware and alternatively to software.
700
allocation rules) on block level.
Example 1 shows a concurrent computation system
performing data modification by an array of four pro-
cesses sum[0..3]. They access the global register
x. Though the access of x is atomic and guarded, the
expression in line 9 is it not, thus a mutual exclusion
lock m is required. A master process someother
controls the system and waits for completion of all
sum processes using a semaphore. A timer t performs
group synchronization (here just for fun). The synthe-
sis is controlled on block level with different settings
(loop unrolling in line 10, scheduling in line 11, object
constraints in lines 2 & 3). Line 14 creates a bounded
block for data assignments to registers a and b(using a
colon instead of a semicolon).
Objects (like IPC) belong to a module, which have to
1 open Mutex; open Timer; open Process;
2 object m: mutex with scheduler="fifo";
3 object t: timer; t.time(1 millisec);
4 object s: semaphore;
5 reg x: int[12];
6 array sum: process[4]of begin
7 for i = 1 to 10 do begin
8 t.await ();
9 m.lock(); x <- x + 1; m.unlock ()
10 end with unroll=true; s.up ();
11 end with schedule="basicblock";
12 process someother: begin
13 reg a,b: int[10];
14 a <- x+1, b <- x-1; x <- a;
15 t.init (); t.start (); s.init(0);
16 for i = 0 to 3 do sum.[i].start();
17 for i = 1 to 4 do s.down();
18 end;
Example 1: Parts of a ConPro source code example.
be opened first (line 1). Each module is defined by a set
of EMI implementation files providing all necessary
informations about objects of this module (like method
declarations, object access and implementation on
hardware level).
3. AN EXTENDED EXAMPLE:
IMPLEMENTATION OF A PROTOCOL STACK
FOR COMMUNICATION IN SENSOR
NETWORKS
The Simple Local Intranet Protocol (SLIP) [6] is
used for communication in wired high-density sensor-
and actuator networks. It implements smart routing of
messages with Δ-addressing of nodes arranged in a
n-dimensional network space (line, mesh, cube). The
network can be heterogeneous regarding node size,
computation power, and memory. The communication
protocol is scalable regarding network topology and
size. A node is a network service endpoint and a
router, too. The routing informations are always kept
in the packet, consisting of: 1.) a header descriptor
(HDT) specifying the address size class ASC, the
address dimension class ADC (for example 2 is a
two-dimensional meshgrid), 2.) a packet descriptor
(PDT) with routing and path informations, and finally
the data part. SLIP was designed for low-resource
System-On-Chip implementations using ASIC/FPGA
target technologies, but a software version was re-
quired, too. A node should handle several serial
link connections and incoming packets concurrently,
thus the protocol stack is a massiv parallel system,
and was implemented with the ConPro behavioural
multi-process model. Each link is serviced by two
processes: a message decoder for incoming and an
encoder for outgoing messages. A packet processor
pkt process applies a set of smart routing computa-
tion functions (route normal, route opposite,
route backward, applied in the given order untill
routing is possible), finding the best routing direction.
Communication between processes is implemented
with queues. There are three packet pools holding
HDT, PDT and data parts of packets. They are im-
plemented with arrays. The packet processor can be
replicated to speed up processing of packets.
A test setup consisting of the routing processor part
Ressources Checkpoint Clock Cycles
Registers: 767
FF
L01 104
Area: 12475
gates
L03 113 (+9)
Path delay: 18
ns
L04 187 (+74)
Source: 1109
lines CP →
9200 lines
VHDL
LA0 235 (+48)
Table 1: HW implementation of routing part of SLIP [packet pool:
variable array, ASIC leonardo+SXLIB]
Ressources Checkpoint Clock Cycles
Registers: 587
FF
L01 102
Area: 10758
gates
L03 107 (+5)
Path delay: 16
ns
L04 148 (+41)
Source: 1109
lines CP →
7900 lines
VHDL
LA0 184 (+36)
Table 2: HW implementation of routing part of SLIP [packet pool:
register array, ASIC leonardo+SXLIB]
701
Ressources Checkpoint Machine
Operations
BSS: 40980
bytes
L01 60000
DS: 4352 by-
tes
L03 60019 (+19)
CS: 49288
bytes
L04 60796 (+777)
Source: 1109
lines CP →
2667 lines C
LA0 62305
(+1509)
Table 3: SW implementation of routing part of SLIP [packet pool:
variable array, SunPro CC, SunOS, USIII, CS:Code-, DS/BSS:Data
segments]
of SLIP was implemented A. in hardware (RTL-SoC,
gate-level synthesis with mentor graphics leonardo
spectrum and SXLIB standard cell library), and B.
in software (SunOS, SunPro C compiler). A packet
with ADC=2, Δ=(2,3) and a link setup of the node
L=(-y,-x) is received on the second link (-x) [L01] and
is processed first by the route normal rule (would
require connected +x /+y links) [L03], and finally by
the route opposite rule [L04] forwarding the
modified packet to the link 0 process [LA0]. Tables
1 to 3 show synthesis and simulation results, of both
hardware (HW) and software (SW) implementation.
They show low resource demands and latency. Diffe-
rent checkpoints Lxx indicate the progress of packet
processing. From gate-level simulation, required clock
cycles are obtained, and from software simulation
with a debugger, required machine operations are
obtained. The two HW implementations differ in
packet pool architecture: 1. variable array in RAM
blocks with EREW-access, and 2. register array with
CREW-access, resulting in lower latency. The SW
implementation contains built-in multi-processing, and
requires up to 30 times more operations (time units)
than the HW implementation.
4. SUMMARY
The ConPro programming language uses a concurrent
multi-process model with interprocess-communication
and guarded atomic actions, well suited to imple-
ment parallel control and data processing systems.
Algorithms can be reused from traditional sequential
programming.
The ConPro synthesis tool is capable to implement
complex algorithms, like communication protocols
requiring concurrency on control path level, efficiently
in hardware (below and beyond 1M gates) and software
with same functional behaviour.
Hardware blocks are accessed using a method based
object-orientated programming model.
5. BIBLIOGRAPHY
[1] Malik, Avinash and Salcic, Zoran and Roop, Partha
S., SystemJ compilation using the tandem virtu-
al machine approach, ACM Trans. Des. Autom.
Electron. Syst., Vol 14,(2009)
[2] S. Bosse, ConPro: Rule-Based Mapping of an
Imperative Programming Language to RTL for
Higher-Level-Synthesis Using Communicating
Sequential Processes , Technical Paper, BSS-
LAB, Bremen,2009
[3] Charles, Philippe et al., X10: an object-oriented
approach to non-uniform cluster computing,
OOPSLA &rsquo;05: Proceedings of the 20th
annual ACM SIGPLAN conference on Object-
oriented programming, systems, languages, and
applications (2005)
[4] Daniel L. Rosenband and Arvind, Modular Sche-
duling of Guarded Atomic Actions, Proceedings
of the 41st annual conference on Design automa-
tion (2004)
[5] C. Hoare, Communicating Sequential Processes,
Prentice Hall, 1985
[6] S. Bosse, D. Lehmhus, Smart Communication in a
Wired Sensor- and Actuator-Network of a Modu-
lar Robot Actuator System Using a Hop-Protocol
with Δ -Routing,Smart Systems Integration, Co-
mo, Italy, 23-24.3.2010
702
