A Language and Toolset for the Synthesis and Efficient Simulation of Clock-Cycle-True Signal-Processing Algorithms by Hofstra, Klaas L. et al.
A Language and Toolset for the Synthesis and Efficient
Simulation of Clock-Cycle-True Signal-Processing
Algorithms
Klaas L. Hofstra, Sabih H. Gerez∗ and David van Kampen∗
University of Twente, Department of Electrical Engineering,
Signals and Systems Group,
P.O. Box 217, 7500 AE Enschede, The Netherlands
Phone: +31 53 489 2773 Fax: +31 53 489 1060
E-mail: k.l.hofstra@utwente.nl
Abstract—Optimal simulation speed and synthesizabil-
ity are contradictory requirements for a hardware descrip-
tion language. This paper presents a language and toolset
that enables both synthesis and fast simulation of fixed-
point signal processing algorithms at the register-transfer
level using a single system description. This is achieved
by separate code generators for different purposes. Code-
generators have been developed for fast simulation (using
ANSI-C) and for synthesis (using VHDL). The simulation
performance of the proposed approach has been compared
with other known methods and turns out to be comparable
in speed to the fastest among them.
Keywords—hardware description languages, simulation,
synthesis
I. Introduction
In the last four decades many hardware description
languages (HDLs) have been proposed, each having
their strengths and weaknesses. HDLs serve one or
more of the following goals: specification, formal veri-
fication, simulation, and synthesis. Ideally, one would
use one and the same description language for all
goals. In practice, however, the different goals are dif-
ficult to satisfy: code written in synthesizable VHDL
(see e.g. [1]) will be much slower to simulate than code
written in C, while code written in e.g. SystemC [2]
in a style to optimize simulation speed is not likely to
be synthesizable.
In practice, multiple HDLs are used to overcome
this problem. C is often used to create a first exe-
cutable model of the system to be designed. Such a
model is untimed [3] in general. It has the advantage
of fast execution and allows extensive elaboration of
the system’s performance. Once the model has been
∗The main affiliation of Sabih Gerez is with SiTel Semicon-
ductor, Design Center Hengelo, The Netherlands. David van
Kampen’s contributions to this work were part of his internship
at SiTel Semiconductor.
refined to the point that a hardware architecture can
be specified, the design is manually recoded in a syn-
thesizable HDL, such as VHDL. This is a cumbersome
and error-prone process.
This paper addresses the problem of conflicting lan-
guage requirements and proposes a solution for the
domain of register-transfer level (RTL) descriptions of
signal processing algorithms. This solution consists of
a single specification language called Arx, in combina-
tion with tools that process designs written in this lan-
guage. The tools build an internal representation and
map this internal representation to external formats in
an optimal way. The C-generation tool expands the
compact Arx format into C code that executes effi-
ciently: fixed-point data types are e.g. mapped on the
int data type where possible, shadow variables are in-
troduced to distinguish the input and output of a reg-
ister, etc. In short, all kind of detail that a user would
add manually to a simulation model, is automatically
generated. In a similar way, the VHDL-generation
tool incorporates expert knowledge about synthesis
when converting Arx into synthesizable VHDL.
The proposed approach has the advantage that a
thorough design-space exploration can be performed
due to the optimized simulation speed, while it is not
necessary to rewrite the simulation code by hand in
order to synthesize hardware.
RTL in the context of Arx should be understood in
a wide sense. It distinguishes between registers and
wires (or variables) and assumes the presence of an
implicit clock that controls the register updates (sup-
port for such a distinction could already be found in
early HDLs such as MoDL [4]). First versions of a de-
sign can be written with a minimal number of registers
and need not be clock-cycle true. The refinement of
the design leads to a careful placement of registers in
the signal flow that eventually results in a clock-cycle
518
functional Arx code
VHDL
bit−true, clock−cycle−true Arx code
bit−true Arx code
synthesis
verification
C−based simulator
Fig. 1. Arx workflow.
true specification.
This paper is organized as follows: in Section II the
tools and language are introduced. The next section
describes the problem of modeling RTL in C. Sec-
tion IV provides more details about the implementa-
tion of the tools. Benchmark results are presented in
Section V. Section VI concludes the paper.
II. The Arx Toolset
The language and toolset enable a successive refine-
ment methodology, i.e. a design can be first described
in a high-level description and iteratively refined into
a synthesizable description. At each level of the de-
sign, the tools can generate C code for a simulator.
This simulator can be used for high-speed verification
and evaluation of the design and algorithms.
Figure 1 shows the typical Arx workflow.The high-
level system description is used for algorithmic verifi-
cation. At this level all data types, including floating
point, are allowed. The generated simulator is neither
bit-true nor clock-cycle-true. Subsequently, the de-
sign is further refined into a bit-true description that
restricts the data types to non floating-point. Once
the conversion to fixed-point is completed, the gen-
erated simulator is bit-true but not clock-cycle-true.
The next refinement step transforms the design into
RTL code. Given this RTL description, the tools can
generate a fast bit-true and clock-cycle-true simulator
and VHDL code for synthesis.
A. The Arx Language
The language Arx is still in development. It com-
bines elements from C and VHDL. Arx supports
all arithmetic operations of VHDL. The conditional
statements if and case and the looping statement
TABLE I
Arx data types.
type description
boolean true or false
integer integer values
real floating-point numbers
signed signed fixed-point
unsigned unsigned fixed-point
for are also included in the language. Arrays can be
constructed and individual bits or slices can be se-
lected with special operators. The design goal of the
language is to simplify the hardware design for sig-
nal processing algorithms. All state transitions in the
design must be synchronous, which makes it impos-
sible to construct latches. This limitation allows a
simple data-flow style design description and enables
fast simulations. Globally asynchronous locally syn-
chronous (GALS) designs can still be created by em-
bedding the design in a C-based framework such as
SystemC.
Arx has two types of data object: registers and
variables. Assignments to registers are concurrent
while assignments to variables are sequential. When a
register object is declared, its reset value can be speci-
fied, otherwise it defaults to zero. While the simulator
only supports synchronous resets, the VHDL genera-
tor supports both synchronous and asynchronous re-
sets.
The available data types are listed in table I. Addi-
tionally, users can define their own enumerated types.
For both signed and unsigned data types the num-
ber of bits is specified in the same way as the Sys-
temC sc fixed data type [5]: signed(wl,iwl) and
unsigned(wl, iwl), where wl denotes the total word
length and iwl denotes the number of integer bits.
The code below shows how to declare a register and a
variable with signed fixed-point data type (2 integer
bits and 4 fractional bits). The declared register has
a non-zero reset value.
register signed(6,2) my_reg(reset=1.75);
variable signed(6,2) my_var;
Arx supports a subset of the many overflow and
quantization modes of SystemC [6]. The reason to
limit the number of supported modes is a practical
one. Implementing new modes is almost trivial. For
unsigned data types MIN = 0 and MAX = (2wl −
1)2iwl−wl. For signed data types MIN = −2iwl−1 and
MAX = (2wl−1 − 1)2iwl−wl. The supported overflow
modes, with their SystemC equivalent between paren-
thesis, are:
519
• wrap (SC WRAP) Wrap around, redundant MSB
bits are discarded. This is the default overflow
mode.
• sat (SC SAT) Saturate, positive overflow yields
MAX and negative overflow yields MIN .
• satsym (SC SAT SYM) Saturate symmetrical,
positive overflow yields MAX and negative over-
flow yields −MAX.
The supported quantization modes are:
• trunc (SC TRN) Remove redundant bits. This is
the default quantization mode.
• rnd (SC RND) Rounds by adding the MSB of the
removed bits to the remaining bits.
Arx is a strongly typed language, i.e. values of one
data type can’t be assigned to values of another type
without conversions. An assignment does not implic-
itly convert the type of the right-hand side to the type
of the left-hand side because this hides possible loss
of precision. Hence, the following code will produce
a type error because the type of the right hand side
is signed(7,3) (the addition increases the number of
integer bits) while the type of the left hand side is
signed(6,2):
variable signed(6,2) a, b, c;
a = b + c;
In these situations, casts must be used to convert
the type of an expression. The code below shows how
to cast the result of the addition to the correct type
with the overflow mode set to saturation:
a = cast signed(6,2,sat,rnd) ( b + c );
If one of the operands of a two-operand operation
is a constant, the constant is automatically converted
to the type of the other operand. When this leads to
loss of precision, a warning is generated.
Instead of converting a bit-sequence to another bit-
sequence as is done with casts, it is sometimes neces-
sary to just interpret a bit-sequence differently. The
following code shows how to reinterpret a signed fixed-
point value as an integer:
variable signed(8,1) a;
variable integer b;
a = 0.5;
b = reinterpret integer ( a );
The bit pattern corresponding to the value of 0.5
(”01000000”) is reinterpreted as the integer value 64.
A design is partitioned in modules. Each design has
one top-level module called top, that contains all other
modules of the design. Modules can be parameterized
with generics. Generics can be types and constant
values. This makes it easy to make, for example, a
single FIR implementation with parameterized num-
ber of taps and data types. An example of a second
Listing 1 Arx code for a second order IIR filter with
generic types.
// module declaration
module iir
<
// parameters
type T_ACC,
type T_DATA,
type T_COEFF
>
(
// interface
data_in in T_DATA,
data_out out T_DATA
)
// declare registers and variables
variable T_COEFF a1, a2, b0, b1, b2;
register T_ACC z1, z2;
register T_DATA z3;
variable T_DATA y;
variable T_ACC a1p, a2p, b0p, b1p, b2p;
a1 = 0.1;
a2 = 0.2;
b0 = 0.3;
b1 = 0.4;
b2 = 0.5;
b0p = cast T_ACC( b0 * data_in );
b1p = cast T_ACC( b1 * data_in );
b2p = cast T_ACC( b2 * data_in );
y = cast T_DATA( z2 + b0p );
a1p = cast T_ACC( y * a1 );
a2p = cast T_ACC( y * a2 );
z1 = cast T_ACC( b2p + a2p );
z2 = cast T_ACC( cast T_ACC( b1p + a1p ) + z1 );
z3 = y;
data_out = z3;
end
// top-level module
module top
<
// parameter with default value
type T_DATA = signed(16,1,sat,rnd)
>
(
// top-level interface
in0 in T_DATA,
out0 out T_DATA
)
// module instantiation
iir iir1
<
// pass parameters
T_ACC = signed(24,4,sat,rnd),
T_DATA = T_DATA,
T_COEFF = signed(10,1,sat,rnd)
>
(
// connecting interface
data_in = in0,
data_out = out0
);
end
order IIR design is shown in Listing 1. This is just an
example and not a real design.
III. C Modeling of RTL
A. Fixed-Point Data Types
For fast simulation of fixed-point arithmetic, the
mapping of these data-types on the host machine is
crucial. SystemC offers two different fixed-point class
implementations. The limited-precision implementa-
tion maps the fixed-point data-types on the 53 man-
tissa bits of the native C++ floating point type. This
restricts the size of fixed-point to 53 bits. Another dis-
520
Listing 2 A|RT-C code for a “toy” example.
void reg_incr(int d_in, int& d_out)
{
#pragma OUT d_out
static int local_reg = 0;
int local_nxt;
local_nxt = d_in;
d_out = local_reg + 1;
local_reg = local_nxt;
}
advantage of this approach is that fixed-point types
with a small number of bits, are mapped on 64-bit
floating-point numbers which require extra memory
bandwidth. The other option offered by SystemC
is the unlimited fixed-point implementation that is
based on concatenated data containers.
We chose to implement the fixed-point mapping in
a similar way as described in [7]. All fixed-point values
are mapped on the native machine word-size, which is
32-bit or 64-bit for most general purpose processors.
If a fixed-point type exceeds this word-size, the type
is mapped on a number of concatenated words.
SystemC uses operator overloading to implement
arithmetic operations on fixed-point data types. Be-
cause our tool generates code, it can generate opti-
mized code for each individual operation in the Arx
code. This method does not incur the run-time over-
head of operator overloading. Currently, we do not
use global optimization to reduce the number of shift
operations as is proposed in [7].
B. Register Modeling
There exist many ways to model the registers of an
RTL design in C. If a hardware module corresponds to
a C function, the use of static variables within the
function effectuate that their value is preserved from
one function call to the next. This style is used by
the commercial A|RT Builder tool that has reached
an end-of-life status [8]. Listing 2 shows a toy design
that consists of a register followed by an incrementer
written in the A|RT style. The intended hardware
block diagram is depicted in Figure 2.
If SystemC is used to model hardware at the
register-transfer level, one has many choices for the
coding style. One could e.g. use a static or syn-
chronous data-flow style [9] where modules interact
through FIFOs (first-in first-out buffers) [2] and each
invocation of the model (read inputs, execute process,
write outputs) corresponds to a single register update.
One could also opt for a synthesizable style that is
closer to the hardware and declares the clock and re-
set signals explicitly. A possible implementation of
clock
reset
data_outdata_in
+1
Fig. 2. Block diagram of the “toy” example.
Listing 3 SystemC code for the “toy” example.
SC_MODULE(toy)
{
//Declarations of input/output ports
sc_in<bool> CLOCK;
sc_in<bool> RESET;
sc_in<int> d_in;
sc_out<int> d_out;
//Declaration of a register
sc_signal<int> local_reg;
SC_CTOR(toy)
{
SC_METHOD(reg_update);
sensitive << CLOCK.pos();
sensitive << RESET.pos();
SC_METHOD(increment);
sensitive << local_reg;
}
void reg_update();
void increment();
};
void toy::reg_update()
{
if(RESET.read())
local_reg = 0;
else
{
local_reg = d_in.read();
}
}
void toy::increment()
{
d_out.write(local_reg+1);
}
the “toy” example coded in this style is given in List-
ing 3.
A definite advantage of SystemC is that it provides
a simulation engine and the modeling semantics nec-
essary for flexible system-level design. From a perfor-
mance point of view, however, one pays a price for
the simulator overhead. Data on this overhead can be
found in Section V. Because of this overhead the Arx
code generation uses a style more like A|RT.
In the C code generated by the Arx tools, registers
are modeled as global variables. A separate function
is generated for the handling of a reset. This function
sets all global register variables to their respective re-
set values. At the start of the simulation function
a copy is made of the register values. We call these
521
Listing 4 Arx generated C code for the “toy” example.
int reg;
void reset(void)
{
reg = 0;
}
void sim(const int d_in, int& d_out)
{
int shadow_reg = reg;
reg = d_in;
d_out = shadow_reg + 1;
}
copies shadow values. When a value is read from a
register in the original Arx code, code for a read from
the shadow value is generated. A write to a register
is translated to a write to the actual register value.
Listing 4 shows the code for the example in the style
of the Arx code generator.
IV. Tool Implementation
A C-code generating backend targeting fast simula-
tion has been fully implemented and a VHDL gener-
ating backend for synthesis is in development.
The lexer and parser for Arx have been imple-
mented with the parser generator ANTLR (ANother
Tool for Language Recognition) [10]. Given a gram-
mar description, ANTLR is able to generate code for
lexers, parsers and tree-parsers in Java, C#, C++ or
Python. We chose Python[11] because it is well suited
for rapid development.
The parser generated by ANTLR creates an ab-
stract syntax tree (AST) which can then be parsed
by tree-parsers creating new ASTs. Actions can be
embedded in the grammars, i.e. specific code is exe-
cuted when a rule matches. These actions can modify
the returned AST. The translation from Arx to C and
VHDL is implemented by several stages of tree trans-
formations. The tree-parsers for the final stages emit
the target code.
The C code generator flattens the module hierarchy.
This results in a single C function that needs to be
called once per clock-cycle. Additional functions are
generated for the allocation and freeing of memory for
persistent storage. These functions should be called at
the beginning and end of the simulation respectively.
For the handling of the reset, a separate function is
generated as described in section III-B.
V. Simulations
As a proof of concept, two typical signal processing
kernels have been implemented:
• FIR filter A 64 taps transpose form FIR filter.
• IIR filter A second order IIR filter.
Implementations of these kernels written in Arx,
SystemC and A|RT style C have been benchmarked in
order to quantify the effects of the different fixed-point
implementations and register modeling approaches.
Version 2.1 of SystemC was used. The implementa-
tions are listed below:
• Floating-point A floating-point reference imple-
mentation of the algorithms using the A|RT-C
style of register modeling.
• A|RT C sc fixedA|RT C implementation that used
the SystemC sc fixed data type for fixed-point
arithmetic.
• A|RT C integer An implementation written in
A|RT style C in which the sc fixed data types
have been replaced by the int data type by an
in-house conversion tool.
• SystemC sc fixed This implementation is written
in SystemC and uses the sc fixed fixed-point data
type.
• SystemC integer This is a converted version of the
previous benchmark using the int data type.
• Arx Implementation of the algorithm written in
Arx.
The conversion of the sc fixed benchmarks into
int equivalents were performed by an adapted ver-
sion of the fixed2int tool [12]. The original version
replaced sc fixed by the sc int in order to replace
synthesizable code. This version aims at simulation
speed-up.
All of the implementations, are bit-true and clock-
cycle-true except for the floating-point one that is
only clock-cycle true. The SystemC models have been
coded using a variant of the style presented in List-
ing 3. In order to reduce the simulation overhead,
only the module ports are declared as signals (sc in,
sc out). The internal signals are not declared as
sc signal< <type>> but directly as <type>.
The fixed-point FIR and IIR implementations have
been tested with two sets of overflow and quantiza-
tion modes. The kernels FIR 1 and IIR 1 combine
wrapping (wrap) and truncation (trunc). Saturation
(sat) and rounding (rnd) is used for kernels FIR 2 and
IIR 2. The simulation times for the benchmarks are
summarized in Table II. The results have been scaled
relative to the simulation time of the floating-point
code.
The FIR benchmark is more computationally in-
tensive compared to the IIR (64 taps vs. 2 internal
delays). The effects of replacing the fixed-point data
types by integer equivalent code is therefore clearer
for the FIR. A general conclusion is that directly sim-
522
TABLE II
Simulation speed results
kernel floating-point SystemC sc fixed SystemC integer A|RT sc fixed A|RT integer Arx
FIR 1 1.00 73.18 1.87 122.48 0.83 0.82
FIR 2 1.00 72.53 2.75 127.83 1.61 1.38
IIR 1 1.00 25.30 4.37 41.18 0.44 0.49
IIR 2 1.00 25.52 4.29 41.66 0.56 0.51
TABLE III
Simulation-time distribution (in %) for the IIR 2
benchmark
kernel SystemC
sc fixed
SystemC
integer
A|RT
sc fixed
SC fixed 76.13 0.00 90.11
SC sim. 16.10 89.30 0.00
Rest 7.77 10.70 9.90
total 100.00 100.00 100.00
ulating with the fixed-point data types leads to high
overhead (as high as 2 orders of magnitude). These re-
sults are similar to those reported in [7]. As expected,
the “direct” C implementations in the A|RT-C style
or the Arx style are equally efficient.
The table also shows that the SystemC simulator
has a high overhead as well. This is more visible in the
IIR benchmark as its I/O signals, which trigger the
simulator events, change relatively more often than in
the FIR benchmark.
The conclusions have been confirmed by profiling
the benchmarks. The profiling results for the IIR
2 benchmark are shown in Table III. The table
shows the percentage of the CPU time spent on fixed-
point calculations, SystemC simulator functions and
all other computations.
VI. Conclusions
In this paper we have briefly introduced the design
language Arx and our tools and workflow for simula-
tion and synthesis. A single design description written
in Arx is processed by the tools and code is generated
depending on the selected backend. The C generat-
ing backend optimized for simulation speed has been
presented in detail. As a proof of concept we have im-
plemented two typical signal processing algorithms,
and compared the simulation performance with im-
plementations written in SystemC and A|RT style C.
The benchmarks show that the right choices have been
made for the simulator generated by the Arx tools, as
far as simulation speed is concerned. The next devel-
opment step for Arx is the completion of the VHDL
backend designed for synthesis.
Acknowledgment
Sabih Gerez would like to thank Søren Rievers of
SiTel Semiconductor for his contributions to the dis-
cussions on C-based hardware design.
References
[1] A. Rushton. VHDL for Logic Synthesis, Second Editon.
John Wiley and Sons, 1998.
[2] T. Groetker, G. Martin S. Liao, and S. Swan. System De-
sign with SystemC. Kluwer Academic Publishers, 2002.
[3] A. Jantsch. Modeling Embedded Systems and SoCs, Con-
currency and Time in Models of Computation. Morgan
Kaufmann, San Francisco, 2004.
[4] J. Smit, B.J.F. van Beijnum, S.H. Gerez, R.J. Mulder, and
L. Spaanenburg. The MoDL hardware design system. In
M. Barbacci and C.J. Koomen, editors, CHDL 87, Com-
puter Hardware Description Languages and Their Applica-
tions, Amsterdam, 1987.
[5] D. Black and J. Donovan. SystemC: From the Ground Up.
Kluwer Academic Publishers, Boston, 2004.
[6] SystemC Version 2.0 User’s Guide, Update for SystemC
2.0.1. http://www.systemc.org, 2002.
[7] Holger Keding, Martin Coors, Olaf Lu¨thje, and Heinrich
Meyr. Fast bit-true simulation. In DAC ’01: Proceedings
of the 38th conference on Design automation, pages 708–
713, New York, NY, USA, 2001. ACM Press.
[8] ARM. http://www.arm.com/.
[9] E.A. Lee and D.G. Messerschmitt. Synchronous data
flow. Proceedings of the IEEE, 75(9):1235–1245, September
1987.
[10] ANTLR. http://www.antlr.org.
[11] Python. http://www.python.org.
[12] S.H. Gerez and J. de Zoeten. A conversion tool for the syn-
thesis of SystemC fixed-point data types by the CoCentric
SystemC Compiler. In Synopsys User Group Conference,
SNUG Europe, Munich, Germany, May 2004.
523
