Domain specific tools and methods for application in security processor design by Schaumont, Patrick & Verbauwhede, Ingrid
Domain SpecificTools andMethods forApplication in
Security ProcessorDesign
PATRICK SCHAUMONT schaum@ee.ucla.edu
UCLADepartmentof Electrical Engineering, Los Angeles, CA
INGRID VERBAUWHEDE ingrid@ee.ucla.edu
UCLADepartmentof Electrical Engineering, Los Angeles, CA
Abstract. Security processors are used to implement cryptographic algorithms with high throughput and/
or low energy consumption constraints. The design of these processors is a balancing act between
flexibility and energy consumption. The target is to create a processor with just enough programmability
to cover a set of algorithms—an application domain. This paper proposes GEZEL, a design environment
consisting of a design language and an implementation methodology that can be used for such domain
specific processors. We use the security domain as driver, and discuss the impact of the domain on the
target architecture. We also present a methodology to create, refine and verify a security processor.
Keywords:Cryptography, domain specific, processor.
1. Domain Specific Processing
1.1. Security Processing Using a Domain Specific Processor
Security processors are used in information infrastructure to improve the implementa-
tion of security algorithms.Typical applications include creation of public keys, authen-
tication, bulk data encryption, pseudorandom number generation and digital signatures.
The key in domain specific processing is to trade flexibility for power consumption
and/or speed. An example of what can be gained from this is given in Table 1.The table
shows power consumption and throughput for three different implementations of the
same AES algorithm. AES is the new encryption standard [24] that was selected by
the NIST in November 2001. The implementations shown in the table use either a do-
main specific architecture created with standard cells [19], an FPGA or a Pentium-III
processor. The table illustrates that the overall figure of merit for a domain specific ar-
chitecture is more than three orders of magnitude better than that of a general purpose
architecture.While general-purpose processing architectures and fine-grain reconfigur-
able architectures are able to keep up with performance, they loose at length with re-
spect to power consumption.
In order to achieve this result, the programmability of the AES processor has been re-
duced to the strict minimum. It contains a wide datapath that implements an entire AES
iteration as a single instruction.The instruction set of the entire processor has been re-
duced to 12 instructions, including load key, load data, set key length, set data length,
and encrypt.This instruction set covers the basicRijndael encryption algorithm.
Design Automation for Embedded Systems, 7, 365^383, 2002.
 2002 KluwerAcademic Publishers, Boston.Manufactured inTheNetherlands.
1.2. The Design Problem
Thedramatic improvements that canbeobtainedwithdomain specialization require care-
ful reduction of the programmability and/or reconfigurability of a general purpose archi-
tecture into a domain specific architecture. As shown in Figure 1, this is a dual-ended
problem. Given a fine-grain reconfigurable architecture like an FPGA,we need to find
the micro-architecture specializations that span an application domain.These can be cus-
tom operators, routing resources, storage architectures or combinations thereof. Alterna-
tively,we might start from a general purpose processor and introduce custom instruction-
set extensions that are well suited for the application domain.Thus, the design space for
domain specialization is vast anddifficult to conquer. In fact,there are numerous commer-
cial offerings and academic platforms available that elaborate the idea of reducing pro-
grammability in favor of domain specialization [3], [25], [11].
Table1. AES Implementation onThreeDifferent Platforms
Platform Power (W) Throughput Gb/s Figure of Merit (Gb/s/W)
Domain specific processor1 0.056 1.62 28.9
FPGA2 0.490 1.323 2.7
Pentium-III,1.13 GHz3,4 41.4 0.648 0.015
1. 0.18 mCMOS standard cell design of Rijndael [19]
2. Amphion CS5230 onVirtex-II, Power EstimatorXAPP152
3. http://www.cs.tut.fi/helger/rijndael.html
4. Power based on Intel Datasheet (Vcc = 1.8 V, Icc = 23 A)
Figure1. Domain specialization reduces programmability and/or reconfigurability.
366 SCHAUMONTANDVERBAUWHEDE
1.3. Outline
In this paper we present a design methodology and a language, called GEZEL, for the
creation of security processors.We will first give an overview of electronic security and
the related design domain.We introduce the concept of reconfiguration hierarchy as a de-
sign space in which domain-specific programming is done. In Section 3,we list the archi-
tecture properties of typical security processors and discuss an example processor,used in
elliptic curve processing. In Section 4,we combine the ideas of Sections 2 and 3 into a de-
sign method and a design flow.To express the domain-specific parts (the security proces-
sors), we introduce the GEZEL design data model, based on an extension of the Finite
State Machine and Datapath (FSMD) model [6].The FSMD model expresses behavior
as a combination of a finite state machine that controls a datapath. In addition, a language
to express designswith this datamodel is presented. In Section 5 we compare and contrast
our approach to related efforts, and then indicate some pending challenges in the conclu-
sions section.
2. Domain-Specific Processors for the Security Domain
2.1. The Security Pyramid
Figure 2 presents an engineer’s view on the application domain in the form of a security
pyramid.The pyramid form represents the design space at multiple levels of abstraction
[16].The most abstract representation of a cryptographic application is the security proto-
col architecture,whichdefineswhat stepsmakeupa secure communication.Examples are
IPSEC, SSL,WEP, etc.This covers aspects such as key management and distribution, as
well as the placement of cipher blocks within the information flow of an application. At
this level, an encryption processor looks like a single box that takes care of the implemen-
tation of one or more steps in the overall security protocol. A security protocol itself is
described usually in plain text format, see for example [14].
The next level represents the security algorithms. An example of an encryption algo-
rithm is Rijndael which was used to define the recently selected AES standard [24]. A se-
curity algorithm is specified by a signal flow graph to express the data operations, in
combinationwith some overall control sequencing like e.g., feedback modes ofoperation.
The operations used in these algorithms are derived from number theory andmake up the
next level. Besides the operations, also the number representations are specific. For exam-
ple, in the normal basis of the Galois field GF ð2pÞ, elements are represented as binary
coefficients of a polynomial. Below the level of number theory we run into levels that deal
with implementation issues. Contemporary embedded platforms express behavior in
terms of cycle-accurate and/or instruction-accurate code. Finally, at the bottom level we
express all aspects of a security algorithm in terms of target platform technology. It can be
seen that modeling at lower abstraction levels is more generic and thus can potentially be
shared with other application domain pyramids. For example, Reed”Solomon block
coding, used in channel coding, uses Galois field operations and thus can share all levels
of the security pyramid up to the number theory.
DOMAINSPECIFIC TOOLSANDMETHODSFORAPPLICATION 367
2.2. Programming in the Security Pyramid
Since reducing programmability is a keyaspect in doing domain specialization, it isworth-
while to define programmability and reconfigurability in terms of the security pyramid.
During the design of a domain specific processor,we find out how much programma-
bility and reconfigurability is needed at each level of the design abstraction. For example,
full AES encryption can be done with 128-bit,192-bit or else 256-bit blocks, and therefore
full AES has at least three configurations at the level of a security algorithm. AES-128 on
the other hand, restricts the encryption block-length to 128 bits. Clearly an AES-128 pro-
cessor can be implemented more energy-efficient than a full AES processor.This is sobe-
cause we can propagate these invariabilities down to lower abstraction levels.
Whenwe express a domain specific processor in terms of the security domain pyramid,
we see that it spans part of the domain and that it covers a smaller pyramid within the
overall one.This is illustrated in Figure 3, that shows the design abstraction levels of an
AES processor. At the highest level of abstraction, the AES is a single node in a security
protocol architecture that reads in blocks of plaintext data andwrites out blocks of cipher-
text. At this level, configurability is restricted to a few simple parameters, for instance
choice of the encryption key, or data block length. Going one level down in abstraction
level, we can see that the AES uses a GF ð28Þ inversion and several different matrix
transformations. Because algorithm internals like finite field representation are fixed by
the AES standard, there is actually no configuration to do at this level. At the next level
however,where we decide on the implementation platform, the question of programmabi-
lity is back.The single function node at security algorithm level is decomposed, either spa-
tially or temporally.We could for instance opt for a micro-programming approach, in
Figure 2. Security pyramid
368 SCHAUMONTANDVERBAUWHEDE
which case the micro-program defines the configurability of this level. Finally, also the
decomposed form of the AES is mapped onto an implementation platform that can be
reprogrammable as well, for example a bitstream-programmed FPGA.
2.3. Reconfiguration Hierarchy
This AES processor example leads us to observe that a domain-specific processor covers
several different, hierarchical levels ofoperating abstraction,where potential configurabil-
ity is decreasing frombottom to top.Domain specialization is obtained by determining, at
each level of abstraction, how to reduce the programmability to the strict minimum. In
order to reason about this problem,we introduce a design space of programmability and
call this a reconfiguration hierarchy [27].The reconfiguration hierarchy design space has
three independent axes.
 Avertical axis that expresses the level of processing abstraction.
 A horizontal axis that expresses the reconfigurable feature diversity.
 Atime axis that expresses the timing relationship of configuration to processing.
The vertical axis is related to the level of computation abstraction.At the lowest level we
naturally recognize logic primitives (gates), simple storage (registers) and routing. At
Figure 3. AES processor within the security pyramid.
DOMAINSPECIFIC TOOLSANDMETHODSFORAPPLICATION 369
higher levels, the microarchitecture, instruction-set architecture and process architecture
(or systems architecture) represent additional layers of computation abstraction.Reconfi-
guration is applicable to each new abstraction layer that is introduced.While the vertical
axis describes a hierarchy, the horizontal axis describes the nature of reconfigured ele-
ments.Each level of the hierarchy ismade up ofa combinationof communication, storage,
computation and control.Reconfiguration can affect eachof those individually. InTable 2,
an enumeration is given of different such design elements (horizontal) characterized at
different design levels.The term coarse grain and fine grain reconfigurability are usually
associated with the variation of horizontal features at the architectural level.The binding
time expresses when configuration data is send to the processing part. Each level of the
hierarchy can be bound individually.We distinguish implementation-time binding and
design-time binding.With implementation-time binding, configuration is postponed until
actual execution of the processing part is required.With design-time binding, configura-
tion is done at themoment that the processing part is conceived.This is equivalent tohard-
coding.These terms are preferred over the more traditional run-time and compile-time
since the latter ones are not unique for hierarchical systems. In order to have a physical
implementation, the lowest processing level of a system is always design-time bound.The
top level of programmable systems is always implementation-time bound. In between,
there is a smooth transition called the binding time continuum [3].
3. Security Processors
3.1. Properties of Security Processors
Security processors have specific architectural features which are enumerated here.
 The large wordlengths found in typical finite fields (1024 bit forRSA) require wide,
bit-sliced data-paths. Bit-slicing helps tomaintain hardware synthesis quality.
 Multitiered control structures naturally support the hierarchy of behaviors that is
present in security algorithms and protocols [8], [15]. They allow to reflect the
Table 2. TheReconfiguration Hierarchy
Communication Storage Processing
System Interconnection Buffer Size Number andType
Network of asynchronous
processes and tasks
Instruction Set Address/Data Register Set Custom Instr
Buswidth Memories Interrupt Levels
Micro-Architecture BusTopology Register File EXU Type
Implementation Interpreter Levels
Circuit Mux and Switch RAMOrg LUT
Interconnect LatchTransparency
370 SCHAUMONTANDVERBAUWHEDE
security pyramid in the architecture and enable precise controlover whichpartofa
processor is programmable andwhich is not.
 Feedback is a fundamental mode of operation for some cipher operations. Pipe-
lining is not an effective option to obtain performance improvement in those
cases [31].
 Number representation is non-standard and can even take on several different
styles within the same cryptographic processor [4].This is because the cost of op-
erators varies widely with the particular number representation.
 Specialized arithmetic operations such as Modular Arithmetic and Galois Field
Arithmetic require specialized operators [22].
 For block-mode ciphers, the input”output structure is block-oriented. In addition,
data blocks are typically larger than the host machine word-size.
 Integration requires special attention if security is not to be compromised [5].This
includes the use of a well-defined and well behaved data- and control interface
(API), as well as maintaining strict isolation of internal processing to eaves-
droppers and less friendly attackers.
3.2. An Elliptic Curve Encryption Processor
Figure 4 shows the architecture of an elliptic curve encryption processor [15] that calcu-
lates keys for the current IEEE public-key encryption standard [13].We briefly explain the
principles of public key cryptography and next discuss the architecture in more detail.
Elliptic-curve public key cryptography is based on operations on points of a specific
curve in a finite field, the so-called underlying field. Point Multiplication is the
fundamental operation for the key agreement protocol. The Diffie”Hellman key
agreement protocol works as follows [4]: given a point P on the curve, Alice will compute
a:P, and Bob will compute b:P. Alice receives b:P and computes a:b:P. Bob receives a:P
and computes a:b:P.They nowshare a secret key a:b:P.Due to the properties of the elliptic
curve group, it is however very hard to calculate a:b:P starting from the knowledge of a:P
and b:P alone. An eavesdropper,who has access to a:P and b:P, can therefore not obtain
the common secret keyor at least has to solve the very hard mathematical problem of
discrete logarithm in the elliptic curve group.
This algorithm can be implemented across different abstraction levels. At the highest
level, the point multiplication k:P is executed, where k is an integer and P is a point on
the elliptic curve.The point multiplication can be decomposed into doublings, additions
and subtractions of points on the elliptic curve.These primitive operations on points of
the elliptic curve can again be decomposed in operations on elements of the underlying
field. These operations are addition, multiplication and squaring of elements of the un-
derlying field.
DOMAINSPECIFIC TOOLSANDMETHODSFORAPPLICATION 371
The processor that implements this algorithm is an Elliptic Curve processor (ECC),
and is shown in Figure 4.This architecture has a layered structure,with the layers corre-
sponding to the operation described above.
 AGalois Field datapath implements addition, squaring and multiplication of ele-
ments of an n-bit Galois field in normal basis (the underlying finite field).
 The FSM DoubleAddSub implements the basic elliptic curve operations that are
needed for a point multiplication. DoubleAddSub will translate those operations
into Galois Field additions, squarings and multiplications.The instruction set of
DoubleAddSub is shown inTable 3.
 The FSM PointMult implements the top-level sequencing of the point multiplica-
tion, and also presents a user API in the form of an instruction set as shown in
Table 4.The instruction set of FSMDoubleAddSub is shown as well in this table.
 The FSM Input and Output implements data-IO, and adapts the host system bus-
width to the internal ECC processor buswidth.
Boththe control interface (at FSMPointMult) and the data interface (atFSMInput and
Output) are supported by two-way handshakes.This allows easy integration of the ECC
processor into a system, and even allows it to run at an unrelated clock.
The ECC architecture has several different parameters that need to be programmed
using the instructions of Table 4 before point multiplications can be performed. First, the
elliptic curve must be uniquely defined. An elliptic curve over GF ð2nÞ is a curve in two
variables x and y that has two parameters a and b [13].
Figure 4. Elliptic curve encryption processor.
372 SCHAUMONTANDVERBAUWHEDE
Parameters a and b must be chosen (SETA, SETB).The points on this curve are ele-
ments of a finite field GF ð2nÞ.This field is defined by an irreducible polynomial p (not to
be confused with the point P on the elliptic curve) that has to be selected as well (SETP).
During operation,one presents an initial point (X ,Y ), sets the multiplicand integer n and
starts the point multiplication (SETX, SETY, SETN, PMLT).When this last instruction
ends, one can read out the resulting point (X , Y, Z) in projective coordinates (GETX,
GETY,GETZ). Depending on the security protocol architecture, other elements can be
required to vary. For example, increasing the finite field size reduces the encryption speed
but at the same time also increases the cipher strength.Finite field size canbemade repro-
grammable by varying the number of active bitslices in the data-path.
Table 3. FSM Instruction Set for DoubleAddSub
Instr Opcode Description
SETP 0001 Set Irreducible Polynomial
SETA 0010 Set EC parameter a
SETB 0011 Set EC parameter b
SETX 0100 SetX intoXreg & T1
SETY 0101 SetY intoYreg & T2
READX 0110 ReadoutX
READY 0111 ReadoutY
READZ 1000 ReadoutZ
DBL 1001 Eval P ¼ 2P
ADD 1010 Eval P ¼ P þ ðX ;YÞ
SUB 1011 Eval P ¼ P  ðX ;YÞ
INV 1100 Eval P ¼ P
Default Nop
Table 4. FSM Instruction Set for Pointmult
Instr Opcode Description
SETP 0001 Set Irreducible Polynomial
SETA 0010 Set EC parameter a
SETB 0011 Set EC parameter b
SETN 0100 Set PointMultiplier n
SET3N 0101 Set PointMultiplier 3n
SETX 0110 Set Initial PointX
SETY 0111 Set Initial PointY
PMLT 1000 PointMultiplication
PMLN 1001 PointMultiplication andNegate
GETX 1010 Set Initial PointX
GETY 1011 Set Initial PointY
GETZ 1100 Set Initial PointX
Default Nop
DOMAINSPECIFIC TOOLSANDMETHODSFORAPPLICATION 373
4. GEZELDesign Environment
We now propose an approach for design automation support of domain specific proces-
sors.This consists of a design flow, combinedwith a design language.
4.1. GEZELDesign Flow
The GEZEL design flow for a domain-specific processor is shown in Figure 5.We use a
view that fits the initial (functional) partitioning of a system, and treat a system as a
composition of different design domains.Whenwe now concentrate on the design process
within a single design domain,we make a distinction between the design of the domain-
specific part itself and the integration of this part into the system.Therefore, the design
flow in Figure 5 contains a domain-specific side and an integration side. At the domain-
specific side, our security processor is designed step-by-step, starting from a high level al-
gorithmic specification and gradually evolving into a detailed architecture description at
the cycle-true level. At the integration side,we deal with the problemof howour processor
talks to the rest of the system.
A domain-specific design presents a set of views to the integration environment. A
view is the ensemble of interactions of a domain-specific part with the integration en-
vironment at a particular abstraction level. In Figure 5, three views have been defined as
an example.
Figure 5. Domain specific design flow.
374 SCHAUMONTANDVERBAUWHEDE
 Atransaction view is a high level co-simulation interface.This could be a transac-
tion based co-simulation interface [28]. For example, when designing a DES en-
cryption processor, a transaction would send a block of data (a transaction) from
the system to the DES processor,which would reply with an encrypted version of
that block (a second transaction).
 An implementationview is a target-specific co-simulation interface.This could be
a clock-cycle true interface description (such as the description of an AMBAbus
interface).
 Finally, a synthesis view combines the design result of one domainwith the rest of
the system.For example,when designing an accelerationunit for an embedded sys-
tem the synthesis view would be the combination of an HDL description of the
processor, together with a device driver and possible other application software
that allows integration of the processor.
The domain specific design itself proceeds through refinement and eventually transla-
tion to the design target. Refinement varies with the application domain. In the crypto-
graphic domain for example, refinement includes design of security protocol
architectures, selection of encryption algorithms, selection of finite field sizes and arith-
metic operators, and more.
4.2. GEZEL ImplementationMethodology
A rigorous implementation of the concept in Figure 5 is shown in Figure 6.We describe a
domain specific processor using a language that is tuned towards the application domain.
Such a domain specific language can be lean since it is restricted to one application
domain.
This language is parsed and converted into anobject structure in ageneral purpose pro-
gramming language.Currently we are using C++ for this.This object structure makes use
ofmodeling objects provided by a predefined library.The library provides services such as
simulation and code generation, and makes those services available through an applica-
tion program interface (API).This interface is used to construct a system simulation, and
eventually to convert the domain-specific description into synthesizable code.
GEZEL clearly distinguishes between a domain specific part,written in a domain spe-
cific language, and a general purpose part in C++. As such, it is a meet-in-the-middle
approachbetween general purpose approaches such as SystemC [28], and language speci-
fic approaches such as SpecC [7].The setup allows a designer, being expert in a particular
domain, to use descriptions that are concise with the domain semantics.At the same time,
the descriptions are fully accessible through the C++ API, which provides access to the
simulation and code generation kernel.This way, the domain specific descriptions can be
easily linked into a system simulation,where different design domains are combined.We
dobelieve that domain specific processing presents an areawhere higherabstraction levels
can be developed easier than for the generic system design language case.
DOMAINSPECIFIC TOOLSANDMETHODSFORAPPLICATION 375
4.3. GEZELDesign DataModel
The GEZEL domain specific language is a representation of a more fundamental struc-
ture, a design data model.This design data model is crafted along the requirements of
security processors as discussed in Section 3.
We started from a model consisting of communicating FSMD. In such a model, each
processorof the system is expressed as the combination of a finite state machine in combi-
nationwith a datapath. In our model we expressed the datapath at behavioral level as a set
of signal flowgraphs (SFG).The controller model decides, at each instruction cycle,which
of the datapath SFG to execute.Currently we support three kinds of controller models: the
Hardwired Controller (that always execute the same SFG), the Sequencer (that execute a
cyclic sequence of SFG) and the generic Finite State Machine (that supports decision
making). Part of our current research is to investigate what kind of controller models are
best suited for security processors.
Split representation of data-processing and control-processing was found very conveni-
ent for modeling at the micro-architecture level. But this approach has been used at other
abstraction levels too. Examples at higher level of abstraction are FunState [29] and SBF
[17].TheFSMDmodel has alsobeenproven tobe applicable to real designs, for example in
OCAPI [30].
4.4. GEZELDomain Specific Language
Before introducing the GEZEL language, we motivate why we are defining yet another
language. First, we want to make clear that our goal is not to define a language that can
describe a complete system. Rather, we are looking for a way to express the design data
model defined above in a convenient way.This leaves us with several different options.
Figure 6. Gezel implementation.
376 SCHAUMONTANDVERBAUWHEDE
 We can use a general purpose programming language (C++) such as is done in
SystemC.This is a very good approach in the initial phases of the creation of a de-
sign data model. It allows to modify the design data model at any time, and also
provides simple programming as a fallback for those parts of a design where the
design data model does not fit.The drawbackof this method is that the design en-
vironment (C++ compiler) cannot make any distinction between design objects
that are part of the design and objects that are part of the design data model. For
example,when designing a finite state machine with C++ objects, a designer will
still see C++ syntax errors (rather than FSM syntax errors).This requires the de-
signer to become expert in both the design and the design data model.
 We can use a custom graphical syntax, such as a block based model. For some ap-
plication areas like controller design or dataflowgraphs,verygood representations
are available.The drawback here is that there is no good graphical representation
available that captures both the data processing and control processing aspects of
our design data model, and as a result we have to use mixed graphical/textual
models. Such models are inherently more complex to handle.
 We can use a domain specific language. Such a language is a direct syntax repre-
sentation of the design data model that fits the domain specific part of the design.
Some examples of this approach from the software engineering world areYacc for
the construction of parsers and Perl for text processing.
We will use an example to describe the GEZEL domain specific language.We focus on
one particular operation out of the ECC datapath, which is Galois field multiplication.
Figure 7 shows a bit-serial multiplication.This flowgraph multiplies bit-vectors a and b,
both in GF ð24Þ representation to yield bit-vector c. Arithmetic in GF ð24Þ is governed by
a field polynomial which is selected by the feedback pattern of the structure.The next list-
ing shows a textual representation of the same structure.
== Listing 1
dp Dð in a, bo : nsð4Þ;
out mul : nsð4Þ;
in mul st : nsð1Þ;
out mul done : nsðiÞÞ f
reg ctl; cr; br; ar : nsð4Þ;
sfg s1 f
ctl ¼ mul st ? 1 : ðctl << 1Þ;
ar¼ a;
br¼ ððctl ¼¼ 0Þ ? b : ðbr << 1ÞÞ;
cr¼ ðctl ¼¼ 0Þ ? 0 : ðcr << 1Þ
^ðar & ðtcð1ÞÞ br½3Þ
DOMAINSPECIFIC TOOLSANDMETHODSFORAPPLICATION 377
^ð0b0011 & ðtcð1ÞÞ cr½3Þ;
mul¼ cr;
mul done¼ ctl½3;
g
g
The datapath that is created has a set of state registers (regvariables) that are subject to
expressions within a signal flowgraph sfg. An sfg represents one clock cycle of
processing, thereby making this a clock cycle true description.The sfg uses word-parallel
semantics, which allows to obtain compact descriptions.The structure also uses a local
one-hot controller (ctl), counting the 4 clock cycles the bit-serial structure needs to com-
plete. Listing 1 implies allocation of datapath resources since all operations execute in the
same clock cycle and thus require parallel implementation. By allowing multiple sfg in-
stances per datapath (instructions), and introducing a separate controller description in
the formofa sequenceror a finite statemachine,we obtain adescription that also supports
operator sharing.This is demonstrated in the next listing.
== Listing 2
dp Dð in a; b : nsð4Þ;
out mul : nsð4Þ;
in mul st : nsð1Þ;
out mul done : nsð1ÞÞ f
reg ctl; cr; br; ar : nsð4Þ;
reg mul st cmd : nsð1Þ;
sfg ini f
ar¼ a; br¼ b;
g
Figure 7. GFð24Þbitserial multiplier.
378 SCHAUMONTANDVERBAUWHEDE
sfg calc f
cr¼ ðcr << 1Þ^ðar & ðtcð1ÞÞ br½3Þ^
ð0b0011 & ðtcð1ÞÞ cr½3Þ;
g
sfg outactive f
mul¼ cr; mul done¼ 1;
g
sfg outidle f
mul¼ 0; mul done¼ 0; mul st cmd¼ mul st;
g
g
fsm FðDÞ f
state s1; s2; s3; s4; s5;
initial s0;
@s0 ðini; outidleÞ ! s1;
@s1 if ðmul st cmdÞ then ðcalc; outidleÞ ! s2;
else ðini; outidleÞ ! s1;
@s2 ðcalc; outidleÞ ! s3;
@s3 ðcalc; outidleÞ ! s4;
@s4 ðcalc; outidleÞ ! s5;
@s5 ðini; outactiveÞ ! s1;
g
The controller is a FSMwith a set of states and transitions between those states.One of
these states is the initial one. Conditional transitions are modeled using an if”then”else
structure and rely on conditional expressions that are derived from datapath registers.
The datapath is now modeled as a collection of sfg. Each of the sfg ini, calc,
outactive and outidle represents a single cycle of activity on the datapath.The sfg
names are used by the controller to define the instruction set.
Comparing listing 1 and 2 we see that separate modeling of datapath processing and
control introduces some overhead. On the plus side, the resulting model can express
datapath-sharing by defining sfg that are executed in exclusive clock cycles.
4.5. Integration
The descriptions in Listings 1 and 2 canbe parsed toyield anobject hierarchy (inC++), as
shown in Figure 6.This object hierarchy can be analyzed by a simulation kernel or a code
generation kernel for the purpose of cycle-true simulation and HDL code generation re-
spectively.The kernels are presented to the user through a simple C++ API. A system si-
mulation then consists of writing a C++ program and calling the parsing and simulation
API as needed to execute the domain specific processor.Themost simple use of the model
given in Figure 6 is to have a simple testbench in GEZEL together with the design itself.
DOMAINSPECIFIC TOOLSANDMETHODSFORAPPLICATION 379
The system integration task (inC++), in that case, simply executes theGEZEL simulation
kernel. An example testbench inGEZEL is shown in the next listing.
== Listing 3
== testbench
dp TBð out i1; i2 : nsð4Þ; out mul st: nsð1ÞÞ f
reg ctl : nsð4Þ;
sfg s1 f
ctl¼ ctlþ 1;
i1 ¼ 0b1101;
i2 ¼ 0b1001;
mul st¼ ðctl¼¼ 4Þ ? 1 : 0;
g
g
hardwired F2ðTBÞ fs1; g
system S f
D ði1; i2; mul; mul st; mul doneÞ;
TB ði1; i2; mul stÞ;
g
A hardwired controller is used because TB always executes the same instruction s1. Fi-
nally,a system statement is used to connect the testbenchto theGFmultiplierof Listing 2.
The generic system simulation model that parses the testbench and the multiplier is a
small C++ program as shown next.
== Listing 4
#include< fdlsim:h >
int mainðint argc; char argvÞ f
== parse GEZEL program
symbolTable table¼ call parserðargv½1Þ;
== generate simulator
rtsimgen simulator;
table:create simulatorðsimulatorÞ;
== run simulation
simulator:runðatoiðargv½2ÞÞ;
return 0;
g
4.6. Results for the ECCProcessor
We finally compare the energy efficiency of the complete ECC processor to a
performance-optimized software implementation [9].The figure of merit in this case is
380 SCHAUMONTANDVERBAUWHEDE
the number of Point Multiplications per second and perWatt.The results are shown in
Table 5.The curve parameters were derived from the ECDSA standard [23], that defines
standard settings for use in Digital Signature Authentication.We used K-163, which is a
setting with an underlying fieldGF ð2163Þ.Datapaths thus are 163 bit.
TheECC processor wasmapped onto a XilinxVirtex-II FPGAandoccupies 3118 slices
with a critical path of 6.5 ns.The low critical path is due to the bit-sliced, bit-serial archi-
tecture.The bit-serial design requires 313000 clock cycles on the average per Point Multi-
plication.This results in 414 Point Multiplications per second. At an estimated power
consumption figure of 575 mW,we thus have 720 Point Multiplications per second and
perWatt forour processor.The software designyields, despite the higher absolute through-
put, only 71 PointMultiplications per second and perWatt.
Based on the results fromTable 1,we also note that the figure ofmerit for a standard cell
implementation of the ECC processor would still improve with respect to FPGA.
5. RelatedWork
Domain specific processors are well known in the domain of signal processing and net-
working.Our contribution to the field is our focus on architecture design methods, design
automation and the related reconfiguration. In particular,we are seeking ways to have an
adequate representation of the processing hierarchies in a design.
We have already indicated the relationship to existing system design language research
such as SystemC [28] or SpecC [7].There is an ongoing discussion in the system design
community on which language is the right one to use.With this work,we hope at least to
conclude that there is a strong dependencyofdesign language to the designdomain,and to
demonstrate this with real applications.
Domain specific languages have also been used at other abstraction levels. ASIP de-
velopment environments such as LISA [12] and Chess [20] create both the processor
architecture and a software development environment for it out of an instruction-level
processor description. In other cases, a domain specific language has also been used to
describe the application at behavioral level (rather then architecture- or instruction-
level). The Stanford SHADE project [26] uses a dedicated description of shading
graphics procedures in order to compile highly optimized code for graphics accelerator
hardware.
Table 5. ECCK-163 Implementation onTwoDifferent Platforms
Platform Power (W) Throughput Pmult/s Figure of Merit (Pmult/s/W)
FPGA1 0.575 414 720
Pentium-III,1 GHz2,3 36.6 2600 71
1. [15] onVirtex-II (Xilinx), XST, Power EstimatorXAPP152
2. [9], ECSDAwith K-163 Curve
3. Power based on Intel Datasheet (Vcc = 1.8 V, Icc = 23 A)
DOMAINSPECIFIC TOOLSANDMETHODSFORAPPLICATION 381
Heterogeneous co-simulation has alsobeen extensively researched in avariety of envir-
onments.We only note Ptolemy [21] and Coware [2] here. Related to this work,we hope to
bootstrap on the existing bodyof researchwhen it comes to selection of the proper simula-
tionmechanisms andmodels of computation.We indicate the designofa control hierarchy
as one of the problems to solve. Existing efforts in this area such as Statecharts [10] and
Esterel [1] focus primarily on the modeling of control.We hope to solve the problem of
hierarchical control in combination with datapath, connectivity and storage design. And
finally, we hope to make efficient use of the software engineering techniques that enable
the development of a lightweight language, for which plenty of examples exist [18].
6. Conclusion
In this contribution we have presented GEZEL, an approach for domain specific design
based on combining a domain-specific language and ageneral-purpose language into one
environment.The design space for such processorswas definedbymeans ofa reconfigura-
tion hierarchy,which recognizes that there are different levels of programming abstraction
in a single system that match up against the different levels of design abstraction.The se-
curity domain is used as a driver in our approach andwe are currently pursuing the devel-
opment of several demonstrator designs, including a high-speed embedded router with
AES support.
References
1. Berry,G.The Foundations of Esterel Proof, Language and Interaction: Essays in Honour of Robin Milner.
MIT Press, 2000.
2. Bolsens, I.,H.DeMan,B.Lin,K.VanRompaey,S.Vercauteren, andD.Verkest.Hardware/Software Co-De-
sign of Digital Telecommunication Systems. In Proceedings of the IEEE, vol. 85, no. 3, pp. 391”418, March
1997.
3. Dehon,A.and J.Wawrzynek.Reconfigurable Computing:What,Why, and Implications forDesign Automa-
tion, Proceedings of the Design Automation Conference1999, June 1999.
4. Dewin, E. and B.Preneel. Elliptic Curve Public-KeyCryptosystems: An Introduction. LNCS 1528, Springer-
Verlag, June 1997, pp.131”141.
5. Dyer, J., M. Linemann, R. Perez, L. van Doorn, S. Smith, and S.Weingart. Building the IBM 4758 Secure
Coprocessor. IEEEComputer, pp. 57”67,Oct. 2001.
6. Gajski,D., F.Vahid,S.Narayan, and J.Gong. Specification andDesign of Embedded Systems. PrenticeHall,
EnglewoodCliffs,NJ,1994.
7. Gajski,D., J.Zhu,R.Doemer,A.Gerstlauer, and S.Zhao.SpecC:SpecificationLanguageandMethodology.
KluwerAcademic Publishers, Boston, 2000.
8. Goodman, J., and A.P.Chandrakasan.An Energy-Efficient Reconfigurable Public-Key Cryptography Pro-
cessor. IEEEJournal of Solid-State Circuits, pp.1808”1820,Nov. 2001.
9. Hankerson,D. Performance Comparison of Elliptic Curve Systems in Software. In Proceedings of the Fifth
Workshop on Elliptic Curve Cryptography 2001,Ontario,Oct. 2001.
10. Harel, D. Statecharts: AVisual Formalism for Complex Systems. Sci. Comput. Programming, vol. 8, 1987,
pp. 231”74.
11. Hartenstein,R.ADecade of Reconfigurable Computing:AVisionary Perspective. InProceedingsoftheDe-
sign Automation and Test European Conference 2001,Munchen,March 2001.
382 SCHAUMONTANDVERBAUWHEDE
12. Hoffmann, A., A.Nohl,G. Braun,O. Schliebusch,T.Kogel, andH.Meyr. ANovel Methodology for the De-
sign of Application Specific Instruction Set Processors (ASIP) Using a Machine Description Language.
IEEETransactions on Computer-AidedDesign of IntegratedCircuits and Systems (TCAD), Nov. 2001.
13. IEEE P1363/2000: Standard Specifications for Public Key Cryptography. http://www.ieee.org.
14. IETF: SSHProtocol Architecture. http://www.ietf.org/internet-drafts/draft-ietf-secsh-architecture-09.txt,
July 20, 2001.
15. Janssens, S., J.Thomas,W.Borremans, P.Gijsels, I.Verbauwhede, F.Vercauteren, and B. Preneel.Hardware/
Software Co-Designofan Elliptic CurvePublic-KeyCryptossystem. InProceedingsofthe 2001IEEEWork-
shop on Signal Processing Systems, pp. 209”216, Antwerpen, 2001.
16. Kienhuis, B. Domain Space Exploration of Stream Based Architectures for Dataflow Applications Ph.D.
thesis,TUDelft,1999.
17. Kienhuis, B., and Ed. F.Depettere.Modeling Stream-Based ApplicationsUsing the SBFModel of Compu-
tation. In Proceedings of the 2001IEEEWorkshop on Signal Processing Systems, pp. 209”216, Antwerpen,
2001.
18. Kim,E.TheMITLightweight LanguagesWorkshop.Dr. Dobb’s Journal,CMP Publishers, Feb. 2002.
19. Kuo,H., I.Verbauwhede, and P.Schaumont.A 2.29Gbits/sec,56mWNon-PipelinedRijndael AESEncryp-
tion IC in a 1.8 V, 0.18 mmCMOSTechnology. In Proceedings of the IEEECustom Integrated Circuits Con-
ference 2002,Orlando,May 2002.
20. Lanneer,D., J.Van Praet, A.Kifli, K. Schoofs,.W.Geurts, F.Thoen, andG.Goossens.CHESS:Retargetable
Code Generation for EmbeddedDSP Processors.Code Generation for Embedded Processors. P.Marwedel,
ed., KluwerAcademic Publishers,1995.
21. Lee, E. A.Overview of the Ptolemy ProjectTechnical MemorandumUCB/ERLM01/11,University of Cali-
fornia, Berkeley,March 6, 2001.
22. Menezes, A., P.vanOorschot, and S.Vanstone.Handbookof AppliedCryptography.CRC Press,1997.
23. NIST. Federal Information Processing Standards (FIPS) PUB 186-2 Digital Signature Standard. http://
www.nist.gov/aes/, Jan. 27, 2000.
24. NIST. Federal Information Processing Standards (FIPS) PUB 197 Advanced Encryption Standard. http://
www.nist.gov/aes/,Nov. 26, 2001.
25. Ogrenci,S.,E.Bozorgzadeh,R.Kastner, andM.Sarrafzadeh.SPS:AStrategically Programmable System. In
Proceedings of the Reconfigurable ArchitecturesWorkshop 2001, San Francisco, April 2001.
26. Proudfoot, K.,W.R.Mark, S.Tzvetkov, and P.Hanrahan. A Real-Time Procedural Shading System for Pro-
grammableGraphicsHardware. InProceedingsofthe 28th InternationalConferenceonComputerGraphics
and InteractiveTechniques (SIGGRAPH2001), Los Angeles, 2001.
27. Schaumont, P., I.Verbauwhede, K.Keutzer, andM. Sarrafzadeh. AQuick Safari Through the Reconfigura-
tion Hierarchy. In Proceedings of the Design Automation Conference 2001, LasVegas, June 2001.
28. Swan, S. An Introduction to System Levtexnansi.encel Modeling in SystemC 2.0, http://www.systemc.org.
29. Thiele,L.,K.Strehl,D.Ziegenbein,R.Ernst, and J.Teich.FunStateAn Internal DesignRepresentation for
Codesign. In Proceedings of the1999 International Conference on Computer AidedDesign, San Jose,1999.
30. Verkest,D.,W.Eberle,P.Schaumont,B.Gyselinckx,andS.Vernalde.C++Based SystemDesignofa 72Mb/s
OFDM Transceiver forWireless LAN. In Proceedings of the Custom Integrated Circuits Conference 2001,
San Diego, 2001.
31. Whiting,D.,B.Schneier, and S.Bellovin.AESKeyAgility Issues inHigh Speed IPsec Implementations.Pub-
lic Comments on AESCandidate Algorithms Round 2, http://www.nist.gov/aes.
DOMAINSPECIFIC TOOLSANDMETHODSFORAPPLICATION 383
