Embedded Microcontrollers and FPGAs Soft-cores by Gómez Prado, Daniel Francisco
3Embedded Microcontrollers and FPGAs Soft-cores
Danie/ Francisco Gómez Prado
Department 01E/ectrica/ and Computer Engineering, University 01Massachuselts, Amherst, USA
ABSTRACT: The FPGA's soft-cores main idea is to
provide designers with the fiexibility of creating a
perfect tit in terms of proeessor(s¡t, peripherals and
memory interfaces for embedded applications, this
perfect tit usually, but not always, can imply a tradeoff
with performance and cos!. This paper presents a
comparison in speed, power, fiexibility and cost
between a microcontroller and its soft-core version. For
this, an HDL synthesizable soft-core of an 8 bit
microcontroller eapable of executing the same
assembler code of a middle range Mierochip PIC, as
the 16F84, is performed. The delays introduced by the
FPGA interconnect is considered after synthesis by
performing post plaeement & routing simulations.
I. INTROOUCTION
Since 1980, when Intel designed the 8051, an 8-bit
microcontroller, embedded systems have used
microcontrollers as a core part of their system. The
applications in which they have been used carne from
automotive, industrial control, office automation and
communications, to name a few; in general any
application that needs fast time to market, lower total
system cost and low-risk product development have
been designed using a microcontroller; thus promoting
the use of microcontrollers everywhere.
This market was particularly understood by
Microchip Company which upon foundation in 1989
released an 8 bit one time programmable (OTP) and a
reprogrammable (Flash) microcontroller based on a
moditied Harvard RISC (Reduced lnstruction Set
Computing) architecture. This simple architecture
combined with a reprogrammable capability provided
, One or more microprocessors can be embcdded inlo the samc FPGA
Thc rescarch rcported in this paper has becn supportcd in part by lhe
National Seiencc Foundation, canlmel No. CCR.0204146.
ELECTRÓNICA - UNMSM
embedded designers with even faster time to market
and lower cost systems; which in turn made Microchip
grew in the market share of 8 bit microcontrollers
(based on worldwide unit shipments) from the 20th
place in 1990 to the number one in 2002 [3].
Even though Microchip core architecture has
remained unchanged, Microchip now offers more than
180 PIC devices [6] featuring numerous on-chip
peripherals to tit best the needs of the huge spectrum of
embedded applications in which they are used.
The customization of the instruction set for a given
application has resulted in literally hundreds of
microcontrollers available today, not just from
Microchip but from different companies; each one with
a different set of peripherals, memories, interfaces, and
performance characteristics. Being one of the biggest
challenges faced by embedded designers the selection
of a processor that tits best their application
requirements; although, designers usually end up either
buying more processor than they need to get the right
mix of peripherals and interfaces, or settling for a less
than ideal solution to keep costs down.
This wide variety of microcontroller peripherals and
memories have been understood lately by major FPGAs
companies such as Xilinx, Altera and Actel which have
developed softcores to embed microcontrollers into
their FPGAs, microBlaze [10], Nios [2] and core8051
[1] respectively. Xilinx and Altera pro vide different
kind of contigurations, implementing softcore
microcontroller from 4 bits to 32 bits, each with
different add ons inside the FPGA to handle a variety
of peripherals as USBs, UARTs, LCDs, Ethernet,
DMA, etc.
11. PREVIOUS WORK ANO OVERVIEW
Even though the design of Microchip devices in
FPGAs have been successfully done in [4](5](8] and
[9], none of them really fully implemented the same
architecture. For example the contiguration bits on the
N" 18.Diciembre del 2006
4111. DESIGN HIERARCHY
A. rile ROM lIlodule
The design of the UMASScore is done hierarchically
as in [9]; by breaking the design into five modules: the
Rom module, the Ram module, the ALU module, the
CPU module and the expansion module.
v6.62 from Microehip is used to write the
microcontroller programs. In [8] and [9] a program to
convert from HEX to VHDL has been done, so the
assembler code written for the PICI6F84 can be
compiled, and the hexadecimal file obtained can be
translated into a VHDL format that contains the binary
instructions to add to the UMASScore ROM module to
simulate and test the designo
It is worth mentioning that the two extended
operations EXTWR and EXTRD are not part of the
MPASM code and not compatible with the MPLAB
!DE software; therefore these two instructions are
tested by adding their binary code manualIy into the
program ROM module.
14 bits
Data<¿J:O>ProgramMcmory
Up lo 8K x 14
1I bits
PC<12:0>
This module implements the program memory of the
UMASScore. It contains the binary code with the
instructions to be executed. The ROM module takes as
input the 13 bits of the program eounter PC, and gives
as output the 14 bits needed to encode a single
instruetion in the PICI6F84. See Fig. 1.
To test the instruetion set, conditional and
uneonditional jumps and the proper function of the
bidirectional ports, a test program in assembler
MPASM language [7] is downloaded into the
UMASSeore ROM module.
With the 13 bits PC this module can store up to 8K
instructions each one of 14 bits width, but this does not
mean that a 8Kx 14 bits memory is implemented on the
design, the number of registers inferred on the module
are as many as the number of instructions to be
executed; the previous program for example will infer a
65 registers of 14 bits each.
TABLEI
TRIS registers that sets the data direetion on the ports
of the mierocontrolIer does not work as specified in [7]
in none of the designs previously mentioned.
In our microcontrolIer sortcore, referred from now
on as UMASScore, the TRIS registers are implemented
and the bidireetional ports works as in [7]. The power
save mode, implemented via the SLEEP instruction,
adds a control signal that staeks the phase clock of the
UMASScore. This keeps aII registers with the same
value and disables the arithmetic unit, thus eliminating
the dynamic.power dissipation. The interruptions (IRQ)
are implemented by a process that reads every clock
eyele one of the bits of the input ports configured as
IRQ. This proeess can set a flag that wiII teII the fetch
unit to save the PC on the stack and to jump lo a known
ROM location in which the IRQs are handled.
AdditionalIy, to alIow the UMASScore embedded
microcontrolIer interacts with the rest ofthe FPGA, two
instructions are added to the original instruction sel:
EXTWR and EXTRD. These instructions define an
extension module for the UMASScore, so more
functionality can be added to the dcsign wilhout
changing its primary specification. The concept of
expansion module has been successfulIy used to
customize the instruction set provided by NIOS II [2] in
whieh the expanded operations are added using
multiplexes as part of the ALU instruetion set. In our
design the expansion module is going to use the
eoneept of program active memory as presented in [13]
so the expansion module wiII be aecessed as a memory
block rather than as ALU operation.
The main fea tures in whieh UMASScore differ from
the PIC 16F84 and other previous implementations are
summarized in the table l.
FOIlun:: Microchip 16F84A ~~~~s~~~r91 UMASScore
OscillalQl'" Se\eral OM:illalOr ( llons ()irl~ldod. ~, Direcl dock ¡;;;-;,
('Ioe,,"i"g ~ phasl."d clock Varics(I.4 4 pha"-l.-de!od-hascu el •.",k)
RC"SCt
Acli\l: low ,\lRST aml a Lo ..•.\lRST ami Lo", \lRST. no
~Wl."r_un circuit hiRh RCS('1 I """'CT-U" circuil
SkI." SIl."c in~'ruclionómtl circuilPl :-':onc 51';;;- instruClion
Ilidircctional pons £lidirc..:tionaJTri-slablc programmcd by thlo 1'001." ports a~ in lhe
Porb TRIS rl:1!istcr 16F8.J
Watch..tog WI>T circuil Done in (8] WDT circuit
limer
1io"."I'1) Free runoin 01'(\Icmalsourcc frttnmmn Free runn;;;-;-
Mullip]c ami proframmablc .\lultiplo: and
Inl<:rrul'lÍon
IRQ. "'llh priorilics ;mll DcJicalcd pin fm programmahle
\:onfigurable lo rising ur falling IRQ
ed"c.
IRO
E:>.lcmlcd ~;;l;~J;)J.also ;.~.~Ioricnled
Insl Sel
r-;onc 13
This projeet uses Xilinx ISE 6.3i webpack and
Xilinx ModelSim 11v5.8c software to synthesize, place,
route and simulate the VHDL design; and a spartan3
device as the target arehitecture. The MPASM
programming language and its compiler MPLAB !DE
Fig. 1.The block diagram ofthe ROM module.
ELECTRÓNICA- UNMSM N" 18, Diciembre del 2006
5B. TheRAM module
From lhe 14 bils inslruclion relrieved from lhe
ROM module, lhe 7 leasl significanl bils <6:0> are
used lo address lhe RAM memory. This memory of 8
bils widlh slorcs lhe 128 general purpose regislers of
lhe UMASScore, from addresses OOh lo 7Fh. Three
signals control lhe operalion oflhe RAM: lhe clock, lhe
general cnable and lhe wrile enable; and (hey are used
lO read lhe dala al lhe beginning of an inslruclion
execulion and lo slore lhe result in the desired address.
See Fig. 2.
In lhe case of bilwise operalions, lhe 3 more significanl
bils of operand B are decoded inlo a BilMask which
selecl lhe bil of operand A in which lhe operalion takes
place.
Addrcss<6:0> •...
toRAM<7:0> ••
RAM
Mcmory fromR~M<7:0>
B
lo1olo!OIob61l ~
lolQIQloloblt iJ t-
101010101011P KlI-
lolololollpppr-
1010101110 ki k) fJ ~
10101Holo P tikJ1-
Ial !101010tl k'lí51-
IlloloIQ!otíb°r-
A
BllMask
00001000
Fig.3. Bitmask selection
128)( 8
dock
enablc
ramwr
Fig. 2. The block diagram oflhe RAMmodule.
The operalion lo perforrn in lhe ALU is delermined
on the CPU module when decoding the inslruction lo
execule. Once lhe ALU operalion is delerrnined lhe
operands for lhe ALU A and B are selecled.
TABLE Il
c. rile ALU module:
The UMASScore has a very simple ALU capable
of doing arilhmetic, logic, shift and bitwise operalions
as in [7]. 11lakes as inpuls lwo 8 bils operands, A and
B, and 4 bils operalion seleclion; and il produces an 8
bils result wilh a carry out. There is an addilional
oulpul used lo indicale lhal lhe result of lhe ALU
module is zero. See lable 11.
The RAM modulc is writlen in VHOL according lo
lhe coding slyle suggesled in [11] [12] so lhe
synthesizer is able lo infer lhal lhe block RAM
available in lhe Spartan3 device is going lO be used.
Respecling lhe coding slyle is importanl olherwise lhe
synlhesizer is nol able lO infer lhal lhe device block
memory can be used for lhis module and lhe memory
would be implemenled using LUTs.
D. rhe CPU module
TABLE III
This is lhe lop level hierarchy of lhe UMASSeore
and il implemenls lhe program flow. Therefore lhe
inslruclion decode, lhe speeific purpose regislers, lhe
dala and address buses, lhe multiplexers and lhe
decoders are implemenled here.
The inslruclion cycle of lhe UMASScore is divided
inlo a four phased cloek lhal is used lo provide
synchronizalion belween lhe differenl execulion sleps
of an operalion. These synchronizalions are scheduled
in lhe lable lB.
The operation of lhe CPU module can be divided
inlo lhree funclions: The program counler control, lhe
inslruelion decode and lhe datapalh control.
1) The Progralll COl/II/ercOII/rol
Upon resel or at slart lhe program counter is loaded
with lhe address IFFFh and a NOP inslruclion is forced
1 2 3
Decodc inslruclion
regíslcr Upo.L!cc Updah.'
Updale Wnell \l/rile RA\1
anl! delennine lhe "' OpAAlUoo
ALU
Read RAM anl!
SpedaJ Update Upd;¡lc Updatc WrilcSpccial
PUTJIO"CRcgislcrs. RcgF OpB ToRAM Regislcrs
read Pons.
Rclricvc nell Slall ifactU<I] Updalc lheIncremL"ll1 pe if nol stallcd Instruction Instruclion
lIIslrucllon
rt'lluires il Il:jzis!cr
Descri lion
A'B
A-B
AANOB
AOR U
AXORB
NOTA
ICin.A 7;1: ,loul<"'A (J
lA 6:0 . Cin) •Cout <= A 7
lA 3:0 . A 7:4 I
(Nol Bil~lask) Al\'D A
BilMm;k OR A
IflBIIMasl A~D A) '" o ••> loo! - I
If(Bil\fask A~[) A)/oc o "::.loul ~ I
Propa¡;:ale A
Pro a ale A
Propa¡;!alc A
()('~
~I
0010
111111
0100
010]
0110
0111
I()(~
111111
101(1
101 [
1100
1101
11 JO
1111
ELECTRÓNICA - UNMSM N° 18,Diciembredel 2006
6Fig. 4. Thc instmction fetch
Ql Q1 QJ Q4 Ql Q1 QJ ~ Ql Q1 QJ Q4
r:~l
02 I I I I
QJ I rtO'~¡ --~ I íl I r-L--i
I r--l~ ~r-1~ íl
inlo lhe UMASSeore; thus the instruetion in I FFFh is
not exeeuted. With lhis address in the PC lhe nexl
inslruction to be fetched is in OOOOhso in the nexl 4
phased eycle the instruetion in OOOOhhas been stored in
the instruelion register; and at the beginning of phase
QI, as shown below, the inslruetion is ready to be
deeoded and exeeuled.
The conlrol of the PC also determines if the
inslruetion in execution produces a conditional or an
unconditional branch.
For conditional branches, instructions BTFSC and
BTFSS, the zero nag from the ALU module is checked
and if the condition is salisficd lhe next inslruction is
replaced by the NOP instruction. In this two
instructions the PC continues is normal count, only the
next instruction to be executed is changed with the
NOP inslruction if needed. For example in lhe
following inslructions if the RAM address Ox 16 has the
value 58h;
,O, CAL!. subruuline lIJump
50 aSF STATUS.RPO (I<O"O»=_ ...._._ .._--
suhroutinc:
JS4 ~IOVlW 0,,0-1 IM'~04h
.'85 \IOV\\'F 0:0.20 II<O,20>-~h
-'sr. inncr)oop
387 DF.CFSZ 0\20,1 1/<0,,20>:.0) h
3S8 GOTO ínner)oop
389 RETLW 0,77 IIW=77 h
TABLE IV
The instruction CALL forces a NOP instruction to be
execulcd instead or BSF and in the meantime it loads
the PC with the address 383, which is used for thc
retching unil to retrieve the inslruction MOLW,
instructions 384 and 385 write into the RAM address
Ox20 the value 04h. The instruction DECFSZ
decrcmenls the value in RAM Ox20 by ¡, and checks if
it is OOh. as it is not the next inslruclion GOTa is
executed. The GOTa instmction loads in the PC 385.
and forces a NOP meanwhile the fetch unit retrieves lhe
instruction from 385, DECFSZ. This process goes on
unlil lhe value sto red in the RAM Ox20 is 01 h, a
decrement here will produce OOh which \ViII force a
NOP instruction to replace the GOTa instruclion. The
instruction RETLW instruction will be executed then
loading the PC with the address 49 and forcing a NOP
instruction. this will make the fetch unit to retrieve the
inslruction BSF from lhe address 50. This is shown in
the table IV.
destination address while the NOP instruction is still
being exeeuted. For example 111 lhe following
instructions:
Fetch 1!1~1pe- • 2
E,,"Cule ln'l pe.
F<1rh In" 1'(". 1
hcrulc In,l Pe'
",:.•d.•IIbl pe
f,C'Cule In" P['-
2" nEcrSl (hI6.1 liJ',:oJump (57 hl
21 Al"DLW 0,6(" I/W<!.Oh
22 BTfSS 0:\16,1 lIJump
23 JORLW O,9F /lW~20h
The instruction DECFSZ will dccrement by one that
value and store it, as the result is 57h different than OOh
the next instruclion ANDL W is executed, BTFSS
check if the last bit in the address OxlG is 1, as that
value is 57h. the condition is satisficd and the next
inslruction IORLW is changed by a NOP inslruction.
This is shown in the next table 3;
Fetch SU Fet~h Fetch FCICh Fetch Feteh Fetch
3" '" J~(, 387 388 3¡;óPC 4'1 P("SO PC 384 re 385 PC 38(, pe 3~7 PC 381\
CALL l\OP MOVLW MOVWf DE(TSZ GOTO :\01'
U,10=(l4h O,l(J-OJh
Felch Fctch FCI<:h fetch Fetch Fetch Fct<:h
387 3RII )Ro 387 388 jU, J87
I'C 3R6 pe 387 re 388 PC 38(, pe 387 rc 388 Pe 3/i()
DECFZ GOTO SOP OEnSZ GOro :'\op DECFSZ
O,2IJ"lh thc:2lt 111 0,20 -~)II
TABLE 111 2) rile IlIstmetioll deeode
For unconditional branches, instructions CALL,
GOTa, RETLW and RETURM, the next instruction is
replaced by a NOP and the PC is loaded with the
deslination address. Actually the PC is loaded with the
deslination address - 1; so the fetch unit, that retrieves
the instruction PC + 1, retrievcs lhe instruction in
Felch 21
p('-2n
DECTSZ
Felch 22
pe ...:!]
A~DLW
Fetch B
I'C - 22
IHFSS
Fetch 24
PCD23
:\01'
The instruclion set supported by the UMASScore is
described detailed in [7], and its summary is shown
below
In the table V, C and Z denote the carry and zero status
bits modificd by the ALU modulc; the File register F
reprcsents any position in the RAM, the register \V is
an intemal working register not directly addressable
that accumulales the last computation of the ALU, and
ELECTRÓNICA - UNMSM N" 18, Diciembrc del 2006
7K is any constant value passed together with the
instruction.
The 14 bits of binary code retrieved from the ROM
module has the following formats:
The more significant bits are compared and the
instruction to be executes IS determined. Each
instruction correspond to a state in a finite state
machine and depending on which state IS reached the
control signal ofthe datapath will vary.
TABLE V
Speclal
Register
OOh lo
OCh
clock
ENASLE
RAM~ RAM
memory
a) rhe Working register and the File register:
As mentioned before the file register represents
any position in the RAM and the working register is
an internal register use to accumulate the output of
the ALU. When we start the execution of a byte or
bit oriented instructions, phase Q 1 of the 4 phased
dock, the working register, W, needs to be updated
with the value accumulaled in the previous
instruction and held in Wnexl. And similarly the
memory needs to be accessed to update the value of
the file register.
As the first 12 bytes of the memory correspond
lo special registers implemented on the same CPU
module and not in the RAM, the file register will be
updated with the values fmm the special registers
when the address is less or equal to OCh and with
the values from the RAM when the address is fmm
ODh to 7Fh. See figure 6.
The control of the datapath can be further divided in
different processes, each one synchronized to a phase
of the 4 phased dock.
3) rhe Datapath control
NEMONTECNIC
INSTRUCTlON;m"" \' BINARY FLAGCODE CODE o-
B eonented liOllS
ADDWF F.d W. F (JO0111 dITfffIT C,Z
ANDWF F.d W ANDF 00 0101 drrrffiT Z
('LRF F Clean rcgistcr F I)(}O()()I 1ftf flTf Z
CLRW Clcan rcgister W
00 0001 o)()\"" Z
)001.)(
('0\11' F.d Complement F 00 1001 dffffllT Z
DECF F.d Dccrease F by I 00 0011 dllffllT Z
DECFSZ F.d ~rease F by l. skip ifO 00 101\ dfff fTlT
¡:-.iCF F,II lncrease F by 1 00 IOIOdflTfTIf Z
I~CFSZ F.d lncrease F by l. skip ifO 00 1111 dflT fllT
IORWf F.d v.' OR F 00 0100 dfffffif Z
\10VF F.d \101'0: f 00 10(10dffffTIT Z
MOVWF F.d \10\'0: W lO F 00 0000 1fff mr
1\01' No opcralíon
(JO0000 0000
(J(KJQ
RLf F.d Shift F 10 ¡he lef! lhrough e 00 1101 drrrffTf e
RRF F.d Shifl F 10 lhe right through e 00 1100 dm lTff e
SUBWF F.d F-W 00 0010 df1Tfllf c,z
SWAPF F.d Swap nibblcs in F 00 111Odf1TfTIT
XORWF F.d v.,' XORF 00OllOdllfffIT Z
Bil oricnted ooerations
BeF F.b Cle,m bit b of F OIOObhbffffITf
BSF F.' Sel bit b ofF OIOlbbbffftllf
BTFSC F.b Check bit b ofF, sldp ifO ni [Obb blTfftTf
BTFSS F.b Check bil b of F, skio if 1 01 Ilbb blTfffif
Lileral or Controlo lioru;
ADDLW K K+W=>W
1IIIIxkkkk C,Z
kkkk
ANDLW K KA~DW
1 [ 1001 kkkk Zkkkk
CALL blendcK Call lO submuline
10 Okkk kkkk
kkkk
CLRWDT No imp/emenled (ignorcd)
00 00000110
0[00
GOTO Go lo ExlcndcK
[O IH.k kkkk
ExlemlcK kkkk
[ORLW K KORW
[[ 1000 kkkk
Zkkkk
MOVLW K K=>W
11 ooxx kkkk
kkkk
RETFIE No imp/emcn/<,d (lwwreJ)
(JOOOO(lOO()O
"JQI
RETLW K RClum [mm subrouline ",ílh K ""> W
II0llxkkkk
kkkk
RETt..:RM Relum from subrouline
00 (J(JQB (J(JQB
I(J(JQ
SLEEP No imp/cmcfllcd (iK'IVred)
0000000110
0011
5UBLW \V, K K-W=>W llllOxkkkk C.Zkkkk
XORLW W,K K XOR W
1I [010 kkkk
Zkkkk
Ex"""'" =
EXTWR F Wrile 10 address in F value W 1110100 IlTfftTf
EXTRD F
Read fmm addrcss in (FJ and savc in 1I0101InTffif
W
Fig. 5. The instruction formal.
13 lG 9 7 6
Opcade 0--"-"-'-"--
BItonented operllllon
13 11 lG
Opcade )--,,-,,-,,-,-, -u,-,~-'-'--
CALl8. GOTOoperatlOn
13 876
Opeod!:' 1~1Address
Byle onented Operlltlon
o
I
13
Opcode
8 7
I lJterllll(
lJIerlll oriented ope~tJon
Fig. 6. Thc working and file register.
At the end of the execution of byte or bit
operation, phase Q4, the file register might need to
be saved. If so the RAM is writlen with the value
coming from the ALU module or from the file
register.
These two operations of reading and writing the
RAM are controlled by the signals ENABLE and
ELECTRÓNICA - UNMSM NO 18,Diciembre del 2006
8Fig. 7. The ALU uni!.
E. The Expallsioll module
The final architecture implemented in the CPU module
is shown below. See figureS.
RegF
w
For both instructions the register F is used to
indirectly address an expanded memory, that is, the
value store in F is used as address; and the register W is
used to connect the expanded data bus. This expanded
memory can map up to 256 bytes and computation can
take place in this memory as described in the PAM
architecture paper [13].
To test this idea a simple fixed point 4 bit multiplier
is implemented. The following assembler lines are used
to test the proper operation of the expanded
instructions.
" MOVWF O1l31 II0:<.31=C5 h" EXTWR Ox31 IIOllC5= W[7:4J",WI3:0J" CLRW 11W~)h60 EXTRD Ox.31 ffW=J('h
61 MO\,WF 0)(31 flO"JI=3Ch
The two instructions EXTWR and EXTRD added
to the instruction set allow to indirect address an
expanded memory. This memory is thought as an active
memory, so any additional functionality can be added
inside the FPGA to the UMASScore.
K
b) The Illpu//Ou/pu/ Por/s:
e) The ALU opera/ioll:
The UMASScore interacts with the external
device through the bidirectional ports A and B.
These ports are addressed as part of the special
registers at addresses Ox05 and Ox06 respectively,
and the direction of each bit is set in the registers
TRISA and TRISB at addresses Ox05 and Ox06. The
UMASScore differentiates which register is being
accessed TRISA or PORTA by checking the status
of the bit RPO, fifth bit of the register STATUS in
address Ox03. If RPO is set to l then the registers
TRIS are accessed, otherwise the registers PORT
are accessed. This is important because modifying
the bits on the register TRIS the respective bits of
the register PORT are configured; thus a value of I
in the register TRIS configures the same bit in its
PORT as input, and a value of Oas output.
RAMWR. When ENABLE is set to I and there is a
rising edge of the dock, the RAM outputs the value
indexed by the address bus; and, when ENABLE is
1, RAMWR is I and there is a rising edge of the
dock, the data placed in the bus toRAM is stored.
Ifwe need to read and write on the phases Ql and
Q4, then the signal ENABLE is set to I during the
phases Q3 and Q4, and the signal RAMWR is set to
I during Q3. For example, if we need to write a
value in memory, we set ENABLE and RAMWR
both to I during phase Q3, and in the next rising
edge of the dock the RAM will be written. This is
because phase Q4 will start after the rising edge of
the dock as it is derived from the main dock and
therefore sorne delay is associated to them, so after
RAMWR and ENABLE are set in Q3 the rising edge
of the dock will produce the memory to store the
value, so it will look as the memory is being written
in the beginning of the phase Q4.
•
The instruction to be executed has been fetched
and decoded in the last phase Q4 of the previous
instruction. With the instruction to execute
determined, the operation to perform in the ALU
module can be determined in phase Ql meanwhile
the working register and the file register are being
updated; and in phase Q2 the operands for the ALU
are selected according to the operation to perform.
See figure 7.
Assuming the working register has a value of C5h,
the instruction MOVWF loads that value on the register
Ox31 of the RAM; then the execution of EXTWR
address the extended memory with the content of the
register file Ox31, and sends the value of W=C5h. As
this memory is an active memory, before storing the
value sorne computation takes place, and in our
example this computation is the. multiplication of
W[7:4] with W[3:0], So the value stored in the
extended memory OxC5 is actually Cx5 = 3Ch, and this
is the value that is retrieved and send to the working
register with the instruction EXTRD.
ELECTRÓNICA - UNMSM N° 18,Diciembre del 2006
9o
Regios!
K
Decode
ALUOpcode
K B opcode
••
Q'
RES~
IIlRSTRESET
e
Fig. 8 lhe UMASScore schematic
Fig. 9. The block diagram oflhe PAM module.
: Thc reason for choosing this dcvicc. as cxplaincd laler, is lhe big amount of
1/0 pins th::Jt it provides. 173.
IV li\lPLEl\IENTA TION AND TIl\IlNG
ANALYSIS
The VHDL description of the Expansion module IS
givcn in anllex V.
the codeo The mOsl important problems faced at this
stagc were:
o Inferring the 128 bit registers as a block RAM in
lhe device: The coding style used produced LUT
implementation ofthe memory, so the XST manual
[11] was used lo infer the use of block memory.
The final type of memory coded \Vas a single-port
read first memory, so an active enable signal reads
the memory.
o Initializing and synchronizing the program counter:
Though in idea simple, making the program
counter start at I FFF and fetching the next
instruction forced a modification in the time the
next instruction was being fetched.
o When execuling conditional branches lhe jump was
never taken: This problem was due to for
conditional branches the zero flag has be set, and
\Vewere taking the decision of the branch al phase
Q3 looking al the zero flag on the status register
that is written at phase Q4. This problem \Vas
solved by looking at the zero output ofthe ALU.
• \Vhen cxecutillg uncollditional brallchcs therc was
a mismatch between the program counter and the
executing instruction: This problem was introduced
by the modification done to the fetch unit, this was
solved by setting the PC to the previollS address of
the' deslination, this is address - 1. With this
O>
css<7:0>
r -¡;::¡- PAM,
Programmablc Wncx~<7:W<7;0>
Active Memory..~
256 x 8 "
dock
eXI_rd
exl "T
The interface of this module is shown below; and in
general this module can be used as an extended port,
inside the FPGA, to communicate data with some othcr
processes in the FPGA. See figure9.
Addr
After wnttng lhe VHDL codc that implement the
UMASScore, we compile it and synthesize it for the
Spartan3 device xc3s2001. At this point, before
placemenl and routing, functional simulations are done
to correct misbehaviors and incorrect specifications in
ELECTRÓNICA - UNMSM N" 18,Diciembre del 2006
10
Fig. 10.The Bloekdiagram ofthe UMASSeore.
V. CAl) Reports
A. Prom SYI/lhesis
Synthesizing Unit <Pic?~~>.
found 256x8.bit single-part block RAM far signal <PAM>.
I"lSe
high
high
• •
PORT~<7:0>
UMASScore
I read- f irst
1 2S6-word x S-bit
I connected to signal <clack>
1 connected tO signal <ENABLE>
I connected to signal <PAMWR>
[ connected to signal <PAMAddr>
1 connected to signal <StO re>
I connected tO signal <EXT_Dout>
I Auto
,-----" ~ORT'\<7:0>--;>Clock
MRST
Found 4x4-bit multiplier tor signa! <Store>.
mace
aspect ratio
clack
enable
•.•.•rite enable
address
data in
data out
TdIlI_style
From the different steps of the CAD design process,
only the most important results of the generated report
files are shown here.
and Q2 there was a middle ground \Vhere no phase
was aetive. This \Vas activating ineomplete
specified eontrol signals that were eorrected, and
some small reschedule of non critieal path
operations \Vere done so the four phased dock were
as balanced as possible.
At this step is important to notice that the
memories for the RAM and PAM module where
inferred correctly.
To get accurate reports on area, power and speed the
extracted signals used in simulation are commented.
Thus the UMASScore is implcmented without the
additional logic and l/O pins used for verification
purposes. Its interface is shown in the figure belo\V. See
figure 10.
• With all the bugs fixed the program shown in the
ROM module is tested in the UMASScore, and its
simulation after plaeement and routing is sho\Vn in
the follo\Ving 6 figures. Sorne glitches can still bc
observed on the ALU unit as this unit is not
synchronized \Vith the dock, that is, the ALU is
asynchronous and its outputs are affected by any
change in its inputs. The glitches in the ALU unit
do not produce any misbehavior as the correct
result is already computed when the synchronous
part ofthe UMASScore requires it, phase Q3.
After solving these problems the funetional
description of the UMASScore \Vas correct, and the
process of simulating the design after placement and
routing began. The principal problems encountered at
this point \Vere:
• The program counter \Vas stall in zero and the
phased dock \Vas stack in Q 1: This problem \Vas
solved by looking at the synthesis reporto The
longest path found in synthesis \Vas from MRST to
an internal node; and \Ve \Vere holding the MRST
signal for less than one doek eyde, so \Ve
inereased the hold time of MRST to 4 doek eydes
\Vhieh is the time taken to exeeute one instruetion.
• l/O pin limitation on the FPGA: Doing simulation
after placement and routing is simulating from
netlist, at this level all the signals are encapsulated
in a black box and the design is observable only
from its input/output pins. At this point ehecking
the correctness of the UMASSeore from its outputs
only \Vas not helpful, so \Ve extracted internal
signals to the output to observe the exeeution of the
instruction. The lalter produeed an inerease of l/O
pins, from 18 pins (1 bit doek, I bit MRST, 8 bits
PORTA and 8 bits PORTB) to 220 pins for
eomplete visibility of the control and internlediate
results on the datapath. This increase on l/O pin
requirements to verify the correct behavior of the
UMASSeore leads us to change the initial Spartan2
deviee xc2s50 to the actual Spartan3 xo3s200. As
the xo3s200 has 173 l/O pins, only the most
important signals \Vhere extracted from the
UMASScore as outputs, sorne other signals \Vhere
derived from these outputs in the testbench given in
annex VI.
• False triggering of control signals: Once the plaeed
and routed signals \Vere observable during
simulation, there \Vere a lot of mismatehes bet\Veen
the expected results and the observed results. Most
of these mismatehes \Vere due to control signals
triggered out of their seheduled phase. Looking in
detail at the simulation \Ve found lhat the dock
signals where unbalanced, and bet\Veen phase Q 1
modification the fetching unit \Vas able to force a
NOP at address - 1, \Vhile decoding the instruction
stored in address.
• Resetting all the registers: As the dock unit is stop
at reset and foreed to phase Q 1, resetting the
registers at phases Q2, Q3 and Q4 \Vith the MRST
signal \Vas impossible. To overcome this problem
an internal RESET signal \Vas created, this signal
propagates once \Vith a NOP instruction \Vhen the
MRST signal is released.
ELECTRÓNICA - UNMSM N" 18,Diciembre del 2006
1I
Synthesizing Unit <PicRAM>.
Found 128x8-bit sing1e-port block RAM fer signa1 <RAM>. Device utilization surnmary:
Se1ected Oevice : 3s200ft256-5modo
aspect ratio
c1eck
enable
write enab1e
address
data in
data out
ram_sty1e
read-first
128-word x 8-bit
connected to signa1 <clock>
connected to signa1 <Enab1e>
connected to signa1 <RAMWR>
connected to signa1 <RamAddr>
connected to signal <Dataln>
connected to signal <DataOut>
Auto
rise
high
high
Number of Slices:
Number of Slice Flip Flops:
Number of 4 input LUís:
Number of bonded IOBs:
Number of BRA.'ts:
Number of MULT18X18s:
Number of GCLKs:
'68 ou' o, 1920 '9%
373 ou' o, 3840 9%
1015 ou' ol 3840 '"17 ou' ol 173 9%, ou' ol 12 16.
1 ou' ol 12 "1 ou' o, , 1"
The adders in the ALU module where inferred as well
as the deeoder for the bitwise operations
It is important to notiee that the deviee utilization varies
when:
Synthesizing Unit <PicALU>.
Found S-bit adder for signal <$nOOOO> created at line 61.
Found 8-bit adder for signa1 <$n0006> created at 1ine 61.
Found S-bit adder carry out for signa1 <$n0032> created at
line 61.
Found 8-bit xor2 for signa1 <$n0048> created at line 78.
Found 1-of-8 decoder for signa1 <BitMask>.
And the finite state machines for the instruetion set, the
adder of the program eounter and multiplexer eontrols
where implemented.
Synthesizing Unit <piccpu>.
Using one-hot encoding for signa1 <opcode>.
Using one-hot encoding for signa1 <StallOpcode>.
Found 13-bit adder for signa1 <AddrPreFetch>.
Found 74 1-bit 2-to-l mu1tiplexers.
inferred 349 D-t.ype flip-flopls).
inferred 4 AdderISubt.ract.erls).
inferred 3 Comparator{s).
inferred 87 Mu1tiplexer{s).
•
•
Not all the instruetions are implemented: \Vhen a
instmetion is not used the eorresponding state on
the FSM beeomes not reaehable, for example ifthe
CALL instruetion is never used not only the state is
drop but also the eireuit eontrolled by the GOTO
state is minimize on the synthesis proeess. This
prunning reduces the number of sliees used in the
deviee.
The number of instruetions inerease or deerease:
\Ve've seen that the UMASSeore is capable of
storing up to 8K instruetions, eaeh of 14 bits. As
the ROM memory is implemented with LUTs
redueing or ¡nereasing the number of instruetions
well result in more or less deviee utilization.
And these resourees for the Spartan3 xe3s2000
represent a deviee utilization of:
The big amount of flip flops inferred is to implement
the registers of the control and data signals in the CPU
module. After synthesis the total amount of resourees
used by the UMASSeore mieroeontroller are:
Reglnst O
Reqlnst<O>
Ker2S112 SWO
Ker28112 SWO/O
Ker28112
N281l4
~2630
CHOICE4025
RamAddr<2>1
RamAddr<2>
0.626
2.353
0.529
0.343
0.529
0.690
0.529
1. 574
0.529
1. 515
14.337ns (data path - c10ck path skewl
ReglnSLO (F'F)
BufferPortA_1 {FF)
1].820ns fLevels of Logic 61
-0.517ns
04 rising
Q4 rising
O.OOOns
Tcko
net (fanout=8)
Tilo
net (fanout:11
Tilo
net (fanout:))
Tilo
net ¡fanout=16)
Tilo
net (fclOout=ll)
B. From Placemen/ and ROll/ing
Delay:
Source:
Destination:
Data Path Delay:
Clock Path Skew:
Source C1ock:
Destination Clock:
Clock Uncertainty:
As from synthesis we gol the deviee utilization, the
the most important infomlation after plaeemenl and
routing is lo generate the liming analyzer reporto This is
shown below:
Data Path: Reglnst O t.o BufferPortA 1
De1ay type oelay(ns) Logical Resourcefsl
Timing constraint: Defau1t period analysis for net °Q4'
26766 items analyzed. O timing errors detected. lO setup
errors. O ho1d errorsl
Minimum periad is 14.337ns.
Physical constraint file:
C:\Xilinx\My_Designs\PiccpuPR\piccpu.pcf
Device,speed: xc3s200,-5
Re1ease 6.3.03i - Timing Analyzer G.38
Copyright .(e) 1995.2004 Xilinx. lnC. AII rights reserved .
18
,
RAM.
RAM: 1
110
"1,,
1
1
1
"40
1
39
1
1,
1
1,
1
1
1
3
1
1
1,,
greater
'd'reatequal
1ess
1-bit xor3
••
ji Xors
•
128x8-bit single-pert block
256x8-bit. sing1e-pert block
Registers
l-bit register
ll-bit. register
13-bit register
14-bit. register
3-bit. register
4-bit register
5-bit register
8-bit register
ji Mu1tiplexers
ji l3-bit 8-to-1 multiplexer
ji 2-to-1 mu1tip1exer
• Decoders
ji 1-of-8 decoder
i Adders/Subtractors
ji ll-bit subtractor
" 13-bit adder
S-bit adder
S-bit adder carry out
~ultipliers
4x4-bit multiplier
Comparators
4-bit comparator
S-bit comparator
8-bit comparator
Design St.atistics
ji 105
Macro Statistics
• RAM
•••••••••,
ELECTRÓNICA - UNMSM N" 18, Diciembre del 2006
12
Tilo
net (fanout."'16t
Tilo
net (tanout:!)
Tcecl<;
0.529
1.697
0.529
1.324
0.524
Ke:-296221
1'29624
00669
00669
BufferPortA 1
can be splitted belween lhe phase Q2 and lhe
momenl on phase Q3 where ils oulput is needed.
Total 13.82005 (31.3% logic. 68.7% route) C. Frol11 Power Alla(vzer
AIl constraints ~ere meto
Timing summary:
Timing errors: O Score: O
Constraints cover 28727 paths, O nets, anó 4375 connections
Design statistics:
Hini~ period: 14.J37ns (Maximum frequency: 69.750MH%)
Mínimum input required time before clock: 15 57105
To be able lo ITII1 lhe power analyzer, files wilh
exlension NCO and VCO for our design where needed.
The VCO files \Vhere oblained by ehecking lhe oplion
lo \Vrite an oulpul VCO file on lhe posl P&R
simulation, and the NCO files \Vhere oblained from lhe
Map and Place & Route property menu.
Setling lhe power analyzer lo a eonfidenee level name
reasonable, lhe average power consumplion of lhe
UMASScore is 28mWattsl. This resull is sho\Vn in lhe
following fragmenl of lhe po\Ver report file:
Release 6.3.03i - XPower SoftwareVersion:G.38
Copyright (e) 1995-2004 Xilinx. lne. All rights reserved.
Design: picepu
Preferenees: C:\Xilinx\~y_Designs\PicepuPR\piccpu.pef
VCD File: C:\Xilinx\My_Designs\PiccpuPR\piccpu.vec
Part: Js200ft256-5
Data version: ADVANCED.vl.0.ll-03-03
From lhis report lhe maximum frequency of lhe
UMASScore is 69.75 MHz, lhough we were only able
lo run lhe posl plaCemenl and routing simulalion al a
frequency of 62.5 MHz (clock period of 16 ns). Al
higher frequencies some regislers slarted getting
unknown values, lhough lhe simulalion al lhe end of
phase Q4 was slill val id.
This liming analysis is more accurale lhan lhe one
obtained in synlhesis where a maximum frequeney of
106 MHz was found.
Power summary:
Total estimated power eonsumption:
lImA) PlmWI
28
Speed Grade: -5
Mínimum period: 9.34605 (Maxímum Frequency: lQ6.998MHz)
Mínimum input artival time befare clack: 13.18905
Timing Sumznary: (From synthesis. befare P&RJ Vccint 1.20V:
Vccaux 2. 50V:
Vcco25 2.50V:
Quieseent Vccaux 2.50V:
3
10
O
JO
3
25
O
25
VI. COl\lPARISON BETWEEN THE
UMASSCORE AND THE l'IC16F84 I\HCROCHlP
1\1ICROCONTROLLER
Anolher important resull is lhal lhe critical path is given
by phase Q4 wilh a minimum period of 14.337 ns; and
the next crilical palh comes from phase Q I with a
minimum period of 9.670 ns. This is importanl as we
lhoughl ál lhe beginning of the design that lhe critical
palh will be given by lhe ALU compulalion in phases
Q2 or Q3. There are lwo reasons for lhese:
Thermal s~~ry:
Estimated junetion temperature:
Ambient teffip: 25C
Case temp: 26C
26C
• The palhs on phase Q4 and Q I have lo delermine
10lSof signals. i.e decode lhe inslruclion and selecl
lhe control signals to write lo memory. This means
lhal lhe cloek of Ihese t\\'o phases Q4 and QI have
more load than Q3 and Q2. as shov.m in Ihe following
rcport:
Clock Signal
03,0
01:0
02:0
Q4:0
EXT_WR:O
PAM_ENABLE(PA.~ __ nOOOll:O!
eloek
I Cloek buffer(FF name) I Load I
:';ONE 8
NONE 164
NONE 32
NONE 149
NONE 8
NONE(") (PAM_PAMAddr_7) 8
BUFGP 6
The UMASScore deviee is eompared againsl lhe
PICI6F84 and PICI6F87i' in frequency, power
dissipation, program memory, flexibility and price.
Even though inilially we had lhe objeelive of
comparing area, al lhis poinl we realize Ihat il really
does nol make much sense to compare the number of
slides or gate counl used on lhe FPGA againsl lhe PIC
device; so we take inlo consideralion Ihe area or
percenlage of utilizalion to reduce lhe price of the
FPGA. It can be argued thal lhis is an artificial price
and it does nol correspond lO markel, bul at least lhis
• The ALU implemenled on our design is
asynchronous, so it is al\Vays compuling the seleCled
operation when lhe values al ils inpuls ehange. And
this means lhal the compulalion of lhe righl value
ELECTRÓNICA - UNMSM
~ This power dissipalion is really smal1 <lnJ it is probably Jue to lhe low
IOggle of signals for Ihe tested programo il will be of ¡nterest lo see how this
power varies ¡flhe 110 pol1S are used frequenlly .
. " TIle PIC16F877 is similar lo lhe PICI6F84 and suppol1 (he same
instruclion se!. It has a biggcr program memory anJ sorne spccializc
regislcrs for UART communicalion.
N' 18.Diciembre del 2006
approach gives us an estimate on how much resources
are left on the FPGA for continuing adding
functionality. In the foIlowing table the values shown
for the PICI6F84 and PICI6F877 are taken from their
datasheet [7]; and the values shown for the
UMASScore are taken from the report analysis.
TABLE V
IMr XC3S200- PICI6F84 PICI6F817T"k 4VQlOOC Microchi~~ Microchip";',,' , m.fASSeore '. .
MuF = 6'1.15Mhz IOMhz 10\1hz
Powcrdissi ti~ 18mWallS SOOmWans 1\\la"
Memory ROM 8K inslnlclions IK '"inslruclions in!itroclions
lnslJUCtion PAM mcmory Nonc l'onc
c:ustomization
Pen:entage of 29% .. --
utilizMion
lkvice Price 13.45S .U9S 5.11S
Utilization Price 3.9$ 4.39$ 5.11$
This table shows that the UMASScore device achieves
a speed up of 6.9X against the PIC 16F84 for a similar
price'- The UMASScore has sorne degree of nexibility
that none of the PIC devices have; as a customizc
hardware can be added inside the FPGA to perform
specialized functions that are not provided on the
instruction set ofthe microcontroIlers.
VII. CONCLUSIONS
The more advanced VLSI process of the FPGA
tcchnology plays a key role on speed and power. This
gives an advantage to the soft-core version over the real
microcontroIler. The price, per percentage of silicon
used, is also cheaper for lhe soft-core version. In this
case, the FPGA soft-core version of the microcontroIler
outperfomls the microcontroIler in speed by a factor of
6.9, and in power by a factor of28 for roughly the same
pnce.
The design of the soft-core has been verified by
executing the whole instruction set with post placement
and routing simulations. From the synthesis process we
have observed that the softcore implementation of a
microconlroIler saves space when there are unused
resources, as these unused resources are found to be
unreachable or never used for the synthesizer and they
are ripped out by the optimization tool. Looking at the
report tiles generated from placement and routing we
have seen that, for the UMASScore, the critical path
lies on the decoding of the instruction lo execute and
tt This price can be misleading as the 29% of dcvice ulilizalion correspond
lo a progmm oC 389 lincs, a more rare comparison will synthesize the FPGA
with the RO:vl" fully utilized. IK or 8K inslructions. and use Ihal as
pcrcentagc as utiIi7.alion. Though Ihis 30010 of device utilizarion whilc be
valid up lo 2K inslruclions ir Ihe rcmaining block memories of the dcvice
are used to implement Ihe ROM.
ELECTRÓNICA - UNMSM
13
not on lhe arithmetic operation to be performed as we
initially thought.
Some modifications were done to the initial
UMASScore code in order to work properly after
placement and routing. as interconnect delays and loads
were not taken into account on the first behavioral
simulation. The changes done werc mercly on reliming
certain operalions or fully specifying multiplexers and
decoders to avoid glitches on control signals. When
proper simulation of the post place and route model
was achieved, the cad processes were redone for the
UMASScore, commenting all the signals that were
placed as outputs during lhe verilication analysis.
The verification of the UMASScore Ict us underslood
the complexity of this process and its important. as
most time of the design process was devoted lo the
verification proccss. This also shows lhe importance of
lhe JTAG port for testing as if the design is placed on a
printed board we will not be able to extracl our internal
signals to verify its internal cxecution as we did it here.
The process of verification gave us a practical example
of Rent's rule, in which the design without the
predefine port interface exploded the l/O pin
requirements from 18 pins to 220 pins.
From here we can sec thal the gain in speed and power
of the soft-core version comes with a more complex
design process, and longer times of verification. This
detrimenl can be neglected if we use IP soft-cores for
the microconlrollers, but of course at an additional cost.
If the FPGAs manufactures release their soft-core ¡Ps,
the market of embcdded systems will drifl to FPGA
devices.
ACKi'iOWLEDGI\IENTS
I would likc to thanks professors Guy Gogniat
and Rusell Tessier for their valuable COl11menlSand
corrections.
REFERENCES
[1] Actel Inc, Ihe core805/, http://www.actel.com.
[2] Altera Corporation, Nios 11Device,
http://www.altera.com/prod ucls/i p/processors/n io
s2/
[3] Gartner Dataqucsl rankings, 2002
Microconlroller Markel Share and Unil
Shipmenls, http://www3.gartner.com/
N° 18,Diciembre del 2006
14
[4] 1. Zafra, VHDL implemelllalioll oflhe
microcontrollers PIC-I6/17, University of
Sevilla, Jun 2000.
[5] J. CJayton, Ihe PICI6F84 ill Verilog,
http://www.opencores.orglprojects.cgi/web/risc 1
6f841
[6] Microchip Company, Microchip Teclmology
Jumps lo Number Dile ill Worldwide 8-bit
Microcontroller Shipmellls,
http://www.microchip.comlstellentl. press reJease
Ju12004.
[7] Microchip Company, PIC 16F8X Dalasheel,
http://www.microchip.com
[8] S. Morioka, VfJDL implementalioll oflhe
PICI6F84 ill FPGA, Transistor Gijutsu
ELECTRÓNICA - UNMSM
Magazine, Dec 1999, http://www02.so-
nel.ne.jp/-moriokalcqpic. htm,
[9] T. Coonan, The synthelhic PIC, 1999,
http://www.mindspring.coml-teoonanlsynthp ¡e.h
tml.
[10] XiJinx Ine, Processor celltral,
http://www.xilinx.eom/produets/design_resouree
si
[11] XiJinx Ine, XST mallual, hup:l/www.xilinx.eoml
[12] XiJinx Ine, Applicatiolllloles: xapp463 alld
xapp464 - Spartall3 FPGAfamily,
http://www.xiJinx.eoml, JuJ 2003.
[13] P. 8ertin, D. Roein and J. Vuillemin,
Programmable AClive Memories: a Performallce
Assesmelll, Digital Equipment Corporation Paris
Researeh Lab, 1993
N' 18,Dieiembre del 2006
