High-Level Microprogramming: An Optimising C Compiler for a Processing Element of a CAD Accelerator by Kenyon, Paul et al.
University of Nebraska - Lincoln
DigitalCommons@University of Nebraska - Lincoln
CSE Journal Articles Computer Science and Engineering, Department of
1990
High-Level Microprogramming: An Optimising C
Compiler for a Processing Element of a CAD
Accelerator
Paul Kenyon
University of Nebraska - Lincoln
Prathima Agrawal
AT&T Bell Labratories
Sharad Seth
University of Nebraska - Lincoln, seth@cse.unl.edu
Follow this and additional works at: http://digitalcommons.unl.edu/csearticles
Part of the Computer Sciences Commons
This Article is brought to you for free and open access by the Computer Science and Engineering, Department of at DigitalCommons@University of
Nebraska - Lincoln. It has been accepted for inclusion in CSE Journal Articles by an authorized administrator of DigitalCommons@University of
Nebraska - Lincoln.
Kenyon, Paul; Agrawal, Prathima; and Seth, Sharad, "High-Level Microprogramming: An Optimising C Compiler for a Processing
Element of a CAD Accelerator" (1990). CSE Journal Articles. 64.
http://digitalcommons.unl.edu/csearticles/64
High-Level Microprogramming: An Optimising C Compiler for a Processing 
Element of a CAD Accelerator 
Paul Kenyon Prathima Agrawal Sharad Seth 
AT&T Bell Lakratorier 
Lincoln, Nebrarka Murray Hill, New Jerrey Lincoln, Nebraska 
Univerrity Nebrarko - Lincoln Univerrity Nebrarko - Lincoln 
Abstract: The development of a high-level lan- 
guage compiler for a micro-programmable process- 
ing element (PE) in the MARS multicomputer is de- 
scribed. MARS, an MIMD message passing machine, 
was designed to speed up VLSI CAD and similar 
other non-numerical applications. The need for sup 
port of a high-level language at the PE level of a mul- 
ticomputer is considered, and the choice of C as an 
appropriate programming language is justified. Spe- 
cial features found in VLSI processors are examined 
along with compiler support for them. 
Conventional retargetable compiler techniques are 
shown to be inadequate for the highly concurrent 
micro-programmable PE. These techniques must be 
extended for microcode generation. The design of 
the MARS compiler is outlined. Performance data 
is provided to evaluate the benefit of various com- 
piler optimisations, and to compare compiler gener- 
ated microcode to hand generated microcode in terms 
of space and time performance 
Keywords: Microcode compiler, Code Generation, 
Front-Znd DAG Compiler, Hand vs. Compiled Mi- 
crocode, Performance Data, Space/Time Overhead, 
Hardware Accelerator, Programming Environment 
for CAD 
1 Introduction 
The long-term goal of our research is to demonstrate 
the effectiveness of high-level language support for 
a microprogrammable multiprocessor designed to ac- 
celerate computer aided design tools for VLSI. This 
research involves the development of a C compiler 
for a processing element (PE), a C++ compiler for a 
cluster of PES, and source-code level simulation and 
debugging tools. 
This paper describes the first stage of that research, 
the development of a C language compiler for the 
processing element of the MARS multicomputer [6]. 
The first step in supporting a high-level language 
on an MIMD (Multiple Instruction Stream, Multi- 
ple Data Stream) message passing machine such as 
MARS is to be able to generate efficient, accurate 
code for each PE of the multicomputer. Good code 
generation at the PE level is needed before any work 
in compilation for the entire multicomputer can be 
attempted. The PE compiler can be used directly 
by a programmer who partitions a problem on to the 
processors, allocates the communication usage, and 
balances the computational load. Or, it can be used 
by a higher-level multicomputer compiler that per- 
forms these tasks and generates output code for com- 
pilation onto the PES. 
While our work deals with MARS, only a small, 
easily-identifiable part of our implementation is spe- 
cific to its architecture. Thus, our method has wider 
The problem of high-level language support for multi- 
computers is one of the major limitations in applying 
the computational power of these machines to new 
applicab%' to other microprogrammable processors 
and can be the basis for development of a retargetable 
compiler. 
applications. Two major developments, which have 
been driven by VLSI technology, lend urgency to this 
task: (a) New parallel architectures have proliferated 
and are increasingly being used by programmers who 
are not intimately familiar with their low-level archi- 
tectural details. (b) The design of specialised VLSI 
multiprocessors to accelerate computationally inten- 
sive tasks is becoming more common. The problem 
of software support for these machines will continue 
to grow. 
There is much reported work on high-level micropro- 
gramming and retargetable compilers (see Section 4). 
The focus of the past work, however, is markedly dif- 
ferent from ours. We are concerned with an envi- 
ronment in which a variety of applications are imple- 
mented by a programmer who is far removed from 
the designers of the microprogrammable PE. By con- 
trast, earlier work has dealt with firmware develop 
ment carried out in the context of a processor design. 
0194-1895/90/0000/0097/$01 .OO 0 IEEE 97 
Proceedings of the 23rd Annual Workshop and Symposium., Workshop on Microprogramming and Microarchitecture. Micro 23. 
doi: 10.1109/MICRO.1990.151431 
2 MARS Background 
MARS, a microprogrammable accelerator for rapid 
simulation, is a messagepawing multicomputer with 
micro-programmable PES [SI. MARS was developed 
to accelerate logic simulation of digital circuits, but 
its generality and programmabfity allow it to per- 
form a wide variety of problems. The currently imple- 
mented MARS applications include logic simulation 
[4], hult simulation [SI, and speech recognition [ll]. 
The architectural featurea in MARS lend themselves 
to non-numeric applications such as graph search. 
MARS is a messagepassing multicomputer with par- 
allelism at three levels. At the highest level MARS is 
proposed as a hypercube network of processor clus- 
ters. The currently implemented hardware represents 
one of these clusters, and is physically a plug-in VME 
board. The cluster level of MARS is a collection of 16 
PES and a house-keeper processor. The PES are mi- 
croprogrammable VLSI custom processors, each with 
its own local data memory which is accessible only by 
it and the housekeeper processor. 
MARS Cluster Architecture 
The PES within a cluster communicate with one an- 
other via a message passing network which is con- 
nected by a 16 x 16 full crossbar. A PE can address 
the destination of its transmitted messages, transmit 
a message into the crossbar, and receive messages 
from the crossbar. All three of these actions are con- 
trolled dynamically by the PE’s program. The cross- 
bar handles 16-bit messages from PE to PE with only 
one clock cycle latency. The crossbar messages are 
buffered at transmission and reception with an eight- 
message deep total buffer sise per PE. The channel 
control logic includes a channel hold which provides a 
blocking non-interruptible channel between two PES. 
The house-keeper processor performs 1/0 and s u p  
port functions for the cluster. For example, the 
house-keeper may perform data transfers from one 
PE data memory to another, or between a PE data 
memory and disk storage. The house-keeper also 
handles cluster-wide control such as loading PE pro- 
grams, starting and stopping PES (individually or en 
masse), and receiving and responding to PE level in- 
terrupts. The CPU of the host Sun work-station acts 
as house-keeper in the VME board implementation. 
Figure 1 shows a block diagram of the MARS cluster 
architecture. 
System 
Level 
266 Node 
8 - Cube 
C l u t a  
LCVCl 
I PE14 I MEMl4 1 I .  
Figure 1: System and Cluster Architecture 
MARS PE Architecture 
The MARS PE has a parallel architecture and a hori- 
rontal (64 bit), writable microcontrol store. The par- 
allel micro-architecture includes several special fea- 
tures. Among these is, the ability to perform arith- 
metic and logical operations quickly on variably offiet 
bit fields of sise 1, 2, 4, and 8 bits using a field o p  
eration unit (FOU) which operates independently of 
the 16-bit address arithmetic unit (AAU). An exam- 
ple bit field operation would be to take bit positions 
0 and 1 from register number 16 and bits 7 and 8 
from register number 17 add the two pairs as two bit 
integers, and place the result in bit positions 13 and 
14 of register number 18. When properly configured 
this can be performed by the FOU in one cycle. 
A second special feature of the PE is reading and 
writing bit fields to and from a variable aspect ratio 
memory. The hardware supports memory access as 
if the memory were 1, 2, 4, 8, or 16 bits wide. Such 
an access capability accelerates operations on tables 
of packed bit fields. 
The PE also provides special hardware to transmit 
and receive words directly over the message passing 
network. The hardware makes it possible to have 
two control flow threads active in the PE at the same 
time. This is managed by hardware which generates 
a trap when the input buffer is not empty or when the 
output buffer is not full. This trap switches control to 
a second thread which can process the data transfer 
to or from the crossbar and then return control to the 
main control flow thread. 
98 
The writable control store was provided to make 
MARS a versatile machine instead of a dedicated 
hardware simulator. The store is 04 bits wide with 
only 64 entries in the current implementation and can 
be read and written by the house-keeper. However, 
writing to the control store causes a hardware reset 
on the PE, which clears its state and register files. 
Programmability was included at the microcode level 
to achieve high-speed processing as a pipeline stage in 
simulation problems. This horbontal microcode, the 
multiple buses, and the availability of parallel micro- 
architecture hardware give the PE the ability to per- 
form up to five operations per clock cycle under the 
control of the programmer [5]. 
MARS Soaware 
The software for the MARS project has been devel- 
oped in several stages. Prior to development of the 
current compiler, there existed several tools to s u p  
port micro-programming. First, a functional simula- 
tor was developed previous to the production of the 
hardware and is used for debugging programs. It sim- 
ulates the 15 PE’s and the message passing network 
in a cluster at the clock phase level. Second, a micro- 
assembler for translating symbolic micreprograms 
into microcode files was written. Micro-assembler 
programming is supported by a macro facility along 
with a library of macros for simple operations. Third, 
the Housokeeper/PE interhe has been supported by 
a library of system routines for accessing and control- 
ling the MARS VME board as a device. The routines 
provide a means of transferring data and programs 
between the host work-station process and the PE 
memories. Together these tools provided a basic en- 
vironment for developing several applications, imple- 
mented in microcode, which currently run on MARS. 
The compiler work for MARS is being performed in 
three stages. The first stage involves the develop 
ment of a compiler which inputs the C language and 
generates microcode. The remainder of this paper 
describes the design and implementation of this PE 
level compiler. The second stage of the research wil l  
be the design and implementation of a duster-level 
compiler which maps (~11 algorithm onto the set of 16 
PES. The third stage of the compiler work will be to 
explore data parallel methods of mapping algorithms 
onto multiple clusters at  the system level. See Fig- 
ure 2. 
3 PE Compiler Requirements 
The VLSI processors used in multicomputers often 
include specialised hardware features. These features 
Parallel Algorithm 
I High Level Programming I 
I 
Parallel Langpage Program 
+ 1 Microcode Compression 1 
I I 
I 
Horizontal Micro-Assembler 
I Micro-Assembler I 
I 
Microcyde File 
Figure 2: MARS Software Overview 
are included to support operations that commonly oc- 
cur in the problem domain being targeted, to enhance 
the processor’s speed performance, and to support in- 
teraction of the PES in the multicomputer. For ex- 
ample, the MARS PE provides special hardware for 
bit-field operations, for accessing data memory with 
a variable aspect ratio, and for interrupting the pro- 
cessor flow of control to handle message traffic. 
The challenge of the PE level compiler is to represent 
the features of the hardware in an efficient and easy- 
to-use manner. The MARS compiler represents the 
specialised features of the PE hardware to the PE 
level programmer and/or overlaying tools in a way 
that is independent of the machine. The PE is s u p  
ported in a machine-independent manner, in order 
to reduce the attention the programmer must pay to 
low-level architecture and to abstract hardware fea- 
tures, thus reducing the complexity of cluster-level 
compiler code generation. This abstraction is also 
important for portability of programs between revi- 
sions of the hardware, and to different architectures. 
The MARS architecture presents three major chal- 
lenges to writing or generating good microcode. 
1. Object Code Compaction: 
The small program memory in the current im- 
plementation makes object code sise very impor- 
tant. Since reducing object code sise reduces ex- 
ecution time [lo], object code sise compression 
is the primary optimisation criteria for the final 
microcode. 
Register Allocation: 
MARS provides few registers. There are 8 
general-purpose registers, with 24 registers in all. 
This, combined with the latency of memory ac- 
cesses, makes good register allocation essential 
to efficient code generation. 
Instruction Scheduling: 
MARS has a limited instruction set. It has no 
complex addressing modes; in fact, the only ex- 
ternal memory addressing mode is register indi- 
rect (from one of two memory address registers) 
with an offset of 0 to 7. The compiler must sched- 
ule instructions for memory access to keep pace 
with data traveling through internal pipelines. 
Language Choice 
The input language of the MARS PE level compiler 
is the C programming language. The C language 
was chosen for several reasons. Though a single PE 
contains parallel hardware, it follows a single flow of 
control; sequential language b sufficient to represent 
this control flow. The C language is capable of di- 
rectly expressing all of the operations of the MARS 
PE architecture: C bit-fields are used to represent 
field operation unit (FOU) bit manipulations on 1, 2, 
4, and 8 bit entities; and special library functions are 
made available to provide access to the communica- 
tion hardware and to interrupt control. C allows for 
direct description of bit manipulations and low-level 
operations. C is a good language for automatic pr- 
gram generation by any overlaying cluster-level com- 
piler. Furthermore, it is already known by the com- 
munity of programmers who will program MARS. 
More importantly, software written for MARS can 
be tested on other C compilers with software simu- 
lation of the MARS library calls. The extensions of 
C needed for MARS take the form of special library 
function calls which have low-level, high-speed action 
on MARS. These function calls can be implemented 
on another system to access a software simulator of 
the special function, to read and to write from a Ne, 
or to perform other debugging activities. Another 
advantage of C is that compiler front-ends for the 
language are readily available. which greatly reduce 
implementation time. Ako, using C makes it ea% 
ier to test and debug the PE compiler by comparing 
program results against those from other machines. 
Compiler Retarget ing 
A secondary objective of our work is to explore re- 
targetable microcode generating compilers. While 
our research has not attempted to produce a com- 
piler generator based on a machine description lan- 
guage, the compiler development has isolated com- 
piler translation rules that depend upon the target 
micrtxuchitecture. (See Section 5 for a complete de- 
scription of compiler implementation.) 
The first requirement of a retargetable compiler is 
a well designed compiler parser and front-end. Re- 
targetable compiler front-ends generally reduce the 
input language to a machine independent interme- 
diate code. Within the compiler back-end, interme- 
diate optimisations are performed to customise the 
intermediate code to the target architecture. These 
optimisation rules are based on the compiler archi- 
tecture and are generated by hand in the current im- 
plementation. In a retargetable implementation of 
a microcode generating compiler, the optimisations 
would be driven by an explicit rule database. This 
rule set could then be automatically generated from 
an architecture description. Several such compilers 
have been constructed for traditional architectures 
and while these compilers are not directly able to s u p  
port microcode generation (sec section 3) the meth- 
ods of supporting retargeting should be applicable to 
microcode generation. 
4 Previous Work 
Since the construction of a compiler for the MARS 
PE is by definition a machine s p d c  problem, only 
solutions to slightly similar problems are found in the 
literature. Furthermore, there exist many good bibli- 
ographies of work in the field of compilers for various 
parallel machines. 
Inadequacy of Traditional Compilers 
One approach to applying existing compiler technol- 
ogy to the construction of the PE level compiler was 
to look at currently available retargetable compil- 
ers and consider the feasibility of targeting them to 
MARS. Retargetable compilers have the ability to 
generate assembly code for many different architec- 
tures. Notable examples of these systems are pcc, the 
standard AT&T UNIX portable C compiler [6], and 
gcc, the Free Software Foundation GNU project C 
compiler, which have been modified to generate code 
for many different architectures. 
100 
More recently compilers have been developed which 
are retargetable to a new architecture simply by writ- 
ing detailed descriptions of the architecture instruc- 
tion set [14] [15]. Retargetable compilers provide a 
straightforward method of constructing in a relatively 
short time a compiler for a new architecture. How- 
ever, these compilers make certain assumptions defin- 
ing what the compiler writers consider to be a gen- 
eral purpose computer. Commonly used assumptions 
are a von-Neumann computer architecture, (with pro- 
gram and data memory and one instruction counter) 
and instructions of the form “operation, addresses”. 
The address fields in instructions can at different 
times refer to a constant, a register, an absolute mem- 
ory location, a memory location addressed by a reg- 
ister with optional constant offset, or a memory lo- 
cation addressed by another memory location. Even 
RISC architectures, which reduce complexity of oper- 
ations and addressing modes, still provide addressing 
modes such as register indirect addressing (into mem- 
ory) with an offset. 
The MARS PE, being microprogrammable cannot be 
said to have addressing modes in the traditional sense 
but the following three operations describe the range 
of possibilities: loading a constant onto a bus, loading 
the contents of a register onto a bus, and storing the 
contents of a bus into a register. These operations 
are of a much lower level than the high-level address- 
ing modes that retargetable code generators require 
traditional architectures to provide. 
VLIW Architecture Compiler 
Fisher [13] and Ellis [12] describe the design and pro- 
duction of a compiler for a Very Large Instruction 
Word (VLIW) Architecture. Ellis describes trace 
scheduling, a technique for code generation for a 
VLIW machine. A trace is a path through the flow of 
control graph of the program. In trace scheduling, in- 
structions are reordered and scheduled within a trace 
rather than within a basic block because more code 
reordering can be found within a trace. Reordering 
the code within a trace involves moving instructions 
across basic block boundaries. This movement causes 
inconsistencies between the code trace and the basic 
blocks adjoining it in the control-flow graph. Correct- 
ing these differences requires additional code in the 
basic blocks that flow into and out of the trace being 
scheduled. 
A second recent compiler technique is described by 
Lam [l9]. Lam describes a compiler for a sys- 
tolic array of microprogrammable processors which 
uses a technique called software pipelining. Soft- 
ware pipelining considers the code resulting from a 
source language function aa a directed graph, with 
nodes representing operations and edges representing 
flow of control. Software pipelining first schedules 
the instructions within an inner loop of a program. 
The completely scheduled instructions are then col- 
lapsed into a single node and the scheduling contin- 
ues. The scheduling algorithm and node collapsing 
are repeated until the entire graph is one node and 
the function is scheduled. 
High- Level Microprogramming 
Hopkins, Horton and Arnold [17] describe a sys- 
tem for high-level microprogramming which is sim- 
ilar to ours but with the aim of producing microcode 
firmware which will execute a higher-level machine 
language on the hardware. Their objectives are to 
support the development of firmware programs across 
evolutionary changes of the underlying hardware and 
to produce tools that can be retargeted to new archi- 
tectures. This is in contrast to the case in MARS, 
where the goal of the compiler is to target applica- 
tion programs directly to microcode. Since the users 
of our compiler are not familiar with the details of its 
microarchitecture, it is important for the compiler to 
hide the low-level details of the architecture from the 
application program. 
Hopkins et al. describe the design of a microcode 
producing compiler whose input is a subset of the C 
language. A good case is made for writing firmware 
programs in C instead of directly in microcode. Ad- 
mittedly there is a code-sbe and run-time overhead in 
the use of a high-level language but it is pointed out 
that recoding just 10% of a program in efficient hand 
microcode often reduces a 100% run-time overhead to 
just lo%, (this is sometimes called the 90/10 rule). 
In the case of MARS we find that the 90/10 rule is 
doubly valid. First, within one PE’s microprogram, 
inner loops can be identified which account for much 
of the processing time. Second, across a multiple- 
PE program - often organized as a pipeline - one or 
two stages may be identified that act as bottlenecks 
in the system. If these PE programs are rewritten 
in efficient microcode overall performance will be im- 
proved and further optimieation of the non-critical 
PES is not beneficial. 
5 Implementation 
Figure 3 shows a block diagram of the compiler. The 
PE Compiler is organised much like the compiler for 
a conventional architecture with a few additions. The 
101 
Source File 
C Preprocessor 
Token 'stream 
m a g  Customination I 
Horisontal Microcode 
Figure 3: Compiler Block Diagram 
h t  part of the compilation process (lexical analysis, 
symbol table construction, parsing, parse tree gen- 
eration, and intermediate code generation) is carried 
out by the compiler front-end. This front-end b from 
a portable ANSI-standard C compiler [16]. The use 
of a previously developed compiler front-end greatly 
reduces compiler development work. It ale0 helps to 
ensure an accurate implementation of the language. 
After translating a section of the input program, the 
front-end produces a directed acyclic graph (DAG) 
representation of the code. The nodes of the DAG 
represent operations, and the edges represent data 
dependencies. A sequence of these DAGS b the in- 
termediate code which the front-end passes to the 
compiler back-end for microcode generation. Each 
sequence of DAGS is a data-dependency graph for a 
flow-control-free basic block of code. The ordering of 
a DAG sequence represents normal control flow. La- 
b& which are the targets of a branch may separate 
DAG sequences, and a DAG sequence may end with a 
conditional or unconditional branch. A function c d  
may appear as a node in the DAG. 
1: i o  < 
2: int a, b, c; 
3: b = 1; 
4: c = 2; 
6: a = b + c ;  
6: while (a>O) 
7: a=a-I ; 
8 :  1 
Figure 4: Example C code 
d il d 
N u m b a  by the root nodca indicate the DAG sequence. 
Figure 6: DAG from lines 3 - 6 of the example C cod 
For example, when the C code listed in Figure 4 is 
parsed by the front-end, the code in linea 1 and 2 
gives rise to a series of back-end calls. These com- 
municate a function begin, a block begin, and three 
local variable definitions. Lines 3, 4 and 6 translate 
into a DAG sequence. The while loop in lines 6 and 7 
is translated to test and branch code containing an- 
other two DAGS. Line 8 generates a block end and a 
function end call to the back-end. Figure 6 shows the 
DAG sequence resulting from lines 3, 4 and 6. 
DAG Customilaation 
When an intermediatecode DAG sequence is emitted 
from the front-end, the operations represented by the 
nodes are from a relatively small set of 38 operations 
over 8 supported data types. These machine indepen- 
dent operations represent a RISC-like machine model 
that has 3 address instructions (e.g. A := B op C) 
and no complex addressing modes. The 8 data types 
supported by the intermediate code DAG sequences 
102 
I d ( C T )  J I 
Numbem by the root nodes indicate the DAG sequence. 
Bold face Indicates MARS-specilic DAG node operations. 
Figure 6: Customised DAG from lines 3 to 5 of the 
example C code 
are: character, short integer, integer, unsigned inte- 
ger, float, double, structure, and pointer. 
The first step in the compiler back-end is to customise 
each DAG to the PE architecture. In general terms 
this process involves searching for opportunities to 
take advantage of the special hardware features avail- 
able in the PE. Figure 6 shows the DAG from Figure 5 
after it has been customised for MARS. During the 
customisation process, the DAG node operations are 
members of the set which is the union of the front-end 
generated operations and the MARSspecific DAG 
node operations. MARS-specific operations repre- 
sent hardware features that exist in a MARS PE but 
not in the generalised machine model for which the 
front-end produced code. Example MARS-specific 
DAG node operations are: GET (fetch value from 
an address), PUT (store value to address), INC (in- 
crement), DEC (decrement), and BZ (branch ifrero). 
The bit-field datatypes are very well supported in 
MARS and must be added to the operator set p r e  
duced by the front end. A bit-field may be 1 ,2 ,4 ,  or 8 
bits long. MARS can perform all integer arithmetic 
operations and memory load/stores on bit-fields as 
easily as on 16-bit integers by using the field opera- 
tion unit and variable-aspect-ratio memory hardware. 
Other alterations made during the DAG customisa- 
tion are performed not SO much to improve the ef- 
fiCiezcy cf !kt a ~ c c t ! y  hilt kQ C?EULe au_b eqgent 
code generation. They also reduce the improvements 
required from the later code optimisation. 
The DAG sequence is customised by traversing 
through each node in the graph, and attempting to 
match the current node (and the subtree of nodes im- 
mediately under it) against various templates. A tem- 
plate is a matching rule for a small section of tree. For 
example there is a template/substitution rule which 
says: "If an ADD has one argument which is a CNST 
whose symbol is 1, then dereference the CNST node, 
and change the ADD node to an INC node." 
When a section of the DAG matches a template it 
is substituted with another section of DAG using 
PE operations. The substituted section is then re- 
scanned for any other possible matches. Specifically, 
the DAG customisation algorithm is: 
0 Form a list of all nodes in a DAG se- 
quence in p r e k  order 
0 For each node, NI  in this list 
- For each substitution template 
* If a template match is found, 
perform the corresponding 
substitution 
- If any match was found at node NI  
retest all templates 
Dereferencing the CNST node means that the pointer 
for its use here is removed and its reference count re- 
duced. If a node's reference count is reduced to zero 
then it is completely removed from the graph and all 
of its children are dereferenced. The time complexity 
of this graph search is given by: 
1 DAGNodes 1 x I Templater I x I Tranrfmmr I 
In the worst case, I Transforma I is bounded by: 
I RulerTemplates 1' 
Where DAGNoder is the number of nodes in the cur- 
rent DAG, Templater is the number of templates to 
be attempted, and Tranr f ormr is the number of suc- 
cessful template matches and DAG transformations. 
Microcode Generat ion 
After the DAG sequence has been customised for 
MARS architecture, microcode is generated from it. 
The code generation follows a straightforward alge 
rithm: For each DAG in a DAG sequence, a code 
generation routine gen-node is called on the root node 
of the DAG. This routine takes two arguments: the 
node for which to generate code, and the location at 
which to place the result of the operation. 
The gen-node routine performs as follows: The node 
is b t  checked to see if it has been processed by 
gen-node, and its result value left in a register. If 
the value was left in a register, code is produced to 
move the value to the current target location. At this 
point processing for the node is complete. 
If no stored value from a previous processing step is 
found, the node is checked to see if it is referenced 
more than once. If it is multiply referenced, and the 
requested destination for its results is not a general 
DAGnode temporary register, then a general DAG 
node temporary regiater is allocated to the node to be 
stored for later use. The next action varies according 
to the node’s operation. In general, for each child of 
the node code generation for the child is recursively 
requested by calling gen-node and taking the child 
node as the node argument. The result placement ar- 
gument for gen-node is the location from which this 
operation could best use the child’s result value. Af- 
ter code has been generated for all of the children, 
the code to perform the current node’s operation is 
produced. Finally, the results from this operation are 
moved from where this operation creates them to the 
node’s storage location (if any) and to the location 
requested in the code generation call. 
This code generation algorithm was designed to bal- 
ance the need to use as few registers as possible for 
DAG temporaries with the need to reduce the amount 
of code generated simply to do data transfers. Many 
of the operations represented by DAG nodes can be 
performed in several different ways by the PE hard- 
ware. The choice of method is based on where the 
result is to be placed after the operation is finished. 
Microcode Compression 
As MARS assembly code is generated in the code- 
generation pass, peephole optimisation is performed 
on it [6] [lo]. Peephole processing performs small lo- 
calised changes on the generated code; it does not al- 
ter the operations performed, but simply changes the 
instruction sequencing. The optimisation identifies 
adjacent instructions that have no data dependen- 
cies between them and no hardware conflicts. If two 
instructions fitting these properties are found, they 
are merged together into one instruction. This poet- 
processing increases the utilisation of the hohontal 
micro-instruction available in MARS. 
:,f: 
:U: 
:12: 
$dei ault$ dei ault-ht 
$origin$ 0 
conrt ,f mt conrt -b(--Ep-baEO,addrOEIJ 
conrt ,f mt conrt ,a(di a-drt (B-22) ; 
conrt-fmt conrt-a(d2) a-drt(B-21) ; 
c-rrc(B-22) a-rrc(B-21) aau-add 
conrt-fmt conrt,a(:12:) ccdrt(PIB) ; 
nop i 
a-rrc(B-23) aau-doc c-drt (B-23) ; 
conat-fmt conrt-b(:li:) b-drt(B-18) ; 
b-rrc(B-23) c,8rc(ft,i8) C,dEt(PU) 
nop ; 
b,drt(SP) ; 
c-drt (B-23) ; 
c-dirable b-por ; 
:BID: HILT 
$end$ 
Figure 7: Microcode fiom Example C code 
Optimising the use of the horiaontal control of the PE 
microcode is done as a poet-processing step to avoid 
complications in the code generation algorithm. A 
drawback of this approach is that artificial data d o  
pendencies are introduced into the code during code 
generation. These data dependencies result fiom 
variables sharing registers in a sequential manner. 
Since the horiaontal compression optimiser works on 
the generated microcode and not on the original o p  
eration DAG multiple uses of a register by differ 
ent variables are seen as data dependencies and thus 
limit possible pardelisation. Figure 7 shows the mi- 
crocode generated by the compiler for the C code 
listed in Figure 4. 
6 Results 
The PE compiler is currently being tested by imple- 
menting various applications under it. a b l e  1 gives 
a list of program segments that have been compiled. 
The table gives the lines of C code for each program 
Ne, followed by the lines of microcode generated un- 
der three different levels of optimisation. In optimists- 
tion “00”, all optimisation is turned off. Optimists- 
tion “01” enables DAG rewrite rules that place local 
variables in data registers and optimise the accesses 
to them. Optimisation “02” enables DAG rewrite 
rules that substitute increments and decrements for 
small additions and subtractions, and DAG rewrite 
rules that substitute shifts and adds for multiplica- 
tion by a constant. “02” is the dehult optimisation 
level; other values are given for comparison only. 
Test 
Ale 
name 0 0  
102 
170 
89 
107 
96 
221 
36 
200 
7 
9 
101 
108 
countl 
count2 
d i V  
mod 
mult 
sieve 
small 
table 
test 
xbar 
squares 
sumtbl 
01 
17 
33 
62 
79 
67 
81 
16 
44 
7 
8 
44 
42 
Lines 
of c 
source 
11 
16 
19 
20 
18 
39 
8 
22 
9 
5 
17 
19 
Lines of microcode OptimirationLevell 
43 
41 
38 
Table 1: Code size using Merent optimisations 
Of the program segments listed in Table 1 several are 
of special note. “mult” and “div” show the greatest 
gains from optimiraation level “01” to “02”. Each 
of these programs contains several multiplication or 
division operations where one argument is a constant. 
These operations are reduced to shorter sequences of 
shifts and adds. 
Many Merent code segments in the table, particu- 
larly “countl”, “count2”, and “table”, show large 
improvements from optimisation level “00” to “01”. 
These programs contain many accesses to local vari- 
ables which can be placed in registers. Since ad- 
dress calculations and memory fetch must be explic- 
itly coded in microcode, each fetch or store that can 
be eliminated saves several instructions. 
The compiler is also being tested by comparing com- 
piler generated microcode programs with assembler 
code produced by hand. Table 2 lists five programs 
that have been written in C and in micro-assembler. 
Each program was compiled using the C compiler, 
run on MARS Hardware and timed against a real- 
time clock. The microcode resulting from each com- 
pilation was then reorganised and optimized by hand, 
and the resulting programs were run and timed. 
The test results for these and other programs indi- 
cate that the compiler generates code about twice the 
size of hand generated code (100 % overhead). Fur- 
thermore, the run-time data gathered for these pro- 
grams show that run-time overhead is approximately 
the same as code sise overhead. 
Further optimisations are being added to the com- 
piler. The peephole horieontal compressions are be- 
Test Lines Lines of Run Time in 
Ale of C microcode milli-seconds 
bubble 
indexed 39 64 36 19940 13000 
sieve 37 65 29 320 130 
Table 2: Code sise and run time, compiled vs. hand 
optimbed 
ing implemented. DAG optimisation templates are 
added to the DAG rewrite section by identifying 
hand-improvements which can be made in the com- 
piler output. It is hoped that further tuning of the 
optimisations wil l  reduce the microcontrol size to the 
goal of one and one-half times (50% overhead) hand- 
generated code. This may seem severe to anyone fa- 
miliar with 0 - 25% code overhead of the compilers for 
traditional sequential architectures. The additional 
overhead for MARS results from targeting microcode 
instructions instead of a higher-level assembler. 
7 Summary 
Special purpose multicomputers often use custom 
VLSI processing elements. The processors incorpo- 
rate special architecture features to speed up targeted 
applications. These features represent a challenge to 
effective compiler design. The MARS PE compiler 
was designed by using retargetable compiler tech- 
niques and extending them for the special features of 
the MARS architecture. A reusable compiler front- 
end, including lexical analyser, parser, and symbol 
table is used to generate intermediate code, which is 
stored as sequences of DAGS. These DAGS are cus- 
tombed to fit the MARS instruction set and then 
translated into microcode using a DAG walking algo- 
rithm. Peephole optimiraation is then performed on 
this microcode. 
Test results indicate a 50% - 100% code size overhead 
in compiler vs. hand-generated microcode. Run-time 
overhead is also in the same range. These numbers 
are higher then commonly encountered in compilers 
for traditional RISC and CISC architectures due to 
the difficulty involved in microcode generation. Fur- 
ther optimisations, currently being implemented, are 
expected to bring the average code and run-time over- 
head to the 50% range. 
105 
Bibliography [l l]  S. Chatt4ee and P. A g r a d .  Connected rpeeeh 
recognition on a multiple pmcuaor pipeline. In 
Proceedinga of the IEEE International Confer- 
ence on Acowticr, Speech, and Signal Procerr- 
ing, 1989. 
[l] S. Abraham and K. Padmanabhan. Instruction 
reorganination for a variablelength pipelined mi- 
croprocemor. In IEEE International Conference 
on Computer Derign, volume ICCD-88, 1988. 
D. P. Agrawal and J. Mauney. Structure of 
a parallclising compiler for the B-HIVE multi- 
computer. In Microrocerring and Microprogram- 
ming. North-Holland, 1988. 
P. Agrawal, V. Agrawal, and IC. T. Cheng. Fault 
simulation in a pipelined multiprocessor sys- 
tem. In Proceedingr of the IEEE International 
Tert Conference, volume ITC-89, pages 727-734, 
1989. 
P. Agrawal and W. J. Dally. A hardware 
logic simulation syatem. IEEE 'Ransactiom 
on Computer-Aided Design, 9(1):19-29, January 
1990. 
P. Agrawal et al. MARS: A multiprocessor- 
based programmable accelerator. IEEE Deign 
& Test of Computers, 4(6):2&56, October 1987. 
A. Aho, R. Sethi, and J. Ullman. Compilerr, 
Principler, Techniquer, and Took. Addison- 
Wesley, Resding, M d u s e t t s ,  1986. 
A. Aiken and A. Nicolau. A development envi- 
ronment for horisontal microcode programs. In 
Pmeedingr Micro-lfi, 1986. 
R. Allen and S. Johnson. Compiling C for vec- 
torisation, parallclisation, and inline expansion. 
In Proceeding8 of the Conference on Program- 
ming Language Derign and Implementation, vol- 
R. P. Atkin. Improved instruction formation in 
the exhaustive local microcode compaction algo- 
rithm. In Proceedingr Micro-17, 1984. 
ume SIGPLAN-88,  page^ 241-249,1988. 
[lo] D. K. Baneqji and J. Raymond. Elementr of 
Micro-Prqmmming. PrenticeHall Inc., Engle 
wood cliffb, NJ, 1982. 
[12] J. Ellis. Bulldug: A compiler for VJIW Amhi- 
tecturer. The MIT Press, Cambridge, MA, 1986. 
Originally presented as the author'r the& (doc- 
[13] J. Fisher. Trace achedpling: a technique for 
global microcode compaction. IEEE Transac- 
tiona on Computers, G50(7):478-490, July 1981. 
[14] C. W. Frsser. A language for writing code gen- 
erators. In Proceedingr of the Conference on 
Programming Language Derign and Implementa- 
tion, volume SIGPLAN-89, pages 256246, June 
1989. 
[16] C. W. haser and A. L. Wendt. Automatic gen- 
eration of fast optimhing code generators. In 
Proceedingr of the Conference on Prqmmming 
Language Derign and Implementation, volume 
[16] R. Gurd. Experience developing microcode us- 
ing a high level language. In Pmceedingr Mi- 
16, 1983. 
[17] W. C. Hopkins, M. J. Horton, and C. S. 
Arnold. fluget-independent high-level micro- 
programming. In Proceedingr Micro-18,1985. 
[18] M. Lam. Software pipelining: An effective 
scheduling technique for VLIW machines. In 
Proceedingr of the Conference on Programming 
Language Derign and Implementation, volume 
[19] M. Lam. A Syrtolie A m p  Optimizing Compiler. 
Kluwer Academic Publishers, Boston, MA, 1989. 
toral) - Yale Univedty, 1986. 
SIGPLAN-88, p~ger, 7944,1988. 
SIGPLAN-88,  page^ 316328,1988. 
