ISEGEN: Generation of High-Quality Instruction Set Extensions by
  Iterative Improvement by Biswas, Partha et al.
ISEGEN: Generation of High-Quality Instruction Set Extensions
by Iterative Improvement  
Partha Biswas
partha@cecs.uci.edu
Sudarshan Banerjee
banerjee@cecs.uci.edu
Nikil Dutt
dutt@cecs.uci.edu
Center for Embedded Computer Systems
Donald Bren School of Information and Computer Science
University of California, Irvine
Irvine, CA 92697-3425, USA
Laura Pozzi
laura.pozzi@epfl.ch
Paolo Ienne
paolo.ienne@epfl.ch
Ecole Polytechnique Fe´de´rale de Lausanne (EPFL)
School of Computer and Communication Sciences
CH-1015 Lausanne, Switzerland
Abstract
Customization of processor architectures through In-
struction Set Extensions (ISEs) is an effective way to meet
the growing performance demands of embedded applica-
tions. A high-quality ISE generation approach needs to ob-
tain results close to those achieved by experienced design-
ers, particularly for complex applications that exhibit reg-
ularity: expert designers are able to exploit manually such
regularity in the data flow graphs to generate high-quality
ISEs. In this paper, we present ISEGEN, an approach that
identifies high-quality ISEs by iterative improvement fol-
lowing the basic principles of the well-known Kernighan-
Lin (K-L) min-cut heuristic. Experimental results on a
number of MediaBench, EEMBC and cryptographic appli-
cations show that our approach matches the quality of the
optimal solution obtained by exhaustive search. We also
show that our ISEGEN technique is on average   faster
than a genetic formulation that generates equivalent solu-
tions. Furthermore, the ISEs identified by our technique ex-
hibit more speedup than the genetic solution on a large
cryptographic application (AES) by effectively exploiting its
regular structure.
 This work was partially supported by NSF grants: CCR-0203813,
CCR-0205712 and SRC contract: 2003-HJ1111.
1 Introduction
Continuing advances in manufacturing processes have
made it possible for processor vendors to build increasingly
fast processors. However, newer applications place an in-
creasing demand on performance, at a rate faster than that
achievable by processors. These trends have necessitated
the migration of critical computations from the processor
core to an application-specific unit that is able to perform
compute-intensive tasks efficiently. We call such a unit
an Ad-hoc Functional Unit (AFU). The AFU accelerates
critical operations of application algorithms by executing
application-specific Instruction Set Extensions (ISEs).
Automatic generation of ISEs is essentially the task of
hardware-software partitioning applied at an instruction-
level granularity. The Kernighan-Lin (K-L) min-cut algo-
rithm is a well-known graph partitioning heuristic originally
designed for circuit partitioning [2]. Recently, this heuristic
has been successfully adapted for task-level partitioning of
a system into hardware and software [1]. In this paper, we
apply the K-L heuristic at the instruction-level granularity
to automatically generate ISEs. We refer to our approach
as ISEGEN. Our motivation for employing an iterative im-
provement technique like K-L is to generate solutions close
to those obtained manually by expert designers. In order
to match such a solution quality, the control parameters of
ISEGEN closely model the decisions taken by the designer.
We show the efficacy of ISEGEN on a number of em-
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
bedded applications selected from MediaBench, EEMBC
and cryptographic suites by comparing our results with the
best known approaches of ISE generation. We demonstrate
that ISEGEN runs up to    faster than a previous genetic
formulation while yielding ISEs having speedup compara-
ble with the optimal solution [3]. On a large cryptographic
application (AES) for which the exhaustive techniques fail,
ISEGEN — by effectively exploiting its regular structure —
generates  more speedup than the genetic approach [4].
The rest of the paper is organized as follows. In Sec-
tion 2, we define our problem. In Section 3, we discuss
related research work and our motivation. We propose our
ISEGEN approach in Section 4. In Section 5, we describe
the experimental results that demonstrate the efficacy of our
approach. Finally, Section 6 concludes the paper.
2 Problem Definition
Instructions within a basic block are typically repre-
sented as a Directed Acyclic Graph (DAG),    :
the nodes  represent instructions and the edges  capture
the data dependencies between them. We define a cut 
representing a potential ISE as a subgraph of  ,     Let
 be the function that measures the merit of a cut  as
an estimation of the speedup achievable by implementing
 as an ISE. Let 
 
 and 
 
 respectively be the
number of inputs and the number of outputs of. The max-
imum number of operands of an ISE (or a cut) is limited by
the number of register file ports in the underlying core.
Let 
 
and 

be the maximum number of input and
output operands respectively. A cut  is architecturally fea-
sible if its inputs are available at the time of issue. This is
only possible if is convex, i.e., if there exists no path from
a node    to another node 	   through a node
  
[3]. The problem of ISE generation can be broken into the
following two sub-problems:
Problem 1 Given the data flow graph (DFG)    
in a basic block, find a cut     that maximizes 
under the following constraints:
 Input-Output (I/O) Constraints: 
 
  

and

 
  

.
 Convexity Constraint:  is convex.
Problem 2 Given the basic blocks in an application and
the maximum allowed number of ISEs as 
 
, find cuts
that maximize the speedup achievable for the entire appli-
cation.
3 State of the Art and Motivation
Some of the earlier work in ISE generation applied to
reconfigurable computing [11, 10] considers only single-
output subgraphs in ISE generation. Even though a few
recently proposed approaches [5, 6] handle multiple out-
puts, they identify only connected subgraphs. However, the
opportunity to include independent subgraphs in the same
ISE exposes speedup potentials, while algorithms identify-
ing only connected graphs are unable to exploit high con-
straints of ISE outputs. Therefore, we also consider inde-
pendent subgraphs in ISEGEN.
Figure 1. An example showing the advan-
tage of large scale reuse — Finding three in-
stances of the largest ISE (shown with a dot-
ted boundary) is not as effective as finding
a large ISE with six instances (shown with a
solid boundary).
When the goal of ISE generation is speedup coupled with
dynamic reuse, as in [7, 8], the resulting subgraphs are gen-
erally small. In practice, if one wants to mimic the excel-
lent results targeted by expert designers, clusters of 2 or
3 instructions are far too small for arousing real interest:
typical results at this level generally include only peculiar
address generation patterns, pre- or post-shifting, or well-
known arithmetic patterns such as multiply-accumulators.
There is a need for algorithms that can identify large and
reusable clusters, efficiently covering the application DFG.
Figure 1 demonstrates this principle with the help of an
example. This motivates our ISEGEN approach that not
only generates ISEs having higher potential for speedup, but
which also shows the efficacy of the generated ISEs in terms
of their reusability.
An exact solution [3] that uses an exhaustive search with
pruning is not practical for applications having large basic-
blocks. A genetic formulation [4] presents a practical so-
lution with results showing good speedup for the generated
ISEs. However, the genetic algorithm is stochastic in nature
and therefore multiple runs may result in different solutions.
Our ISEGEN approach, on the other hand, is an iterative
improvement technique that closely mimics the decisions
taken by an expert designer; consequently we are able to
match the solution quality of expert designers.
4 The ISEGEN Approach
We reiterate that ISEGEN essentially performs
Hardware-Software partitioning at instruction-level
granularity. The instructions belonging to the hardware
2
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
ISEGEN()
01:       
02: loop (until exit condition)
03:        
04: while (there exists unmarked node in DFG)
05: foreach (unmarked node )
06: Calculate Gain Function, 
 
(, )
07: endfor
08:  	
  Node with maximum Gain
09: Toggle and Mark  	

10: Get impact of toggling  	
 w.r.t.  
11: if (toggling  	
 satisfies constraints)
12: Update   from toggling  	

13: Calculate  ( )
14: endif
15: endwhile
16: if ( ( )  (   ))
17:        
18: Unmark all nodes
19: endif
20: endloop
21:       
Figure 2. The ISEGEN Algorithm
partition map to an ISE to be executed on an AFU while
those belonging to the software partition individually
execute on the processor core. Our approach considers
the basic blocks in an application based on their speedup
potential — a function of its execution frequency and
estimated gain from mapping all its nodes to hardware
— and performs up to 

successive bi-partitions into
hardware and software within a basic block. After an ISE
is found in a basic block, the speedup potential of the block
is updated considering the remaining nodes.
We borrow the idea from Kernighan-Lin min-cut parti-
tioning heuristic to steer toggling of nodes in the DFG be-
tween software (S) and hardware (H) based on a gain func-
tion that captures the designer’s objective. The effectiveness
of the K-L heuristic lies in its ability to overcome many lo-
cal maxima without using unnecessary moves.
4.1 Modified Kernighan-Lin Algorithm
The ISEGEN algorithm that essentially performs a bi-
partitioning of a DFG into S and H is depicted in Figure 2.
This is an iterative improvement algorithm that starts with
all nodes in software and tries to toggle each unmarked
node, , in the graph from S to H or H to S in every iter-
ation. Within each iteration of ISEGEN (line 02 to line 20),
    retains the best cut found so far with the help of
 , that maintains the intermediate best cuts. Initially,
the cut  points to a configuration where all nodes belong
to software and this configuration is passed down to  .
The decision to toggle  with respect to   is based
on a gain function, 
 
 . The gain function is evalu-
ated for each node (line 05) and the node with the best gain,
best node (obtained in line 08) is then toggled and marked
(line 09). Note that the chosen cut at this point may be vio-
lating input/output constraints and convexity constraints. In
other words, we allow a cut to be illegal giving it an oppor-
tunity to eventually grow into a valid cut.
If both convexity and I/O constraints are satisfied (line
11),   is updated through removal of  	
 from
the cut or its addition to the cut depending on whether
 	
 has toggled from H to S or S to H respectively.
The speedup estimate   determines whether  
should override     (line 17). This process is car-
ried on till no more unmarked nodes are left. In general,
we found experimentally that 5 passes are enough for suc-
cessive improvement of the solution. Therefore, the exit
condition in the outermost loop is set to 5 times or lower
when there is no improvement in the merit of the solution
across successive iterations. The best cut (   ) is
stored back in  that further acts as a starting point for the
next bi-partitioning of the DFG.
4.2 Gain Function
The gain function 
 
for toggling a node  with
respect to a cut is a linear weighted sum of the following 5
components that act as control parameters for the algorithm:
  Merit Function (Speedup Estimate): Let     be the new cut
after addition or removal of the node  from the cut   as 
toggles from S to H or H to S respectively.
  
 
 
 
  if    obeys convexity constraint
 if    violates convexity constraint
  Input Output violation penalty: A heavy penalty is applied
with the help of a large factor if input-output port constraints
are violated.
	  



 

 
 







 

 
 





  Convexity Constraints: Addition of a node to a cut is fa-
vored when its neighbors are already in the cut while a node
already in the cut is not easily removed from the cut. Let
      be the number of neighbors of 
in  .
  
 
       if  is in S
       if  is in H
  Large Cut: A cut is allowed to grow in regions where
growth potential is higher. The external input and exter-
nal output nodes act as barriers beyond which a cut cannot
grow. Since we do not allow memory access from AFUs,
memory operations are also barriers for cut growth. Let
3
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
toggle
toggle
I
O
S->H
S -1
-1
(a) 1S, 0H as child
S->H
H S
oror or or
+1
0 -1
-1
-1
-1
+1+1
all H’s or all S’s or nil
H
S->H
+1
+1
S->H
HS
+1
(c) 1S, 1H as children
 0 H
+1 0
S or Ext. Inp.
SS->H
S->H
H H S -1 0  0
 0  0
S->H
 0  0S S H
+1 0 0
(b) 0S, 1H as child
S->H
H HH
0
(e) 0S, >1H as children
+1
S->H
S S S
-1 -1
 0  0
(d) >1S, 0H as children
H H
H
S->H
 0
(i) >1H as siblings
S or Ext. Inp.
S SS->H
-1
(k) >1S as siblings(f) 1S, >1H as children (g) >1S, 1H as children (h) H or S as parents (l) 1H, >=0S as siblings
all H’s or all S’s or nil
+1 0
 -1
 +1
 0
 +1
-1 0
+10-1  0 +1 S H 0
H
0
(j) 1S, >=0H as siblings
 0
 -1 S->H
0
0
0
0
0
0
00 0
Figure 3. Basic Rules to project the effect of toggling a node from S to H to its parents (h), children
(a-g) and siblings (i-l). The (  
 
, 
 
) pair shown associated with each node that gets affected.
      	 be the minimum distance of 	 from the
barriers in the upward direction and let      
	  	
be the minimum distance of 	 from the barriers in the down-
ward direction.
 
 






       	       
	  	  
if 	 is in S
       	       
	  	  
if 	 is in H
We employ a directional growth strategy where nodes closer
to the barrier (that have higher potential for cut growth) are
consistently favored for inclusion in hardware; this strategy
implicitly favors reusability of the cut without losing the ben-
efit of having large cut as a solution.
 Independent Cuts: It is quite possible that the best cut is
actually a combination of 2 or 3 large connected subgraphs
and not necessarily the largest connected subgraph. So, ISE
exploration needs to expand not only in the vertical direc-
tion favoring large cuts but also in the horizontal direction.
Let   be the independently connected subgraphs in the
DFG  excluding the connected subgraph containing 	.
  
 





   
    
if 	 is in H
 if 	 is in S
where     is the sum of the hardware latencies
along the critical path of the independently connected sub-
graph, . Using this component, the nodes already in H are
allowed to move back into S to favor the growth of other po-
tentially large subgraphs.
We now express 
 
 with respect to the current
cut  as follows:


   

  	
 

    

   
	
  	
The weights 

, 

, 

, 

and 
	
have been deter-
mined experimentally. We show in [9] that by maintaining
appropriate data structures, the worst-case running time of
ISEGEN can be restricted to O    .
4.3 Impact of Toggling a Node
The runtime complexity of 
 
is significantly re-
duced by trading the majority of computations into appro-
priately evaluating the impact of toggling a node (line 10 in
Figure 2). The number of inputs and the number of outputs
of ISE at any stage of the partitioning process are given by
 


and 


respectively. In order to quantify the impact
of toggling a node, we introduce addendums  
 
and

 
associated with every node. When a node is tog-
gled, its addendums  
 
and 
 
are added to  


and 


respectively to get the new values of  


and



. Initially, all nodes are in S and therefore  


= 


= 0 and  
 
and 
 
equal the number of inputs and
number of outputs respectively of the corresponding node.
It is easy to show that when a node is toggled (say, from
S to H),  
 
and 
 
of only its neighbors (parents,
children and siblings) get affected. After toggling from S to
H,  
 
and 
 
of the node reverse in sign so that the
changes to  


and 


will be undone if the same node
toggles back to S. The impact of toggling a node (node 3)
on itself and other nodes is illustrated in Figure 5.
S H
2
1
2
1
1
2 3
S
S
S
2
1
All Software
1
2
S
S
S
Software
2+(-1)=1
2+(-1)=1
2+(-1)=1
1+(-1)=0
1 + 0 = 1
1 + 0 = 1
Hardware (ISE)
= 0 + 1OISE
4 4Otoggle
Itoggle
IISE = 0 + 1
3H
-1
-1
1
1
Figure 5. Instance of an Instruction-level
Hardware-Software Partitioning
We developed a comprehensive set of rules to capture the
effect of toggling a node that is pictorially presented in Fig-
ure 3 wherein a toggle of a node from S to H is shown as
4
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
SpeedupforI/OConstraints(4,2)andNISE=4
1
1.5
2
2.5
3
3.5
4
4.5
conven00(6)
fbital00(20)
viterb00(23)
autcor00(25)
adpcm_decoder(82)
adpcm_coder(96)
fft00(104)
Exact
Iterative
Genetic
ISEGEN
Runtime(inmicroseconds)forI/OConstraints(4,2)andNISE=4
1
10
100
1000
10000
100000
1000000
10000000
100000000
1000000000
conven00(6)
fbital00(20)
viterb00(23)
autcor00(25)
adpcm_decoder(82)
adpcm_coder(96)
fft00(104)
Exact
Iterative
Genetic
ISEGEN
Figure 4. Comparison of Speedup and Runtime with number of AFUs = 4 and I/O constraints: (4,2)
     . The changes in 
 
and 
 
values are rep-
resented as 
 
and 
 
respectively such that the
new values of 
 
and 
 
for the affected nodes are
computed as 
 
 
 
 and 
 
 
 

respectively. These rules can be empirically verified to work
on any DFG (for example, Figure 5). An additional rule is
that a toggle of a node from H to S negates the effect of its
toggling from S, i.e., the rules in Figure 3 can be applied for
toggling from H to S with the sign reversed for 
 
and
 
 
. The proofs of correctness for all the rules have
been omitted for the sake of brevity and presented in [9].
The impact of toggling a node also involves maintenance of
appropriate data structures for fast evaluation of  and
convexity violation [9].
5 Experimental Results
We define the merit function as:   
 
 


, where 

 is the software latency of  esti-
mated by summing the latencies of the nodes in ; 


is the hardware latency of  estimated from the criti-
cal path in . The hardware latency for each instruc-
tion was obtained by synthesizing the constituent arithmetic
and logic operators on a common  CMOS technol-
ogy and then normalized to the delay of a 	
-bit multiply-
accumulate (MAC).
We integrated ISEGEN in the MachSUIF framework
[12] and evaluated overall speedup for the entire applica-
tion using all the generated cuts as follows:

	


	


 

	


The variable, 
	

encapsulates the overall execution
latency of the application i.e., when the application entirely
runs on software, and 	

is the execution frequency of .
Note that, in this work, we do not consider memory opera-
tions for inclusion into a cut.
To evaluate the efficacy of our ISEGEN approach,
we chose benchmarks from diverse application domains
in EEMBC (  , 	
 , 	 ,  and

  ) and MediaBench (  	 and  
		) suites. In addition, we chose a cryptographic ap-
plication viz.  . Our baseline architecture is a simple
RISC machine and we allow up to  AFUs (or ISEs) to be
added. Keeping the I/O constraints fixed at 
 
, we study
the overall speedup of applications obtained over execution
on the core processor and the time taken to generate ISEs
(or runtime) on Sun Ultra-5. We compare the quality of our
results with the best known algorithms for ISE generation.
The optimal algorithms for ISE generation [3] come in two
flavors: Exact multiple-cut identification (or Exact in short)
and Iterative exact single-cut identification (or Iterative),
both of which employ exhaustive search with pruning. For
applications having large basic blocks, we chose a genetic
formulation [4] for comparing our results.
We associate with each benchmark the maximum num-
ber of nodes in its critical basic block (shown in parenthe-
ses) and arrange them in increasing order. It is evident from
the first plot of Figure 4 that ISEGEN matches the solution
quality of Exact, Iterative and Genetic algorithms. Note that
because of effective pruning, Exact is able to handle up to

 nodes and Iterative is able to handle up to  nodes in
the selected benchmarks. As shown in the second plot of
Figure 4, ISEGEN runs up to 
 faster than the genetic
approach with the generated ISEs having quality compara-
ble with the optimal solution in terms of overall speedup.
We observed that some of the ISEs identified by the optimal
algorithms are independent subgraphs and therefore an ISE
identification algorithm should not be restricted to identify
only connected subgraphs.
 is a cryptographic benchmark with a large DFG;
its critical basic block contains  nodes with a symmetric
structure. Since the optimal algorithms (Exact and Itera-
tive) could not run on such a large application, we chose
the genetic solution (that also matches the optimal solution
in smaller benchmarks) for comparing our results. Because
of its non-exponential complexity, ISEGEN easily handles
large DFGs. We deliberately chose  to demonstrate the
efficacy of our ISEGEN approach in matching expert design
quality. We increased the maximum number of AFUs from
 to , and studied the speedup over execution on the pro-
cessor core as shown in Figure 6.
5
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
SpeedupComparisoninAES(696)forNISE=1
1
1.2
1.4
1.6
1.8
2
2.2
2.4
(2,1) (3,1) (4,1) (4,2) (6,3) (8,4)
Genetic
ISEGEN
SpeedupComparisoninAES(696)forNISE=4
1
1.2
1.4
1.6
1.8
2
2.2
2.4
(2,1) (3,1) (4,1) (4,2) (6,3) (8,4)
Genetic
ISEGEN
Figure 6. Comparison of Speedup on AES with varying number of AFUs
On average, ISEGEN obtains   more speedup than
the genetic solution by effectively exploiting the regular-
ity in the data flow graph of AES. Figure 7 shows how the
structure yielded multiple instances of the same cut thereby
exposing the regularity in the application. Since AES has a
large number of nodes, it is intuitive to expect an increase
in speedup by increasing the allowed number of AFUs and
I/O constraints. However, it is interesting to note that con-
trarily to our expectation, for a smaller number of allowed
AFUs (= ), the speedup could not scale with relaxing I/O
constraints (as shown in the first plot of Figure 6). The rea-
son is clear from the plot of Figure 7. It shows that there are
 instances of the first cut for the I/O constraint of   
(or   ), while there are only  instances for the I/O con-
straint of    . As is evident from the first plot of Figure 6,
the  instances generated for    cover the DFG better
than the  instances generated for    . However, with
increase in the allowed number of AFUs, the speedup be-
gins to scale with increasing I/O constraints (as shown in
the last plot of Figure 6). Therefore, our ISEGEN not only
generates ISEs resulting in high speedup but also exploits
their reusability by producing all the instances in the DFG
(as shown in Figure 7). Thus, the solutions generated by
ISEGEN are indeed close to those generated by an expert
designer.
ReusabilityofCutsinAES
0
2
4
6
8
10
12
14
(2,1) (3,1) (4,1) (4,2) (6,3) (8,4)
I/OConstraints
N
u
m
b
e
r
o
fI
n
s
ta
n
c
e
s
CUT1
CUT2
CUT3
CUT4
Figure 7. Study of Reusability of ISEs on AES
with varying number of AFUs
6 Conclusions
The hardware-software partitioning problem when ap-
plied at the instruction-level granularity constitutes the
problem of ISE generation. The contributions presented
in this paper are as follows. First, we clearly identified
the properties of ISEs that are of interest to an expert de-
signer. Second, we adapted a well-known Kernighan-Lin
heuristic to perform ISE generation with a low computa-
tional complexity. Finally, we show that our ISEGEN ap-
proach produces high-quality ISEs — close to those sought
after by an expert designer. Furthermore, ISEGEN runs up
to 	  faster than a previous genetic approach and gener-
ates solutions comparable with the optimal ISE generation
approaches. Our future work will focus on the deployment
of ISEs in a real system and evaluating the impact of ISEs
on code size and energy reduction.
References
[1] F. Vahid and T. D. Le. Extending the Kernighan/Lin Heuristic for Hardware
and Software Functional Partitioning. In Kluwer Journal on Design Automa-
tion of Embedded Systems, 1997.
[2] C. M. Fiduccia and R. M. Mattheyses. A Linear-time Heuristic for Improving
Network Partitions. In Proc. of DAC, 1982.
[3] K. Atasu, L. Pozzi and P. Ienne. Automatic Application-Specific Instruction-
Set Extensions under Microarchitectural Constraints. In Proc. of DAC, 2003.
[4] P. Biswas, V. Choudhary, K. Atasu, L. Pozzi, P. Ienne and N. Dutt. Introduc-
tion of Local Memory Elements in Instruction Set Extensions. In Proc. of
DAC, 2004.
[5] N. Clark, H. Zhong and S. Mahlke. Processor Acceleration through Auto-
mated Instruction Set Customization. In Proc. of MICRO, 2003.
[6] P. Yu and T. Mitra. Scalable Custom Instructions Identification for
Instruction-Set Extensible Processors. In Proc. of CASES, 2004.
[7] F. Sun, S. Ravi, A. Raghunathan and N. K. Jha. Synthesis of Custom Proces-
sors based on Extensible Platforms. In Proc. of ICCAD, 2002.
[8] M. Arnold and H. Corporaal. Designing Domain-specific Processors. In Proc.
of CODES, 2001.
[9] P. Biswas, S. Banerjee, N. Dutt, L. Pozzi and P. Ienne. ISEGEN: Adapting
Kernighan-Lin Min-Cut Heuristic for Generation of Instruction Set Exten-
sions. CECS, UC Irvine, Technical Report CECS-TR-04-21.
[10] C. Alippi, W. Fornaciari, L. Pozzi and M. Sami. A DAG based Design Ap-
proach for Reconfigurable VLIW Processors. In Proc. of DATE, 1999.
[11] R. Razdan and M. D. Smith. A High-performance Microarchitecture with
Hardware-programmable Functional Units. In Proc. of MICRO, 1994.
[12] Machine SUIF. http://www.eecs.harvard.edu/hube/
software/software.html.
6
Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’05) 
1530-1591/05 $ 20.00 IEEE 
