Hardware implementation of intelligent systems by Teodorescu, Horia-Nicolai et al.
Chapter 1 
Automated Design Synthesis and Partitioning for 
Adaptive Reconfigurable Hardware 
Ranga Vemuri, Sriram Govindarajan, Iyad Ouaiss, Meenakshi 
Kaul, Vinoo Srinivasan, Shankar Radhakrishnan, Sujatha 
Sundaraman, Satish Ganesan, Awartika Pandey, Preetham 
Lakshmikanthan 
Digital Design Environments Laboratory, ECECS Department, ML 0030, 
University of Cincinnati, Cincinnati, OH 45221-0030, USA 
The advent of reconfigurable logic arrays facilitates the development of 
adaptive architectures that have wide applicability as stand-alone 
intelligent systems. The hardware structure of such architectures can be 
rapidly altered to suit the changing computational needs of an application 
during its execution. The power of adaptive architectures has been 
demonstrated primarily in image processing, digital signal processing, and 
other areas such as neural networks and genetic algorithms. This chapter 
discusses the state-of-the-art adaptive architectures, their classification, 
and their applications. 
In order to effectively exploit adaptive architectures, efficient and 
retargetable design synthesis techniques are necessary. Further, the 
synthesis techniques must be fully integrated with design partitioning 
methods to make use of the multiplicity of reconfigurable devices provided 
by adaptive architectures. This chapter provides a description of a 
collection of synthesis and partitioning techniques and their embodiment in 
the SP ARCS (Synthesis and Partitioning for Adaptive Reconfigurable 
Computing Systems) system. 
1 
Introduction 
With the advancement of the Integrated Circuit (IC) technology, the 
design of digital systems starting at the transistor-level or the gate-level is 
no longer viable. On the other hand, at higher levels of abstraction the 
design functionality and tradeoffs can be clearly stated. Therefore, use of 
H.-N. Teodorescu et al. (eds.), Hardware Implementation of Intelligent Systems
© Springer-Verlag Berlin Heidelberg 2001
4 R. Vemuri et at. 
automation from conceptualization to silicon became an integral part of the 
design cycle. This encouraged the development of Computer-Aided 
Design (CAD) automation tools that can handle the increasing complexity 
of the VLSI technology. In addition, CAD algorithms have the ability to 
perform a thorough search of the design space, i.e. contemplate several 
design possibilities, in order to generate high quality designs. 
Currently, designers follow a top-down methodology where they 
describe the intent of the design and let CAD tools add detailed physical 
structure to it. This method of synthesizing systems from a design 
description suits the design of large systems. Thus, the design process 
could handle the increasing demands on the system complexity as well as 
the time-to-market. High-Level Synthesis (HLS) [1], in particular, is the 
process of generating a structural implementation from the behavioral 
specification (functionality) of a design. At this point of evolution, VLSI 
technology has reached a stage where high-level synthesis of VLSI chips 
and electronic systems is becoming more cost effective and less time 
consuming than being manually handcrafted. 
The field-programmable logic arrays led to the development of 
reconfigurable devices. Reconfigurable devices such as the Field-
Programmable Gate Arrays (FPGAs) consisting of a sea of uncommitted 
logic devices offer the same performance advantages as that of the custom 
VLSI chips while retaining the flexibility of general-purpose processors. It 
is predicted that in the near future, reconfigurable devices will offer 1 OOx 
performance improvement over contemporary microprocessors, 20x 
progress in density (gate-count), 10-lOOx reduction in power gate, and 
1 ,OOO,OOOx reduction in reconfiguration time compared to the current 
devices [2]. The advancement of the reconfigurable device technology 
facilitated the development of adaptive architectures that can dynamically 
change during the execution of an application. Such adaptive hardware 
architectures are at the heart of Reconfigurable Computers (RCs). An RC 
typically consists of one or more re-programmable processing elements, 
memory banks, interconnection network across these devices, and 
interface hardware to the external environment. The wide variety of RC 
resources enables RCs to act as stand-alone intelligent systems. 
Research in the field of CAD for RCs is still in its nascent stages. In 
order to demonstrate a performance improvement over conventional 
microprocessors, most of the applications are handcrafted for a specific RC 
architecture [3]. Typically, the handcrafted applications are small, 
requiring little hardware design and the RC architectures are simple, most 
often having a single reconfigurable device. The designers primarily rely 
on application-level parallelism to obtain speed-up. Whereas many large-
Design for Adaptive Reconfigurable Hardware 5 
scale applications that have inherent computation-level, instruction-level, 
word-level, and bit-level parallelism have not been exploited. 
A wide variety of RCs available in the market today offer tremendous 
reconfigurable computing power, and may be used for realizing 
implementations of computation intensive applications. In order to 
effectively exploit the RC architectures, efficient and retargetable design 
synthesis techniques are necessary. Further, the synthesis techniques must 
be fully integrated with design partitioning methods to make use of the 
multiplicity of reconfigurable devices provided by RC architectures. 
Synthesis and partitioning techniques need to mature to be able to fully 
utilize the computing power of RCs and handle challenging applications. 
This chapter is organized as follows. Section 2 provides a description of 
RC architectures, a survey of RC application domains, and a classification 
of currently available RCs. Section 3 describes a typical RC design flow 
and the fundamental problems in the design automation of RCs. Section 4 
provides an overview of the SPARCS synthesis and partitioning 
environment. Section 5 introduces the computational models used to 
capture the design specification in SPARCS. 
Sections 6 and 7 describe techniques used to solve the temporal and 
spatial partitioning problems, respectively. In Section 8, we describe the 
techniques for solving the interconnection synthesis problem. Finally, in 
Section 9, we provide insight into techniques used to solve some primary 
issues in design synthesis for RCs. 
2 
Adaptive Reconfigurable Hardware Systems 
A generic adaptive reconfigurable hardware system, or a reconfigurable 
computer, is shown in Figure 1. An RC typically consists of the following 
components: 
Reconfigurable devices: Although the capacities of a configurable 
device, such as a Field Programmable Gate Array (FPGA) [4], have been 
increasing rapidly, they are nowhere close to the gate-capacity of the full-
custom or semi-custom ASICs. Therefore, it is common to design large-
scale RC systems using multiple reconfigurable devices such as FPGAs or 
special-purpose processors, on a single printed circuit board. 
Memory banks: These are usually based on RAMs (Random- Access 
Memory) that provide data storage space for computation. They also 
provide means of data communication between the external environment 
and the reconfigurable devices. Memory banks can be viewed as either 
6 R. Vemuri et al. 
being shared between multiple reconfigurable devices or local to a single 
reconfigurable device. 
Local 
Memory 
Interface Hardware 
1/0 
Local 
Memory 
Shared 
Memory 
Fig. 1. A Generic Reconfigurable Computer Model 
Local 
Memory 
Interconnection network: Interconnection network is a collection of 
dedicated or programmable connections between the components of the 
RC. A programmable interconnection network can be configured to 
provide a desired set of connections, whereas a dedicated network offers 
fixed hardware connections. 
Inteiface hardware: The interface consists of a variety of application-
specific I/0 connectors such as PCI ports, and/or extension ports to expand 
the RC hardware. The interface is used for downloading input data to the 
RC, controlling its reconfiguration, and monitoring its execution. 
Design for Adaptive Reconfigurable Hardware 7 
2.1 
Reconfigurable Computer Applications 
The application domain suitable for RCs usually covers problems that are 
computationally intensive and exhibit parallelism. These include digital 
signal processing algorithms [5] [6], image processing [7], database search 
algorithms [7], multimedia algorithms [8], genetic algorithms [9], and 
neural network applications [10]. 
Table 1 lists a collection of design examples from various application 
domains. The designs have been primarily handcrafted for various RC 
architectures and the results published. These examples are Genetic 
Partitioning [11], Boolean Satisfiability [12], Neural Networks [13], Image 
Correlation [14], and Mean Filter [15]. An important feature is that all 
these systems have inherent data parallelism at different levels of 
abstraction. Most of the examples used little inter-device communication. 
Similarly, the usage of the RC memory is very little, with data mostly 
being hard-wired of self-generated. Speed-up in the examples using Xilinx 
4000 devices, required multiple FPGA devices without using host-to-
device communication and using only self-generated data. Speed-up in the 
examples using Xilinx 6200 devices required the data to be distributed 
within each device using the feature of direct addressability of logic cells. 
Also, these design examples were carefully synthesized into parallel and 
pipelined structures. 
Table 1: Comparative study of designs implemented on RCs 
Genetic Boolean Neural Image Mean 
Feature Partition Satisfiability Networks Correlation Filter 
FPGAType XC6200 XC4000 XC3000 XC62000 XC6200 
#FPGAs 1 64 4 1 I 
yes yes 
Run-time (partial no yes (partial no 
Reconfig. reconfig.) reconfig.) 
local memory memory 
Data Self- Self- memory mapped+ mapped from host 
Communic. !generated !generated (1 or 4) hard wired 
Level of algorithm-
Parallelism block-level level word-level bit-level word-level 
(20Mhz projected 2x 50x speed-up 
Published 2 parallel speed-up (lOmhz, lOx speed-up 
Results 87 stage Sx speed-up (more than 32 stage 
pipelines) 23 FPGAs) !pipeline) 
8 R. Vern uri eta/. 
Goldstein and Schmit [3] have compiled a similar variety of design 
examples that demonstrate a lOx speed-up when using an RC over a 
conventional microprocessor. 
2.2 
Classification of Reconfigurable Computers 
Over the last decade, several RC architectures have emerged in order to 
meet the increasing computational demands of various application 
domains. Most RCs that are available currently can be classified based on 
the type of reconfigurable devices used: 
Field-Programmable Gate Array: The FPGA [4] [16] is usually a 
stream-reprogrammable device that can be reconfigured by serially loading 
the entire configuration bit-stream into a logical program register. For such 
devices, the reconfiguration time is sufficiently high (in the order of 
milliseconds [4]) since the entire bit-stream needs to be reloaded. An RC 
architecture may be based on one or more FPGA devices. The Xilinx Demo 
Board [4] is an RC with a single FPGA device. Examples of multi-FPGA 
RCs are Wildforce [17], GigaOps [18], and Garp [19]. 
Reconfigurable Processor Array: The FPGA consists of an array of 
fine-grained programmable elements called CLBs (Configurable Logic 
Blocks) [4]; whereas, the reconfigurable processor array is a device 
consisting of an array of coarse-grained processing elements. The 
REMARC [20] is such a device consisting of an array of 64 16-bit 
reconfigurable processing elements and a global control unit. The 
REMARC device, unlike an FPGA, uses instruction words to control the 
configuration of the processing elements as well as interconnection 
network. An RC architecture based on the REMARC reconfigurable 
device is presented in [8] and another RC architecture based on a 
reconfigurable processor array is the Raw microprocessor [21]. 
Partially Reconfigurable FPGA: The recently introduced bit-
reprogrammable FPGAs, the 6200 series from Xilinx [22] make it possible 
to reconfigure selected portions of the device without having to reload the 
entire bit-stream. In these architectures, the reconfigurable logic is fully 
accessible such that specific logic cells can be addressed, or memory-
mapped, and reconfigured as desired. Bit-reprogrammable devices reduce 
the reconfiguration time by a factor of 1,000,000x, from milliseconds to 
nanoseconds. These devices are quite conductive to permit reconfiguration 
Design for Adaptive Reconfigurable Hardware 9 
mid-way through an application execution. The Firefly [23] and Virtual 
Workbench [24] are examples of RCs based on a single partially 
reconfigurable FPGA. The ACE Card [25] and Wildstar [26] are examples 
of RCs based on multiple partially reconfigurable FPGAs. 
Context-Switching FPGA: This is an enhancement that allows complete 
reconfiguration of an FPGA at a rate far better than that of the standard 
FPGA. A context-switching FPGA device has the ability to store multiple 
configurations (or contexts) and switch between them on a clock-cycle 
basis. Also, a new configuration can be loaded while another configuration 
is active (or in execution). The Time-Multiplexed FPGA [27], from Xilinx 
Corporation can store up to seven context and switch between them 
through an internal controller. The Context-Switching FPGA [28] from 
Sanders -A Lockheed Martin Company, can store four contexts and has a 
powerful cross-context data sharing mechanism implemented within the 
device. 
Hybrid Processor: This type of hybrid architecture comprises of a main 
processor that is tightly coupled with a Reconfigurable Logic Array 
(RLA). A portion of the device area is devoted for conventional 
microprocessor architecture and the remaining for an RLA. NAPA [29] is 
such a hybrid processor consisting of an Adaptive Logic Processor 
resource combined with a general-purpose scalar processor called the 
Fixed Instruction processor. An RC architecture based on the NAPA 
processor is also presented in [29]. The MorphoSys [30] is another RC 
architecture that has a coarse-grained reconfigurable logic array and a 
MIPS-like RISC processor. 
From a different perspective, the RCs can also be classified based on 
their usage: 
Coprocessor: RC coprocessors are subservient to a host processor. The 
host processor executes the application, occasionally configuring the 
coprocessor to perform a special function and delegating portions of the 
application, that needs a special function, to the coprocessor. The host 
maintains the set of permissible coprocessor configurations as a library of 
"hardware functions". If the host is a high-end workstation computer, it is 
possible to keep a large number of configurations on the hard disk. If the 
host is a small DSP-style motherboard, then it is possible to store only a 
small number of configurations in the RAM. This number of 
configurations that can be stored in the system impacts the synthesis 
process. Often coprocessor applications attempt to absorb the 
10 R. Vemuri et al. 
reconfiguration overhead through parallel execution of the host processor 
and the coprocessor. 
Embedded Processor: One can view the RC as an embedded processor 
when it is not attached to any host processor. Embedded RC architectures 
contain a finite number of alternative configurations stored on ROMs 
(Read- Only Memory) on-board. A micro-operation system, also loaded in 
an on-board ROM, controls the loading of these configurations. 
Statically I Dynamically Reconfigurable Processor: During the course 
of execution of an application, the RC may be reconfigured only once to 
act as a statically reconfigurable processor, or several times to act as a 
dynamically reconfigurable processor. From an application perspective, 
statically reconfigurable RCs offer a finite set of hardware resources that 
can be configured to execute the application. On the other hand, 
dynamically reconfigurable RCs offer an infinite set of hardware 
resources, only a finite number of which can be used at any time during the 
execution of the application. 
Behavioral 
Specification 
RC Architecture and Performance Constraints 
High-Level 
Synthesis 
RTL 
Specification 
Logic 
Synthesis 
Partitioning System 
Light-Weight Estimation 
Algorithms 
Gate Level 
Specification 
Layout 
Synthesis 
FPGA 
Specification 
Configuration 
Schedule 
Fig. 2. Design Automation for Reconfigurable Computers 
In this section, we described the generic RC architecture model, 
provided references to number of RC architectures, and classified them 
according to the reconfigurable device type and usage. In the following 
section, we will describe design automation techniques for RCs. 
Design for Adaptive Reconfigurable Hardware 11 
3 
Design Automation for Reconfigurable Computers 
The design automation techniques for RCs are shown in Figure 2. The 
design process generates bit-streams for configuring the hardware and a 
configuration schedule, which controls loading of the bit-streams on the 
hardware. The configuration schedule is either a software program running 
on a host-computer attached to the co-processor, or a controller program 
that is loaded on a ROM and running from within the embedded processor. 
This chapter will cover design automation techniques for the generation of 
configuration bit-streams such that the hardware resources on the RC are 
efficiently utilized. 
The design automation process involves synthesis and partitioning of a 
given design specification. The design flow can start from any of the three 
levels of abstraction, behavioral level, Register-Transfer Level (RTL), and 
gate level. At each level, the RC architecture and the performance 
constraints are provided. The design is specified at behavioral level as an 
algorithmic description, at RT level as a structural net-list of components, 
and at the gate level as a set of boolean equations. As we move top-down, 
from behavior-level to gate-level, the design specification embodies 
structural details. 
At the RTL and gate level, since the design structure has already been 
decided, synthesis and partitioning of designs either lead to poor utilization 
of the RC resources, or have higher chances of failure. Moreover, it is 
almost impossible for the designer to consider all RC architectural 
constraints while specifying a design at the RTL or gate level. Vahid et al. 
[31] have shown the advantages of functional partitioning over structural 
partitioning approaches. Although few special-purpose systems have a 
design entry at the gate/logic level [27], [28], the focus of the state-of-the-
art RC research is on automating the process of High-Level Synthesis 
(HLS) and behavioral partitioning of specifications, starting at a high-level 
of abstraction. 
Further research is required in the area of HLS [32] [33] before these 
algorithms become practical and applicable to the RC design flow. On the 
other hand, logic and layout synthesis algorithms [32] [34] [35] introduced 
two decades ago, are well established, with sound mathematical models for 
optimization. At the expense of computing time, logic and layout synthesis 
algorithms perform a near-exhaustive search, and are capable of producing 
good quality designs. Although quite mature, logic and layout synthesis 
algorithms have not been able to efficiently control the area-speed 
tradeoffs for large-scale designs, the main drawbacks being poor device 
12 R. Vemuri eta/. 
utilization and poor performance for high-density designs. This inability to 
handle large designs is compensated by allowing the HLS to explore 
tradeoffs at a higher level of abstraction. High-level synthesis techniques 
that perform efficient search of the design space are still emerging. 
Behavioral partitioning for RCs can be classified into two sub-problems, 
temporal partitioning and spatial partitioning, explained in the following 
sections. Behavioral partitioning algorithms need to obtain estimates about 
the design that is being partitioned. To synthesize designs that efficiently 
utilize the RC resources, the partitioning algorithms need to closely 
interact with lightweight (fast) estimation algorithms that predict the 
outcome of synthesis. 
The following sections will provide an informal overview of the 
fundamental problems involved in the design automation of RCs. For 
further details, interested readers can refer to the rich set of literature 
available in the proceedings of the FCCM [36] and the FPGA [37] 
conferences. 
3.1 
Design Specification Model 
The design specified in a Hardware Description Language (HDL), such as 
VHDL [38], is usually captured into an intermediate computational model 
and used for partitioning and synthesis. The computational model is 
typically a graph-based representation, with nodes representing elements 
of computation and edges representing the flow of data and control. Some 
well-known computational models are communicating sequential 
processes [38] [39], synchronous dataflow [ 40], program-state machines 
[41], and CDFGs [42] [43]. 
For the design automation process to effectively utilize the rich set of 
resources on an RC, the computational model of the specification should 
support the following features: 
(i) Explicit capture of parallelism at the fine-grain (e.g. statement-level 
in VHDL) and coarse-grain (e.g. process-level in VHDL) levels; 
(ii) Allow contiguous data storage (arrays) that can be mapped to the 
physical memory banks; 
(iii) Capture of data communication at the coarse-grain level (e.g. across 
computations among different processes in VHDL); 
(iv) Provide synchronization mechanism for computations at the coarse-
grain level of parallelism; 
(v) Capture computations at the fine-grain level of parallelism, using 
representation such as a Data Flow Graph (DFG). 
Design for Adaptive Reconfigurable Hardware 13 
For these features, the model should provide well-defined synthesis 
semantics that are precisely interpreted by the design automation tools. 
3.2 
Temporal Partitioning 
A behavioral specification that is reasonably large - does not fit within the 
given RC hardware - can be partitioned over time into a sequence of 
temporal segments. A temporal segment is a subgraph of the given 
computational model of the behavior. Every synthesized temporal segment 
is allowed to utilize all the resources in the RC. In order to execute the 
design, the RC is configured to execute each synthesized temporal 
segment, one at a time, in the sequence of temporal steps generated by 
temporal partitioning. Thus, the RC is dynamically reconfigured several 
times during the execution of an application. The key issues in temporal 
partitioning are: (1) the time taken to reconfigure the RC; (2) the memory 
space required to store the live data between temporal segments; and (3) 
the estimation of the hardware requirements of a synthesized temporal 
segment. 
Typically, a temporal partitioner attempts to minimize reconfiguration 
overhead. The partitioner also ensures that the amount of live data between 
temporal segments fits within the available memory space by assuming a 
lumped model of data communication. More importantly, the partitioner 
has to ensure that the estimates about the hardware requirements have to 
be close to the actual synthesized values. Otherwise, it could lead to failure 
later in the design process. Therefore, it is imperative that temporal 
partitioning obtains these estimates through some lightweight high-level 
synthesis process. 
3.3 
Spatial Partitioning 
Spatial partitioning involves partitioning each temporal segment into as 
many spatial segments as the number of reconfigurable devices on the RC. 
A spatial segment is a subgraph of a given temporal segment and a spatial 
partition of the temporal segment is the collection of all mutually 
exclusive spatial segments generated by spatial partitioning. The primary 
issues in spatial partitioning are: (1) estimating the area requirements of 
the spatial partition, such that each spatial segment when synthesized fits 
14 R. Vemuri et at. 
in the FPGA, (2) partitioning the memory requirements across the 
available memory banks, and (3) estimating the interconnect requirements 
between the spatial segments. 
The estimation of partition areas is complicated, especially in the 
presence of performance constraints such as latency of the temporal 
segment. This is because spatial partitioning of a behavior needs to make 
efficient area-speed tradeoffs in order to produce high quality designs. 
Furthermore, spatial partitioning usually performs memory partitioning 
along with the partitioning of computations. These problems can be 
efficiently solved only through a well-defined interaction with high-level 
exploration and synthesis. Finally, spatial partitioning is typically 
integrated with interconnection estimation, in order to determine the 
interconnection feasibility. 
3.4 
Interconnection Synthesis 
Many RC architectures [17] [18] [23] [24] [25] [26] provide the flexibility 
of programmable interconnection networks in order to realize different 
connectivity patterns among the RC resources. This poses a severe 
constraint on the partitioner in estimating the mutability of signals across 
the contemplated partition segments. The effort required in modifying 
existing CAD tools to handle a new RC interconnect architecture is often 
comparable to that of developing a new CAD algorithm specific to that 
interconnect architecture. An ideal multi-device partitioning tool must be 
able to support a generic interconnection model. 
It is essential for any partitioning algorithm targeting multi-device 
boards to appropriately assign logic signals to the device VO pins. This is 
due to the fact that pins of a device are not functionally identical in 
establishing the same connection pattern between the processing elements. 
The viability of routing connections between devices is contingent on the 
correct assignment of the logic signals to the pins. This problem is referred 
to as pin assignment and directly impacts the functionality of the 
programmable interconnection. The interconnection synthesis problem for 
RC architectures is the unified problem of generating a pin assignment and 
synthesizing the appropriate configuration stream for the programmable 
interconnection network, such that a given interconnection requirement is 
met. 
Design for Adaptive Reconfigurable Hardware 15 
3.5 
Design Synthesis 
High-Level Synthesis (HLS) [1] [32] [33] [42] is the process of generating 
a structural implementation from a behavioral specification, so that the 
design constraints such as area, latency, clock period, power, etc. are best 
satisfied. The behavioral specification is typically algorithmic in nature, 
without any architectural details. The structural implementation is usually 
at the register-transfer level of abstraction consisting of a datapath and a 
controller. The datapath is a structural net-list of components, such as 
ALUs, registers, and multiplexers, and the controller is a finite state 
machine that sequences the execution of datapath components. 
Given a behavioral specification, there are many different structures that 
can realize the behavior. Each such structural implementation denotes a 
design point and the set of all possible design points determines the design 
space of the specification. One of the most compelling reasons for 
developing HLS systems [32] [44] is the desire to quickly explore a wide 
range of design points. The goal of design space exploration is to identify 
possible implementation alternatives from the design space of a behavioral 
specification, such that the design constraints are satisfied. In the following 
section, we will describe some basic issues that have to be addressed in 
design synthesis for RCs. 
Synthesis with Partitioning 
Design automation involving HLS and partitioning can be broadly 
classified into the vertical and the integrated design flows. There are two 
approaches to the vertical design flow, namely: (1) HLS followed by 
structural partitioning, and (2) behavioral partitioning followed by HLS. 
The primary disadvantage of the first approach [45] [46] is that structural 
partitioning is done on a pre-synthesized design and could fail most often 
due to I/0 pin shortage. The second approach [31] [47] [48] is more 
efficient since behavioral partitioning contemplates several partition 
solutions prior to HLS. However, this approach relies heavily on pre-
synthesized design points that limit the exploration process. 
On the other hand, researchers [31] [49] [50] developed the integrated 
design flow where the design space exploration is performed in 
conjunction with partitioning. Partitioning algorithms are typically based 
on global search techniques such as Genetic Algorithm (GA) or Simulated 
Annealing (SA), and hence contemplate millions of partitions during the 
search. It would be imprudent to apply exploration techniques based on 
16 R. Vemuri eta/. 
either exact models [51] [52] [53] or simultaneous scheduling and binding 
[54] [55] [56], since it would take an impractical amount of time for 
partitioning with dynamic exploration. Hence, we require heuristic 
exploration techniques that primarily perform the scheduling phase of HLS 
along with design estimation. More importantly, the exploration technique 
should have the ability to simultaneously explore multiple spatial segments 
to effectively satisfy global constraints such as design latency. 
Arbiter Synthesis 
Another issue in design synthesis is resource sharing. After partitioning, a 
physical resource on the RC might be shared between parallel execution 
threads, thereby requiring arbitration. RCs offer a varying number of 
physical resources. For instance, a RC board can have a variable number 
of physical memory segments or a variable number of interconnection 
pins. To support architecture independence, the synthesis tool must be able 
to synthesize the same design for different boards. 
If the design makes use of L resources (e.g. logical memory segments) 
and the board only hasP such resources (e.g. physical memory segments), 
then two cases arise: when L is less than or equal to P, and when L is 
greater than P. Obviously, if Lis less than or equal toP, then the mapping 
is straightforward: each design resource is mapped to an individual 
physical resource. On the other hand, when Lis greater than P, there are 
more used resources in the design than there are physical resources on the 
multi-FPGA board. In this case, the mapping becomes difficult since more 
than one design resource has to be mapped to the same physical resource. 
This mapping might introduce resource access conflicts since different 
process might be accessing the different design resources. 
It would be advantageous to have a mechanism that would resolve 
resource access conflicts thereby providing the flexibility of freely 
scheduling resource access during HLS. At the same time, this mechanism 
should not introduce complexity to the partitioning process. 
Integrating Logic Synthesis with HLS 
The traditional HLS process synthesizes operations in a behavior to 
components picked from a given RTL library. This library is pre-
characterized so that the HLS can predict the area and performance of the 
design that is being synthesized. For the reconfigurable device technology, 
the pre-characterized component data is highly dependent on the specific 
Design for Adaptive Reconfigurable Hardware 17 
layout of the component. Therefore, an HLS tool accepts a macro-library 
consisting of pre-synthesized macro components that have layout specific 
shape information [54]. The use of such macro components enables HLS 
to make better estimates about the design. However, during logic 
synthesis, these macro components are treated as black boxes thereby 
preventing any kind of logic optimization across macros that would 
otherwise be achieved by fully flattened logic synthesis. 
The RTL design is a sequential circuit consisting of combinational 
blocks separated by registers. The HLS process can be viewed as making 
optimization decisions that result in the insertion of registers and the 
formation of the combination blocks. Once the RTL design is generated, 
logic optimization is typically limited to within a combinational block. 
Therefore, it is necessary to come up with an efficient RTL design that 
maximizes logic optimization. This requires effective integration of logic 
synthesis with HLS. 
The macro library is usually a one-time pre-characterized set of 
components that support only basic operation types in the input 
specification. This is insufficient since it restricts HLS design decisions 
that select the combinational blocks of the RTL design. On the other hand, 
HLS would highly benefit if the library were populated with efficient logic 
optimized macros. 
Reconfiguration 
Schedule 
Fig. 3. SPARCS System 
Task specified in CNHDL 
Light-Weight 
:High-Level Synthesis 
Estimator 
Design Synthesis 
18 R. Vemuri eta/. 
4 
The SPARCS System: An Overview 
The SPARCS system [57] [58] [59] (Synthesis and Partitioning for 
Adaptive Reconfigurable Computing Systems) is an integrated design 
system for automatically partitioning and synthesizing designs for 
reconfigurable boards with multiple devices (FPGAs). The SPARCS 
system (see Figure 3) accepts behavioral design specifications in VHDL 
[38] and compiles a task graph model called the Unified Specification 
Model (USM) (explained in the following section). The SPARCS system 
contains a temporal partitioning tool to temporally divide and schedule the 
tasks on the reconfigurable architecture, a spatial partitioning tool to map 
the tasks to individual FPGAs, and a collection of design synthesis tools to 
synthesize efficient register-transfer level designs for each set of tasks 
destined to be downloaded on each FPGA. Commercial logic and layout 
synthesis tools are used to complete logic synthesis, placement, and 
routing for each FPGA design segment. A distinguishing feature of the 
SP ARCS system is the tight integration of the partitioning and synthesis 
tools to accurately predict and control design performance and resources 
utilization. 
Another important feature of SPARCS is its ability to re-target a design 
to a variety of RCs, by accepting a target architecture specification. The 
target architecture for SPARCS is a co-processor environment consisting 
of a multi-FPGA board that is attached to a host computer. The host 
controls the loading of the design and monitors the execution of the board. 
The SPARCS system produces a set of FPGA bitmap files, a 
reconfiguration schedule (a software program) that specifies when these 
bitmap files should be loaded on the individual FPGAs in the RC. In the 
presence of programmable interconnect, a mask (configuration) of 
interconnect for each temporal segment is also generated. 
5 
Design Specification Model in SPARCS 
In Section 3.1, we described a collection of features that a computational 
model should support in order to perform efficient design automation for 
RCs. We have developed the Unified Specification Model (USM) [60] that 
is highly suitable for specifications targeted to RC architectures. The USM 
embodies the Behavior Blocks Intermediate Format (BBIF) [61] that is 
well suited for high-level synthesis. In this section, we provide a short 
Design for Adaptive Reconfigurable Hardware 19 
description of these computational models and further details may be 
obtained from [60] [61]. 
5.1 
The Unified Specification Model 
The USM is a hierarchical representation for capturing the behavior of a 
design. It is also possible to capture the behavior of the environment with 
which the design is supposed to interact. There are two levels in the 
hierarchy, the top level comprising of coarse-grain design objects and the 
lower-level describing the behavior of these objects. An example of the 
USM is shown in Figure 4. There are two types of design objects at the 
top-level, a design task (an ellipse in the figure) and a logical memory 
segment (a box in the figure). Tasks in the USM represent elements of 
computation and memory segments represents elements of data storage. In 
order to capture computations of a task, the BBIF model is used. The 
memory segments are logical since the user has the flexibility to specify as 
many as required, irrespective of the number of physical memories 
available on the RC. The tasks are further classified into design tasks and 
environment tasks. Design tasks are those that need to be synthesized onto 
the RC and environment tasks are used to extract information about the l/0 
interface of the complete design and the protocol. 
Local 
Memory 
Dependency 
Environment 
Task 
Fig. 4. An example of the Unified Specification Model 
20 R. Vemuri eta/. 
The implicit execution semantics of the USM can be described as 
follows. All tasks, as well as memory segments, are said to be 
simultaneously alive during execution. This captures the parallelism at the 
coarse-grain level of abstraction. The synchronization mechanism between 
tasks is established using dependencies and the data communication is 
established using channels. Through dependencies, a task may wait for an 
initiation signal from other tasks. The execution cycle for a USM model 
finishes when all the tasks are indefinitely waiting. The model assumes an 
indefinite wait at the end of each task to denote the completion of the task. 
5.2 
The Behavior Blocks Intermediate Format 
The BBIF model is used to capture the behavior at the fine-grain level of 
parallelism. The BBIF1 is a hierarchical CDFG [ 42] [32] representation 
with features well suited for formally verified HLS [61]. The BBIF model 
represents a behavioral task with a single thread of control. The BBIF is a 
graph with behavior block Nodes and control flow edges where each block 
contains a DFG [42] [32]. Thus, data flow and computations are captured 
within each behavior block, while the control flow is captured between the 
blocks. The control flow starts as the root block and transfers from one 
block to another through the branch construct provided at the end of each 
block. The branch construct specifies either an unconditional transfer to a 
single successor block or a conditional transfer to one of the series of 
successor blocks. The control flow in BBIF can capture conditionals as 
well as loops specified in a behavior. 
The BBIF model that represents a task interacts with the environment 
(another task in the USM) through input and output ports in the BBIF. 
6 
Temporal Partitioning 
The approaches to temporal partitioning can be classified in two 
categories. Complete reconfiguration techniques reconfigure the entire 
device and hence utilize the entire device area for each temporal segment. 
On the other hand, partial reconfiguration techniques are targeted for 
those special-purpose devices that can partially reconfigure while another 
1 The BBIF model is used in Asserta, an HLS system developed at the University 
of Cincinnati [61] 
Design for Adaptive Reconfigurable Hardware 21 
portion is executing. In the following sections we provide a survey of 
techniques in these categories and an overview of the temporal partitioning 
techniques developed for SPARCS. 
6.1 
Temporal Partitioning for Complete Reconfiguration 
In the past, researchers have proposed temporal partitioning for designs at 
the logic-level or circuit-level. Spillane and Owen [62] perform temporal 
partitioning on gate level designs. The algorithm first performs a cone 
based clustering followed by a mapping of the clusters onto temporal 
segments. The entire gate level design is transformed into a number of 
clusters that are then scheduled on a XC6000 family FPGA. Chang and 
Marek-Sadowska [63] perform an exchanged Force Scheduling (FDS) [64] 
on gate level sequential circuits for the time-multiplexed FPGA [65]. The 
enhanced FDS algorithm produces a design that reduces the amount of 
data transfer required between temporal segments. Trimberger [27] takes 
post logic synthesis and technology mapped designs and again uses list 
based scheduling and its variations to perform temporal partitioning for the 
time-multiplexed FPGA [65]. 
Researchers have also proposed temporal partitioning techniques at a 
higher level of abstraction using operation-level data flow graph (DFG) as 
in traditional HLS. Vasilko and Ait-Boudaoud [66] have extended the 
static list-based scheduling technique of high-level synthesis for producing 
temporal segments. No functional unit sharing is performed and the area 
overhead due to registers needed to store data values between control steps 
is not taken into account. Gajjalapuma and Bhatia [67] perform a 
topological sort of the nodes in the DFG. Each node uses a distinct 
functional unit (from an RTL library), without any sharing. In their 
approach, each temporal segment is similar to a control step where 
registers are placed at the input and output of the temporal segment and 
operations are chained within each temporal segment. Gajjalapuma and 
Bhatia [68] also perform a depth first assignment of nodes to temporal 
segments such that the memory transfer required between temporal 
segments is smaller compared to their earlier approach [67]. Cardoso and 
Neto [69] extend [67] by providing a priority function based on reducing 
the critical path length of the partitioned graph. For each node in the DFG, 
pre-synthesized macro components are present and no sharing of the 
macro components or register insertion occurs during the process of 
temporal partitioning. 
22 R. Vemuri eta/. 
Optimal Temporal Partitioning in SPARCS 
Design space exploration: A shortcoming of current automated 
temporal partitioning techniques is the selection of an implementation of 
the components of their design prior to partitioning. Since there are 
multiple implementations of the components with varying area/delay 
values, it would be more effective to choose the design implementation 
while partitioning the design by dynamically performing design space 
exploration. Temporal partitioning in SP ARCS uses the USM as the input 
model. For each task in the USM, different design points (or Pareto points 
[32]) are derived from its design space. Depending on the resource/area 
constraint for the design, different implementations of the same task, 
which represent different area-time tradeoffs, are contemplated while 
performing temporal partitioning. 
Block Processing: In many application domains such as Digital Signal 
Processing, computations are defined on very long streams of input data. 
In such applications, an approach known as block processing is used to 
increase the throughput of a system through the use of parallelism and 
pipelining (refer to parallel compilers [70] and VLSI processors [71]). 
Block processing is not only beneficial in parallelizing/pipelining 
applications, but in all cases where the net cost of processing k samples of 
data individually is higher than the net cost of processing k samples 
simultaneously. Most DSP applications, such as image processing, 
template matching and encryption algorithm, fall in this category. 
Integrated design-space exploration and block processing: When the 
reconfiguration overhead is very large compared to the execution time of 
the task, it is clear that minimizing the number of the temporal partitions 
will achieve the smallest latency for the overall design. In the resultant 
solution, each task will usually be mapped to the smallest area design point 
for a task. However, it is not necessary that the minimum latency design is 
the best solution. We illustrate this idea with an example. In figure 5(a), 
two tasks are shown. Each task has two different design points on which it 
can be mapped. Two different solutions (b) and (c) are shown. If minimum 
latency solution is required, then solution (b) will be chosen over solution 
(c) because the latency of (b) is 500.3 J.lsec and latency of (c) is 1000.12 
f.!Sec. Now, if we use (b) and (c) in the block processing framework to 
process 5000 computations on each temporal partition, then the execution 
time for solution (b) is 2000 J.lseconds and for solution (c) is 1600 
f.!Seconds. Therefore, if we can integrate the knowledge about block 
processing while design space exploration is being done, then it is possible 
to choose more appropriate solutions. The price paid for block processing 
Design for Adaptive Reconfigurable Hardware 23 
is the higher memory requirements for the reconfigured design. We call 
the number of data samples or inputs to be processed in each temporal 
partition, the block processing factor, k [72]. This is given by the user and 
is the minimum number of input data computations that this design will 
execute for typical runs of the application. The amount of block processing 
is limited by the amount of memory available to store the intermediate 
results. 
'------r---' 
'-------' 
Design Pt. 1: Area=100, Delay=lOO ns 
Design Pt. 2: Area=200, Delay=50 ns 
Design Pt. 1: Area=150, Delay=200 ns 
Design Pt. 2: Area=300, Delay=70 ns 
'------r---' 
...__ _ __, 
Design Pt. 1 
Area=lOO, Delay=lOO ns 
Design Pt. 1 
Area=150, Delay=200 ns 
Reconfiguration Time=500 microsec 
FPGA size=300 
Latency=500 mu+ 100 ns+200 ns 
a) b) 
Design Pt. 2 
'------r---' 
Area=200, Delay=50 ns 
Design Pt. 
'-------' 
Area=300, Delay=70 ns 
Latency=2*500 mu+50 ns+70 ns 
c) 
Fig. 5. Design space exploration with block processing 
Optimal Temporal Partitioning: The temporal partitioning and design 
space exploration problem can be formulated as an integer linear 
programming model. We will provide an overview of this model here, and 
interested readers may refer to [72] for the ILP equations and exact 
solution methods. This model can be informally stated as follows: 
Minimize the design execution time such that the following constraints are 
satisfied: 1. Each task is mapped to a temporal partition; 2. Each task is 
mapped to a design point; 3. The dependencies among the tasks are 
maintained; 4. The area constraint of each temporal partition is met; 5. The 
memory constraint is met. 
For a given partition bound, the ILP model corresponding to the 
temporal partitioning problem is formed and solved. The partition bound is 
the number of partitions for which the current model has been formed and 
24 R. Vemuri eta/. 
a solution is being explored. The solution for the current partition bound is 
the best solution when the given task graph is partitioned over the given 
partitions. The optimization goal takes into account the reconfiguration 
overhead of the design and the actual design execution time. Therefore, as 
the design space is explored, each task gets mapped to an appropriate 
design point for which the overall design execution time will be least, 
while satisfying the dependency, area, and memory constraints. 
The model given above will generate the best solution for a particular 
partition bound. To explore the solution space of the temporal partitioning 
problem, we need to explore more than one partition bound. Based on the 
design points for the tasks, we generate the range of partitions over which 
the solution must be explored. Solutions of ILP models with increasing 
partition bounds are explored until no further improvement in the solution 
is observed. To handle large design problems with our technique, we also 
present an iterative refinement procedure [73] that iteratively explores 
different regions of the design space and leads to reduction in the 
execution time of the partitioned design. The ILP based integrated 
temporal partitioning and design space exploration technique forms a core 
solution method that is used in a constraint satisfying approach to explore 
different regions of the design space. 
Heuristic Temporal Partitioning in SPARCS 
In this approach, temporal partitioning and the scheduling phase of HLS 
are solved in a simultaneous step. The goal is to minimize the design 
latency defined as k · Cr + L ~=! d P , where k is the total number of 
temporal segments, Cr is the reconfiguration overhead of the RC, and dp is 
the schedule length of each segment. This section provides an overview of 
our approach, and more details can be obtained from [74] [75]. We will 
use the term segment to mean a temporal segment and the term control 
step to mean a clock step (or RTL state) introduced by scheduling. 
We have enhanced the Force Directed List Scheduling (FDLS) [64] 
algorithm to perform temporal partitioning and scheduling. FDLS is a 
resource constrained scheduling algorithm that finds a schedule with near-
minimal control steps for a given resource set. The enhanced FDLS 
algorithm interacts with an estimation engine to estimate the RTL design 
cost while scheduling. If the RTL design cost of the scheduled 
specifications exceeds the device area, a new segment is formed by 
selecting a new resource set. The number of operations scheduled in a 
Design for Adaptive Reconfigurable Hardware 25 
segment and the latency of each segment is decided by the resource set 
chosen. 
The algorithm generates a resource set for each segment. The resource 
set determines the schedule for that segment. Initially, a minimal resource 
set (one resource for each operation type) is selected to generate the initial 
solution (schedules for all segments). The cost of the initial solution is 
evaluated in terms of the overall design latency. The initial solution has 
maximum design latency, since the resource set is minimal. The algorithm 
then, tries to improve the overall latency of the initial solution by exploring 
different resource sets for each segment iteratively. This could be achieved 
in two ways, either by reducing the latency of each segment, or by 
reducing the total number of segments. Both strategies are incorporated in 
the algorithm. 
For each segment, the algorithm iteratively explores various resource 
sets thereby exploring different possible schedules. For each segment, the 
resource set is incrementally enlarged, until either the resulting schedule 
does not improve the overall design latency or the segment area exceeds 
the device area constraint. At this point, the algorithm attempts to 
maximize the resource sharing which could lead to a solution with smaller 
number of temporal segments. The current resource set represents the best 
solution (least design latency) obtained so far. Therefore, the current 
segment is scheduled with the current resource set. The algorithm moves 
to the next temporal segment and performs the resource set exploration 
again. This process terminates when all nodes in the DFG have been 
assigned to temporal segments. 
6.2 
Temporal Partitioning for Partial Reconfiguration 
Other automated techniques for temporal partitioning focus on identifying 
and mapping partially reconfigurable regions to reduce reconfiguration 
delay. Luk et al. [76] take advantage of the partial reconfiguration 
capability of FPGAs and automate techniques of identifying and mapping 
reconfigurable regions from pre-temporally partitioned circuits. They 
represent successive temporal segments as weighted bipartite graphs to 
which matching algorithms are applied to determine common components. 
Schwabe et al. [77] take advantage of some of the feature of the Xilinx 
XC6200 family of FPGAs to reduce the reconfiguration time overhead by 
compression of the configuration bit streams. They have developed an 
algorithm that compresses the bit stream that is decompressed by the 
embedded hardware on the FPGA. Lechner and Guccione [78] provide a 
26 R. Vemurl et al. 
Java based application-programming interface into the bit stream of the 
Xilinx 6200 family of FPGAs. A similar facility is provided by Guccione 
and Levi [79] for the Xilinx XC4000 family of devices. These interfaces 
provide a capability of designing, modifying and dynamically modifying 
the circuits for the FPGAs by operating on the FPGA bit streams. 
Partial Reconfiguration Technique in SPARCS 
Another approach to efficiently utilize reconfiguration feature of a device 
is to overlap execution of portions of the design with reconfiguration of 
other portions of the design. This can lead to partial, if not complete, 
amortization of the reconfiguration overhead posed by the device. We 
provide an overview of this approach, and more details can be obtained 
from [80], [81]. 
Figure 6 depicts the overview of our approach. The first phase involves 
partitioning the design into a sequence of temporal partitions. This is 
followed by a pipelining phase, where the execution of each temporal 
partition is pipelined with the reconfiguration of the following partition. 
Referring to Figure 6, at the i1h instant, TP; executes and TP;+t reconfigures 
on the devices. Reconfiguration time of partition TP;+t is reduced due to 
the overlap with execution of TP;. Similarly, the (i + 1)1h instant involves 
overlap of execution of TP;+t and reconfiguration of TP;+z· The overlapped 
load and execute cycle is repeated until all the temporal segments have 
been loaded and executed. 
The proposed approach can be explained as follows. Let R; be the 
reconfiguration time and E; be the execution time of the th segment. The 
total latency of the design using partial reconfiguration is: -
n-1 
Latn =R1 + ,Lmax(Ri+l•E;)+En 
i=l 
(1) 
Therefore, when R; + 1 - E; =::;; 0, Vi 1 =::;; i =::;; n -1 there is complete 
amortization of the reconfiguration overhead using partial reconfiguration. 
Hence, it is clear that in order to obtain significant improvement in design 
performance, the reconfiguration time of TP;+t should be comparable to the 
execution time of TP;. This allows maximal overlap between execution 
and reconfiguration and results m considerable reduction m 
reconfiguration overhead. 
Design for Adaptive Reconfigurable Hardware 27 
G G 
TPl t 2 
• 
TP2 G c::J 
INPUT Temporal 
SPEC Partitioning• 
' Pipelining TP3 
y 
G G 
' ' 
4 
TP4 
' 
G G 
TP5 
' B 
Fig. 6. Partitioning/Pipelining Methodology 
When device reconfiguration times are much higher than design 
execution times, it becomes essential to group computationally intensive 
structures, e.g. loops, in a single temporal segment to increase E; and 
thereby minimize Ri+I - E;. The loop structures in the specification can be 
of two types: the loops are explicit when cycles are detected in the input 
graph; the loops are implicit when the entire application is repeated on 
several sets of input data. The latter type of loops is typically handled by 
block processing. The target architecture model consists of partially 
reconfigurable devices that are split into two parts, for execution and 
reconfiguration. 
7 
Spatial Partitioning 
Spatial partitioning of a design may be performed at various levels: 
behavioral level, RTL, or gate-level. Gate-level and RTL partitioning are 
both structural level partitioning and are conceptually similar (graph 
28 R. Vemuri eta/. 
partitioning) problems. In RTL partitioning, the nodes are components 
from an RTL library, while the components in gate-level partitioning are 
from the target specific device library. Problem sizes for gate-level 
partitioning are a magnitude larger than RTL partitioning. Usually gate-
level partitioning is used in the context of certain placement algorithms 
that use recursive partitioning strategies to minimize the routing length 
[82]. 
The RC research community has invested several efforts in multi-FPGA 
partitioning [83] [84] [85] [86] [87] [88] [89] [90]. However, almost all of 
these have been post-HLS partitioning approaches. Chan et al. [83] 
partition with the aim of producing routable sub-circuits using a pre-
partition routability prediction mechanism. 
Sawkar and Thomas [90] present a set cover based approach for 
minimizing the delay of the partitioned design. Limited logic duplication is 
used to minimize the number of chip-crossings on each circuit path. Bi-
partition orderings are studied by Hauck and Bordello [84] to minimize 
critical bottlenecks during inter-FPGA routing. Woo [87], Kuznar [88], 
and Haung [89] primarily limit their partitioners to handle device area and 
pin constraints. A library of FPGAs is available and the objective is to 
minimize device cost and interconnect complexity [89] [88]. Functional 
replication techniques have been used [88] to minimize cut size. Neogi and 
Sechen [85] present a rectilinear partitioning algorithm to handle timing 
constraints for a specific multi-FPGA system. Fang and Wu [86] present a 
hierarchical partitioning approach integrated with RTLnogic synthesis. 
Behavioral partitioning is a pre-synthesis partitioning often called 
functional partitioning. Various studies [31] [91] [92] comparing 
behavioral and RTL partitioning show the superiority of the behavioral 
partitioning for large designs. However, behavioral partitioning must be 
guided by high-level estimators that make estimates on device area, 
memory size, I/0 performance, and power. These estimations are 
performed by high-speed synthesis estimators. These estimators have to be 
lightweight because several thousands partition options may be examined. 
However, being light and accurate at the same time is very difficult. 
Sophisticated HLS estimation techniques are used to alleviate this 
difficulty, as described by V ahid [31]. A behaviorally partitioned system 
may use more gates, since hardware is not shared between partitions. 
However, since RTL partitions are I/0 dominated, the RTL partitions do 
not tend to under-utilize the device. Thus, this increase in gates is not 
much of a concern. Behavioral partitioning has been promoted by several 
system level synthesis groups [31] [91] [92] [93] [94] [95] [96]. 
Design for Adaptive Reconfigurable Hardware 29 
7.1 
Issues in Spatial Partitioning 
Besides architecture independence and integration with synthesis, the 
following is a list of issues that need to be addressed. 
Utilization of Interconnection Resources: Typically, pins and routing 
resources are a primary bottleneck in RCs. However, certain behavioral 
transformations may alleviate such bottlenecks. For example, data transfer 
between partitions can be time-multiplexed through the same wires or 
through the use of shared memories. 
Utilization of Memory: Memories in RCs have not been effectively 
used and memory partitioning is an open research area. Only recently, 
distribution of variables to memories has been discussed [97], but only in a 
limited scope. The authors do not address the issue of multiple FPGAs 
accessing a shared memory and the need for arbitration or synchronization. 
There are few partitioning environments for RCs that integrate both 
memory and multi-FPGA partitioning in a unified way. 
Cost Models: Integrated cost models must be developed to evaluate 
partitioned designs. Typical cost metrics in RCs are area violation, inter-
FPGA mutability, critical wire length and clock speed estimation, design 
latency, and power. Due to the conflicting nature of various cost metrics, 
aggregate cost functions that optimize different costs are difficult to 
develop. If not carefully tuned, such functions often tend to work well only 
for a limited set of applications. 
7.2 
Spatial Partitioning in SPARCS 
This section presents SPADE [96] [98], a system for partitioning designs 
onto multi-FPGA architectures. The input to SPADE is the Unified 
Specification Model (USM) [60] that is composed of computational tasks, 
memory tasks, and the communication and synchronization between tasks. 
SPADE consists of an iterative partitioning engine, an architectural 
constraint evaluator, and a throughput optimization and RTL design space 
exploration heuristic. We show how various architectural constraints can 
be effectively handled using an iterative partitioning engine. 
30 R. Vemuri et at. 
Behavioral Specification 
Macro Library 
Layout-driven synthesis 
for each individual task ..----Task Graph 
~ 
RTL Implementation 
options for each task 
(Design Space Table) 
Macro Library 
Fig. 7(a). Block Diagram of SPADE 
Iterative Partitioning 
Engine 
Reconfigurable Board 
Architectural Model 
Architectural Constraint Evaluator 
Best ~ itness 
Average Fitness 
Area Cost 
Interconnect Cost 
Memory Cost 
0.8 ~--------------------------------------------------·· 
0.6 
0.4 
0.2 
----------------------- -----1\ 
--- -~ ... -------..... ,~_,_.,.,_. 
10 20 30 40 50 60 70 80 
Generations 
Fig. 7(b). GA Convergence Plot 
Design for Adaptive Reconfigurable Hardware 31 
The SPADE system, shown in Figure 7(a), provides an environment to 
perform behavioral partitioning with integrated RTL design space 
exploration for a wide class of multi-FPGA architectures. Prior to 
partitioning, each task is individually synthesized and various RTL 
implementation options (design points) are produced using an HLS 
exploration tool. During partitioning, these options are evaluated and one 
implementation is selected for each task. The Architectural Constraint 
Evaluator (ACE) evaluates the constraint satisfaction of a contemplated 
partition based on the constraints inferred from the reconfigurable board 
model. 
The partitioning engine in figure 7(a) represents any move-based 
partitioning approach where the partition state changes by moving nodes in 
the graph from one partition to another and re-evaluating the fitness each 
time a move is made. Stochastic hill-climbing approaches like simulated 
annealing [99], genetic algorithms [100], and the Fiduccia-Matteyses (FM) 
[101] are some well-established move-based partitioning approaches. A 
detailed analysis and comparative study of these partitioning techniques 
can be found in [98]. 
7.3 
Architectural Constraint Evaluator 
The role of ACE is to evaluate the architectural constraint satisfaction of a 
contemplated partition of the task graph. ACE returns a fitness measure in 
the range [0 - 1] depending on how well the various constraints are met. A 
fitness value of 1 indicates that all constraints are met. ACE considers the 
following constraints: area, memory, and inter-FPGA mutability 
constraint. The goal of the partitioner is to satisfy multiple conflicting 
constraints. Accordingly, the cost function that determines the cost of a 
partition is a combination of the conflicting cost factors. Each cost factor is 
normalized in the range [0- 1]: 
[ 
~ ~ AA; ~ K AM; ] N L.J,=o L.J,=o un 
cost = K + K + --L i=OArea(TG;) L i=OMem(TG;) Ntot 
where K is the number of FPGAs/partitions and: 
32 R. Vemuri eta/. 
{0 if Area(TG) ~ Area(Fi) LlA. = 
' Area (TGi)- Area (Fi) otherwise 
Nu,., N101 are unrouted and total inter-FPGA nets. The interconnect 
evaluator is used to compute the number of unrouted nets. Area(TGi) is the 
estimated area of the partition segment, TGi. when each task in TGi is 
mapped to the least area implementation from the available options. An 
area estimation heuristic that accounts for sharing of resources between 
tasks that execute at exclusive times is used to estimate the area of the task 
segment. The area estimation heuristic also accounts for the interconnect 
resources introduced due to sharing. Mem(TGi ) is the total memory 
requirement of the partition segment TGi, and Mem(Fi) is the size of the 
local memory for the ;th FPGA. Fitness of a partition is given by 
fitness = --1-. A fitness of unity implies that all architectural constraints 
!+cost 
will be satisfied when all tasks are mapped to their respective least area 
implementations. Further details of the constraint evaluator may be 
obtained from [96] [98]. 
7.4 
Partitioning Engine 
The partitioning engine invokes the throughput optimizer for those 
partitions whose fitness is one. The throughput optimization process 
explores the design space for the tasks and selects a faster RTL 
implementation for critical tasks in the design. Thus, the excess area 
available in the FPGAs is efficiently utilized to improve the throughput of 
the partitioned design. The goal of exploration is to minimize the critical 
path of the design without violating the area constraints. 
GA-based Partitioning: GAs work well for multi-modal multi-
objective cost functions and unlike heuristic approaches, GAs efficiently 
move out of local optima and converge towards the global optimum. 
SPADE has an integer-codes genetic algorithm that is well tuned to 
optimize the fitness function presented in Section 7.3. Figure 7(b) shows 
the GA convergence plot for a two-dimensional FFT example partitioned 
for the Wildforce [17] (four FPGAs, four local memory banks, and partial-
crossbar interconnect) architecture. The FFT design has 12 compute tasks, 
Design for Adaptive Reconfigurable Hardware 33 
12 memory tasks and close to 100 interconnections. The plot shows the 
fitness of the best solution ever and the average fitness of the population 
with increasing generations. The best fitness increases steadily from 0.67 
and converges close to 0.85. The individual cost values (area, interconnect, 
and memory) of the best solution are also shown in the plot. Further details 
of this work can be obtained from [98]. 
FM-based Partitioning: The RC interconnections impose multiple 
cutset constraints that the partitioner has to satisfy. The drawback of using 
a standard multi-way PM algorithm is that it tries to reduce the sum of all 
the cutsets. This may not help because each of the cuts has to be 
minimized individually for pin constraint satisfaction. Therefore, the 
multi-way PM has to be modified to behave as a constraint-satisfying 
algorithm rather than an optimizing algorithm. The modified PMP AR 
algorithm [102] attempts to simultaneously satisfy the individual pin 
constraints (between each pair of devices) and the device area constraints 
(on each partition segment). The algorithm has a collection of different 
types of moves that can be made given a current partition configuration. 
The moves are then prioritized based on the respective cutset violations. 
After a move, positive gains are assigned to moves that minimize cutset 
violations, and negative gains are assigned to moves that worsen the cutset. 
A move is accepted only if it satisfies the area constraints, thereby always 
moving towards area satisfying solutions. Thus, during each iteration, the 
algorithm contemplates several moves and accepts the best constraint-
satisfying move. The algorithm completes as soon as a constraint-
satisfying solution is obtained, or an upper-limit on the number of 
iterations and moves has been reached. Further details of this work can be 
obtained from [102]. 
8 
Interconnection Synthesis 
Interconnection synthesis is a highly architecture-specific task in any 
partitioning environment for reconfigurable multi-FPGA architectures. In 
this section, we will begin by reviewing some related research on inter-
FPGA routing in RCs. Then, we will describe the interconnection 
synthesis in SPARCS. 
34 R. Vemuri eta/. 
8.1 
Related Research 
The problem of pin-assignment and inter-FPGA routing, in the presence of 
interconnection networks, has been investigated in the past. Hauck and 
Borriello [103] present a force-directed pin-assignment technique for 
multi-FPGA systems with fixed routing structure. Mak and Wong [104] 
present an optimal board-level routing algorithm for emulation systems 
with multi-FPGAs. 
Selvidge et al. [105] present TIERS, a topology independent pipelined 
routing algorithm for multi-FPGA systems. The interesting feature of 
TIERS is that it time-multiplexes the interconnection resources thus 
increasing its utility. However, the limitation is that the interconnection 
topology has only direct two-terminal nets between FPGAs. 
Khalid and Rose [106] present a new interconnection architecture called 
the HCGP (Hybrid Complete-Graph Partial-Crossbar). They show that 
HCGP is better than partial crossbar architectures. They present a routing 
algorithm that is customized for HCGP. 
A unique feature of the interconnection synthesis technique [107] 
employed in SP ARCS is that it works for any generic interconnection 
architecture. Any type of programmable architecture can be specified and 
the interconnection topology is not fixed prior to partitioning. The 
necessary configuration information is produced as the result of 
interconnection synthesis. 
8.2 
Interconnection Synthesis is SPARCS 
Figure 8 shows the components that constitute the interconnect synthesis 
environment [107] [108] in SPARCS. The shaded region shows the 
interconnection synthesis tool. The RC interconnection architecture is 
specified using a hierarchical specification language and the architecture 
elaborator flattens the hierarchy as and when required. The Interconnection 
Synthesis Tool (IST) has two components, a symbolic evaluator and a 
boolean satisfier. The symbolic evaluator can generate various boolean 
equations representing the set of allowable connections in the 
interconnection network. The symbolic evaluator is tightly integrated with 
the boolean satisfier and produces the necessary boolean equations when 
queried. The set of desired interconnections is presented to the boolean 
satisfier as a simple boolean expression. The desired nets are generated by 
the partitioner based on the current partition configuration. The results of 
Design for Adaptive Reconfigurable Hardware 35 
interconnection synthesis are: 1) the bits that achieve the desired 
interconnect configuration, and 2) an interconnection penalty that is fed 
back to the partitioner. 
We have used an RC architecture modeling language,PDL+ [109], to 
specify interconnect architectures. The PDL+ model of architecture is then 
elaborated using the ARC system [110] . We use the symbolic evaluation 
capability in the ARC system to generate a boolean model that represents 
the given interconnect architecture. The boolean satisfier tool attempts to 
generate the configuration bits and the pin assignments such that the 
desired interconnections are routed. 
Partitioning Interconnection Cost 
Engine 
I 
Desired 
Interconnections (nets) 
1ST 
Symbolic ' 
/ Evaluator 
/ 
/ 
(ARC)~ 
',,i Architecture ~ Elaboration 
/ 
/ 
/ 
Boolean 
Satisfier 
Configuration 
control variables 
t 
/ 
(BDDP ackage) 
;ff 
Confi guration stream for 
ntrol variables co 
I Hierarchical Interconnection Specification ~-
Model 
---~(Modeled in PDL+) 
Fig. 8. Interconnect Synthesis Environment 
9 
Design Synthesis 
In this section, we will discuss the three primary issues of design synthesis: 
(1) interaction between HLS and partitioning, (2) synthesis of arbiters and 
(3) integrating logic synthesis with HLS. For each of these categories, we 
will provide a survey and existing techniques and an overview of the 
technique employed in the SP ARCS environment. 
36 R. Vemuri eta/. 
9.1 
Interaction between HLS and Partitioning 
Related Work: In order to perform hardware design space exploration, 
researchers [31], [49], [50] have integrated the HLS exploration and 
estimation phase with partitioning. This led to the traditional 
heterogeneous model, shown in Figure 9(a), where the design area and 
latency costs of each contemplated partition segment was evaluated by the 
HLS phase. Several heterogeneous systems, such as SpecSyn [111], Chop 
[112], and Vulcan I [113], focused on providing good design estimates 
while not performing complete high-level synthesis. Later, researchers 
(COBRA-ABS [114], Multipar [115]) developed a completely 
homogeneous model, wherein high-level synthesis and partitioning are 
performed in a single step. The COBRA-ABS system has a Simulated 
Annealing (SA) based model and Multipar has an ILP based model for 
synthesis and partitioning. 
HLS 
Estimation 
Behavioral 
Design 
Partitioning 
Segment 
High Level 
Synthesis 
estimates ..._,--,-....,.......J 
RTLdesigns 
a) Traditional Heterogeneous Model 
Behavioral Design+ __ 
Target architecture 
.----.L-----, current CDFG 
configuration .-----I--, 
Partitioning-Based 1-------------~ Behavioral 
HLS Exploration Partitioner 
And Estimation 
High Level 
Synthesis 
RTLdesigns 
partition 
estimates 
b) Proposed Heterogeneous Model 
Fig. 9. Integrated Synthesis and Partitioning Models 
However, unification of partitioning and synthesis into a homogeneous 
model, adds to the already complex sub-problems of high-level synthesis, 
leading to a large multi-dimensional design space. Therefore, the cost 
(design automation time) of having a homogeneous model is very high, i.e. 
Design for Adaptive Reconfigurable Hardware 37 
either the run times are quite high (COBRA-ABS [114]) or the model 
cannot handle large problem sizes (Multipar [115]). The traditional 
heterogeneous model, although less complex, has a significant drawback 
of performing exploration on a particular partition segment, which is only 
a locality of the entire design space. 
Dynamic Exploration with Partitioning in SPARCS 
We have proposed a new HLS exploration technique [116] that combines 
the best flavors of both models. In the proposed heterogeneous model 
shown in Figure 9(b ), both the partitioner and the HLS exploration engine 
maintain an identical view of the partitioned behavior, and the partitioner 
always communicates any change in the partitioned configuration. In the 
following paragraphs, we will provide an overview of the exploration 
model and the technique. Further details may be obtained from [116]. 
Exploration model: The Control Data Flow Graph (CDFG) [117] is a 
popular representation for a behavioral specification. The CDFG that we 
use is a block call graph (or BBIF [61]) shown in Figure 10(a). It consists 
of a set of nodes (Nblocks) called blocks, and edges that represent the flow of 
data and control across blocks. Each block contains an operation graph, 
which is purely a Data Flow Graph (DFG) [117]. The control flow at the 
end of a block can conditionally branch into one of the mutually exclusive 
blocks connected to it. The control flow also permits loops in the block 
call graph. The block call graph represents a single thread of control where 
all blocks are mutually exclusive in time. We define the following terms 
with respect to our partitioned CDFG model: (1) A partition Pi~ Nblocks• is 
a subset of blocks in the CDFG. (2) A configuration Cset is a set of 
mutually exclusive partitions of all the blocks. (3) A design point DPk is a 
set of schedules, one for each block in partition Pk. (4) A design space of a 
partition is the set of all possible design points bounded by the fastest 
(ASAP) and the slowest (smallest resource bag) schedules of all blocks in 
that partition. 
For the partitioned CDFG shown in Figure 10(a), Cser = P 1, P2, where P 1 
= {B~. B2 } and P2 = {B3, B4 }. Figure 10(b) shows the design points DP1 
and DP2 corresponding to the partitions. Each partition is synthesized into 
an RTL design (datapath-controller pair [117]) for the corresponding 
device in the target multi-device architecture. Therefore, for each design 
point, an RTL design estimate is maintained as shown in Figure lO(c). In 
addition, the RTL resource requirement for each individual block is also 
maintained. Note that blocks belonging to a partition share all the datapath 
38 R. Vern uri eta/. 
resources and a single finite state machine controller. Interested readers 
may refer to [50] [118] [119] for RTL design estimation techniques. 
PI 
"-- ..... I \ ..---. 
I \ 
DPI 
I I 
I I 
I I 
I I 
I I 
I I 
I I 
I ~ 
I I 
( ') 
,_ .... - ,• 
' .. l 
, .... ---------------------,, 
I \ 
I \ 
I I 
I 
I ALUs 
:B 1 Registers ~ ALUs I I 
~I 
I 
I 
I 
I 
I 
I 
Multiplexer Registers 
controller Multiplexers 
ALUs 
:B2 
Registers 
!Multiplexw r+ controller I 
I 
I 
I 
I 
, I controller 
: IndlVldual Block RTL des1gn : 
\ estimates estimate J 
.. 
\ I 
', .... ~ , ____________________ .... 
I 
I 
I 
I 
.... ---------------------, ~ ' I \ 
ALUs 
~ ~B3 4~ 
Registers ALUs 
!Multiplexer ~ Registers 
I 
I 
I 
I 
I 
I 
controller Multiplexers 
\ 
ALUs I I 
I I 
:B4 : 
I I 
Registers 
~ultiplexer 
--+ controller 
I I 
a) Partitioned CDFG I I I I 
I I 
~ l controller b) Scheduled design points 
: Individual Block RTL des1gn : 
I I 
1 estimates estimate 1 
\ I 
' I , _____________________ ........ 
.. 
c) RTL Design Estimates 
Fig. 10. Partitioning-based Exploration Model 
Exploration Technique: Given a subset of partitions Pser ~ Cser• the 
goal of the exploration technique is to schedule a given subset of blocks in 
Bser = U PkEP"' It such that the constraints, design latency, and individual 
device areas, are best satisfied. The algorithm performs exploration in a 
loop where each iteration relaxes or tightens the schedule of a block. 
Relaxing (incrementing) the schedule length of a block could decrease the 
Design for Adaptive Reconfigurable Hardware 39 
area of a partition and increase the latency of the entire design and 
tightening works vice versa. At the core of the exploration algorithm is a 
collection of four cost functions [116] that determine the block to be 
selected for relaxing/tightening. Each cost function captures an essential 
aspect of the partitioning-based exploration model and these functions 
collectively guide the exploration engine in finding a constraint satisfying 
design. During each iteration, the blocks are scheduled at various possible 
schedule lengths, thereby distributing the design latency over the blocks in 
various combinations. 
Also, at each iteration, after re-scheduling a block, the corresponding 
partition area is re-computed, thereby dynamically maintaining the 
estimated areas of partitions. The exploration algorithm stops when either 
all the partition area fall within the device area constraints, or none of the 
blocks can be relaxed or tightened without violating the design latency 
constraint. At the end of the exploration, if the area constraints are not 
satisfied, the blocks are reset to the schedules corresponding to the best 
area satisfying solution obtained so far. 
The exploration technique has the following unique features: (1) it has 
the capability to simultaneously explore the four-dimensional design space 
of multiple partition segments. Therefore, the technique can generate a 
constraint satisfying solutions in cases where the traditional heterogeneous 
model will fail. (2) The technique, unlike in a homogeneous model, uses a 
low-complexity heuristic instead of an exhaustive search. The 
effectiveness of the heuristic is demonstrated using an illustrative example 
in [116]. (3) It is independent of the partitioning algorithm and can be 
interfaced with any partitioner. The results presented in [116] demonstrate 
the effectiveness of the integrated exploration and partitioning 
methodology in generating constraint-satisfying designs. 
9.2 
Arbiter Synthesis in SPARCS 
Related Work: Several mechanisms exist to reuse pins for 
interconnections; Virtual wires [120] offer a way of overcoming pin 
limitations in FPGAs by statically scheduling data transfers so that 
multiple transfers reuse the same set of pins. This comes at the price of 
statically scheduling accesses. On the other hand, Vahid used functional 
partitioning and the concepts of Function Bus interprocessor bus and port 
calling to reduce the I/0 requirements [121]. This solution came at the 
price of intrusive modifications to the partitioning and synthesis process. 
40 R. Vemuri eta/. 
The following paragraphs briefly discuss the arbitration mechanism in 
SPARCS. Further details may be obtained from [122]. 
Request 1 
Request 2 
RequestN ' 
Fig. 11. Generic N-bit arbiter 
Clock 
~ 
v 
Arbiter 
Grant 1 
Grant2 
GrantN 
Generic Arbitration: An arbiter should be introduced for each resource 
that is to be shared between processes executing in parallel. The size of the 
arbiter depends on the number of processes accessing that resource; and a 
general N-bit arbiter is shown in Figure 11. Arbiters are also referred to as 
mutual-exclusion circuits or interlocks [123]. 
For each process accessing a shared resource, two wires are introduced 
-Request and Grant - between the process and the resource's arbiter. 
When a process wants to access the shared resource, it asserts its Request 
line and waits until its Grant is asserted. Thus, at any given point, the duty 
of the arbiter is to receive zero or more Requests from processes and issue 
zero or one Grant. If there are no requests, then the arbiter should not 
assert any grants. On the other hand, if there are one or more requests, the 
arbiter should then assert exactly one grant. 
In the SPARCS system, arbitration is introduced to solve both memory 
mapping and pin limitation problems. These problems can be solved with 
the same technique since they can be both viewed as resource sharing 
conflicts: multiple data segments are mapped to a single physical memory 
- the shared resource - and multiple 110 connections are mapped to a single 
set of 110 pins - the shared resource. 
The implementation and functioning of arbiters depend on the 
environment that they will be in as well as other constraints that the 
application imposes. For SPARCS, the round-robin implementation was 
selected. This implementation supported fairness, low overhead in terms of 
area and delay, and ease of insertion and synthesis in this framework. 
Design for Adaptive Reconfigurable Hardware 41 
Memory Arbitration: Consider the case when a task Tl reads/writes 
from data segment Ml and task T2 reads/writes from data segment M2 
(Figure 12(a)). If the two memory segments Ml and M2 are assigned to 
the same physical memory bank on the RC board, then tasks Tl and T2 are 
sharing the same address bus, data bus, and read/write mode line of the 
memory bank. But this creates a conflict since tasks Tl and T2 might be 
independent from one another (i.e. executing in parallel). 
Mutual exclusive access cannot be ensured for the address/data busses 
as well as the select mode line. So, if T1 is writing to the address bus in 
clock step cl, T2 cannot be accessing the memory during this step. 
Moreover, during clock step cl, T2 must tristate its access to the address 
bus. In conclusion, when two memory accesses are occurring through the 
same physical memory bank, an arbitration scheme has to be present to 
avoid any conflicts on the bank. For the example shown in Figure 12(a), an 
arbiter solution is shown in Figure 12(b). 
a) Before memory mapping b) After memory mapping 
Fig. 12. Memory access arbitration 
Channel Arbitration: Pin limitation between processing elements 
might cause a practical problem when a design has to be partitioned across 
several connected processing elements. In actual RC boards, a limited 
amount of pins is available for interconnection. Typically, a large number 
of pins on each processing element is already dedicated to accessing a 
memory bank attached to the processing element. Another set of pins is 
hardwired to adjacent processing elements. And finally, a limited set of 
pins might be dedicated to a programmable interconnect network that can 
connect processing elements with each other or with memory banks. 
Similar to the memory sharing mechanism described earlier, when the 
number of physical channels on the board is less than the number of 
logical connections required, then physical channels can be re-used. A 
42 R. Vemuri eta/. 
single physical channel can be used by more than one pair of writer/reader, 
provided that arbitration is introduced to avoid access conflicts. 
An example of channel sharing is shown in Figure (13). Two logical 
channels (k-bit and m-bit wide, with m < k) are merged onto one k-bit 
physical channel. For each receiving end of a shared channel, a register 
will be introduced whose enable originates from the source task (whereas 
for non-shared channels, a register is introduced at the source end). The 
reason for having registers at the receiving ends of each transfer is to 
ensure that data going to one of the targets will not be overwritten by 
future transfers. In addition, the presence of the registers allows transferred 
data to be stored and subsequent transfers to take place immediately. 
PE-l PE-2 
a) Before channel sharing 
Fig. 13. Channel arbitration 
9.3 
(} 
o-o 
D 
b) After channel sharing 
Integrating Logic Synthesis and HLS 
The traditional HLS process attempts to predictably synthesize a design, 
thereby fails to utilize the power of logic synthesis tools. As explained in 
Section 3.5, HLS is heavily restricted to select components from a macro-
library. We have proposed an approach where macros are dynamically 
optimized during HLS for area and performance and added to the library. 
This enables HLS to explore regions of a newly created design space that 
traditional synthesis tools do not have. We call this the application-specific 
macro-based synthesis where macros are identified for the specific design 
application. 
Design for Adaptive Reconfigurable Hardware 43 
Related Work: Subgraph matching is used to identify the parts of the 
design that improve the overall performance when replaced with efficient 
equivalents. This technique has been used before in [124] and [125]. 
Cadambi and Goldstein [124] look for certain aspects at the logic-level 
design such that a logic optimization improves the performance of the 
design. They build a macro library based on this knowledge and the 
improvement in the design performance. However, the synthesis of an 
application is restricted to this macro library. Srinivasa Rao and Kurdahi 
[125] try to exploit the structural regularity in the DFG of a behavioral 
specification and perform operation clustering to improve the performance. 
However, this approach is restricted to graphs that have repeating patterns. 
Application Specific Macro-Based Synthesis in SPARCS 
To circumvent the problem of having a restricted macro-library, we 
dynamically populate the library with several macros generated from the 
given application graph. We characterize each macro by dynamically 
performing logic and layout synthesis. The characterized information is 
then used to replace subgraphs of the DFG with macros. This macro-
replaced DFG is then taken through HLS to produce performance-
optimized designs. The essential aspects of this methodology are explained 
below, and more details can be obtained from [126]. 
For macro generation, the size (number of nodes) of the macro is used 
as a limiting factor. The algorithm is a modified depth first search whose 
terminating condition is satisfied when the number of nodes in the macro 
is reached. This method exhaustively produces all possible macro 
subgraphs of the given size from the original DFG. The macro 
replacement technique uses a pattern matcher to identify the parts of the 
graph where the macros can be replaced. The technique evaluates a gain in 
performance for replacing a subgraph, by computing the difference in the 
delays on the critical path and the second longest (delay) path. Using this 
gain value, the technique replaces nodes on the critical path such that the 
delay is reduced. The matching and replacement procedure is repeated 
until either there is no gain between two consecutive iterations or the 
design area exceeds the device area constraint. 
44 R. Vemuri et at. 
10 
Conclusions 
Adaptive reconfigurable architectures play a central role as intelligent 
systems. They provide a platform for high-speed designs that can meet the 
performance requirements of various application domains such as image 
processing and digital signal processing. 
This chapter provided a survey of adaptive architectures and their 
application domains. These architectures have been classified based on 
their usage and the type of reconfigurable devices used. A typical design 
flow for adaptive architectures was presented and the fundamental 
problems in design automation have been discussed. For each of these 
problems, a literature survey of CAD techniques is presented. Finally, the 
chapter provides an insight into the collection of CAD techniques 
employed in SPARCS [57]. 
References 
[1] M. C. MC Farland, A. C. Parker, R. Camposano: "Tutorial on High-Level 
Synthesis". In Proceedings of the 251h ACM/IEEE Design Automation 
Conference, pages 330-336, 1988. 
[2] R. Parker: "Adaptive Computing Systems: Chameleon". In Program 
Overview Foils, DARPA/ITO, July 1996 
[3] S. Goldstein, H. Schmit: "Reconfigurable Computing Seminar". In Course 
I 5-828/I 8-847, 
http://www .cs.emu.edu/ afs/cs.emu.edu/academic/class/ 15 828-s98/www, 
Spring 1998. 
[4] Xilinx, Inc.: "The Programmable Logic Data Book", 1998. 
[5] S. Y. Kung, H. J. Whitehouse, T. Kailath: "VLSI and Modem Signal 
Processing". Prentice-Hall, Inc., 1985. 
[6] L. B. Jackson: "Digital Filters and Signal Processing". Kluwer Academic 
Publishers, second edition, 1989. 
[7] S. R. Park, W. Burleson: "Configuration Cloning: Regularity in Dynamic 
DSP Architectures". In Proceedings of the ACM Symposium on FPGAs, 
pp. 81-89, 1999. 
[8] T. Miyamori, K. Olukotun: "A Quantitative Analysis of Reconfigurable 
Coprocessors for Multimedia Applications". In IEEE Symposium on Field 
Programmable Custom Computing Machines, pp. 2-11, 1998. 
[9] D. E. Goldberg: Genetic Algorithms in Search, Optimization, and Machine 
Learning. Addison-Wesley, Reading, MA, 1989. 
[10] J. M. Zurada: "Introduction to Artificial Neural Systems". West Publishing 
Company, 1992. 
Design for Adaptive Reconfigurable Hardware 45 
[11] J. Koza et al.: "Evolving Computer Using Rapidly Reconfigurable Field-
Programmable Gate Arrays and Genetic Programming". In Proceedings of 
the ACM Sixth International Symposium on Field Programmable Gate 
Arrays (FPGA). ACM Press, 1998. 
[12] P. Zhong et al.: "Accelerating Boolean Satisfiability with Configurable 
Hardware". In Proceedings of the 61h Annual IEEE Symposium on FPGAs 
for Custom Computing Machines (FCCM), pp. 186-195, Napa, California, 
April1998. IEEE Computer Society. ISBN 0-8186-8900-5. 
[13] J. Eldredge, B. Hutchings: "Density Enhancement of a Neural Network 
Using FPGAs and Run-Time Reconfiguration". In Proceedings of the 
Second Annual IEEE Symposium on FPGAs for Custom Computing 
Machines (FCCM), pp. 180-188, Napa, California, April 1994. IEEE 
Computer Society. 
[14] T. Kean, A. Duncan: "A 800 Mpixel/sec Reconfigurable Image Correlator on 
XC6216". In Proceedings of the International Workshop on Field-
Programmable Logic and Applications ( FPL), 1997. 
[15] K. Sirnha: "NEBULA: A Partially and Dynamically Reconfigurable 
Architecture". Master's thesis, University of Cincinnati, 1998. 
[16] Altera Corporation: Reconfigurable Interconnect Peripheral Processor 
(RIPPlO). http://www.altera.com. 
[17] Wildforce multi-FPGA board by Annapolis Micro Systems, Inc. 
http://www .annapmicro.com. 
[18] GigaOps multi-FPGA. http://www.gigaops.com. 
[19] J. Hauser, J. Wawrzynek: "GARP: A MIPS processor with a Reconfigurable 
Coprocessor". In International Symposium on Field-Programmable 
Custom Computing Machines, April1997. 
[20] T. Miyamori, K. Olukotun: "REMARC: Reconfigurable Multimedia Array 
Coprocessor". In Proceedings of the CAN/SIGDA International Symposium 
on FPGAs, 1998. 
[21] E. Waingold et al.: "Baring it All to Software: Raw Machines". In IEEE 
Computer, pp. 86-93, September 1997. 
[22] Xilinx Inc.: "XC6200 FPGAs Product Description", April1997. 
[23] Firefly XC6200-based single-FPGA board by Annapolis Micro Systems, Inc. 
http://www.annapmicro.com. 
[24] Virtual Workbench Virtex-based Rapid Prototyping Board. 
http://www. vcc.com. 
[25] ACEcard Hardware Designer's Manual. http://www.tsi-telsys.com. 
[26] Wildstar Virtex-based multi-FPGA board by Annapolis Micro Systems, Inc. 
http://www .annapmicro.com. 
[27] S. Trimberger: "Scheduling Designs into a Time-multiplexed FPGA". In 
Proceedings of the ACM/SIGDA International Symposium on FPGAs, 
1998. 
[28] S.M. Scalera, J. R. Vazquez: "The Design and Implementation of a Context 
Switching FPGA". In Proceedings of the ACMISIGDA IEEE Symposium 
on FPGAsfor Custom Computing Machines (FCCM), pp. 78-85, 1998. 
46 R. Vemuri eta/. 
[29] C. Rupp et al.: "The NAPA Adaptive Processing Architecture". In 
Proceedings of IEEE Symposium on FPGAs for Custom Computing 
Machines, 1998. 
[30] H. Singh, N. Bagherzadeh, F. Kurdahi, G. Lu, M. Lee, E. Filho: 
"MorphoSys: a Reconfigurable Processor Targeted to High Performance 
Image Applications". In Reconfigurable Architectures Workshop, RAW in 
IPPSISPDP, pp. 660-669, 1999. 
[31] F. Vahid et al.: "Functional Partitioning Improvements Over Structural 
Partitioning Constraints and Synthesis: Tool Performance". In ACM 
transactions on Design Automation of Electronic Systems, volume 3, No.2, 
pp. 181-208, April1998. 
[32] G. De Micheli: "Synthesis and Optimization of Digital Circuits". McGraw-
Hill, 1994. 
[33] A. C. H. Wu, Y. L. Lin: "High-Level Synthesis-A Tutorial". In IEICE 
transactions on information and systems, vol. 17, No.3, November 1995. 
[34] R. Murgai, R. K. Brayton, A. S. Vincentelli: "Logic synthesis for field-
programmable gate arrays". Kluwer Academic Publishers, 1995. 
[35] R. K. Brayton et al.: "Logic minimization algorithms for VLSI synthesis ". 
Kluwer Academic Publishers, 1984. 
[36] Proceedings of FCCM: "Proceedings of Annual IEEE Symposium on Field-
Programmable Custom Computing Machines (FCMM)". IEEE Computer 
Society, 1992-Current. 
[37] Proceedings of FPGA: "Proceedings of the Annual International ACM 
Symposium on FPGAs ". ACM PublicationS: 1992-current. 
[38] IEEE Standard 1076-1993. IEEE Standard VHDL Language Reference 
Manual. 
[39] C. Hoare: "Communicating Sequential Processes". In ACM 
Communications, vol. 21, No. 98, pp. 666-677, 1978. 
[40] E. Lee, D. Messerschrnitt: "Synchronous Data Flow". In IEEE Proceedings, 
vol. 75, No.9. Pp. 1235-1245, 1987. 
[41] F. Vahid, D. Gajski: "SLIF: A specification-level intermediate format for 
system design". In Proceedings of the European Design and Test (EDTC), 
pp. 185-189, 1995. 
[42] D. D. Gajski, N. D. Dutt, A. C. Wu, S. Y. Lin: "High-Level Synthesis: 
Introduction to Chip and System Design". Kluwer Academic Publishers, 
1992. 
[43] M. J. Farland: "Value Trace". Carnegie Mellon University, Internal Report, 
Pittsburgh, PA, 1978. 
[44] D. C. Ku, G. D. Michelli: "High level synthesis of ASICs under timing and 
synchronization constraints". Kluwer Academic Publishers, 1992. 
[45] Tessier et al.: "The virtual wires emulation system: A gate-efficient ASIC 
prototyping environment". In Proceedings of the 3rd International ACM 
Symposium on FPGAs, Monterey, CA, 1995. 
[46] F. Johannes: "Partitioning of VLSI Circuits and Systems". In Proceedings of 
the 33rd Design Automation conference (DAC), 1996. 
Design for Adaptive Reconfigurable Hardware 47 
[47] S. Antoniazzi et al.: "A methodology for control-dominated systems 
codesign". In Proceedings of the International Workshop on Hardware-
Software Codesign, pp. 2-9, 1994. 
[48] R. K. Gupta, G. De Michelli: "Hardware-software co-synthesis for digital 
systems". In IEEE Design and Test, vol. 10, No. 3, pp. 29-41, September 
1993. 
[49] W. J. Fang, A. C. H. Wu: "Integrating HDL Synthesis and Partitioning for 
Multi-GPGA Designs". In IEEE Design and Test of Computers, pp. 65-72, 
April-June 1998. 
[50] N. Kumar, V. Srinivasan, R. Vemuri: "Hierarchical Behavioral Partitioning 
for Multi Component Synthesis". In Proceedings of the European Design 
Automation Conference, pp. 212-219, 1996. 
[51] S. Chaudhuri, S. A. Blythe, R. A. Walker: "A Solution Methodology for 
Exact Design Exploration in a Three-Dimensional Design Space". In IEEE 
Transactions on VLSI Systems, vol. 5, No.1, March 1997. 
[52] S. A. Blythe, R. A. Walker: "Toward a Practical Methodology for 
Completely Characterizing the Optimal Design Space". In 9th IEEE 
International Symposium on System Synthesis (ISSS), 1996. 
[53] S. Chaudhuri, S. A. Blythe, R. A. Walker: "An Exact Methodology for 
Scheduling in 3D Design Space". In 8th IEEE International Symposium on 
System Synthesis (ISSS), 1995. 
[54] M. Xu, F. J. Kurdahi: "Layout-driven RTL Binding Techniques for High-
Level Synthesis Using Accurate Estimators". In ACM Transactions on 
Design Automation of Electronic Systems, 1996. 
[55] J. M. Jou, S. R. Kuang: "A Library-Adaptively Integrated High-Level 
Synthesis System". In Proceedings of the National Science Council, Rep., 
vol. 19, No.3, May 1995. 
[56] I. Ahmad, M. K. Dhodhi, C. Y. R. Chen: "Integrated scheduling, allocation 
and module selection for design-space exploration in high-level synthesis". 
In IEEE Proceedings on Computers and Digital Techniques, vol. 142, No. 
1, pp. 65-71, January 1995. 
[57] I. Ouaiss, S. Govindarajan, V. Srinivasan, M. Kaul, R. Vemuri: "An 
Integrated Partitioning and Synthesis System for Dynamically 
Reconfigurable Multi-FPGA Architectures". In Proceedings of the 5th 
Reconfigurable Architectures Workshop (RAW), Lecture Notes in 
Computer Science 1388, pp. 31-36, April1998. 
[58] S. Govindarajan, I. Ouaiss, V. Srinivasan, M. Kaul, R. Vemuri: "An 
Effective Design System for Dynamically Reconfigurable Architectures". 
In Proceedings of 6th Annual IEEE Symposium on FPGAs for Custom 
Computing Machines (FCCM), pp. 312-313, Napa, California, April1998. 
IEEE Computer Society. 
[59] M. Kaul, V. Srinivasan, S. Govindarajan, I. Ouaiss, R. Vemuri: "Partitioning 
and Synthesis for Run-Time Reconfigurable Computers Using the 
SPARCS System". In Proceedings of the 1998 Military and Aerospace 
Applications of Programmable Devices and Technologies Conference 
(MAPLD'98), 1998. 
48 R. Vern uri eta/. 
[60] I. Ouaiss, S. Govindarajan, V. Srinivasan, M. Kaul, R. Vemuri: "A Unified 
Specification Model of Concurrency and Coordination for Synthesis from 
VHDL". In Proceedings of the 41h International Conference on Information 
Systems Analysis and Synthesis (ISAS), July 1998. 
[61] N. Narasimhan: "Formal-Assertions Based Verification in a High-Level 
Synthesis System". Ph. D. Thesis, University of Cincinnati, ECECS 
Department, 1998. 
[62] J. Spillane, H. Owen: "Temporal Partitioning for Partially-Reconfigurable-
Field-Programmable Gate". In Reconfigurable Architectures Workshop, 
RAW in IPPS/SPDP, pp. 37-42. Springer, 1998. 
[63] D. Chang, M. Marek-Sadowska: "Partitioning Sequential Circuits on 
Dynamically Reconfigurable FPGAs". In ACM/SIGDA International 
Symposium on Field Programmable Gate Arrays, FPGA, pp. 161-167, 
ACM Press, 1998. 
[64] P. G. Paulin, J. P. Knight: "Force Directed Scheduling for the Behavior 
Synthesis of ASICs". In IEEE Transactions ON CAD, vol. 8, pp. 661-679, 
June 1989. 
[65] S. Trimberger, D. Carberry, A. Johnson, J. Wong: "A Time-Multiplexed 
FPGA". In FPGAs for Custom Computing Machines, FCCM, pp. 22-28. 
IEEE Computer Society Press, 1997. 
[66] M. Vasilko, D. Ait-Boudaoud: "Architectural Synthesis for Dynamically 
Reconfigurable Logic". In International Workshop on Field-
Programmable Logic and Applications, FPL, pp. 290-296. Springer, 1996. 
[67] K. M. GajjalaPuma, D. Bhatia: "Temporal Partitioning and Scheduling for 
Reconfigurable Computing". In FPGAs for Custom Computing Machines, 
FCCM, pp. 329-330. IEEE Computer Society Press, 1998. 
[68] K. M. GajjalaPuma, D. Bhatia: "Emulating Large Designs on Small 
Reconfigurable Hardware". In IEEE Workshop on Rapid System 
Prototyping, RSP, pp. 58-63. IEEE Computer Society Press, 1998. 
[69] J. M. P. Cardoso, H. C. Neto: "Macro-Based Hardware Compilation of Java 
ByteCodes into a Dynamic Reconfigurable Computing System". In 
Proceedings of FPGAs for Custom Computing Machines (FCCM), Napa 
Valley, California, 1999. 
[70] M. Wolfe: "High Performance Compilers for Parallel Computing". Addison-
Wesley Publishers, 1996. 
[71] S. Y. Kung: VLSI Array Processors. Prentice Hall, 1988. 
[72) M. Kaul, R. Vemuri: "Integrated Block processing and Design-Space 
Exploration in Temporal Partitioning for RTR Architectures". In Jose 
Rolim, editor, Parallel and Distributed Processing, vol. 1586, pp. 606-615. 
Springer-Verlag, 1999. 
[73] M. Kaul, R. Vemuri: "Temporal Partitioning combined with Design Space 
Exploration for Latency Minimization of Run-Time Reconfigured 
Designs". In Design, Automation and Test in Europe, DATE, pp. 202-209. 
IEEE Computer Society Press, 1999. 
Design for Adaptive Reconfigurable Hardware 49 
[74] A. Pandey, R. Vemuri: "Combined Temporal Partitioning and Scheduling for 
Reconfigurable Architectures". In SPIE Conference on Configurable 
Computing: Technology and Applications, September 1999. 
[75] A. Pandey: "Temporal Partitioning and Scheduling for Reconfigurable 
Architectures". Master's thesis, University of Cincinnati, ECECS 
Department, 1999. 
[76] W. Luk, N. Shirazi, P. Cheung: "Automating Product of Run-Time 
Reconfigurable Designs". In FPGAs for Custom Computing Machines, 
FCCM, pp. 147-156. IEEE Computer Society Press, 1998. 
[77] S. Hauck, Z. Li, E. Schwabe: "Configuration Compression for the Xilinx 
XC6200 FPGA". In FPGAs for Custom Computing Machines, FCCM, pp. 
138-146. IEEE Computer Society Press, 1998. 
[78] E. Lechner, S. A. Guccione: "A Java environment for reconfigurable 
computing". In International Workshop on Field-Programmable Logic and 
Applications, FPL, pp. 284-293. Springer, 1997. 
[79] S. A. Guccione, D. Levi: "XBI: A Java-Based Interface to FPGA Hardware". 
In SPIE Conference on Configurable Computing: Technology and 
Applications, pp. 97-102, 1998. 
[80] S. Ganesan, A. Ghosh, R. Vemuri: "High-level Synthesis of Designs for 
Partially Reconfigurable FPGAs". In Proc. of 2nd annual Military and 
Aerospace Applications of Programmable Devices and Technologies 
Conference, MAPLD 99, September 1999. 
[81] S. Ganesan: "A Temporal Partitioning and Synthesis Framework to Improve 
Latency of Design Targeted towards Partially Reconfigurable 
Architectures". Master's thesis, University of Cincinnati, ECECS 
Department, 1999. 
[82] N. A. Sherwani: Algorithms for VLSI Physical Design Automation. Kluwer 
Academic Publishers, Boston, 1993. 
[83] P. K. Chan, M. Schlag, J. Zien: "Spectral-Based Multi-Way FPGA 
Partitioning". In Proc. of 3rd Int. Symp. FPGAs, pp. 133-139, 1995. 
[84] S. Hauck, G. Borriello: "Logic Partition Ordering for Multi-GPGA 
Systems". In Proc. of 3rd Int. Symp. FPGAs, pp. 32-38, 1995. 
[85] K. Roy-Neogi, C. Sechen: "Multiple FPGA Partitioning with Performance 
Optimization"" In Proc. of 3rd Int. Symp. FPGAs, pp. 146-151, 1995. 
[86] W.-J. Fang, A. Wu: "A Hierarchical Functional Structuring and Partitioning 
Approach for Multi-FPGA Implementations". In IEEE Trans. on CAD, vol. 
9, No.5, pp. 500-511, Nov. 1990. 
[87] N.-S. Woo, J. Kim: "An Efficient Method of Partitioning Circuits for Multi-
FPGA Implementations". In Proc. 3rJh ACMIIEEE Design Automation 
Conference,pp.202-207, 1993. 
[88] R. Kuznar, F. Brglez, B. Zajc: "Multi-way Net-list Partitioning into 
Heterogeneous FPGA and Minimization of Total Device Cost and 
Interconnect". In Proc. 31'1 ACM/IEEE Design Automation Conference, pp. 
228-243, 1994. 
50 R. Vemuri eta/. 
[89] D. Huang, A. B. Kahng: "Multi-Way System Partitioning into a Single Type 
or Multiple Types of FPGAs". In Proc. of 3rd Int. Symp. FPGAs, pp. 140-
145, 1995. 
[90] P. Sawkar, D. Thomas: "Multi-way Partitioning for Minimum Delay for 
Look-Up Table Based FPGAs". In Proc. 32nd ACM/IEEE Design 
Automation Conference, pp. 201-205, 1995. 
[91] N. Kumar: High Level VLSI Synthesis for Multichip Designs. Ph. D. Thesis, 
University of Cincinnati, 1994. 
[92] N. Kumar, V. Srinivasan, R. Vemuri: "Hierarchical Behavioral Partitioning 
for Multi Component Synthesis". In Proc. European Design Automation 
Conference, pp. 212-219, 1996. 
[93] R. K. Gupta, G. De Micheli: "System-level Synthesis using Re-
programmable Components". In Proc. European Design Automation 
Conference, pp. 2-7, 1992. 
[94] K. Kucukcakar: System-Level Synthesis Techniques with Emphasis on 
Partitioning and Design Planning. Ph. D. Thesis, University of Southern 
California, CA, 1991. 
[95] F. Vahid, D. D. Gajski: "Specification Partitioning for System Design". In 
Proc. Of291h Design Automation Conference, pp. 219-224, 1992. 
[96] V. Srinivasan, R. Vemuri: "Task-level Partitioning and RTL Design Space 
Exploration for Multi-FPGA Architectures". In Int. Symposium on Field-
Programmable Custom Computing Machines, April1999. 
[97] M. Gokhale, J. Stone: "Automatic Allocation of Arrays to Memories in 
FPGA Processors with Multiple Memory Banks" In Int. Symposium on 
Field-Programmable Custom Computing Machines, April1999. 
[98] V. Srinivasan: Partitioning for FPGA-Based Reconfigurable Computers. Ph. 
D. Thesis, University of Cincinnati, USA, August 1999. 
[99] S. Kirkpatrik, C. D. Gelatt, M. P. Vecchi: "Optimization by Simulated 
Annealing". In Science, vol. 220, No. 4598, pp. 671-680, 1983. 
[100] J. Holland: Adaptation in Natural and Artificial Systems. Ann Arbor: 
University of Michigan Press, 1997. 
[101] C. Fiduccia, R. Mattheyses: "A linear time heuristic for improving network 
partitions". In Proceedings of the 19'h Design Automation Conference 
(DAC), pp. 175-181, 1982. 
[102] P. Lakshrnikanthan: "Partitioning of Behavioral Specifications for 
Reconfigurable Multi-FPGA Architectures". Master's thesis, University of 
Cincinnati, ECECS Department, 1999. 
[103] S. Hauck, G. Boriello: "Pin Assignment for Multi-FPGA Systems". In 
Proc. of FPGAsfor Custom Computing Machines, pp. 11-13, 1994. 
[104] W. Mak, D. F. Wong: "On Optimal Board-Level Routing for FPGA based 
Logic Emulation". In Proc. 32nd ACM/JEEE Design Automation 
Conference, pp. 552-556, 1995. 
[105] C. Selvidge, A. Agarwal, M. Dahl, J. Babb: "TIERS: Topology 
Independent Pipelined Routing and Scheduling for Virtual Wire 
Compilation". In Proc.lnt. Symp. FPGAs, pp. 25-31, Feb. 1995. 
Design for Adaptive Reconfigurable Hardware 51 
[106] M. Khalid, J. Rose: "A Hybrid Complete- Graph Partial-Crossbar Routing 
Architecture for Multi-FPGA Systems". In Proc. Int. Symp. FPGAs, pp. 
45-54, Feb. 1998. 
[107] V. Srinivasan, S. Radhakrishnan, R. Vemuri, J. Walrath: "Interconnect 
Synthesis for Reconfigurable Multi-FPGA Architectures". In Proceedings 
of Parallel and Distributed Processing (RA W'99), pp. 597-605. Springer, 
April1999. 
[108] S. Radhakrishnan: "Interconnect Synthesis for Reconfigurable Multi-FPGA 
architectures". Master's thesis, University of Cincinnati, ECECS 
Department, April 1999. 
[109] R. Vemuri, J. Walrath: "Abstract models of reconfigurable computer 
architectures". In SPIE'98, Nov. 1998. 
[110] J. Walrath, R. Vemuri: "A Performance Modeling and Analysis 
Environment for Reconfigurable Computers". In Proceedings of Parallel 
and Distributed Processing, pp. 19-24. Springer, March 1998. 
[111] D. D. Gajski, F. Vahid et al.: "Specification and Design of Embedded 
Systems". In Prentice-Hall Inc., Upper Saddle River, NJ, 1994. 
[112] K. Kucukcakar, A. Parker: "CHOP: A constraint-driven system-level 
partitioner". In Proceedings of the Conference on Design Automation, pp. 
514-519, 1991. 
[113] R. K Gupta, G. De Micheli: "Partitioning of functional models of 
synchronous digital systems". In Proceedings of the International 
Conference on Computer-Aided Design, pp. 216-219, 1990. 
[114] A. A. Duncan, D. C. Hendry, P. Gray: "An Overview of the Cobra-ABS 
High-Level Synthesis System for Multi-FPGA Systems". In Proceedings of 
FPGAs for Custom Computing Machines (FCCM), pp. 106-115, Napa 
Valley, California, 1998. 
[115] Y. Chen, Y. Hsu, C. King: "MULTIPAR: Behavioral partition for 
synthesizing multiprocessor architectures". In IEEE Transactions on VLSI 
systems, vol. 2, No. 1, pp. 21-32, March 1994. 
[116] S. Govindarajan, V. Srinivasan, P. Lakshmikanthan, R. Vemuri: "A 
Technique for Dynamic High-Level Exploration During Behavioral-
Partitioning for Multi-Device Architectures". In Proceedings of the I31h 
International Conference on VLSI Design (VLSI 2000), 2000. 
[117] R. Walker, R. Camposano: "A Survey of High-Level Synthesis Systems". 
Kluwer Academic Publishers, 1991. 
[118] H. Mecha, M. Fernandez and K. Olcoz: "A Method for Area Estimation of 
Data-Path in High Level Synthesis". In IEEE Transactions on Computer-
Aided Design, Vol. 15, No.2, Feb. 1998. 
[119] J. Roy, N. Kumar, R. Dutta and R. Vemuri: "DSS: A Distributed High-
Level Synthesis System". In IEEE Design and Test of Computers, June 
1992. 
[120] J. Babb, R. Tessier, A. Agarwal: "Virtual Wires: Overcoming Pin 
Limitations in FPGA-based Logic Emulators". In Proceedings of FPGAs 
for Custom Computing Machines, 1993. 
52 R. Vemuri et al. 
[121] F. Vahid: "Techniques for Minimizing and Balancing I/0 During 
Functional Partitioning". In IEEE Transactions on Computer-Aided Design 
of Integrated Circuits and Systems, Vol. 18, January 1999. 
[122] I. Ouaiss, R. Vemuri: "Efficient Resource Arbitration in Reconfigurable 
Computing Environments". In Design, Automation and Test in Europe, 
DATE, pp. 560-566. IEEE Computer Society Press, 2000 
[123] J. Rabaey: "Digital Integrated Circuits: A Design Perspective". Prentice 
Hall, 1996. 
[124] S. Cadambi, S. C. Goldstein: "CPR: A Configuration Profiling Tool". In 
Proceedings ofFPGAsfor Custom Computing Machines (FCCM), 1999. 
[125] D. S. Rao, F. Kurdahi: "On Clustering for Maximal Regularity Extraction". 
In IEEE Transactions on CAD, August 1993. 
[126] S. Sundararaman: "Application Specific Macro Based Synthesis". Master's 
thesis, University of Cincinnati, ECECS Department. 1999. 
