Application of reconfigurable computing to a high performance front-end radar signal processor by David R. Martinez et al.
Journal of VLSI Signal Processing 28, 63–83, 2001
c ° 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.
Application of Reconﬁgurable Computing to a High Performance
Front-End Radar Signal Processor¤
DAVID R. MARTINEZ, TYLER J. MOELLER AND KEN TEITELBAUM
MIT Lincoln Laboratory, 244 Wood Street, Lexington, MA 02420-9108
Received July 1999; Revised December 1999
Abstract. Manyradarsensorsystemsdemandhighperformancefront-endsignalprocessing.Thehighprocessing
throughput is driven by the fast analog-to-digital conversion sampling rate, the large number of sensor channels,
and stringent requirements on the ﬁlter design leading to a large number of ﬁlter taps. The computational demands
range from tens to hundreds of billion operations per second (GOPS). Fortunately, this processing is very regular,
highly parallel, and well suited to VLSI hardware. We recently ﬁelded a system consisting of 100 GOPS designed
using custom VLSI chips. The system can adapt to different ﬁlter coefﬁcients as a function of changes in the
transmitted radar pulse. Although the computation is performed on custom VLSI chips, there are important reasons
to attempt to solve this problem using adaptive computing devices. As feature size shrinks and ﬁeld programmable
gate arrays become more capable, the same ﬁltering operation will be feasible using reconﬁgurable electronics. In
this paper we describe the hardware architecture of this high performance radar signal processor, technology trends
in reconﬁgurable computing, and present an alternate implementation using emerging reconﬁgurable technologies.
We investigate the suitability of a Xilinx Virtex chip (XCV1000) to this application. Results of simulating and
implementing the application on the Xilinx chip is also discussed.
Keywords: VLSI rader signal processor, front end high performance ﬁltering, reconﬁgurable hardware, digital
ﬁltering mapped to reconﬁgurable computing, commercial FPGA hardware
1. Introduction
Theradarsignalprocessingtrendistobringtheanalog-
to-digital converter (ADC) closer to the radar antenna
elements, and process the incoming signals digitally
insteadofusingmoreconventionalanalogapproaches.
The digital hardware offers more robust system stabil-
ity, avoiding unnecessary calibrations characteristic of
analoghardwareduetocomponentdriftswithtimeand
temperature. The availability of digital hardware also
provides very important beneﬁts, such as, more ﬂex-
ibility in waveform design, an ability to reconﬁgure
¤This work was sponsored by the Air Force under contract F19628-
95-C-0002. Opinions, interpretations, conclusions, and recommen-
dations are those of the authors and are not necessarily endorsed by
the United States Air Force.
the hardware by downloading different system coefﬁ-
cients, and an easier upgrade path as digital electronics
continues to advance at an exponential rate commen-
surate with Moore’s law.
The beneﬁts of digital hardware, in advanced radar
signal processing systems, come at the expense of very
potent front-end signal processors needs. The com-
putational and data throughputs result from the large
amount of incoming data processed over a very short
time interval. The computational throughputs are in
the tens to hundreds of billion operations per second
(GOPS). The data throughput bandwidths are in the
hundreds of megabytes per second (MBytes/sec). In
addition to stringent computational and data through-
puts, the digital hardware must be deployed in a small
size,lowweight,andlowpower.Furthermore,thehard-
waremustalsocomplywithdemandingenvironmental,64 Martinez, Moeller and Teitelbaum
shock, vibration, humidity, and temperature speciﬁ-
cations. Examples of platforms requiring these capa-
bilities are today’s airborne early warning radars, un-
manned air vehicles (UAVs), ﬁghters, and spaceborne
surveillance and targeting radars.
For the past several years, commercially available
digital signal processors (DSPs) or RISC microproces-
sors have not met the system requirements for cases
where we need in excess of 10 GOPS/ft3 and ¸0.5
GOPS/Watt, operating on at least 16-bit data. Fortu-
nately, the front-end signal processing functions are
very regular, well structured, and often times indepen-
dent of the incoming data. Therefore, most implemen-
tationstodatehaveusedeithercommerciallyavailable,
yet dedicated, compute engines (e.g., INMOS A100 or
SHARP LH9124 FFT chip), or custom VLSI designs.
These approaches have allowed us to meet the compu-
tational throughputs and data throughputs, within the
form factor and environmental requirements.
In this paper, we describe a recently ﬁelded front-
end signal processor capable of 100 GOPS. Since this
system is channel parallel, it can be scaled to many
hundreds of GOPS as we increase the number of radar
input channels. The system can also be reconﬁgured in
afewmillisecondsforadifferentsetofcoefﬁcients.Al-
though the system requirements led us to choose a cus-
tomVLSIimplementation,webelievethatthisapplica-
tion is very well matched to reconﬁgurable computing
hardware. At the time when the custom VLSI design
started (circa 1996–1997), the commercially available
ﬁeld programmable gate arrays were not able to meet
the system requirement by a large margin. However,
the rapid growth in computational capability of more
recent reconﬁgurable devices provides a new opportu-
nity for meeting the requirements without depending
on point solutions.
Reconﬁgurable computing will offer many of the
samebeneﬁtsprovidedbycommercialgeneralpurpose
microprocessorswithoutsacriﬁcingoverallsystemper-
formance. These devices will be cost effective, since
the same hardware can be reused for many different
applications. The economies of scale will result in a
very cost-effective solution. Furthermore, for a given
design, the implementation cycle is shorter than cus-
tom VLSI solutions. If the ﬁrst design contains ﬂaws,
one can redo the hardware conﬁguration in software.
ThesechangeswouldbeverycostlywithcustomVLSI.
Finally, because of the large number of devices, recon-
ﬁgurablehardwaremanufacturescanripthebeneﬁtsof
the latest reductions in lithography feature size.
We are not implying that all front-end radar signal
processingcanbesolvedusingreconﬁgurablecomput-
ing. There are still system speciﬁcations, for example,
in spaceborne applications, where custom VLSI solu-
tions are further ahead than reconﬁgurable hardware,
particularly when we need solutions with throughputs
perunitpower¸20GOPS/W,datathroughputsexceed-
ing GBytes/sec, and compliant with radiation harden-
ing requirements.
The example presented in this paper represents an
advanced prototype that only a few years back would
have been unacceptable to attempt to solve using ﬁeld
programmable gate arrays. This example serves as a
measure of how far reconﬁgurable technologies have
come in just three years. As discussed in Section 4,
under technology trends, this rapid advance can be
creditedtoseveralfactors,suchas,smallerlithography
size, advancements in development tools, lower volt-
ages, and algorithms that exploit the distributed mem-
ory architecture of ﬁeld-programmable gate arrays
(FPGAs).
The paper is divided as follows. In the next section,
we describe the front-end radar digital ﬁltering and the
associated computational requirements. In Section 3,
the custom VLSI approach is presented. Section 4 ad-
dresses the technology trends and drivers fueling the
advances experienced with reconﬁgurable electronics
compared to more general purpose computing. In
Section 5, we review and contrast several architecture
approaches to solve the front-end radar signal process-
ing posed in Section 2. In Section 6, we discuss other
future candidate applications for reconﬁgurable com-
puting technologies. Finally, in Section 7 we present
our conclusions and summary.
2. Front-End Digital Filtering
A typical radar signal processing ﬂow starts at the out-
put of the ADC with real-time data processed through
a set of digital ﬁlters. The digital ﬁlters are needed to
ﬁrst convert the data from real ADC samples to com-
plex digital in-phase and quadrature (DIQ) samples.
This operation is commonly known in the radar com-
munityasDIQsampling[1].TheDIQsamplingisnec-
essarytopreservethetargetDopplerinformation.From
the Doppler information one can discern the target
velocity.
AsillustratedinFig.1,thedigitalﬁlteringistypically
theﬁrststepinthesignalprocessingchain.Theremain-
ing processing, Doppler ﬁltering, adaptive nulling, andApplication of Reconﬁgurable Computing 65
Figure 1. Typical radar signal processing ﬂow.
post-nulling processing, can also be very demanding.
However, in this paper we only focus in on DIQ sam-
pling. J. Ward [2] presents a thorough exposition of
the typical radar signal processing ﬂow, with particu-
lar emphasis on space-time adaptive processing. DIQ
sampling is proportional in computational operations
tothedatabandwidthofthesystem,thenumberofﬁlter
taps, and the number of channels, as shown in Fig. 1.
In the next subsection we describe in more detail the
ﬁlter speciﬁcations.
2.1. DIQ Sampling Architecture
The radar data after the ADC will exhibit a set of spec-
trum replicates with periodicity equal to the Nyquist
sampling (1/2 the sampling rate). Because the data are
realsamples,atthisstagethespectrummaintainsequal
images with respect to 0 Hz. The DIQ ﬁltering will
serve the purpose of extracting one of the sidebands,
mapping the sideband spectrum to baseband, and ﬁl-
tering all remaining images including any DC offsets
introduced by the ADC. The ﬁlter coefﬁcients will
vary depending on the characteristics of the bandwidth
present in the transmitted radar pulse. The resulting
output contains a single sideband spectrum replicated
every increment of the sampling frequency. The sin-
gle sideband represents the complex signal (in-phase
and quadrature components) characteristic of a Hilbert
transform performed in a demodulation operation [3].
The analog receiver ﬁltering, prior to the ADC con-
verter, is selected such that the information bandwidth
is preserved without aliasing. The signal instantaneous
bandwidth is typically several factors below the ADC
sampling frequency. This oversampling leads to very
simple DIQ architectures.
For example, if the ADC sampling is four times the
instantaneous frequency of the radar [4], the mapping
of the spectrum to baseband reduces to multiplication
by C1, 0, and ¡1 for the in-phase component, and 0,
C1, and ¡1 for the quadrature component. Once the
dataaremappedtobaseband,thecomplexsignalcanbe
ﬁltered and decimated to match the signal bandwidth.
This simpliﬁcation leads to the typical DIQ sampling
architecture shown in Fig. 2. We use two ﬁlter banks
of equal length, each operating on the in-phase and
quadrature samples, respectively. The odd samples are
used to form the in-phase component, and the even
samples are used to form the quadrature components.
Figure 2. Digital in-phase and quadrature sampling architecture.66 Martinez, Moeller and Teitelbaum
2.2. Digital Filter Speciﬁcations
The DIQ ﬁlter is designed to suppress all out of band
images and spurious signals, including any DC offset
from the ADC. The ﬁlter must maintain a minimum
ripple across the passband with a sidelobe level sufﬁ-
cientlydowntoattenuateanyunwanteddataoutsidethe
signal bandwidth. The ﬁlter skirts determine the num-
ber of taps in the ﬁlter. For our application, we used
the Parks-McClellan Remez algorithm [3] available in
Matlab to design a ﬁnite impulse response ﬁlter with
linear phase. The characteristics of the ﬁlter were as
follows [5]:
² Equiripple passband D 0:18 dB
² Stopband level ¸ 80 dB
² Unattenuated passband D 0:25 MHz
² Bandwidth at 3 dB points D 0:381 MHz
The above speciﬁcations led to a ﬁlter with 208 com-
plex taps (104 taps for the real and 104 taps for the
imaginary components, respectively). In order to meet
the ﬁlter stopband level the coefﬁcients must have a
ﬁnite precision of at least 15 bits.
The ADC used was a DATEL ADS945 sampling at
10MHz,withaprecisionof14bits.Afterthedigitalﬁl-
ter, shown in Fig. 2, the radar data were decimated by a
factorof16or,equivalently,afactorof8complexsam-
ples.Thus,theoutputsamplingrateoutoftheDIQﬁlter
was down to 0.625 MHz complex data. Although this
output data rate was twice what was needed to main-
tain the signal information bandwidth, it simpliﬁed the
following ﬁltering stage past the DIQ sampling result-
ing in a simpler ﬁlter design with a smaller number of
taps.
On a per channel basis, as shown in Fig. 1, the com-
putational throughput is proportional to the data band-
width and the number of ﬁlter taps. For the 208 com-
plex ﬁlter taps and an input data rate of 10 MHz, the
DIQ ﬁlter must perform at least 2.08 GOPS, based
on the architecture shown in Fig. 2. The reduction
in sampling rate by a factor of 8 could have allowed
us to only process the samples needed at the out-
put. However, the hardware designer [6] concluded
that the custom VLSI implementation was more reg-
ular and easier to layout by processing all the in-
coming samples and decimating at the end. For a 48-
channel system the total computation throughput was
100 GOPS.
3. Custom VLSI Implementation
The front-end signal processor design begun in 1996.
The system was completed in 1998. At the start of the
project, a detailed search was done to determine if ei-
ther general purpose hardware or commercially avail-
able dedicated chips could be used to solve the DIQ
sampling requirements. We concluded that no DSP or
RISC chips would be suitable to meet the processing
requirements, particularly when you account for the
lossinprocessingefﬁcienciescausedbyprogrammable
hardware. There were several commercially special-
ized chips designed speciﬁcally to perform FIR ﬁlter-
ing (e.g., Gray chip GC2011, Harris HSP43168, GEC
PlesseyPDSP16256).Weconcludedthatnoneofthese
chips met the requirements in all the critical dimen-
sions,suchas,inputdataprecisionbits,coefﬁcientpre-
cision bits, internal arithmetic accuracy, and computa-
tion throughput at low power.
The decision was made to proceed with the design
ofacustomVLSIchipbasedonamixofstandardcells
and datapath multiply-accumulate (MAC) cells. The
details on this design are further discussed in the lat-
ter sections. The ﬁnal front-end signal processor was
integrated into two 9U-VME chassis together with ad-
ditional control electronics, a single-board computer
(SBC), and interface boards to the follow on process-
ing and an instrumentation recorder. The airborne sig-
nal processor subsystems are shown in Fig. 3. In this
paper,weonlydiscusstheDigitalIn-PhaseandQuadra-
ture Filtering Subsystem. The following sections elab-
orate in more detail on the hardware developed for this
subsystem.
3.1. DIQ Hardware Subsystem
TheDIQhardwaresubsystemconsistedofthreeunique
board designs. There were two 9U-VME chassis;
each chassis performed 50 GOPS. A 9U-VME chas-
sis housed 10 boards. The internal architecture of this
subsystem is shown in Fig. 4. One board slot was ded-
icated to a SBC control computer used to download
radar coefﬁcients to the processing boards and to ini-
tialize all the board registers. Another slot was dedi-
cated to a distribution board used for subsystem con-
trol.Thisboardinterfacedtotheradarcontrolprocessor
and provided control signals to each of the subsequent
boards in the chassis. There were 6 DIQ processing
boards. Each DIQ board handled 4 input channels. The
outputs from 2 DIQ boards were then sent to a dataApplication of Reconﬁgurable Computing 67
Figure 3. Airborne signal processor subsystems.
output board (DOB board). Each DOB board packed
the data and provided the same interface outputs to
a back-end programmable processor and to an instru-
mentation recorder. These custom boards were based
on a 14-layer PCB design of size 9U£220 mm.
The overall characteristics of this subsystem for two
9U-VME chassis were:
² 100 GOPS
² Size D 7:3f t 3
² Weight D 210 lbs
² Power D 1:5K W
² Chassis throughput/Power D 67 MOPS/W
² Throughput density D 14 GOPS/ft3
² Power density D 205 W/ft3
There were 24 channels input to each DIQ subsys-
tem chassis. The total data rate to the chassis was
420 MBytes/sec. These data were then distributed to
6 DIQ processing boards over a custom backplane.
EachDIQboardreceived70MBytes/secofdata.There
were 4 custom VLSI chips per DIQ board; each chip
processed one channel of data. The input data was
mapped to 18-bit data on the DIQ board prior to go-
ing into the VLSI chip. The output of the chip was
a 24-bit data word. A 16-bit word was then selected
and sent over the custom backplane to the DOB board.
Thus,theoutputoftheDIQboardwasreduceddownto
10 MBytes/sec (4 channels, 16-bit complex words, at
0.625 MHz). The next section describes in more detail
the custom VLSI hardware.
3.2. Custom VLSI Hardware
The VLSI custom front-end chip, designed using 1996
process technology, is shown in Fig. 5. The chip was
fabricatedattheNationalSemiconductorfacilityinAr-
lington, Texas, under the C050 process. The key fea-
tures of this chip were:
– 2.08 billion operations per second
– 18-bitinputdataandcoefﬁcients;24-bitoutputdata
– 585 mil £ 585 mil die size
– 1.5 million transistors68 Martinez, Moeller and Teitelbaum
Figure 4. DIQ subsystem architecture.
–0 : 65 ¹m feature size
– CMOS using three-layer metal
– Designed for 4 watts power dissipation; measured
3.2 watts in operation
– Throughput/Power D 0:65 GOPS/W
The designed was based on a combination of standard
cells and datapath blocks [6]. The standard cells were
usedinthechipcontrolinterface,barrelbitselector,and
downsampler.Thedatapathblocksformedthemultiply
and add (MAC) cells. There were a total of 64 MACs
availableonthechip.TheseMACscouldbeusedeither
as64complextapsorasmaximum512realtaps.Since
we were interested in computing the in-phase (I) and
quadrature (Q) data from real ADC data, each MAC
processed up to 128 real taps for the I samples and, in
parallel, another up to 128 taps for the Q samples, at
10 MHz. We could also have used the chip to process
at 5 MHz using 512 taps.
ADC samples arrived at 10 MHz. Every other sam-
ple went to the I computation and the Q computa-
tion, respectively (see Fig. 2). Thus, two sets of
outputs were computed at 10 MHz (5 MHz for
I and Q),eachapplying104realtaps.Sincetherewere
two real operations per tap (a multiply and an add),
the total computation throughput for these parameters
was 2.08 GOPS. These operations were performed on
18-bit sign extended data (from 14-bit ADC samples)
using 18-bit coefﬁcients. The resulting signiﬁcant 16-
bit outputs (I and Q samples) were selected on the
DIQ board from a 24-bit data output from the VLSI
chip.
Figure 5 illustrates a die photograph, the packaged
chip using a pin grid array (PGA) with 238 pins, and
the results of processing simulated radar data. The per-
formance results contain three plots. Two of the curves
illustrate the output from the hardware superimposed
over the output of Matlab, respectively. Matlab simu-
lates the ﬁltering operation using full precision arith-
metic. The third plot shows the difference between the
hardwareoutputandtheMatlabresults.Thedifference
between the hardware output and the Matlab full pre-
cision simulation is about 100 dB relative to the square
waveform peak. These results are very good and quite
commensurate with what one would expect using 18-
bit ﬁnite precision arithmetic.Application of Reconﬁgurable Computing 69
Figure 5. Custom VLSI front-end chip.
If we compare the characteristics of the DIQ sub-
system, described in the previous section, against the
characteristics of the custom VLSI chip, several obser-
vationscanbemade.TheDIQsubsystemhasacompu-
tation throughput per unit power of only 67 MOPS/W.
The custom VLSI chip has a throughput/power of
0.65GOPS/W.Thisdecreaseofoneorderofmagnitude
in throughput/power performance is typical of a hard-
ware integrated as part of an overall subsystem, con-
trasted to a single chip performance. Additional power
is needed for the control and interface boards, back-
plane drivers, and cooling. Another important obser-
vation is the lag between the time the VLSI chip was
designed (circa 1996 fabrication process) to the time
the system was deployed (circa 1998). This two-year
lagwasusedtointegratetheoverallsystem.Therefore,
the system was over one generation old in feature size
process technology at deployment time. Furthermore,
for fabrication orders consisting of few wafer lots (in
circa 1996) the only available process technology was
0:65 ¹m. This feature size was sufﬁcient for this one
of a kind prototype technology demonstration. Larger
wafer lots fabrication, often found in commercial chip
designs,haveaccesstomoreadvancedfabricationpro-
cess lines. These are areas that we expect to improve
with the availability of more advanced reconﬁgurable
computing technologies.
As FPGA technologies become more capable, we
can now expect to perform the same ﬁltering opera-
tions as previously delegated to custom VLSI devices.
The FPGA implementations would have several major
beneﬁts:
1. Reduction in design and manufacturing costs rela-
tive to custom VLSI designs.
2. Flexibilitytoimplementdifferentﬁlteringfunctions
using the same FPGA devices.
3. Ability to upgrade the design to higher ADC samp-
ling rates by substituting more capable FPGA
devices commensurate with the Moore’s law pro-
gression in silicon technology.
4. Wider vendor sources of FPGA technologies than
available for custom VLSI designs.
ThemajorlimitingfactorintheuseofFPGAstodate,
fortheseclassesofapplications,hasbeentheirinability70 Martinez, Moeller and Teitelbaum
to reach computation throughputs on the order of sev-
eral GOPS, with 18-bit data and coefﬁcient inputs, and
a throughput per unit power greater than 0.5 GOPS/W.
Most FPGA demonstrations to date have been limited
toafewbitsofprecision.ThecustomVLSIimplemen-
tation, described in this section, serves as an existing
proof and benchmark of capabilities we would like to
see in FPGAs. In the following sections we discuss
the recent observed FPGA technology trends. We also
presentimplementationtechniquesandresultsshowing
what is feasible to date with the most advanced FPGA
devices. We show how close this hardware comes to
meeting our FIR ﬁltering requirements.
4. Technology Trends in Field Programmable
Gate Arrays
The increasing sophistication of radar signal process-
ing techniques has paralleled improvements in digital
computingtechniques.Computationalrequirementsof
future radar systems will approach 10 TOPS (trillion
operations per second) during the next decade, plac-
ing it in the domain occupied by the leading edge of
commercialmassivelyparallelprocessorsconsistingof
hundreds or thousands of individual CPUs. Unfortu-
nately, the sheer bulk of these systems will preclude
their use in most military radar applications where
platform constraints place stringent requirements on
size, weight and power. Reconﬁgurable computing
has the potential to mitigate this problem by ofﬂoad-
ing computationally intensive tasks for execution onto
FPGAs.
Compared to the digital computer industry, FPGAs
areintheirinfancy.Xilinx,theleadingmanufacturerof
FPGAs produced its ﬁrst ﬁeld programmable gate ar-
ray 15 years ago [7]. The initial devices were primitive
bytoday’sstandards,andFPGAsweremainlyusedfor
“glue logic.” The programmable nature of the device
made it possible for engineers to make circuit design
changes without costly and time-consuming modiﬁca-
tions to printed circuit boards, and the ﬂedgling indus-
try prospered. As technology progressed and the pro-
cess geometry shrank, both the capacity and speed of
FPGAs have increased. These improvements have en-
abled FPGAs to perform the arithmetic functions nec-
essary for digital signal processing (DSP). The goal
of reconﬁgurable computing is to design application-
independent hardware based on FPGA technology,
and download application-speciﬁc algorithms to han-
dle a wide variety of potential applications. Several
manufacturers currently offer such products with
many FPGAs, external memory, and industry-standard
interfaces.
4.1. Building Blocks
In simplest terms, FPGAs are nothing more than large
arrays of look-up tables (LUTs) with a ﬂexible inter-
connect which allows building complex circuits with
multiple LUTs. Once the circuit is described (either
with VHDL or with schematic capture), the chore of
mapping of these circuits onto the sea of LUTs is ac-
complished with automated tools for synthesis, place-
ment, and routing.
For the Xilinx devices, the basic building block is
called a Conﬁgurable Logic Block (CLB) that consists
of two 4-input LUTs, one 3-input LUT, and two reg-
isters. DSP functions consist principally of multiply
and add operations that must be constructed of mul-
tiple CLBs. A b-bit adder needs b LUTs and b regis-
ters, and thus consumes b=2 CLBs. A b£b bit mul-
tiplier requires 2bb -bit adders consuming b2 CLBs.
A FIR ﬁlter tap (multiply-add, 2 real operations) con-
sumes(b2 Cb=2)CLBs.For16-bitarithmetic,anadder
would consume only 8 CLBs, but a multiplier would
take 264 CLBs.
In order to promote a more industry-standard ap-
proach to counting FPGA capacities, Xilinx has pro-
posed the notion of a Logic Cell [8] which consists
of a 4-input LUT and a register. For comparison pur-
poses, the Xilinx XC4000 family CLB is equivalent to
2.375 logic cells.
Since the parallel multiplier has been so expensive
in terms of resources used (CLBs) compared to re-
sources available on a single FPGA, several alterna-
tive strategies for implementing DSP functions have
evolved.ThesestrategiestendtoexploittheLUT-based
architecture of the FPGA.
When multiplying a data stream by a constant that
is known a priori, a constant coefﬁcient multiplier [9]
may be used to look up the product of the data sample
and coefﬁcient. The data sample effectively becomes
the address to the look-up table that stores the prod-
uct of the coefﬁcient and data sample for each possi-
ble value of the data sample. Also assuming a priori
knowledge of multiplier constants, Serial Distributed
Arithmetic (SDA) [10] rearranges the order of com-
putations to facilitate table lookup of a 1-bit sum of
products. Distributed arithmetic has also been used to
implement Fast-Fourier Transforms (FFTs) [11].Application of Reconﬁgurable Computing 71
These techniques work well for many digital-
ﬁltering applications where the ﬁlter coefﬁcients (im-
pulse response) are ﬁxed. For adaptive ﬁlters where
the coefﬁcients are changed in response to the data,
other approaches are necessary. It is possible to build
SDA ﬁlters with two banks of ﬁlter taps that are bank
swapped when ﬁlter coefﬁcients are updated [12]. In
the latter sections under implementation, this is the ap-
proach used for handling multiple ﬁlter coefﬁcients in
real-time. A table generator computes the LUT con-
tents in real time while the new ﬁlter coefﬁcients are
downloaded. While this is obviously less resource efﬁ-
cient than the straightforward SDA approach, it is still
animprovementoverhavingtouseparallelmultipliers.
4.2. Benchmarks
In assessing the computational requirements of DSP
algorithms, and the capabilities of DSP technologies,
the metric commonly used is the number of arithmetic
operationstobeexecutedperunittime.Thisistypically
expressed in MOPS (million operations per second) or
GOPS (billion operations per second). For considering
the performance of FPGAs, this metric is normalized
by the “area” consumed by the circuit (the number of
CLBs) to yield MOPS/CLB. If the multiply-adder can
operate at a clock rate of fC, then the number of oper-
ations per second executed per CLB is:
2* fC=.b2 C b=2/ (1)
FPGAsaremuchmoreeffectiveforsmallwordlengths
sincecomputationrateperunitarea(MOPS/CLB)falls
off as the square of the word length in bits. This is
clearly evident in the graph in Fig. 6, which is based
on the Xilinx CORE benchmarks [13].
Figure 6 also illustrates the performance advantage
achievable with SDA compared to parallel multiply-
addarchitectures.TheproductoftheMOPS/CLBmet-
ric and the device capacity (number of CLBs) give a
notion of the device capability in aggregate. For exam-
ple, if Xilinx’s new Virtex device (XCV1000) which
has12,288CLBs(equivalentCLBswithrespecttoXil-
inx 4000 series) could achieve 1 MOPS/CLB, it would
be able to perform about 12 GOPS per chip.
In many DSP applications, power consumption is
heavily constrained. For these applications, MOPS/
Watt is a convenient performance metric. Xilinx [14]
has suggested a simple formula for estimating the
Figure 6. Xilinx core benchmarks.
power consumption of a device:
P D Vcc
¤Kp
¤ fC
¤NLC
¤ TogLC (2)
Where Vcc isthesupplyvoltage, Kp isadevice-speciﬁc
powerfactor, fC isthemaximumclockfrequency, NLC
is the number of logic cells and TogLC is the frac-
tion of logic cells toggling on each clock (typically
about 20%). Rearranging (1) and (2), and assuming
2.375logiccellsperCLByieldsthefollowingestimate
for MOPS/Watt:
MOPS/Watt D 4:2e¡6=[Vcc
¤Kp
¤ .b2 C b=2/] (3)
Typical values for several generations of Xilinx parts
are shown in Table 1 and assume 16-bit arithmetic.
Power efﬁciencies for several 32-bit ﬂoating point pro-
grammable processors are shown in Table 2 for com-
parison.
Table 1. Power efﬁciency (MOPS/W) for several
FPGA families.
Part number Vcc Kp £ 1012 MOPS/W
XC4000E 5 72 43.0
XC4000EX 5 47 65.9
XC4000XL 3.3 28 167.5
XC4000XV 2.5 13 476.3
XCV (Virtex) 2.5 6.8 961.272 Martinez, Moeller and Teitelbaum
Table 2. Power efﬁciency (MFLOPS/W) for several ﬂoating-point programmable microprocessors.
Processor Peak MFLOPS Power (W) (Typical) MFLOPS/W
DEC Alpha 21164 1200 46 26
INTEL Pentium III 1000 28 35.7
Motorola Power PC 750 800 8 138
Analog Devices 21062 (SHARC) 120 2 60
Analog Devices 21160 600 2 300
Texas Instruments TMS320C6701 1000 2.8 357
4.3. Technology Trends
Shrinking process geometry has fueled the continued
growth in capability for FPGAs in DSP applications
by increasing both density and clock rate. Supply volt-
age has dropped to reduce power consumption. The
net result has been enormous growth in terms of com-
putational capability (MOPS) and power efﬁciency
(MOPS/W). The growth in computational capability
versus process geometry is shown in Fig. 7. Curves are
shownforFPGAs,theMotorolaPowerPC[15,16],and
custom VLSI for comparison purposes. The data for
custom VLSI is based on several development efforts
at MIT Lincoln Laboratory [17] over the past several
years.
It is clear from Fig. 7 that FPGAs offer an interme-
diatecapabilitybetweenthatofferedbyprogrammable
processors and custom VLSI. At present FPGAs are
about an order of magnitude more capable than the
Figure 7. Growth in FPGA computation rate versus process geometry.
programmable processors, and an order of magnitude
less capable than full custom VLSI. Furthermore, the
gap between FPGAs and microprocessors is widening.
This is easily explained. FPGAs beneﬁt directly from
the increase in density and clock rate. Using an FIR ﬁl-
ter as an example, successively ﬁner geometries result
in increases both in the number of ﬁlter taps which can
be implemented in parallel, and in the sample rate at
which they process data. The Power PC, on the other
hand, has maintained two ﬂoating-point pipelines de-
spite the shrinkage of process geometry, relying on in-
creases in clock frequency to improve peak MFLOPS.
The Power PC has taken advantage of the increases in
densitytoincreasememorycachesandshrinkdiesizes.
The growth in the computational capability of
FPGAs for DSP applications as a function of time is
shown in Fig. 8 in billions of 16-bit arithmetic opera-
tions per second (GOPS). Within the last 5 years the
computational capability of FPGAs have increased byApplication of Reconﬁgurable Computing 73
Figure 8. Growth of computational capability for Xilinx FPGAs
(16-bit arithmetic operations).
an order of magnitude every two years, and we have
reached the point where it is now feasible to explore
theimplementationofcomplexDSPsystemsusingFP-
GAs as the computational building blocks. In the next
sectionswepresentimplementationapproachesforthe
FIR digital ﬁltering function shown in Fig. 1 using the
Xilinx Virtex XCV1000 as an example of state-of-the-
art FPGAs.
5. Reconﬁgurable Hardware for FIR
Digital Filtering
To demonstrate the current state of FPGA technology
and its application to front-end signal processing, an
FPGA design meeting the design requirements of MIT
Lincoln Laboratory’s custom VLSI FIR chip was cre-
ated[19].AlthoughtheactualVLSIchipwascapableof
processing, in the DIQ mode, 256 taps (I and Q com-
ponents) for input data at 10 MHz, the design could
accommodate up to 512-tap real ﬁlter with 5 MHz in-
put data. Therefore, the FPGA demonstration was pre-
sentedwiththerequirementtoalsoaccommodateupto
512 taps at a maximum input data rate of 5 MHz. This
requirement is equivalent to 10 MHz input data with a
maximum of 256 taps. The FPGA requirements were:
² Perform 512-tap real FIR ﬁltering.
² Acceptaminimumof16-bitdataatamaximuminput
rate of 5 MHz.
² Operate with a maximum chip clock frequency of
40 MHz (eight times the input data rate).
² Output data with the same precision as the custom
VLSI chip (the VLSI chip outputs 24-bits of data,
butsimulationsshowed18-bitsofaccurateprecision
was sufﬁcient).
² Use two swappable banks of 18-bit coefﬁcients, one
active and one loadable.
² FitintothelargestXilinxFPGAavailable,theVirtex
XCV1000.
² Coefﬁcient banks must be able to be switched every
1 ms (i.e. reloading all 512 coefﬁcients must take
place in less than 1 ms), and the new bank of coefﬁ-
cients must become active instantaneously.
The 16-bit inputs were chosen for the FPGA imple-
mentation as the custom VLSI chip is currently being
used with an ADC with a 14-bit output, and it is un-
likely the VLSI chip would operate with an ADC of
more than 16-bits in the future. The only reason for
having the custom VLSI chip designed to accommo-
date 18-bit input data was to facilitate word growth if
several custom A1000 chips were used in a cascaded
mode. This is not a requirement imposed on the FPGA
implementation, because for the DIQ ﬁltering function
there was no need to cascade multiple chips.
Two banks of coefﬁcients were used in the custom
VLSI chip. This capability is also required for the
FPGA; so that one bank stores the active ﬁlter coef-
ﬁcients while the other bank is being loaded with new
coefﬁcients.Thebanksmaythenbeswappedbyanex-
ternal control so that the new coefﬁcients may become
instantly active.
Although the need to change coefﬁcients could be
viewed as an ideal application for reconﬁguration, us-
ing swappable coefﬁcient banks is more efﬁcient. It
has been proposed that the ability of an FPGA to be
reconﬁgured in-system be used to implement a single
ﬁlterwithﬁxedcoefﬁcientsthatisreconﬁguredwhena
coefﬁcient switch is desired. The design requirements
specify an instantaneous switch between the active co-
efﬁcients and the new set of loaded coefﬁcients. How-
ever, this prohibits using a single bank of coefﬁcients
that could be reconﬁgured via the FPGA’s reconﬁgu-
ration ability. The ﬁlter would be inactive during the
reconﬁguration time, which is unacceptable. This slow
reconﬁgurationisoneoftheimportantlimitationswith
today’s reconﬁgurable technologies.
One solution might be to have one ﬁlter operational
while a second ﬁlter was being conﬁgured in the same74 Martinez, Moeller and Teitelbaum
chip. This would require the FPGA to have partial re-
conﬁguration ability. In addition, the ﬁlter coefﬁcients
would have to be known in advance so that conﬁgu-
rations using constant coefﬁcients could be mapped,
placed, and routed by the Xilinx software to be ready
forloadingintotheFPGA.Theseconﬁgurationswould
thenhavetobestoredoff-chiptoberecalledandloaded
intotheFPGA.Inmostapplications,theﬁlterresponses
are not known beforehand, so creating and storing ev-
ery conﬁguration for every set of possible coefﬁcients
is not feasible.
5.1. Implementation Techniques
on Reconﬁgurable Hardware
Several different implementation techniques were in-
vestigated to ﬁnd a design with the best performance
(or at least satisfying the design requirements) with
the minimum amount of FPGA utilization. These tech-
niques included: a parallel MAC structure (much like
thatusedinthecustomA1000VLSIdesign),abit-level
systolic structure (similar to that used in another MIT
LincolnLaboratorycustomVLSIchip),andabit-serial
approachusingDistributedArithmetic(DA).Thetech-
niquethatwasdiscoveredtohavethebestperformance
for the smallest area was the bit-serial DA approach
[19].
In addition to the DA approach, we also investigated
the fast FIR algorithm (FFA) and ﬁltering in the fre-
quency domain. Although these are techniques that
could be used in a VLSI solution, they would not lend
themselves as easily to a multi-mode chip. The custom
VLSI A1000 chip was designed to be able to perform
real, complex, or DIQ digital ﬁltering with the same
hardware. The FPGA conﬁguration does not need to
support multiple modes. Instead, a speciﬁc implemen-
tation can be developed for each mode, and the correct
FPGA conﬁguration can be loaded into the reconﬁg-
urable hardware for the mode desired.
The next section describes the implementation of
DIQﬁlteringusingtheDAapproach[18].Additionally,
a custom layout tool is discussed. This tool, developed
atMITLincolnLaboratory,aidsintheautomaticplace-
ment of regular and pipelined FPGA VHDL designs
[19]. Currently, there is no way, through the Xilinx
development ﬂow, to describe how synthesized com-
ponents from a design written in VHDL should be laid
out within a FPGA. A designer may develop a highly
pipelineddesignthatrequiresonlyshortnetlengthsbe-
tween logic and pipeline stages. However, by breaking
the design into small, regular, systolic cells, the devel-
opmenttoolshavenoknowledgeofthefactthatasingle
cell needs to be laid out only once and that cells that
communicate with each other should be placed next
to each other. Presently, the tools place every single
synthesized component individually, resulting in inef-
ﬁcient designs with long net lengths that lead to low
clock performance results. The MIT Lincoln Labora-
tory custom tool will read, through the VHDL hierar-
chy, the user-deﬁned placement constraints embedded
within the VHDL for low-level, regular structures. The
toolwillalsodevelopanoverallplacementschemethat
willkeepthesamelayoutforidenticalrepeatedsystolic
cells, and will place connecting cells adjacent to each
other.TheDAimplementationandtheFPGAcomplex-
ity are described in the following section.
5.2. DIQ Filtering Implementation Based
on Distributed Arithmetic
The abundance of small distributed RAM blocks
throughout the Xilinx FPGA chip enables the user to
pre-calculate partial products, and to load these into
the distributed RAM, thereby eliminating the large
amounts of logic needed to compute multiplication
results in a non-distributed approach. A Distributed
Arithmetic (DA) architecture is a versatile approach
to using this distributed RAM [10, 20–24].
The 16£1 RAM units within the Xilinx CLBs are
good candidates for this DA scheme. One bit of a sin-
gle 4-input LUT can ﬁt into one of these units with
no unused logic. For FIR ﬁlters larger than 4-taps,
the ﬁlter can be broken into four tap groups. For ex-
ample, a 16-tap FIR is shown in Fig. 9. To eliminate
overﬂow, each adder stage must grow by one bit, and
the scaling accumulator must also grow accordingly
in size (of course, the scaling accumulator could drop
the lower bits in its accumulation if less precision is
required).
The shift registers required by each four-tap group
can be implemented in the Xilinx Virtex distributed
16 £ 1 RAM for 16-bit (or less) inputs. Each 16 £
1 RAM block may be conﬁgured to act as (up to) a
16-bit shift register. The length of the register is spec-
iﬁed by the value written into the RAM block’s 4-bit
address. Each clock cycle a bit is shifted into the front
of the register from the RAM’s data input, and a bit
is shifted out of the end of the register to the RAM’s
dataoutput.Withthistechnique,asingleCLBcanhold
the entire 16-bit input for two taps. This is much moreApplication of Reconﬁgurable Computing 75
Figure 9. 16-Tap serial distributed FIR.
efﬁcient than using registers to form each tap’s shift
register.
Figure 9 could be expanded until a 512-tap ﬁlter
has been constructed, but two problems would exist.
First of all, the coefﬁcients are not constant as re-
quiredbythedesign.Secondly,theﬁlteraboverequires
B clock cycles to process a single input sample. For a
5 MHz input, this means a ﬁlter with a 16-bit input
must run at 80 MHz, which is twice as fast as the de-
sign speciﬁcations allow.
To solve the ﬁrst problem, two banks of LUTs are
used for each four-tap group. One bank is used as the
activeLUT.ThisLUTisaddressedbytheshift-register
outputs. Then, another bank is loadable by the user.
Therefore, the user can load all of the LUT values into
one bank of the FIR ﬁlter in the background while the
old LUT values stored in the other bank are active. By
toggling a bank select line, the two banks are switched
so that the previously loadable bank is now active, and
the previously active bank is now loadable.
Figure 10 shows the complete block diagram for
a single four-tap group including the shift registers
and the double-banking LUTs as described above. The
four-tap group accepts a serial data input (from the last
tap’sshiftregisterofthepreviousgroup),andproduces
a serial data output for the next group’s ﬁrst tap’s shift
register. A bank selection line selects (through multi-
plexors) which bank of LUTs are active, and which are
used for loading new coefﬁcients. Each bank is 20-bits
long as described above. The active bank is addressed
by the four shift register outputs. The bank’s output is
thegroup’soutput.Theloadbankisaddressedbyanex-
ternal set of coefﬁcient address lines that select which
ofthe16bankaddressesisbeingwritten.Agroupcoef-
ﬁcient load enable line selects whether this group is to
have its coefﬁcients updated versus another group, and
a bank write clock line writes the data into the correct
bank (the bank being loaded). A separate clock was
used for each bank’s LUT write clock to minimize the
amount of logic local to a group.
Pipelining has been inserted so that the cumulative
delaybetweenpipelineregistershasbeenkepttoamin-
imumtoincreaseperformance.Inaddition,thebankse-
lection line has been pipelined between 4-tap groups.
This prevents a single bank selection line from having
to drive all the mutiplexers in every 4-tap group, which
would lead to a very high fan-out, and a very slow sig-
nal, decreasing the overall system performance. One
drawbacktothisisthatchangingcoefﬁcientbankswill
take 128 clock cycles during which the new bank se-
lection switch is propagated through its pipelining reg-
isters. Any outputs produced during that time will con-
sist of outputs from both coefﬁcient banks, and will be
incorrect outputs from either bank’s ﬁlter. This draw-
back was resolved using a linear-network summer tree
[19]. Another implementation constraint is that coef-
ﬁcients in a given 4-tap’s stand-by registers cannot be
altered after a bank selection switch until the bank se-
lection signal has propagated to that 4-tap. Otherwise,
the wrong set of coefﬁcients will be changed, since
that tap has the wrong bank selected. This constraint
was easily met since the entire bank change takes only
128 clock cycles, after which any 4-tap group’s coefﬁ-
cientscanbechanged.Asinglefour-tapgrouprequires
36 CLBs.
A 512-tap ﬁlter using an expanded version of Fig. 9
with the four-tap group in Fig. 10 has been built. This
ﬁlter was designed with full-precision, meaning that
every adder stage grew by one bit (so no rounding
was required), and the scaling accumulator was large
enough to hold the entire 43-bit result. This 512-tap
ﬁlter required 6,203 CLBs.76 Martinez, Moeller and Teitelbaum
Figure 10. Block diagram of four-tap serial DA group.
To verify the functionality of the design, a custom
512-tap FIR ﬁlter was constructed in CCC that could
output full-precision ﬁxed-point two’s-complement
results.1 The output of this simulation was compared,
bit-for-bit, with the output of the simulated FPGA de-
sign. No errors were found for random sets of input
data and coefﬁcients.
5.3. Xilinx Virtex XCV 1000 Performance
and Area Results
The serial distributed arithmetic design, as described
above, requires a clock rate 16 times faster than the
input data rate. For a 5 MHz data rate, this means the
serial ﬁlters must run at 80 MHz. The design speciﬁca-
tions require a clock 8 times faster than the input rate,
or 40 MHz to be compatible with the DIQ subsystem
showninFig.4.Twosolutionswereinvestigatedtoad-
dress this requirement. The ﬁrst approach operated the
internal SDA ﬁlter at 80 MHz, using the Virtex Digital
Locked Loop (DLL) to multiply the external 40 MHz
clock up to a 80 MHz internal clock rate. The second
technique placed two 512-tap SDA ﬁlters on the chip
and had each ﬁlter, in parallel, operate on a single bit
of the input data at a time. Two bits of the input data
are processed per clock (this technique is called 2-bit
parallel distributed arithmetic, or 2-bit PDA). An eight
times faster clock rate would be required for this de-
sign, and the internal ﬁlters would only have to operate
at40MHz,yetthedesignwouldrequiretwiceasmuch
area as a SDA design.
The tradeoff between the two techniques was speed
versusarea.TheDLLdesignrequiredtheinternalﬁlter
to work twice as fast as the 2-bit PDA design, whereas
the 2-bit PDA design required twice as much area.
Unfortunately, the 2-bit PDA design required 12,756
CLBs for the 512-tap linear-network ﬁlter alone (with-
out the scaling accumulator and additional glue logic),
which was larger than what was available in the Virtex
XCV1000. For this reason, the DLL SDA design was
chosen.
A 512-tap linear-network SDA design using a Vir-
tex DLL to double the external clock rate required
6,446 CLBs slices with the scaling accumulator and
glue logic added to the design. The Xilinx timing ana-
lyzer software predicted a maximum worst-case clock
rate of 85.9 MHz with the ¡6 speed grade. In the Xil-
inx 4000XL series of FPGAs (the Virtex was based on
the 4000 design), real performance results were often
noted to exceed those of the timing analyzer software.Application of Reconﬁgurable Computing 77
It is very likely that the ﬁlter design here would meet
the 80 MHz clock rate for 5 MHz input rate design
requirements.
TheXilinxplaceandrouteroutinesdidnotoptimally
place the FIR ﬁlter. The optimized design that was
chosen as the ﬁnal design consisted of identical 4-tap
groups connected in a linear fashion. The only connec-
tions to a given group were to the two adjacent groups.
Only the clock signals were routed to all the groups,
whichcouldbehandledbytheVirtex’sglobalclockin-
terconnect. The design had also been highly pipelined
so that placing adjacent groups next to each other on
the FPGA would result in very short route lengths.
The place and route tools would only need to place a
4-tap group’s internal components once, then replicate
that group 128 times. However, the tools did not have
the knowledge that the design was made this way, so
all the various components of the design were placed
in a haphazard fashion, even with timing constraints
applied to the design. As a result, many of the wire
lengths were longer than necessary, decreasing the ﬁl-
ter’s performance. A short description of the MIT Lin-
colnLaboratoryplacementtoolisdescribedinthenext
section (for more details refer to [19]).
5.4. MIT Lincoln Laboratory Custom VHDL
Placement Tool
One of the beneﬁts of using VHDL to describe a de-
sign is the ability to create parameterized modules that
can be instantiated in higher levels of the design hi-
erarchy. For example, a delay line could be created in
VHDL with a variable number of bits and a variable
numberofdelaystages.Whenthisdelaylinemoduleis
instantiated, the exact sizing of a particular instance is
speciﬁed at compile time in the VHDL as part of the
instantiation code.
The problem with using VHDL for FPGA designs is
that there is no way presently (at least with the Xilinx
Foundation software) to communicate the desired lay-
outofaVHDLdesigntotheplacementsoftware.Inthe
designs discussed above, a single small cell was often
replicated many times in a systolic fashion. For exam-
ple,intheDAapproach,the4-tapgroupwasreplicated
128 times. Each group only connected to the two adja-
centgroups,creatingalinearsystolicnetwork.Because
ofthis,careandtimecouldbetakentoplacethecompo-
nents that make up one group, and then this placement
could be replicated to all the instantiations of the 4-tap
group. Ensuring that consecutive groups were placed
adjacenttoeachotherwouldresultinthemostefﬁcient
design.
Unfortunately, VHDL has no method of describing
this process, and the placement tools do not have the
knowledge that the design was created in a systolic
fashion. As a result, the components from different 4-
tap groups and the rest of the logic were interspersed
among each other in a seemingly haphazard fashion.
Manysignalsthatshouldhavebeenshortifthesystolic
approach was taken ended up long, as components that
should have been near each other were far apart on the
chip. Constraining the design placement would drasti-
cally improve its performance.
Although some tools (for example, Synopsys) allow
attributes (such as placement constraints) to be passed
from VHDL to the synthesized netlist for instantiated
components, this feature is very limited. Components
created as part of a generate statement (for example,
a bank of registers x-bits long could be created by us-
ing a generate statement to duplicate a single register
x times) are created during the synthesis process, and
cannot have attributes attached to them. In addition,
writing a long string of attribute statements can be
tedious and error-prone.
AcustomtoolreferredtoasCell Maker[19]wasde-
velopedtoreadinVHDLcode,extractbasicplacement
constraints added as comments by the user, and create
a user constraints ﬁle (UCF) describing the placement
constraints for every component in the VHDL code.
This tool allows a cell that is replicated many times to
be placed once, and the replication strategy (e.g. lin-
ear systolic) to be speciﬁed in order to create the best
placement.
The only limitation is that instantiated library com-
ponents from the Xilinx FPGA library (e.g. regis-
ters, LUTs, RAMs, adder primitives etc) must be used
within the VHDL code, instead of high-level synthe-
sizable code (for example, using the C operator for
addition). The reason for this limitation is that the tool
cannot infer the components that would be generated
by high-level code. All the components must be ex-
plicitly instantiated so that placement constraints can
beattachedtothem.Withbetterplacementstrategies,it
is expected that the linear systolic network DA design
performance could improve signiﬁcantly. In the fol-
lowing section we illustrate the layout improvements
achieved using Cell Maker.
5.4.1. Placement and Routing Using Cell Maker
Custom Tool. Using Xilinx placement and routing78 Martinez, Moeller and Teitelbaum
Figure 11. Placed and routed linear systolic DA design without Cell Maker.
tools, the linear systolic design was not placed well
within the Xilinx Virtex XCV1000, resulting in long
route lengths that decreased performance. The placed
and routed version of this systolic design layout, us-
ing this approach, is shown in Fig. 11. This design
hadthebestperformanceoutoftwentydifferentplace-
ments run by the Xilinx tools via their multi-pass place
and route feature. As the ﬁgure shows, the placement
strategy did not take advantage of any of the systolic,
regular, design features built into the linear DA design.
The linear systolic DA design was then constrained
with the Cell Maker constraints so that the four tap
groupswereplacedinalinearchainasshowninFig.12.
A parallel-to-serial shift register at the input of the ﬁrst
four-tap group changes the 16-bit X input into serial
data for the DA algorithm, and a scaling accumulator
at the output of the last four-tap group produced full
43-bit outputs.
The ﬁnal placed and routed design with Cell Maker
constraints is shown in Fig. 13. The performance in-
creased from 86 MHz to 118 MHz with the improved
systolic layout; a 37% improvement due to layout
alone. It is also evident that the overall layout is much
moreregularandconsistentwithasystolicshortlength
interconnects.
The linear systolic SDA design constrained with
Cell Maker,asshowninFig.13,hadamaximumclock
rate of 117.5 MHz, which allowed a maximum sample
rate of 7.3 MHz. With a sample rate of 7.3 MHz, and
512 multiply and add operations (two operations) per-
formed each clock cycle, the linear systolic DA design
couldperform7.475GOPS.Thisthroughputrepresents
about a 49% increase over the older custom VLSI de-
signdescribedinSection3(withamaximumcapability
of 5.02 GOPS).
For the linear systolic DA design, about 6,700 ﬂip-
ﬂopsandshiftregisterswereclocked;sothepowercon-
sumption for the design was estimated at 12.14 watts
with a 117 MHz clock [19]. With 7.475 billion opera-
tionsperformedpersecondat12.14watts,thethrough-
put/power factor for the linear systolic DA design was
0.62GOPS/Watt.Incomparison,thecustomVLSIchip
throughput/power, for its 512-tap real mode, was esti-
mated at 1.57 GOPS/Watt, or 2.5 times better than the
FPGA linear systolic DA design.
One important lesson learned from this implemen-
tation experience is that generic tools can compromise
the mapping of the desired architecture structure to the
available FPGA hardware. For best utilization of the
FPGA resources, with the goal of achieving highestApplication of Reconﬁgurable Computing 79
Figure 12. Linear systolic DA Cell Maker layout.
Figure 13. Linear systolic DA ﬁnal placement and routing after Cell Maker.80 Martinez, Moeller and Teitelbaum
Figure 14. Growth of embedded radar signal processing throughput.
performance over minimum area, the placement and
routing tools should exploit architecture speciﬁc
constraints.
6. Future Candidate Applications
for Reconﬁgurable Computing Technologies
In an effort to explore possible application areas for
FPGAcomputing,computationalrequirementsforvar-
ious military signal-processing applications were sur-
veyed. These applications included shipboard radar,
airborne radar, radars for unmanned air vehicles
(UAVs),missileseekers,andspace-basedradar(SBR).
For each application the processing throughput in
teraﬂops (TFLOPSD1012 operations/sec.) was deter-
mined, along with an estimate of when (approximate
year)thetechnologywasrequired,andwhatconstraints
(volume, power) were imposed by the platform. The
results are shown graphically in Fig. 14.
The bulk of near-term requirements fall in the
50-500 GFLOPS range (sustained). There is a gen-
eral trend which shows computational requirements
increasing approximately an order of magnitude ev-
eryﬁveyears.Atthispacecomputationalrequirements
will exceed 10 TFLOPS in the 2005–2010 time frame.
The ratios GOPS/ft3 and MOPS/Watt express the
platformconstraintsofvolumeandpower,respectively.
For the various application areas, these requirements
span almost six orders of magnitude. These require-
ments will dictate the technology required for imple-
mentation. Several technologies were surveyed, based
on current and previous experience at MIT Lincoln
Laboratory. Programmable processors (SHARC, Pow-
erPC, DEC Alpha, etc.) are capable of up to approxi-
mately 50 GOPS/ft3 (PEAK) and 50 MOPS/Watt, as-
suming they are packaged for embedded use. Typical
efﬁcienciesrunabout25%,sothesenumbersshouldbe
scaled accordingly. Reconﬁgurable computing based
on FPGAs should be able to reach 1000 GOPS/ft3
and 1000 MOPS/Watt. This projection assumes em-
bedded packaging with MCMs (Multi-Chip Modules).
Efﬁciencies are likely to be substantially higher than
with programmable processors, but less than unity.
Full custom VLSI with embedded packaging is the
most capable technology, reaching out to approxi-
mately 20,000 GOPS/ft3 and 75,000 MOPS/Watt.
The computational requirements of many of the ap-
plications surveyed fall outside of the capabilities of
programmable processors. While the capabilities of
such processors are increasing with each passing year,
so are the computational requirements. There will al-
waysbeapplicationsforwhichtheplatformconstraints
push signal processing requirements into the realm of
reconﬁgurable computing and custom VLSI hardware.
Furthermore,thebulk(60%–90%)oftheprocessingfor
these applications are the “easy ops” of FIR ﬁltering,
FFTs, and beam forming (exclusive of weight compu-
tation) and are well suited to hardware implementation
either in FPGAs or custom silicon.
7. Conclusion
Historically, front-end signal processors for radar sys-
tems have demanded computational throughputs well
beyond any contemporary microprocessor. Therefore,
system designers have required to develop custom
VLSI devices to meet these throughput demands.
Recently there has been a major evolution in reconﬁg-
urable computing making viable solving these front-
end signal processing problems with commercial-off-
the-shelf FPGAs.
As FPGA technologies become more capable, we
cannowexpecttoperformthesamedigitalﬁlteringop-
erations previously delegated to custom VLSI designs.Application of Reconﬁgurable Computing 81
TheFPGAtechnologyoffersseveralimportantbeneﬁts
to these military applications, such as:
– Reduction in design and manufacturing costs rela-
tive to custom VLSI designs.
– Flexibilitytoimplementdifferentﬁlteringfunctions
using the same FPGA devices.
– Ability to upgrade the design to higher ADC sam-
pling rates by substituting more capable FPGA de-
vices commensurate with the Moore’s law progres-
sion in silicon technology.
– Wider vendor sources of FPGA technologies than
available for custom VLSI designs.
In the past several years, FPGAs have been success-
fully used for either microprocessor emulation or dig-
ital control functions. For digital control functions, the
FPGAshavereplacedmanydiscreteintegratedcircuits,
simplifying the complexity and providing more ﬂexi-
bility. People have also demonstrated the application
of FPGAs signal processing for cases where the arith-
metic operations only required a few bits of precision.
In this paper, we presented a recently ﬁelded high-
performance signal processor built to provide an ag-
gregate throughput of 100 GOPS based on a custom
VLSI design, with an input precision of 18 bits and an
output precision of 24 bits. We used this example as
an existence proof. It also provided the design goals
neededtobedemonstratedwithFPGAhardware.Only
a few years back, it would have been unacceptable to
attempt to solve this problem with FPGAs. This limi-
tation was due to very anemic number of transistors on
a die, the slow clock frequency, excessive download-
ing times, inability to operate on a large data word,
and poor placement and routing tools that could efﬁ-
ciently use the limited hardware available. These areas
have advanced exponentially in the past years. In this
paper, we have demonstrated that the same front-end
digital ﬁltering requirements could be met with the lat-
estXilinxVirtexXCV1000device,withoutsacriﬁcing
system performance.
The implementation of the digital in-phase and
quadrature ﬁltering using the custom VLSI design was
based on a parallel group of multiply-add cells. This
approach was not feasible in a reconﬁgurable hard-
ware.Insteadweoptedtoimplementthedigitalﬁltering
function using a serial distributed arithmetic approach.
This technique, previously published in the literature,
offered maximum utilization of the RAM based FPGA
maintaining the throughput performance requirement.
However, we had to assist the Xilinx placement
tools by creating a user constraint ﬁle (UCF) describ-
ing the placement constraints for every component in
the VHDL code. This tool allowed a cell that is repli-
cated many times to be placed once, and the replica-
tion strategy (e.g. linear systolic) to be speciﬁed in or-
der to create the best placement across the array of
CLBs.
We concluded the paper by presenting several im-
portant applications with the potential to be impacted
byFPGAtechnology.Reconﬁgurablecomputingbased
on FPGAs was predicted to reach 1000 GOPS/ft3 and
1000 MOPS/Watt within a couple of years, with the
assumption of using embedded packaging with MCMs
(Multi-Chip Modules). Throughput efﬁciencies were
also predicted to be substantially higher than with pro-
grammable processors, but less than unity.
We do not imply that all front-end radar signal pro-
cessing can be solved using reconﬁgurable computing.
There are still system speciﬁcations, for example, in
spaceborneapplications,wherecustomVLSIsolutions
are further ahead than reconﬁgurable hardware, partic-
ularly when we need solutions with throughputs per
unit power ¸20 GOPS/W, data throughputs exceeding
GBytes/sec, and compliant with radiation hardening
requirements.
Acknowledgments
We would like to thank several people at MIT Lincoln
Laboratory for their help with various aspects of the
technologies discussed in this paper. Bob Ford, Bill
Song, Michael Killoran, and Huy Nguyen provided
many insights into the application of FPGAs to front-
end digital ﬁltering. The architecture for the custom
front-endradarsignalprocessorwasdevelopedbyPaul
McHugh. The custom chip speciﬁcations were pro-
vided by Bob Pugh, and the design was implemented
by Joe Greco. The chip was fabricated by National
SemiconductorattheArlington,Texasfacility.Finally,
the back-end chip design and design rule veriﬁcations
were performed by Mentor Graphics, Inc. The authors
are also thankful to the anonymous reviewers for their
inputs and constructive comments.
Note
1. Courtesy of Michael Killoran, MIT Lincoln Laboratory.82 Martinez, Moeller and Teitelbaum
References
1. G.W.Stimson,IntroductiontoAirborneRadar,2ndedn.,Scitech
Publishing, Inc., 1998.
2. J.Ward,“Space-TimeAdaptiveProcessingforAirborneRadar,”
Technical Report #1015, MIT Lincoln Laboratory, Dec. 1994.
3. A.V. Oppenheim and R.W. Schafer, Discrete-Time Signal Pro-
cessing, Prentice Hall, Inc., 1989.
4. K. Teitelbaum, “A Flexible Processor for a Digital Adaptive
Array,” in Proceedings of the 1991 IEEE National Radar Con-
ference, March 1991.
5. E.D. Baranoski, “Pre-Processor Speciﬁcations,” Internal Mem-
orandum, 21 April 1995.
6. J. Greco, “A1000 Critical Design Review,” MIT Lincoln Labo-
ratory, June 1996.
7. “Xilinx Celebrates 15th Year of Continuous Innovation in
Programmable Logic,” Xilinx Press Release, Feb. 1999,
http://www.xilinx.com/company/anniversary.htm.
8. “The Future of FPGAs,” Xilinx White Paper, http://www.
xilinx.com/prs rls/5yrwhite.htm.
9. K. Chapman, “Building High Performance FIR Filters
Using KCMs,” Xilinx App Note, July 1996, http://www.xilinx.
com/appnotes/kcm ﬁr.pdf.
10. G.R. Goslin, “A Guide to Using Field Programmable Gate
Arrays (FPGAs) for Application—Speciﬁc Digital Signal Pro-
cessing Performance,” Xilinx, Dec. 1995, http://www.xilinx.
com/appnotes/dspguide.pdf.
11. L. Mintzer, “Large FFTs in a Single FPGA,” in Proceedings of
the 7th International Conference on Signal Processing Applica-
tions & Technology, Boston, MA, 7–10 Oct. 1996.
12. B. Allaire and B. Fischer, “Adaptive Filters in FPGAs,” in Pro-
ceedings of the 7th International Conference on Signal Process-
ing Applications & Technology, Boston, MA, 7–10 Oct. 1996.
13. “Xilinx Core Solutions Data Book,” Xilinx, 2/98, http://www.
xilinx.com/products/logicore/core sol.htm.
14. “A Simple Method of Estimating Power in XC4000XL/EX/E
FPGAs,” Xilinx Application Brief XBRF014 v 1.0, June 1997,
http://www.xilinx.com/xbrf/xbrf014.pdf.
15. “PowerPC603eMicroprocessors,Motorola,”http://www.moto-
rola.com/SPS/PowerPC/products/semiconductor/cpu/603.html.
16. “PowerPC 750 and PowerPC 740 Microprocessors,” Motorola,
http://www.motorola.com/SPS/PowerPC/products/semicon-
ductor/cpu/750.html.
17. W. Song, “A Two Trillion Operations per Second Minature
Mixed Signal Radar Receiver/Processor,” in Asilomar Confer-
ence on Signals, Systems, and Computers, Nov. 1998.
18. T.J. Moeller and D.R. Martinez, “Field Programmable Gate Ar-
ray Based Front-End Digital Signal Processing,” in IEEE Sym-
posium on Field-Programmable Custom Computing Machines,
FCCM’99, April 1999.
19. T.J. Moeller, “Field Programmable Gate Array for Front-
End Digital Signal Processing,” Master of Engineering Thesis,
Massachusetts Institute of Technology, May 1999.
20. G.R. Goslin, “Using Xilinx FPGAs to Design Custom Digital
Signal Processing Devices,” in DSPX 1995 Technical Proceed-
ings, 12 Jan. 1995, p. 595.
21. G.R. Goslin, “Implement DSP Functions in FPGAs to Reduce
Cost and Boost Performance,” EDN, 1996.
22. B.New,“ADistributedArithmeticApproachtoDesigningScal-
able DSP Chips,” EDN, 17 Aug. 1995.
23. Xilinx Publications, “The Role of Distributed Arithmetic in
FPGA-based Signal Processing,” Technical Report.
24. S.A. White, “Application of Distributed Arithmetic to Digital
Signal Processing: A Tutorial Review,” IEEE ASSP Magazine,
July 1989.
David R. Martinez received a B.S. degree in Electrical Engineering
from New Mexico State University in 1976. He received an M.S.
and E.E. degree in Electrical Engineering from MIT, jointly with
the Woods Hole Oceanographic Institution in 1979. Mr. Martinez
also completed an MBA degree from the Southern Methodist Uni-
versity in 1986. He worked at the Atlantic Richﬁeld Co. in seismic
signal processing from 1979 to 1988. During this time, Mr. Martinez
worked on algorithm development and technology ﬁeld demonstra-
tions.WhileatAtlanticRichﬁeldCo.,hereceivedaSpecialAchieve-
ment Award for the conception, management, and implementation
of a multidisciplinary project. He holds three U.S. patents relating to
seismic signal processing hardware. He has worked at MIT Lincoln
Laboratory since 1988. His areas of interest are in VLSI signal pro-
cessing and high performance parallel processing systems. He has
been responsible for managing the development of several complex
real-time signal processor systems. Mr. Martinez is Associate Divi-
sion Head in the Air Defense Technology Division. For the last three
years, he has been the chairman for a national workshop on high per-
formanceembeddedcomputing,heldatMITLincolnLaboratory.He
also served as an Associate Editor for the IEEE Signal Processing
Magazine.
Tyler J. Moeller grew up in Alexandria, VA, where his interest
in electrical engineering was sparked over several summers of in-
ternships at the Army’s Night Vision and Electro-Optics Labora-
tory during High School. He then attended MIT, where he received
his Bachelors degree in Electrical Engineering and Computer Sci-
ence. During the summers, he was an intern at MIT Lincoln Lab-
oratory, where he worked on the Laboratory’s Miniaturized Digital
Receiver project, designing the digital ﬁltering multi-chip module
for the project. Tyler then received his Master’s degree in Electrical
Engineering and Computer Science from MIT while working on his
thesis,FieldProgrammableGateArraysforRadarFront-EndDigital
Signal Processing, at Lincoln Laboratory with Dave Martinez. He is
nowaleaddeveloperatcarOrder.com,aspin-offcompanyofTrilogy
Software.Application of Reconﬁgurable Computing 83
Kenneth Teitelbaum received the B.S.E.E. degree in 1977 from
the State University of New York at Stony Brook and the M.S.
degreein1979fromtheUniversityofIllinoisatUrbana-Champaign.
Since then he has been employed by M.I.T. Lincoln Laboratory in
Lexington, MA, and currently holds the position of Senior Staff
Member in the Embedded Digital Systems Group. His interests in-
clude radar systems design, adaptive signal processing, and parallel
computing.