Abstract. Custom computing m achines, a class of computational platforms consisting of recon gurable functional units with recon gurable interconnection networks, provide a middle-ground b e t w een specialpurpose hardware, which provide high execution speed, and general-purpose computers, which o er exibility. T h e Splash-2 system, one s u c h custom computing m achine, is an experimental platform for complex computations requiring t h e high speed of special-purpose hardware. The recon gurability a n d m o d ularity of Splash-2, along with a u t omated synthesistools, allow for rapid, staged development o f a p plications. This paper demonstrates the a p plicability of Splash-2 to t h e i m age-processing area and gives an introduction to the programming e n vironment used in developing a p plications. An example image-processing a p plication based upon two-dimensional convolution is described and t h e o perating procedures of custom computing m achines are presented. Also presented are the d etails of directly implementing t w o-dimensional convolution in a straight-forward, systolic fashion in recon gurable hardware.
Introduction
The a v ailability of a high-performance, cost e ective, programmable computing platform is the rst step towards high-speed signal processing d evices. Such d evices may be found in computer vision applications such a s v e hicles which guide t h emselves, whether this involves weaving around objects o n a factory oor or automatically driving d o wn a highway, or security systems which monitor a home or warehouse without t h e n eed of human intervention. The e m ergence of these types of computer vision systems and o t h er signal processing devices is slowed by o n e important factor: the computational complexity o f t h e n ecessary operations. A characteristic of most general-purpose computing platforms is the lack of speed necessary to quickly process large quantiti e s o f d a t a. Recon gurable hardware allows these tasks to b e This research h as been supported in part by t h e National Science FoundationNSF under grant performed quickly while still leaving room for future modi cations and enhancements.
While signal processing can be performed at high rates on live d a t a streams with costly custom hardware, or at slower rates with a more exible system such a s a w orkstation, personal computer, or mainframe, the Splash-2 attached processor offers an excellent compromise. The Splash-2 system is a custom computing machine created at t h e Supercomputing Research Center in Bowie, Maryland 1 2 . It is relatively inexpensive compared to specialized signal-processing h ardware 3 , y et it retains the requisite speed for performing complex computations on high-rate d a t a streams such as live video. Like a general-purpose machine, Splash-2 also has the a bility t o be programmed for a particular task at o n e moment, and t h en reprogrammed for a new task a moment l a t er. These qualities allow for the implementation of various computationally intensive i m age-processing t asks such a s t w o-dimensional convolution.
??
One important aspect of the Splash-2 architecture which allows it to perform image-processing tasks is the combination of programmable f u nctional units a n d programmable interconnection resources between the f u nctional units. Programmable functional units allow for high-speed, ecient implementations of the n ecessary computations in which only the exact number of bits required in a calculation are computed. Programmable interconnections provide an e ective m eans of transporting d a t a t o t h e d esired units, such a s recirculating d a t a t hrough a butter y unit when calculating a fast Fourier transform 4 . Both o f t h ese capabilities are required when processing live video data.
Another aspect derived from the recon gurability of Splash-2 involves its rapid prototyping a bilities when developing a p plications and t h e inherent modularity which allows con gurations to be concatenated. Since the system may be recon gured easily, a p plications can be developed in stages. A preliminary test stage may examine t h e performance of a subset of the n ecessary con guration, perhaps that portion representing t h e core of the desired computation. Once this subset is proven to o perate correctly, additional con gurations may be appended to nish the goal.
This paper explores the a p plicability o f t h e Splash-2 system to performing i m age processing tasks. After a brief introduction to t h e Splash-2 architecture and t h e v arious modi cations made to allow for image-processing of live video, an example application will be examined in detail. The a p plication, 8-by-8 window convolution of live video data, is one t h a t requires high rates of computation, data m o v ement, and m anipulation; thus, necessitating t h e use of a high-speed platform. Speci cally, m any m ultiplication and addition operations are needed during each clock cycle. Given a reasonable clock r a t e, this requires the performance of several hundred million o perations each second. To illustrate t h e d esign and programming process for custom computing m achines, the construction and implementation o f t h e convolution application will be discussed in depth.
This paper is organized as follows: Section 2 introduces the Splash-2 custom computing m achine and t h e modi cations made t o obtain a system able to process live video data. In Section 3, the t w odimensional convolution procedure is de ned, and the possible simpli cations are presented which f acilitate an e cient implementation. The strengths and w eaknesses of custom computing m achines for image processing are discussed in Section 4. Section 5 presents t h e i n t ernal details of the convolution application, and Section 6 provides the results along with comparisons between Splash-2 and several special-purpose and general-purpose architectures.
The Splash-2 System
In order to u n d erstand w h y t h e Splash-2 system is well suited to t h e t ask of image processing, an overview of the system is necessary. An examination of the modi cations made t o allow Splash-2 to perform image processing will further demonstrate t h e ease with which Splash-2 may be used. The d esign process through which Splash-2 programmers create a w orking con guration will also be explained.
The Splash-2 Architecture
As mentioned in the introduction, Splash-2 is a custom computing m achine|a platform that provides high-speed computation and highbandwidth i n t ercommunication by t h e implementation of recon gurable functional units a n d i n t erconnects. The Splash-2 system consists of an array of Xilinx XC4010 FPGAs with associated RAM. Each FPGA-RAM pair is a processing element, or PE. On each Splash-2 processor board, there are seventeen PEs referred to a s X 0 t hrough X16. Processing elements X 1 t hrough X16 are connected in a linear array via 36-bit data p a t hs, and an additional 36-bit full-crossbar interconnection path is provided between all of the PEs. The RAM associated with each processing element i s arranged as 256k words, each 16 bits in width. A high-level block diagram of the Splash-2 processor board is shown in gure 1.
While the linear data p a t h is xed, the 36-bit crossbar data p a t h m ay be used to c h ange the interconnection of the processing arrays to a n y d esired topology. Not only can the crossbar be used to connect any processing element t o a n y o t h er,
To other boards the d uration of the execution of the con guration, or it may be altered every clock cycle by t h e rst PE, control element X0.
Up to fteen processor boards may be concatenated through a 36-bit data p a t h from X16 on one board to X 1 o n t h e n ext. This group of processor boards is controlled by a n i n t erface board which connects t o a S u n Microsystems workstation by an S-Bus connector. The i n t erface board passes data t o element X 0 o f e a c h processor board and t o element X 1 o f t h e rst processor board by a 36-bit bus referred to a s t h e SIMD bus. It also acquires data from element X16 of the last processor board, which m ay then be passed to t h e w orkstation, through the 36-bit RBus.
VTSplash
The VTSplash system is a modi cation of the standard Splash-2 con guration which h as the alterations necessary to h andle the n eeds of live video for image-processing a p plications. In particular, a black-and-white RS170 video camera has been attached to t h e input d a t a stream. A depiction of the VTSplash laboratory system is shown in gure 3. The video data are fed to a digitzer sequencer board which digitizes the i m age into a 512-by-480 non-interlaced frame with 8 bits p e r pixel representing i n t ensity v alue. The resulting frame i s t h en given to t h e Splash-2 system for processing. Data produced by Splash-2 is, in most applications, in a similarformat, and i s f e d t o a frame bu er board. This board collects t h e d a t a u n t il an entire image is received and t h en sends that image to a D a t a T ranslation DT-2867LC frame grabber card that displays the result on a monitor. The Sparc-2 workstation attached to Splash-2 is no longer the p r i m ary source of data input. Instead, it is used to provide programming information to con gure Splash and control data t o m anipulate an image-processing a p plication as it is running. While a regular black-and-white camera is currently being u s e d , a n y RS-170 video source can be fed to t h e digitizer chip for processing, including o u t put from a VCR.
Design process
Creating a n a p plication for Splash-2 involves dening t h e a p propriate f u nction of each o f t h e processing elements. This de nition may be given as a VHDL component, or it may be drawn as a schematic in tools such as XBLOX 9 . Given this high-level description, synthesis tools can create a con guration le for the associated processing element. Figure 2 shows an excerpt of the VHDL code used in the d esign of the convolution application. The c o d e generates the control signals needed for a circular memory bu er used to d elay incoming data. When this code is passed to t h e synthesis tools, the t ools will create t h e adders, multiplexers and o t h er logic needed. Also, the a p propriate signals will be routed to t h e I O pins of the processing element i n o r d er to i n t erface with t h e a t t ached memory. T h e synthesis tools will even detect the presence of the wait command a n d apply ip-ops triggered by t h e XP Clk signal to t h e outputs o f t h e n ecessary functional units.
Unfortunately, m any o f t h e s t eps involved in the creation of a design are performed manually, and e v en experienced designers may introduce errors. Continuing research h as been focusing o n t h e a bility t o a u t omate most of these steps, and proceed from an original behavioral model of the entire design to t h e n al synthesized con guration without t h e n eed for human interaction 5 6 7 . With t h e creation of such d esign tools, the t ime required to program a new Splash-2 design may be reduced to a m a t t er of hours or less, as opposed to t h e s e v eral months it took to create t h e convolution design described in Section 5.
3. High-Speed Convolution Two-dimensional convolution is commonly used in image processing for ltering, edge detection, and feature extraction, among o t h ers. The computational complexity i n v olved and t h e fast execution speed required for convolution of live video data make t his task an excellent candidate for demonstrating t h e a p plicability of Splash-2 to t h e i m age processing community. Convolution is basically a weighted sum o f d a t a points. Depending o n h o w m any d a t a points are considered, the same n u m ber of products are necessary to compute t h e w eighted values and t h a t n u m ber less one additions are required to s u m t h e products. This weighting a n d summing occurs for every output point. A quick overview of convolution, along with t h e a v ailable simpli cations will demonstrate t h e complexity required.
?? 5
The basic formula for two-dimensional, discrete convolution of a nite t emplate with a nite image is shown in 1, where w and h are the width and h eight o f t h e coe cient t emplate, which is assumed to be a zero-based array 8 .
Two-dimensional image data is most typically stored and transferred in raster-scan format. This format orders the pixels into a o n e-dimensional array by u s i n g t h e u p per left pixel rst, the pixel immediately to t h e right n ext, and continuing o n t o t h e right u n t il the e n d o f t h e t o p r o w is reached. The second r o w is used in a similar fashion and s o on, down the r o ws, until the e n t ire image has been covered. This is the format o f t h e d a t a presented to t h e VTSplash system.
Raster-scan format t s t h e t w o-dimensional data i n t o o n e dimension and, in doing so, also reduces the t w o-dimensional convolution into a o n e dimensional operation. Given the original image width, W, an equivalence between a new iterator z and t h e old iterators x and y can be de ned as in 2. This new iterator allows the incoming image, M x;y , t h e resulting i m age N x;y , a n d t h e l t er template, C x;y to be rewritten as the input raster data, R z , t h e o u t put raster data S z , a n d a n ew raster lter template, D z , a s s h o wn in 3. z x + W y 2 R z = M x;y , S z = N x;y and D z = C x;y 3
Rewriting t h e original convolution equation 1 now y eilds the simple, one-dimensional convolution shown in 4. Figure 4 illustrates these simpli cations. One-dimensional convolution may be implemented more readily in hardware, as it only requires one register for an iterator or an image o set, whereas the t w odimensional convolution would require two, along with a n a l beit, trivial multiplication to i n d ex the data.
Suitability of Custom Computing Machines
The VTSplash system has many properties necessary to perform image processing t asks on live video data. The high speed of the d edicated hardware, coupled with t h e recon gurability o f t h e FPGAs and FPICs eld programmable interconnects makes VTSplash a feasible platform for image-processing experimentation. However, certain weaknesses in the Splash-2 system are obstacles to t his goals.
High speed: Systems performing calculations on live video require high speed computation and large bandwidth communications in order to process the incoming frame d a t a. A 512-by-512 frame acquired at 15 frames per second creates an incoming pixel rate of 3,932,160 pixels every second. When invalid data corresponding t o h orizontal and vertical retracing is considered, as is the case in the VTSplash system, calculation speeds of 5MHz are necessary. A s s h o wn in the previous section, even the most common image processing t asks can require more than a hundred arithmetic operations per input pixel. At frame r a t es as slow a s 1 5 frames per second, this corresponds to more than half a billion arithmetic calculations each second for one process alone.
Recon gurability and Modularity: The Splash-2 system can be entirely recon gured in less than a 6 ??
second. This recon gurability allows it to be reused for other image-processing t asks. It also lets the Splash-2 programmer debug and reprogram a
Obstacles: Various details of the Splash-2 system contribute t o obstacles which h ad to b e o v ercome before convolution could be implemented. One of the foremost of these obstacles is the limited resources available on each FPGA. Although the Xilinx XC4010 chips used for the Splash-2 processing elements contain 400 combinational logic blocks 9 , t w o eight-bit parallel multipliers 10 may take u p t oo much o f t h e c hip to allow for the remaining control circuitry. The communication complexity i n v olved in transferring t h e n ecessary data t o t h e a p propriate f u nctional units i s a n i mportant factor as well.
Another complication is the m emory bandwidth. The Splash-2 system requires one clock cycle to t urn the m emory data bus around. This
Implementation
The Splash-2 application presented in this section can be used to perform two-dimensional convolution of an eight-by-eight square template. The nal result w as a con guration which allows twodimensional convolution of a 512-by-512 image with m any di erently shaped lters, including 8 -b y-8, 16-by-4, 4-by-16, and 64-by-1 templates, at a r a t e o f 1 5 f r a m es each second. This section describes the m any d etails of the convolution application in order to illustrate t h e d esign environment in which Splash-2 programmers must work.
The implementation takes advantage of the simpli cations speci ed in Section 3. The system uses eighteen of the t hirty-four processing elements spread across two concatenated Splash-2 boards. Since the d esign processes 15 frames each second while the video interface produces 30 frames per second, the rst processing element i n t h e d esign subsamples the incoming d a t a stream by t aking every other frame a n d passing i t t o t h e remaining elements a t t h e r a t e o f o n e pixel every two clock cycles. The n ext sixteen elements constitute t h e convolution pipeline. Con gured in a linear array, these elements d elay the pixel data coming from the previous element b y a speci ed amount a n d perform the m ultiplication and s u mmation necessary to calculate t h e a p propriate portion of then al convolution result e v ery two clock cycles. Each of the sixteen elements o perates on a di erent fourby-one portion of the i m age. The last processing element i n t h e system processes the o u t put in order to m ake it visible on the display. This usually ?? 7 consists o f m erely shifting t h e result b y some s p eci ed number of bits, although, if desired, this element can also take t h e a bsolute v alue of the result.
The Preprocessing Element
The rst processing element i n t h e d esign has the sole purpose of subsampling t h e incoming frames and generating a n o u t put o f o n e pixel every other clock cycle. Since two pixels can t into each 16-bit memory location, the frame s u bsampler stores eight incoming pixels into m emory by executing four consecutive w r i t es, performs a bus turnaround cycle, reads four outgoing pixels with t w o consecutive m emory reads, and performs a nal bus turnaround cycle. Thus, within eight clock cycles, eight incoming pixels have been stored in memory and four outgoing pixels have been read, keeping u p t h e a v erage of one o u t put pixel every other clock cycle while acquiring o n e input pixel every clock cycle. When the e n d o f t h e currently stored frame is reached, the frame s u bsampler ceases providing v alid pixel data u n t il a new frame arrives. A state diagram is shown in gure 5.
The Main Convolution Processing Element
The n ext sixteen elements i n t h e d esign perform the m ain computations in the convolution task. The m ain convolution elements c o n s i s t o f t w o m ajor portions as shown in gure 6. These two u nits are the d elay-line u nit and t h e partial-convolution unit. Each u nit is driven by t h e same four-state clock counter. Since each o f t h e m ain convolution elements m ay be in one of four states, all elements m ust be kept in lockstep by u s i n g synchronization signals. These signals are asserted every time a processing element i s a bout t o e n t er state zero, and each element forwards the synchronization signal in a systolic fashion. The rst element i n t h e linear array determinesthe a p propriate clock cycle and forces the second element i n t o t his cycle, which t h en forces the t hird element, and s o on until all elements are synchronized. This synchronization occurs during t h e rst 16 clock cycles when the system begins to r u n a n d t h e processing elements remain synchronized for the r e s t o f t h e execution. In addition, the frame-subsampler element previously described must keep in synchrony with t h ese elements. Since a frame m ay start some variable amount o f t ime after the previous one, the subsampler element m ay delay its o u tgoing pixel stream in order to a v oid frame jitter at t h e o u tput. While the m ain convolution element h as a four-state cycle, a delay of a maximum o fo n e clock cycle is needed because the pixel rate is once every two clock periods. Thus, if the n ew input i m age arrives when the m ain convolution elements are in an odd state states one o r t hree, the rst pixel will be delayed until the n ext clock cycle, at which time t h e convolution elements will be expecting a n ew input pixel.
The Delay-Line Unit: The d elay-line u nit of the main convolution element consists of control logic which s t ores incoming pixels into a logical circular bu er created in RAM and reads pixels from a previously used location in the bu er. Since the delayed pixels may be up to 16 scan lines above the current pixel, the circular bu er must be at least 16 512=2 = 4096 memory words in size. Given this amount o f m emory, t h e d elay-line accepts incoming pixels by performing a m emory write t o s t ore two consecutive pixels into o n e 16-bit memory location at t h e h ead of the circular memory bu er. The element t h en allows one clock cycle in order to t urn the m emory bus around a s d escribed in Section 4, and performs a memory
read to acquire two consecutive d elayed pixels from a memory location which i s a t a constant o set from the h ead of the bu er modulo 4096. A n al bus turnaround cycle is then performed and t h e h ead of the circular bu er is incremented modulo 4096 to point t o t h e n ext location in memory. Note t h a t t his procedure maintains the a v erage rate o f o n e input pixel and o n e o u tput pixel every other clock cycle. This is the only portion of the element which requires the full four cycles of the s t a t e counter. All other portions effectively run o n a t w o-cycle state counter by m erging s t a t es zero and t w o a n d s t a t es one a n d t hree.
The Multiply Add Unit: Since pixel intensity values are no more accurate t h an eight bits, the lter coe cients n eed not be any larger. Therefore, each l t er coe cient i n t h e 8-by-8 template i s represented in 8-bit signed-magnitude format 11 . The m ultiplications necessary to compute t h e r e sult o f t h e convolution are therefore 8-bit by 7 -bit unsigned products with t h e sign bits o f t h e coe cients t aken into account w h en the products are summed. In order to o v ercome resource limitations inside t h e partial-convolution portion of the processing element, multiplication units are reduced from computing t h e full 8-by-7 product to t w o 4-by-7 partial products which are then summed. Since the convolution calculation need only be calculatedevery two clock cycles, the same multiplier can be used to calculate b o t h partial products. Frames per Second Fig. 7 . Timing comparisons between various systems performing 8-by-8 convolution. Note t h e logarithmic scale.
The Postprocessing Element
The n al processing element i n t h e d esign consists of a barrel shifter with a n o ptional absolutevalue functional unit. This element s t ores a shift value which d etermines what processing will be performed on the convolution result in order to make i t a v ailable for display. The b o t t om four bits o f t h e v alue indicate t h e n u m b e r o f b i t s t o shift the convolution result, and t h e fth bit indicates whether or not to t ake t h e a bsolute v alue of the resulting pixel data. The 22-bit result coming from the last convolution element is shifted to the right b y t h e speci ed number of bits a n d t h e least signi cant 8 bits are used as the resulting pixel value. If the a bsolute v alue function is used, an output o f 0 s h o ws on the display as the d arkest black color, and a n o u t put of 255 full scale shows as the brightest white. Without t h e a bsolute v alue option, the element t oggles the eighth bit of the result s o t h a t a v alue of 0 is displayed as medium gray, a n o u t put of -128 full negative is the darkest black, and a n o u t put of 127 full positive shows as the brightest white. This element performs no calculation related to t h e o v er ow o f t h e result o u t o f t h e eight-bit output.
The Resulting Con guration
The synthesized set of main processing elements can delay the incoming d a t a b y a n y e v en number of pixels up to 8190 and can then perform convolution on the four most recently delayed pixels. The d elay-line circuitry adds a constant d elay of ve pixels and t h e convolution calculation adds another two-pixel delay, s o t h e n al result i s t h e convolution of data d elayed by some odd number of pixels between 7 and 8197, inclusive.
Rather than synthesizing sixteen di erent processing elements t o h andle the sixteen di erent delays, one general element w as created which obtains its d elay value from the h ost. During t h e v ertical retrace of the video data, when no valid pixels are available, the h ost passes special control words to t h e Splash-2 system indicating e i t h er lter coe cients o r d elay values. When the rst element receives this control word, it updates either its coe cients o r d elay and places the old value on the o u t put, indicating t o t h e n ext element a bout the d a t a t ype to load. Thus, con guration data is shifted through the linear array, e v entually programming all of the processing elements.
This also allows for more exible lter con gurations such as a sixteen-by-four lter or a fourby-sixteen lter. Indeed, the con guration can be any composition of sixteen four-by-one l t ers and, with more attached Splash-2 boards, the l t er can expand t o a composition of as many four-by-one lters as there are available processing elements provided the d elay conditions mentionedabove are met. Modi cations to t h e l t er con guration may be performed as the system is operating, and t h e lter may be entirely recon gured on the order of a tenth of a second. Figure 8 shows some o f t h e l t er con gurations available as compositions of sixteen four-by-one l t ers. By substituting t h e raster scan input with some o n e-dimensional input, the system may perform one-dimensional convolution as well.
Results
The n al system con guration performs the d esired two-dimensionalconvolution on a 512-by-512 image at t h e r a t e of fteen frames every second. With a clock r a t e of 10MHz and pixels arriving every other clock cycle, the n u m ber of computations is large. A result m ust be calculated for every pixel: a rate of 5 million results per second. Each result requires 64 multiplications, 64 additions, and 1 6 m emory operations. This computes to a burst rate of 720 million operations per second. Due to edge e ects, the t o p 7 lines of the r e sulting i m age as well as the rst seven pixels in each line are invalid, so only a 505-by-505 portion of the resulting i m age corresponds to actual valid data. Also, summing 64 products only requires 63 Neither of these calculations include a n y additions, subtractions, increments or shifts performed in the w ay of memory addressing o f t h e circular bu er, next-state calculation, and n al output adjustment. By comparison, a 90MHz Pentium m achine t akes approximately 6.2 seconds to perform an identical task. This implies that t h e Splash-2 system provides a speedup factor of roughly 119 ove r a P entium. Figure 7 shows relative speed comparisons in frames per second of six di erent computing platforms. The 66MHz 486DX-2 machine a n d t h e Sparc-10 Workstation take 14.8 and 6.2 seconds, respectively, t o process a 512-by-512 image with an 8-by-8 template. A 50MHz dual i860 processor board takes an estimated 353 milliseconds 12 , a n d a custom-built VLSI chip reported in 1990 takes 16.7 milliseconds 13 . Figures 9 through 13 show a sample image and the results of processing t his image with v arious types of lters possible with t h e n al convolution design. Figure 9 is the original input. Figure 10 shows the results of smoothing v i a a l o w-pass lter. Figures 11 and 1 2 s h o w t h e results from applying t w o p o pular edge-detection lters. The image in gure 11 was calculated using a Laplacianof-Gaussian function with a s t andard deviation of three, and t h e i m age in gure 12 is the result o f a Sobel edge detector oriented diagonally. Figure 13 was created using a l t er which a t t empts t o s i m ulate e c h oes resulting from multi-path reception of a t elevision signal. While the previous lters were all 8-by-8 templates, this one t akes advantage of the recon gurable lter shape by u s i n g a 64-by-1 lter.
Summary
The high speed of the Splash-2 system, coupled with i t s recon gurability, give i t t h e n ecessary qualities to b e a p plied to t h e area of high-speed image processing. As shown, only minor modications in the architecture are required in order to allow t h e Splash-2 system to acquire, process, and display live video images. The high recongurability of Splash-2 allows it to switch from one t ask to another within a matter of a second.
?? 11
Thus, di erent i m age processing t echniques may be tried without t h e expense of reconstructing special-purpose hardware. While this paper focused on only one a p plication, convolution, many di erent i m age processing t asks have been implemented on VTSplash. A brief summary of the performance of each o f t h ese is given in table 1.
Along with t h e m any advanta g e s o f u s i n g Splash-2 for image processing, there are also drawbacks. The t ime required for synthesis of a design, the a n alog of compile time" in general-purpose computers, can be several hours long. This can put a large delay in the d e bugging cycle unless the programmer makes extensive use of VHDL simulations. Unfortunately, some d e bugging issues simply can not be reconciled in a simulation. To get accurate t iming v alues for propogation delays, the d esign must be synthesized. Another major disadvantage is the problems with t h e m emory cycle. A simpler design which allows each processing element t o control the read and w r i t e signals directly would have greatly increased memory bandwidth. Even a more complex design that contains a memory controller which processes read and w r i t e requests a n d presents results t o t h e processing element a t some l a t er time w ould allow for back-to-back m emory reads and w r i t es. A third disadvantage is the limited resources available to each processing element, allowing only four 4-by-7 m ultipliers in each. Fortunately, FPGA technology is contantly improving, so that it is just a matter of time before this problem is solved.
Despite t h ese disadvantages, the high speed and high recon gurability o f t h e Splash-2 system allows it to b e u s e d f o r m any signal processing tasks, not just in the area of image processing. As FPGA technology becomes denser and faster, custom computing m achines will reach t h e exibility a n d computational complexity required to rival typical general-purpose sequential machines. In order to accommodate t h ose that will program such systems, signi cant enhancements m ust be made i n t h e d evelopment t ools, both i n t erms of execution time a n d ease of use.
which h e received a BSEE and an MSEE. He is currently working on his PhD in the area of recon gurable computer architectures and a u t omated synthesis of their con gurations. Dr. Athanas is his advisor.
Peter M. Athanas is an assistant professor in the Bradley Department of Eletrical Engineering at Virginia Polytechnic Institute a n d S t a t e University. H i s research i n t erests include recon gurable computer architectures, logic synthesis, VLSI, and parallel processing.
Athanas received a BSEE from the University o f T oledo, an MSEE from Rensselaer Polytechnic Institute, and an ScM in applied mathematics and a PhD in electrical engineering from Brown University. H e i s a m emb e r o f t h e IEEE Computer Society.
?? 13 Image created using a Sobel edge-detection lter oriented diagonally. Notice that t h e a bsolute v alue option on the postprocessing element is not invoked, so that a zero result is a medium gray color while positive a n d n egative results are brighter and darker, respectively. 
