Abstract-
INTRODUCTION
Manufacturers of digital signal processors (DSP) design their products to accommodate the widest possible market/variety of applications. This market driven philosophy dictates the general purpose nature of their design. With the high cost of chip real-estate, designers can ill-afford to implement infrequent processing needs of application specific operations. General purpose DSPs tend to support the largest common factor in all algorithms, with no regards for specific processing needs. As a result of this tendency, DSPs have the largest required word widths, the most common memory addressing schemes and generic arithmetic operations. Yet, they have complex instructions sets and compilers, with a low possibility of using all the DSP resources efficiently for all parts of an application.
Image processing applications, for instance, do not require full 32 bit integer arithmetic, let alone floating point computations. Even for the same application, certain parts of the signal processing algorithm may require high bit resolution, while others require much lower bit resolutions, but are more demanding regarding the rate of the computations. In today's technology, the mismatch between data path bit resolution and application requirements can be made smaller with reprogrammable hardware circuits, such as fieldprogrammable gate arrays (FPGAs). Indeed, it has been shown by Albaharna, Cheung and Clarke [1] that varying degrees of speed-up can be achieved through functional migration onto FPGAs, and that for a comparable size, such adaptive platforms can provide better performance than an additional general purpose processor. In fact, according to their review, FPGA-based platforms have been cited as procuring speed-ups generally ranging between 10 and 100, while an additional processor can provide no better than twice the performance of a single general purpose processor.
Recently, a number of systems allowing on-the-fly reconfigurability of accelerators (coprocessors) have been introduced. The PLADO system developed by Athanas and Silverman [2] uses a reconfigurable platform to enhance the performance of general purpose microprocessors. The PLADO system is a prototype of a fully automated hardware-software codesign environment in which the software running on a particular microprocessor is analyzed to identify and extract time-critical algorithms. The functionality of these time-critical algorithms is then synthesized to a suitable architecture and mapped to a reconfigurable platform. Although the entire process may require a lot of time to complete, its undeniable advantage is that the user does not need to have any knowledge or experience in the field of hardware design. However, with their fully automated approach, there is no guarantee as to the size and performance of the synthesized accelerator.
A more hands-on approach to using FPGAs as reconfigurable accelerators was put forward by Lazarus and Meyer [3] , to solve their real-time radar signal preprocessing needs. What these authors propose is to build a library of pre-optimized filter designs. These designs could be fine tuned once and for all to meet specific size and speed constraints, and then be recalled as often as need be as hardware subroutines. This approach allows the designer to choose the manner in which the required operations will be performed and the architecture that will best balance the need for speed and the limited resources available. However, their approach was mainly targeted towards preprocessing rather than coprocessing,
i.e., the use of an FPGA alone.
Chan, Ngai and Ho [4] use FPGAs along with DSP microprocessors in a real-time image processing system. The DSP microprocessors are responsible for all high-level arithmetic operations, while the FPGAs are assigned lower bit resolution and byte-level operations, for which the DSP microprocessors are not well suited. The authors show, through the bit-level systolic implementation of a median filter, how the internal architecture of the Xilinx 3090 FPGAs is tailored to support fine-grain pipelined and systolic architectures.
Heeb and Pfister [5] also use FPGAs as reconfigurable accelerators to enhance the performance of a RISC processor in their system named Chameleon. Based on their implementation of a character segmentation algorithm, the authors suggest that algorithms with a high degree of fine-grain parallelism and locality are the best candidates for an acceleration using reconfigurable logic.
A key aspect of these systems is that the performance depends on the effect of the hardware/software partition on the utilization of the purpose DSP and on the bandwidth of the bus between the general DSP and coprocessors [6] . This kind of trade-offs is a central problem with hardware/software codesign [7] .
With these criteria in mind, this work explores the performance and architectural tradeoffs involved in the design of an FPGA-based 2-D convolution coprocessor for the TMS320C40 DSP microprocessor (C40)
from Texas Instruments (TI). The 2-D convolution is a process commonly used in image processing acquired by external sensors (guidance systems, surveillance systems or machine vision). As many other applications in image processing, the 2D convolution is extremely demanding in terms of real-time system performance. For instance, it may easily require more than 300 million multiplications and additions per second. Meeting such performance requirements exceeds the capabilities of most high speed real-time processors [8] . The TMS320C80 (C80) from TI is a notable exception, but this complex processor is not necessarily suitable for all real time applications. Moreover, some applications have much higher throughput requirement. For instance, a 5x5 convolution with a 50 MHz data stream would require 2.5
Giga multiply-accumulate.
From a hardware/software partitioning point of view, considering that fully automatic hardware/software system partitioning is difficult, and that this problem lacks a definitive solution [7] , the use of a library of DSP-oriented hardware accelerators appears as a natural extension of the existing software signal processing libraries. Indeed, in order to make software applications simpler to write and easier to maintain, a software digital signal processing library that performs essential signal and image processing functions is an important part of every DSP developer's toolset [9] . In general, such library provides high-level interface mechanisms, therefore developers only need to known how to use algorithm, not the details of how they work. Complex signal transformations now become a few function calls, e.g. Ccallable functions. Considering the 2-D convolver function, which is very important in many DSP applications, this work proposes to replace the function software execution on the general DSP by a hardware execution on a FPGA. In that sense, the C-callable function must be replaced by two software instructions that perform the following tasks: 1) initialization of the hardware accelerator by downloading its binary configuration data and 2) initialization of the communication protocol between the DSP and the coprocessor(s). Therefore, the 2-D convolver's design space exploration will provide guidelines for the development of a reconfigurable library of DSP-oriented hardware functions intended to enhance the performance of processors such as the C40. More precisely, depending on the design objectives, different elementary convolvers (basic functions of the library) are proposed, and we show how these sets can be used to compute arbitrary size 2D convolutions. Also, based on the specific convolver problem, we propose a series of general design trade-offs to make efficient use of the bandwidth between the general DSP and its coprocessors (i.e., operators supported in the library).
What emerges from our discussion so far is that floating point and high precision operations can be performed efficiently on a general purpose DSP, while low to medium (1-16) bit resolution operations can be implemented more efficiently on FPGA-based coprocessors. Multiple parallel operations on the FPGAs can produce significant speed-up to the execution of compute intensive signal processing algorithms.
However, good speed-up will only be achieved if the bandwidth requirement for feeding the accelerator can be satisfied. Thus, considering area, execution time and communication metrics, this work proposes a series of trade-offs for the design of 2-D convolvers.
The paper is organized as follows. In section 2, we present a typical implementation of a 2-D 3X3
convolver, designed as a coprocessor for the C40. This case study is essential to understand the trade-offs implied in a general 2-D convolver design. Then, based on this specific design, we present in section 3 three basic architectures that allow implementing 2-D convolvers of arbitrary size, which could be included in a library of DSP-oriented hardware accelerators. Section 4 proposes two major improvements of the basic architectures, while section 5 shows and compares some implementations. Finally, section 6 presents two architectures to exploit the full available bandwidth and section 7 concludes the paper.
COMPLETE 3X3 CONVOLVER
The TI C40 microprocessor dominates the DSP market with an over 50% market share [10] . This processor was designed to handle the most computation intensive applications. However, even with single- 
3x3 convolution implementation strategy
The 3x3 convolution of an image is defined by equation 1:
where P' m,n is the convolved pixel, P m,n is the image's actual pixel value, and W i,j is the convolution kernel weight. Equation 1 indicates that the 3x3 convolution P' m,n of each pixel P m,n requires knowledge of the values of its 8 immediate neighbors. Similar to the Cytocomputer machine [12] , a strategy to extract windows of pixels from a single data stream has been adopted. Pixel values are fed line by line, from top to bottom, until 2 complete lines and the first 3 pixels of a third line are contained within a series of shift registers. At that point, all the pixels belonging to the first 3x3 convolution window are available inside the coprocessor. From that moment on, each new pixel value inserted into the chain of shift registers effectively displaces the convolution window to a new adjacent position until the whole image has been visited (see figure 1 ).
Evidently, storing 2 complete 1024 pixel lines within a chain of shift registers would be very expensive in an ASIC or FPGA based implementation. An alternative was to divide the entire 1024x1024 pixel image into several vertical bands and to treat these bands as narrow but complete images. A substantial economy in the number of shift registers required could be achieved this way. The problem with this scheme is that to compute 3x3 convolutions on band borders requires having access to pixel values belonging to adjacent bands. Hence, a certain amount of overlap must be allowed between bands. A number of pixel columns must then be transmitted more than once, thereby degrading the coprocessor's overall performance. However, we will show later on in section 4.2 that by dissecting a 1024x1024 pixel image into 16 vertical bands of 68 pixels wide (4 pixel overlap per band), an economy of 1912 shift register stage could be attained while maintaining a 94 % throughput. Although a 2 pixel overlap would have been sufficient, the choice of 4 pixels was made because the C40's communication port buffers accept data in the form of 32-bit words, which are then transmitted one byte at a time. It was considered less time consuming to simply transmit an additional 32-bit word, rather than having to break up and reassemble packed arrays of byte-size pixels, before sending them to the communication port buffers.
Furthermore, the 3x3 convolution kernel weights have been restricted to the values -4, -2, -1, 0, 1, 2 and 4. This set of values was chosen because it allowed several useful image enhancement filters to be implemented (average, Sobel, Prewiitt, Laplace, etc.), especially in the area of edge detection [13] , while reducing by half the multipliers' overall size and complexity (see later on Section 5). The control unit, composed of a finite state machine coupled with a counter and a comparator, keeps track of events and identifies each byte before it is read from the input FIFO according to a predetermined sequence.
3x3 Convolution coprocessor architecture
To compute a full 3x3 convolution in a single cycle, 9 multipliers were needed. With the supported kernel weights having been restricted to the aforementioned set of values, each multiplier's task could be reduced to single cycle shifts, resets to zero and two's complement conversions. A summation unit composed of carry-save adders in a Wallace-tree configuration tallies up the 9 pixel-weight products for each convolution window. The result produced by the summation unit is then fed to a saturation module which converts it back to an unsigned 8-bit value. A set of pipeline registers was inserted between the multipliers and the summation unit to preserve a 40 ns clock cycle.
The output port stacks valid results in its FIFO and transmits them back to the C40's communication 
Implementation results
Every element in this design was obtained through synthesis of VHDL-1076 behavioral descriptions.
The Mentor Graphics 8.2.5 design environment provided all the compilation, simulation and synthesis tools required for the tasks [14] . The synthesized netlist of the convolution coprocessor was initially mapped to a With the experience gained from the design of the 3x3 convolver, we will see how an arbitrary-size 2-D convolver can be obtained. The object of this study is to find the most convenient way of including a generalized 2-D convolver in a library of reconfigurable hardware accelerators. As in the case of the previous 3x3 convolver, the convolvers presented in this section all have one point in common: the bandwidth they require is independent of the size of the convolution which is computed. Since a fixed bandwidth is a characteristic found in most systems, coprocessors destined to function within such an environment should be designed with this constraint in mind. In the following sections, image and convolution kernel dimensions will be defined according to figure 3. In section 3.1, a natural extension and generalization of the previously described 3x3 convolver's architecture is proposed, Figure 4 illustrates a generalization of the previously described 3x3 convolver, which we call a complete RxS convolver. The architecture of this convolver includes mainly R-1 delay lines, RxS multipliers and an RxS term adder tree. What is interesting about this architecture is that a fixed bandwidth of 1 pixel/cycle is sufficient to maintain a steady state 1 cycle per convolution window processing rate, independently of the size of the convolution kernel. The price to pay for this independence of the bandwidth with respect to the size of the convolution kernel lies in the increased number of delay lines required by large convolution kernels. The cost of the delay lines is directly related to both the width of the processed image and the number of rows in the convolution kernel. We shall see later on in section 4 how these dependencies can be reduced.
Complete 2-D Convolver

2-D grid of elementary convolvers
Although the architecture of the complete convolver is fairly regular, its overall complexity is proportional to the 2-dimensional size (RxS) of its convolution kernel. To reduce this dependency, Landeta and Malinowski [8] have shown how arbitrary-size convolutions can be computed using a grid of low complexity elementary convolvers. These elementary convolvers are similar to complete convolvers, but have one extra input and one extra output. Besides having an input for the image's pixels, an elementary convolver can also receive partial results through its input R i . This partial result can then be added to the elementary convolver's own computation, and transmitted to its result output R o for further processing by another elementary convolver. The other output, P o , enables the flow of pixels to pass through the delay lines and be fed to other elementary convolvers. Figure 5 shows the block diagram of a 3x3 elementary convolver. In the 3x3 elementary convolver, the pixels fed through the input P i emerge at its output P o after a delay equal to 2 complete rows of the processed image. The technique shown in figure 6 can be generalized to compute RxS convolutions using an axb grid of R'xS' elementary convolvers, where: 
b if modulo 0 , 1 otherwise , with S' 0 , As in the case of figure 6 , the elementary convolvers in the first column of the axb grid are assigned the S' first columns of the RxS convolution kernel. Every elementary convolver, except those on the first row of the axb grid, receives its pixels from the output P o of the convolver above it. However, since an R'xS' elementary convolver possesses only R'-1 delay lines, the portions of the RxS convolution kernel processed by consecutive elementary convolvers on a same column are not mutually exclusive, but overlap on one row of kernel weights. To eliminate this overlap, the first row of weights of every elementary convolver, except those on the first row of the axb grid, need to be canceled out by making them equal to zero (see convolvers 3 and 4 in figure 6 ). Furthermore, if (R-1) is not a multiple of (R'-1), the last [(R-1) modulo (R'-1)] rows of weights in the elementary convolvers situated on the last row of the axb grid also need to be set to zero.
Between each elementary convolver on the first row of the axb grid, a delay of S' shift registers must be inserted to allow each column of convolvers to be assigned different parts of the RxS kernel. If S is not a multiple of S', the last [S'-(S modulo S')] columns of weights of the elementary convolvers situated on the last column of the axb grid also need to be set to zero (see convolvers 2 and 4 in figure 6 ).
A more detailed example is presented in appendix 1. It illustrates the pixel flow through a 2-D grid of elementary convolvers, more precisely the flow from an 8x8 pixel size image through a grid of 3x3 elementary convolvers to perform a 5x5 convolution.
This technique reduces the design of arbitrary sized 2-D convolvers to the assembly of an appropriate number of copies of an elementary convolver. This is attractive since one could support all convolution sizes with a single design of an acceleration module replicated as often as necessary. However, it does require more resources than a complete convolver. This is easily seen in figure 6 where four 3x3 elementary convolvers, containing a total of 36 multipliers, are used to compute a 5x5 convolution which requires only 25 multipliers. Also, the four 3x3 elementary convolvers include a total of 8 delay lines, while a complete convolver needs only 4 of them. The ease of design offered by the dissection of a large convolution kernel into smaller size kernels is thus obtained at the price of a larger overall complexity.
1-D convolution modules
Dividing up a large convolution kernel into smaller, more manageable subarrays is an attractive idea provided it remains efficient in its use of hardware resources. Therefore, another possibility to improve efficiency is to isolate each row in the convolution kernel and treat it as a 1-D convolution. The idea here is that it is easier to scale 1-D convolution modules than complete 2-D convolvers. Then, by linking together as many 1-D modules as there are rows in the desired 2-D convolution, a 2-D convolver can be formed. Figure 7 shows the internal architecture of a 1-D convolution module and the manner in which they can be connected to form an RxS convolver operating on an MxN pixel sized image. In figure 7 , R=5 and S=5, but it is obvious that the 1-D modules can be scaled to any width, and any number of these modules can be figure 7 can be designed without a delay line, since it is of no use). 
Comparisons between architectures and basic components of the library
For a given size convolution, the complete 2-D convolver, the 2-D grid of elementary convolvers and the chain of 1-D convolution modules can all reach a steady state of one result per cycle processing rate, provided sufficient pipelining is included.
In terms of hardware resources, the amount required to implement an RxS convolution is always greater in the case of a 2-D grid of elementary convolvers compared to the 1-D modules approach. In fact, every column in a 2-D grid contributes a number of delay lines greater than or equal to a complete convolver implemented using 1-D modules. However, the width (S) of the 1D modules either needs to be adjusted to the particular required convolution's width, or designed to be able to support the widest foreseen convolution kernel. In the latter case, the extra pixel*weight products can be canceled by setting the appropriate kernel weights to zero. Compared to a complete RxS multiplier, however, the use of 1-D modules requires slightly more hardware resources due to the fact that partial convolution results need to be propagated and included as an extra term in the addition scheme of each 1-D module. Indeed, the use of 1-D convolution modules involves splitting an adder tree into several smaller adder trees. For instance, a complete 5x5 convolver using a single Wallace-tree to add-up 25 pixel*weight products needs 23 carry-save adders and one fast-adder. To compute a 5x5 convolution using 5 convolution modules requires five 6-term adder trees, each one composed of 4 carry-save adders and one fast-adder. The total of 20 carry-save adders and 5 fast-adders requires more hardware than do 23 carry-save adders and one fast-adder (a fast-adder is more complex than a carry-save adder). A chain of 1-D convolution modules thus requires a slightly larger amount of hardware resources than a complete 2-D convolver, but compensates for this drawback by making scalability more manageable.
To support arbitrary size 2-D convolutions, the problem is now to select a minimal set of basic components, that should be included in the library, considering the trade-offs of the three previous
architectures. Suppose a library that must support any 2-D convolutions of a fixed range n, from RxS up to (R+n)x(S+n). We consider three class of applications: 1) small range, for example n ≤5, 2) medium range, 5<n≤15 and 3) large range, n>15.
In 1), the basic components that compose a library can simply contain n complete convolvers or a 1-D module of width R+5. The configuration of the 1-D module can be recalled in n copies to fit any convolution of the given range. As mentioned previously, the extra pixel*weight products can be canceled by setting the appropriate weights to zero. Note that, if the dimension of a given convolution kernel is known to be frequently used, using 1-D convolution modules will not be simplified since these modules will always need to be adapted to the kernel's height (R).
In The next section presents two important improvement strategies, in order to reduce the complexity of our three basic architectures.
COMPLEXITY REDUCTION STRATEGIES
The values contained in Table 2 , which expand in more detail the previous results in Table 1 
Multiplexed 1-D convolution modules
All the strategies presented up to now can reach a 1 convolution window per cycle processing rate in steady state. Yet, they all require at least one multiplier per convolution kernel weight, which can lead to a very high complexity, especially for large convolution kernels. If the amount of hardware resources required by these strategies is a limiting factor in their implementation, alternative strategies must be found. through the R i input. Consequently, the flow of pixels needs to be shifted only every 2 cycles. In steady state, a convolution window can therefore be processed every 2 clock cycles using this strategy. The multiplexing ratio can be increased to an arbitrary value. Each (Q:1) multiplexed 1-D convolution module must then contain Q delay lines, Q sets of S shift registers for both pixels and kernel weights, and one set of S multipliers. A processing rate of 1 convolution window per Q cycles can be reached in this manner. Basically, reducing by a factor Q the number of multipliers, required for a fully parallel processing of an RxS convolution, imposes a limit of 1 convolution window per Q cycles to the processing rate which can be achieved. Figure 9 illustrates a variation of the multiplexing of 1-D convolution modules strategy. This variation involves interlacing the pixels of 2 consecutive lines so that they will naturally be fed to the input of the
N -S shift registers P P P P P P P P 1  1  1  1  1  1  1  1   2  2  2  2  2  2 multipliers in alternation, without the need for multiplexers. With this scheme, the flow of pixels is shifted every clock cycle, but a valid result is still only produced every 2 clock cycles. Similarly to the multiplexing strategy, the ratio of interlacing can be increased to an arbitrary value, with the same effects on its processing rate.
Interlacing of the pixels lines in the original image belonging to 2 consecutive Figure 9 : Architecture of (2:1) interlaced convolution module and assembly of R/2 modules to form a RxS convolver with R=8 and S=8.
Multiple vertical band processing
As mentioned earlier, besides the multipliers, the most important contribution to the cost of a convolution engine lies in its delay lines. One possible way of reducing this cost is to slice an image in vertical bands, and to process these images one band at time. Table 3 and figure 10 show the effects of multiple vertical band processing of a 1024x1024 pixels image on the hardware requirements and performance of a complete 3x3 convolver. The computation cycles presented in table 3 are based on a 1 cycle per convolution window processing rate, and 4 pixels overlap between adjacent bands. According to table 3, a reduction by a factor of 100 of the number of shift registers required in the design of a complete 3x3 convolver can be obtained, by choosing to process a 1024x1024 pixels image as 256 vertical bands.
Moreover, this 256 vertical bands processing reduces the overall performance of a 3x3 convolver by a factor of only 2. Between 1 and 256 vertical bands, a whole range of choices are available to trade off performance and complexity. Figure 10 shows that 16 bands provides an interesting balance, with only 6% performance degradation and a complexity reduction factor of 14.8 in the number of registers. This corresponds to the design parameters proposed in section 2 for a library module devoted to accelerating convolutions.
COMPLEXITY ESTIMATION AND COMPARISON OF VARIOUS 2-D CONVOLVERS
This It is worth noting that in the cases where multiple vertical band processing is used, growing widths of pixel overlap are necessary to satisfy increasingly large convolutions. Table 5 gives the number of vertical bands and their widths, including overlap, required to achieve a 2 cycles per convolution window processing rate. The overlap widths have been chosen as multiples of four pixels, as in the case of the 3x3 convolver described in section 2. Increasing the overlap by multiples of 4 follows from the fact with 32-bit microprocessors, it is more efficient to transmit byte-size pixels in packed groups of four to a convolution coprocessor rather than having to unpack and rearrange them in software. 
MULTIPLE DATAFLOW CONVOLVER TO MAKE USE OF THE ENTIRE AVAILABLE BANDWIDTH
All the 2-D convolver architectures and strategies presented up to now never required more than 1 data pixel per cycle. Delay lines enabled arbitrary size convolution windows to be assembled inside the convolver from a single data flow. However, when more than 1 pixel per cycle can be transmitted to a 2-D convolver, a number of delay lines, if not all, may be eliminated. However, with fewer delay lines, the pixels can no longer be fed to the convolver in a raster scan line format, but they must be transmitted as alternating groups from consecutive rows of the image to be processed. To operate at that maximum processing rate, the stacks should be implemented either as asynchronous FIFOs of a sufficient speed and depth, or as a very carefully timed shift register array. Figure 12 shows a multiple dataflow complete convolver using a 32-bit bus. The bus can completely replace the delay lines inside a 3x3 convolver. Instead of being tied to delay lines, the convolution window pixel registers (P1, P2 and P3) receive the pixels belonging to consecutive rows of the original image through 3 stacks (F1, F2 and F3), which transform the 32-bit bus data into a chain of byte-size pixels.
During 3 out of 4 cycles, groups of 4 pixels belonging to 3 consecutive rows are transmitted in rotating order to the input/output port's stacks F1, F2 and F3. The pixels contained in these stacks are then gradually shifted out to the groups of 3 shift registers P1, P2 and P3 from which a new 3x3 convolution window can be sampled at each cycle. Every fourth cycle, a group of four results can be extracted from the output stack R and placed on the input/output bus.
To compute a single-cycle 3x3 convolution, one new pixel per row is needed at every cycle. The total of 3 pixels transferred and 1 result produced means that a bandwidth of 4 bytes per cycle is needed. In the case of the 3 data flow 3x3 convolver of figure 12 , a 32-bit bus is sufficient to transfer both pixels and convolution results to and from the convolver, provided that reads and writes are single-cycle operations.
Multiple data flow convolvers require a greater bandwidth than the single data flow, fixed bandwidth convolvers presented in section 2. For a multiple data flow convolver to be efficient, one DMA channel should be reserved for each distinct data flow, namely, one for each row in the convolution kernel and another one to retrieve the results. The pixels of the image are transferred to the 5x5 convolver of figure 13 in a raster scan line format through stack F3 to the 2 delay lines, until the first pixel reaches the last of the 5 shift registers designated P5. From that moment on, stacks F3, F2 and F1 can be fed in rotating order with the pixels from the next three rows of the image. In other words, once the 2 delay lines of the 5x5 convolver are filled, it can be fed in the same manner as the 3 data flow 3x3 convolver described previously. Once the groups of 5 shift registers P3, P2
and P1 are filled, an entire 5x5 convolution can be computed, and the 32-bit bus can thereafter maintain a 5x5 convolution window per cycle processing rate. This technique can be generalized to a multiple data flow RxS convolver supplied by an insufficient bandwidth. Indeed, any lack of bandwidth can be compensated by an appropriate number of delay lines.
Of course, instead of using excess bandwidth to reduce the number of delay lines in a 2-D convolver, this bandwidth could be used to increase its throughput. With sufficient hardware resources and bandwidth, more than a single 2-D convolution window can be processed at each cycle. A speed-area tradeoff can therefore be considered according to the amount of available bandwidth [17] .
CONCLUSION
With the performance and size of field programmable devices growing steadily, adding a virtual hardware platform to complement a general purpose processor offers the potential for increased performance and flexibility. In this article, we have presented several 2-D convolution designs intended to become part of a library of reconfigurable accelerators. Various performance and complexity trade-offs were illustrated to show different ways of implementing large size convolvers on a limited size reconfigurable platform. Particular attention was directed at fixed bandwidth convolvers. Since a system's available bandwidth is usually a set parameter, it made sense to concentrate on developing scalable 2-D convolvers which do not require a bandwidth proportional to their size.
In the future, functions such as pattern matching, FFT, FIR filter and IIR filter could be developed
and included in the library of reconfigurable accelerators. Also, note that the proposed concept is not limited to a particular processor and it could thus be adapted to other environments as technology evolves. 
ACKNOWLEDGMENTS
