Abstract-A configurable architecture for performing image transform algorithms is presented that provides a better tradeoff between low complexity and algorithm flexibility than either software-programmable processors or dedicated ASIC's. The configurable processor unit requires only 110 K transistors and can execute several image transform algorithms. By emulating the signal flow of the algorithms in hardware, rather than software, complexity is reduced by an order of magnitude compared with current software programmable video signal processors, while providing more flexibility than single function ASIC's. The processor has been fabricated in 1.2-m CMOS and has been successfully used to execute the discrete cosine transform/inverse discrete cosine transform (DCT/IDCT), subband coding, vector quantization, and two-dimensional filtering algorithms at pixel rates up to 25 MPixels/s. Index Terms-Digital signal processors, discrete cosine transform, image coding, vector quantization, video signal processing.
I. INTRODUCTION
E XISTING video compression processors are based on the JPEG [1] , MPEG [2] , or H.261 [3] standards. These standards have been developed for stored image, stored motion video, and wireline videoconferencing applications, respectively, and are the motivation for the integrated circuits described in [4] - [8] . In each case, the coding of individual pictures is done on a block-by-block basis using the 8 8 discrete cosine transform (DCT). While alternative algorithms to the 8 8 DCT had been considered during the standards process, such as subband coding [9] , coding schemes based on the 8 8 DCT have been judged to provide the most attractive tradeoff between image quality and implementation complexity. However, emerging standards such as MPEG-4 are less restrictive in the choice of algorithm [10] and provide motivation for the design of processors that possess both algorithm flexibility and low implementation complexity. In this paper we apply configurable architecture design techniques to the design of a processor unit for several image transform algorithms.
Conventional architecture approaches fail to meet this requirement for algorithm flexibility and low complexity. Dedicated ASIC's can be highly optimized to execute a given coding algorithm in real-time with very low complexity, but Manuscript received October 1, 1996 ; revised April 21, 1997 . This work was supported by ARPA/CSTO under Contract J-FBI-93-112, Hughes Research Labs, Texas Instruments, and the State of California MICRO program.
The authors are with Integrated Circuits and Systems Laboratory, Electrical Engineering Department, University of California, Los Angeles, Los Angeles, CA 90024 USA.
Publisher Item Identifier S 0018-9200(98) 00373-4. they lack the flexibility to perform the range of required algorithms [11] - [13] . Software programmable video signal processors have been proposed that provide a very high flexibility, but at the expense of a lower throughput and very high complexity [14] , [15] . In this paper we address the design of signal processors that provide a solution in the "middle ground" between dedicated ASIC's and software programmable video signal processors. Rather than optimizing the architecture for a single algorithm as in the case of a dedicated ASIC, or for all algorithms as in the case of a programmable video signal processor, the proposed processor is optimized for a class of algorithms. It achieves the desired tradeoff between low complexity and algorithm flexibility by emulating the signal flow of these algorithms through hardware configurability instead of software programmability. Configurable array architectures for digital signal processors (DSP's) have been presented before [16] which balance flexibility with improved throughput. However, these have been aimed at rapid prototyping and require a large number of generic array elements for real-time video compression, resulting in too high a complexity for custom products. In contrast, the proposed configurable video signal processor achieves low complexity by using configurable processing elements, I/O ports, and memories that are optimized for image transform algorithms. Programmable video signal processors have a very high complexity due to two related factors: 1) the complicated control logic for addressing memories and sequencing operations and 2) the large on-chip memories. For example, the processor in [15] has over 100 000 transistors dedicated to instruction storage and over 300 000 transistors dedicated to data storage. These are an overhead and are an artifact of how the processor architecture is designed, not the algorithm requirements. The only hard lower bound on circuit complexity is the computational requirements of the algorithm. For example, if an algorithm requires 400 million multiplies per second and the maximum clock rate of a multiplier is 50 MHz, then the application requires that at least eight multipliers be placed on chip. To approach this lower bound on the complexity of the architecture, the proposed processor is partitioned to allow: 1) optimization of the computational datapath by maximizing the hardware sharing while meeting the computational hardlimit, 2) addition of the minimum required flexibility in the signal flow to support each algorithm, and 3) minimization of the control logic required to configure the signal flow for each algorithm. This partitioning results in the architecture model shown in Fig. 1 and consists of three subsystems. 
1)
A signal flow network (defined in Section II) that configures the flow of the input and output data through the computation datapath. It consists of I/O ports, data memories, and multiplexers. 2) A parallel computational datapath (defined in Section III), consisting of four computational units-the sum-of-products, accumulator, offset subtracter, and minimizer-that are required to execute operations in many different image transform algorithms.
3) Finite state machine (FSM) controllers (defined in
Section IV), which provide three types of signals: a) processor clock signals at multiples of the input pixel rate-to pipeline registers in the computational datapath as well as the memory and I/O circuits, b) multiplexer select signals that configure the processor for a given algorithm, and c) addressing sequences for the on-chip memories. This is a hardwired control strategy where the control signals for each state are fixed, removing the need for the continuous fetch, decode, and execution of instructions on each processor clock cycle. Methods to minimize the signal flow network complexity and control logic requirements are discussed in Section II. These result in a maximum on-chip memory requirement of only 4 Kb, which is at least an order of magnitude lower than required in conventional programmable processors. Minimization of the computational logic is performed by maximizing the sharing of the computational datapath between different algorithms. Methods to define a shared computational pipeline based on the commonality in image transform algorithms are discussed in Section III. Section IV describes the design of control logic to configure the signal flow network for a desired algorithm. In particular, controllers are presented for execution of the 8 8 DCT, subband coding, and vector quantization. It is shown that the controller for each algorithm can be implemented with a logic complexity of less than 3000 gates. The processor chip consisting of the signal flow network and computation datapath has been designed and fabricated as described in Section V. For test purposes, the configuration controller for the processor has been implemented in a fieldprogrammable gate array (FPGA), and the test results are presented in Section VI.
II. SIGNAL FLOW NETWORK DESIGN
Both the control logic complexity and the large on-chip memories of programmable video signal processors are a direct consequence of the fact that programmable architectures must emulate the signal flow of an algorithm using software control of a data memory. For example, conventional DSP's have many addressing modes including modulo-mode and bitreversed addressing. The former is used to emulate the signal flow through tapped delay lines in a finite impulse response (FIR) filter, while the latter is used to emulate the signal flow in the fast Fourier transform (FFT) or DCT. To emulate a specific type of signal flow, the programmable architecture must set aside a portion of the data memory and address this section of memory by calling instructions specific to the desired addressing mode.
In contrast, dedicated ASIC architectures reduce complexity by hardwiring the signal flow, resulting in little or no control or on-chip memory. However, the dedicated signal flow allows no flexibility in the function performed by the ASIC. To implement multiple functions, multiple ASIC's must be used at the cost of a higher overall system complexity. To meet the range of flexibility required in video compression applications and reduce the processor complexity, a tradeoff must be made between hardwired signal flow datapaths and software-controlled signal flow emulation in memory. The proposed configurable architecture implements optimized I/O, memory, and computational units that are hardwired for a high throughput and low control overhead but can be reconfigured to match the signal flow of different algorithms. This reconfiguration is performed with the addition of simple multiplexers to the hardwired I/O and memory units. The absence of memory-based software-emulation of signal flow reduces the on-chip memory dramatically, requiring a total of only 4 Kb of on-chip memory versus the 60-100 Kb common in programmable video signal processors. The memory and I/O units are described below. One class of useful image transform algorithms includes the 8 8 discrete cosine transform, wavelet/subband coding, and vector quantization. The signal flow for these three algorithms shares several common characteristics that are exploited to optimize the memory and I/O.
A. Two On-Chip Line Memories
Each image transform algorithm being considered here has well-defined coefficient and pixel operands. The coefficient operand (i.e., a filter coefficient for subband coding, basis vector for DCT, or source pixel block for vector quantization) is either static or changes at a lower rate than the pixel operand. This characteristic can be exploited by asymmetrically defining the memory access capabilities of the coefficient and pixel operands. Many conventional programmable DSP's have a modified Harvard architecture using two data memories, however, with identical memory access capabilities. By classifying one memory as a pixel memory and one as a coefficient memory, the memory architecture and control can be optimized to meet the individual requirements of each type of operand. Furthermore, the use of a line memory architecture allows the accessing of multiple pixel and coefficient values by the parallel computational datapath in a single cycle.
Most video compression algorithms operate on blocks (DCT and vector quantization) or windows (subband coding) of pixels. Therefore, the I/O circuits should be designed to minimize the on-chip memory capacity to a few multiples of the basic block size. With a basic block size of 8 8, the coefficient memory requires only two blocks of storage to allow the use and update of coefficients in a ping-pong fashion. A capacity of three blocks is sufficient for the pixel memory-two blocks to allow the use and update of pixel values in a ping-pong fashion, and a third block to store the intermediate results for algorithms such as the DCT. Together, these optimizations reduce the total memory overhead to 4 Kb assuming 8 8 block sizes. The I/O circuits for these memories are described in the following sections.
B. On-Chip Tapped Delay Lines
To remove the need for software emulation of the processing of a sliding window of input pixels and to simplify the external input interface to the on-chip line memories, tapped delay lines are placed at the inputs of both the coefficient and pixel memories as shown in Fig. 2 . The tapped delay line at the input to the pixel memory directly implements the sliding window of pixels in the FIR filters used for subband coding without requiring additional control and addressing logic. For the discrete cosine transform, this delay line allows rows of the input pixel block to be shifted on-chip in a wordserial fashion removing the need for an additional external buffer to convert the external word-serial data stream to an internal word-parallel format suitable for storage in the on-chip line memories. Furthermore, as the contents of the coefficient memory are either loaded once or change at a slower rate than the pixel memory, the tapped delay line at the coefficient memory input also serves to reduce the external pincount required to load the coefficients. Word-serial update of both the coefficient and pixel memories is achieved by shifting a single value into the tapped delay line in one processor clock cycle. Groups of eight coefficients or pixels are loaded in parallel from the tapped delay line outputs into the on-chip memories every eight processor cycles, resulting in a reduced memory write cycle time requirement. The pixel tapped delay line and memory can be bypassed by use of the parallel input pixel bus.
C. Parallel Input Pixel and Serial Input Offset Busses
In vector quantization, the input rate requirements are higher than those in the discrete cosine transform and subband coding, since all the input codebook values are used for a single output computation. Therefore, a direct parallel signal flow path is required to bring external data to the parallel onchip computational datapaths in a single processor cycle. This can increase the input bandwidth requirement as well as the on-chip routing complexity. However, for the image transform algorithms being considered, this direct parallel signal flow and the sequencing of the signal flow through the on-chip pixel memory are never simultaneously required by the same algorithm. This observation is exploited to reduce the complexity by sharing the parallel input pixel bus and the pixel memory data bus with the addition of a separate shunt path between the memory's input and output bitlines.
In addition to requiring the addition of a parallel input pixel bus, some algorithms such as vector quantization also require an offset value to be subtracted from the result of the inner product computation. Therefore, an additional wordserial input offset bus has been added.
D. Interleaving Transformation
For long filter lengths, subband coding has higher computational requirements than the 8 8 DCT or tree-search vector quantization. However, the 8 8 DCT requires the largest on-chip memories of the three algorithms. The use of an interleaving transformation on the tapped delay line at the input to the pixel memory simultaneously reduces both the computational hardware requirements for subband coding and the memory sizes for the 8 8 DCT as explained below. Interleaving of the tapped delay line is achieved by doubling every register and adding an input multiplexer as shown in Fig. 3 [17] .
For subband coding, the interleaved tapped delay line helps to implement the computationally efficient polyphase filter structure with separate processing of the odd and even samples of the input pixel stream. As described in [18] , this allows the filters to run at half the output and input pixel rate for the interpolation and decimation filtering, respectively. This reduces the number of parallel computational datapaths required by the processor by half for a fixed throughput requirement.
For the 8 8 DCT, the computation of a pair of matrix multiplications is required. The interleaving transformation allows the computation of the first and second matrix multiplications to be interleaved on the processor. The two interleaved inputs to the tapped delay line are the input pixels of the next input block arriving from off-chip and the results of the current matrix multiplication being fed back from the computational pipeline for the second matrix multiply operation. Without interleaving, the processor produces no outputs for the first 64 clock cycles (while it computes the first matrix product), followed by a burst of valid outputs for the next 64 clock cycles. This would require an additional 8 8 block of memory to buffer the output pixel block to match the output rate of the processor to the input rate (one input sample per two processor clock cycles). The interleaving of the delay line eliminates the need for this buffering and reduces the on-chip pixel memory required by 25%. Note that interleaving is not required for the coefficient memory input since its bandwidth requirements are already very low. Fig. 4 illustrates the complete signal flow network and its interfaces to the controller and to the computational datapath. The total overhead necessary to configure the processor for execution of an algorithm consists of three multiplexers, three word-serial input busses, one word-parallel input bus, one word-serial output bus, two tapped delay-lines, and 4 Kb of RAM. The two outputs of the processor are the word-serial output of the inner product computation and the conditional flag which is set if the current offset subtraction output is less than the previously computed value. This signal flow network is sufficient to fully utilize the parallel computational datapath and achieve a 25-MPixels/s processing rate for several image transform algorithms. The design of the computational datapath is explained in the next section and the processor's configuration and control in Section IV.
E. Final Signal Flow Network Design

III. COMPUTATIONAL DATAPATH AND BRANCH-CONTROL MINIMIZATION
Computational hardware can be minimized by using a very high clock rate, but at the expense of large on-chip memories to match a low-rate external I/O to a very high rate onchip arithmetic logic unit (ALU) [14] , and at the expense of large control logic required to timeshare operations on a single datapath. The proposed processor utilizes multiple computational datapaths operating at a lower clock rate to reduce the on-chip memory sizes and control complexity while maintaining a constant computational throughput. This is a self-consistent strategy since the reduction in memory and control allows the integration of more computational units on the chip.
The second optimization that is performed on the processor is to maximize the sharing of computational hardware across different algorithms. The proposed configurable processor implements computations as macro-instructions consisting of a sequence of operations that are common to several image transform algorithms. These macro-instructions are hardwired in the configurable architecture, allowing the full parallelism within the macro-instruction to be exploited. Similarly, software controlled branching that is conventionally used to make data-dependent decisions is replaced with hardwired decision making units. This removes the control logic complexity associated with dynamic changes in program control flow and instruction fetching.
The macro-instructions are defined based on the common algebraic form of the desired class of image transform algorithms and exposes more parallelism than the fine-grained instructions in a software programmable architecture while providing more flexibility than a dedicated ASIC architecture. This idea was first proposed by Yamazaki et al. [19] . Each of the three intraframe coding algorithms-the DCT, subband coding, and vector quantization-can be computed by the evaluation of an -point inner product (1) where is the coefficient operand and the data/pixel operand.
This inner product represents the vector dot product of the 8 8 DCT, the convolution for subband coding, and the distance computation of vector quantization. As these three algorithms require block operations, the one-dimensional inner product of (1) can be extended to two dimensions by adding an accumulation (2) While the evaluation of (2) is sufficient to perform the DCT or subband coding, vector quantization requires two more modifications to the basic equation. First, the inner-product value computed in (1) must be subtracted from a precomputed offset in order to form the complete Euclidean distance computation as described in [20] ( 3) where (4) Finally, as vector quantization encoding requires a search, we must find the minimum of the result formed in (3), and return the index of the vector that is closest to the input vector (5) Table I illustrates how this last equation provides a common representation of the DCT, vector quantization, and subband coding algorithms. By defining the computational datapath to efficiently implement (5), the complete architecture can execute any of the three image transform algorithms by appropriately configuring the datapath as described in Section IV.
The computational datapath is defined by the four register transfer operations that are required sequentially by (5) is initialized to the maximum representable number. Since these four operations are repeatedly performed according to a static schedule, they are mapped directly to a hardwired configurable datapath rather than executing them on a software programmable ALU. This avoids the control and on-chip memory overhead associated with scheduling the operations on a time-shared datapath. Furthermore, the sequential definition of the four operations lends itself to a logical pipeline structure, with the output of each operation passed to the input of the next. The output of operation 1 is directly accumulated in operation 2 and finally used in the subtract-compare-select function implemented in operations 3 and 4. The only remaining optimization to the pipelined implementation is the balancing of the delay through each pipeline stage as discussed below.
To reduce the delay required to compute the sum-ofproducts in operation 1, eight multipliers are implemented in parallel as shown in Fig. 5 . Even with this added parallelism, the multiply and addition operations performed in the first step of the sequence have a longer propagation delay than the simple additions performed in the remaining steps. To equalize the delays, the first operation has been further divided into three pipeline stages: 1) the eight parallel multiplications using carry-save arithmetic, 2) the summation of the products in carry-save form using 4-2 compressors [21] , and 3) the conversion of the final sum-of-products from a redundant carry-save representation to nonredundant two's complement representation. After the extra pipelining of operation 1, the pipeline delay is approximately equal to the pipeline delay of the accumulation in operation 2 and the subtraction in operation 3.
The wordlengths for the multiplier and accumulator datapaths were chosen based on the required accuracy of IEEE standard 1180-1990 for the inverse discrete cosine transform [22] . To achieve this level of accuracy, 14 12-b multipliers are used producing 26-b results. The coefficient input to the sum-of-products unit are 12 b, and the data inputs are 14 b. The 26-b multiplication outputs are summed at full accuracy in the 4-2 carry-save adder tree, and only then rounded to a 14-b output. An additional 3 b of precision in the accumulator allows eight outputs to be summed without overflow.
The last operation in the computational pipeline is the search for a minimum implemented with a compare-and-select datapath. The function of this datapath is to compare the current output of the offset subtracter with its previous value and to store the minimum of the two as well as set the conditional flag indicating a new minimum has been found. In full-search vector quantization, the controller sends every block in the codebook to the processor and monitors the conditional flag to determine the closest codebook block to the current input block. In tree-search vector quantization, the block-by-block result of the minimizer is used to alter the data flow by directing the selection of the next codebook block for comparison by the processor. Fig. 6 shows the parallel computational datapath and its interface to the signal flow network. In the next section we describe the controller logic required to configure this processor and implement several compression algorithms.
IV. PROCESSOR CONFIGURATION AND CONTROL
A. Control Model
Given the signal flow network and computational datapath shown in Figs. 4 and 6 , execution of the algorithms requires the definition of a set of clock signals for the registers and memories, multiplexer select signals, and memory addresses associated with each state of the controller. Each algorithm may consist of many such states. The state or states of the controller FSM for the proposed configurable processor when executing the 8 8 DCT, subband coding, and vector quantization are described in the next three sections. Due to the optimizations of the signal flow network and the computational datapath, the complexity of the FSM for any one algorithm is extremely low, typically a few thousand gates. In a custom product, the FSM's for the desired set of algorithms can be integrated on the processor chip. The hardwired control does not allow field programmability, but is ideal for a low-cost processor. For system prototyping purposes, the FSM's can be implemented in an FPGA. The FSM's for one of the algorithms, the DCT, is described below. Controllers for the other algorithms have been similarly derived, and all have been implemented on a 4000 gate FPGA and used to test the execution on the processor chip as discussed in Section VI.
B. Control Example: DCT
The DCT controller requires two states corresponding to the two matrix multiply operations that are being interleaved on the processor. Each state shares a common set of clock signals but supplies a different set of multiplexer select signals and a different set of read and write addresses to the coefficient and pixel memories. On alternate clock cycles the control signals from one or the other of the two states are supplied to the signal flow network. Table II summarizes the control requirements  for execution of the 8 8 DCT in each of the two control states.
The clocks, memory addresses, and multiplexer select signals for the DCT are derived as follows. The first state refers to the first matrix multiplication operation. In this state, the serial input data/pixel mux selects the serial input data/pixel bus, shifting pixels into the pixel input tapped delay line at the input pixel rate. Eight DCT basis vector coefficients are read from the coefficient memory and multiplied by eight pixels read from the pixel memory. The sum-of-products unit operates at twice the input pixel rate because it is being time-shared between two matrix multiplications. In state 2, the second matrix multiplication is performed. Eight DCT basis vector coefficients are read from the coefficient memory and multiplied by eight elements of the matrix multiplication result computed in state 1. Simultaneously, the output values of the previous matrix computation are shifted into the interleaved pixel tapped delay line by selecting the sum-of-products output with the serial input data/pixel multiplexer. The serial output data/pixel multiplexer selects the output of the sum-of-products unit, placing values on the serial output data/pixel bus at the input pixel rate. New rows of pixels shifted into the input tapped delay line are written into the pixel memory during state 2 as well.
The absence of the coefficient memory write addressing reflects the fact that the coefficient information is statically defined for the algorithm. Likewise, no control signals are TABLE II  DCT CONTROLLER STATES required for the accumulator, offset subtracter, or minimizer, as these datapaths are inactive when computing the DCT. Similar derivations allow the processor to perform vector quantization and subband coding.
C. Control Implementation
To add the flexibility to explore different implementation strategies for the controller FSM shown in Fig. 1 , it has been implemented externally on an FPGA. In a consumer application, the controller can be mapped to a standardcell layout for integration on a single chip with the rest of the processor. This is easily accomplished due to the low complexity of the processor and the low gate count of the state machines (approximately 2700, 2000, and 2600 gates for the DCT, vector quantization, and subband coding controllers, respectively). Note that the complexity of the vector quantization controller is less due to its single control state versus the two control states required for the DCT and subband coding.
By way of comparison, the graphics processor presented in [23] has a similar computational datapath to the proposed processor, but has a different memory and I/O design and a more traditional interpreted control strategy. As a result of the microinstruction control scheme, the processor's complexity is increased by a two-level hierarchy of instruction decoders and 4 Kb on-chip instruction memory, which is equal to the size of the total data memory in our proposed processor. The total number of devices of the processor in [23] is twice that of the proposed processor.
V. CHIP IMPLEMENTATION
The processor chip has been fabricated [24] and was implemented using the Lager IV CAD system [25] . This allows all of the macroblocks and cells to be designed as parameterized modules. These parameters can either be specified at a low level, such as the datapath wordlengths and flags to specify the addition of feedthroughs or extra drive capability, or at a high level such as the on-chip memory size or number of multipliers in the sum-of-products unit. The individual cells can be either tiled or routed using an automatic place-and-route tool.
The on-chip memories have been implemented as dualported line memories, each capable of reading and writing eight words in a single cycle. These eight words are passed directly to the eight inputs of the inner product unit, allowing the inner product unit to begin a new computation every clock cycle. The basic memory cell is a three-transistor dynamic memory cell having an area of 240 m per bit in 1.2-m CMOS. The simulated worst case read and write cycle times are 19 and 9 ns, respectively. The coefficient and pixel memories have capacities of 1.5 and 2.6 Kb, respectively, for a total of only 4 Kb of on-chip memory to sustain uninterrupted execution of any of the image transform algorithms. As mentioned earlier, the only purpose the memories serve is to sequence the input data through the common parallel computational datapath. This purpose, together with the blockbased nature of the algorithms, allows the memory capacity to be only a few multiples of the block size without increasing the The use of a configurable architecture with hardwired signal flow was successful in reducing the complexity associated with the control and on-chip memories. Of the less than 110 K total transistors in the combined processor and external controllers, approximately 11% of the complexity is due to on-chip memories. This is in contrast to the greater than 50% that is common in programmable video signal processors. The remaining configurable signal flow network requires only 10% of the overall complexity, the control logic 27%, and the computational pipeline 52%. Therefore, the reduction in control and on-chip memory complexity allowed most of the processors complexity to be dedicated to high-speed parallel computational datapaths. Of the 180 pins, 112 are used for data signal I/O, 39 for control signals (clocks, addresses and multiplexer controls), 16 for power and ground, and the remaining 13 pins are unused. If the controller is included on- chip, the pin count can be reduced to 132, bringing the total chip area closer to the core area of 55 mm The micrograph is shown in Fig. 7 .
VI. TEST RESULTS
An evaluation board has been designed and built to allow execution of the three algorithms under the control of a workstation. As illustrated in Fig. 8 , the test board consists of four main components: the configurable image transform processor chip, an FPGA controller, a host bus direct memory access (DMA) controller, and one megabyte of frame memory. From a schematic-based software environment, the choice of algorithm initiates the programming of the FPGA through an interface attached to the workstation's serial port. A test image selected from a database of stored images on the workstation is then transferred through the DMA controller to the local frame buffer on the evaluation board. The image is then processed using the image transform processor unit and stored in the local frame buffer. The operation is concluded with an uploading and display of the processed image back to the workstation. Fig. 9 illustrates a vector quantized test image using the image transform processor unit and evaluation board with a four-way tree search of a codebook containing 256 4 4 code vectors. In this configuration the FPGA controller uses the decisions produced by the processor to generate the addresses for the next vector in the codebook search. In addition, the codebook is stored along with the controller on the FPGA. Fig. 10 illustrates a one-level subband decomposition of an input test image using the processor and evaluation board. In the three high-frequency subbands, the most negative components are mapped to black pixels, the most positive values to white pixels, and zero values to the middle gray level. The test image is processed in two passes-one pass for the horizontal high-pass and low-pass filters, and another pass for the vertical high-pass and low-pass filters. Finally, in Fig. 11 , the results of a forward and inverse 8 8 DCT are shown, as produced by the processor.
The evaluation board has been used to test the execution of the compression algorithms at the maximum host bus transfer rate of 600 Kbytes/s. Additionally, the chip has been tested to a maximum clock rate of 50 MHz at 5 V using a Tektronix LV500 tester. This clock rate supports the encoding and decoding throughputs listed in the chip summary in Table III. VII. CONCLUSIONS A configurable processor chip has been presented that achieves the same performance for several image transform algorithms as conventional programmable processors, with an order of magnitude lower complexity. The use of asymmetric interfaces for the two on-chip data memories and the introduction of interleaved tapped delay lines at the memory inputs significantly reduce both the required memory sizes and the address control logic complexity. A common algebraic formulation of the different compression algorithms allows the design of a shared parallel computation datapath eliminating the need for instruction decode and microprogramming logic. Execution of different algorithms is achieved by dedicated FSM controllers which configure the processor. With these features, the total complexity of the processor with controllers to execute the 8 8 DCT/IDCT, subband coding, and vector quantization at 25 MPixels/s is 110 K transistors compared with over a million transistors required by software programmable processors. The tradeoff between flexibility, low complexity, and high speed achieved by the configurable processor makes it an ideal candidate for integration into future video communication devices supporting multiple algorithms. Configurable DSP's in general provide a better tradeoff between flexibility and complexity than software programmable DSP's and dedicated ASIC's. For more general applications, future work is required in the development of software for configurable controller synthesis.
