This paper describes a novel reconfigurable architecture for digital signal processing (DSP). This architecture consists of a two-level array of cells and interconnections. On the upper level, fundamental DSP operations such as multiplication and addition are mapped onto blocks of 4-bit cells. On the lower level, each cell uses a 4 · 4 matrix of smaller ''elements'' to perform the necessary computations. Cells also contain pipeline latches for increased throughput. The architecture features a simple VLSI implementation that combines the flexibility of memory elements with the speed of DOMINO logic. Initial prototypes have been fabricated using a modest 0.5-lm CMOS technology. Circuit simulations of the cell in 0.25-lm technology indicate that the design achieves a clock frequency of 200 MHz.
Introduction
Many digital systems rely on digital signal processing (DSP) to achieve their functionality. For example, cellular phones use sophisticated compression and encryption algorithms to transmit data securely over a wireless link. Digital multimedia devices such as video cards and CD players translate a stream of bits into images or music. Even hearing aids may implement complex digital filters to enhance speech. Many of these applications impose critical requirements on performance, power consumption, and reliability, and thus require innovative hardware architectures to meet these specifications.
Digital systems may use several components to implement DSP, ranging from custom integrated circuits to general-purpose microprocessors [1] . Application-specific integrated circuits (ASIC) achieve very high performance, but incur high development costs and only support one operation. General-purpose microprocessors can execute any software program, but offer relatively poor performance for DSP. Specialized digital signal processors achieve better results, but still do not approach the speed of an ASIC. Finally, reconfigurable devices attempt to combine the performance of an ASIC with the flexibility of a microprocessor. This approach has recently become practical for DSP, due to the exponentially increasing capabilities of VLSI systems.
In general, reconfigurable devices consist of a programmable array of computational cells and interconnection structures. These devices offer great flexibility and adaptability for DSP since designers can reprogram the hardware at any time, even after deployment [2] [3] [4] . In addition, the implementation process can be automated with the use of appropriate software tools. Traditional reconfigurable devices such as field-programmable gate arrays (FPGA) place little functionality in the cells. These fine-grain devices work well for implementing combinational or sequential logic. However, DSP algorithms use arithmetic operations such as multiplication extensively. Mapping a multiplier onto a fine-grain device produces a complex structure that yields poor performance, so some designs actually embed dedicated multipliers into the array [5, 6] lookup tables, and other functional units within the cells [7] [8] [9] . These coarse-grain devices achieve good performance for binary arithmetic, but may not provide a straightforward way to implement the control logic found in DSP algorithms. The fixed number of functional units also limits flexibility.
This paper describes a novel reconfigurable architecture for DSP. In this approach, each cell consists of a 4 · 4 matrix of reconfigurable ''elements''. Each element is essentially a small, 16 · 2-bit random-access memory. The matrix of elements can be configured into two structures: one optimized for memory operations and the other for arithmetic and logic functions. In memory mode, the matrix of elements operates as a 64-byte random-access memory. In mathematics mode, each element acts as a lookup table that allows the cell to implement a wide variety of 4-bit functions. The resulting two-level architecture can implement the entire spectrum of operations required for DSP.
Previous work has presented a system-level overview of the reconfigurable architecture, comparing the proposed scheme to other alternatives [10] . Other research has described the interconnection fabric used in the design and illustrated how the architecture can implement the fast fourier transform (FFT) [11] . This paper focuses on the design and performance characteristics of the two-level array of cells. Section 2 illustrates how to map basic arithmetic and control operations onto blocks of cells. Section 3 describes the design of the cell, explaining how the matrix of elements can perform the computations required by each block. Section 4 considers a simple VLSI implementation of an element and evaluates the performance of the design. Finally, Section 5 gives some concluding remarks.
Mapping DSP operations
To implement a DSP algorithm on the two-level architecture, design tools can follow the three-step process outlined in Fig. 1 . First, the algorithm is divided into basic modules, such as multipliers, adders, memory units, and control logic. Next, each module is mapped onto a block of 4-bit cells. Finally, the modules are arranged on the array and routed together over the interconnection fabric. This section gives several examples of mapping common DSP operations onto blocks of cells. The motivations for this discussion are twofold: to demonstrate that these operations can be translated efficiently, and to illustrate the functionality that each cell must contain. Showing the other two steps in the process would require a detailed discussion of the interconnection scheme and is left to [11] .
Multiplication
Almost all DSP algorithms use multiplication of some form. Depending on the target application, the algorithm may require unsigned or signed multiplication of 16-bit, 20-bit, 24-bit, 32-bit, or larger numbers. The use of 4-bit cells enables applications to implement a multiplier of the precise size required, while exploiting the inherent parallelism of the operation.
Suppose the reconfigurable device must multiply two unsigned 16-bit numbers A and B to generate a 32-bit output Y. The unit is to operate in parallel for maximum performance. A straightforward solution, shown in Fig. 2 , implements a carry-save multiplier [12] with 4-bit cells. Note that A and B are broadcast across the entire structure. This multiplier requires 20 cells: 4 that perform multiplication, 4 that perform addition, and 12 that perform both. The critical path involves 8 cells. A typical cell multiplies two 4-bit portions of the inputs, say a and b, and may add two 4-bit terms to the result, say c and d. Denoting the result as y, each cell performs the operation
The upper and lower halves of the result connect to the c and d inputs of neighboring cells.
By rearranging the interconnection structure, it is possible to reduce the hardware required. Fig. 3 shows an Fig. 1 . Implementing a DSP algorithm on the two-level architecture. Fig. 2 . Carry-save multiplier.
improved multiplier that uses only 16 cells and has a critical path of 7 cells. The structure is easily scalable to form nbit multipliers with (n/4) 2 cells (assuming n is a multiple of 4).
One benefit of dividing large operations into 4-bit units is the performance enhancement gained by pipelining [13] . Adding pipeline latches to the output of each cell enables the component to begin one multiplication per clock cycle. The clock frequency is only constrained by the propagation delay of one cell plus associated interconnection structures. Fig. 3 indicates the number of pipeline stages that must separate each cell so that intermediate results arrive at the next cell at the proper times. With these modifications, the multiplier has a latency of 7 clock cycles, but can initiate one operation per cycle. The least-significant 4 bits of the output are generated during the first cycle, followed by the next 4 bits, and so forth.
With appropriate changes to the function implemented by each cell, the multiplier can perform two's-complement as well as unsigned multiplication [14] . Also note that two additional 16-bit terms, C and D, could be added into the top row of cells without increasing the hardware required. This modification would create a 16-bit multiply-accumulate (MAC) unit that evaluated the expression
Addition
Most DSP algorithms require addition as well as multiplication. In many cases, an addition may be combined with another multiplication and implemented with the MAC unit described previously. For example, the finite impulse-response (FIR) filter equation is amenable to this simplification:
However, other algorithms require dedicated adders. The structure in Fig. 4 uses 4 cells to add two 16-bit numbers A and B. Each cell adds two 4-bit portions of these numbers, named a and b, along with carry in c:
A similar structure can be used to perform two's-complement addition or subtraction. In general, adding or subtracting two n-bit numbers requires (n/4) cells (again assuming n is a multiple of 4).
The adder is pipelined for maximum performance, and achieves a latency of 4 clock cycles. Note that the inputs should arrive in a staggered fashion: the least-significant 4 bits in the first cycle, the next 4 bits in the second cycle, and so forth. Many of the structures described here impose similar requirements.
Memory operations
To implement DSP algorithms, the reconfigurable device must provide a way to store intermediate results.
For example, the fast fourier transform (FFT) requires a working buffer approximately the size of the input data. Most adaptations of the algorithm also use a lookup table of multiplication coefficients. This example indicates that cells should be able to perform memory operations as well as basic arithmetic functions.
The reconfigurable architecture described here is unique in that each cell can implement a 64 · 8-bit random-access memory. The inputs and outputs of the cell in this configuration include a 6-bit address a, 8-bit input data i, and 8-bit output data q. Depending on the read enable re and write enable we, the cell can perform the operations listed in Table 1 . The upper two bits of a are grouped with re and we into a 4-bit line.
Passing the input data to the output data on a no-op simplifies the design of larger memory units. For example, consider the 256 · 24-bit memory in Fig. 5 . The 12 cells marked ''M'' store the actual data, whereas the 4 cells marked ''D'' decode the 8-bit address A and control signals. The entire memory unit operates in a pipelined fashion with a latency of 7 clock cycles but a throughput of one operation per clock cycle. As access requests travel through the pipeline, each ''D'' cell determines whether A falls within the address range of the corresponding row of ''M'' cells. If so, the re and/or we inputs of these cells are asserted. If not, a no-op occurs and the ''M'' cells pass the data downwards to the next row.
Logic and control operations
DSP operations are not composed exclusively of mathematical functions, but also require a certain amount of control logic for proper operation. This control logic may include AND-OR expressions, decoders, multiplexers, and simple state machines. For example, the FFT requires a mechanism to load data into the computational stage at the proper time.
Implementing control logic on a reconfigurable device requires good fine-grain flexibility. An architecture that places a fixed number of functional units in each cell may not be able to evaluate arbitrary logic expressions efficiently. However, an architecture that places only finegrain functionality in each cell cannot execute complex arithmetic operations efficiently. The next section of this paper proposes a solution to this dilemma. For simplicity, all busses are unidirectional. The interconnection fabric also includes reprogrammable switches to control the data transfer. Local switches route data within individual modules, such as multipliers and adders. Global switches connect the inputs and outputs of modules together across a top-level network of busses. Further details about the interconnection scheme appear in [11] .
Description of design
As shown in Fig. 7 , each cell contains four main components. The processing core implements the 4-bit operations required for DSP. The two switches connect the inputs and outputs of the processing core to the interconnection network. Latches between the switches and the processing core pipeline the execution cycle. Finally, the control unit generates control signals for the processing core and man- ages the reconfiguration process. The following subsections discuss the organization of the cell in more detail. Fig. 8 summarizes the clock approach used for the cell. During the first phase of the clock, the cell precharges the processing core and enables the two switches. The values in the output latches flow through the output switch onto the interconnection network. At the destination cell, the values pass through the input switch and are stored in the input latches. During the second phase of the clock, the cell precharges the two switches and enables the processing core. The processing core evaluates the desired operation, and the results are placed in the output latches.
Clock approach
Besides isolating the two phases of the clock, the latches in the cell allow DSP algorithms to execute in a pipelined or even superpipelined fashion. Without pipelining, the system clock rate would depend on the word length and operation type of each module in the device. With pipelining, the propagation delay through one cell is the only restriction on the clock rate. In the architecture, each outgoing data line can be configured to go through an extra set of pipeline latches if required by the module.
Input and output switches
The input and output switches allow the processing core to transfer data to and from the interconnection network in 4-bit units. For maximum flexibility, the switches are organized as the 8 · 8 crossbars depicted in Fig. 9 . Each link between 4-bit busses is controlled by a 1-bit latch, so the switches each contain 64 bits of memory. The inputs and outputs of the processing core can connect to the interconnection network in any direction. Since the processing core has fewer outputs than inputs, the architecture allows cells to pass certain inputs through the pipeline latches and back to the interconnection network. This feature is useful for several of the modules presented in Section 2.
Processing core
The wide range of operations used in DSP places great demands on the capability of a cell. According to the analysis in Section 2, cells must be able to implement multiplication, addition, memory operations, and simple combinational and sequential logic. Assigning each operation to a separate functional unit would inevitably sacrifice performance or flexibility.
The processing core uses a novel design that consists of a 4 · 4 matrix of reconfigurable elements. Each element, in turn, is a 16 · 2-bit random-access memory. The matrix of elements can be configured into two structures: one optimized for memory operations, and the other optimized for arithmetic and logic functions. Both structures execute one operation per clock cycle.
In memory mode, shown in Fig. 10 , the matrix of elements implements a 64 · 8-bit random-access memory. Inputs a, b, c, and d specify a 4-bit address for each row of elements. Typically, these inputs are all wired together; however, using separate inputs allows the structure to act as a four-way multiplexer. The re and we signals enables one row of elements for reading or writing. These signals are generated by the control unit from the upper two bits of the input address a 5:4 . Lines i and q are the input and output data, respectively.
In mathematics mode, shown in Fig. 11 , the matrix of elements assumes a structure resembling a parallel multiplier. However, the elements can perform many functions besides multiplication. The 16 · 2-bit memory in each element now implements a lookup table for the desired function. Possible functions include signed and unsigned addition, subtraction, and multiplication, as well as rounding, shifting, and logic expressions.
Configuration
Before using the reconfigurable device to perform DSP, each cell must be configured to implement the desired operation. The process begins when the target system asserts the Program pin of the device. In response, all switches inside the architecture revert to a default setting so that the interconnection fabric assumes the structure in Fig. 12 . The control unit inside each cell also places the processing core into memory mode so that new information can be written into the matrix of elements. The system then loads 8-bit configuration commands into the global interconnection lines, one command per clock cycle. The commands propagate into the cells along the 4-bit data busses. The hierarchical nature of the interconnection fabric allows multiple cells to be programmed simultaneously. Further information about the reconfiguration process appears in [15] .
In the worst case, the system takes 64 cycles to completely reprogram the 64 · 8-bit memory in the processing core, plus around 2 additional cycles to specify the mode and configure the pipeline latches. The two switches inside the cell require 8 cycles each to overwrite the 64-bit memory (assuming no optimization). Assume that local and global switches require 8 cycles and 16 cycles to reconfigure, respectively. A robust device with a 32 · 32 array of cells contains 1024 cells, 1984 local switches, and 511 global switches. The total number of reconfiguration cycles is estimated as 1024ð64 þ 2 þ 8 þ 8Þ þ 1984ð8Þ þ 511ð16Þ ¼ 108016; but this result drops to 27 004 cycles if four configuration commands can be issued at once. In addition, the reconfiguration process can exploit the symmetry inherent to most modules to reduce the cycle count further.
To conclude this section, Table 2 lists some of the operations possible with suitable configurations of the cell. The design combines the flexibility of a fine-grain architecture with the processing capability of a coarse-grain architecture.
VLSI implementation
A notable feature of the two-level reconfigurable architecture is the absence of functional units such as adders. The entire design consists of a hierarchy of memory units with some simple glue logic. This strategy leads to a simple and compact VLSI implementation that achieves very high performance. For applications where reliability is also critical, error detection and correction circuitry may be added to the memory units, as described in [16] . The following discussion focuses on the implementation and performance characteristics of the design. Lookup table  Read-only memory  State machine Read-only memory with pipelined feedback 4.1. Circuit design Fig. 13 depicts the organization of one element in the processing core. Recall that each element behaves as a 16 · 2-bit memory. This memory is arranged into a 4 · 4 array of 2-bit latches, together with additional glue logic. In memory mode, the element has a 4-bit address a, 2-bit input data i, and 2-bit output data q. In mathematics mode, the four address bits are pre-empted by inputs a, b, c, and d. The lower 2 bits control a row decoder, and the upper 2 bits control a column decoder.
The element uses a special data format to achieve high performance. As shown in Fig. 14 , each bit x is represented by two signals, x H and x L . During the precharge phase of the clock cycle, both signals are precharged to V DD , indicating a NULL condition. Discharging x H specifies logic 0; discharging x L specifies logic 1. Thus, the components in the element do not require a separate clock signal since the data itself contains all necessary timing information. This data format is especially suited to the design of the latch, illustrated in Fig. 15 . Each 2-bit latch contains two static random-access memory (SRAM) cells. The circuit provides separate paths for memory mode and mathematics mode to boost performance.
The other components in the element are very simple. The column selector contains n-transistors that connect one column of data lines to the main data lines of the element. The precharger units use p-transistors to charge the internal lines to V DD . The column decoder consists of CMOS NAND and NOR gates to drive the column selector and precharger. The row decoder has a similar structure. Finally, the interface module controls the read and write operations in memory mode.
For a mathematics operation, all address inputs are initially charged to V DD , causing the decoders to turn on the precharge transistors and disable the latches. Then, inputs a and b are broadcast to all elements simultaneously. When these inputs evaluate, the row decoder turns off the column precharge transistors and enables the appropriate row of latches. The data in the latches begins to propagate to the w outputs. When the previous elements cause c and d to evaluate, the column decoder turns off the precharge transistors on the output and enables one column. The w outputs then evaluate, affecting the c and d inputs of the next elements. This process creates a domino effect that Fig. 13 . Organization of reconfigurable element. allows the matrix of elements to perform mathematical operations rapidly.
A read operation in memory mode operates in a similar fashion, except that each element receives all 4 bits of the input address a at the same time. The latch at the selected row and column discharges one of the main data lines to ground, depending on the stored value. The interface module uses these lines to set the read data q.
To perform a write operation, which can only occur in memory mode, the element first executes a read operation at the input address, storing the resulting value if necessary. After the main data signal evaluates, the element drives value of the i input onto the same lines. The n-type pass transistors in the datapath then run in reverse, storing the new data in the selected latch.
Performance
The operation of a cell that determines the maximum clock frequency is a read operation in mathematics mode. Observe in Figs. 10 and 11 that the critical path in the matrix of elements involves one element in memory mode, but seven elements in mathematics mode. The circuitry that performs this critical operation has been optimized for speed. For example, the circuitry that decodes the c and d inputs features DOMINO logic.
Simulations of the processing core in a modest 0.25-lm technology show that it can operate at a clock frequency of 200 MHz. Fig. 16 shows the results of the worst-case simulation in mathematics mode. When Clock is high, the processing core precharges all internal data lines to 2.5 V and allows the new data to propagate to the inputs. When Clock falls low, the evaluation phase begins. The calculated result is zero in this example, so y 0H through y 7H all fall to ground. Bit 0 evaluates first, followed by bits 1, 2, 3, and so on.
An initial prototype of the processing core has also been fabricated by MOSIS in 0.5-lm technology. The prototype has been verified for functionality in both modes of operation.
Comparison to other implementations
Simulations have demonstrated that the processing core can operate at 200 MHz in 0.25-lm technology. Although more work remains before the interconnection fabric can be simulated, the switches are designed to support data transfer at the same frequency. The overall clock rate compares favorably to commodity FPGAs today. The Xilinx Spartan II-E family, for example, can achieve speeds of 200 MHz [17] . Current digital signal processors fall in this range as well. Furthermore, using a more robust technology would substantially increase the clock rate, perhaps to 500 MHz or even higher.
One key difference between the two-level architecture from FPGA platforms is that all DSP algorithms operate at maximum frequency due to the pipelined organization. With FPGAs, the achievable clock rate depends on the complexity of each module as well as the sophistication of the routing tools. Another difference is the approach used to implement multiplication. Many FPGAs now contain embedded multipliers for DSP applications. For instance, the Xilinx Virtex-II architecture offers up to 168 18-bit multipliers that can operate up to 300 MHz [18, 19] . The twolevel architecture, in contrast, has a homogeneous structure that can integrate multipliers with all other modules and maintain the same clock rate. Designers can also tailor the number of multipliers and the word length to meet the needs of the application. For example, a 64 · 64 array of cells could hold 256 16-bit multipliers or 64 32-bit multipliers.
The reconfiguration time required by the two-level architecture is comparable to FPGAs. The most basic Virtex-II device has 338,976 bits of configuration memory that can be programmed at 50 MHz in 8-bit units. The reconfiguration time for this device is thus 847 ls. As shown in Section 3, using an 8-bit interface to completely reprogram a 32 · 32 array of cells would require 108,016 cycles, or 540 ls at a frequency of 200 MHz.
Conclusion
This paper has presented a novel reconfigurable architecture for digital signal processing. The architecture features a two-level hierarchy of programmable cells and elements. Each cell contains a 4 · 4 matrix of elements that allow the cell to perform a wide variety of operations. The matrix of elements has two possible configurations optimized for memory operations and arithmetic functions, respectively. Cells can then be interconnected to implement DSP algorithms. Adding pipeline latches between cells dramatically improves performance. A prototype of the architecture has been fabricated in 0.5-lm technology. Simulations in 0.25-lm technology indicate that the architecture can operate at 100 MHz even with this modest technology.
The reconfigurable architecture has many advantages over other DSP implementations. Unlike commodity processors, system designers can customize the word length, amount of parallelism, and interconnection structure to meet the demands of the application. Unlike an ASIC, the device can be reprogrammed any number of times as the needs of the application change. The performance of the device may not approach that of an ASIC, but the reduced development costs alone make the reconfigurable architecture a viable candidate for many applications.
A primary focus of further research will be to compare the performance and flexibility of the two-level architecture to other implementations, including digital signal processors and FPGAs. This analysis will encompass work on both the hardware and software level. On the hardware level, additional prototype chips will be fabricated and tested to evaluate the performance of a small array of cells and switches. On the software level, computer-aided design (CAD) tools will be developed to automate the placement and routing of DSP algorithms. Simulations of DSP algorithms will also be conducted.
