Abstract-This paper introduces a novel bit-serial cell for reconfigurable hardware used to perform digital signal processing. The cell contains an array of 4-bit lookup tables, or "elements", that can operate in two modes. In memory mode, the elements behave as a random-access memory. In mathematics mode, the elements perform operations such as multiply-accumulate, addition, and shifting in bit-serial fashion. To calculate m-bit functions, the cell requires (m + 1) elements and (2m + 1) clock cycles. Layout simulations in 180-nm CMOS demonstrate that the serial clock frequency approaches 2 GHz. Compared to a parallel implementation with the same functionality, the cell has lower throughput but substantially smaller area.
I. INTRODUCTION
Reconfigurable hardware has become a well-accepted option for implementing complex processing stages without incurring the development costs of custom integrated circuits [1] . In particular, digital signal processing (DSP) greatly benefits from the performance acceleration over microprocessor-based implementations [2] . However, DSP is computationally intensive: many algorithms apply a number of multiplications and additions to a large data set. Taking advantage of this parallelism, researchers have proposed new coarse-grain reconfigurable architectures for DSP [3] . Such devices typically contain an array of cells that perform 16-bit or 32-bit operations. This approach contrasts with the 1-bit granularity of a field-programmable gate array (FPGA).
Designing reconfigurable hardware for DSP involves basic tradeoffs between performance, area, and flexibility. Coarsegrain cells can execute calculations quickly but offer limited functionality. Fine-grain cells can be combined into many different modules but require a complex interconnection structure. To solve this dilemma, we have proposed a two-level architecture in which each cell contains a number of smaller reconfigurable units, called "elements" [4] . The elements within a cell need only assume a few structures to implement all basic DSP operations, including multiplication, addition, control logic, and memory access. Hence, the design offers less overhead than fine-grain devices, while retaining much of the low-level flexibility.
In the original two-level architecture, the cell performed word-length operations using a small matrix of elements. However, computing binary arithmetic in a bit-serial fashion greatly reduces the circuit complexity-particularly for multiplication, which forms the bottleneck of most DSP algorithms. Bit-serial designs also require fewer data lines and hence fewer interconnection resources on reconfigurable hardware. Recognizing these benefits, researchers have proposed new reconfigurable architectures that perform bit-serial computations. Both finegrain [5] , [6] and coarse-grain [7] devices appear in the literature. The drawback of these designs is that the entire device runs off a single clock signal. Aside from the problem of clock distribution, the interconnection structures within the critical path limit the maximum clock frequency and hence the processing speed.
This paper introduces a novel bit-serial cell for reconfigurable DSP hardware. As described in Section II, the design uses the two-level scheme developed previously to minimize interconnection structures within the cell. Section III demonstrates that the VLSI implementation achieves high clock frequency and compares the simulated performance with the parallel architecture. Finally, Section IV concludes the paper.
II. DESIGN
The bit-serial cell performs n-bit operations using an array of n + 1 reconfigurable elements. Each element is simply a 16×2-bit random-access memory. The array of elements can only assume two structures: one optimized for bit-parallel memory access and the other optimized for bit-serial arithmetic. The remainder of this section describes the two modes of operation and demonstrates how the design can implement several common operations. For the purposes of this discussion, assume that n = 4. Besides storing intermediate data, memory mode is useful for specifying read-only lookup tables and implementing logic functions with up to five inputs. Additionally, all cells default to memory mode during reconfiguration so that the system can write new data into the elements. It is no coincidence that the parallel model of mathematics mode closely resembles a carry-save multiplier [8] . In fact, the structure can perform 4-bit multiply-accumulates (MAC) of the form
A. Memory Mode

B. Mathematics Mode
This function allows the reconfigurable device to implement 4n-bit multipliers with an n × n array of cells [9] . At first, it would seem that only four elements would be necessary to compute a 4-bit MAC. However, using an additional element Fig. 3 . Expanded design of cell in mathematics mode.
allows modules to work with data in unsigned and two'scomplement format. The extra bit on the inputs and outputs prevents ambiguity between the two number representations.
The cell in mathematics mode can compute many other functions besides multiplication. Clearly, addition and subtraction are special cases of (1) 
III. IMPLEMENTATION
This section complements the discussion of the bit-serial cell by describing its transistor-level implementation. Essentially, the entire design consists of several lookup tables with some glue logic. This strategy leads to a simple and compact implementation that can be optimized for high performance and low power consumption. The following discussion focuses on the operation of each element in memory mode and mathematics mode, and then compares the simulated performance to the original parallel architecture.
A. Memory Mode
In memory mode, the elements act as a random-access memory. Each of the two banks is organized as a 4×4 array The output data connects to an interface module controlled by the op 2:0 input. After allowing some time for the read operation to complete, the interface module copies the value read from bank 0 to the q and q outputs. Next, the module drives the i and i inputs back into the selected latch in bank 1. The n-transistors in the datapath now run in reverse, overwriting the value stored in v 1 and v 1 . The read and write operations are synchronized to the clock signals used in mathematics mode, described next.
B. Mathematics Mode
To achieve maximum performance, the element uses a separate datapath for mathematics mode. During the initialization phase of mathematics mode, the current inputs and previous outputs are collected in parallel form. The reconfigurable device takes advantage of this fact by pipelining the datapath into n-bit portions. Each pipeline stage encompasses one operation of the cell, or one data transfer across the interconnection structure. This paper assumes that the overall clock rate depends only on the cell, as the interconnection structure can always transmit data in parallel. To control the pipeline stages, the device distributes a master clock signal clk to every cell. Each cell must divide the clock period into an initialization phase and 2n + 1 serial processing stages. Fig. 6 depicts the resulting control signals init and shif t. Note that the initialization phase occurs at the end of the global clock cycle, so that each operation spans part of two cycles. This feature allows clk to be slowed down for testing.
The clocking scheme used in the proposed cell achieves higher performance than other bit-serial reconfigurable architectures. The frequency of the serial clock shif t does not depend on the propagation delay though complex interconnection structures, but rather the latency of the elements in mathematics mode. To increase performance further, the cell uses differential control signals that drive dual n-type and p-type transmission gates. As described in [10] , the clock generator contains a dual ring oscillator with cross-coupled p-transistors on the outputs to ensure that the waveforms are symmetric. Generating the control signals locally avoids many of the clock distribution problems shared by other designs.
C. Simulations
To evaluate the functionality and performance of the proposed implementation, we performed layout simulations in 180-nm technology with all parasitic capacitances. The results for mathematics mode appear in Fig. 7 . The cell has been configured to act as a 4-bit MAC unit. In the simulation, the elements perform the following calculation:
(01111 × 01010) + 01010 + 01010 = 010101010. These values represent a worst-case situation, since the bitserial output y i alternates between logic 0 and logic 1. As shown, the elements produce the correct output value. The shift signal shif t runs at a frequency of approximately 2 GHz; the master clock clk has a period of 6 ns.
D. Performance Comparison
The proposed bit-serial cell offers a reasonable alternative to the parallel design we developed previously in [4] . Table I compares the performance, estimated area, and overhead of the two approaches. The original cell used a 4×4 matrix of elements to compute 4-bit functions. Like the bit-serial scheme, the elements could also implement a random-access memory. The cell could store 128 words in memory mode (although two words were accessed simultaneously). However, the system had to program the lookup table inside each element before running in mathematics mode. An optimized implementation of the parallel design in 180-nm technology ran with a period of 4 ns. Table I lists two alternatives of the bit-serial design. The first uses five elements to compute 4-bit functions, and the second uses nine elements to compute 8-bit functions. Both cells require fewer elements than the original design and thus save area. However, DSP algorithms that need large memory modules would need more cells to implement the same address space. The bit-serial designs also require fewer configuration bits in mathematics mode. Since the elements almost always specify the same lookup table, the system can write to every element simultaneously. The parallel design does offer a higher clock rate, making this alternative better for applications that cannot trade off performance. Notice, however, that a given die size could contain more bit-serial cells, offering designers more opportunities for parallelization.
IV. CONCLUSION
In this paper, we have introduced a novel bit-serial cell for coarse-grain reconfigurable architectures. The cell uses an array of (n + 1) elements to perform n-bit computations. The extra element allows the cell to operate with unsigned and two's-complement data formats. The cell can operate in only two modes: one optimized for memory access and the other for binary arithmetic. Together, the two modes allow the cell to perform the entire spectrum of operations required by DSP algorithms with low hardware complexity. We presented a transistor-level implementation of the cell as well as layout simulations that verify the performance of the scheme. The serial clock reaches a frequency of approximately 2 GHz in 180-nm technology.
