Abstract-Reconfigurable hardware offers a number of ad-embedded FPGA [5] . The MONTIUM architecture integrates vantages over custom integrated circuits, including low devel-a microprocessor with an FPGA and an array of coarseopment cost, high flexibility, and high adaptability to changing grain cells, each containing several 16-bit ALUs [6] . These requirements. However, this alternative does incur some reduction in performance, especially for computationally intensive coarse-gran archlitectures achieve higher performance, but tasks such as digital signal processing. Recent [1]. Application-specific inteIn this paper, we introduce a novel reconfigurable architecgrated circuits (ASIC) achieve very high performance, but ture that employs pipelining at the bit level. This architecture incur high development costs and only support one operation. includes a number of features to address the issues mentioned General-purpose microprocessors can execute any software above. As described in Section II, an array of mediumprogram, but offer relatively poor performance for DSP. Digital grain cells accelerates DSP computations without sacrificing signal processors achieve better results, but still do not ap-flexibility. A hierarchical interconnection network, presented proach the speed of an ASIC. Finally, reconfigurable hardware in Section III, equalizes the communication delay between attempts to find some middle ground between the performance modules on the device. Section IV presents the circuit design of an ASIC and the flexibility of a microprocessor. This of the cell and gives some initial simulations that demonstrate approach has recently become practical for DSP, due to the its performance. Finally, Section V summarizes the paper. exponentially increasing capabilities of VLSI. 
vantages over custom integrated circuits, including low devel-a microprocessor with an FPGA and an array of coarseopment cost, high flexibility, and high adaptability to changing grain cells, each containing several 16-bit ALUs [6] . These requirements. However, this alternative does incur some reduction in performance, especially for computationally intensive coarse-gran archlitectures achieve higher performance, but tasks such as digital signal processing. Recent In this paper, we introduce a novel reconfigurable architecgrated circuits (ASIC) achieve very high performance, but ture that employs pipelining at the bit level. This architecture incur high development costs and only support one operation. includes a number of features to address the issues mentioned General-purpose microprocessors can execute any software above. As described in Section II, an array of mediumprogram, but offer relatively poor performance for DSP. Digital grain cells accelerates DSP computations without sacrificing signal processors achieve better results, but still do not ap-flexibility. A hierarchical interconnection network, presented proach the speed of an ASIC. Finally, reconfigurable hardware in Section III, equalizes the communication delay between attempts to find some middle ground between the performance modules on the device. Section IV presents the circuit design of an ASIC and the flexibility of a microprocessor. This of the cell and gives some initial simulations that demonstrate approach has recently become practical for DSP, due to the its performance. Finally, Section V summarizes the paper. exponentially increasing capabilities of VLSI. [7] , [8] . Each to connect the inputs and outputs of the elements with the local network, as well as the global network described next. This capability is useful for creating larger multipliers [10] . The global network resembles an H-tree, as shown in Fig. 4 . The flexibility of the lookup tables allows cells to work with Each level of the tree contains four input busses and four both unsigned and two's-complement data. Cells can also output busses. The number of bits per bus starts at 4 bits compute addition (or subtraction) by configuring the lookup and doubles at every level. (To save area, one could limit the tables to assume that b is 1 (or -1). Other configurations allow bandwidth to a predetermined value.) Like the local mesh, cells to perform bit shifting and combinational logic functions. each level of the tree contains a pipeline register. Simulations Fig. 2 gives a conceptual diagram of memory mode. Collec-demonstrate that the latency of the global network does not tively, the matrix of elements implements a 128 x 4-bit random-limit the clock rate of the system. access memory with separate read and write ports. Pairs of This interconnection structure has several advantages. First, two elements contribute a 16 x 4-bit portion to the total. For the switches in the upper layers of the H-tree can manipulate the read port, the cell passes the read address ra3o0 to each data in larger units than 4 bits, resulting in substantially element, and uses the read enable re3o0 to select one pair of lower configuration overhead. In addition, the outputs of a elements. (Since there are only eight choices, bit re3 functions module can be collected onto a single bus and routed to the as a global read enable.) The selected elements output data inputs of another module. All portions of the data incur the onto do3o0. The write port operates in a similar manner, except same latency over the H-tree. Finally, the local network is that the value of di3:0 is written to the selected elements.
completely separate from the global network, simplifying the In memory mode, the matrix of elements retains the same mapping process. pipelining scheme as mathematics mode. In other words, elements perform the read and write operations in the order IV. CIRCUIT DESIGN shown in Fig. 1 . This design allows some of the internal lines
The most critical portion of the superpipelined reconfigto be reused, simplifying the datapath. urable architecture is the basic element. Even though this Fig. 6 , a differential clock signal component is essentially a lookup table, it must perform read drives four minimum-size n-transistors for each pair of differand write operations at the maximum clock speed. This section ential data lines. A combination of cross-coupled p-transistors summarizes the circuit design of the element and gives some and clocked n-transistors provides feedback to improve noise initial simulations that validate its performance.
margins. To reduce the clock load further, the cell only uses Fig. 5 illustrates the datapath of one element in the cell. At one pipeline latch (rather than a master-slave combination) the center of the diagram is a familiar SRAM latch. Elements between successive elements, relying on propagation delays contain 32 of these latches, organizes into a 4 x 4 x 2-bit array. to separate adjacent cycles. Observe that the latch has dedicated input and output sides.
The pipelining scheme also allows the cell to optimize comAll data lines are differential to improve performance.
putations. Recall from Section II that the matrix of elements is For a read operation, the element decodes the read address to divided into seven pipeline stages. However, the cell adds one generate the read row signal rr and the read column signal rc. half-cycle to the beginning to perform buffering, and one halfThe row depends on the lower two bits, and the column on the cycle to the end to finalize computations. The latency through upper two bits. The signals drive a series of tri-state buffers the matrix of elements is then eight clock cycles. Fig. 7 depicts that connect one latch to the output. These buffers improve the execution of two successive operations.
the speed and isolate the data lines from undesired feedback.
To analyze the performance of the new architecture, we have In mathematics mode, the output is passed on to the next implemented the cell in 180-nm technology. Fig. 8 contains elements in the subsequent pipeline stage. In memory mode, an initial circuit simulation of one element. The clock signal the output runs through some additional logic that controls ck is driven at 1.5 GHz. The solid lines in the simulation which elements specify the final output.
depict the data output of the element as the read address 
