A global tree local X-net network (GTLX) is introduced to realize high-performance data transfer in a multiple-valued fine-grain reconfigurable VLSI (MVFG-RVLSI). A global pipelined tree network is utilized to realize high-performance long-distance bit-parallel data transfer. Moreover, a logic-in-memory architecture is employed for solving data transfer bottleneck between a block data memory and a cell. A local X-net network is utilized to realize simple interconnections and compact switch blocks for eight-near neighborhood data transfer. Moreover, multiple-valued signaling is utilized to improve the utilization of the Xnet network, where two binary data can be transferred from two adjacent cells to one common adjacent cell simultaneously at each "X" intersection. To evaluate the MVFG-RVLSI, a fast Fourier transform (FFT) operation is mapped onto a previous MVFG-RVLSI using only the X-net network and the MVFG-RVLSI using the GTLX. As a result, the computation time, the power consumption and the transistor count of the MVFG-RVLSI using the GTLX are reduced by 25%, 36% and 56%, respectively, in comparison with those of the MVFG-RVLSI using only the X-net network. key words: multiple-valued reconfigurable VLSI, fine-grain reconfigurable VLSI, global tree local X-net network, logic-in-memory architecture 
Introduction
Field-programmable gate arrays (FPGAs) are cost-effective from low-to mid-volume applications because functions and interconnections of logic resources can be directly programmed by end users [1] , [2] . However, the overhead of area, delay and power consumption incurred to make FPGAs both general purpose and field-programmable often limits integrations of FPGAs into real-time processing systems [3] , [4] .
To solve the problems, novel fine-grain reconfigurable VLSIs (FG-RVLSI) based on bit-serial pipeline architecture have been proposed [5] - [7] . Fine-grain pipelining and high utilization of a cell make the performance and parallelism high, respectively. It has been demonstrated that the performance of the FG-RVLSI is 9 times higher than that of the conventional FPGA in typical applications [5] . Moreover, a multiple-valued fine-grain reconfigurable VLSI (MVFG-RVLSI) using an X-net network shown in Fig. 1 has been proposed to reduce the power consumption and area without increasing delay [6] , [7] . Each cell is connected to four cross points, and each cross point is connected to the other three adjacent cells. Therefore, only four nMOS pass transistors and four configuration memories are sufficient to be provided at each input/output (I/O) of the cell for the eightnear neighborhood data transfer. Moreover, to improve the utilization of the X-net network, the multiple-valued data transfer scheme is proposed, where linear summation of current signals transferred between cells can be realized at each "X" intersection. However, it is necessary to use many cells for longdistance data transfer by the X-net network, which results in low speed, large power consumption and low utilization of the cells. To overcome the problems, this paper presents a global tree local X-net network. The global tree network is employed for high-performance bit-parallel long-distance data transfer, and the local X-net network is utilized to realize simple eight-near neighborhood data transfer for highperformance bit-serial pipeline operations.
In practical applications such as a sum-of-absolutedifference operation (SAD) [7] , the local X-net network is frequently used for inter-cell neighborhood data transfer, and the global tree network is occasionally used for longdistance data transfer between a cell and a data memory. Therefore, the global tree network is connected not to each cell, but to cell blocks composed of many cells. To realize highly parallel memory access, a logic-in-memory architecture is introduced, where data transfer between a local memory and the cell block can be done in each logicin-memory element (LME). Moreover, to solve speed problems in comparison with a multiple bus and a crossbar network, pipelined switch nodes are utilized to improve data transfer throughput.
HSPICE simulation of the MVFG-RVLSI using the GTLX is done using a 65nm CMOS design rule. An eightbit data transfer from one cell to one hundred cells away is realized in the MVFG-RVLSI using the GTLX, and the MVFG-RVLSI using only the X-net network. The computation time, the transistor count, the configuration memory Copyright c 2014 The Institute of Electronics, Information and Communication Engineers count, the power consumption and the power-delay product of the MVFG-RVLSI using the GTLX are reduced by 62%, 88%, 91%, 92% and 97%, respectively, in comparison with those of the MVFG-RVLSI using only the X-net network. The FFT butterfly operation is mapped onto the MVFGRVLSIs. The computation time, the transistor count, the configuration memory count, the power consumption and the power-delay product of the MVFG-RVLSI using the GTLX are reduced by 25%, 56%, 59%, 36% and 52%, respectively, in comparison with those of the MVFG-RVLSI using only the X-net network.
Review of the Multiple-Valued Fine-Grain Reconfig-
urable VLSI Using Only the X-Net Network Figure 1 shows the MVFG-RVLSI using only the X-net network [6] , [7] . Each cell is composed of a logic block and a switch block, and can be connected to its eight adjacent cells through one-bit switches. To transfer a data from cell i to its right adjacent cell i+1 , the cell i transmits out its northeast corner and the cell i+1 reads from its northwest corner. There are two methods to realize the linear summation of the binary input currents A and B. One is that A and B are linearly summed at the "X" intersection, if A and B are transferred from a common "X" intersection. The other is that A and B are linearly summed in the switch block, if A and B are transferred from two different "X" intersections. In a bit-serial operation, a start signal indicating a head of a one-word data is required to initialize D flip-flops used for a state memory. Superposition of the binary input current C and the start signal in a single interconnection is introduced to implement compact switch blocks, where the logic value "1" and "0" is defined as C and the logic value "2" is defined as the start signal. As shown in Fig. 2 , the multiple-valued logic block consists of a current-to-voltage (I-V) converter, a currentsource-sharing AND circuit (CSSAND), a current-sourcesharing NOT circuit (CSSNOT), a start signal detector, a current-source-sharing binary logic module (CSSBLM), and a current replication circuit [7] . The multiple-valued current signals from the switch block are converted to multiple-valued voltage signals by the I-V converter, and then enter the CSSAND, the CSSNOT and the start signal detector. The CSSAND is used to generate a partial prod- uct for a multiplication. The CSSNOT is used to convert a subtrahend to a 2's complement number for a subtraction. In the CSSBLM, both a bit-serial addition and an arbitrary 2-variable binary function can be realized. All of the CSSAND, CSSNOT and CSSBLM can be used to realize function and store their results.
A multiple-valued data transfer scheme has been introduced to improve the utilization of the X-net network, where multiple-valued current signals are transferred between cells. Two binary data M and N from two adjacent cells can be transferred to one common adjacent cell at each "X" intersection (two-to-one data transfer) as shown in Fig. 3 (b) . M and N should be (0, 1) and (0, 2), respectively, and P becomes a quaternary data (0, 1, 2, 3) which expresses two-bit information. On the other hand, summation of M and N can be realized at each "X" intersection as shown in Fig. 3 (c). M and N should be (0, 1) and (0, 1), respectively, and P becomes a ternary data (0, 1, 2). All the one-to-one quaternary data transfer, two-to-one binary data transfer and summation can be realized at each "X" intersection in the multiple-valued data transfer scheme as shown in Fig. 3 , which leads to high utilization of the X-net network.
The behavioral description is given by a control/data flow graph. In the direct allocation of the control/data flow graph shown in Fig. 4 , each node in the control/data flow graph corresponds to a macro-block in the FG-RVLSI and each edge corresponds to a data transfer path between the macro-blocks, where the macro-block consists of multiple cells. The complexity of logical connections between the macro-blocks becomes almost the same as that of the control/data flow graph. The architecture for the localized data transfer can be effectively employed for reducing the complexity of interconnections and delay due to data transfer between cells [5] . However, long-distance data transfer is not effective in the MVFG-RVLSI using only the X-net network. Figure 5 shows long-distance data transfer between the cells A and B. In the Cell 1, two one-bit switches S1 and S2 are turned ON to pass data, which results in low speed and low utilization of the cell. The Cell 2 is programmed as a D flip-flop (DFF) to amplify a voltage data signal and improve throughput, which results in low speed, large power consumption and low utilization of the cell.
As shown in Fig. 6 , a block data memory is connected to edge cells in the MVFG-RVLSI using only the X-net network. In data access between the block data memory and a non-edge cell C, many cells are used for data relay, which results in low speed, large power consumption and low utilization of the cells. 5 Long-distance data transfer in the multiple-valued reconfigurable VLSI using only the X-net network.
Fig. 6
Data access between a data memory and a cell in the multiplevalued reconfigurable VLSI using only the X-net network.
Design of the Multiple-Valued Fine-Grain Reconfigurable VLSI Using the Global Tree Local X-Net Network
The tree network as one kind of the multistage networks is effectively employed for high-performance long-distance data transfer. As shown in Fig. 7 , all processing elements (PEs) are configured as "leaves" of the tree network. Data can be transferred between the PEs through one or more switch nodes. Each switch node has three I/O ports; one connected to a parent switch node and the other two connected to child nodes (or PEs, at the bottom level) [8] . One port of the switch node at the top level is connected to a block data memory to access data between the block data memory and the PE array. The tree network can be utilized to realize both the long-distance inter-PE data transfer and the data access between the block data memory and the PE array, which leads to high utilization of the tree network. However, very long interconnection and many switch nodes are required for the data access between the block data memory and the PE array, which results in low speed and Fig. 7 Conventional architecture using the tree network. Fig. 8 Logic-in-memory architecture using the tree network. Fig. 9 Control/data flow graph for a sum-of-absolute-differences operation. Fig. 10 Allocation of the absolute difference operation and addition onto the multiple-valued fine-grain reconfigurable VLSI using the X-net network. large power consumption. Moreover, only one data can be accessed and other many data cannot be accessed in parallel, which causes low utilization of the PEs. The data transfer bottleneck can be greatly reduced by using a logic-inmemory architecture shown in Fig. 8 , because it can make the interconnection length and the switch node count between a local memory (LM) and the PEs very short and small, respectively [9] . In the LME, an LM and eight PEs communicate with each other by a three-level subtree. The switch node at the third level has four I/O ports; one connected to the LM and the rest connected to other switch nodes.
The logic-in-memory architecture using the tree network is applied to the MVFG-RVLSI using only the X-net network, where the ultra-fine grain cell is composed of 10 differential-pair circuits (DPCs) and a CMOS DFF [10] . If the tree network is connected to each cell, the cost becomes extremely large. Also, in practical applications, most of the cells require neighborhood data transfer for bit-serial operations. For example, let us consider the SAD widely used as a similarity measure in template matching. The SAD is expressed as
where the control/data flow graph is shown in Fig. 9 . The SAD is performed by iteration of an absolute difference operation and addition (ADA). Figure 10 shows the allocation result of the eight-bit ADA for the MVFG-RVLSI using only the X-net network [7] . Only three cells (gray color) are connected to the tree network to receive or send data. Therefore, the tree network is reasonably connected to cell blocks composed of many cells, but not to each cell. Figure 11 shows the MVFG-RVLSI using the GTLX. The tree network is used for bit-parallel global data transfer (eight bits as an example), and the X-net network is used for bit-serial localized data transfer between the cells for logic operations. Two kinds of the switch nodes are provided to control the global data transfer. A non-pipelined switch node is composed of three one-bit switches, and a pipelined switch node employed for high throughput is composed of the eight DFFs and six one-bit switches. Figure 12 shows the interconnections between the global tree network and the local X-net network by eight eight-bit registers. Each register has a parallel voltage I/O Fig. 12 Interconnections between the global tree network and the local X-net network.
Fig. 13
Serial-parallel multiplication in the multiple-valued fine-grain reconfigurable VLSI using the X-net network.
port, a serial current I/O port and a parallel current output port. The parallel voltage I/O port is used to access the tree network for the global bit-parallel data transfer. The serial current I/O port is used to access a fixed "X" intersection for bit-serial operations. The parallel current output port is used to access eight fixed vertical "X" intersections for serial-parallel operations such as a serial-parallel multiplication shown in Fig. 13 . At each "X" intersection, the current signals from the register and an adjacent cell can be linearly summed by wiring, which leads to high utilization of the X-net network.
Evaluations of the Multiple-Valued Fine-Grain Reconfigurable VLSI Using the Global Tree Local XNet Network
The evaluation of the MVFG-RVLSI using the GTLX is done based on HSPICE simulation using a 65 nm CMOS design rule. An eight-bit data is transferred from one cell to one hundred cells away in the MVFG-RVLSI using only the X-net network, and the MVFG-RVLSI using the GTLX. Table 1 shows the comparison results. In comparison with the MVFG-RVLSI using only the X-net network, the computation time, the transistor count, the configuration memory count, the power consumption and the power-delay product of the MVFG-RVLSI using the GTLX are reduced by 62%, Table 1 Comparison of the long-distance data transfer in the multiplevalued fine-grain reconfigurable VLSIs.
MVFG-RVLSI MVFG-RVLSI using the X-net using the GTLX 88%, 91%, 92% and 97%, respectively.
Let us consider the FFT operation which is an efficient algorithm to compute the discrete Fourier transform (DFT) and its inverse. The most common FFT algorithm is the Cooley-Tukey FFT algorithm composed of many butterfly operations. Figure 14 shows the control/data flow graph of the butterfly operation. Figure 15 shows the allocation of the butterfly operation onto the MVFG-RVLSI using only the X-net network. The inputs B r , W r , B i and W i are transferred from the fixed eight-bit registers to the edge cells, which is the optimal allocation onto the MVFG-RVLSI using only the X-net network. However, 16 cells are used to implement two serialin parallel-out registers for the serial-parallel multiplication, and 24 cells are used to access the input A i and the outputs O 1r , O 2r between the fixed eight-bit registers and the non-edge cells, which results in low speed, large power consumption and low utilization of the cells. Figure 16 shows the allocation of the butterfly operation onto the MVFG-RVLSI using the GTLX. All of the inputs B r , W r , B i , W i , A i and the outputs O 1r , O 2r are transferred between the eight-bit registers provided at the lowest level subtree and the adjacent cells. Moreover, the parallel current signals from the eight-bit registers and the current signals from the adjacent cells are linearly summed by wiring, which leads to high utilization of the X-net network. Table 2 shows the comparison result. The computation time, the transistor count, the configuration memory count, the power consumption and the power-delay product of the Fig. 16 Allocation of the butterfly operation onto the multiple-valued fine-grain reconfigurable VLSI using the global tree local X-net network. MVFG-RVLSI using the GTLX are reduced by 25%, 56%, 59%, 36% and 52%, respectively, in comparison with those of the MVFG-RVLSI using only the X-net network. The performance is greatly improved in the MVFG-RVLSI using the GTLX, even though in comparison with the optimal allocation in the MVFG-RVLSI using only the X-net network.
In the FG-RVLSI, fine-grain pipelining and high utilization of a cell make the performance and parallelism high, respectively. Since the FFT can be performed in parallel for all the data sets, the performance increases in proportion to the number of butterfly modules available. It has been demonstrated that the total performance of the CMOS FG-RVLSI using only the neighborhood network is 9 times higher than that of the conventional FPGA by realizing the FFT [5] . Moreover, the multiple-valued logic has been introduced to reduce the area and the power consumption of the fine-grain cell [6] , [7] . It has been demonstrated that the area and the power consumption of the MVFG-RVLSI are reduced to 60% and 67%, respectively, in comparison with the equivalent CMOS FG-RVLSI without increasing the delay. The total performance of the MVFG-RVLSI becomes 1.6 times higher than that of the CMOS FG-RVLSI.
In this paper, the speed and the density of the butterfly modules in the MVFG-RVLSI using the GTLX becomes 1.3 times and 2.2 times higher than those of the MVFG-RVLSI using only the neighborhood network, respectively. Therefore, the total performance of the MVFG-RVLSI using the GTLX becomes 2.8 times higher than that of the MVFG-RVLSI using only the neighborhood network. As a result, we can infer that the total performance of the MVFG-RVLSI using the GTLX becomes 41.2 times higher than that of the conventional FPGA.
Conclusion
A global tree local X-net network has been introduced to realize a high-performance multiple-valued fine-grain reconfigurable VLSI. A pipelined tree network is employed for high-throughput bit-parallel global data transfer, and an Xnet network is employed for simple bit-serial localized data transfer for logic operations. A logic-in-memory architecture is utilized to solve the bottleneck problem between a data memory at the highest level subtree and each cell at the lowest level subtree. A register with a parallel voltage I/O port, a serial current I/O port and a parallel current output port is introduced to realize flexible interconnections between the global tree network and the local X-net network. Moreover, linear summation of the current signals from the register and an adjacent cell can be realized at each "X" intersection, which leads to high utilization of the global tree local X-net network. It is demonstrated that the computation time, the transistor count, the configuration memory count, the power consumption and the power-delay product of the multiple-valued fine-grain reconfigurable VLSI using the global tree local X-net network are reduced by 25%, 56%, 59%, 36% and 52%, respectively, in comparison with those of the multiple-valued fine-grain reconfigurable VLSI using only the X-net network.
As a future work, we will make system-level evaluation through image processing, video compression and audio signal processing, and so on. To make CAD tools for the multiple-valued fine-grain reconfigurable VLSI is also an interesting issue.
