In this paper , we develop novel parallel circuit designs for calculating the bit reversal. To perform bit reversal on 2n data words, the designs take 2k (k < n) words as input each cycle. The circuits consist of concatenated single-port buffers and 2-to-l multiplexers and use minimum number of registers for control. The designs consume minimum number of single-port memory banks that are necessary for calculating continuous-flow bit reversal, as well as near optimal 2n memory words. The proposed parallel circuits can be built for any given fixed k and n , and achieve superior performance over state-of-the-art for calculating the bit reversal in parallel multi-path FFT architectures.
INTRODUCTION
Bit reversal has been widely used for calculating the output frequencies of the Fast Fourier Transforms (FFT)s [13] . Given a set of data, a bit reversal reorders the indexed data according to a reversing of the bits of the data index [16] . As all the inputs are available concurrently, a straightforward bit reversal implementation can be reordered hardware wires. However, a compact pipelined design is more desirable for processing a long data sequence, considering the routing complexity and 1/0 bandwidth limitation.
Bit reversal has been extensively studied for decades in various research areas [21 , 22, 13, 15, 10] . Many existed approaches in the literature focus on efficient algorithm theories to compute the bit reversal of a data sequence stored in a memory [21 , 22] . These approaches are highly suitable for microprocessor architectures. Efficient bit reversal design is particularly important in the FFT hardware implementations [13, 15] . Based on the type of data storage used, previous work on the circuit design for bit reversal can be classified into memory-based designs and register-based designs. According to [6, 17] , delay feedback/commutator is proposed to perform bit reversal in the traditional VLSI implementation for FFT. Input samples are bit reversed using registerbased delay feedback/ commutator in these folded FFT architectures to achieve high computational performance per Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwi se, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
DAC '17, June 18-22, 2017 unit area. In this context, realizing the bit reversal becomes a different problem from reodering data stored in a memory. With the aim of minimizing resource consumption, registerbased serial designs for bit reversal are developed in [14] and further adopted in the implementation of radix feedforward FFT architectures [15] . To improve the throughput, parallel bit reversal circuit designs are developed in [11 , 12] such that input samples can be processed by multi-path delay feedback (MDF) and multi-path delay commutator (MDC) FFT architectures.
In addition to the above designs, other relevant research works are proposed to realize a specific family of data permutations including stride permutation and bit reversal. In [7, 9] , parallel streaming permutation structures are proposed to obtain high performance hardware implementations for the FFT. In [18] , the authors present a register-based permutation network for stride permutations for array processor. Bit reversal is realized by a combination of different stride permutations. The proposed network supports any stride of power-of-two and achieves high resource efficiency. In [20] , a memory-based permutation approach has been proposed; the permutation network is proved mathematically able to realize any given bit-index permutations including bit reversal using dual-ported memories. These previous works are more generalized approaches without optimization specifically for bit reversal.
In this paper , we address the problem of calculating bit reversal on parallel continuous data flows and propose an optimal solution to this problem. The solution consists of circuits which only require single-port buffers, 2-to-1 multiplexers, as well as simple control circuits. The proposed circuits are proven to be optimal in two senses. At first , they achieve the lower bound of the number of memory banks with near optimal memory efficiency, i.e., the continuous-flow bit reversal cannot be realized with fewer storage banks. Secondly, the circuits achieve optimal control mechanism with regarding to the number of control registers. Specifically, our main contributions are the following:
• Parallel circuits for calculating bit reversal on parallel data inputs (Section 4).
• Parameterized bit reversal designs with respect to input size and data parallelism. (Section 4.2) .
• Demonstrated optimal designs achieving the lower bound of the number of memory banks as well as near optimal memory efficiency. (Section 4.3).
• Optimal control mechanism for supporting reordering outputs of power-of-2 FFTs (Section 4.2) .
The rest of the paper is organized as follows. Section 2 introduces the background. Section 3 defines the parallel bit reversal problem. Section 4 presents our proposed parallel bit reversal circuit designs. Section 5 presents experimental results. Section 6 concludes the paper.
BACKGROUND
Bit reversal: As an important permutation pattern, bit reversal has been extensively used in FFTs [6] . 
where er() is the bit reversing operation. Bit reversal permutation matrix is also known as the base-t digital reversal permutation when t = 2. The permutat ion mat rix of the base-t digital reversal permutation is denoted as EN ,t :
Where I is the identity matrix, Piogt N ,t is the stride permutation introduced next , and 0 is the well-known matrix operation tensor product (Kronecker product) [19] . For example, E 4,2 performs xo , x1 , x2,x3-+xo,x2,x1,x3 and y = E 4,2 · x can be represented as Stride Permutation: Another class of important data permutations in FFT algorithms are stride permutations.
Given an m-element data vector x and a stride t (1 :::; t :::; m -1), the data vector y produced by the stride-by-t permutation over x is given as y = Pm ,tX , where Pm ,t is a permutation matrix. Pm ,t is an m x m bit matrix such that
where mod is the modulus operation and L J is the floor function.
PROBLEM DEFINITION
Bit reversal on serial data has been well studied [6, 14, 17] . Recently hardware designs for parallel bit reversal are proposed for high throughput MDC or MDF FFT architectures [11, 12] . inputs cannot be all available currently, designing a specialized hardware block for performing bit reversal on parallel data becomes challenging as data elements need to be moved across the temporal bound, i.e., a data element previously input in input cycle i needs to be output in output cycle j , where i i= j. This feature requires register files or storage elements such as memory banks to be employed so that a data element can be output after a specific amount of delay.
To formulate this problem, we define the parallel-2k bit reversal problem as: to design a parallel hardware design with an available data parallelism of p (p = 2k). p 1 is defined as the number of parallel inputs processed per cycle. Therefore, to calculate the bit reversal of a 2n-element data sequence, the inputs enter the hardware design over 2n-k (k < n) consecutive cycles. After a specific amount of delay t, the sequence is reordered as specified by the bit reversal and output over 2n-k consecutive cycles. Such design can be pipelined to process continuous data flows, thus the peak throughput is p output elements per cycle and the latency is t.
DESIGN APPROACH 4.1 Parallel-2 Bit Reversal
Parallel-2k bit reversal can be formulated as a permutation matrix problem [20] or a graph problem. In this paper, we propose a mapping methodology translating this challenging problem into a classic switch network routing problem. Fig. 2 illustrates the key idea of the proposed mapping methodology, where the upper switch network is the classic Benes network [5] which can realize bit reversal between the input and the output by configuring its 2-to-2 switches. The first step to route the Benes network is to decompose it into two stages of switches and two middle sub-networks. Fig. 2 shows that each 2 x 2 input/output switch has exactly one link connecting to the upper subnetwork or the lower subnetwork. The input and output switch stages need to be configured such that the input data are routed to the right n = §= Bypass Cross In cycle 1
In cycle 2
In cycle 3
In cycle 4
In cycle 5
In cycle 6
In cycle 7
In cycle 8 Figure 4: Functional structure of the employed data buffer and its behavior in the read-before-write mode location at the output based on the mapping of bit reversal. The switch configuration bits can be easily computed by the classic looping algorithm [5] . The two subnetworks can be further decomposed and routed in a recursive manner such that 2 log N + 1 stages of switches are used and configured to realize a specific routing when the input size is N. We first propose a parallel-2 bit reversal design by vertically folding the topology of the upper network in Fig. 2 . In  Fig. 2 , the resulting datapath of our design has a three-stage structure, including two stages of switches and one memory (buffer) stage. The 2 x 2 switch in the input/output stage is run time reconfigured to realize the connection of switches at the input/output stage of Benes network. The upper (lower) subnetwork in the middle stage of Benes network is replaced with a memory bank in the datapath as shown in Fig. 2 , where this substitution relationship is emphasized using colored wire connections. The mapping between the input and the output of a subnetwork determines the number of cycles to be buffered for each data element in a corresponding memory bank. The proposed parallel-2 design shown in Fi.g 2 is parameterizable with respect to input size 2n. It only requires two 2-to-2 switches, each having two 2-to-l multiplexers and two single-port data buffers, each of size 2n-1 . Fig. 4 shows the data buffer structure which has one address port , one data port for read and one data port for write, where the two data ports share the single address port. To realize the bit reversal, our design requires the data buffer to be accessed in the read-before-write mode, i.e., given a buffer address, the old data previously stored at the write address appears first on the output latches, while the input data is being stored in the buffer. The access behavior of the buffer in the read-before-write mode is shown in Fig. 4. Fig. 3 shows the circuit behaviors of the 2-to-2 switch in different states specified by the value of the control bit. When the control bit value is zero, the 2-to-2 switch bypass the inputs, otherwise exchanges the inputs and routes them to the output. Such a switch can be implemented using two 2-to-l multiplexers. Table 1 presents the values of the control bits including cl, Ao, A 1 , er shown in Fig. 2 for computing bit reversal. Input data are fed into the circuits starting from cycle 0. For the first two cycles, the inputs bypass the 2-to-2 switch as cl = 0. For the next two cycles, the two inputs are swapped and routed to the output as cl = 1. During cycles 0, 1, 2 and 3, the inputs are first written into the data buffers, then, in the subsequent four cycles, data stored in buffers are read out and routed by the output switch. The complete data flow in the parallel-2 design could be obtained using Table 1 . For simplicity, we only show the data flow during the input cycles and the output cycles in Fig. 5 . The control bit values will be repeatedly updated based on Table 1 when processing continuous data flows. Fig. 4 
Parallel-2k Bit Reversal
Fig. 6(a) illustrates our proposed designs for parallel-2k bit reversal. Similarly, the 2n inputs are divided into several 2k-element sub-vectors, which are fed into the parallel-2k bit reversal design over 2n -k consecutive cycles. Therefore, data parallelism is 2k. During the output phase, the design produces 2k outputs per cycle. The notations used in Fig. 6 are illustrated as below:
• L;j (0 ::::; i ::::; s, 0 ::::; j ::::; p -1): a 2-to-2 switch (see Fig. 3 ) at the input stage i of the design. • w;1w:: 2k-to-2k fixed wire interconnection, i = 0 , 1, ... , s.
• M; : single-port memory banks (see Fig. 4 ) at the middle stage of the design.
The proposed design shown in Fig. 6 is obtained by extending the mapping approach introduced in Section 4.1 , such that the Benes network is recursively decomposed into 2k subnetworks and further vertically folded motivated by the idea of time multiplexing. Such design is parameterizable with respect to input size 2n and data parallelism 2k. Based on such mapping methodology, the theorem below can be obtained:
Control Unit
(n-k)-bit counter+ i nverters .JJ.
.JJ.
r---------JJ.---------, ,--------------------I
+r,--r{-1-{;1 r-, l."1-{-r,~I Mõ ~--i+lol --~ -f+lol-f. Proof to Theorem 4.1 can be found in [8] . Therefore, the parallel-4 bit reversal design is obtained as shown in Fig. 6(b) . The control bits for the design in Fig. 6 • Aj (0 ::; j ::; p -1): n-k bits used as the address for data buffer Mj.
• er; (0 ::; i ::; s): one bit control determining the state of the switches including R;o, Ril, ... ,R;d. Its value is either 0 or 1. 
Note that the control bits are not shown in Fig. 6 for simplicity. Table 2 shows the values of the control bits for calculating parallel-2k bit reversal. All the control bit values are computed using the Benes network based mapping 2 I 2 i is the identity matrix and 0 is the tensor (or Kronecker) product [19] .
approach (see Section 4.1). These values are further represented using fucntions including f(i,t), go(j,t), gl(t), and
g2(t). Values of cl; or er; are determined by function f(i, t),
where i is the stage index, and tis the cycle number. f(i, t) is defined as :
where f(i, t) switch between 0 and 1 periodically. Based on f(i, t), control bit values of all cl; and er; can be obtained using a (n -k)-bit counter. 
where ao is the binary representation of the result by (t/co) mod 2k, and c 0 is a power-of-two constant. bj is the binary representation of j. The values computed by go(j, t) can be dynamically produced by the control unit in F ig. 6(a) using a (k + log 2 co)-bit counter and ignoring its least significant log 2 co bits. Then, gl (j, t) is defined as: (6) where £T() is the bit reversing operation defined in Section 2, a 1 is the binary representation of the result by (t/c1) mod 2 11 , and c1 is a power-of-two constant. In the control unit, to update values of AJ [1] determined by gl (t), a (li + log 2 c1)-bit counter ignoring its least significant log 2 c1 bits is satisfied. The bit reversing operation can be easily realized by permuted wires in hardware. g2(t) is defined as: (7) where {} is the concatenation operation for bit representations [7] , and m = l2/4 -1. a2q (0 ::; q ::; m) is the binary representation of the result by (t/c2q) mod 16), and c 2 q is a power-of-two constant. In the control unit, bit values of a (l2 + log 2 c2q)-bit counter ignoring its least significant log 2 C2q-bit can be employed for updating AJ [2] .
In Table 2 , AJ = t when 0 ::; t ::; 2n-k_l, and then AJ is 
1 2-to-1 m ul t iplexer, 2 Y = 2" -1, a = (log 2 ( N / 2p 2 ))/2 o r (1 og 2 ( N / 2p 2 )) / 2 +I, 3 C is a constant and C S 2, 4 X = 2plog 2 (N/ p 2 ) / 2+4p 2 -4p, and X > 2plog 2 N determined by go(j, t), g1(t), g2(t), where co, c1 and c2q are pre-computed, during cycle t (2n-k:::; t :::;2n-k+ 1 -1). Therefore , we propose to employ a single (n -k)-bit counter to realize the run-time update of the control bit values of Aj, clj and cri. In summary, the total number of latency of the bit-reversal circuit denoted as T(n , k) can be calculated as:
Furthermore, the throughput of the design Th( n , k) and the total number of control bits Ctrl(n, k) are:
Although regular control bits such as buffer read or write enable bits are not considered in the calculation for Ctrl ( n , k) , the result still covers the major portion of the control logics and indicates the optimum of the control mechanism used in our designs. With regarding to resource consumption, the number of 2-to-1 multiplexers Mu(n, k) and the number of data buffers Bu(n, k) are: Table 3 shows comparison of several bit-reversal designs. N and p represents the input size and the data parallelism, respectively. Highly memory efficient designs for bit reversal are proposed in [4] and [14] , which consume (.JN -1) 2 and (5N /8-3) memory words, respectively. However, these two designs only support serial input data. To support parallel samples, a parallel-8 bit reversal design is developed in [23] using 16 memory banks having totally SN memory words; such design is able to offer high throughput while not resource efficient . To realize parallel bit reveral on continuous data flows , a design using 2p single-port data buffer of total size N and (2p 2 -p) 2-to-1 multiplexers is proposed in [11] . Only parallel-2/4/8 designs are presented in this paper. To further reduce the total memory space, a parallel-2k design is developed in [12] by expanding the technique proposed in [14] . Their design achieves the lower bound of the number of memory words at the expense of more than 2p log 2 N memory banks (data buffers) and 2plog 2 N 2-to-1 multiplexers, which is resource and power inefficient for large size of N . Besides, the lower bound is hold only when p <.JN. A more generic approach for parallelizing a class of bit-index permutations including bitreversal is presented in [20] . Permutation matrix manipulation techniques are employed to achieve the lower bound of the number of memory words for realizing arbitrary parallel-2k (k < n) bit reversal. Al- though many more recent optimizations have been proposed for reducing memory consumption to less than N, the approach in [20] still remains the most practical one for hardware implementation. However, to support continuous data flows , the memory space requirement in their designs has to be doubled with using dual-port memory banks. Our approach translates the parallel bit reversal problem into a switch network routing problem, and achieves the lower bound of memory words using p single-port memory banks for arbitrary 2k (k < n), simultaneously supports processing continuous data flows. The proposed parallel designs are realized using optimal control mechanism achieving the lower bound of the number of control registers, which is a ( n -k )-bit counter. Note that all previous parallel-2k designs needs at least a (n -k)-bit register for memory addressing.
Resource consumption summary

EXPERIMENTAL RESULTS
To illustrate the benefits of using our proposed parallel designs for bit reversal, we performed detailed experiments on Virtex-7 FPGA (XC7VX690T) by Xilinx Tool Set Vivado 15.2. We choose 32-bit fixed point data vectors as input. The Verilog implementations of the proposed designs and the baselines are available through [1] and [2] , respectively. Table 4 shows a resource consumption comparison between our proposed design and the baseline implemented using the technique developed by the SPIRAL project in [20] , where a We further evaluate the energy efficiency (throughput by power) of radix-2 FFT for N =16, 256, 1024, 2048 and 4096 while varying p. The operating frequency is fixed at 250 MHz for the sake of power evaluation. The input test vectors were randomly generated with an average toggle rate of 253 (pessimistic estimation) . We used the VCD file (value change dump file) as input to Vivado Power Analyzer to provide accurate power dissipation estimation (3] . All the designs were pipelined to achieve this clock rate. The experimental results are presented to demonstrate the benefit of the parallel bit reversal techniques incorporated in our FFT designs from a power point of view . Fig. 8 shows that for small size of N, highly data parallel designs can achieve extremely high energy efficiency. It also shows that as p , and problem size are varied, our design achieves 213 to 493 improvement in energy efficiency compared to (20] .
CONCLUSION
This paper represents optimal circuit designs for parallel bit reversal. The designs achieve optimal control michanism and near optimal resource efficiency for calculating continuous-flow bit reversal. The proposed designs are generic and supports arbitrary problem size and data parallelism which are powers of two. Finally, the proposed designs demonstrate significant performance improvement in resource efficiency and energy efficiency compared with the state-ofthe-art , thus are very suitable for reordering the samples of multi-path parallel FFT architectures.
