Abstract-This brief presents novel circuits for calculating the bit reversal on parallel data. The circuits consist of delays/memories and multiplexers, and have the advantage that they requires the minimum number of multiplexers among circuits for parallel bit reversal so far, as well as a small total memory.
I. INTRODUCTION

B
IT REVERSAL [1] is a type of bit-dimension permutation [2] , [3] that permutes a set of indexed data according to a reversing of the bits of the index. Its main use is to sort out the outputs of the fast Fourier transform (FFT), which are generally provided in the so called bit-reversed order.
In the last years, researchers have provided efficient circuits to calculate the bit reversal of a continuous data flow arriving in series [4] or in parallel [5] - [11] . The calculation of parallel bit reversal has become popular due to the increasing number of high throughput FFT hardware architectures that have been proposed in the last years. These architectures process a continuous flow of data arriving in parallel and demand a bit reversal circuit with the same parallelization in order to keep the continuous flow.
There are numerous strategies to implement the parallel bit reversal. They depend on the type of elements used to store the data, i.e., delays or memories, and on how the bit reversal permutation is split in other sub-permutations. Previous works have focused on reducing the amount of delays/memory or the number of multiplexers. However, the minimization of the delays/memory results in a larger number of multiplexers, whereas the minimization of the multiplexers results in a larger memories.
In this brief, we present memory and multiplexer-efficient circuits for parallel bit-reversal. Thus, this brief explores how to achieve a small memory and a small number of multiplexers simultaneously. This is done by studying the permutation that the bit reversal of parallel data carries out, as well as the alternatives of using delays and memories. As a result, this brief proposes two efficient circuits for N > P 2 and N ≤ P 2 , respectively, where N is the number of data involved in the permutation and P is the number of parallel paths. These circuits achieve both small memory usage and small multiplexer usage.
Manuscript received August 9, 2018 This brief is organized as follows. In Section II, we review the concepts related to bit-dimension permutations that are needed for this brief. In Section III, we review and classify the previous approach for parallel bit reversal. In Section IV, we develop the proposed parallel bit reversal circuits. In Section V, we compare the proposed designs to previous approaches. Finally, in Section VI, we collect the main conclusions of this brief.
II. REVIEW OF BIT-DIMENSION PERMUTATIONS
This section reviews the main ideas related to bit-dimension permutations, which are required for understanding this brief. For a detailed description of bit-dimension permutations, the reader is encouraged to read [2] .
Bit-dimension permutations apply to a set of N = 2 n data, n ∈ N, in an n-dimension space with dimensions x n−1 x n−2 . . . x 0 , where the only possible coordinates for each dimension are 0 or 1, i.e., x i ∈ {0, 1}. In this context, a bit-dimension permutation defines a reordering of the data according to a permutation of the n bits of the dimensions [12] . This allows for defining a permutation operation on a set of n bits instead of defining it for 2 n values, which is most times mathematically inaccessible [3] .
The dimensions of the space can be serial or parallel. P = 2 p is the number of samples that flow in parallel. As the total amount of data is N, those data are provided in N/P clock cycles in series.
In this context, a bit-dimension permutation, σ , is a function represented by 
A. Elementary Bit-Exchange
An elementary bit-exchange (EBE) [3] is a bit-dimension permutation that only exchanges two dimensions. For example, the permutation defined by σ (u 2 u 1 u 0 ) = u 2 u 0 u 1 is an elementary bit-exchange of dimensions x 1 and x 0 . Alternatively, an elementary bit-exchange of dimensions x j and x k can be represented as [4] σ :
The number of delays (D), the number of multiplexers (M) and the latency (Lat) of circuits that calculate an elementary bit exchange are summarized in Table I. This table   1549 -7747 c 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. 
TABLE I COSTS OF ELEMENTARY BIT-EXCHANGES
is brought from [2] and the description of the circuits can be found in [2] and [13] . The circuits for elementary bitexchange are classified into circuits that exchange two serial dimensions, i.e., serial-serial (ss), two parallel dimensions, i.e., parallel-parallel (pp) or a serial and a parallel dimensions, i.e., serial-parallel (sp). Note that the costs in Table I depend on this classification.
B. Bit Reversal
Bit reversal is a specific bit-dimension permutation that flips the dimensions. For serial data, i.e., when there is no parallel dimension, the bit reversal corresponds to the permutation
For parallel data, the bit reversal is represented as
III. PREVIOUS APPROACHES FOR PARALLEL BIT REVERSAL Previous approaches for parallel bit reversal are characterized by the type of permutations that they carry out and by the storage elements that they use, i.e., memories or delays. Fig. 1 shows the different approaches that have been proposed depending on the type of permutations. The permutation blocks in the figure have several stages with ss, or pp or sp permutations. The first approach consists of the composition of the permutations ss-pp-ss [7] , the second approach is pp-ss-pp [5] - [9] , the third approach is ss-sp [11] and the forth one is ss-pp-ss-pp [10] .
Regarding the storage elements, each ss permutation can be carried out either by a memory or by a circuit for serial-serial permutation [2] , [4] . The latter consists of a buffer and two multiplexers. Each pp permutation admits two possibilities: Either it only requires to interconnect each input to the corresponding output by a wire [2] , which does not require any hardware, or uses a set of multiplexers in parallel to route the input data [5] - [9] . The former case occurs when all the values coming from the same input follow the same path, whereas the latter case occurs when they may follow different paths. Finally, each sp permutation consists of a set of buffers and multiplexers [2] . The buffers may be implemented using delays or memory [14] . Previous ss-pp-ss and pp-ss-pp approaches use a set of buffers for the pp permutation and a memory for the ss permutation [5] - [9] . Conversely, previous ss-sp and ss-pp-ss-pp approaches use delays for ss and sp permutations [10] , [11] .
IV. PROPOSED BIT REVERSAL CIRCUITS
The proposed approach is based on carrying out elementary bit-exchanges of pairs of dimensions. These pairs are x n−1 and x 0 , x n−2 and x 1 , x n−3 and x 2 , etc. This results in flipping the order of the bits, which is the essence of bit reversal. Fig. 2 shows some cases on how the parallel bit reversal permutation may look like. In all cases the permutation consist of flipping the order of the bits. However, the type of permutations that must be carried out are different depending on the relation between N and P.
When N = P 2 , half of the dimensions are serial and the other half are parallel. In this case, all the pairs of dimensions to be exchanged include a serial dimension and a parallel one. For instance, x n−1 is serial as it is to the left of the bar (|), whereas x 0 is parallel.
When N < P 2 , there are more parallel dimensions than serial dimensions. Thus, all the serial dimensions are exchanged with parallel ones and, additionally, there are some parallel dimensions that are exchanged with other parallel dimensions. This results in sp and pp permutations.
When N > P 2 , there are more serial dimensions than parallel dimensions. Therefore, the p upper serial dimensions are exchanged with parallel dimensions, and there are also some serial dimensions that are exchanged with other serial dimensions. This results in sp and ss permutations.
These cases are studied in the next sections.
A. Bit Reversal Circuits for N ≤ P 2
If N ≤ P 2 , the bit reversal of parallel data is described as If we split this permutation into the sp and the pp permutations, we obtain
where σ 1 is pp and σ 2 is sp. Note that both permutation deal with different dimensions. Therefore, we can calculate first any of the permutations and then the other one, i.e.,
The permutation σ 1 is a pp permutation with no hardware cost. How to implement it is described in [2] . The permutation σ 2 consists of p sp elementary bit exchanges. Its total number of delays according to Table I is
Likewise, the number of multiplexers is
and the latency is
B. Implementation of the Bit Reversal Circuits for N ≤ P 2
The circuit in Fig. 3 calculates the parallel bit reversal for N = 16 and P = 8. The proposed circuit includes a pp permutation and a sp one. It requires 8 delays, 8 multiplexers and has a latency of 1 clock cycle, which matches equations (9), (10) and (11) .
For other values of N and P, once the permutations have been identified, this brief [2] describes how to implement the elementary bit exchanges.
Note also that the buffers can alternatively be grouped in memories [14] . However, as N ≤ P 2 , the length of the buffers is small, so it is generally preferred to use delays.
C. Bit Reversal Circuits for N > P 2
If N > P 2 , the bit reversal for parallel data is described as
If we split this permutation into the ss and the sp permutations, we obtain
where σ 1 is ss and σ 2 is sp. If this case, the order of the permutations is also arbitrary, i.e.,
The number of delays for the sp permutation is
The ss permutation σ 1 is calculated by a bank of memories. As σ 1 does not affect neither depends on the parallel dimensions, all the memories carry out the same permutation. This permutation flips the order of the n − 2p lower serial dimensions. Therefore, the operation that each memory carries out is the serial bit reversal of 2 n−2p data. When data is written in memory in natural order, the bit reversal is calculated by reading the data in bit reversed order. Likewise, if data is stored in bit-reversed order, then the bit reversal of the data is achieved by reading the data from memory in natural order. In this way, a memory whose address alternates between natural and bit reversed order calculates the bit reversal of the input data. The size of each memory is equal to
As there is one memory per parallel branch, the number of memories is
and the total memory for the ss permutation is
Regarding latency, as the read and write addresses are the same, the latency of each memory is equal to its size, i.e., Finally, the memories do not include any multiplexer, as the permutation is done by modifying the read and write addresses of the memories. By combining the cost of the sp and the ss permutation, we obtain the total memory cost of the proposed approach, which results in
The multiplexers come only from the sp permutation and the total latency is Fig. 4 shows the parallel bit reversal circuit for N = 32 and P = 2. It includes two memories and a circuit for sp permutation. As N > P 2 , the buffers of the sp permutation are always large. Therefore, it is preferable to implement them by using memories instead of delays. Likewise, the addresses for the two memories in the figure are always the same. This means that they can be grouped in a single memory.
D. Implementation of the Bit Reversal Circuits for N > P 2
As a result, Fig. 5 shows the detailed implementation of the circuit in Fig. 4 . The circuit only includes two memories, a control counter and two multiplexers, considering that the area of the multiplexers coming from the control counter is negligible, because they are 1-bit multiplexers. In the circuit, MEM0 groups the two memories in Fig. 4 . Its address ADDR0 is used for reading and writing and it is generated from the bits of the counter. Note that it consists of the counter bits C 2 C 1 C 0 during 8 clock cycles and C 0 C 1 C 2 during the following 8 clock cycles. This allows for calculating the bit reversal of groups of 8 inputs. The second memory, MEM1, calculates the sp permutation. This memory acts as a buffer of length 8 and combines the two buffers in Fig. 4 . Note also that the control of the circuit is simple, as it is easily obtained from the bits of a counter. The timing diagram for the circuit in Fig. 5 is shown in Table II . Apart from the control signals, inputs and outputs, it includes some internal signals to keep track of the data. Note that the input is received in bit-reversed order and the output is provided in natural order. Table III compares proposed and previous approaches to calculate the parallel bit reversal. Previous approaches that use memories are pp-ss-pp [5] - [9] or ss-pp-ss [7] , whereas previous approaches based on delays use ss-pp-ss-pp [10] or ss-sp [11] . By contrast, the proposed approach uses memories and ss-pp for N > P 2 , and pp-sp for N ≤ P 2 .
V. COMPARISON
Regarding delays/memory, previous approaches based on delays [10] , [11] achieve the theoretical minimum, and some approaches based on memories [7] -[9] achieve a reasonable and slightly larger memory size of N. The proposed approach also requires a memory size of N.
Regarding multiplexers, some approaches [5] , [8] , [10] require a number of multiplexers proportional to P 2 . Other approaches [6] , [7] , [9] reduce the complexity to 2P log 2 P. Finally, only [7] and the proposed approach reduce the complexity to P log 2 P, whereas the complexity of [11] is in most cases larger, as it depends on N and N > P 2 .
By considering delays/memories and multiplexers, the advantage of the proposed approach for N > P 2 is that it is the only approach that reduces the memory to N addresses and the number of multiplexers to P log 2 P. Thus, it minimizes memory and multiplexers simultaneously. For N ≤ P 2 the proposed approach reduces the amount of memory and multiplexers even further.
Finally, the latency is similar in all the approaches with small variations, and the throughput (Th.) of all the approaches is the same, as all of them process P data in parallel in a continuous flow.
Regarding experimental results, Table IV compares implementations of bit reversal circuits on a Virtex-7 XC7VX330T-3-FFG1157 FPGA. The results are obtained for N = 4096, P = 4 and 32 bits of word length (WL). Compared to previous approaches, the proposed solution is more compact, as it makes use of BRAM memory instead of a large amount of distributed logic.
VI. CONCLUSION
This brief proposes multiplexer and memory-efficient circuits for parallel bit reversal. They are the result of a good selection of the sub-permutations used to calculate the bit reversal and an efficient use of the delays/memories and multiplexers. For N > P 2 , the proposed circuits use an ss-sp permutation and require a memory of N addresses and P log 2 P multiplexers. For N ≤ P 2 , the circuits carry out a pp-sp permutation and require delays/memories of a total size of N − P, and P log 2 (N/P) multiplexers. This represents small memory and small number of multiplexers simultaneously.
