Abstract-In this paper, we present a systematic approach to design hardware circuits for bit-dimension permutations. The proposed approach is based on decomposing any bit-dimension permutation into elementary bit-exchanges. Such decomposition is proven to achieve the theoretical minimum number of delays required for the permutation. This offers optimum solutions for multiple well-known problems in the literature that make use of bit-dimension permutations. This includes the design of permutation circuits for the fast Fourier transform, bit reversal, matrix transposition, stride permutations, and Viterbi decoders.
I. INTRODUCTION

B
IT-DIMENSION permutations [1] are permutations on N = 2 n data defined by a permutation of n bits that represent the index of the data in binary. Bit-dimension permutations are a wide category that includes, among other permutations, the perfect shuffle [2] , matrix transposition [3] , [4] , stride permutations [5] - [8] , and bit reversal [9] - [15] , as shown in Fig. 1 . Regarding their use, bit-dimension permutations are used in important signal processing algorithms such as the fast Fourier transform (FFT) [15] - [21] and Viterbi decoders [22] .
One way to design digital circuits that carry out bitdimension permutations is to use lifetime analysis and register allocation [23] , [24] . This approach determines the content of the registers used for the permutation at each time instant, leading to efficient use of the registers.
More recent works use memories [3] , [13] , [25] - [31] or delays (buffers or registers) [3] , [6] - [10] to carry out the permutations. The approaches based on memories consist of a memory bank in parallel and multiplexers at the input and output of the memories. The multiplexers decide to/from which memory data is written/read. The approaches based on delays include delays and multiplexers in series and in parallel. At present, there exist optimum solutions for bit-dimension permutations in terms of the number of delays [23] , [24] . However, the interconnection between registers is complex and requires a large number of wires and multiplexers. There also exist optimum solutions for specific bit-dimension permutations. For bit reversal, optimum circuits have been proposed for serial [9] and parallel data [10] . Likewise, there are circuits with the minimum number of delays for matrix transposition and other stride permutation [8] . However, there is no general solution in the literature that provides the minimum number of delays as well as a reduced number of multiplexers for any bit-dimension permutation.
In this paper, we present a systematic approach to design hardware circuits for bit-dimension permutations, different to the commonly used Kronecker products [8] , [25] . The proposed approach leads to circuits with an optimum number of delays for any bit-dimension permutation. This has several implications. First, the proposed approach widens the scope with respect to previous papers that only focus on specific permutations such as bit reversal or matrix transposition. Second, it provides optimum solutions in terms of delays for a wide range of permutations. Furthermore, it reduces the number of multiplexers with respect to previous approaches based on delays. As a result, the proposed approach is a systematic and optimized solution for a large group of permutations. This paper is organized as follows. Section II briefly reviews the concept of bit-dimension permutations. Section III explains how to model a continuous data flow, which is the basis of the proposed approach. Section IV presents the circuits for elementary bit-exchange (EBE), which are the basic circuits that we use to carry out bit-dimension permutations. Section V describes how to calculate the cost of a bit-dimension permutation. Section VI presents the theoretical minimum latency and number of delays for a bit-dimension permutation. Section VII shows how to derive optimum circuits for bit-dimension permutations. Section VIII compares the proposed approach to previous approaches in the literature. Section IX summarizes the main conclusions and the Appendix shows an example on how to use the proposed approach.
II. BIT-DIMENSION PERMUTATIONS
Let us consider a set of N = 2 n data, n ∈ N, in an n-dimensional space x n−1 x n−2 . . . x 0 , where x i ∈ {0, 1}. In this context, a bit-dimension permutation, σ , defines a reordering of the data according to a permutation of the coordinates in the space [1] . This allows for defining the permutation on a set of n elements instead of defining it for 2 n values, which is, most of the times, mathematically inaccessible [32] .
In general, a bit-dimension permutation is a permutation
which transforms a point in the space u n−1 u n−2 . . . u 0 into a new point u σ (n−1) u σ (n−2) . . . u σ (0) whose coordinates are a permutation of the coordinates of the original point. Before the permutation, x i = u i and, after the permutation,
A. Types of Bit-Dimension Permutations
A bit-dimension permutation that only involves two dimensions, 1 x j and x k , and exchanges their coordinates is called EBE [32] . This EBE can be represented as σ : x j ↔ x k [9] . Throughout this paper, we also use the notation ( j k) to represent this EBE.
A perfect shuffle [2] is a circular permutation of one bit to the left
Likewise, a perfect unshuffle is a circular permutation of one bit to the right
A stride-by-2 s permutation [5] , [7] is a circular permutation of s bits to the left, and it can be expressed as a composition of s perfect shuffles
III. MODELING A HARDWARE DATA FLOW
In this section, we propose a new model to describe a hardware data flow. The model considers a continuous data flow of N = 2 n data in an n-dimensional space x n−1 x n−2 . . . x 0 .
A. Serial and Parallel Dimensions
In a hardware circuit, data flows in series and/or in parallel. Data flowing in series are provided to the same terminal at different clock cycles. Data flowing in parallel are provided at the same time to different terminals. Fig. 2 shows the definition of serial and parallel dimensions in a data flow. As a convention, data flow from left to right, x 0 to x p−1 are parallel dimensions and x p to x n−1 are serial ones. This means that there are p parallel dimensions and n − p serial dimensions. This also means that the data flow is modeled as a rectangle of 2 p data in parallel times 2 n− p in series.
B. Position, Time, and Terminal
In the data flow, we can define the position occupied by each datum. As the number of data is N = 2 n , the positions are numbered from 0 to 2 n − 1 according to
In other words, we can say that P ≡ x n−1 x n−2 . . . x 0 , where (≡) relates the decimal and binary representations of a number.
In the data flow, we can define the time of arrival and the terminal of the datum in any position P. The time of arrival is calculated as (6) and the input terminal is
Note that t (P) is the time of arrival relative to the arrival of the first sample at a given point of the circuit. Therefore, t (P) = 0 means that the sample in position P is the first one to arrive at that point of the circuit. Note that the input terminal is only determined by the parallel dimensions, whereas the time of arrival only depends on the serial ones. In addition, there exists a total of 2 p terminals, which are numbered from T = 0 to T = 2 p − 1 according to (7) , and all the data arrive in 2 n− p clock cycles, from t = 0 to t = 2 n− p − 1 according to (6) .
Finally, the vertical bar (|) is used throughout this paper to separate the serial and parallel dimensions. According to this, we can represent the position as
Example: Fig. 3 shows a data flow with three dimensions. One of them is parallel and two are serial. The position is indicated in parenthesis and numbered from 0 to 7. The time and terminal are also indicated in this figure. For instance, position P = 5 corresponds to x 2 x 1 x 0 = 101. As p = 1, the time of arrival is t (P) 
C. Data Flow With Indexed Data
Signal processing algorithms define mathematical operations on indexed data. In our approach, we define the index of the data as I ≡ b n−1 b n−2 . . . b 0 or, equivalently
Thus, I represents the decimal value of the index and b i are the bits of its binary representation.
In a data flow with indexed data, the position is defined as a function of b i . This allows to assign each indexed data to a position in the data flow.
Example: In the data flow shown in 
D. Permuting a Continuous Data Flow
A shuffling circuit is represented by a function σ . The permutation that a circuit carries out is defined as σ (u) and can be applied to a data flow with or without indexed data. If the input order is P 0 and the output order is P 1 , then
Example: σ (u 2 u 1 |u 0 ) = u 2 u 0 |u 1 defines the permutation of a shuffling circuit. When applying it to the input order P 0 ≡ b 0 b 1 |b 2 , we obtain the output order P 1 ≡ b 0 b 2 |b 1 .
IV. HARDWARE CIRCUITS FOR ELEMENTARY
BIT-EXCHANGE Any bit-dimension permutation can be decomposed into a series of EBEs [32] . This principle is followed in this paper to design circuits for bit-dimension permutation. This section describes the hardware circuits used to calculate EBEs. Sections V-VII explain how to use the EBEs to create circuits for bit-dimension permutations.
A. General Considerations
An EBE exchanges the coordinates of two dimensions, x j and x k . Without loss of generality, let us assume that j > k.
If σ is an EBE that calculates σ (P 0 ) = P 1 , then P 0 and P 1 only differ in the coordinates that are exchanged, that is
As x i ∈ {0, 1}, samples for which x j = x k will remain in the same position, as P 1 = P 0 . Conversely, if x j = x k , the input position corresponds to one of these options
If the initial position is P 0 = P 0 A , the EBE moves the sample to P 1 = P 0B . If P 0 = P 0B , the output position is P 1 = P 0 A . Therefore, pairs of samples whose position only differ in x j and x k are swapped. As a result, an EBE changes the position of half of the N samples, i.e., those for which x j = x k . The rest of the samples are unaffected by the permutation and keep their positions. Note that this holds independently of the serial or parallel nature of the dimensions. However, depending on this nature, we can define three different cases: either both dimensions are parallel, or both are serial, or one of them is serial and the other one is parallel. Each of these cases leads to different shuffling circuits, as shown next.
B. Parallel-Parallel EBE
If x j and x k are parallel, then p > j > k. This leads to
and the pairs of inputs whose position must be exchanged are
As no serial dimensions are involved in the permutation, the difference in time between two inputs that must be switched is t = t (P 0B ) − t (P 0 A ) = 0. Therefore, pairs of data whose position must be exchanged are received at the same time at different terminals of the shuffling structure. This means that the permutation can be carried out by rearranging the inputs at each time instant and the circuit does not need any delay element to store samples. In addition, according to (13) , input data at terminal
Consequently, a parallel-parallel EBE can be simply carried out by an interconnection between each input terminal, T 0 , and the corresponding output one, T 1 .
C. Serial-Serial EBE
If both dimensions x j and x k are serial, then j > k ≥ p. This leads to and the pairs of inputs whose position must be exchanged are
This means that pairs of input data that must be interchanged arrive at the same input terminal, because T (P 0 A ) = T (P 0B ), and they are separated a constant number of clock cycles
Fig . 4 shows the circuit to carry out a serial-serial EBE. It consists of a buffer of length
and two multiplexers controlled by the same control signal, S. The latency of the circuit is equal to the length of the buffer, i.e., Lat = L. This is the minimum number of delays that make the circuit causal and, therefore, implementable. The control signal of the multiplexers depends on the serial dimensions that are involved and is obtained as
Note that S = 0 only if x j = 1 and x k = 0, i.e., when a sample in position P 0B is at the input of the circuit and sample in position P 0 A = P 0B − t = P 0B − L is at the output of the buffer. As S = 0, both samples are interchanged. Otherwise, S = 1 and data are not permuted. When there exist parallel dimensions, the circuit is replicated in parallel for each input terminal. In this general case, the total number of delays is
Permutations of serial data are used for bit reversal [9] and for the serial commutator FFT [20] , [21] .
D. Serial-Parallel EBE
When the dimension x j is serial and x k is parallel, then j ≥ p > k. This leads to
Accordingly, pairs of input samples that must be interchanged arrive at different terminals because T (P 0 A ) = T (P 0B ). 
In order to do the swapping, the input sample at terminal T (P 0 A ), which arrives first, must wait t clock cycles until the other sample arrives. Fig. 5 shows the circuit that permutes a parallel dimension with a serial one. It consists of two buffers and two multiplexers, where the length of each buffer is directly determined by (24) and the control signal is
If there is more than one parallel dimension, the circuit in Fig. 5 is replicated in parallel 2 p−1 times. This leads to a total number of delays
Examples of serial-parallel permutations can be found in the parallel feedforward FFT architectures [16] , [17] .
E. Implementation of the Delays Using Memories
Although the circuits have been described in terms of delays, these delays can be implemented in hardware by memories that act as buffers. In fact, several delays in the permutation circuits can be grouped together to form a bigger memory [33] . This reduces the power consumption and the area of the circuit if the number of delays is large [34] .
V. COST OF A BIT-DIMENSION PERMUTATION
The costs of the circuits in Section IV are summarized in Table I . The cost is shown in terms of a total number of delays, D, buffer length and latency, L, and multiplexers, M. 
For any bit-dimension permutation, the number of delays is calculated from the cost of the EBEs that it consists of, according to
where Q is the total number of EBEs and r corresponds to the values for which σ r and σ r+1 are both serial-parallel EBEs that share the parallel dimension. The latency of a bit-dimension permutation is related to the number of delays by the number of parallel samples P = 2 p , i.e., Lat = D(σ )/2 p . Finally, the number of multiplexers is equal to the sum of the number of multiplexers of the individual permutations, that is
Example: The perfect shuffle σ (u 2 u 1 |u 0 ) = u 1 u 0 |u 2 can be calculated in three ways depending on how we break down the permutation, i.e., σ = σ 1 • σ 2 = σ 3 • σ 1 = σ 2 • σ 3 , where σ 1 : x 1 ↔ x 0 , σ 2 : x 2 ↔ x 1 and σ 3 : x 2 ↔ x 0 are EBEs. The permutations σ 1 and σ 3 are serial-parallel and the permutation σ 2 is serial-serial. According to Table I and considering that p = 1, the number of delays and multiplexers of the EBEs is
By implementing the EBEs according to Section IV, the three circuits to calculate σ in Fig. 6 are obtained. The number of delays for the three implementations according to (27) is
In the case of σ 3 • σ 1 , both permutations are serial-parallel, so we subtract the minimum cost among them, which is D(σ 1 ). This fact can be observed in Fig. 6(b) , where two delays in the parallel branches between multiplexers can be removed, thanks to pipelining, leading to a total of four delays. The latency is then Lat = 4/2 1 = 2 clock cycles for σ 1 • σ 2 and σ 3 • σ 1 , and Lat = 6/2 1 = 3 for σ 2 • σ 3 . Finally, the number of multiplexers is
As a result, the permutation with the lowest cost is σ 3 • σ 1 .
VI. THEORETICAL LIMITS
A. Minimum Latency
The latency of a permutation circuit is the difference between the time when the first input arrives to the circuit and the time when the first output is provided.
The time that a certain input is inside the permutation circuit, t I , is equal to the time of departure, t (P 1 ), minus the time of arrival, t (P 0 ), plus the circuit latency. Note that the time of arrival/departure defined in Section III is referred to the arrival/departure of the first input-output. This gives
where the time t I needs to be greater than or equal to zero to make the circuit causal. This leads to
Therefore, the minimum latency is
Consequently, the minimum latency is set by the datum that arrives the latest with respect to the time in which it should be provided, which will force the rest of data to wait for it.
B. Minimum Number of Delays
The outputs of a circuit are provided Lat min clock cycles after the inputs are received. The minimum number of delays is then equal to the amount of data stored during this time
where P = 2 p is the number of parallel inputs. For serial data (P = 1), this lower bound was already derived in [23] . For the general case, we continue the analysis as follows.
By substituting (6) in (35) and taking into account that x i = u i at the input and x i = u σ (i) at the output according to (1) , we obtain
The permutation σ is bijection and the relationship between i and σ (i ) is the same as the relationship between σ −1 (i ) and i . Therefore, by applying the variable change i → σ −1 (i ) to the second summation, we obtain
As u i ∈ {0, 1}, then each term of the first summation will add a positive number or zero and each term of the second summation will subtract a positive number or zero. When 2 i > 2 σ −1 (i) , the sum of the terms corresponding to the same u i will be positive or zero, and if 2 i < 2 σ −1 (i) , the sum of the terms corresponding to the same u i will be negative. Therefore, the maximum will occur when u i = 1 for i > σ −1 (i ) and u i = 0 for i < σ −1 (i ). This leads to
This equation results in Algorithm 1, which is used to calculate the cost of the optimum permutation. Equation (39) shows how to apply Algorithm 1 to the permutation σ (u 4 u 3 u 2 u 1 |u 0 ) = u 2 u 1 u 0 u 4 |u 3 . In this case, p = 1 and 
Finally, the upper bound of (38) for a permutation of N = 2 n elements and P = 2 p parallel streams is
Note that the number of delays is always smaller than N, i.e., D UB < N.
C. Dealing With Cycles
Permutations can be broken down into cycles. Different cycles in bit-dimension permutation do not share any dimension. As cycles are not mixed, when calculating the minimum number of delays according to Algorithm 1 each column can only correspond to one cycle. Thus, the minimum number of delays for a cycle is obtained by adding the positive Algorithm 1 Theoretical Minimum Number of Delays values in the columns that correspond to that cycle. Likewise, the minimum number of delays in a bit-dimension permutation is obtained as the sum of the minimum number of delays of the individual cycles. The example in the Appendix clarifies all of this fact.
VII. OPTIMUM CIRCUITS FOR BIT-DIMENSION PERMUTATION
In this section, we propose a methodology to obtain optimum circuits for bit-dimension permutations. First, the permutation is broken down into cycles. Then, for each cycle, the optimum permutation is obtained, which depends on the serial or parallel nature of the dimensions involved.
A. Decomposing the Permutation Into Cycles
For the optimization purpose, permutation cycles can be treated independently. Each cycle can be one among the following three types.
1) The cycle only includes parallel dimensions.
2) The cycle only includes serial dimensions.
3) The cycle includes serial and parallel dimensions. The case when the cycle only includes parallel dimensions is straightforward: the optimum circuit simply consists of connecting each input terminal to the corresponding output terminal, as discussed in Section IV-B. For the other two cases, Sections VII-B and VII-C show how to achieve the optimum circuit.
Also, note that the number of EBEs of each cycle is one less than the number of dimensions that the cycle involves. Therefore, if c is the number of cycles in the permutation, the total number of EBEs of a permutation is
B. Cycles With Only Serial Dimensions 1) Problem With Elevators:
The optimization for cycles with only serial dimensions is analogous to the following problem with elevators. Once the problem of elevators is understood, it is easy to apply it to our optimization problem.
Let us assume that a building has n elevators that can move between floors F = 0 and F = n − 1. Each elevator can be in any of the n floors. However, there is always one elevator at each floor.
Each elevator has a number. This number corresponds to the floor that the elevator must reach.
Elevators can move. The movement is done in pairs of elevators that change floor. This exchange is done to respect the rule of one elevator per floor. The cost of moving one elevator along several floors depends on the initial and final floors. As long as the elevator moves toward its final floor, the cost will be the same independent of the number of stops in intermediate floors. However, if an elevator gets further than its target floor, there will be an extra cost.
For this problem, we want to calculate the most efficient movements in order to make all the elevators reach their destination floors.
Solution: As pairs of elevators exchange floors, in order to move them, it is necessary that one of them moves down and the other one moves up. It is also necessary that each of them reaches the floor where the other one was. Fig. 7 shows all the cases in which two elevators moving in different directions can be. The elevator j is in floor F( j ) and aims to reach floor j . The elevator k is in floor F(k) and aims to reach floor k. Without loss of generality, we consider that j > k. This means that the elevator j moves up and the elevator k moves down. In other words, it is fulfilled that F( j ) < j and F(k) > k.
Among the cases in Fig. 7 , in (a), (b), and (c), j cannot reach F(k) without surpassing its destination. In cases (c), (e), and (m), k cannot reach F( j ) without surpassing its destination. In case (o), a swap of the floors would only make the elevators further than their destination. Therefore, the cases that advance to the destinations without incurring in additional costs are the cases (d), (g), (h), and (n). These cases share the properties:
, and F( j ) ≥ k. Therefore, the minimum cost is achieved as long as the movements fulfill these properties.
Algorithm 2 Obtaining the Optimum Permutation for Cycles With Only Serial Dimensions
2) Optimizing Cycles With Only Serial Dimensions:
The optimization of cycles with only serial dimensions is done by translating it to a problem of the elevators. This is possible because for serial-serial permutations, the cost of moving up (analogously down) any u i from x k to x j is the same when it is done directly, i.e., D = 2 j − 2 k , and when there are intermediate stops, e.g., by stopping in
Once the problem has been translated into a problem with elevators, the next step is to identify allowed movements that respect the properties j ≥ F(k), F(k) > F( j ), and F( j ) ≥ k, which guarantee the minimum cost. Each of these movements leads to a new building and each of these buildings creates a branch of a tree.
Then, the process repeats for each branch of the tree and continues until all the elevators reach their destination.
In the end, each branch of the tree represents an optimum permutation. The sequence of permutations to reach one optimum is obtained by following the tree from the top to the end of any branch.
If, instead of obtaining all optimum solutions, we only need one of them, Algorithm 2 obtains such permutation. This algorithm searches for feasible movements of the elevators and, when it finds one, it collects the EBE, does the swap corresponding to that EBE and continues from that point, i.e., it does not search all the optimum cases but follows the first that it finds.
Example: Let us consider σ (u 4 u 3 u 2 u 1 u 0 ) = u 1 u 4 u 0 u 2 u 3 . First, it is translated into the building with elevators at the top of Fig. 8 . The floor in which each elevator starts is equal to the numbers on the top of the building, which corresponds to the subindex i of u i at the output of the given permutation.
An allowed movement is to swap the elevators 4 and 0, which are in floors 3 and 1, respectively. Another allowed movement is to swap the elevators 4 and 1, which are in floors 3 and 2, respectively. These two cases lead to the two buildings to the sides of the top one. Then, the process repeats for each of the resulting buildings, until the tree is finished.
In the end, any branch of the tree represents an optimum movement. For instance, by going from the top to the most left branch, we obtain the EBEs (3 1), (1 0), (2 1), and (4 3). It can be checked that this sequence of EBEs carries out the desired permutation and its cost in terms of delays is 17, which corresponds to the theoretical minimum in Section VI.
3) Proof of Optimality: To proof optimality, we know that in our solution, we only consider the movements in Fig. 7(d), (g) , (h), and (n). All these movements move elevators closer to their final floor and guarantee that the final cost is optimum since no cost apart from the minimum is introduced. Furthermore, by following any of these movements, the final cost is the same, as it is independent on the stops in the intermediate floor. What remains to proof is that at any step of the algorithm, we can always apply at least one of these movements. Any step of the algorithm consists of one or more cycles. If a given cycle involves only two dimensions, then it will be equal to the case in Fig. 7(g) . If the cycle involves more than two dimensions, then the upper part of the cycle must look like the case in Fig. 7(d) or (e). They, together with Fig. 7(g) , are the only cases that can create the upper part of the cycle. The case in Fig. 7(d) is already one of the valid movements. For Fig. 7(e), if F( j ) is the lowest floor in the cycle, i.e., F( j ) = F min , the elevator j will go from the lowest to the highest floor. This forces that the bottom part of the cycle is closed with Fig. 7(h) , which is a valid movement. In Fig. 7(e) , if F( j ) > F min , there will exist an elevator h < F( j ) that comes from F(h) > F( j ), which allows for reaching the floors under F( j ). Otherwise, the cycle could not be closed. In this case, we can apply the movement in Fig. 7(n) to the elevators j and h. Therefore, any step of the algorithm has at least a valid movement. This guarantees that the algorithms always reach the optimum permutations.
C. Cycles With Serial and Parallel Dimensions 1) Optimizing Cycles With Serial and Parallel Dimensions:
When a cycle includes serial and parallel dimensions, the circuit is optimized by using one of the parallel dimensions as a pivot. Thus, all the EBEs are carried out between the pivot dimension and another dimension. This transforms all serialserial EBEs into serial-parallel, which follows the ideas in Section V and results in less multiplexers and equal or less delays than using serial-serial permutations. The order of the EBEs that must be carried out is obtained easily. Starting with the pivot dimension, the value u i is moved to its corresponding place at the output. This not only allocates u i in its place but also moves another u i to the pivot dimension. Next, u i is moved to its corresponding place at the output, and a new u i is moved to the pivot dimension. The procedure continues in the same way until all the values reach their place. Note that i = σ −1 (i ) and i = σ −2 (i ) according to the definition in Section II.
The previous procedure results in Algorithm 3. An example of the application of this algorithm is given in the Appendix.
2) Proof of Optimality: All the resulting permutations are serial-parallel or parallel-parallel. By including the costs in Table I in (27) , the cost of the resulting permutation is
As i = σ −1 (i ), this corresponds to the minimum number of delays in (38).
D. Number of Multiplexers
In cycles that only include serial dimensions, all the EBEs are serial-serial. Therefore, each serial path includes two multiplexers per EBE, leading to a total of
where s C is the number of serial dimensions in the cycle and s C − 1 is equal to the number of EBEs in the cycle.
In cycles with at least one parallel dimension, the optimum permutation consists of a sequence of serial-parallel permutations. These permutations require one multiplexer per parallel branch and per serial-parallel permutation, that is
Based on this, for any bit-dimension permutation, the upper bound for the number of multiplexers in the proposed approach is
VIII. COMPARISON Table II 
For this permutation, the proposed approach and other delay-based approaches in Table II have less complexity than memory-based approaches either in the amount of delays/memory, or the number of multiplexers, or both.
The permutation (B) is a bit reversal of N data arriving in P parallel streams with N > P 2 . By using the proposed approach, the bit reversal is broken down into the EBEs σ i :
For either parallel data with N > P 2 or serial data, the permutation consists of p serial-parallel EBEs σ i :
which results in
and
The resulting circuits for serial and parallel bit reversal using the proposed approach are the same as those in [9] and [10] , respectively. Therefore, the proposed approach is capable of obtaining optimum circuits for bit reversal for any P, and [9] and [10] are only specific cases of the framework provided in this paper. Also, note that the cost of the bit-reversal permutation, both for serial and parallel data, corresponds to the upper bound defined in (40). This means that the bit reversal is the most costly bit-dimension permutation.
Compared to previous approaches in Table II , the proposed approach requires less delays/memory than previous memorybased approaches. As we consider N > P 2 , the P log 2 P multiplexers in [27] are less than the P log 2 (N/P) multiplexers of the proposed approach. Therefore, there is a tradeoff between delays/memory and multiplexers.
The permutation (C) is σ (u 4 u 3 u 2 u 1 |u 0 ) = u 2 u 1 u 0 u 4 |u 3 , which is a stride permutation that has been used in [8] . Fig. 9(b) and (c) shows the proposed solution and the timing diagram, respectively. In this case, memory-based approaches require less multiplexers at the cost of noticeably more delays/memory. All delay-based approaches require the theoretical minimum amount of delays/memory, and the proposed approach requires the least amount of multiplexers among them.
The permutation (D) is σ (u 4 u 3 u 2 |u 1 u 0 ) = u 3 u 0 u 1 |u 4 u 2 , which is not a stride permutation. The proposed solution is shown in Fig. 10 . In this case, the proposed approach saves 68% of the memory and uses 50% more multiplexers with respect to those in [25] and [27] and saves 37% of the memory plus 25% of the multiplexers with respect to that in [26] and [27] . Previous delay-based approaches do not consider this permutation [8] or require a large number of multiplexers [24] .
Finally, there are some general conclusions. On the one hand, the proposed approach reduces the memory requirement with respect to memory-based approaches, and in most of the cases, the reduction is significant. This is derived from Fig. 11 , which shows the maximum and mean number of delays of the proposed circuits among all the permutations with the corresponding dimensions and parallelization, normalized to N. As some memory-based approaches [26] - [29] require a total memory of N, the values of the graph correspond to the ratio between the delays/memory of the proposed approach and in those memory-based approaches.
On the other hand, the proposed approach reduces the number of multiplexers compared to previous delay-based approaches [8] , [24] while having the minimum number of delays/memory. It also widens the scope, as some previous approaches [8] restrict to strides. 
IX. CONCLUSION
This paper has presented a new approach to design optimum circuits for bit-dimension permutations. It consists in breaking down any permutation into EBEs in an optimum way and, then, implementing these EBEs with hardware circuits.
In order to achieve optimum results, this paper analyzes the cost of the bit-dimension permutations in terms of the number of delays. A methodology to calculate this minimum number of delays and obtain the corresponding circuit is proposed.
Comparison to previous approaches shows that the proposed approach reduces the delays/memory with respect to the previous memory-based approach and the number of multiplexers with respect to previous delay-based approaches.
APPENDIX PRACTICAL CASE
This section illustrates the entire procedure to design the circuits for bit-dimension permutations. For this purpose, we consider the permutation σ (u 7 u 6 u 5 u 4 u 3 u 2 u 1 |u 0 ) = u 6 u 1 u 0 u 3 u 5 u 7 u 2 |u 4 . This permutation has p = 1 parallel dimensions and, therefore, P = 2 p = 2.
The permutation σ has two cycles that involve {x 7 x 6 x 2 x 1 } and {x 5 x 4 x 3 x 0 }, respectively, which is shown in Fig. 12 . According to (41), the circuit consists Cycles of the permutation σ (u 7 u 6 u 5 u 4 u 3 u 2 u 1 |u 0 ) = u 6 u 1 u 0 u 3 u 5 u 7 u 2 |u 4 . The two cycles involve the dimensions {x 7 x 6 x 2 x 1 } and {x 5 x 4 x 3 x 0 }, respectively. of n − c = 8 − 2 = 6 EBEs, three for each cycle. Note that the first cycle only involves serial dimensions, whereas the second one involves serial and parallel dimensions.
The latency and the number of delays and multiplexers can be calculated as follows. Following Algorithm 1, the number of delays is D min = 166 according to 
From the total number of delays, the first cycle has 124 + 2 = 126 delays and the second cycle has 24 + 16 = 40. For clarity, in (52), the columns that correspond to the first cycle are highlighted. The latency of the circuit is obtained from (35) as Lat min = D min /P = 166/2 = 83 clock cycles.
As the first cycle only involves serial dimensions, the number of multiplexers is obtained from (44), leading to M(σ 1 ) = (s C1 − 1)2P = (4 − 1) · 2 · 2 = 12. Likewise, the second cycle involves serial and parallel dimensions and the number of multiplexers is obtained from (45), leading to M(σ 2 ) = s C2 P = 3 ·2 = 6. As a result, the total number of multiplexers is M(σ ) = M(σ 1 ) + M(σ 2 ) = 12 + 6 = 18. The next step is to calculate the EBEs of the permutation. For the first cycle, we have the exercise with elevators shown in Fig. 13(a) . One solution to this problem is the sequence of EBEs (7 6), (2 1), and (6 2), which require 64, 2, and 60 delays, leading to the expected total of 126 delays.
For the second cycle, there is only one parallel dimension, which we use as pivot dimension. According to Algorithm 3, we obtain 0 → 5 → 3 → 4 → 0 0 0 0 (53) which leads to the sequence of EBEs (5 0), (3 0), and (4 0). According to (27) , this sequence of EBEs requires 32 + 8 + 16 − 8 − 8 = 40 delays, which corresponds to the expected value. Finally, the obtained EBEs are implemented with the circuits in Section IV, leading to the circuit in Fig. 13(b) . Note that as the cycles are independent, the order of the circuits that calculate the cycles can be exchanged.
