Abstract. Many v ersions of the fast Fourier transform require a reordering of either the input or the output data that corresponds to reversing the order of the bits in the array index. There has been a surprisingly large number of papers on this subject in the recent literature. This paper collects 30 methods for bit reversing an array. E a c h m e t h o d w as recoded into a uniform style in Fortran and its performance measured on several di erent m a c hines, each w i t h a di erent memory system. This paper includes a description of how the memories of the machines operate to motivate two new algorithms that perform substantially better than the others.
1. Introduction. There is a wide variety of fast Fourier transform (FFT) algorithms. Some of them require extra working storage some can be done in place. Some algorithms use vectors of varying lengths some use long vectors throughout. Some algorithms take the input in natural order and produce output in natural order some either require scrambled input or produce scrambled output 38]. Since they all do exactly the same number of arithmetic operations, memory access is the deciding factor in choosing one over another.
There are con icting goals when deciding which algorithms to use. On a machine with a limited memory, an in-place algorithm is best. On a vector machine, algorithms with long vectors perform best. Unfortunately, m a n y of these algorithms require a so-called bit reversal reordering of the data. If the bit reversal is not done properly, it can take a substantial fraction of the total time to do the FFT. In fact, it is common wisdom that bit reversal reordering is too slow t o b e u s e d o n a m a c hine with a hierarchical memory 37] . Figure 1 illustrates the problem. It shows the time it takes to do the bit reversal of an array of the indicated length in machine cycles per element. Two methods are shown, a simple scatter operation 23] and one of the rst published methods 7] . Also shown is the time it takes to do one FFT butter y using an algorithm tuned to the IBM 3090 vector processor used. It is clear that the bit reversal step will take a substantial fraction of the total time for large arrays. One conclusion drawn from these measurements is that the performance will be poor if the algorithm is not designed with the memory structure in mind.
People have l o o k ed at measurements like this and concluded that autosort methods are best. The idea, which is a good one, is to do the data movement required by the bit reversal while storing the results of each butter y. A n y bit reversal algorithm that requires log 2 N passes over the data can be combined with the log 2 N butter y steps to make an autosort method. However, the problem is not so much with the amount of data movement a s i t i s w i t h t h e pattern of data movement.
An autosort method is e ectively limited to using a multipass bit reversal which limits the data access patterns available for the FFT. Such methods don't perform optimally on vector processors because the vector length changes on each butter y. equivalent to a matrix transpose is needed when switching algorithms. This transpose step is at least as expensive as the best bit reversal described in this paper.
The main point of this paper is to show that bit reversals are not to be feared. There are quite a few bit reversal algorithms that require only a single pass over the data. In many cases, a single pass bit reversal is quite fast, leaving more freedom in selecting an FFT algorithm. My own measurement s , a s y et unpublished, on a Cray Y-MP and an IBM 3090 vector processor show that the Pease algorithm with a separate bit reversal step is every bit as fast as the library routines with their less favorable data access patterns.
Another reason for this paper is the recent i n terest in bit reversals. There have been nearly 20 publications in the last several years on this subject. This study is the rst comprehensive review of the literature in the last 15-20 years. (R osel's 30] survey is more limited in scope than this one.)
While there is a good deal of interest in bit reversals on parallel processors 9, 15, 19, 20] , this paper considers only those written for uniprocessors. I have collected 30 bit reversal algorithms from the literature and from a request made over na-net. Each has been coded into a uniform style of Fortran, tuned, and run on a number of machines { one node of an Intel iPSC/860, an IBM RS/6000 workstation, an HP-9000 workstation, and in vector and scalar modes on both a Cray Y-MP and an IBM 3090J/VF. Section 2 gives a summary of some ways of viewing the bit reversal. To help Table 1 Index Vector by Recursion 0 0 1 0 1 2 0 2 1 3 3 0 4 2 6 1 5 3 7 explain why some of the methods perform poorly, x3 g i v es a brief description of the operation of interleaved memory and hierarchical memory systems. Section 4 describes the machines used. Next, x5 describes the algorithms. A new algorithm for interleaved memory machines is given in x5.3 and a new algorithm which is designed for hierarchical memories is presented in x5. 6 . The performance of all these algorithms measured on all the machines is given next in x6. The paper nishes with an interesting historical note, a plot of the speed of the bit reversal versus year of publication.
2. Properties of bit reversal. A bit reversal reordering is a simple operation. 1 For an array o f l e n g t h N = 2 n , e x c hange two elements x(k) and x(k) for k = n;1 X j=0 a j 2 j andk n = n;1 X j=0 a j 2 n;1;j (1)
where the a j are either 0 or 1. From this equation we see that we n e e d t o k n o w t h e value of the index and how m a n y bits are involved. For example, a 4-bit reversal of 1 = 0001 is 1000 = 8 while a 5-bit reversal is 10000 = 16.
We can view the bit reversal as a permutation 23] . Let x be the vector of the x(k), andx be the vector of the x(k). Then we can writex = Px where P = P ;1 . In other words, the bit reversal of a bit reversed array is the array in natural order.
In practice, the permutation is written as an index vector with componentsk. The index vector can be computed quite e ciently by Horner's rule, k n = a n;1 + 2 ( a n;2 + 2 ( + 2 a 0 ) ):
Our index vector computation producesk n for all 0 k j < N . W e can save a l o t o f arithmetic if we use the fact that the high-order bits of the binary form of k j are 0 when k j is small. For example, we k n o w that the last n ; 2 bits ofk j are zero for k j < 4. This approach leads to the recursive algorithm illustrated in Table 1 42 ]. At each s t e p w e m ultiply the current l i s t b y 2 and concatenate 1 plus the list.
We are primarily interested in bit reversals because many FFT algorithms leave the result bit reversed or require input in bit reversed order. This means that we c a n use the data access patterns of these FFT algorithms to a ect the bit reversal. Here Q(m N) is a generalized shu e that performs a perfect shu e taking m elements at a time.
The bit reversal reordering is also closely related to matrix transposition repeatedly applied on matrices of di erent shape 39]. One scheme is to reshape x into an N=2 2 matrix, transpose it, and copy the elements in row major order. This new vector is reshaped into an N=4 4 matrix. This procedure is repeated n ; 1 times and is identical to the generalized perfect shu e just described.
These permutation and multiple pass methods touch all the data. In fact, only some of the elements of x(k) n e e d t o b e m o ved, depending on the value of k. Since we are exchanging elements, we need only consider k < k. The next thing to notice is that we need not look at all the indices. For example, it is obvious that no exchange is necessary for k = 0 and k = N ; 1 since the former is all 0 bits and the latter all 1 bits. Rodriguez 28] showed that the actual upper bound is the number whose bits are a l l 1 e x c e p t f o r a 0 n e a r t h e m i d d l e . I f n is odd, this number has two more 1 bits to the right of the 0 than to the left if n is even, there is one extra 1 bit to the right of the zero. The actual value for the index of the last exchange is then N ; 1 ; m, where m = p N for n even and m = p 2N for n odd.
We don't even have to examine all the indices to see which ones should be swapped 41]. Consider n even. We can write k = k 1 p N + k 2 so thatk =k 1 +k 2 p N.
If we think of a matrix with p N rows and columns and entries consisting of pairs (k 1 k 2 ), it is clear we need only exchange elements corresponding to entries below t h e diagonal. Those on the diagonal are their own bit reversal, and those above are the same as those below. If n is odd, we generate separate matrices for the middle bit 0 and for the middle bit 1.
Another approach is to view the bit reversal process as one of nding the set of pairs k andk that require an exchange. Rutkowska 32] describes a way to nd this set recursively. Divide the set of all indices 0 k < N into four sets. Elements in set A n have n bits and both a leading and trailing 0 in their binary representation B n , a leading 0 and a trailing 1 C n , a leading 1 and a trailing 0 D n , both leading and trailing 1's. Clearly, i f k 2 A n k 2 A n i f k 2 D n k 2 D n i f k 2 B n k 2 C n . These sets are simply related to the sets involving the n ; 2 middle bits of k. W e build up to the desired set starting with (1,2) and (0,3) for n even and (0,1) for n odd.
3. Memory architectures. As processor speed has increased, computer architects have been forced to use a number of tricks to keep the cost of their systems down. For example, they have used memory chips that are considerably slower than those used in the processor. Since many instructions refer to data stored in memory, s l o w memories can only be used e ectively if there is some way t o i m p r o ve their apparent performance.
One commonly used approach that is adopted by all the machines studied in this paper is to use a small amount of high-speed memory to hold the most frequently used data. These registers may hold as few as four words on some machines and as many as a few thousand on a vector processor. Since this small amount is not enough to hide enough of the memory delay, system designers use two other approaches { interleaved and hierarchical memories. Each of these are described in the following sections.
3.1. Interleaved memory. If it takes several processor cycles to deliver one word from memory, then taking consecutive requests from independent memories can deliver the data fast enough. One way to implement this approach i s t o h a ve a n interleaved memory.
An interleaved memory is divided into banks, e a c h of which holds part of the memory. Consecutive w ords are stored in consecutive banks in wrap mode. In other words, data is dealt to the banks as cards are in a bridge game. When a word in memory is accessed, the associated memory bank starts to deliver the word. The number of cycles needed until this bank can start to process another request is called the busy time.
The number of banks is usually a small multiple of the ratio of the processor to memory speeds. For example, the Cray X-MP/1 uses memory chips that are a quarter of the speed of the processor chips and has 16 banks 16]. Other systems use more the Fujitsu VP-200 uses 256 memory banks. Some machines use more complicated structures. The Cray 2 has a two-level hierarchy, four sectors with four memory banks each.
A request to another bank that is not busy can start immediately. T h us, if there are enough memory banks, stepping through consecutive w ords in memory, s t r i d e 1 access, results in data being delivered to the processor fast enough to keep it busy. The performance from using other strides will depend on the number of banks and the relative speeds of the processor and memory. F ortunately, the relationship is easy to understand.
Let the busy time be m cycles, the number of memory banks be b, and the stride be s. If the stride is not commensurate with the number of memory banks, data will be delivered as fast as at stride 1, typically one cycle per element. If the stride is a multiple of b, then the average time it takes to deliver a word to the processor is m. If the stride is such that b=s < m is a small integer, the average time will be ms=b. This last case is easy to understand if we think of the memory as b=s independent memories each d e l i v ering a word every m machine cycles.
It is relatively easy to discover which memory bank contains a speci c address if the number of banks is a power of two, simply look at a few bits in the address. Hence, most machines use a power of two memory banks. However, power of two strides are quite common in practice, and such strides cause problems for these machines. Burroughs attempted to address this problem by producing a machine with 17 memory banks. Special algorithms were used to nd which bank held a given word 44].
3.2. Hierarchical memory. Hierarchical memories are complicated because they use many tricks to shield the user from delays caused by the slower circuits lower in the hierarchy 36]. While these techniques speed up the average memory access time, they make understanding the system more di cult.
At the top of the memory hierarchy are the registers. The time it takes to move data between two registers sets the basic unit of time, usually a machine cycle. Since this is the case on all the machines considered here, the unit of time for all measurements will be machine cycles.
Next in the hierarchy is the cache memory, sometimes called high speed bu er memory. T h e c a c he is designed to deliver data to the registers as fast as it is requested, typically one word per cycle. Such high performance requires that an expensive technology be used, so cache memories are typically small, 8 KBytes to 256 KBytes.
Below the cache is the main memory. Main memory is typically several cycles from the cache. This means that a memory reference to a word that is not in the cache will cause a delay. H o wever, it is possible to pipeline data transfers from main memory to the cache, which reduces the average access time per word. The amount o f data transferred on each such cache miss is called a cache line. C a c he lines typically contain between one and 32 words.
Below the main memory is the backing storage, either slow memory or disk. A reference to a word not in the main memory can take a n ywhere from hundreds to thousands of cycles to get the data into the cache. The average delay is reduced by moving relatively large blocks, typically 4 KBytes, at a time. The data transferred on each s u c h page fault is called a page. Main memory is divided into frames, e a c h o f which can hold a page of data.
For this discussion, the cache is the most important part of the hierarchy because all the cases measured t in the main memory. System designers have a great deal of freedom in designing their cache system. They can choose the amount of data brought into the cache as a block, the cache line size. They can also choose the algorithm by which they replace a cache line when the cache becomes lled. In addition, they often speed up the processor by limiting the numb e r o f p l a c e s i n t h e c a c he that can hold a speci c word.
Choosing the cache line size involves a trade-o between two con icting goals. We want a line as short as possible to avoid bringing data that we will never use into the cache. On the other hand, we w ant a c a c he line to be as long as possible to amortize the cost of going to memory. The machines in this study have c a c he lines ranging from 32 to 128 bytes.
In an ideal world, we w ould be able to put a cache line into any a vailable slot in the cache. Unfortunately, e a c h memory reference would involve s e a r c hing the entire cache for the cache line of interest. Such a search w ould take so long that memory references to cached data would take m a n y m a c hine cycles. The process can be sped up by limiting the number of places that can hold a speci c address.
We can think of the cache as containing rows and columns of the places the data can sit. Set associativity is implemented by using a subset of the bits in the address to select the row, called the class. W e can now nd our entry on a memory reference by using these address bits to tell us which r o w to search. Typically, machines are direct mapped where there is only one slot that can hold a speci c address (one column), or n-way set associative ( n columns), where n is in the range of 2 to 8. There are also fully associative caches that allow a n y c a c he line to go into any a vailable slot.
When we need to bring a new cache line into a full cache, we m ust decide which cache line to replace. The best algorithm is one that replaces the line that will be used farthest in the future. Since we don't have full knowledge of future memory references, we need an approximation. If we h a ve a direct mapped cache, there is no decision to be made we put our new line into the only slot that can hold it. In a set associative or fully associative c a c he, we h a ve a c hoice of algorithms. Most designers choose to replace the least recently used cache line (LRU replacement) an algorithm that needs a good deal of hardware to implement. Surprisingly, simply replacing a random line works quite well.
These choices impact performance. Consider the following program run on a machine with a four-way, set associative, 32 KB cach e w i t h a 1 6 w ord line that uses LRU replacement. Each of the rst ve accesses should take a relatively long time since each results in a cache miss. However, we w ould expect the next 15 5 accesses to take only one cycle each since the data should be in cache. What we nd is that every access is slow.
In our example, we are accessing the data with a stride equal to the size of the cache. Thus, each reference falls into the same row. The rst four references nd free slots in the row, but the fth ushes one of them. Since we are replacing the least recently used line, the ft reference ushes the cache line that will be used on the sixth reference. That one ushes the line needed for the seventh, and so on.
Other power of two strides will also be bad. A stride of half the cache size gives only two r o ws one quarter of the cache size, four rows etc. Surprisingly, other strides can also cause problems 12]. A four-way set associative, 32 KByte cache with a 128 byte line size will perform poorly when the stride is 103!
We can x our example by c hanging the stride by a small amount. If the array in our simple program were dimensioned a(2**13+16,5), e a c h reference would be to a di erent r o w in the cache, and our performance would improve m a r k edly.
Another problem occurs for even larger strides. Figure 2 compares the performance of two w ays of copying data at stride 1 on the IBM 3090. The dotted line uses a(i) = b(i) the solid line, a(i) = b(J(i)), with J(i) = i. The intent i s t o measure how m uch more time a gather operation takes than a simple assignment. The time in cycles per element is constant u n til the array gets up to 2 16 words, at which p o i n t it increases by 36 cycles. The problem is not related to the cache which Table 2 Parameters for machines studied.
Machine
Cycle ( can hold only 2 13 words.
To understand this behavior we m ust look at the virtual memory. Although this problem ts into the 128 MByte memory on the machine used, the virtual memory system still must do address translation, a process that relates the virtual address of the page to the real address of the frame holding the data. The system keeps a table in memory to do this translation.
Doing address translation out of a table kept in main memory is slow. The process is sped up by using a special hardware device, the translation look-aside bu er (TLB). The TLB on the 3090 is two-way set associative with 128 entries, i.e., the TLB can be pictured as having 64 rows and two columns. A reference to an address in the TLB is resolved in one machine cycle a TLB miss costs 36 cycles.
Any stride larger than the 64 rows (256 KBytes or 64 KWords) will cause all TLB references to fall into the same class. Since the loop using a(i) = b(i) has only two memory references, and the TLB is 2-way set associative, the code runs at full speed. The other loop has three arrays falling in the same class thus every memory reference results in a TLB miss and an increase of 36 cycles per element. As with the cache, this problem can be avoided by adding an increment to the size of the arrays.
There are many more subtleties of hierarchical memories. Some systems prefetch data others move blocks of pages at a time yet others have m ultiple level caches. Fortunately, none of these are important for the algorithms described here. 4 . The machines measured. In this section, I will describe the memory subsystems of the machines measured. Their characteristics are summarized in Table 2 . Each has some unique features that make its performance di erent from the others. 4 .1. Cray Y-MP. The Cray Y-MP is the only machine in this study that does not have a memory hierarchy b e y ond the registers. All addresses refer to the physical addresses in memory since there is no virtual memory support. The Y-MP can do two loads and a store every machine cycle, giving it the highest memory bandwidth of any of the machines studied. The Cray can also chain operations together, such a s load, multiply, add, store, to reduce the time to complete a calculation.
The Y-MP uses a two-stage network to connect each of its eight processors to each of the 256 memory banks. The rst stage is an eight-way switch and the second, a four-way switch. Hence, the memory looks like a 3 2 -w ay i n terleave t o e a c h processor. Since the bank busy time is ve cycles, strides of 2 and 4 su er no penalty. The time it takes to complete a load doubles with every doubling of the stride from 4 to 16. Stride 32 loads take v e cycles each.
It takes about 17 cycles to get a single word from memory to the registers, either a scalar or a vector element. Subsequent w ords on a vector load or store are done at a rate of one cycle per element barring bank con icts. Hence, stepping through an array t a k es 12 + 63 = 75 cycles for each v ector register, an average of 1.2 cycles per element. If the compiler can generate two independent loads, the rate doubles.
IBM 3090.
The IBM 3090J has a hierarchical memorywith virtual memory support that uses a two-way set associative TLB and a four-way set associative c a c he. Unlike some other supercomputers, vector memory operations use the cache. The 3090 uses the longest cache line of any o f t h e m a c hines studied. The 3090 is the only machine studied that has arithmetic operations where one of the operands comes directly from the memory all others require the data to be moved to the registers before it is used.
The memory system is built to deliver entire cache lines. When a cache miss occurs from either a read or a write, the memory delivers an entire cache line starting with the requested double word (64 bits). After a startup of approximately 16 cycles, one double word is put into the cache every cycle this word is immediately available for use. Since only one memory operation may be in progress at any time, the next load or store must wait for the entire 32 cycles it takes to transfer the cache line.
Once the data is in the cache, data is delivered to the registers in the cycle in which it is requested. Thus, stepping through an array of 32-bit numbers takes 16 + 31 = 47 cycles for every 32 numbers in the cache line, just less than 1.5 cycles per element.
Since the 3090 has a four-way c a c he, we n e v er run into a problem of set associativity a s w e n e v er reference more than four arrays in any o f t h e l o o p s . H o wever, the TLB is only two-way set associative. We h a ve seen from Fig. 2 that a gather copy can take o ver 45 cycles per element if the three arrays fall into the same associativity set. To a void this problem, the measurements were made with arrays o set from each other by 1024 words. 4 .3. IBM RS/6000-520. The memory subsystem of the RS/6000 model 520 is similar to that of the 3090. Its cache is smaller, 32 KBytes, and its cache lines are shorter (64-bytes) 3]. It also has a smaller TLB.
When a reference is made to a word not in cache, a line is transferred from main memory starting with the quadword (16 bytes) containing the requested byte. The rst word is delivered to the register after a delay o f e i g h t cycles the rest of the quadwords arrive at one cycle intervals. This data can be moved from the cache to the registers while the rest of the cache line is being moved from memory. T h e memory subsystem is busy for 12 cycles on every cache miss. However, up to two memory requests to independent memory banks can be processed simultaneously.
Once the data is in the cache, data is delivered to the registers in one cycle. Thus, sequentially stepping through a single precision array t a k es 8 + 15 = 23 cycles for every 16 words accessed, an average of 1.5 cycles per element. As with the 3090, we a void the set associativity problem in the TLB by o setting the arrays from each other by 1024 words. A TLB miss costs about 35 cycles. 4 .4. HP/9000-730. The HP/9000 is the only machine studied with a direct mapped cache, which means that there is only one place in the cache for each piece of data. To compensate, this machine has the largest cache of the machines studied, 256 KB, and a cycle time shorter than the other RISC processors. We c a n a void problems of having a cache line from one array replace a line from another array b y separating the arrays by the length of a cache line, eight w ords, from each o t h e r .
When a reference is made to a word not in cache, a line is transferred starting with the word requested. Subsequent double words are delivered on successive cycles. It takes about 30 cycles to get the rst word from the memory to the cache. Unlike the other machines studied, there is a one-cycle delay b e t ween loading a word from the cache to a register and when that word can be used. Fortunately, t h i s delay slot can usually be lled with other work. Hence, stepping through a single precision array takes 30 + 7 = 37 cycles for every eight w ords, almost ve cycles per element.
The TLB is fully associative w i t h L R U replacement and contains 96 entries, making this the only machine that has a cache or TLB which i s n o t a p o wer of two. A TLB miss takes about 30 cycles. 4 .5. Intel i860. The Intel i860 processor is the basic compute node of the Touchstone Gamma machine. One node was used for the measurements reported here. Unlike the other machines studied, the i860 will not load a cache line on a write. If the word is in the cache, it is stored if not, the data is sent directly to the memory. This means that stores never su er a miss penalty.
When a load misses in the cache, the word referenced is delivered to the cache rst and forwarded to the register. Each remaining double word (64 bits) is delivered in subsequent cycles 22]. It takes about 40 cycles to get the rst word into the cache. Hence, loading a vector at stride 1 takes about 6 cycles per element. The i860 h a s a t wo-way set associative cache that chooses which o f t h e t wo lines to replace using a pseudorandom number generator. This replacement r u l e m a k es the i860 less susceptible to problems related to associativity, but can result in higher miss rates in some circumstances.
The TLB is four-way set associative w i t h 6 4 e n tries. Hence, our tests should show no e ects of con ict in the TLB.
5. The algorithms. Even though bit reversal is an inherently simple thing to do, there is a surprisingly large number of ways to do it. Even more surprising is how m uch w ork is still being done over half of the algorithms reported here were published or developed in the last 5 years. Each of the sections that follows describes algorithms that use similar techniques to perform the reversal.
5.1. Examine the index. These algorithms step through the array o f i n d i c e s and decide whether the corresponding data element should be exchanged with another. One disadvantage shared by all of these approaches is that they are inherently sequential and do not vectorize. Also, there is no locality in the data reference patternm so these approaches do poorly on machines with hierarchical memories. On the other hand, they are e cient in their data movement, touching each element t o be moved once and not touching elements that remain in place.
The prototype of this class 7] works with two indices. One increases while the other is modi ed until it contains the bit reversal of the other. At this point, the elements are exchanged if they have not already been exchanged in an earlier iteration of the loop.
Buneman 6] made three improvements. First, he noted that the last two elements will never be exchanged with each other, so the loop could be shortened. Secondly, he replaced the divisions by t wo that Cooley used with an array o f p o wers of two. This latter change had no e ect on the results reported here because I replaced the divisions with shifts. Finally, h e m o ved the exchange of elements to the end of the loop, which results in one fewer index computation. Buneman's code runs slower than Cooley's does because the way Buneman builds the bit reversed index takes one more step than Cooley's does. Had I been clever enough to optimize Buneman's routine to use Cooley's version, its performance would have b e e n i d e n tical to that of Cooley.
Rodriguez 28] implements Cooley's algorithm by using the actual value of the index of the last exchanged element as the upper bound on the loop. He showed that the loop was made shorter by about p N. Unfortunately, t h e s a vings are small for arrays of the length considered in this study.
Rutkowska 31] i m p r o ved Rodriguez's idea by using some simple identities of an integer and its bit reversal to do four swaps of elements for each index computation. A similar modi cation of Cooley's algorithm results in a similar performance improvement. Yong 43] 
Although we normally think of arithmetic operations as taking all the time, Duhamel 8] noted that the test in the inner loops of many bit reversals takes a substantial amount of time. He used the close relation between bit reversal and matrix transpose to completely eliminate the test of whether the two elements had been previously exchanged.
Brigham 5] took a conceptually simpler approach that, unfortunately, performs much w orse. For each index in the loop, he explicitly computes the bit reversed value in log 2 N steps. He then tests the values to see if the data elements should be exchanged. Not only does this approach use many i n teger operations, it makes the computer time increase as N log 2 N instead of as N with all the other algorithms in this section.
5.2.
Build an index vector. Since we frequently do many FFTs of the same length, recomputing the bit reversal indices can be avoided. The methods reported in this section compute the bit reversal indices once with the intent of using them many times. There are many w ays to generate this list. I timed one that uses an optimized version of Horner's rule for evaluating polynomials 42], as described in x2.
Once we h a ve the index vector J(i), w e can do the bit reversal using a gather, a(i) = b(J(i)), or a scatter, a(J(i)) = b(i). Middleditch 24] builds an index vector one-eighth the length of the array. H e determines which segment of the elements to work on from the short index vector, then does 3 bits worth of reversal explicitly. This approach s a ves storage, since a smaller index vector is needed, but it does not vectorize.
One of Van Loan's algorithms from an early draft of his book 39] computes the index vector inside the loop that exchanges the elements by using the algorithm described in this section. Odd/even pairs of elements are exchanged on each p a s s , and the inner loop does vectorize.
The problem with using these methods is that the memory references are not localized. Hence, most of the loads on a gather or stores on a scatter do not hit in the cache. On most machines, the performance of these two i s t h e s a m e . O n t h e I n tel i860, however, the scatter operation runs considerably faster because it does not bring in a cache line when a store operation misses.
5.3. New gather/scatter method. A di erent problem occurs on the Cray.
The bit reversal index vector contains all possible power of two strides which leads to many bank con icts. All is not lost, though. We can modify the data access patterns by s c r a m bling the index vector 1]. Of course, since the index vector is scrambled, we will have to use both a scatter and a gather to do the bit reversal. We will gain by avoiding memory bank con icts, but we lose because gathers and scatters take almost twice as long as straight loads and stores.
Recall that our gather method uses a(i) = b(J(i)). The strategy used is to replace J(i) with J(mod(k*i,n)) for an array o f l e n g t h n. W e c a n u s e a n y v alue for k that will break up the bad reference patterns. The choice used for the runs reported in this paper is k = n/4 -1. T h i s c hoice guarantees that the largest power of two di erence on any v ector access is four, a value that does not degrade performance on the Cray Y-MP.
Simply using n/4-1 will cause problems on machines with 32-bit integers since k*i can over ow e v en for arrays of modest size. We are safe if we use the smaller of n/4-1 and 2**31/n -1. W e prevent the program from crashing if it is fed a very small or very large value of n. The code used to build the gather vector is k = max(n/4 -1,3) k = min(k,2**31/n-1) do i = 0, n -1
Since we h a ve scrambled the data on gather, we m ust reorder the data we store. Fortunately, this part is easy we simply use the standard index vector for bit reversal for the scatter. The bit reversal is then done with both a gather and a scatter, a(J(G(i))) = b(G(i)). While this scheme runs a bit slower in scalar mode due to the extra memory tra c, it reduces the time in vector mode by about 1/3. 5.4. Make log 2 N passes. The bit reversal step is needed because many o f t h e FFT algorithms bit reverse the data in log 2 N passes. This observation means that we can also do a bit reversal in log 2 N passes as was done by Singleton 34] , whose algorithm was designed to minimize the amount o f w ords transferred between memory and backing storage. His code uses the perfect shu e on successively smaller segments on each pass. The loops are arranged to keep a block of data in main memory as long as possible. Not surprisingly, t h i s a p p r o a c h w orks well on hierarchical memory machines.
Swarztrauber 37] presented two algorithms that are similar to Singleton's but are coded to look like matrix transposes. Routine ctsort is an inverse perfect shu e successive elements are stored in separate halves of the output array. Routine ptsort implements a perfect shu e similar to Singleton's, but the inner loop is coded in such a w ay that the data is always accessed at large stride. No special blocking is done in either implementation.
Van Loan 39] has implemented the same algorithms as Swarztrauber has by u s i n g explicit computation of the indices instead of transposing elements in an array with three indices. There is a saving in not needing a subroutine call, but on many m a c hines explicit address computation is not as e cient as array indexing is.
A b i t r e v ersal can be done by a sequence of shu es as shown, by Korn and Lambiotte 23]. First do a perfect shu e of the two h a l v es of the array. Next do a perfect shu e taking pairs of elements. Repeat this process, doubling the size of the groups shu ed for log 2 N steps. Each step is implementable using the merge operation on the STAR 100 they used.
Polge's usbin algorithm 27] d o e s m ultiple passes over the data but does not need any auxiliary storage. However, some elements may b e m o ved more than once. At step k elements are moved when the bits of k and log 2 N ; k di er. Since blocks of elements share these bits, the block s c a n b e e x c hanged as units, making the algorithm vectorizable. While Polge describes higher radix versions of this algorithm, his published code is for radix-2 exchanges only.
Rutkowska 32] makes the recursive nature of these multistep bit reversals explicit. The recursions are based on looking at the elements with odd and even indices separately and splitting these two sets into a top and a bottom half. The bit reversal of an element in one set puts it into another or the same set. For example, the bit reversal of an even index in the lower half of the array results in another element i n the same set. Rutkowska presents two implementations that apply this scheme recursively and use only a few shift and add operations. The rst recursively calls itself until it has identi ed a pair of elements to exchange, a version that is done in place. Since some machines do not execute recursive algorithms e ciently, she also presents an implementation that uses a work array but handles the recursion implicitly. This latter version vectorizes. 5.5. Use a short index vector. It is possible to do a bit reversal with an index vector that is much shorter than we might think. The methods described here are based on the observation that the bit reversal of an integer k = k 2 + k 1 p N is k =k 2 p N +k 1 when N is even. When k is odd, we do the same thing for the two halves with di ering middle bits. This approach is used in Polge's usone algorithm 27].
Evans 10] independently discovered Polge's approach and produced a slightly simpler implementation. Evans later improved the algorithm very slightly by reducing the number of multiplications needed 11].
Evans's papers generated a urry of improvements. Walker 41] observed that the array indices and their bit reversals can be written as a matrix. Given the idempotency of the bit reversal operation, it is easy to show that one need only consider the subdiagonal part of this matrix. The observation allowed Walker to reduce the size of the index vector by a factor of two when log 2 N is odd and reduce the run time marginally. Biswas 4] presents an algorithm virtually identical to that of Walker, even including a matrix showing the symmetry of the transformation.
Vesely 40] studied the properties of mixed radix bit reversals and concluded that the best performance comes from using three index vectors of length about 3 p N. T h e fully symmetric version used here is similar to Evans's algorithm for odd n except that the middle section is wider than one bit and an additional reversal stage is needed to process these bits. Khan 21] formulates the bit reversal in a manner similar to that of Rutkowska 32] , splitting the indices into a top and a bottom half of the even and odd values. He breaks up the bit reversal into groups of elements that can be processed together, and an index vector of order p N in size.
The algorithm Heller 14] uses was designed for the CM-2, so some of the details are a bit obscure. However, the structure is similar to that of Khan's in the way groups of elements are handled. Heller starts with a base version that computes a bit reversal index vector of length 32. Then groups of 1024 elements are bit reversed in groups of 32. Heller tried a number of variants { explicit unrolling of the inner loop, reordering the statements to improve locality, using temporaries to keep data in the registers longer. The measurements show that the simplest code produced the best results on all machines. Fig. 1 and the results in x6 t h a t t h e standard approaches to bit reversal reordering don't perform well for arrays larger than the cache. With only one exception 34], none of the methods previously published take explicit account of the structure of the various parts of the memory hierarchy. In this section, I describe a two-step approach that runs close to the in-cache rate. In the rst step, the data is loaded with xed stride and stored at stride 1. The second step does a bit reversal that works on data in the cache.
New hybrid method. It is clear from
In what follows I will use the following de nitions. The cache can hold somewhat more than C data elements, and the cache line size is L data elements. I will assume Table 3 Cycles for a bit reversal that ts entirely in the cache and one that does not.
In
Out of Cache Cache
it takes one machine cycle to load an element from the cache to a register. If the data is being gathered, one additional machine cycle is needed per element. Thus, it takes 2L cycles to gather L elements from the cache. If the data is not in the cache, I assume it takes B cycles for the data to reach the cache and D additional cycles before the memory can start the next request, a total of D + B cycles. If a gather or scatter is used, each load of an element takes one additional cycle.
The number of cycles it takes to complete the bit reversal is shown in Table 3 . The rst iteration is done separately to account for the di erence in access time between the rst reference to data on a cache line and subsequent references. The average time to do the bit reversal if all the data ts in the cache can be seen to be We should expect the model to underestimate the time because all it includes are memory accesses other work that must be done is ignored. The limits of this simple model can also be seen by comparing the time for a gather and that for a scatter. According to the model, the performance should be the same. Instead, they di er because the CPU must wait for the data on a load but not on a store. Note that the Intel i860 does not bring data into the cache on a store. Hence, the scatter operation is considerably faster than the gather when the data does not t into the cache.
We can now see why a straightforward scatter operation does not perform well on a machine with a cache. Consider a large bit reversal of length N C. The rst few elements of J are 0 N = 2 N = 4 N = 2 + N=4. Thus, when storing a(J(0)), d a t a elements a(0) through a(L-1) are brought i n to the cache a(J(1)) brings elements a(N/2) through a(N/2+L-1), e t c . Since each element o f a is used only once, the average access time per element i s D + B cycles. The indirect addressing used costs an additional cycle per element.
The elements of b and J must also be brought i n to the cache before being stored, but they are accessed contiguously. The sum of these times gives the total for the bit reversal. The best performance from the simple gather operation will be about D+B+3+2(B+D;1)=L cycles per element a s s h o wn in Table 3 . For the RS/6000-520 the model predicts 16.4 cycles per element, compared to the 24 cycles measured.
To i m p r o ve the cache performance of the bit reversal reordering, we m ust make better use of the data when it is brought i n to the cache. The approach used here is to do one or more large-radix bit reversals followed by a series of radix-2 bit reversals. For example, Table 4 shows how to do a radix-4 digit reversal of 16 elements. Table 4 Radix-4 digit reversal by recursive transposes.
Row 1 is the data in natural order, row 2 is the data written as a 4 4 a r r a y, and row 3 is the transpose of row 2 . R o w 4 represents a step needed when the radix is not prime the indices of the rows must be bit reversed. (In the code, steps 3 and 4 are combined.) Finally, r o w 5 represents the radix-4 bit reversal reordering of the data. If C = 4 , w e can nish the radix-2 bit reversal by bit reversing each segment i n t u r n .
The indices in row 5 c a n b e l o o k ed at as four sequences, the elements of each sequence being parts of the original array accessed at stride 4. The scheme proposed here uses the data patterns present in the indices of the large-radix bit reversal. These segments are exactly the same as the groups described by Khan 21] .
I will now describe the algorithm for a sequence of length N = LC. Since the cache line is L elements long, I will use a radix-L bit reversal.
First, load Z = C=L data elements at stride L starting at element 0 . ( I f m y vector registers can hold Z elements, this sectioning loop is done automatically.) These elements get stored into another array at stride 1 starting at element 0. Since each n umber used results in L numbers being transferred into the cache, I will have loaded C elements into the cache.
Next, I load another set of Z elements from the cache to the register at stride L starting at element 1. This data is stored in the other array at stride 1 starting at element ZL=2, where L=2 is obtained by r e v ersing the bits in 1. The next set of Z elements goes into the output array starting at element ZL=4, L=4 c o m i n g f r o m t h e bit reversal of 2. This process is repeated in chunks of Z elements until the entire array has been processed, i.e., a total of L times.
At the end, I will have produced a radix-L bit reversal reordering of the original data. Now, I can take e a c h of the L segments of the output array a n d d o a r a d i x -2 bit reversal on the N=L = C elements using any method that performs well when the data ts into the cache.
Even though I make t wo passes over the data instead of one for the conventional algorithm, the new approach i s m uch faster. Cycle counting explains the improvement. The rst load takes D + B cycles per element since each n umb e r i s o n a d i e r e n t cache line. The next C ; 1 loads take only one cycle per element since the data has already been put into the cache. Each store is done at stride 1 and, therefore, takes 1 + ( B + D ; 1)=L cycles per element. The average time for the radix-L part is only a little more than 2 + (B + D ; 1)=L cycles per element a s s h o wn in Table 5 . The Table 5 Cycles for large radix bit reversal.
Load b(L*i+j+k) 1 Store a(i+j+k) 1 End Do End Do End Do predicted value of 2.7 cycles per element is a bit smaller than the measured value of 4 on the RS/6000-520.
The length C radix-2 bit reversals are also e cient since all the data ts in the cache. All measurements were made using the algorithm that performs best on data that ts in the cache. On the RS/6000, HP-9000, and Intel i860 I used walker 41] on the 3090 scalar run I used unsone 27] and on the 3090 vector run I used gather.
If N=L is too large to t in the cache, we m ust repeat the process as many times as necessary to allow an e cient reordering. Each xed stride pass over the data will take an additional T f = 2 + ( B + D ; 1)=L cycles per element. However, since we are decreasing the length of the sequence by a constant factor, the time grows only as log L N. In fact, we will rarely have t o w orry about when to use another method. The full algorithm is presented in Table 6 . Here N is the length of the array, C is somewhat less than the number of elements that t in the cache, and L is the size of the cache line. In addition, we t a k e m=min(L,nr/C) instead of simply L to ensure that we always do length C bit reverses in the next phase. J(j) is the m-bit bit reversal of j.
In routine trans the shapes of the arrays are changed on each e n try. T h us, on the rst pass we deal with one array of length N on the second pass, L arrays of length N/L etc. Note too that the order of the loops is important. One might think that because Table 6 Two stage bit reversal reordering. 
i,k appears on both sides of the assignment these two loops could be coalesced. Doing so would spoil the optimal use of the cache. The additional dimension that does not appear in Swarztrauber's 37] code accounts for the performance di erence.
The performance of the algorithm can be improved slightly by violating one of the constraints, namely, t h a t w e load the array with a stride equal to the cache line size. We can reduce the number of passes made over the data by increasing the stride. Figure 3 shows that we can increase the stride much more than might be expected, up to 256 on a machine with 16 word cache lines, before the performance degrades dramatically. T o understand this phenomenon, we m ust look at the load and store operations separately.
Consider using a stride of S = p L . The loads proceed as before each load brings in a full cache line, all of which gets used. However, we will be forced to store fewer than L consecutive w ords { L/p words instead of L.
The optimum stride can now be predicted. If a larger stride means we can do the xed stride work in one pass instead of two, we s a ve time if the large stride version takes less than half the time of the stride L version. From our discussion in x3.2 we reach this point on the RS/6000 when we store an eighth of a line, a stride of 256, just the value measured. A similar argument holds for the HP-9000 which has higher memory latency and a shorter cache line. On this machine, the cut-o stride is also 256. In this case, the xed stride part is so slow that we can a ord to have the gather step run partly outside the cache if it saves an extra pass over the data.
This approach can be extended to machines with more memory levels { either second-level caches, electronic backing storage, or even distributed memory parallel machines. All we need do is hierarchically nest the bit reversal with successively smaller values for C and L. F or example, we might do a radix-1024 bit reversal to Cache Size = 2 13 words Fig. 3 . The time to co p y a n a r r ay at the indicated stride using the segmented c opy of the new bit reversal method. Note that the time per element does not increase dramatically until the stride exceeds 256 words.
optimize the use of the second-level cache. Next, we w ould call a bit reversal routine to work on segments that t entirely in the second-level cache. Since these segments would not t in the rst-level cache, we could do a radix-32 bit reversal on them. Finally, w e could use a gather on the subsegments that t into the rst-level cache.
6. Measurements. The reports in the literature were coded in di erent l a nguages and were run on a variety o f m a c hines over more than 20 years. To be fair, I coded each of them into more or less modern Fortran 77. Thus, any di erences due to coding style and quality of compilers have been removed. Second, I made a reasonable attempt to tune the codes. For example, using shift operations instead of multiplications and divisions by p o wers of two s p e d u p s o m e codes more than three times. Reordering loops to improve the locality of reference sped up others by a n e v en larger factor. However, if a code was presented with certain computations in the inner loop, those were left there. I also did not unroll loops, a procedure which i s k n o wn to improve performance on some machines. I did check t h e results for correctness for every case run.
Di erent approaches were taken to time the runs on di erent m a c hines. Since I could not get the 3090 standalone, I measured CPU times which uctuated somewhat due to the heavy load on the machine. (We shared the machine with chemists.) In addition, to get higher clock resolution, I used an experimental version of the timer. When the system was heavily loaded, this timer sometimes forgot to tick. To a c c o u n t for this problem, I took the largest CPU time of several runs.
On the RS/6000 Model 520 and HP-9000 Model 730 I ran when I was the only user. Unfortunately, Unix d mons kept rearing their ugly heads, so an occasional Table 7 Baseline measurements for log 2 N = 2 0 ( log 2 N = 1 7 on i860) measurement w as clearly spurious. These were the short runs there is no way I c a n be sure that the longer runs were not a ected. In all cases I report the shortest measured elapsed time of several runs. The iPSC/860 returns the elapsed time on the single node I used. Even here, there were uctuations, albeit small onesm from one run to the next. Again, I report the shortest elapsed time of several runs. The times on the Cray w ere very repeatable.
6.1. Base Measurements. First, I made some measurements to discover the base performance of the machines. Table 7 For each machine, I measured the largest bit reversal that would t entirely in the cache up to the largest that would not cause excessive paging. On the Cray the largest run was limited by the 10 MWord allowed for interactive use.
The table entries are divided into groups separated by horizontal lines. The rst group of algorithms are those that step through the index values and exchange the appropriate elements. None of these methods vectorize, so this group is denoted by a double horizontal line in Table 10 . Group 2 methods use an index vector of order N. Multiple passes over the data are used by the methods in Group 3, while Group 4 consists of methods that use a small seed table. The new method described in x5.6 is presented last.
Double vertical lines are used in the tables of machines with hierarchical memories to delineate the various levels in the memory hierarchy. The left section is for arrays that t in cache, the middle section is for those that t in the TLB, and the right section is for larger arrays. In general, the methods do worse on larger arrays. Those that use less memory do better longer while those that need extra memory degrade sooner.
Some comments on the tables are in order. The Cray times are so stable and vary so little with problem size that only the four largest cases are shown. The code perm1 uses recursive calls. Due to a bug in the RS/6000 Fortran compiler, this code did not run correctly. Also, even following the recommended procedures for recursive calls on the Intel i860, the code would not run. I did not use the C versions of this method on these machines because the performance was poor on the machines where it did run, and the di erences between the C and Fortran compilers would have been an issue. 7 . Conclusions. The idea that the memory structure is an important factor in determining the performance of the bit reversal algorithm is not new 34]. However, this fact seems to have been forgotten in the 25 years since its rst publication few of the methods described in this paper consider the memory structure of the machines. Some authors even publish performance numbers without actually moving the data 4, 2 1 ]! Further evidence of the problem is illustrated in Fig. 4 which s h o ws the time in cycles per element v ersus year of publication. I have c hosen to show t h e performance on the RS/6000 for an array with log 2 N = 2 0 .
I am being somewhat unfair since many of the algorithms were written for PCs and I am running them on a high performance, scienti c workstation. On the other hand, the average number of citations to previous work in these papers is disappointingly small. Of course, bit reversals are not the most important thing in the world so the lack of rigor in searching out citations may w ell be justi ed.
The two new methods presented show the importance of considering the memory structure. The improvement is particularly dramatic for the machines having long cache lines.
While this paper has concentrated on radix-2 bit reversals, most of the algorithms used have mixed-radix analogs, some of which h a ve been reported in the literature 26, 35] . However, it is di cult to predict their performance because their memory access patterns are quite di erent from those of the radix-2 methods. Perhaps a followup study is needed.
Acknowledgments. I w ould like to thank Rad Olson for helping me understand the operation of the cache and TLB on the IBM machines. Eric Barszcz helped me understand the Intel i860. David Bailey taught m e t h e gatscat algorithm. I also thank the people who sent me algorithms or allowed me to use their work in advance of publication { Alan Edelman, Steve Heller, Fuad Khan, and John Middleditch. Numerous others sent comments and pointers to the literature. Thanks to you all. I w ould also like to thank David Gustavson, Chuck Dickens, and Charles Boeheim of SLAC for helping me get machine time to verify the performance numbers for the 3090 in scalar mode. Update. As the reader may h a ve inferred from the machines used for the measurements, this work was completed a number of years ago. A combination of Corporate legal departments and a slow refereeing process delayed publication. Since my search for the relevant literature was completed in the middle of 1989, more bit reversal papers have been published. This section contains those new algorithms found during a less than thorough literature search. In addition, I have added a citation to a version published in an archival publication in addition to the conference proceedings originally cited 29]. Unfortunately, I no longer have access to the machines used for the extensive measurements. Instead, I'll report performance numbers on an HP-755 workstation.
I s t u m bled on a paper 13] describing a scheme that is very similar to the hybrid method described in x 5.6. It was designed for out-of-core problems, which i s n o t unlike the out-of-cache problem described in this paper. However, the older work is done inplace, leaving blocks of data out of order. This ordering problem presents no serious di culty when reading from a disk, but would be awkward to incorporate in a program working on memory resident data. Table 14 Index vector by recursion.
0 0 1 0 4 2 0 4 2 6 3 0 4 2 6 1 5 3 7 Elster 17] presented an interesting derivation of an algorithm that is identical to cvl143 39]. The bit reversed index is written as r n (k) = c k 2 t;q , where n = 2 r and 1 q < t . The c k can be computed recursively. An interesting o -shoot of this algorithm is a procedure for computing the index vector to be used for the gather or scatter operations. Recall that routine reorder used in the body of this paper used the \double and add one" algorithm. Elster instead uses the \add half" procedure which involves halving an integer instead of doubling an array. T able 14 can be compared with Table 1 to see the di erence in the way the index vector is built. Since Elster's approach never changes an entry once it has been computed, it saves arithmetic and time, as shown in Table 15 .
Elster's algorithm and that in Table 1 ] for Elster's approach. Other orderings can now be seen quite easily. F or example, a tree-like reduction could be used on a parallel machine or a di erent ordering used for machines with direct mapped caches.
Jeong's 18] approach c o m bines the data movement with the computation of the index. While it seems unfair to compare it with a routine like gather which doesn't count the time to set up the index array, the algorithm is actually a good deal faster. A clever trick helps us avoid computing all the elements of the index array a n d m o ving data that doesn't need to be moved.
The idea is based on splitting the binary representation of a number with an even number of bits into two p a r t s H and L, We h a ve computed r(i) f o r 2 k;1 i, w h i c h i s a l w ays possible since r( 0 ) = 0 . We can bit reverse the elements with index values 2 k;1 i < 2 k by noting that r(i) = 2 n;k + r(i ; 2 k;1 ), which i s i d e n tical to Elster's algorithm 17]. The elements to be swapped are just x(r(i) + j) a n d x(i + r(j)) for 0 j < i . O n l y n=2 stages are needed to complete the bit reversal for an array w i t h 2 n elements.
The bit reversal proposed by O r c hard 25] is similar to the methods that step through the index values. However, instead of stepping through the integers by counting, these methods use the properties of Galois elds to step through the integers using shift and exclusive OR operations. The advantage of this approach i s that the same algorithm generates both the integer and its bit-reversed version. The proposed approach takes successive p o wers of a root of the primitive polynomial modulo the number of elements. Because of the properties of the roots, only shift and exclusive OR operations are needed. Because of the modulo arithmetic inherent i n these operations, several tests for end conditions are saved. It should be noted that complexity comparisons between this and other approaches will be strongly in uenced by the relative costs of shifts and exclusive OR operations verus addition and increment operations on a speci c architecture. Table 15 shows how this procedure performs when incorporated in a scheme like that of Cooley 7] , nbrv1, and in one like that of Evans 10] , nbrv2.
One other, rather obvious idea occurred to me while looking at these algorithms. The gather algorithm moves all the elements from one array to another. However, the data movement can be done in place if we simply use the index vector to control which elements get swapped. We m a y not see much i m p r o vement o n a v ector processor, but, as the numbers in Table 15 show, the saving is considerable on a cache based machine. The new algorithm
