Although Fast Hartley Transform (FHT) 
Introduction
Digital Signal Processing (DSP) of real-time signals has gained importance with recent advances in digital computer technology. Digital signal processors, digital computers specializing in signal processing, are in development and available on the market. All of this growth is for massive amounts of computations in various DSP applications. One way to satisfy the performance requirement of DSP applications is to choose clever algorithms or expand the processor performance or both of them. DSP applications are characterized by computations that are massive but fairly straightforward and simple. Furthermore, these computations exhibit orderly structures. Besides, DSP algorithms are very efficient. These algorithms are optimized and improved several times until now. However, it is still not enough for most of the DSP ap- Tukey providing a more efficient algorithm [3] , named as Fast Fourier Transform (FFT), made possible many applications concerning the computation of DFT to be realizable because of performance problems.
Beyond the highly accepted usage of FFT, it is a complex transformation. That is, both DFT and FFT include complex arithmetic even if the input signal consists of real numbers only. Hence, FFT contains redundancy if the signals in the time domain are real. DHT is developed for a more efficient and faster transformation [4] . The DHT of an input sequence fh(i): i=0; 
for k=0; 1; : : :; N?1, where the input sequence h() is constrained to real numbers only. Hartley transform does not necessitate any complex arithmetics. This important feature of Hartley transform increases the performance of DHT by a factor of two, while decreasing the memory requirements again by a factor of two at the same time. Computational complexities of both schemes are O(N 2 ). FFT reduces this time to O(N lg 2 N) [3] . As well as FFT, DHT has also a fast formulation called Fast Hartley Transform (FHT) [1, 2] with computational complexity O(N lg 2 N). FHT provides efficient spectral analysis of real discrete signals.
60% more floating point operations than the simplified butterfly scheme.
In this work, we propose an efficient restructuring for the sequential FHT algorithm which brings regularity and symmetry to the computational structure of the FHT. The restructured algorithm does not involve any computational overhead compared to the original algorithm. Then, we propose an efficient parallel FHT algorithm for medium-to-coarse grain hypercube multicomputers by introducing a dynamic mapping scheme for the restructured FHT. The proposed parallel algorithm has the following nice features for the implementation of an N-point FHT on a d-dimensional hypercube with P = 2 d N=4 processors:
(i) achieves perfect load-balance for the simplified butterfly scheme, (ii) allows only nearest-neighbor communications, (iii) minimizes the number of concurrent communications to d by eliminating fragmentary message passing, (iv) minimizes the total concurrent communication volume to dM=2 by minimizing the volume of communication in each concurrent exchange step to M=2 = N=2P FHT points, (v) achieves in-place computation and communication.
The sequential FHT is presented in Section 2. In Section 3, parallelization of the presented FHT scheme is discussed. Section 3.1 presents the proposed restructuring of the FHT algorithm for an efficient parallelization. The dynamic mapping scheme proposed for the restructured FHT algorithm is presented in Section 3.2. Section 4 presents the experimental results on Intel's iPSC/2 hypercube multicomputer. 
Sequential FHT Algorithm
Different strategies exist for the computation of FHT and some include Radix-2 Decimation-in-Time
FHT, Radix-2 Decimation-in-Frequency FHT, Radix-4 FHT, Split Radix FHT, Recursive FHT and Vector
FHT [5, 7, 9, 10] . Computational steps for a 32-point, radix-2, decimation-in-time FHT algorithm [7] is illustrated in Fig. 1 . This tabular representation is proposed in [10] . The input in this scheme is N real numbers in bit-reversed order. The output is N real numbers in normal order. The Ci and Si factors in Fig. 1 represent Cos(2 i=N) and Sin(2 i=N), respectively. As is seen in Fig. 1 , each level of FHT algorithm takes a set of N real numbers and transforms them into another set of N real numbers. This process is repeated n=lg 2 N times, resulting in the in-place computation of the desired Hartley transform in normal order. However, the tabular representation is not sufficient for a detailed analysis of the computational interdependencies which is crucial for an efficient parallel algorithm design. In this work, computational flow graph for the FHT algorithm is derived in order to explore the computational interdependencies.
A close examination of Fig. 1 Fig. 2(a) ) can be simplified as shown in Fig. 3(a) . This simplification reduces the total number of floating point operations in the first stages of type-1 butterflies to 6 (from 8 multiplications and 4 additions to 4 multiplications and 2 additions) as follows: The resulting FHT butterfly will be referred here as type-1 simplified FHT butterfly. A similar analysis can also be applied to type-2 basic FHT butterfly to reduce the number of multiplications involved in the first stage from 4 to 2. Furthermore, a detailed analysis shows that Cos+Sin factors multiplied by the q and s points are always 1. Hence, the remaining two multiplications can also be omitted. Fig. 3(b) illustrates the computational flow-graph for a type-2 simplified FHT butterfly. Note that multiplications with Cos+Sin factors are shown in Fig. 3(b) In the rest of the paper, simplified FHT butterflies will be referred as butterflies for the sake of simplicity, unless otherwise stated.
Each FHT point in an N-point FHT is assumed to have an n-bit binary representation where n=lg 2 N. For example, f n (binary string of length n) denotes the binary representation of an FHT point q, where q denotes its decimal index in the bit-reversed ordering. In both types of butterflies, FHT points in both (p,q) and (r; s) pairs differ only in the`th bit of their n-bit binary representation at level`such that q = p+2à nd s=r+2`. That is,`th bits of the binary representations of both q and s indices are "1", whereas`th bits of both p and r indices are "0". Note that the least significant bit of a binary number is referred here as its 0th bit. Hence, FHT points in (p; q) and (r; s) pairs are separated by 2`at level`.
In a type-1 butterfly at level`, two FHT points of each (q; s) pair differ only in the least significant bits of their n-bit binary representations. This difference is such that, least significant`bits of the binary representations of the q and s indices are mutually 2's complement of each other. Hence, the separation between q and s indices of a type-1 butterfly varies between 2 and 2`?2 at level`for` 2. In a type-2 butterfly at level`, q and s points only differ in the (`? 1) th bit of their binary representations such that q is a power of 2`, and s = q + 2`? 1 Fig . 4 illustrates the proposed computational flow-graph for the (N=32)-point FHT algorithm using the simplified butterfly scheme. As is seen in Fig. 4 1 butterflies confined to that block. The first points of successive quarters of each block constitute the p; r; q; s points of the only type-2 butterfly involved in that block. As is seen in Fig 4, f16,20 ,24,28g is the only type-2 butterfly involved in block B 
respectively. Note that NT 1 +NT 2 = N=4 FHT butterflies exist at each level for`= 1; 2; : : :; n?1. Also note that level`= 1 consists of only N=4 type-2 butterflies and the number of type-2 butterflies decreases by one half in the following n?2 levels and reduces to 1 at the last level (`= n?1). 
Parallel FHT Algorithm
There are strong computational dependencies in the FHT algorithm. These computational dependencies exist between successive levels confined within the butterflies. As is seen in Fig. 3(a) and Fig. 4 , stage-2 computations in type-1 butterflies depend on the results of the stage 1 computations. The computation of qtemp and stemp values ((3a) and (3b), respectively) in the first stage necessitates bidirectional interdependency between q and s points, which will be referred here as q$s interactions. Note that first stages of type-2 butterflies involve no computations and interactions. Type-2 butterflies are also modeled as two stage computations just for the sake of completeness. The update of p, r, q and s points in the second stages of all butterflies (for`= 1; 2; : : :; n?1) necessitate bidirectional interdependencies between the p and q, and r and s points, which will be referred here as p$q and r$s interactions. The p$q and r$s interactions are very regular in nature since p and q, and r and s points are separated by 2`at level`for 1. In fact, this regularity in the p$q and r$s interactions makes hypercube topology very suitable for the parallelization of FHT. However, the q$s interactions complicates the parallelization because of the irregular spacing between q and s points of type-1 butterflies.
This paper investigates the parallelization of (N = 2 n )-point FHT on a d-dimensional hypercube with P = 2 n processors, where the number of 4-point FHT butterflies is an integer (power of 2) multiple of the number of processors (i.e., N 4P ). A straightforward parallelization can be achieved by adopting a static tiled mapping. The first processor in the decimal ordering is assigned the first M = N=P FHT points, the second processor is assigned the next M points and so on. Successive processors in the decimal ordering are assigned the consecutive slices of FHT points with each slice containing equal number of M consecutive FHT points. This mapping prevents the fragmentation of FHT butterflies and (q,s) pairs during the first n?d and n?d+1 levels, respectively. Both (p,q) and (r,s) pairs of butterflies are fragmented across processor pairs which are neighbors over channel c =`? n+d at level`for`= n?d; : : :; n?1. Here, channel c denotes the set of P=2 communication links between processor pairs whose d-bit binary representations differ only in their cth bit. Hence, these pairwise exchanges due to the p$q and r$s interactions can be accomplished by performing a concurrent single-hop exchange communication over channel c =`? n+d at level`for`= n?d; : : :; n?1. Unfortunately, the nature of fragmentation of (q,s) pairs, and hence the nature of the communications due to the q$s interactions are very irregular and complicated because of the irregularity in these interactions. A careful analysis reveals that the q$s interactions necessitate concurrent exchange communications, each with a volume of M?1 FHT points, at each level of the last d?1 levels, plus concurrent exchange communications, each with a volume of single FHT point, at each level of the last d?2 levels. All former type of exchange communications are single-hop communications at level`= n?d+1 and multi-hop communications with distances 2; : : :; d?1 during the last d?2 levels`= n?d+2; : : :; n?1 respectively. All latter type of communications are single-hop communications at level`= n?d+2 and mostly multi-hop communications with maximum distances 2; : : :; d?2 during the last d?3 levels`= n?d+3; : : :; n?1 respectively. Multi-hop exchange communications during the last d?2 levels will introduce drastic performance degradation due to the congestion.
The fine-grain algorithm proposed by Hou [6] considers the parallelization of N-point FHT on a hypercube with P = N processors, where each processor is assigned a single FHT point. Here, we will briefly describe an extension of Hou's fine-grain algorithm to medium-to-coarse grain parallelism. A tiled decomposition scheme is adopted for the initial mapping. This initial mapping is maintained during the first n?d+2 levels`= 0; 1; : : :; n?d+1. The tiled mapping scheme already confines the FHT butterflies to 1-dimensional and 2-dimensional subcubes over channels c = 0 and c = 0; 1 at levels`= n?d and = n?d+1, respectively. Hence, the second stages of levels`= n?d and`= n?d+1, and the first stage of level`= n?d+1 necessitate concurrent single-hop exchange communications over channels c=0; 1 and c=0 due to the p$q, r$s and q$s interactions, respectively. Then, at the end of each level = n?d+1; : : : ; n?2, those processor pairs which exchanged their local M?1 or M q or s points during the first stage of that level, exchange the further responsibilities of these local FHT points. These mapping exchange operations performed at the end of each level`, for`= n?d+1; : : :; n?2, confine the FHT butterflies to 2-dimensional subcubes over successive channels`?n+d and`?n+d+1, at the following level`+1. The d-bit binary representations of 4 processors in each subcube differ only in their cth and (c?1)th bits such that these two successive bits are "00", "01", "10" and "11" in the first, second, third and fourth processors, respectively. The fragmentation of FHT butterflies across these subcubes during the last d?1 levels is such that first, second, third and fourth processors in each subcube hold M p, r, q and s points, respectively, of the M butterflies confined to that subcube. Hence, each level`of the last d?1 levels require 3 concurrent single-hop exchange communications, each with a volume of M (or M?1) FHT points, over channels c?1, c and c?1, respectively, where c=`?n+d. The first and second exchange communications are information exchange operations due to the q$s, and p$q, r$s interactions in the first and second stage computations, respectively. The third exchange communication is a mapping exchange operation due to the nonlocal q$s swaps. Note that level`= n?d necessitates only one concurrent single-hop exchange communication over channel c=0, and the mapping exchange communication at the last level may not be necessary. Thus, the number and volume of concurrent communications required by this scheme are 3d?3 and (3d?3)M FHT points, respectively.
The dynamic mapping scheme proposed by Lin [8] reduces the number of concurrent communications to d. The initial mapping avoids the fragmentation of two-point butterflies at level`= 0 by assigning consecutive FHT-point pairs to successive processors in a cyclic manner. This initial mapping scheme can be considered as a scattered mapping of consecutive FHT-point pairs, where FHT point pair (2i, 2i+1) is assigned to processor i mod P. The dynamic mapping during the following d levels confines the FHT butterflies to processor pairs which are neighbors on the Hartley graph during levels`= 1; 2; : : :; d, and prevents the fragmentation of butterflies during the last n?d?1 levels. At level`= 1; 2; : : :; d, processor pairs whose least significant`?1 bits are all 0 0 s hold M=2 type-2 butterflies, whereas all other processor pairs hold M=2 type-1 butterflies. Former and latter types of processor pairs will be referred here as type-2 and type-1 processor pairs, respectively. The fragmentation of level-`butterflies (for`= 1; 2; : : :; d) across each processor pair is such that ith local FHT-point pairs in the first and second processors, whose (`?1)th bits are 0 and 1, correspond to the (p,r) and (s,q) pairs of the butterflies, respectively, confined to that processor pair, for i = 0; 1; : : :; M=2?1. The first and second processors of type-1 pairs are responsible for updating the (p,s) and (r,q) pairs, respectively, or vice-versa, depending on their`th bits. The first and second processors of type-2 pairs are responsible for updating the (p,q) and (r,s) pairs, respectively.
Hence, type-1 processor pairs need to exchange all of their local FHT points at the beginning of each level = 2; : : :; d. However, type-2 processor pairs need to exchange only half of their local FHT points at the beginning of each level`= 1; 2; : : :; d. These exchanges will be referred here as type-1 and type-2 exchanges, respectively. One half of the M local FHT points involved in each type-1 exchange is a mapping exchange, whereas the other half is exchanged because of the computational interdependencies.
Type-2 exchanges are both mapping and information exchanges.
All P=2 processor pairs are type-2 pairs at level`= 1, and the number of type-2 processor pairs decreases by one half in the following d?1 levels, thus reducing to 1 at level`= d. Thus, the communication volume of type-1 exchanges determines the concurrent communication volume during levels = 2; 3; : : :; d. Hence, concurrent communication volume overhead of Lin's algorithm is Md?M=2 FHT points on Hartley graph. Unfortunately, Hartley graph cannot be embedded with dilation one onto the hypercube graph as is also indicated in [8] . In a hypercube implementation of Lin's algorithm, type-2 exchanges are single-hop communications over channel c =`? 1 at level`for`= 1; 2; : : :; d. Type-1 exchanges at level`= 2 are single-hop communications over channel c=1. Hence, all exchanges can be concurrently performed over channels c = 0 and c = 1 at levels`= 1 and`= 2, respectively. However, type-1 exchanges at levels`= 3; : : :; d are mostly multi-hop communications with maximum distances of `?1=2; : : :; d?1. Hence, concurrent communication volume overhead of Lin's algorithm will be much higher on the hypercube topology due to the congestion during these d?2 levels.
Although these two algorithms are successful attempts to reduce the communication overhead, neither of them achieves perfect load balance for the simplified butterfly scheme. Consider the coarse-grain extension of Hou's algorithm. The tiled mapping scheme, which is maintained during the first n?d+2 levels, achieves perfect load balance during the first n?d levels, since it assigns equal number of unfragmented butterflies to each processor during these levels. However, load balance is disturbed during the first stage computations of the last d levels. At levels`= n?d and`= n?d+1; : : :; n?1, processors can be considered as divided into 2 and 4 groups, each containing P=2 and P=4 processors, respectively. At level`= n?d, each processor in the first and second halves of the hypercube holds and updates M=2?1 (p,r) and (q,s) pairs of type-1 butterflies, respectively. Hence, at level`= n?d, one half of the processors holding q and s points concurrently perform 3M ?6 floating point operations while the processors in the other half wait idle for receiving these qtemp and stemp results corresponding to the first stage computations of type-1 butterflies. At levels`= n?d+1; : : : ; n?1, each processor in the first, second, third and fourth quarters of the hypercube holds and updates either M ?1 or M p, r, q and s points of type-1 butterflies, respectively. Hence, at levels`= n?d+1; : : :; n?1, one half of the processors holding q or s points concurrently perform 3M or 3M ?3 floating point operations while the processors in the other half wait idle for receiving these qtemp or stemp results corresponding to the first stage computations of type-1 butterflies. Note that this algorithm cannot achieve perfect load balance even for the basic butterfly scheme during the first stage computations of last d?1 levels because of the 4-way computational fragmentation of FHT butterflies during these levels. Here, 4-way computational fragmentation refers to the situation in which 4 different processors compute the 4 different points of the same FHT butterfly.
Lin's algorithm, which is originally proposed for the basic butterfly scheme, achieves perfect load balance only for this scheme. This algorithm achieves perfect load balance during levels`= 0 and = d+1; : : :; n?1, both for the basic and simplified butterfly schemes, by assigning equal number of unfragmented butterflies to each processor during these n?d levels. The 2-way fragmentation during levels = 1; 2; : : :; d achieves perfect load balance for the basic butterfly scheme during these d levels. Consider the performance of this algorithm for the simplified butterfly scheme during d?1 levels`= 2; : : :; d. In the following section, we propose and describe a restructuring which brings regularity to the q$s interactions, without disturbing the regularity of the p$q and r$s interactions. Then, we will propose a dynamic mapping scheme for the restructured algorithm which totally avoids the computational fragmentation of FHT butterflies, as is illustrated in Fig. 6(c) .
Restructuring
The computational interdependencies between the successive levels of the FHT algorithm should be closely examined in order to achieve a suitable restructuring for an efficient parallelization. Two consecutive blocks at level`such that the indices of the p; r; q and s points of the two butterflies in each pair differ only in their (`+ 1) th bits. For example, in a 32-point FHT (see Fig. 4 
where k = n?`?2. Proof follows by Definitions 2 and 1 since 0 `?1 = `a nd`-bit 2's complement of
(1 `?1 ) is equal to itself.
2 Fig. 7 illustrates the combination structures of type-1 and type-2 butterfliy pairs. As is seen in Fig. 4 , in a 32-point FHT, the type-1 butterfly pair (f1,7,9,15g2B 0 3 , f17,23,25,31g2B In the discussions given so far, p; r; q and s labels were used both to identify different points of FHT butterflies and the decimal indices of the corresponding FHT points in the H-array. However, for the sake of clarity of further discussions, p; r; q and s labels will be used only to identify different points of FHT butterflies, whereas i and j labels will be used to identify their decimal indices in the H-array. Note that i and j indices satisfy the same relations previously defined for p; r; q and s points. That is, i 3 = i 1 + 2`; i 4 = i 2 + 2`; j 3 = j 1 + 2`; j 4 = j 2 + 2`, j 1 ?i 1 = j 2 ?i 2 = j 3 ?i 3 = j 4 ?i 4 = 2`+ 1 ; : : :; etc. In this notation, Theorems 1 and 2 can be re-stated as follows: level-(`+1) (F T1`+ 1 , ST1`+ 1 ) and (F T2`+ 1 , ST1`+ 1 ) pairs generated by type-1 (T 1 0 ; T1 1 ) and type-2 (T 2 0 ; T2 1 ) pairs will have the following structure in the H-array; FT1`+ 1 = fi 1 ; i 4 ; j 1 ; j 4 g ST1`+ 1 = fi 2 ; i 3 ; j 2 ; j 3 g FT2`+ 1 = fi 1 ; i 3 ; j 1 ; j 3 g ST1`+ 1 = fi 2 ; i 4 ; j 2 ; j 4 g; respectively.
Theorems 1 and 2 reveal that regularly separated (by powers of 2's) butterfly pairs at a particular level constitute scrambled butterfly pairs at the following level. The scrambled combination of the butterfly pairs is the main reason for the irregular spacing between q and s points of type-1 butterflies in the following levels. However, this scrambling between butterfly pairs can be avoided by a clever re-ordering while storing the computational results of each butterfly into the H-array. This internal re-ordering will be different for type-1 and type-2 butterflies since the combination structures of these two types of butterfly pairs are different from each other. Combination structure of type-2 FHT butterfly pairs is also investigated since they generate a single type-1 butterfly at the following level.
The scrambled combination of type-1 butterfly pairs are avoided by swapping r and s points type-1 butterflies while storing their updated values into the H-array. The scrambled combination of type-2 butterfly pairs are avoided by swapping r and q points of type-2 butterflies while storing their updated values into the H-array. In this scheme, the results of type-1 (T 1 Comparison of (8) with (3), and (9) with (4) ; T 1 ) is a type-1 butterfly pair. Note that the proposed restructuring avoids the scrambled combination structure between butterfly pairs at successive levels. Furthermore, in the proposed scheme, p; r points and q; s points of both FT`+ 1 and ST`+ 1 will be allocated to the consecutive locations of the H-array if p; r points and q; s points of both T 0 and T 1 are initially allocated to the consecutive locations of the H-array. This structure is valid for both types of butterfly pairs in the proposed restructuring scheme, since (p; r) and (q; s) pairs of FT`+ 1 This important feature of the proposed restructuring scheme will be exploited to avoid the fragmentation of the (q; s) pairs of type-1 butterflies during the parallelization.
In the original FHT algorithm, 4-point butterfly computations start at level`= 1 which contains only type-2 butterflies. Note that p; r points and q; s points of all type-2 butterflies at level`= 1 are already allocated to the consecutive locations of the H-array. Hence, if the proposed re-structuring is applied starting from level`= 1, then p; r points and q; s points of all butterflies at the following levels will be allocated to the consecutive locations of the H-array. Fig. 9 illustrates the computational flow-graph for the restructured 32-point FHT algorithm. As is seen in Fig. 9 , the type-1 butterfly pair (f18,19,22,23g, f26,27,30,31g) at level`= 2 constitute the type-1 butterfly pair, (f18,19,26,27g, f23,22,31,30g) at the following level`= 3. Similarly, type-2 butterfly pair (f16,17,20,21g, f24,25,28,29g) at level`= 2 constitute the (type-2, type-1) butterfly pair (f16,17,24,25g, f20,21,28,29g) at the following level`= 3. As is also seen in Fig. 9 , the proposed restructuring does not disturb the block structure of the original FHT algorithm. Furthermore, the proposed restructuring brings regularity and symmetry to the in-block allocation structure of the FHT butterflies. The following paragraph explains the regular allocation structure of 2`? 1 = 2`+ 1 =4 butterflies in each block at level`for`= 1; 2; : : :; n?1.
In each block, 2`? 1 consecutive FHT-point pairs in the first and second halves constitute the (p; r) and (q; s) pairs, respectively, of the butterflies involved in that block. Consecutive FHT-point pairs in each half are ordered regularly such that ith pairs in the first and second halves constitute the (p; r) and (q; s) pairs of the same butterfly, respectively, for i = 0; 1; : : :; 2`? 1 ?1) in each half hold the FHT points of (p; r) and (q; s) pairs in the reverse order (i.e., as fr; pg and fs; qg). These reverse ordered (p; r) and (q; s) pairs belong to the second type-1 butterflies generated from type-1 butterfly pairs in the previous level.
For example, in a 32-point restructured FHT algorithm (See Fig. 9 ?1=1. As is seen in Fig. 9 , this type-1 butterfly is the second butterfly generated by the type-1 butterfly pair (f2,3,6,7g, f10,11,14,15g) in the previous level (`= 2). ?1) (p; r) and (q; s) pairs of each block are hold in reverse order in the H-array during these levels. A careful analysis of (3) reveals the symmetry between the computations of p and r points, and q and s points of type-1 butterflies. That is, correct values for the type-1 butterflies will also be computed if we interchange p with r, q with s, and i with j in (3). In this case, qtemp will hold the correct value of stemp and vice versa. This symmetry in type-1 butterfly computations is exploited in the restructured FHT algorithm as follows. The first two lines in the innermost for-loop computes the indices of the p; r; q; s points of type-1 butterflies involved in a particular block assuming a proper ordering of the FHT points in (p; r) and (q; s) pairs. Hence, during the first 2`? 2 iterations, p1; r1; q1; s1 variables refer to the correct FHT points p; r; q; s, respectively, in the H-array. However, during the last 2`? 2 ? 1 iterations, p1; r1; q1; s1 indices refer to r; p; s; q points, respectively, in the H-array. Thus, this scheme implicitly achieves the interchange of p with r, and q with s. The interchange of the Cos=Sin factors (i.e., interchange of i and j) is also achieved implicitly during construction of the Cos=Sin factor index tables prior to the execution of the program. As is seen in Fig. 9 , at level`= 4, i=j indices of the last Figure 9 : Computational flow graph for a 32-point restructured FHT and its tiled mapping on a 2-dimensional hypercube. The nonlocal alignment operations during the last two levels`= 2 and`= 3 correspond to mapping exchanges of the respective FHT points 2 4?2 ?1 = 3 Cos=Sin factor pairs appear in reverse order (as j=i; 9/7, 10/6, 13/3). Hence, the last four statements of the innermost for-loop effectively computes the correct values for the s; p; r; q points of type-1 butterflies, and stores them into H q1]; H s1]; H p1]; H r1], respectively. Thus, the updated values of the s; p; r; q points of type-1 butterflies are effectively stored into their s; q; r; p locations, respectively. Hence, p and q points of type-1 butterflies are effectively swapped, instead of r and s points, during these iterations.
The implementation scheme proposed in Fig. 10 ), at levels` 3. We need to examine the combination structure of these reverse butterfly pairs in order to show that the implementation scheme in in the H-array. For example, type-1 (f7,6,15,14g, f23,22,31,30g) butterfly pair at level`= 3 constitutes the type-1 (f15,14,31,30g, f6,7,22,23g) butterfly pair at the next level`= 4. It is clear that, (ST 1`+ 1 ; FT1`+ 1 ) butterfly pairs generated during the last 2`? 2 ?1 iterations will have the same spatial structure compared to the (F T1`+ 1 ; ST1`+ 1 ) butterfly pairs generated during the first 2`? 2 iterations of the innermost for-loop.
Hence, the scheme proposed in Fig. 10 maintains the regular and symmetrical features of the restructured FHT algorithm without disturbing the simplicity and regularity of programming.
As is seen in Fig. 9 , the order of the output results is scrambled in the proposed restructured FHT algorithm. However, in most of the DSP applications a sequence of DSP blocks are applied consecutively on a set of input data. A proper output/input interface between successive DSP blocks can always be maintained, if the output or input data order of a particular DSP block is disturbed for the sake of efficiency.
Hence, the order of input and output data of individual DSP blocks does not bring any inefficiency to the overall application.
Dynamic Mapping
Consider the performance of the tiled mapping scheme for the parallelization of the restructured FHT algorithm. The internal alignment operations for the restructured butterflies will correspond to simple local swap operations during the first n?d levels since the tiled mapping prevents the fragmentation of butterflies during these levels. However, these alignment operations will necessitate mapping exchange communications after the second stage computations of the last d levels because of the fragmentation of butterflies during these levels. The nonlocal alignment operations performed at the end of each level`, for`= n?d; : : :; n?2, confine the FHT butterflies of the next level (`+ 1) to 1-dimensional subcubes over channel c =`? n+d+1. The d-bit binary representations of the two processors in each subcube differ only in their cth bit such that this bit is "0" and "1"in the first and second processors of the subcube, respectively. The fragmentation of FHT butterflies across these subcubes is such that first and second processors in each subcube hold and are responsible for computing M=2 (p; r) and (q; s) pairs, respectively, of the M butterflies confined to that subcube. Hence, each level`of the last d levels require two concurrent single-hop exchange communications both over channel c =`? n+d. The first concurrent exchange communication, of volume M FHT points, is due to the p$q and r$s interactions. The second concurrent exchange communication, of volume M=2 FHT points, is a mapping exchange operation due to the nonlocal alignment operations. Thus, the proposed restructuring reduces the number and volume of concurrent communications to 2d and 3dM=2 FHT points, respectively. Although this scheme achieves perfect load balance for the basic butterfly scheme it doesn't achieve perfect load balance for the simplified butterfly scheme because of the fragmentation of butterflies during the last d levels.
In this section, we propose a dynamic mapping scheme for the restructured FHT algorithm which prevents the fragmentation of FHT butterflies. Starting with the initial tiled mapping, alignment operations in the restructured FHT algorithm do not fragment the butterflies during the first n?d levels, and confines the butterflies to 1-dimensional subcubes during the last d levels. The first and second processors in each subcube hold (p; r) and (q; s) pairs of the butterflies confined to that subcube. In the proposed scheme, at the beginning of each level`during the last d levels, first and second processors in each subcube exchange the appropriate halves of their local (p; r) and (q; s) pairs, respectively, such that each processor gathers M=4 unfragmented butterflies. This exchange communication is a mapping exchange operation which effectively exchanges the responsibility of further computations associated with those exchanged FHT points. The M=2 butterflies fragmented across the two processors of each subcube are evenly divided between these two processors after the mapping exchange communication. Hence, this scheme achieves perfect load balance both for the basic and simplified butterfly schemes, since it gathers and assigns equal number of unfragmented butterflies to each processor at each level. These mapping exchange operations are the only communication requirement of the proposed scheme since they gather and assign unfragmented butterflies to all processors at each level of the last d levels. Hence, in this scheme, each level of the last d levels require only one concurrent single-hop exchange communication, of volume M=2 FHT points, over channel c =`? n+d. Thus, the proposed scheme reduces the number and volume of concurrent communications to d and dM=2 FHT points, respectively. In this scheme, the alignment (Fig. 10) for the first (n?d) levels. As is seen in Fig. 11 , the computational flow graphs for the local FHT computations performed by processors during the first (n?d) levels are exactly same as the computational flow graph for the M-point FHT algorithm. That is, P processors can be considered as concurrently computing P independent M-point FHT (using proper Cos=Sin factors for the N-point FHT) during the first n?d levels, Hence, the pseudo-code of the node program for the first n?d levels of the parallel algorithm can easily be obtained by replacing variables N and n in Fig. 10 with M and m=lg 2 M, respectively.
In the first inner if-then-else statement of Fig. 12 , each processor identifies itself either as the first or the second processor in the respective 1-dimensional subcube by simply checking the cth bit of its processor index. Here, mynode is assumed to be a d-bit binary number representing the index of the respective processor. The variable c denotes the channel over which the mapping exchange operation is to be performed at that level. Then, each first processor exchanges the second half of its local H-array with the first half of the local H-array of the respective second processor, and vice-versa. Hence, first processors effectively exchange their local M=4 (p; r) pairs with the local M=4 (q; s) pairs of the respective second processors, and vice-versa. The pr and qs indices used inside the first if-then-else statement identify the nature of the FHT points being sent and received. The proposed parallel FHT algorithm does not necessitate any extra send or receive buffers. All communications are initiated from/into contiguous locations of the local H arrays thus avoiding any scatter/gather type of local operations for communications. Note that first and second processors at a particular level use the second and first halves of their local H-arrays, respectively, as contiguous send and receive buffers for the exchange communication operations. Hence, the proposed scheme has a very regular in-place communication structure. In Fig. 12 , send and recv denote synchronous (blocking) send and receive primitives. Synchronous send/receive operations are used to prevent the contamination of the message to be sent with the incoming message since the same half of the local H-array is used both as send and receive buffers at a particular level. Thus, as is also seen in Fig. 12 , each processor performs simplified FHT butterfly computations on local (p; r) and (q; s) pairs separated by M=2 = N=2P. The proposed parallel FHT algorithm has a very regular in-place computational structure and hence can also be implemented on SIMD type hypercubes efficiently.
Although butterflies are partitioned evenly among processors throughout the algorithm, the type of but- the mapping exchange operation at each level`of the last d levels, the first butterfly of M=4 butterflies in each processor is a type-2 butterfly if least significant c+1 bits of the prorocessor are all 0 0 s, where c =`? n+d. Otherwise, it is a type-1 butterfly as well as the remaining M=4?1 butterflies. So, at each level`of the last d levels, P=2 c processors compute one type-2 and M=4?1 type-1 butterflies, while the others compute M=4 type-1 butterflies, where c =`? n+d. As is seen in Fig. 12 , this difference in local computations is resolved simply by the second if-then-else statement before the inner for-loop.
The parallel execution time of the proposed FHT algorithm can be modeled as 
Comparing the first term of (12) with the expression given for the overall sequential execution time T seq in (7), we can rewrite (12) as The first two terms in (13) represent the parallel execution time under perfect load balance conditions. The last term in (13) represents the slight deviation from the perfect load balance as a parallel computational overhead term. This overhead, which is always smaller than the machine specific constant 6t calc , can be neglected for sufficiently large N=P or N values.
Experimental Results
All Number of Processors (P) Table 1 illustrates the parallel performance comparison of Lin's and the proposed algorithms. As is described earlier, the parallel computational performance of Lin's algorithm reduces to that of the basic butterfly scheme. Recall that t basic =t simp = 1:6 where t basic and t simp denote the computational complexity of type-1 basic and simplified butterflies, respectively. As is seen in Table 1 , the experimental performance ratio of the proposed algorithm to Lin's algorithm approaches to this ratio with increasing FHT size. Larger communication volume overhead of Lin's algorithm does not introduce significant decrease in its relative performance on iPSC/2 compared to the proposed algorithm because of the small t tr =t calc 0:25 value. Furthermore, index computation overhead of Lin's algorithm is less than that of the proposed algorithm (2 versus 4 per butterfly). The experimental performance ratio values do not exceed the value 1.6 because of the above mentioned reasons. However, the relative performance of the proposed algorithm compared to Lin's algorithm is expected to be much higher on hypercubes with larger t tr =t calc values. The relative performance is also expected to increase with increasing hypercube dimension since Lin's algorithm introduces congestion during the last d?2 levels of the d concurrent exchange communication phase due to the multi-hop messages during these levels. Relatively small efficiency values for small size problems on large dimensional hypercubes are due to the high communication latency (t su t calc ) value of the iPSC/2 architecture.
Conclusion
The 
