We have developed a parallel, pipelined architecture for calculating the Fast Hartley Transform. Hardware implementation of the FHT introduces two challenges: retrograde indexing and data scaling. We propose a novel addressing scheme that permits the fast computation of FHT butterflies, and describe a hardware implementation of conditional block floating point scaling that reduces error due to data growth with little extra cost. Simulations reveal a processor capable of transforming a 1K-point sequence in 170 microseconds using a 15.4 MHz clock.
Introduction
The fast Hartley transform has attracted considerable research interest as an alternative to the FFT. A complete analogy exists between the two transforms. All the properties applicable to the FFT, such as the convolution and shifting theorems, apply to the FHT. In fact, the FHT is easily extracted from the real and imaginary components of the FFT (and vice versa). These properties have been well-documented in [SORE85] , [BRAC86] , [HOU87] , and [ONEI88] .
The computational advantage of the FHT lies in its use of real-valued data. As the vast majority of applications involve only real-valued data, significant performance gains are realized without much loss of generality. If one performs the complex FFT on an N-point realvalued sequence, the product is a complex sequence of 2N real points, N of which are redundant. 1 The FHT, on the other hand, will yield a real sequence of the same length with no redundant terms, using only half the time and half the memory resources.
Real-valued FFT algorithms have been derived [DUHA86, SORE87, MOU88] which compare in speed and memory requirements of the FHT, but they require the implementation of a different algorithm to compute the inverse real-valued FFT. This added complexity is undesirable, as algorithms for both the forward and inverse transform are preferred.
The FHT is obtained from the Discrete Hartley Transform (DHT) pair of a real function x(t), 
1 The complex FFT can be used to simultaneously compute the transform of 2 real-valued sequences. However, the two resulting transforms are scrambled, requiring a fair amount of post-processing to separate the two transforms.
The Challenges of Implementing the FHT
The two principal difficulties with implementing the FHT are retrograde indexing and data scaling. Retrograde indexing refers to the peculiar addressing pattern of the FHT, while scaling arises from our use of fixed-point arithmetic to calculate the FHT.
Retrograde Indexing
Addressing for the FHT exhibits a unique asymmetry. Figure 1 depicts the flow diagram for a decimation-in-time (DIT), radix-2, 16-point FHT [KWON86] . Note the regular structure of the FHT. It is identical to a radix-2 FFT except for the 'T' blocks, which implement retrograde indexing. Visualizing larger transforms is straightforward: to construct a 32-point transform flow graph, we duplicate the 16-point transform diagram below for stages 1-4 and add a fifth stage with T5 in its lower half 4 .
A. Erickson and B. Fagin X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X (15) x(0) x(8) x(4) x(12) x(2) x(10) x(6) x(14) x(1) x(9) x(5) x(13) x(3) x(11) x(7) For the first two stages, FHT butterflies are similar to FFT butterflies, involving only two terms and no multiplications. For stage 3 and after, however, an increasing number of butterflies involve non-trivial multiplications. As T3 and T4 show, these butterflies require three input points to produce two output points (heavy lines, stage 3), introducing computational asymmetry and an I/O ratio of 3:2. The I/O efficiency can be improved by noting that two of the three terms used for any one butterfly are shared by a second butterfly in the same iteration. If the two associated butterflies are processed simultaneously, only four terms (not six) would be required (heavy lines, stage 4). Thus if we calculate two FHT butterflies at once, we may use four input points to calculate four output points, giving an I/O ratio of 1:1 [SORE85] . The concurrent calculation of two related butterflies will be referred to as a dual radix-2 butterfly. The equations of a dual radix-2 butterfly are shown below:
A. Erickson and B. Fagin
where i = 1,...,log 2 N (stage counter), N s = 2 i (group size), f = 0,...,N s /4 -1 (butterflies within each group), and k = 0,N/N s ,...,(N/N s -1)*N/N s (base index for each group). Whenever f evaluates to zero, Equations (3) -(6) are reduced to one addition or subtraction. This is the case for all of stages 1 and 2, half the butterflies in stage 3, and so on, decreasing to only 1 butterfly in the last stage. All the other butterflies involve general multiplication with the twiddle factors and two additions or subtractions. The next section discusses the consequences these operations can have on data integrity.
Preventing Overflow
Additions and subtractions using fixed-point arithmetic can result in data overflow, a situation which yields erroneous results and must be avoided during FHT calculation. In butterflies requiring three terms, two addition operations are involved. Studying Equations (3) -(6), we observe that the condition which results in the largest data growth is when f/N s = 1/8 and the data are equal and at their maximum. Normalizing the maximum to be 1, the result of a worst-case computation is described by:
Because the result can grow by a factor of 2.4142, the data can experience up to log 2 2.4142 = 2 bits of overflow. To prevent this, the data can be shifted right by two bits before performing the butterfly calculations. In addition to reducing the data's dynamic range, this data scaling introduces truncation error if any of the discarded bits are non-zero. As the FHT calculations The technique used to scale the data can likewise degrade the signal-to-noise ratio. The easiest solution is to unconditionally shift the data right two bits at each stage. This would eliminate the possibility of overflow, at the cost of unnecessary shifting of data and a large loss of accuracy. Clearly other solutions are preferable.
Our processor implements a scaling technique known as conditional "block-floating-point"
(BFP), whereby the data are shifted only when an overflow occurs [WELC69] . Instead of providing an exponent for each data word, as in floating-point, the entire sequence is associated with a single exponent. Should an overflow occur during an iteration, the entire array is shifted right by one and the exponent incremented.
How and when the sequence is shifted can effect system performance. [WELC69] suggests that the sequence be shifted immediately after an overflow. In this method, the sequence may be traversed twice during an iteration for the sole purpose of overflow adjustment. [ANDE88] limits the size of the incoming data to be two bits less than full-precision. The data are then allowed to grow into these extra bits during an iteration, after which an explicit subroutine is called to shift the sequence by the maximum amount of overflow that occurred. These schemes are not optimal because a significant portion of the FHT calculation is dedicated to adjusting the sequence. Our implementation improves upon existing methods by shifting the data as it arrives in the computational element during the next iteration, thereby precluding any time penalties. We discuss this in detail in a later section.
The situation resulting in the maximum possible bit growth arises when the input data are equal and at full-scale; that is, x(n) = [2 B-1 -1,2 B-1 -1,. 
This result is intuitive, as a constant time-domain signal concentrates all its power into the DC component of its transform. Equation 7 shows that the data can grow by a factor of N, or log 2 N bits, so when x is at full-scale, log 2 N shifts are necessary to prevent overflow.
We note that the inverse transform would likewise produce log 2 N bits of overflow given the same maximum DC input sequence. However, the inverse transform requires that we divide by N, equivalent to subtracting log 2 N from the data's common exponent. From equation 8, we see that the data, overall, does not experience any growth, although it experiences maximum data scaling. For a detailed analysis of this and other examples showing the effects of data scaling, the reader is referred to [ERIC90] .
Knowing that a sequence can grow by one bit during the first two stages and by two bits in all subsequent stages, unconditional fixed-point scaling techniques would shift the data a total of 2log 2 N -2 bits, almost twice the maximum possible bit-growth. For this reason, blockfloating-point schemes can generally be expected to perform better than unconditional schemes.
Detailed error analyses for the fixed-point scaling techniques discussed here can be found in 
Architecture of the FHT Processor
This section describes an architecture capable of computing the FHT very efficiently. For 1024-point transforms, 98 percent of the execution time is spent performing FHT calculations.
We first discuss the system architecture, and then present solutions to the aforementioned implementation challenges.
System Architecture
The processor architecture embodies a number of techniques designed to optimize system performance and functionality. The butterfly computational unit (BCU) utilizes a 5-stage pipeline, keeping the critical path as short as possible. Additionally, the processor incorporates tightly-coupled, high speed buffer memories. Organized to hold up to 1024 16-bit words, each memory is arranged to allow simultaneous, single-cycle access to four 16-bit data points.
Employing several of these memories allows for many such concurrent data transfers, boosting the data I/O rate significantly.
The processor can be programmed to transform sequence lengths of any power of two from 4 to 1024 points, thereby overcoming an inflexibility found in some dedicated hardware processors. Larger transforms are a simple extension of the processor's address generation unit and an increase in the depth of the memories.
A block diagram of the system is given in Figure 2 We have mentioned previously the two primary challenges to implementing the FHT: retrograde indexing and scaling. We now present our solutions to these problems.
Algorithm Control and Butterfly Computation
Because parallel access to all four elements of a dual radix-2 butterfly is desired, the processor employs an algorithm different from the one originally proposed in [BRAC84] . Given a sequence of length 2 v , v ∈ [2,3,...,10], the sequence is broken into 4 equal segments and arranged as depicted in Figure 3 . Each column in the array is a separate memory bank, where the first 1/4th of the sequence is loaded in the first bank, the second 1/4th in the second, and so on. Thus, a 16-point sequence would be stored as shown in As Figure 4 shows, butterfly calculation involves multipliers, adders and subtracters.
Disposing of most of the subscripts and letting m denote the current stage number, Equations (3) -(6) simplify to:
These equations are similar to those used to calculate just one radix-2 complex FFT butterfly .
A. Erickson and B. Fagin 
Figure 4 BCU Block Diagram
Although a sequence is stored in four separate memory banks, writing the results to the destination memory does not require the generation of four separate addresses. The BCU result data are written to the destination memory in pairs, requiring only two addresses, A and B (see Figure 4 ). Data must be written in this way to insure simultaneous access to four elements using a single source address during the next iteration.
Address Generation and Control
The correct computation of the FHT using the datapath of Table 2 shows the butterfly results at the outputs of the second tier of adders is given, followed by the data after output reordering. Symbols used are similar to those in Table 1 . We now turn our discussion to fixed-point versus floating-point arithmetic, pointing out the consequences of choosing one over the other.
An Example

Fixed-Point versus Floating-Point
Implementing a 16-bit fixed-point processor has important advantages over floating-point implementations. Standard, single-precision floating-point arithmetic operates on 32-bit data.
This increased data width would have an enormous impact on the amount of hardware required to implement our processor. Memory width would double, as would all hardware A. Erickson and B. Fagin used to compute the double radix-2 butterfly. Furthermore, to preclude any performance hits, additional pipeline hardware would be required to offset the higher latency of floating-point operations.
Floating-point hardware itself is relatively expensive. Not only is floating-point hardware functionally more complex than integer hardware, but, because of the larger data paths, the hardware must be housed in costly pin-grid-array packages. For these reasons, we felt that a fixed-point processor was better suited for our purposes.
In addition to aliasing, quantization, leakage and other sources of error inherent to discrete transform calculations, the fixed-point data will experience both roundoff and truncation error.
Roundoff error is introduced at the multipliers, where only the 16 most-significant bits of the 32-bit product are used. In [ZAKH87] , a theoretical analysis of fixed-point and floating point roundoff error is developed. The theoretical results are supported by subsequent experiments on white-noise input sequences. Given a zero-mean, white-noise, fixed-point input sequence, Zakhor found that the noise-to-signal ratio increases by about 1.1 bits per stage. This relatively high error can be partially attributed to her limiting the dynamic range of the input sequence in order to insure against overflow.
Truncation error occurs after an adder overflow, where the least-significant bit is shifted out. 5 We attempt to minimize these effects by implementing block floating point scaling.
Block-Floating-Point Scaling
Among the components needed to implement BFP scaling are overflow-detection logic, a maximum-overflow register (MAXOVL) reflecting the largest growth (0, 1, or 2 bits) of data that occurred during the current stage, a scale register (SCALE), reflecting the largest growth of data that occurred during the previous stage, and the common exponent register (EXPO). Figure 6 illustrates their interconnection.
A. Erickson and B. Fagin shift is made to ensure that two guard bits are present for the third stage. A special case arises when none of the data occupies more than 14 bits of precision after the second stage. In this case, the two required guard bits are already present, and the unconditional shift does not take place.
Scale control at the end of the third and subsequent stages ensures that the data starts with two guard bits. This proceeds with no special cases until the last stage, where the data need not A. Erickson and B. Fagin be adjusted after an overflow. Here, the data are allowed to grow to the full 16 bits of precision and are written to the output buffer memory unmodified. 6 The exponent register will reflect the total amount of shifts the data underwent.
Everything said about the forward FHT applies equally as well to the inverse FHT; the only difference is that the exponent will be log 2 N greater than the true value. A simple subtraction takes care of this. We outline here a procedure similar to that used in [WELC69] and [OPPE71] for empirically determining the output signal-to-noise ratio for an FHT using conditional BFP scaling. We begin with an input sequence, x(n), and compute its FHT using two programs. The first program uses double-precision floating point in its FHT computations, which we take as exact. The second program emulates our FHT processor scale control hardware. Using fixed-point arithmetic, it will produce an output sequence equal to that of the first program plus an error sequence. The signal-to-noise ratio is determined by taking the ratio of the root-mean-square value of the floating-point sequence to that of the error sequence, where the RMS values are given by
Equations (13) and (14). Note that the sequences' mean values must be subtracted from each of the terms before being squared.
The results can be used to predict the SNR for several classes of input sequences and be compared to theoretical and experimental results for unconditional fixed-point implementations in [OPPE88] and [WELC69] . An example calculation of the SNR for a maximum-amplitude, truncated cosine wave is provided in the Appendix.
Simulation of the Architecture and Performance Estimates
The two primary tools used to develop the FHT processor were the Workview ® design system, for schematic capture and simulation, and the XACT CAD package for gate array layout. Workview is an integrated schematic capture and simulation package; our version ran on a Sun 3/60 platform. XACT is a field programmable gate array die editor which gives the designer complete control over logic placement and routing resources. We have shown in
A. Erickson and B. Fagin [FAGI90] that user control at this level was essential to meeting our design spec of a 65ns clock period, corresponding to a 15.4 MHz system clock. For a more thorough description of the processor hardware, the reader is referred to [FAGI90] and [ERIC90] .
Performance of the FHT Processor
Simulation has verified that we meet our system spec of a 65ns clock period. We may therefore calculate the performance of the FHT processor in the following way. There are µS, allowing one to perform real-time signal processing using a 6.04 MHz input sampling rate. We caution against an unduly enthusiastic interpretation of Table 4 . The reported time for the FHT processor is based on simulation, while other results are from functioning chips.
Additionally, the FHT processor is a special purpose device, and should be expected to perform well on the chosen application when compared with more general purpose devices.
Nonetheless, even when these factors are taken into account, we believe the performance difference to be significant.
Conclusions and Future Work
We have described a high-performance FHT architecture, discussing our solutions to the principal computational challenges of the FHT. Simulations reveal a processor capable of computing a real-valued, 1K-point transform in 170µS. The processor is among the fastest transform processors to date and is the only dedicated FHT hardware processor known to the authors.
The architecture has been fully simulated, and has up to this point met system specifications. The next step will include PCB fabrication and prototyping. While problems inevitably arise when constructing an experimental system, our previous experiences with the 
Appendix: Error Analysis
The following pages give simulated error analysis results for a 32-point maximum amplitude truncated cosine wave. We employ two programs: program A, a double-precision FHT algorithm, and program B, the FHT processor hardware emulator. Table A .2 lists the output of Program B using the same input sequence. Note the order in which the output sequence must be read and the value of the exponent register.
Assuming that the double-precision results of Program A are exact, we can calculate the signal-to-noise ratio of the processor's fixed-point output for a maximum-amplitude, truncated cosine wave. Following the procedure outlined in previous sections, we find that the SNR of our output sequence is 71 dB. Given that the data began with 14 bits of precision (excluding the sign bit), this corresponds to an average decrease in SNR of 0.44 bits per stage.
