is used for input data distribution, and one tile is used for output data collection. We also report on the performance of the SDR based upon the FFT experiments.
INTRODUCTION

A. Problem Statement
Software defined radios (SDR) are radio receivers or transmitters that achieve their function through a program running on a microprocessor, digital signal processor, or a field-programmable gate array (FPGA) that is connected to an antenna via an analog-to-digital converter (receive) or a digital-to-analog converter (transmit). SDRs are ideal for space applications because of the ability to reprogram them remotely. In either the transmit or the receive mode, digital signal processing is vital to the development of the requisite functionality.
In the project reported in this paper, we investigate a receiver-type SDR application of a newly-developed high performance processor, the Maestro 49-tile Radiation-Hard by-Design chip that was developed to demonstrate the application of space-qualified, multicore hardware. We have U.S. Government work not protected by U.S. copyright 978-1-4799-5380-6/15/$31.00 investigated the implementation of a single-precision floating-point pipeline FFT to be used as part of a SDR receiver application. The details of the software architecture that can adapt to the use of different numbers of tiles and the performance of the N-point FFTs for N=128, 512, 1024, and 2048 are described. The maximum throughput achieved for a 2048-point FFT is 27 million samples per second when 20 of the 49 available tiles are used for separate FFT blocks, one tile is used for input data distribution, and one tile is used for output data collection. We also report on the expected performance of the example SDR based on the FFT experiment results.
A. Background
Radiation in space poses a considerable threat to modem microelectronic devices, in particular to the high performance low-cost computing capability we enjoy on earth. These threats apply as well to the processors one would employ implementing SDRs in space. These effects can be categorized as long-term permanent faults called total dose effects and transient temporary effects called single event upsets (SEU) [1] .
Total dose effects must be mitigated by semiconductor manufacturing process modification, or by selecting and testing parts to meet the total dose requirements of the planned mission. Single event upsets are more difficult to prevent in modem, high-speed, small-feature-size devices. So, while total-dose radiation-tolerant modem processors and FPGAs are available, most modem current generation processors are very susceptible to SEUs [1] .
Two notable exceptions to this generalization have emerged recently. The U.S. Government has been sponsoring the OPERA project through which the Boeing Corp. has produced a Radiation-Hard by Design (RHBD) 49-core (tile) multiprocessor chip (MAESTRO) based on the architecture of the Tilera Corp. 64-tile chip [2] . This chip is being made available to U.S. Government space computing applications. It has been demonstrated at a U.S. Government sponsored Industry Day 9/29-9/30/2010 [3] . The other is a project sponsored by the Air Force Research Lab for Xilinx Corp. to develop a RHBD version of its Virtex-5 Field Programmable Gate Array (FPGA) chip [4] .
It has long been understood that replication of logic with voting circuitry can be used to improve the reliability of digital systems in the presence of transient errors in the logic, such as SEUs [1] [5] . We at the Naval Postgraduate School (NPS) have been engaged in a project to build an evaluation board for a Triple Modular Redundant (TMR) implementation of a RISC processor to validate the TMR architecture for employment in a high-SEU environment. This evaluation board has evolved to a dual-FPGA processor called the Configurable Fault-Tolerant Processor (CFTP). The research has led us to the conclusion that the TMR architecture is an effective one to enhance the resistance of a processor to SEUs so that the computer could operate reliably in the hostile environment of low earth orbit [5] .
The NPS is conducting research and education programs in SDR, including thesis research in SDR design of transceivers for IEEE 802.11 wireless LANs, IEEE 802.16 wireless MANs, and IS-95B and cdma2000 mobile telephony, and the course EC4530 Soft Radio. This work includes software defmed radios consistent with the Software Communications Architecture (SCA), micro processor-based SDRs, and most recently FPGA-based SDR design. The Naval Postgraduate School's Communications Research Laboratory is equipped for SDR design with eight software defined radio design stations including programming design environments, RFFEs, and micro processor and FPGA modules.
SDRs are a natural fit for satellite applications because they can be changed via reprogramming after launch, thereby allowing new functionality and/or design improvement at any time in the spacecraft's lifecycle. It is expected this will make the satellite more useful over its lifespan including more operationally responsive. Furthermore, a single SDR can receive mUltiple dissimilar communications signals simultaneously and be reconfigured to receive different signals at different times -for example, different signals over different areas of the world.
The Naval Postgraduate School is currently at work on a project to design the software for a fault tolerant SDR suitable for hardware (FPGAs) already on orbit. The proposed SDR will process pre-demodulated signals in order to compress the signals for potential passing to the downlink. It is presumed that the downlink does not have sufficient bandwidth to pass the entire pre-demodulated signal. The compression algorithm will be configurable by ground operators who will set signal power thresholds for frequency ranges and time durations of interest. The compression will be accomplished by passing only those frequency ranges-time durations of the signal that exceed the relevant power threshold. The basic SDR design has been proven by Wright [6] and further refined by Livingston 2 [7] . The FPGA configuration is being made fault tolerant by applying the methods learned in this research program and will be tested on the Algorithmic WorkStation (A WS) prior to being tested on an on-orbit FPGA.
A key component of this SDR is a high-speed pipeline Fast Fourier Transform (FFT) unit. We have had an earlier research effort on the realization of high-speed, pipelined FFTs. It developed the architecture for a high-speed pipe lined signal processor for the computation of the Cyclic Spectrum [8] of which the principal component is an FFT processor.
A recent thesis has developed the realization of a Radix-4 64-point real-time FFT implemented and simulated in a Virtex-II FPGA [9] . This design was implemented with both TMR and RPR fault-amelioration techniques and showed a modest improvement in resource utilization of the RPR technique over TMR. Unfortunately, the fault-tolerant FFTs were not available in time for testing in the UC Davis cyclotron.
The NPS investigators also have experience with the multi core processor architecture that is exploited in the Boeing developed MAESTRO chip. We have investigated ways to enhance the designed-in RHBD technology of the chip [10] . We have looked into ways to utilize the multi-core architecture for the implementation of SDRs and some particular SIGINT algorithms.
The NPS investigators have also had some hands-on experience with the Tilera processor on which the MAESTRO is based. At the time we investigated how we might implement highly reliable, high-speed implement ations of encryption and hashing algorithms utilizing the pipe lined architecture available on the Tilera. We investigated how to take advantage of the allocations of specific portions of the chip to specific functions, i.e. how the chip design supports physical redundancy and where there might be potential single points of failure. We compared this architecture to the Cell Broadband Engine, a different multi-core approach. Although we did not do a complete implementation of the encryption algorithms on the Tilera, our analysis indicated that its speed for hashing and encryption would be roughly comparable that of the Cell with possibly greater resistance to hardware failure.
Based on our experience with the implementation of high speed pipelined processors and the design of high performance reliable processors for the space environment, we have studied the use of the Maestro RHBD multi-core processor for the implementation of a SDR to perform data compression on broad-band pre-demodulated signals. The multi core FFT implementation would also be applicable to SDRs which transmit or receive OFDM or OFDMA signals.
PIPELINED SDR ARCHITECTURE
A. Specific Pre-Detection Data-Compression SDR The basic SDR that was the motivation and implementation target for the research was developed in master's theses by Livingston [7] , Wright [6] , and Humberd [11] . The radio would monitor a band of the RF spectrum of bandwidth B. It would convert that band-limited portion of the RF to digital samples at a sample rate, Is, such that, Is > 2B , the Nyquist rate. Then, the SDR computes N-point FFTs of each successive N-point block of input data samples. For each N point block of complex frequency data, the magnitudes of all the positive frequency components are computed. Then, the signal energy is calculated in each band of interest using the FFT amplitude data. For each band that has signal energy exceeding the threshold frequency the corresponding complex-frequency values are reported. size of the FFT, so the potential compression ratio will be Nlk for blocks with k FFT indices with power greater than the threshold. For blocks with less signal power, no frequency components are downlinked, achieving an infinite compression ratio, although probably null blocks should have their time stamp downlinked. Figure 2 shows the block diagram of the computational processes that are required to implement the SDR. It is desired that these processes be implemented in real time with sample rates in the tens of megahertz. Two basic ways to accomplish this goal would be:
1. By use of a FPGA or Application-Specific Integrated Circuit (ASIC) implementing pipeline versions of the major processor sub-blocks of Figure 2 .
2. By use of a multi-processor to compute separate N point blocks of selected frequency components in parallel.
The use of parallel processors to do the computation relies on the fact that each N-point FFT is independent of the others. So, as long as the blocks are time stamped, the computation of Figure 2 can be carried out in a different processor with the selected frequency-component blocks with their time stamp reassembled at the output. This latter approach is the one that would be suitable for implementation of the SDR on a multi-tile processor such as Maestro. If this SDR were placed in a satellite, then the selected frequency components would be down linked with a block identifier. Ground processing can reconstruct the significant components of the signal by performing the inverse FFT on the selected frequency components, block-by-block.
The frequency resolution of the SDR is simply N, the block 3
B. Maestro-Based FFT Ex periments
The basic Maestro multi-tile architecture is shown in Figure  3 . The architecture of Maestro uses the intellectual property of the Tilera Corporation for its 64-core commercial architecture. This architecture was purchased by the U.S. Government for royalty-free use by the Government in space applications. The Boeing Corporation was contracted by the Government to produce a 49-core RHBD chip, incorporating the basic Tilera architecture and adding an IEEE-standard floating-point co-processor to each core. Tilera refers to its cores as tiles so we will use the term tile. which corresponds to the usage in Tilera documentation.
J.Maestro SDR Architecture- Figure 4 shows the basic architecture of the planned multi-core architecture of the Maestro program to compute the real-time spectrum of the incoming sampled data stream, select the components whose power exceeds a given threshold, and then output the time index of the N-sample block and the frequency indices and complex magnitudes of the spectrum.
The fIrst tile in the process, the source tile, converts each 12-bit sample into a 32-bit IEEE standard floating-point nwnber and places the samples into p successive N-word buffers. As each N-word buffer is fIlled, the source signals
The sink tile then performs the following operations:
• It waits for a signal from any of the p FFT -compute tiles, "ready to send"; the associated FFT -select tile, "Ready to Send a Block."Each of those p tiles performs the following operations:
• It waits for that "ready-to-send" signal and when received, initiates a direct-memory-access (DMA) transfer of the block of data with time stamp into the empty half of the input ping-pong buffer.
• It checks if the output ping-pong buffer is available and if it is, it o Computes the FFT of the full half of the input ping-pong buffer; o Tests the magnitude of the power in each positive frequency component of the spectrwn; o Loads the time stamp, number of components exceeding threshold, frequency indices and their complex amplitudes into the empty half of the output ping-pong buffer; o Signals the output is available to the output tile.
• It sends that data off chip.
Programming of the Maestro chip to exploit the parallelism displayed in Figure 4 is a very difficult process. The programmer must explicitly manage the data transfer (DMA operations) between tiles, manage the ping-pong data buffering, as well as provide computationally effIcient processes to compute the FFTs and test for the signifIcant frequency components. Because of this complexity, it was decided to fIrst implement a simplifIed parallel algorithm to develop the techniques for distributing the data and to exploit powerful FFT algorithms developed by others.
FFT Program
Us ed fo r FFT Tiles-The FFT program used in the tests reported here is a version of the FFTW ("Fastest Fourier Transform in the West") algorithm reported by Singh, et aI., of University of Southern California's Information Sciences Institute (lSI) [13] . In that paper, the authors describe their adaptation of the FFTW algorithm to the Maestro chip and their simulation studies of the performances with various sizes of FFT on Maestro and Table 1 .
We analyzed the results from [13] to obtain a digital sample rate or throughput for a multi-tile FFTW implementation on Maestro. Based upon a number of real FLOPs per FFT of Nlog2 N, we estimate a single-tile sample rate achievable by their FFTW also shown in Table 1 .
Ignoring data distribution overhead in the operation of p FFT tiles in parallel operating on different blocks, the sample rate should scale linearly with p, as shown in the [mal row of Table 1 . This final estimate gives us an upper bound on the sample rate achievable from a p-tile parallel pipeline implementation of the lSI FFTW in accordance with the structure shown in Figure 5 .
This upper bound suggests that a 20-tile pipelined FFT could achieve real-time FFT operating at a sample rate of 52 Mega-samples per second or less.
In the next chapter, we discuss the details of our experiment to obtain a real-time pipeline FFT and to verify the operation of our pipeline SDR architecture. Finally, it presents the results of the performance experiments.
@ fs samp/sec
Source distribution tile pxN-word ping-pong data buffers processes on the tiles nor are they concerned with optimizing the Inter Process Communications. They are using MPI to handle that. Using the MPI mUlti-processor software layer allows them to do easy "ports" to the other hardware. In our work, fixing processes to tiles and directly managing the inter-tile communications was very important to reducing data transmission overhead.
MAESTRO-BASED FFT EXPERIMENTS
A. Pipeline FFT Im plementations
The process illustrated in Figure 4 that the FFT-select tile performs has two basic components, the calculation of an N point floating-point FFT and the selection of the frequency components to downlink. The FFT has a computational requirement of C FF r = 5N )og2 N real floating point operations, whereas the selection portion of the algorithm N N .
. has Cs = 3-flops plus -mteger compansons. (See 2 2 Section D. for further discussion of these complexity figures.) The dominant computational requirement comes from the FFT, and hence it was decided that the most important process to implement would be the multi-tile pipeline FFT. The structure of that pipeline real-time FFT process is shown in Figure 5 . This is very similar to the pipeline SDR architecture shown in Figure 4 ; the selection portion of each tile's process has been removed. One of the reviewers of this paper suggested that the work of Mighell, "Benchmarking the CRBLASTER Computational Framework on the 350-MHz 49-core Maestro Development Board" [14] might have applicability to our work. A recent review of that paper shows that the authors are using a "higher level" interface (MPI) to program the Multi-Core architecture of the Maestro than we did. They seem not to be as interested in the location of the 6
The operation of the pipeline FFT architecture is very similar to that of the SDR architecture; only the selection process has been eliminated. The first tile in the process, the source tile, converts each 12-bit sample into a 32-bit IEEE standard floating-point number and places the samples into p successive N-word buffers. As each N-word buffer is filled, the source signals the associated FFT -select tile, "Ready to Send a Block." We used the system interfaces to the underlying tile-to-tile communications functions provided in the MAESTRO "i/ib" library and the "tmc" library functions to allocate memory shared among processes. Documentation for these libraries is distributed as part of the MAESTRO development environment [2] .
Note that each FFT tile is executing a separate Unix process with its own "memory address space." Along with a significant amount of book keeping and error detection, each of these processes does the following:
• Receive Parameters from source process. This includes the address of the shared memory buffers used to transmit blocks from the sender to the receiver. We use the ilib _ msg_ broadcast library call to receive this message.
• Allocate "ping-pong" buffers using tm c_cmem_memalign to share with the "data collection" process.
• Send a message to the data collection process via the ilib _send _ msg call of the address of the shared memory.
• Receive message from "source" via the ilib _receive _ msg call that a message is ready to be collected.
• Copy via Direct Memory Access (DMA) the first source block from the source using the ilib_mem_start_dma call and wait via the ilib _wait call for the DMA to complete the copy. The ilib_mem_start_dma call sets up internal structures on the two associated tiles and uses separate mechanisms for the copy to take place. The CPU is not directly involved in the copy process and can do other computations while the copying is going on. When the copy is complete internal registers are set and the CPU will wait for that to happen via the ilib_wait call.
• Start loop p -1 times (p, number of parallel FFT tiles, passed from the source process).
o Receive message from source that another source buffer is ready and then start DMA copy into 2 nd buffer, but do not wait.
o Process FFT on the first buffer while the DMA copy is taking place using the fftwL execute_dflJ2c call.
o Send a message to the data collection sink that this buffer is ready using ilib _send _ msg.
o Using ilib _wait, wait for the DMA started above to complete.
• Process the last FFT block and let data collection sink know.
7
The source and sink processes operate in a similar fashion using the same calls. There is a 4t h process, that starts each of the source, FFTW and sink tile processes and establishes which specific tiles each runs on. This process "spawns" these by filling in a set of parameters that describe the process to be run (via its file name), the number of instances of the process and the location of the instances, passing this parameter to the ilibyroc_spawn library call.
Note that the ordering of the N-sample complex-frequency amplitude blocks is maintained by the inclusion of the time stamp. This will permit re-creation of the band-limited sampled signal by simply taking the inverse FFT of each block.
Next, we implemented the structure of Figure 5 to experimentally measure the sample rate of the parallel FFTW tiles. The C code for the FFTW was obtained from the Maestro source code distribution web site. [14] The single-precision FFTW was only able to be compiled without optimization. A version of a single tile's code was compiled for each value of block size (N) tested. The code used had a separately-compiled "wisdom" file, used for FFTW internal optimization, for each block size tested, so that FFTW code would not spend time setting up its configuration. The binary code for each tile's FFTW, including the ping-pong buffers, is approximately eight MBytes. In the object code generated, each floating-point instruction appears to be padded by 4-5 no-ops. The reason for this apparently has to do with the communications between the floating point co-processor and the main CPU. Each floating point instruction takes more time than the completion of the message between the two entities.
1. Verification of the Correctness of the FFTW-The single tile FFTW compiled code was tested for functional correctness for values of N that would be used in the pipeline multi-tile performance tests, namely for Nf{ 128,256,512,1024,2048}. For each value of N, a number of random data blocks were generated and submitted to the compiled FFTW code. Those results were compared to the results of Matlab® FFTs computed on the same random data blocks but using double precision. The results agreed within the least significant mantissa bit of our single precision output. As a result, we had confidence that the compiled FFTW code was functionally correct and that the performance data would be for a functionally correct FFTW.
B. FFT Performance Measuring Experiments
The experiments were conducted on the Maestro Development Board (MDB), loaned to the Naval Postgraduate School by the U.S. Government. The board was operating at a clock frequency of 350 MHz .
The software to implement the architecture of Figure 5 was created, compiled and loaded on the MDB and 100,000 N word blocks were submitted to the various programs. The software structure is explained in Appendix A. The average 
FFT Rates versus Number of Tiles -NPS Experimental Results
I I I
+-�-+�--�� / �! 1J-i--t-+-�-+-l--t-1--r-t-l--t-i= =t=i�t=t=j
Number of Tiles Table 2 .
The results from the experiment are shown in Figure 6 In that figure are plotted the curves of pipeline sample rate or throughput in millions of samples per second versus number of FFT tiles for each of the five values of N.
Discussion:
• At lower block sizes, 256 and 512, it appears that at some point the cost of setting up the DMA (ilib _ dm a _start _ dm a) and processing the wait for termination (ilib _wait) are dominating the processing time, so that even though the number of tiles increases, the cost of the DMA overwhelms the potential benefit of the additional FFT tiles. DMA setup and wait cost is most likely independent of the size of the data being moved.
• At larger block sizes we see some increase in performance with increasing number of tiles beyond the 7 or 8 with 256 and 512 size blocks, but in these cases it appears that eventually we are seeing contention on the various internal buses. Part of the reason we think that is because of the variability that we are seeing. This may also be influenced by the arrangements we used for the tiles. With Figure 3 as a reference, the tiles are labeled (x,y) where 0 S; x, y S; 6. The source tile was placed at (1,3) and the sink tile at (1, 4) . The up to 20 FFT tiles were placed in the columns starting at (2,0) with the tiles used in a block of height 7 and width 3, through tile (4, 6) . We let the library (os) determine which tiles within a block were used for a particular count of FFT tiles. We did not experiment by setting up different arrangements of tiles.
Earlier, we discussed the results from the lSI paper [13] that appear to set an upper bound on the perfonnance of the pipelined multi-tile implementation of the FFTW. The lSI projected perfonnance of two of the FFT sizes that coincide with sizes that were used in the NPS experiments compared to the NPS results for the same FFT sizes are shown in Figure 7 .
For N = 256, the NPS experimental results significantly underperform the lSI projections. We believe this to be caused by the DMA overhead. For N = 2048 and for low tile numbers, the NPS performance is close to the upper bound, until it reaches the knee of the NPS curve and then internal bus contention appears to take over.
D. Application of results to SDR performance
The process of programming the Maestro Development Board (MDB) experimental verification of the SDR performance was not obtained. Nevertheless, we can make reasonable predictions of the SDR performance from our FFT experiments.
Referring back to the basic pipeline architecture of Figure 4 Sample Rates versus Number of Tiles for 2 FFT Sizes Consequently, when the number of tiles in Figure 6 exceeds ten (the knee of those curves) and the performance is limited by DMA performance and bus contention, it is expected that the throughput for the full SDR implementation will be nearly the same as for the FFT alone. Furthennore, since the SDR is basically a data compression process, the output data rate should be much less than N 32-bit words per block, allowing a further modest increase in throughput.
CONCLUSIONS
A. Summary • Pipeline throughputs for these FFTs were achieved of up to 25 million 32-bit samples per second.
• Addition of the rest of the SDR code to each FFT tile should not decrease performance for number of tiles, p > 10 and for N> 512.
• Higher block size operates more efficiently.
• The pipeline architecture was successfully demonstrated.
• Programming a single application to exploit parallelism of a multi-tile processor like Maestro is very difficult.
o Because the caches are relatively small, main memory access is relatively expensive and inter tile communications is not super fast, one has to take care in explicitly managing memory and 10 inter-tile communications. The tools to do this are available, to some extent, but take some understanding. In our case, we are unsure whether the "main loop" of the FFT algorithm fit into the cache. We would need to do more evaluation to detennine this o We depended on the compiler to take advantage of the potential built-in instruction parallelism. Except in a couple of cases, we were unable to exploit the very long instruction word parallelism directly ourselves .
o
The system provides a set of development tools and libraries. The current compiler has some problems, especially handling the optimization of single precision floating-point arithmetic.
Although the libraries are documented and there are tools for evaluating and optimizing code, understanding when to use which features of the system takes some experience. In addition, it is unclear how the use of the various features might interact with each other.
B. Recommendations for Future Study
The lSI paper [13] and Table 1 suggest that an upper bound on throughput of 2.6 x 10 6 p samples per second might be achieved for each tile computing a 2048-point FFTW running under the pipeline architecture demonstrated in this study. This would mean that a 31-tile realization of the full SDR would potentially support a throughput of 80 million samples per second, enabling in-space processing of 32-MHz bandwidth signals. The following things should be tried to seek to realize that potential:
• Verify the effect on throughput of adding the SDR functionality directly to the FFT tiles. Compare the perfonnance of the alternative of adding an SDR tile in tandem to each FFT tile via the run sink.c code.
• Experiment with different mappings of functionality to physical tiles.
• Work with lSI to optimize the NPS-developed code.
We have considered several approaches we could take to possibly optimize the FFT computations on the MAESTRO board. These include, listed in order of difficulty to address, but not necessarily the order of expected improvement:
1. Compiling/Coding Optimizations; 2. Dedicated Tiles; 3. Geometric Optimizations; 4. Refactoring the FFT Algorithms.
Below we describe these in more detail.
We think that the most improvement would come from some combination of Dedicated Tiles and Refactoring the FFT Algorithm.
FFT Algorithm. The only change we made to the delivered FFTW package is to recompile it for single precision floating point operation. This provided a small performance improvement. We have not spent any significant time and efforts applying the techniques outlined in the "Optimization Guide, UG 1 05" [2] document to the FFTW package or our integration code. We would like to apply the various monitoring tools to analyze the performance of the FFTW package to see where the bottlenecks are. It would be interesting to apply the analysis tools to the package, apply the compiler feedback-based optimizations and add appropriate compiler features to the code. We are currently using the "i/ib" interfaces for inter-tile communications and synchronization. There are "intrinsic" level compiler macros that directly access hardware level instructions to accomplish the communications and synchronization functions. These would remove the ilib function call overhead. It is not clear how much this would save but it is worth looking at and might provide for tighter (less latency) inter-tile communications.
Dedicated Tiles. We are currently using the delivered version of the Linux operating system. This version of the OS does not include any configuration of "dedicated tiles." We would like to configure and compile a new version of the operating system that includes dedicated tiles to do the FFT computations. We think this may provide significant performance improvements since the dedicated tiles operate with much less OS overhead than normal tiles, hence we may see more effective use of both the processor and cache Geometric Optimizations. We have only done a limited number of experiments on the allocation of our processes to tiles. Our current efforts do not show a significant increase in overall throughput, samples per second, once the number of tiles doing FFT processing increases beyond around 13 or so. We are not sure why this is the case, since our tile-to-tile communications speed measurements indicate that we should be seeing better performance than that. We think part of the problem is how the assignment of function to specific tiles is done.
The length of the path between two communicating tiles increases latency and there is a potential for collisions on a network path where several tiles attempt to communicate over the same path simultaneously. Since we potentially know all the communications among processes, we should be able to find optimal, or at least better, arrangements of processes to tiles. If we were to build a new OS version, this could tie into where we allocate the dedicated tiles.
Refactoring The FFT Algorithm. The FFTW implement ation of the FFT algorithm is configured to compute an FFT block on a single tile. We achieve our "throughput" by providing multiple tiles, each computing a full FFT block. Our current measurements indicate that we could come close to doubling the throughput if we could speed up the FFT computations. In that case the limiting factor would be the inter-tile communications. One way of achieving an increase in FFT processing speed is to refactor the algorithm to take advantage of the MIPS/MAESTRO capabilities. One approach is to "role our own" FFT implementation that still uses one tile/block but does not include any of the overhead needed to support the generality of the FFTW implementation. It can be tailored with appropriate assembly code to take advantage of the single precision floating point operations needed. This code could be tailored for exactly the block size we use and integrated into the inter-tile communications infrastructure. We could apply the various analysis and optimization techniques directly to this code. It is unclear, without further analysis, whether the implementation for, say, a 1024 block size would fit comfortably on a single tile and cache. A second approach might be to attempt to factor the algorithm to run on multiple tiles to take advantage of the "butterfly" computations. This might increase the latency slightly but might make it easier to ensure that the computations take place entirely within the cache.
ApPENDIX
This appendix contains a description of the source code used in the implementation of the FFT on the Maestro board. The archive "maestro-fft.tar.gz" contains the source files. A number of copies of a process, 'run_receiver', that actually processes the FFTs. The actual number of FFT processes and their arrangement on the Maestro board are established by in initialization program.
3. A sink process, 'run_sink', to collect the processed results.
4.
A program, 'startdma' (or for testing, 'start3dma') that "spawns" each of the 3 components processes, each process on a separate "tile." The "start" program also establishes which tiles are used for which functions.
The development environment used consisted of the standard Linux development environment. There is no 'confi gu re ' script since this code was developed to only run on the Maestro boards. We use a 'Makefile', standard Linux ,* .c' and ,* .h' files. Each of the files contains comments that documents the process described in the file and how it is used.
12
The files available in the archive are:
1. ./startdma blocksize iterations numReceivers 4. run_sender: is the program that generates data to be processed, it sends a dma message to the run Jeceiver processes to tell them that they have a buffer to process.
run_receiver:
Using DMA, it transfers FFT blocks from run_sender, computes the FFTs on the blocks, and signals run_sink that it has a block to process.
6. run_sink: Using DMA, it transfers blocks from run_receiver for future processing (not done here) 7. start3dma: is a simplified version of startdma that has only one run_receiver tile. It places run_sender on tile (1,3), run_receiver on tile (2,3) and run_sink on tile (3,3) This was constructed for testing.
