Image-based wavefront sensing provides significant advantages over interferometric-based wavefront sensors such as optical design simplicity and stability. However, the image-based approach is computationally intensive, and therefore, applications utilizing the image-based approach gain substantial benefits using specialized high-performance computing architectures. The development and testing of these computing architectures are essential to missions such as James Webb Space Telescope (JWST), Terrestrial Planet Finder-Coronagraph (TPF-C and CorSpec), and the Spherical Primary Optical Telescope (SPOT). The algorithms implemented on these specialized computing architectures make use of numerous two-dimensional Fast Fourier Transforms (FFTs) which necessitate an all-to-all communication when applied on a distributed computational architecture. Several solutions for distributed computing are presented with an emphasis on a 64 Node cluster of digital signal processors (DSPs) and multiple DSP field programmable gate arrays (FPGAs), offering a novel application of low-diameter graph theory. Timing results and performance analysis are presented. The solutions offered could be applied to other computationally complex all-to-all communication problems.
INTRODUCTION
Future optical systems based on light-weighted and segmented primary mirror systems require active optical control to maintain mirror positioning and figure to within nanometer tolerances. Conventional interferometer-based wavefront sensors require a wavefront reference and are generally hardware intensive and introduce non-common path errors. An image-based wavefront sensor on the other hand, is self-referencing by design, and trades hardware for a software solution. As a result, the image-based approach, while more demanding computationally, is generally less complex to implement in hardware and more attractive from a payload reliability and systems engineering perspective. Although the image-based approach is less complex in hardware than interferometer based systems, the increased computational demand requires significant floating-point performance, increased power demands, and high speed data transfer and communication rates.
Phase retrieval is an image-based wavefront sensing method that utilizes point-source images (or other known objects) to recover optical phase information. A general purpose phase retrieval algorithm architecture, or Hybrid Diversity Algorithm (HDA), has been developed by NASA to meet project Technology Readiness Level (TRL) 1 requirements for various ground and space-based systems. Details of this architecture are discussed elsewhere in these proceedings. 2 The core building blocks of the HDA are the iterative transform based methods (or Misell-Gerchberg-Saxton (MGS)) 3, 4 and parametric phase retrieval methods. 5, 6 With certain data configurations and large array sizes, particularly for under-sampled data sets, these core algorithm blocks can take tens of minutes or even hours to reach full convergence. In addition, laboratory testing is generally susceptible to opto-mechanical instabilities induced by environmental factors such as jitter and air turbulence. These factors not only limit fine-phasing performance but also stability of the phasing results. Motivated by these factors, a number of special purpose supercomputing architectures have been developed to reduce algorithm convergence times while not sacrificing solution accuracy.
In this paper, we document several supercomputing architectures that have been developed for the Government implementation of MGS which forms a core algorithm block of the HDA. A detailed discussion of the MGS algorithm and its limitations are given elsewhere in these proceedings.
The basic MGS algorithm is illustrated in Fig. 1 . MGS requires numerous iterations from the image plane to the pupil plane via the Fourier transform pair. For each "inner loop" of MGS and for each image, a single iteration is completed as constraints in each Fourier domain are applied. Thus, a single iteration results in a Fourier and Inverse Fourier Transform pair; hence, the number of 2-D FFTs is the product of the number of diversity-defocus images, the number of outer loops, the number of inner loops, and a factor of 2 for the Fourier and Inverse Fourier Transforms in each iteration.
Current desktop and server line general-purpose central processing units (CPUs) have been optimized for performing multiple tasks and use principles of data locality to increase performance, yet performance dramatically decreases 7, 8 when using large 2-D FFTs. For example, a current general-purpose CPU can take several seconds for a double precision 2-D FFT of size 2048×2048. This is mainly due to the fact that the memory architecture is not optimized for such large data sets. To perform the numerous large 2-D FFTs efficiently, several application-specific architectures have been developed and are discussed further below. Many 1-D and 2-D FFT algorithms exist, yet for each element of the output, the algorithms require access to every element of the input. 11 Thus, as one naive approach to parallelization, an image cannot be divided into sub-components that are processed completely independent. Typically, the 2-D FFT is computed as a series of 1-D FFTs. 12 Thus, a 1-D FFT is performed on each row; then, the 1-D FFT is performed on each column of the result. The total number of 1-D FFTs is then product of the number of single inner loop iterations, the diversity-defocus image size, and a factor of 2 for the rows and for the columns. For example, 4 diversity-defocus images of size 512×512, with 50 outer loops and 10 inner loops results in more than 4,000,000 1-D FFTs. The computational complexity increases with the adoption of the full HDA algorithm, 2 e.g., the HDA will perform the MGS algorithm in Fig. 1 multiple times, while modifying an adaptive diversity array to incorporate feedback.
DISTRIBUTED ARCHITECUTRES
The computational requirements of image-based wavefront sensing are proportional to the accuracy and resolution of the wavefront measurement. E.g., the James Webb Space Telescope (JWST) coarse alignment and coarse phasing 15 use less complicated models to determine the necessary corrections; similarly, fine phasing requires more complex algorithms to accomplish nanometer level tolerances. Without sacrificing accuracy, it is optimal to minimize the time between image acquisition and correction. 16 Additionally, an increased demand on wavefront resolution requires an increase in the data set size, and thus in computational complexity.
The process of performing wavefront sensing on N diversity-defocused images is highly parallel for each image; as such, past approaches to increase performance have used 1 to N general-purpose CPUs as a cluster. 17 Consequently, this provides a maximum factor of N improvement, while having the negative effect of increasing power requirements, footprint, and cooling requirements by N. The primary solution is to provide N application specific, highly optimized computational cores. All further discussion assumes a single computational core, and thus, the MGS is being performed on a single diversity defocus image. An earlier MGS algorithm implementation for the Analog Devices Hammerhead DSP has been previously developed. 18, 28 P1 P2 P3 P4 P1
P2 P3 P4
To address these bottlenecks, supercomputing architectures are utilized to handle the computational demands of the 2-D FFT. As described, the 2-D FFT can be performed as a multi-step process of performing the 1-D FFT on each row and then on the resulting column. The bottleneck for performing this on a distributed system becomes the resulting transpose of the entire image. As seen in Fig. 2 , the distributed transpose necessitates an all-to-all communication, and thus is non-trivial as one increases the number of processing elements. For Fig. 2 , the vertical columns represent the internal memory on P1, P2, P3, and P4 processing units. For the transpose to occur, the memory must change from one memory location on one processing unit, to another. This compares to the traditionally naive approach of modifying the index methodology for a 2-D image, i.e
Fig. 2. Characterization of a distributed transpose for 4 processing units
The distributed transpose of Fig. 2 takes advantage of the transpose property in (Eq. 1). For (Eq. 1), A, B, C, and D are all square matrices. This allows the transpose to be performed on sub-blocks, and then to perform the transpose on each sub-block. This iterative algorithm allows for the further development of architectures, by adaptively supporting the growth of the number of processing units.
The bottleneck of this process is not the transpose of the sub-blocks, but the data transfer of a sub-block to another processing unit. For example, the data transfer of (Eq. 1) would occur by moving matrix B and C, and not the transpose of A, B, C, and D.
The computational cores used on each image can be further divided to perform a distributed 2-D FFT. To perform the distributed 2-D FFT, the computational requirements scale linearly with the number of sub-components up to the size of the diversity-defocus image. Thus, the computational requirements of the MGS on a 512×512 diversity-defocus image size scales linearly until one reaches 512 sub-components of a computational core, or 512 processing units. For implementations of distributed 1-D FFTs, and thus, the scalability for an increase in the number of sub-components larger than the diversity-defocus image size, various algorithms have been explored externally. 19 All discussion in this paper will assume the 1-D FFT is the smallest component of division.
The transpose within the 2-D FFT is the main component of data transfer between sub-components; the transpose necessitates a transfer of sub-blocks of data from each processing unit to every other processing unit, as shown in Fig. 2 . The all-to-all communication does not scale without care given to the underling architecture of a distributed system. To achieve linear scalability in performance for an all-to-all communication, one must have an architecture that provides a direct connection of all processing units to all other processing units, such as the crossbar switch architecture 20 or the complete-graph of type K n (i.e. n branches for n nodes). 21 In practice, this is not ideal because the number of interconnects per node grows as N-1, and thus, total number of interconnects I for the complete-graph of type K n grows as O(n Similarly, the crossbar switch architecture grows as O(n 2 ) with the total number of switches.
To address this problem, several solutions such as the n-dimensional hypercube, ring-torus, or grid architecture are used in practice. These architectures are not the optimal choice under the constraint of number of nodes and the number of branches per node. The optimal homogenous architecture for the all-to-all communication would minimize the graph diameter, under the constraint of degree of each node. Thus, the optimal graph would restrict the number of branches per node, while minimizing the maximum of all of the "shortest paths". An example of a solution to this problem is the Peterson Graph, as seen in Fig. 3(a) . Furthermore, the Moore's bound for a p-node undirected graph provides a methodology for placing a bound on the diameter, as seen in (Eq. 3) and Fig. 3 (b) -(d).
For Fig. 3 (b-d) , the leaf nodes need additional branches added to be a complete d-degree graph, and it is proposed that such a graph does not exist. 22 In practice, leaf nodes of the base architectures are removed, and thus, for significantly large graphs, only optimum solutions exist. For example the Levi Graph, with 30 nodes and diameter 4, is an optimum solution 23 for the 46 node 3-degree diameter 4 graph in Fig. 3 (d) . In practice, it is impractical to develop such architectures because of scalability, feasibility, or cost. In many cases, simply minimizing the graph diameter does not provide an optimal solution because of load balancing, irregular network flow, or other problems. Yet, for the 2-D FFT transpose, the all-to-all communication is perfectly balanced and each node communicates at a pre-determined and predictable time. Thus, the graph diameter is the primary variable that is optimized for a given architecture to solve the transpose problem.
For this research, several low-diameter architectures were evaluated and implemented, as shown schematically in Fig. 6 and Fig. 7 . The majority of the research was conducted on three independent systems: a 64 node cluster of Analog Devices TS101 DSPs, a 24 node cluster of Analog Devices TS201 DSPs, and a 4 node cluster of Virtex 4 SX55 Xilinx FPGAs. The TS101 and TS201 both provide methodologies for enhancing the performance of the HDA by including ground-based wavefront sensing, telescope image processing, laboratory optical processing, system design and tolerancing, sensitivity analysis, Monte-Carlo simulations, and finite element modeling. The FPGA system provides a path to radiation tolerant environments. 25 As mentioned earlier, the strides in the number of computational cores for each diversity-defocus image has been previously explored. Thus, the number of DSPs per image is the main focus, and all graphs and architectures will assume a single image as input.
The TS101 processor has four bi-directional link ports that allow the direct connection between two chips. Other processors offer comparable communication ports. Similarly, the TS201 DSPs have four dual channel uni-directional ------link ports, and thus minimize the turn-around time in changing directions for data flow. This is advantageous for the 2-D FFT because every node is sending and receiving data.
The TS101 system is rated at theoretical maximum of 96 Gflops (or 1.5 Gflops per DSP for 64 DSPs) and the TS201 system is rated at 72 Gflops (or 3 Gflops per DSP for 24 DSPs). The TS101 and TS201 systems are both arranged in clusters of 4 DSPs per board; with a shared SDRAM, as seen in Fig. 5 . The TS101 has two dedicated link ports to other nodes in the cluster, the bold black line of Fig. 5 : the remaining two link ports are routed to an external I/O port. For the TS201, all link ports are routed to a Virtex 2 Pro, from there, the user has complete control over the destination as either back to the cluster, or to an external port. The system layout TS101 can be in Fig. 4 , where each DSP board is a cluster of 16 DSPs connected by a cPCI bus to a single board computer. As a first example of the effects of a minimum diameter graph for the 2-D FFT, refer to Fig. 6 . In this example, the node's unique identifier and shortest path from 0 are listed. For graph (a), a cube, the shortest-path between node 0 and all other nodes is 3 or less, but for graph (b), a 1-Möbius cube, the shortest-path between any two nodes is 2 or less. 26, 27 For graph (b), the 2-D FFT is 17% faster than graph (a). To date, this graph (b) is the largest architecture that has been implemented for the TS201 system.
To further explore the capabilities of the 64 node TS101 DSP system, the number of DSPs per image was developed and tested for the cases 16, Fig. 7 (a,b) and 32 nodes, Fig. 7 (c) . Results of this trade are listed in section 3. Currently, a division of the 64 DSP cluster is being used at the James Webb Space Telescope test bed. Furthermore, for most applications of the MGS, multiple diversity-defocus images are used but for some applications only a single image is applicable. For this case, an architecture for 64 DSPs per image was constructed, Fig. 8 . This builds upon the proven methodology from the 16 and 32 DSPs per image. As mentioned, distributed FPGA architectures are also being explored as a method of executing the computationally complex MGS and HDA algorithms. In the distributed FPGA architecture, a combination of five FPGA boards will be used to study the effectiveness of distributed FPGA architectures. During this study, we hope to compare the timing, accuracy, and power consumption of the distributed FPGA system to other distributed systems. The goal is to determine if distributed FPGA architectures are a viable solution to executing the MGS and HDA algorithms in space.
The study is yet in its infancy, and single system architecture has been thus far proposed, which builds upon the successful the DSP low-diameter architectures. Defined and constrained by budget, the system includes a total of five FPGA boards: four ADM-XRC-4 boards by Alpha Data, each utilizing a Xilinx Virtex-4 SX55 FPGA; and one ADP-XPI board by Alpha Data, with a Xilinx Virtex-II Pro FPGA.
These boards were selected for their general digital signal processing capabilities. The Virtex-4 SX55 FPGA has approximately 55,000 logic cells and 512 XtremeDSP Slices. The XtremeDSP Slice is Xilinx's specialized multiplyaccumulator logic cell. The availability of these specialized logic cells on the FPGA enhances the FPGA's efficiency for digital signal processing applications. This particular FPGA was chosen over others in the Virtex-4 family because of its high quantity of XtremeDSP Slices. The ADP-XPI board with the Virtex-II Pro was chosen as a master controller for the other four boards. The four Virtex 4SX chips have been prototyped as a K 4 complete-graph architecture. Each SX chip is connected to the master FPGA using the same I/O configuration. Fig. 9 : Cluster of four Virtex 4SX chips and one Virtex 4FX chip.
PERFORMANCE AND TESTING
The transition from 8 to 16 DSPs per image reduced the sub-component block size for the all-to-all communication by 4, yet, still requires the same total data transfer. For example, a 512×512 image on 8 DSPs results in each DSP performing 1-D FFTs on a 64×512 block. Each block is then divided into 8 sub-blocks of 64×64, where each sub-block is transmitted to the corresponding DSP. The 64×64 sub-block has 4096 elements, but for the 16 DSP case, the subblock is 32×32 and has 1024 elements. The first architecture investigated based on the 16 DSPs is shown in Fig. 7 (a) which maintains an overall low-diameter. With the necessary packet overhead for routing, the reduced sub-block size, and the turn-around time for data flow on the physical layer of the link ports, the result was less efficient than desired, and thus, an alternative 16 DSP architecture was explored, Fig. 7 (b) . This architecture utilized edge routing between the clusters of DSPs, thus creating larger packets. This improvement resulted in an overall system improvement for the 2-D FFT by 24% for the 512×512 case. For larger images, the improvement was less significant because the sub-block size increases. An additional advantage, and method for understanding this performance increase, is to relate the structure of the transpose to (Eq. 1). This equation illustrates the recursive nature of the transpose, and thus, why the hierarchy of a low diameter architecture is efficient. The hierarchy can be visualized if each cluster of four DSP, rather than the individual DSP, is seen as sub-component computational core. Thus, the transpose must occur within the cluster, and then between the clusters of DSPs. This approach maximizes the bandwidth between the clusters by minimizing the administrative overhead incurred for the first 16 DSP cluster explored, (a).
For the 16 to 32 DSPs per image, only the sub-clustered routing architecture was explored which is shown in Fig. 7 (c) .
For processing, i.e. 1-D FFTs and other mathematical routines, the improvement between 16 and 32 DSPs is nearly linear. Thus, the computational routines provided a factor of ×2 improvement. In addition to the scalability of the computational routines, one must explore the scalability of the data transfer. For a bus architecture, increasing the number of nodes will result in a decrease in the bandwidth per node. For the TS101 and TS201 systems, increasing the number of nodes also increases the total amount of network bandwidth. For example, a single TS101 has 4 link ports at 250 MB/sec, and thus for the 16 node cluster, this network has a theoretical maximum bandwidth of 8 GB/sec (16 nodes × 250 MB/sec × 4 link ports / 2 for interconnection = 8 GB/sec). Similarly, the 32 TS101 system has an over network bandwidth of 16 GB/sec. Fig. 7 (c) , was ×1.2 times faster than the 16 DSPs architecture, Fig. 7 (b) , for an overall improvement for the 2-D FFT of ×1.7. Timing details for comparison to a single desktop processor is presented elsewhere in these proceedings. 28 The implemented architectures of Fig. 7 (b,c) for the 16 and 32 node architectures provide linear scalability with respect to image size. Intuitively, an N×N diversity defocus image would take 4 times longer than one half it size in each direction, i.e. ½N × ½N = ¼(N×N). Yet, for general purpose CPUs, this is not the case, as it has been shown, 10 that the computational performance, measured in Gflops, decreases with image size. The optimal image size for the FFT on a general purpose CPU is between 16×16 and 64×64, depending on the chipset. 10 For the TS101 and TS201 architectures, this is not the case. The computational 1-D FFT scales linearly with array size, and the distributed transpose scales near linear. Thus, the difference in performance between various image sizes for the TS101 architecture can be seen in Table  1 , and averages 2.5 between images of size less than 512×512 (smaller numbers signify an independence between image size and performance time). For larger images, there is a performance hit between the 512 and the 1024 case because the external SDRAM must be used more heavily. 
CONCLUSIONS
In summary, the HDA and MGS algorithms entail numerous 2-D FFTs for a single iteration of closed loop control. We have shown that a distributed computational architecture, processing on each diversity defocus image, is the optimal architecture in terms of performance, with further details specified by various requirements depending on the desired footprint, power consumption, operating environment. These details imply the specification of the sub-components. For ground-based wavefront sensing, a cluster of DSPs is optimal for performance, while a radiation environment, just as L2 for JWST and TPF required radiation toleration sub-components, such as FPGAs.
Given the infrastructure differences between an FPGA and a DSP it is difficult to compare the performance of the TS101 and TS201 DSP systems to the Virtex 4 FPGAs. Currently, the baseline for the FPGA is a fixed point processing model, while the DSPs are performing 32 bit floating point. Additional details on the FPGA work will be presented in a later report.
Additionally, for the TS101 and TS201 systems, many subtle performance enhancements were discovered, such as the minimum image size for a fixed number of DSPs. For 16 DSPs, the run-time for a 32×32 diversity defocus image is the same as a 16×16 image, and similar results were seen for the 8 and 32 DSP architectures. In a later analysis we will document the performance of a 1024 node cluster for the 2048×2048 image size and determine the hierarchy of an optimal architecture to perform the transpose of the 2-D FFT.
