We demonstrate a novel architecture for Adaptive Optics (AO) control based on FPGAs (Field Programmable Gate Arrays), making active use of their configurable parallel processing capability. SPARC's unique capabilities are demonstrated through an implementation on an off-the-shelf inexpensive Xilinx VC-709 development board. The architecture makes SPARC a generic and powerful Real-time Control (RTC) kernel for a broad spectrum of AO scenarios. SPARC is scalable across different numbers of subapertures and pixels per subaperture. The overall concept, objectives, architecture, validation and results from simulation as well as hardware tests are presented here. For Shack-Hartmann wavefront sensors, the total AO reconstruction time ranges from a median of 39.4µs (11 × 11 subapertures) to 1.283 ms (50 × 50 subapertures) on the development board. For large wavefront sensors, the latency is dominated by access time (∼1 ms) of the standard DDR memory available on the board. This paper is divided into two parts. Part 1 is targeted at astronomers interested in the capability of the current hardware. Part 2 explains the FPGA implementation of the wavefront processing unit, the reconstruction algorithm and the hardware interfaces of the platform. Part 2 mainly targets the embedded developers interested in the hardware implementation of SPARC.
Introduction
Adaptive Optics (AO) is an indispensable tool in the frontier of astrophysics for the era of large telescopes of aperture diameter greater than 30 m, which can provide high-resolution spectroscopy and direct imaging of exoplanets. For example, the Narrow Field Infrared Adaptive Optics System (NFIRAOS) 1 will be the first adaptive optics system deployed on the Thirty Meter Telescope (TMT) and will provide 12 times the spatial resolution (at near-infrared wavelengths) compared to that of the Hubble Space Telescope (HST). The Multi-Conjugate Adaptive Optics (MCAO) system on the European Extremely Large Telescope (E-ELT) 2 will provide 16 times the spatial resolution compared to HST. AO is moving on from an optional back-end instrument to a full-time first light instrument on the next generation of large telescopes. Advances in the field of ground-layer AO, (DM) fitting to create the control matrix, which is only required to be created at a slower rate compared to the real-time reconstruction of actuator values from WFS inputs through MVM. 20, 21 The majority of the deployed and planned AO systems use a limited variety of WFS techniques, and MVM is predominantly used for extraction of the correction vector from the sensed variables.
We describe SPARC in two publications. Part 1 (this paper) explains the primary motive of the work, the hardware requirements, architecture, method of validation and the results from validation. This part is intended for astronomers and the user community who are primarily interested in understanding the capabilities and performance results of SPARC. Part 2 22 describes in depth, the implementation of SPARC on an off-the-shelf Field Programmable Gate Array (FPGA) development board. Part 2 22 contains the details of how scalability and adaptability has been built into SPARC in the memory interface and in the reconstruction matrix. It also explains the different techniques of AO reconstruction with their own levels of computational power required per degrees of freedom. Part 2 22 is aimed at embedded system developers interested in the design of a scalable AO real-time controller on an FPGA. Section 2 of this paper explains the objectives and the levels of scalability that is implemented on SPARC. Section 3 describes the SPARC system's state machine implementation and the details of the off-the-shelf hardware which we have used to test the platform. Details of the atmospheric turbulence simulation to check the limits of scalability and the interface with the iRobo-AO laboratory platform is explained in Section 4. Section 5 gives the results obtained from the validation tests explained in Section 4. A summary and future prospects of the SPARC approach are presented in Section 6.
Objectives and Scalability
Gavel 23 had created an estimate of the computational requirements for achieving a Strehl ratio of 0.5 at a wavelength of 1µm for a telescope with a primary aperture diameter of 30 m. The fitting error in the error budget can be achieved by an AO system with 10,000 degrees of freedom (DOF).
A more recent study of the AO requirement for E-ELT 24 confirms a similar DOF target for E-ELT.
The broader interface and computational requirements, which are based on this error budget, is given in Surendran et al. 25 The size of a control matrix for a single conjugate AO system with 10,000 DOF is close to 400 MB (if we assume 2 bytes for every element in the matrix). For an AO reconstruction time of 1 ms, a memory bandwidth of 400 GB/s and 400 GFlops of computing performance would be required. While the computing performance can easily be achieved by offthe-shelf GPUs and FPGAs, the memory bandwidth is still a challenge with the current technology.
The Nvidia Tesla K10 26 is one of the most powerful GPUs available today and can provide a memory bandwidth of 320 GB/s.
A generic, scalable platform should be able to adapt to the requirements of large-scale AO and to the faster memory modules of the future. The primary motive of SPARC is to create a plug-andplay implementation of a scalable adaptive optics real-time controller. FPGAs are ideally suited for such a system because of its inherent flexibility in adapting the level of parallelization based on requirement. This translates to a standalone AO real-time controller with flexible WFS and DM interfacing options and is scalable and adaptable to a range of AO scenarios as follows:
1. Number of pixels and pixels per subaperture: One of the largest contributors to the error budget of an AO system is the fitting error caused by the limitations in the spatial sampling of the wavefront. A unique feature of the SPARC implementation is that the logic resources of the FPGA needed for AO computation (including pixel acquisition, slope computation, AO reconstruction and actuator output) are independent of the number of subapertures. The slope computation takes a relatively small fraction of the logic resources and hence does not contribute much to the FPGA resource usage. The platform is also compatible with rectangular or cropped subaperture configurations.
2. Memory bandwidth: An adaptable First-in-first-out (FIFO) interface is used in SPARC which provides a lot of flexibility in interfacing with dual data rate (DDR) type memory modules with a wide range of frequencies and datawidths.
3. Portability across FPGA families: SPARC is designed with compatibility in mind, where the core algorithm is programmed with native VHDL, which is compatible with the FPGAs manufactured by most companies. About 25% of the program is FPGA specific but this is restricted to the operation of external interfaces. This enables the design to be implemented on slower/faster FPGAs having less/more internal memory and logic resources, which can interface with slower/faster external memory modules.
SPARC is designed to function without the help of a host-PC for the AO interfaces, which will help in the miniaturization of the computational hardware. This is especially advantageous for small AO systems on a budget, if the control system is able to plug-and-play into the optical hardware. The plug-and-play implementation can be expanded to other applications of AO like vision science and microscopy. An FPGA is used as the computational hardware for the following reasons:
1. Flexibility in parallel processing: The number of parallel processes in a GPU or CPU is limited by the number of processor cores available on hardware and a single clock driving the chip, whereas the FPGA is only limited by the raw logic available inside the chip. An FPGA allows for the creation of a custom number of parallel processing units (limited only by the total logic resources available) and to create different digital clocks whose periods can be individually customized. Matrix multiplications can be optimized with the available memory bandwidth with this capability.
Predictable latency:
The delay between the time at which input signals from the wavefront sensor arrives to the time at which the deformable mirror correction signals are updated, directly affects the performance of an AO system. Unlike CPUs and GPUs, the computational delay of an FPGA design (excluding the delay contributed by external interfaces) can be predicted to within a few nanoseconds.
3 Control Scheme
Hardware
The SPARC platform is implemented on a commercial off-the-shelf solution which could demonstrate the wide range of scalability and is powerful enough to execute the AO control loop for a large telescope. The Xilinx VC-709 development board has a Virtex-7 XC7VX690T chip which hosts 3,600 Digital Signal Processors (DSPs) and facilitates a large number of parallel multiplyand-accumulate (MAC) operations for a fast matrix multiplication. The two modules of DDR3 (DDR version 3) memory on the board can provide a combined memory bandwidth of 25.6 GB/s, which is important for the fast retrieval of the reconstruction matrix for AO reconstruction. The board also features a Peripheral Control Interconnect express (PCIe) version 3 which facilitates high speed communication with a host PC.
State machine implementation of the control loop
The FPGA design is divided into state machines which are a combination of digital logic gates for computation (combinatorial logic) and decision making algorithms (sequential logic). The sequential logic in each state machine is driven by periodic clocks whose frequency is determined by the maximum time taken by the logic gates to complete a computation within the state machine.
The design consists of three state machines, which are driven by different clock frequencies owing to the different amount of time taken for computation by the logic gates within each state machine.
The functioning of SPARC is like any other AO control loop which accepts WFS pixels as the input and generates phase (or actuator) values as the output. The difference is the scalability which is incorporated at each level. Fig. 1 shows SPARC represented as a system of interconnected state machines. The state machine shows the flow of data for a single row of subapertures coming from the WFS. The three state machines are described below:
1. Wavefront Processing Unit (WPU): The WPU consists of a flexible interface for acquiring pixels from different CCD interfaces, and a scalable slope computer. This particular WPU is designed to implement a Center of Gravity (CoG) slope computation for an SH sensor, but the algorithm for slope computation is modular and can be modified if required. Every CCD interface has its own set of rules on how the pixels are sent to the processing hardware. So, a flexible interface was provided, which can be connected to the custom WFS hardware, or to a host PC which can acquire the pixels before sending them to the FPGA. Each pixel value is 16 bits in the current hardware, but this can be changed if required. The WPU is pipelined to allow for simultaneous computation of slopes without interrupting the pixel acquisition into the FPGA. When the pixels corresponding to a single row of subapertures is available, More details on the BRAM addressing is specified in Surendran et al 25 and Part 2 22 of the paper. The speed of slope computation is determined by the amount of logic resources on the FPGA. A combination of dynamic internal memory allocation, and the use of different clocks for pixel acquisition and slope computation has resulted in a WPU which is fully scalable with the number of subapertures and pixels per subaperture. The logic resource usage in the FPGA only depends on the pixels per subaperture and the speed of slope computation (which can be set by the user), and is independent of the number of subapertures. As of now, the implementation of SPARC on the VC-709 development board requires a host PC with a PCIe interface, to send and receive data to the outside world (including the WFS and the DM).
2. AO Reconstructor: The AO Reconstructor analyzes the reconstruction matrix, decomposes it into sub-matrices depending on the number of subapertures in the WFS and performs MVM for each row of subapertures at a time. We are currently using Fried geometry in our platform, but any geometry can be used to create the reconstruction matrix. If the number of subapertures along a row is n (and the total number of subapertures n 2 ), the size of the reconstruction matrix would be (n + 1) 2 × 2n 2 (for Fried geometry). The x-slopes and yslopes from each row of subapertures would correspond to 2n slopes, and they would need to be multiplied with two sub-matrices of size (n + 1) 2 × n taken from different parts of the original reconstruction matrix. As shown in Fig. 1 , the AO reconstructor communicates with the memory state machine to extract the correct sub-matrix (corresponding to the available slopes) from the external memory, and performs the required multiplication. The speed for this is limited only by how fast the sub-matrix can be read out from the external memory.
The details on how the memory speed adaptability and scalability is implemented in the AO reconstructor, is discussed in part 2 22 of the paper.
3. Memory state machine: As mentioned in Section 1, the speed of current external memory modules available in the market is not enough to cater to the requirements of AO for large telescopes. Hence, the SPARC memory interface has been designed to be flexible to cater to different frequencies and datawidths of the external memory being interfaced. The detailed description of the scalability of the memory interface is provided in Part 2 22 of the paper.
The memory state machine is also responsible for acquiring the reconstruction matrix from a host PC and writing it into the external memory connected to the FPGA.
Validation
Validation of SPARC is performed through two methods:
1. Hardware-in-the-loop simulation: In this method, atmospheric turbulence is simulated for different telescope aperture sizes and the performance of SPARC tested for a range of subaperture sizes. This simulation measures the AO reconstruction time and verifies the phase outputs generated by SPARC.
iRobo-AO interface:
In this test, SPARC is interfaced with actual AO hardware. Reliability testing for a large number of frames is conducted by using SPARC as the real-time AO kernel in a real AO system called iRobo-AO (see below). This was not possible with the hardware in the loop test due to the long duration of time that it took for the host PC to process each frame.
Hardware-in-the-loop simulation
The AO simulator is based on the fast-fourier transform based phase screens for large telescopes designed by Sedmak, 27 and is created with MATLAB. A master phase screen is created initially, from which are generated the continuous individual phase screens which travel across the primary aperture (frozen flow). The frozen flow is generated based on the number of frames, wind speed and frames per second. The input simulated wavefront has not been corrected for tip-tilt. The generated frames from the phase screen are used to create the WFS pixel inputs. The WFS pixels are generated from the phase screen assuming Fried geometry, as described by Herrmann. 28 The PCIe interface of the VC-709 development board is used for sending simulated pixel inputs from the host PC to SPARC and to retrieve the phase outputs generated by SPARC. At present, the simulator accounts for fitting error and time delay error. Other error sources can be easily incorporated in future revisions. The benchmarking carefully excludes the performance of the PC interfaces involved, and only 
iRobo-AO interface
The iRobo-AO system is a robotic AO system due to be installed at the Inter-University Centre for Astronomy and Astrophysics (IUCAA) Girawali Observatory shortly. The first Robo-AO was installed at the Palomar P60 telescope 9 as well as the Kitt Peak 2m telescope, 10 and has been used for several years now to take thousands of observations at high resolution. The Robo-AO WFS consists of 97 illuminated subapertures and 120 DM actuators. The iRobo-AO system is virtually 
Results

Results from hardware-in-the-loop simulation
The validation of the AO correction is done by comparing the theoretical RMS WFE 29 due to a combination of fitting error and time delay error, with the actual RMS WFE produced by the hardware-in-the-loop simulation. The AO system is assumed to adhere to a strict Fried geometry with the number of phase values generated being (n + 1) 2 , if the number of subapertures is n 2 .
The results reported here are obtained with a Fried parameter (r 0 ) of 15 cm, a mean wind velocity 
where D is the aperture diameter, r 0 is the Fried parameter and n 2 is the number of independently controlled actuators in the AO system. Under the Fried geometry assumption, the number of actuators is one more than the number of subapertures along a row. The number of subapertures for each aperture diameter is chosen such that the ratio of D /n 2 ranges between 10 -20 cm for all observations. Time delay error is constant because of the constant modelled loop frequency that we used, hence keeping the combination of fitting and time delay error similar for different observations in Table 1 .
The mean of the RMS values (calculated as the ratio of the sum of the RMS values of all the wavefronts to the total number of WFS frames) of the uncorrected wavefront (calculated over the number of frames) is about 20-30% of the theoretically estimated value, 29 because of undersampling of the Von Karman spectrum at low frequencies. Since the number of samples in the frequency domain is the same as that of the phase screen in the time domain (equal to the number of actuators), the undersampling error is inversely proportional to the number of subapertures for which the phase screen is simulated. In the simulation, we have used uniform sampling without any sub-sampling below the lowest frequencies. A lower RMS value for the residual WF is obtained for smaller subapertures, because the input WF gets more undersampled as the number of subapertures reduces. Considering that, the mean of the RMS values of the residual wavefronts at different subaperture sizes are well within the estimated theoretical values of the error budget due to fitting error and time delay error. Table 2 shows the comparison between the standard deviations of the RMS values of the uncorrected wavefront and the residual wavefront. 
AO Reconstruction time
The time taken for AO reconstruction of a WFS frame is calculated from when the first pixel arrives at the input of SPARC to when the last actuator value is sent out to the host PC. As explained in Section 4, care has been given to exclude the time taken for the WFS pixel and the DM actuator values to pass through the host PC interfaces. Fig. 7 shows the histogram of the variability in the AO reconstruction time. The results in Fig. 7 are obtained at 4×4 pixels per subaperture. The variation between the slowest and fastest frame is within a few tens of microseconds, and is solely caused by the unpredictable latency of the external DDR3 memory. All other processes except fetching the reconstruction matrix from the external memory are deterministic and repeatable across frames.
For the largest number of subapertures (50 × 50) that we simulated SPARC with, 1 ms out of a total AO reconstruction time of 1.283 ms was taken in the retrieval of the reconstruction matrix from the DDR3 memory. For all the subaperture values that we tested SPARC with, the time take for reconstruction matrix retrieval formed a very significant percentage of the total reconstruction time. A comparison between the mean reconstruction time and the time taken for fetching the reconstruction matrix from an ideal DDR3 memory without any unpredictable latencies is shown in Part 2 22 of this paper.
Results from iRobo-AO interface
The multithreaded operation of the pixel thread and the phase thread at the host PC minimized the latency between the WFS and the DM of iRobo-AO. for the data to pass through the interfaces comes to less than 40 ns per frame. The AO loop time is much higher because of the low WFS frame rate of around 550-600 frames/second. The interfaces for the PCIe, the WFS and the DM are outside the scope of implementation of SPARC. They are part of the testing platform, and the performance of these interfaces were not benchmarked or improved upon. The same latencies would be present in a GPU implementation, apart from additional internal memory and processor-to-processor latencies. 30 We were using an off-the-shelf
PCIe software, where the data sent through the interface had to be checked for integrity and resent if it was erroneous. This resulted in the three groups of AO loop times (in Fig 8b) according to the number of times the data had to be resent through the interface to preserve data integrity. Fig 9b) is about 2.3 nm (1.2% of the mean of the RMS value). When the resistor was switched off after 60,000 frames or about 110 seconds into the observation (in Fig 9c) , the standard deviation in the RMS WFE is about 4.4 nm (2.8% of the mean of the RMS value).
Summary
SPARC is a pathfinder for FPGA-based real-time kernels for realistic AO systems that will be needed by the next generation of extremely large telescopes, which will require MCAO and Extreme-AO. The proof of concept implementation uses a cheap commercial FPGA development board to perform a full AO reconstruction for a 2601 × 2500 matrix (for 50 × 50 subapertures) in a median time of 1.283 ms (out of which 1 ms is taken for the retrieval of the reconstruction matrix from the external DDR3 memory). By using our platform on already available hardware which uses faster FPGAs and memories, it can meet the requirements of challenging AO implementations. FPGAs can compete with GPUs in terms of memory bandwidth and computational performance.
The system is designed to be modular so that the slope computation algorithm can be replaced by alternative methods (matched filtering, weighted CoG etc.), while preserving the parallel computational capabilities of the WPU. We have tested the system upto 50×50 subapertures for a single channel WFS, but the system is scalable for any subaperture size and any DDR type of memory.
The system is limited only by the memory bandwidth and the logic resources available on the FPGA. The advent of serial memories like the high bandwidth memory or HBM (explained in Part 2 22 ) increases the memory bandwidth limits of today's FPGAs by a factor of 16-18 times, compared to our development board. The Xilinx VU35P or the VU37P (commercially available today)
will not need an external memory to store the reconstruction matrix, and will be able to provide a theoretical memory bandwidth of 460 GB/s. 31 With the proportional increase in logic resources in these Xilinx chips, they can be expected to perform the AO reconstruction for a 50×50 subaperture frame in less than 100 µs. GPUs and CPUs will still be better for conventional AO implementations on large telescopes in the near future, but for applications which demand a predictable latency (like Extreme-AO), the HBM-FPGA implementation will provide a better solution. We will be perfecting the platform to have the capability to actually address the computational needs of a thirty meter class telescope and have a shorter development cycle compared to that of conventional AO kernels, through the future versions of SPARC. The future version of SPARC is also planned to have the interface to directly connect to different WFS camera interfaces (like IEEE1394b/Firewire, CameraLink, USB 3.0, 10GbE or even through a PCIe backplane). 2 Schematic of phase screen generation, based on the method outlined by Sedmak.
27
The inverse digital Fourier transform (DFT) of the product of a random skewhermitian matrix (with a mean of 0 and variance of 1) with the Von-Karman power spectrum produces the required phase screen for any aperture diameter.
3 Hardware-in-loop simulation for SPARC, which shows the PC generating the simulated pixels of a WFS. After SPARC generates the phase outputs at the FPGA side and sends it to the PC, the same simulator generates the residual wavefront and the next frame of WFS pixels to be sent to SPARC. Time series of observations in the lab a) when no turbulence was induced b) when the power supply to the resistance was switched off at 45,000 frames c) when the power supply to the resistance was switched off at 60,000 frames. Tables   1  AO correction 
List of
