The SKA (Square Kilometre Array) radio telescope will become the most sensitive telescope by correlating a large number of antenna nodes to form a giant antenna array. The data generated from such a large number of antenna nodes will pose a huge storage problem and require real-time data processing to make the best use of data, and the SKA Scientific Data Processing becomes the bottleneck of the whole processing flow. However, the existing high-performance CPU-and GPU (Graphics Processing Unit)-based solutions cannot satisfy the performance requirements and power budget requirements well [1] . Due to the consideration of the high energy efficiency of hardware accelerators and the flexibility and cost of prototype design, in this paper, we explore the FPGA(Field Programmable Gate Array)-based prototype of one of the most computationally demanding procedures in SKA scientific data processing: degridding. Through the analysis of algorithm behavior and bottlenecks, we design and optimize the memory architecture and computing logic of an FPGA-based prototype. Besides, with the consideration of the relations between the required data of processing multiple spectral channels, we reuse the shared data in processing neighboring spectral channels, and the performance further improves. The functionality and performance of our design have been verified on the target FPGA board, and the software-based benchmarks were also measured on comparable CPU and GPU platforms, indicating that the FPGA-based prototype achieves 2.74 times and 2.03 times speedup, 7.64 times and 7.42 times energy efficiency than the MPI(Message Passing Interface)-based CPU benchmark and the CUDA (Compute Unified Device Architecture)-based GPU benchmark, respectively. INDEX TERMS FPGA, gridding/degridding, scientific data processing, square kilometre array.
I. INTRODUCTION
SSince the 1960s, astronomy has produced many amazing results. The most brilliant and ground-breaking astronomical discoveries increasingly depend on the collaboration of large astronomical science facilities, and on the analysis and mining of huge scientific data. SKA (Square Kilometre Array) is a multinational project to design and build a new generation radio telescope at metre to centimetre wavelengths. The construction of SKA telescopes and sites have started in 2018, and it will be divided into two phases: Phase 1-SKA1 and Phase 2-SKA2. At present, we mainly focus on Phase
The associate editor coordinating the review of this manuscript and approving it for publication was Songwen Pei .
1-SKA1. SKA1 consists of two components namely a lowfrequency aperture array (SKA1 low) and a mid-frequency dish array (SKA1 mid). SKA1 low will have a collecting area of 0.4 km 2 , and it will consist of 130,000 dipoles grouped into approximately 512 stations, while the SKA1 mid will consist of a 150-km array with a collecting area of 33,000 m 2 , the result of 133 15-m diameter SKA1 mid dishes and 64 13.5-m diameter dishes from the MeerKAT telescope [2] , [3] . The total collecting area of the SKA will be well over one square kilometre, and it will be the world's largest radio telescope [4] .
The amount of sensory data collected will pose a huge storage problem and require real-time signal processing to make the best use of data. It was estimated that the array could generate an exabyte of raw data a day [5] , which makes the SKA scientific data processing (SDP) become a bottleneck of the whole processing flow. On top of the performance challenge, SKA1-SDP (SKA phase 1 SDP) solutions are further bounded by a tight power budget [1] , which is much lower than existing multi-core CPU-and GPU-based supercomputers. This situation requires a high-performance design along with acceptable power dissipation.
Hardware-based accelerators can provide much higher energy efficiency than software-based generic computing architectures, and with the consideration of the flexibility and cost of prototype design [6] , [7] , before the ASIC(Application Specific Integrated Circuit)-based prototype design, we decided to use FPGA (Field Programmable Gate Array) to implement the prototype of scientific data processing algorithm. In this paper, we explore the FPGAbased prototype of key algorithm degridding, which is one of the most computationally demanding procedures in SKA1-SDP [8] . The main contributions of this work are as follows:
-Through the analysis of memory size requirements and the read/write behaviour of degridding, we customize the memory architecture and access strategy in FPGA-based prototype design, so that it gives a better balance of memory size requirements and access performance. -On the basis of the access bandwidth of memory architecture, we customize the parallel and pipeline computing logic of the kernel convolution operation. We use a tree structure to decrease the latency of complex multiply-accumulate operations with the consideration of low-power design. -Through the analysis of the relations between the required data of processing multiple spectral channels, we reuse the shared data to reduce the data size of memory access to further improve the overall performance. The organization of the rest of this paper is as follows. Section 2 introduces background and related works. Section 3 analyzes the algorithm behavior and bottlenecks of degridding. Section 4 illustrates the hardware architectures and optimizations of memory and computing logic. Section 5 presents the results of the FPGA-based prototype and compares it with the existing implementations on other platforms. We draw a conclusion and present the future work in Section 6.
II. BACKGROUND AND RELATED WORKS A. IMAGING PHASE IN RADIO TELESCOPE DATA PROCESSING
UV imaging processing is the key processing phase in SKA-SDP flow, and the steps of imaging are presented in Figure 1 . Telescopes often combine data from multiple antennas to increase the sensitivity and resolution of images [9] , and a Measured Visibility is the integration of the samples from an antenna pair. The Measured Visibilities are processed independently for different spectral frequency ranges (so-called image channels). The imaging phase typically starts from an empty sky model, and after the ''imaging'' (gridding and iFFT ) processing of the visibilities, a residual image is formed. However, the residual image masks many interesting weak sources. The Clean algorithm [10] , [11] then extracts one or more bright sources and adds them to the sky model. For these sources, with inverse process (FFT and degridding) the visibilities are ''predicted'' and subtracted from the input to reveal fainter sources. This process is repeated until the sky model converges [12] , [13] .
In steps of imaging, two of the most computationally demanding steps are gridding/degridding, and the roles of these 2 steps are presented here. The Measured Visibilities are not sampled on a grid, and in order to reconstruct the target image with iFFT, we have to interpolate the Measured Visibility onto a grid. The step of mapping irregular visibilities to a grid is called gridding. The degridding step is the inverse process of gridding, and it predicts Model Visibility from a grid which is the results of FFT.
B. RELATED WORKS
Convolution is widely used in image processing, such as edge detection, image smoothing, and image blurring. In image processing, the convolution operation uses a fixed convolution kernel to slide consecutively on an image, and each pixel of the output image is the accumulation of weighted pixels of the input image overlapped by the convolution kernel. There are many studies on the high-performance implementation of convolution, such as works [14] - [16] . However, the convolution of degridding is different from the convolution of an image. The image convolution scans the image consecutively with a convolution kernel. In degridding, convolution relies on a large set of samples. The convolution operation of different samples requires different convolution kernels, and the area of matrix overlapped by the convolution kernel is less predictable, which depends on the coordinates of each sample. These features of degridding convolution cause a totally different access pattern and computation inefficiency compared with image convolution.
At present, the widely used gridding/degridding approaches in creating wide-field images include W-projection, W-stacking, A-projection, and AW-projection [12] . W-projection [17] corrects for the W-term through a convolution in Fourier space, and it uses different convolution matrices for different values of w. W-Stacking [18] uses a w-plane dependent grid which is different from the wplane dependent convolution function in W-projection, and the samples are gridded to the grid corresponding to the nearest w-plane. A-projection [19] corrects for the A-term in a similar way to the W-term correction. AW-projection [20] corrects both the A-term and the W-term. Most state-of-theart astronomy software packages use one or more of the above approaches of gridding/degridding. For example, LOFAR's AWImager [19] and CASA [21] use AW-projection and W-projection, respectively. WSClean [22] uses W-stacking only. A-term correction is supported in WSClean, but the performance will be much lower when this feature is enabled.
In order to enhance the performance of gridding/degridding processing, some works researched high-performance implementations of gridding/degridding approaches. Humphreys and Cornwell [23] researched CPU and GPU (Graphics Processing Unit) acceleration of W-projection gridding/ degridding for radio-astronomy imaging in the ASKAP (Australian Square Kilometre Array Pathfinder), and they made detailed comparisons on CPU and GPU platforms to estimate the performance and compute resource requirements to perform real-time imaging. Romein [9] proposed a GPUbased work-distribution strategy for W-projection gridding, which optimizes the data accumulation in on-chip registers rather than in off-chip memory and keeps the number of expensive off-chip memory accesses very low. This work was further improved in the work of Merry [24] , where the author applied thread coarsening to improve the efficiency of grid computing and observed performance gains for singlepolarization gridding and quad-polarization gridding on the target GPU. For the work of Muscat [25] observed that in some situations, especially for short baselines, the positions of two neighboring visibilities are the same on the higher resolution grid, and we do not need to grid each visibility independently. This characteristic can be used to reduce the number of visibilities to grid efficiently but still produce consistent results. Veenboer et al. [12] presented the first implementations of the image-domain gridding (IDG) algorithm for CPU and GPU, and the applied roofline analysis shows that their parallelization optimizations and approaches acquire nearly optimal performance of the IDG algorithm on CPU and GPU.
These works all concentrate on CPU/GPU based accelerators of gridding/degridding approaches. Veenboer and Romein [26] researched on FPGA-based implementation of IDG (Image-Domain Gridding) gridding/degridding, and this work compares the performance and energy efficiency of FPGA-, GPU-and CPU-based implementations of IDG.
The results show that with the identical theoretical peakperformance, the FPGA and GPU perform much better than the CPU and consume significantly less power. In absolute terms, the GPU is the fastest and most energy-efficient device, mainly due to support for sine/cosine operations of IDG using dedicated hardware. IDG is a new method for gridding/degridding and has totally different algorithm architecture with the widely used W-projection gridding/ degridding. In IDG algorithm, the neighbouring visibilities are first gridded onto so-called subgrids, after which the subgrids are Fourier transformed and added to the full grid. One shortcoming of faceting images is that the generated images often suffer from the edge effect induced by misalignment of image brightness scales in adjacent facets. The edge effect and noise interference lead to the reduction of imaging quality, and it has not been widely used for actual radioastronomical imaging. The image created by W-projection has higher quality and lower noise level than that obtained by the faceting technique. To the best of our knowledge, there are no studies of FPGA-based W-projection gridding/degridding accelerators for creating wide-field images, except for our previous work of a small size FPGA-based prototype of W-projection gridding, which has been publish in FCCM 2017 [27] . Our previous work only focuses on computing logic optimization of gridding processing, and has no insights on hierarchical storage, memory access strategy and computing logic optimization when scaling out the model and sample size. Besides, there are few degridding accelerators that has been presented, even for GPUs, and in this work, we focus on accelerating the data processing of degridding. We research memory architecture, computing logic, and data reuse in an FPGA-based W-projection degridding design with a scale-out model and sample size.
III. ALGORITHM BEHAVIOUR AND BOTTLENECKS ANALYSIS
In this paper, we refer to the degridding benchmark [28] , [29] used in radio-astronomy imaging with respect to ASKAP which is one of the precursor telescopes of SKA, and the algorithm description of degridding is depicted in Figure 2 . The input data are a two-dimensional grid array and coordinate parameters of samples. The weights for convolution, in other words, the convolution coefficient array generated by the W-projection algorithm, and the details of the W-projection algorithm can be referred here [17] . First, the subgrid and convolution kernel for a sample computing are extracted from the grid array and convolution coefficient array by coordinate parameters of the sample. Then the subgrid and convolution kernel are complex multiplyaccumulated to a visibility. Finally, after iterating over all samples, all the visibilities are predicted for comparing with the measured visibilities.
A floating-point multiply-accumulate operation needs 3 memory reads and 1 memory write, and with the pipeline implementation of multiply-accumulate computation, the number of memory accesses can be approximately decreased to 2*N in which N is the number of multiplyaccumulate operations. There are 2*N floating-point additions/multiplications in N multiply-accumulate operations, and the floating-point operations are on the same order of magnitude as memory reads and writes. In terms of memory storage, the data size of the grid array and coefficient array is much larger than the supported size of on-chip memory, and we have to store them in off-chip DDR memory and read the required data of each sample in real-time. However, the theoretical bandwidth of DDR4 memory is about 20 GB/s, which means 16 32-bit floating numbers will be transferred in a transfer cycle with the user frequency of 300 MHz. In one sample computing, there are sSize*sSize (sSize is 128 in the benchmark) complex multiply-accumulate operations, and with a small degree of computing parallelism, the required bandwidth will exceed the theoretical bandwidth. In each sample computing, we have to read 4 data blocks, and each block is an sSize*sSize floating-point array, and the reading of these data will be a bottleneck in degridding processing. The optimizations of enhancing bandwidth use and decreasing bandwidth requirements will significantly increase the performance of degridding processing. Besides, the sSize*sSize complex multiply-accumulate operations in a sample computing contain a massive parallel capacity, and in parallel computing logic design we should also consider the limitation of memory bandwidth to achieve a better balance of performance and resource usage.
IV. HARDWARE ARCHITECTURES & OPTIMIZATIONS A. KEY ARITHMETIC UNIT
The key arithmetic operation in the degridding algorithm is floating-point complex multiply-accumulate. There are no IP (Intellectual Property) cores for the operations of floatingpoint complex numbers (Xilinx only provides fixed-point complex multiplier IP [30] ) in FPGA design. Besides, the imaginary part of points in the coefficient array are zero, using fixed-point complex multiplier IP will consume more useless resources in our application. We implement the floating-point multiply-accumulate operation of complex numbers with operations of corresponding real and imaginary parts, the operations of floats using floating-point IP cores provided by Xilinx [31] . The complex multiply-accumulate can be equivalent to the combination of floating-point operations depicted in Equation 1. Besides, in degridding, only the data value of samples and grid points are complex number; the imaginary part of points in the coefficient array are zero. Therefore, Equation 1 can be simplified and modified to Equation 2. The arithmetic unit of floating-point complex multiply-accumulate in degridding is shown in Figure 3 and it only uses 2 floating-point multipliers and 2 floating-point adders.
B. PRELIMINARY HARDWARE ARCHITECTURE
The scale of key parameters in degridding benchmark are shown in Table 1 . The memory sizes required to store the grid array and coefficient array are 128 MB and 134 MB, respectively. However, the on-chip memory size of our target FPGA chip (XCVU9P-L2FSGD2104E) is only about 43.25 MB, so we have to store the grid array and coefficient array in off-chip DDR memory.
In the computation of each sample, it only needs partial points (sSize * sSize) in the grid array and coefficient array. Firstly, the index positions of the first required point in the grid array and coefficient array are computed in real-time with coordinate parameters of the sample, and the remaining points can be determined by the index position of the first point. Then, in our preliminary design, the multiply-accumulate unit reads a grid point, a coefficient array point, and the previously accumulated value from DDR. After the multiply-accumulate operation, it writes the latest accumulated value back to DDR. This process will iterates sSize * sSize times in a sample computation. The architecture of the preliminary design is shown in Figure 4 . The grid array and coefficient array are stored in DDR memory, and the index position computation module provides the address for memory access, and only a multiply-accumulate unit is customized.
We test and compare our design with a CPU-based benchmark, and the performance results are presented in Table 2 . The performance of the CPU-based benchmark with one CPU core is about 11.74 times better than our preliminary design. The reasons for this sub-optimal situation are listed as follows. (1) The memory accesses between iterations in a sample computing are random, and it does not make full use of the bandwidth of DDR memory.
(2) Only a multiply-accumulate unit is customized and no parallelism is applied. The multiply-accumulate operations between iterations are independent, and this will cause the extra memory access of accumulated value.
On the basis of preliminary design, in the following subsections, we will optimize the design of memory access to make the best use of bandwidth, and optimize the computation logic to achieve the best performance with the consideration of the provided bandwidth.
C. MEMORY ARCHITECTURE AND ACCESS OPTIMIZATION
In computing each sample, the addresses of the required points in the grid array and coefficient array have different continuities as described in Figure 5 . The areas colored with red and orange are the required points of the grid array and coefficient array in computing different samples, respectively. For memory addresses, the points in the colored area of the grid array are partially contiguous, and the 128 (sSize) points in each row are contiguous. The points in the colored area of the coefficient array are totally contiguous. Table 3 presents the DDR bus efficiency in different workloads (it is taken from the Xilinx Product Guide [32] ), and in order to enhance the performance of memory access, we should read/write data from/to contiguous addresses as far as possible to increase the bandwidth utilization ratio. The multiply-accumulate operations require both grid and coefficient points, and the computation starts when the required points of the grid array and coefficient array arrive simultaneously. We read a batch of 128 points from both the grid array and coefficient array with the consideration of the partial continuity of required points in grid array.
In computing each sample, there are 3 main data blocks (subarrays in the coefficient array, real part of the grid array, imaginary part of the grid array) that need to be transferred between DDR memory and the FPGA chip. The data size of the coordinate parameters of the sample is small and can be ignored in the performance analysis of memory access. In our target FPGA board, there are 4 independent DDRs, in order to maximize the bandwidth of data transfer, we use 3 independent DDRs to store these datasets and read the required points simultaneously. Besides, the memory controller of DDR in our target board supports a 512-bit user interface. In the preliminary design, we only use a 32-bit user interface to transfer the data, and the bandwidth is totally underutilized. In order to maximize the performance of DDR memory, we should widen the user interface to make better use of the bandwidth, and we read 16 points (16 * 32-bit = 512-bit) together at one time. The optimized memory architecture of degridding processing is shown in Figure 6 , and these optimizations can dramatically enhance the bandwidth use of off-chip memory.
D. COMPUTING LOGIC OPTIMIZATION
The kernel computation of each sample in degridding is sSize * sSize (128 * 128) complex multiply-accumulate operations. There are two ways of parallel computing logic design. The first approach is that with the consideration of the partial continuity discussed in the last section, we customize the computing logic with 128 complex multiply-accumulates, and it will be iterated 128 times in a sample computing. A pipeline implementation of multiply-accumulates in computing a line (128 data points) is shown in Figure 7 (The figure only presents the computing of the real part; the computing of the imaginary part is similar), and it requires 128 stages of floating-point addition.
To decrease the latency of computing a line, we use a tree structure to implement the complex multiply-accumulate operations, and it is presented in Figure 8 (The figure only presents the computing of the real part). In the tree structure, there are 1 stage of multiplication and 8 stages of additions which is much shorter than 128 stages of additions as presented in Figure 7 . Resource usage of floating-point multipliers and adders in tree structure is nearly same with it in linear structure. Besides, the computing iterations of tree structure only perform in the front part, and it will be iterated sSize times. The remaining part only iterates once in a sample computing. Compared to the iterations of the whole tree structure, it will decrease the logic needs to overturn in a sample computing and decrease the overall power dissipation.
The other approach is that with the consideration of the provided bandwidth of DDR memory, we process the data transferred from DDR memory in a cycle each time. The user interface of MIG (Memory Interface Generator) is 512-bit, and we can read 16 points (16 * 32-bit = 512-bit) together at one time. We only customize the computing logic with 16 complex multiply-accumulates, and the implementation of tree structure is similar with it discussed in the first approach.
When all read/write operations of different data blocks are performed through a memory controller, the first approach has a higher DDR efficiency than the second one. In this way, the different data blocks for computing a line are accessed consecutively, but it will consume much more computing logic. If we use strategy in the second approach, the memory controller will switch in short read operations of different data blocks frequently, and the DDR efficiency will decrease significantly. However, in our design, we use independent DDRs to store different datasets and read the required points simultaneously. In this way, the independent memory controllers not need to switch the read operations of different data blocks, and it will not decrease the efficiency of DDRs. Besides, the second approach consumes much smaller computing logic than the first approach, and we use the second approach to design the parallel computing logic.
E. DATA REUSE IN PROCESSING DATA OF MULTIPLE SPECTRAL CHANNELS
Through the above optimizations of memory architecture and computing logic, we can efficiently use the provided bandwidth and achieve a good performance. If there is a possibility of optimizing the bandwidth requirement in degridding processing, the performance will be improved further. In reality, there are many spectral channels, and processing samples of different spectral channels have relations that the corresponding samples in neighboring channels have the same or approximately the same mapping area of the grid array. Therefore, we only need to transfer the required sSize * sSize grid points once in computing corresponding samples of neighboring channels, and the data size of memory access from DDR memory will be efficiently reduced.
In our target board, there are 4 independent DDRs available. From the above optimization of memory architecture, only 3 DDRs are efficiently used for data transfer. If the 2 neighboring spectral channels are processed together, the all 4 DDRs can be fully used. When processing 2 neighboring spectral channels together, besides reading the required data for processing samples of the 1st channel, we only need to read the coefficient points for the corresponding sample of the 2nd channel that shares the same grid points with the corresponding sample of the 1st channel. We arrange the grid real, grid imaginary, and coefficient array for the 1st channel and coefficient array for the 2nd channel in 4 DDRs, which are presented in Figure 9 . In this way, the provided bandwidth of the 4 DDRs will be fully used, and we can read the required data of 2 samples in a transfer period (A transfer period needs to transfer sSize * sSize 32-bit floats from each DDR) which is the same as the theoretical transfer time of the required data for a sample processing. We duplicate the computing logic and local memory of one channel for processing 2 neighboring channels together, and we can process 2 neighboring channels in parallel. The theoretical performance of processing 2 channels together will be about 2 times better than processing 1 channel each time. 
V. RESULTS

A. EXPERIMENTAL ENVIRONMENT
We implement and measure the performance of degridding on the target FPGA board described in Table 4 . For comparison, we also measure the performance of the MPI(Message Passing Interface)-based CPU benchmark and CUDA(Compute Unified Device Architecture)-based GPU benchmark, and the information of the CPU and GPU platforms are also listed.
B. RESULTS AND COMPARISONS WITH OTHER HARDWARES
We measure the performance of the MPI benchmark on CPU with variation of the number of cores. The variation ranges from one degridder on one CPU core through multiple degridders fully occupying all CPUs with one degridder per core. In the MPI-based degridding implementation, multiple processes execute parallel degridders to process samples. The processes do not need to communicate each other in computing samples of different spectral channel, without producing communication overheads. OpenMP is widely used for the shared memory system, thus we also made the OpenMP-based implementation of degridding, and each thread responsible for the computing of a sample. The results of MPI-and OpenMP-based implementations is shown in Figure 10 . As the the number of processes (MPI) or threads (OpenMP) increases, the two implementations have the almost consistent results. Besides, the performance of OpenMP with 6 threads is almost the same as the results of 12 threads (The target CPU has 6 cores and supports 12 threads). The MPI-based benchmark has been applied in radio-astronomy imaging with respect to ASKAP (Australian Square Kilometre Array Pathfinder), and is part of the ASKAP software distribution. For comparing the performance of existing CPU benchmark with our FPGA design, we use the MPI-based benchmark as the comparison. Degridding is typically a read/write-intensive application, and in order to compare fairly on different hardware architecture, the target platforms have the approximate theoretical bandwidth of off-chip memory. The DDR theoretical bandwidth of the VCU1525 FPGA board is about 76.8 GB/s, and the GDDR theoretical bandwidth of the GeForce GTX650 GPU is about 80 GB/s. The performance of different platform is presented in Figure 11 , and in performance evaluation, GPPS (Grid Points Per Second) is used which is more intuitive than FLOPS in degridding evaluation (1GPPS=8*FLOPS). As the number of participating cores of the CPU increases, the performance of the MPI benchmark increases, however, the performance no longer increases when the participating cores increase to 5 cores. This issue is caused by restriction of memory bandwidth, which is not due to the effect of linear code sections demonstrated in the Amdahl's law. In the MPI-based benchmark, multiple processes execute parallel degridders to process samples of different spectral channel, instead of processing parallel procedures of a sample. The GPU results of the CUDA benchmark is about 1.35x better than the full performance of the CPU benchmark. The performance of the FPGA-based implementation is a bit superior to the GPU benchmark. We apply the optimization of data reuse in processing 2 neighboring spectrum channels and measure the performance of FPGA which is shown in Table 5 . The performance of the FPGA-based implementation increases significantly, and the performance of FPGA is 2.74 times and 2.03 times better than the full performance of target CPU and GPU, respectively. The numerical results of processing 2 neighboring channels together with data reuse are almost same as the results of processing 1 channel each time, only 1064/(16000*64)=0.104% of the numerical results are not totally consistent but have strong Pearson correlation. These inconsistences are caused by tiny deviations of the mapping area on grid occurred in a very small proportion of samples. The coordinates of visibility (u, v, w) are floating point numbers with non-zero fractional parts, the convolved visibility cannot be mapped accurately on grid with integer U and V coordinates. There are inherent mapping errors in gridding/degridding, and the above mentioned tiny deviations in a very small proportion of samples are allowable compared to the inherent errors of gridding/degridding.
The resource usage of the FPGA-based implementation is shown in Table 6 . Processing 2 channels together will duplicate the computing logic of 1 channel computing. The kernel computing resource DSP usage of processing 2 channels together is 2 times than usage of processing 1 channel each time. The target FPGA has abundant logic resources, however, our design only uses less than 6.3% usage of the resources. The theoretical peak performance of the target GPU (GTX650) is 812.5 GFLOPS. The theoretical peak performance of the target FPGA (VU9P Virtex UltraScale+ FPGA) is about 1515 GFLOPS (The calculations based on paper [33] which is co-authored by Xilinx). We use the largest proportion of the resource usage (6.3%) to calculate the theoretical peak performance of the used resources, and these resources can provide no more than theoretical peak performance of 6.3% * 1515 GFLOPS = 95.445 GFLOPS, which is much lower than the theoretical peak performance of the target GPU. This discussion conveys that if we use a low-end FPGA chip which only provides the resources mentioned in Table 6 , and we can achieve the same performance as well. Although theoretical peak performance of the used FPGA resource is much lower than the target GPU, we still achieve 2.03 times speedup than the target GPU. It illustrates that in degridding processing, the FPGA-based prototype exhibits much better architecture efficiency than the GPU-based implementation.
The energy efficiency of degridding on different hardware architecture is considered. The FPGA-based prototype is also used as a performance evaluation of future ASIC design, and in power measurement, to eliminate the effect of the power of non-processing units on different platforms, we measure the running power of each platform (The power difference of the whole platform with the application running or not) to reveal the power dissipation of different architecture. The power of the platform is measure by dynamometer (HY-001 type, HYELEC Company, Ltd., Hangzhou, China), and the test environment is shown in Figure 12 . Firstly, we measure the static power of different platforms with no target applications running, and then measure the dynamic power of different platforms with target applications (MPI, CUDA, FPGA) running. The running power is the difference of dynamic and static power. The power of each platform is separately measured 10 times and the average is computed. EyE (Energy Efficiency) is calculated from throughput (Top) and power, as shown in Equation 3. The running power and energy efficiency of degridding on different hardware are shown in Table 7 . The FPGA-based prototype achieves 7.64 times and 7.42 times energy efficiency than the MPIbased benchmark and CUDA-based benchmark on the target CPU and GPU, respectively.
EyE = Top/Power
(3)
VI. CONCLUSION
In this work, we studied the FPGA-based scale-out prototype of widely used W-projection degridding algorithm, and this work would be the first attempt to accelerate W-projection degridding on FPGA for creating wide-field images. Through the analysis of the algorithm behavior and bottlenecks, we customize and optimize the memory architecture and computing logic to maximize the bandwidth efficiency and computing performance. Besides, through the analysis of the relations between the required data in processing multiple channels, we reuse the shared data to reduce the data size of memory access to further improve the overall performance. The FPGA-based prototype achieves 2.74 times and 2.03 times speedup, 7.64 times and 7.42 times energy efficiency than the MPI-based benchmark and CUDA-based benchmark on the target CPU and GPU, respectively. This work demonstrates the performance optimization strategy of W-projection degridding on the target FPGA board. The optimizations mentioned in the manuscript can be still used for the future higher bandwidth hardware design, such as Xilinx Alveo U280/U50 which equipped with HBM memory and can achieve much larger performance improvement. Besides, the performance of W-projection degridding on FPGA can also provides the guidance and evaluation for the future ASIC-based design, e.g., the approximate logic scale of algorithm model; the size of on-chip memory for storing the coefficient array, then using the full bandwidth of off-chip memory for the grid array; the performance of the ASIC-based implementation which can be evaluated through analyzing the enhancement of the bandwidth and frequency compared with the FPGA-based prototype.
