In the era of the IoT (Internet of Things) and Edge computing, SoC (System on Chip) with an FPGA (Field Programmable Gate Array) is a suitable solution for embedded systems because it supports running rich operating systems on general-purpose CPUs, as well as the FPGA's acceleration for specific computing. One problem of designing an accelerator on an FPGA is that optimization of the logic for the accelerator is not automatic and much trial and error is needed before attaining peak performance from the SoC. In this paper we propose a method to reduce the development time of the accelerator using N-body simulation as a target application. Based on the hardware resources needed for several pipelines of the accelerator and their performance estimation model, we can estimate how many pipelines can be implemented on an SoC. In addition, the amount of memory each pipeline requires for attaining maximum performance is suggested. Our model agreed with the actual calculation speed for different constraining conditions.
INTRODUCTION
Accelerator architecture, which uses GPUs or specialized-hardware, is a promising approach to overcome the increasing demand of huge data processing from the Internet. GPUs were initially used for accelerating scientific simulations, such as Nbody simulations and fluid dynamics (Luebke et al., 2006) . Then, they become a common tool for processing Deep Learning operations, such as training and inference, because highly optimized libraries are provided from the GPU vendor (Chetlur et al., 2014) . Field Programmable Gate Array (FPGA) is another solution for data processing with low power consumption, because it can fully optimize specific operations by reducing the bit length of arithmetic operations. Even though recent GPUs or CPUs support lower bitlength arithmetic operations, such as 16-bit floatingpoint or 4-bit integer operations, they cannot be used for arbitrary bit length for each operation.
Another reason that there is a focus on FPGAs is that they are shipped as an SoC (System on Chip). In the era of the IoT (Internet of Things), Edge computing is a promising direction because the workloads on the client side are becoming heavier and the computing power on the server side is not sufficient (Shi and Dustdar, 2016) . Moreover, a reduction of data from the client side is indispensable to prevent an explosion of network traffic. An SoC with an FPGA is a suitable solution because it has a general-purpose CPU that can run a rich operating system, as well as an FPGA that can be used as an accelerator for specific computing (Gomes et al., 2015) . The difference between a pure accelerator with GPUs or FPGAs and an SoC system is that an SoC system is a compact and low power system that can be integrated as an Edge component. We previously proposed an FPGA tablet that runs Android on the CPU with a specific accelerator for applications configured for the FPGA while the OS is running (Sato and Narumi, 2015) .
One problem of developing an SoC system is designing and optimizing the FPGA. FPGAs are considered hardware, and designing them with an HDL (Hardware Description Language) takes far more time to develop compared with developing such software for a normal CPU. Recently, High-Level Synthesis (HLS) has become a useful tool for developing an FPGA system (Gajski et al., 2012) . For example, Sys-temC (Black et al., 2009) and Vivado HLS (Xilinx, 2019b) are used for such purposes. The last difficulty is in shortening the design time for establishing the communication between the CPU and FPGA, which is not supported by such HLS tools.
SDSoC (Xilinx, 2019a; Kathail et al., 2016 ) is a tool to shorten the development time when using SoC with Xilinx FPGAs. It automatically generates communication hardware to bridge the CPU and FPGA, as well as generating hardware logic on the FPGA, by just writing with a C-code. By specifying a subroutine to be ported to the hardware, it generates all of the files needed to operate, such as the OS files and the bit file to configure the FPGA.
The last difficulty when an accelerator is implemented with the SDSoC is that how to attain maximum performance among the various configuration patterns of the hardware. The effective performance of the accelerator depends on not only the pure calculation speed of the hardware pipeline but also the data transfer speed. Because the SDSoC hides a detailed mechanism of how the calculation is parallelized as well as how the CPU and FPGA communicate, much trial and error is needed when searching for the combination of parameters that achieves the best performance. The compilation time of the design of the FPGA is often very long, and reducing the number of trials will greatly shorten the development time.
In this paper, we propose a strategy to optimize the accelerator performance by estimating it beforehand using N-body simulation as an example. In section 2, related works on designing accelerators with FPGAs are described. In section 3, the system architecture of this work is shown. The performance result and resource usage are summarized in section 4, and its estimation is explained in section 5. Finally, section 6 summarizes the paper. 
RELATED WORK
FPGAs are used to accelerate many applications, such as encoding for wireless communication (El Adawy et al., 2017) , error correction with a low density parity check (Roh et al., 2016) , image processing for 4K video streams (Kowalczyk et al., 2018) , convolutional and recurrent neural networks (Zeng et al., 2018) , the Finite Difference Time Domain (FDTD) method (Waidyasooriya et al., 2017) , and N-body simulations (Peng et al., 2016; Del Sozzo et al., 2017; Ukawa and Narumi, 2015; K&F, 2015) .
N-body simulation is an applications that allows the FPGA to successfully compete with other architectures, such as GPU. Peng et al. developed a system with a Zynq SoC to accelerate an N-body Modified Newtonian Dynamics (MOND) simulation, and achieved 10 times better performance per watt compared with an Nvidia K80 GPU (Peng et al., 2016) . GRAPE-9 (K&F, 2015) achieved 16 Tflops of performance with only 300W in 2015, while a K20 GPU could reach only 1/3 of that with similar power consumption at that time. In these implementations, the pipeline is carefully optimized with much effort, such as using a lower bit-length of arithmetic units. However, in this paper, we use only a simple C-code to develop the pipeline, and concentrate on the proposed strategy to parallelize the pipeline without much effort.
The key technology is SDSoC (Xilinx, 2019a) from Xilinx, which is a kind of High Level Synthesis (HLS) tool. Unlike SystemC or Vivado HLS, SDSoC automatically generates communication hardware as well as the accelerator itself. Rettkowski et al. achieved 10 times the acceleration of the Histogram of Oriented Gradients (HOG) algorithm compared with the ARM processor in the SoC with a Zynq device (Rettkowski et al., 2017) . A Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) were accelerated on Zynq MP-SoC device, and achieved several times of acceleration compared with previous implementation with FPGAs (Zeng et al., 2018) . Several filters for 4K video streaming were also accelerated by SDSoC on a Zynq MPSoC device (Kowalczyk et al., 2018) . They discussed the merits and drawbacks of using SDSoC compared with Vivado HLS or the xfOpenCV library.
There are other methods for writing an accelerator code, including the data transfer portion from a high level language, such as OpenCL (Khronos, 2019) . For example, Neural network and FDTD calculations are implemented with FPGAs (Luo et al., 2018; Waidyasooriya et al., 2017) . Though OpenCL can support many platforms including CPUs and GPUs, we need to modify the code using specific APIs. However, SD-SoC requires no modification of the software to use FPGAs.
Making a performance model is a reasonable approach to optimizing the hardware accelerator. Mousouliotis et al. accelerated convolutional neural networks using SDSoC on Zynq SoC (Mousouliotis and Petrou, 2019) . They modeled such elements as pipeline depth and function/loop overheads, and they were consistent in resourcing usage from the vendor tool. However, their model did not combine hardware resources and performance to attain a simple answer for optimized parameters. Zeng et al. also showed the performance model for neural network accelerator with SDSoC (Zeng et al., 2018) , but they just described the calculation cost. However, our strategy directly describes which parameters should be used to achieve the maximum performance for a specified condition.
SYSTEM ARCHITECTURE
In this section the hardware and software of the system is described. 
Hardware Platform
For the SoC platform with an FPGA, we used an Ul-tra96 board (Avnet, 2019) , which houses a Zynq MP-SoC device. Table 1 shows the specifications. The SoC contains a quad-core ARM Coretex-A53 processor operated at 1.5 GHz, a dual-core Cortex-R5 processor, a Mali-400 MP2 GPU as well as FPGA functions. However, the most attractive point is its small size, comparable to a credit card. It is suitable for Edge IoT devices because it can run the latest OS and many IO ports are supported. The size of the logic that fits into the FPGA is not so large compared with other MPSoC devices, and devices that are 7 times larger are available on the market. However the optimization strategy proposed in this paper would be more useful for a larger device; larger devices need a longer compilation time and good estimation methods are more useful than small devices. Figure 1 shows the subroutine to calculate gravity between particles. Only this routine is converted to the hardware by SDSoC tool because other calculations are not so compute intensive. Note that there are iand j-loops (see lines 11 and 13 in Figure 1 ). Variables (posi, posj) to store particle positions and calculated forces (forcef) are allocated with a fixed size because SDSoC requests it when a simple communication method is used. Figure 2 shows a non-parallelized parent routine. The calculation cost of calculate force is O(N 2 ), where N is the number of particles. Note that conversion from double to float is performed to call force pipeline, and the results are also converted back (see lines 8, 9 and 13 in Figure 2 ). Such conversion takes some time with a low power CPU in the SoC.
Software for Parallel Processing of Pipelines
To attain the highest performance with SDSoC, three techniques are used. 
Create a Pipeline
Adding a "#pragma HLS pipeline" line (see line 14 in Figure 1 ) will automatically make a pipeline. The number of clocks required for getting one result (initiation interval) is not specified, and it is estimated as 5 in this system based on section 5.1.
Parallel Processing in a Pipeline
Manual loop-unrolling was performed to calculate a different i of the for loop in line 11 of Figure 1 . We used a manual one because automatic loop-unrolling with "#pragma HLS unroll" was not successful.
Parallelizing the Whole Pipeline
By duplicating the pipeline and executing both pipelines simultaneously, further acceleration can be done. Figure 3 shows a dual pipeline case. First, the dimensions of posi and forcef are increased without increasing the total size of the array. The particle position (posi) is divided into two groups, and the result (forcef) is divided into two arrays. force pipeline2 is the same program as force pipeline. However, different name is needed for conversions to different instances in the FPGA by SDSoC. By using #pragma SDS async and wait, two pipelines are executed simultaneously.
Further Optimization
Further optimization might be possible, but we did not try other ones because the main object was not optimization of a pipeline itself. For example the following methods would be possible: reduction of initiation interval, loop-unrolling for the j-loop to reduce the loop count, reduction of the bit-length of arithmetic operation, or interpolation for calculating division or square root (K&F, 2015) .
PERFORMANCE RESULT AND RESOURCE USAGE
In this section, several results of calculation speed are shown as well as how much logic and RAM are used for the pipeline. Table 2 shows how many resources in the FPGA are used for the system, which is described in section 3.2.2. The number of pipelines generated by unrolling the subroutine is called p unroll in this paper. The size of the local memory, MAXN in Figure 1 , is fixed to 4096, which is the maximum when compiled with SDSoC. BRAM means "Block RAM", which is used as local memory. DSP is a specialized arithmetic logic for such functions as addition and multiplication. FF means Flip-Flops, and LUT means Look-Up Table for logical operations. "(fail)" means that the compilation failed to generate 26 pipelines in this case. The last row shows the maximum number of resources of the target device. As shown in the last column of Table 2 , the compilation time of SDSoC increases as the number of pipelines increases. Especially, when it reaches its limit, the compilation time increases dramatically. This is the reason that estimating configuration parameters is important for FPGA development. Among BRAM, DSP, FF and LUT, LUT is the most restrictive resource in this case. Figure 4 shows the calculation speed in Gflops by changing the number, N, of particles for different numbers of pipelines, p unroll . Here we assumed 38 floating point operations per pairwise interaction. Parallel efficiency is 0.92 when p unroll = 25 for N = 8192. This number means the speed of 25 pipelines is 25 × 0.92 times faster than that of a single pipeline, which is sufficient. Table 3 shows how many resources are used for the system described in section 3.2.3. The number of pipelines attained by parallelizing the whole pipeline is called p dma because a DMA engine is generated for each pipeline. The SDSoC automatically generates a data motion network, which communicates between the CPU and FPGA. In this paper we do not specify which data motion network should be used, and the SDSoC automatically chooses the best one.
Parallelization in a Pipeline

Parallelization of the Whole Pipeline
The required resources in this section is much higher than that presented in the previous section. Only seven pipelines can be implemented instead of 25 pipelines. The bottleneck of the resources is BRAM, as shown in Table 3 . Because the number of BRAMs exceeds the limit when seven pipelines are used, the N local value is reduced from 4096 to 2048. Here, N local is the same as MAXN in Figure 1 . The sum of the data-transfer size of posi and forcef is fixed to MAXN×4 and that of posj is proportional to p dma , as shown in Figure 3 . Note that though the size of posj seems to be fixed in the C-code level, the actual hardware needs a different buffer for each pipeline to receive data in posj. Figure 5 shows the calculation speed. Compared with Figure 4 , the highest speed is much lower. In addition, the line of p dma = 7 becomes horizontal in a larger number of particles. This is because N local is smaller than other cases, which causes low efficiency because of communication overhead. The parallel efficiency is 0.90 when p dma = 7 for N = 8192, which is already lower than the 0.92 of the p unroll = 25 case. total number of pipelines p: p = p dma p unroll .
Combining Both Parallelizations
(1)
As seen in Table 4 , using p unroll first is always better than using a p dma parallelism. For example, when p = 16, the lowest BRAM usage comes from the combination {p unroll =16, p dam =1}. DSP, FF, LUT and compilation time are also the lowest for the same combination. Figure 6 compares the calculation speed. The peak speed has no difference when p is the same, but a larger p dma causes more overhead around N = 512. Therefore, using lower p dma is also better from the point of view of performance.
For obtaining maximum performance from this SoC, we have no reason to use the p dma parallelism. However, to use the p unroll parallelism, one needs manual unrolling, as described in section 3.2.2, which increases the development cost for the software side, especially when we have to try with different numbers of parallelism. In addition, the communication performance might be better in different platforms because using independent DMA engines might increase the data transfer. Therefore, in the rest of this paper, we concentrate on using the p dma parallelism because the optimization strategy is not straight forward. Table 5 and Figure 7 show the case when the maximum resources of BRAM are limited to 300 instead of 432. This situation is when BRAM is used for another purpose in the design and we can use only 300 of the 432 BRAMs for the pipeline. Only the p dma parallelism is used, as in Table 3 and Figure 5 .
Lower Resources of BRAM
The difference caused by the new limitation is that N local needs to be decreased to reduce the consumption of the BRAM. The number of pipelines is the same because LUT becomes the bottleneck subsequent to BRAM. The large difference in calculation speed between Figures 5 and 7 is because p dma = 7 causes a lower speed compared with the smaller p dma = 6. This did not happen in Figure 5 . This is because a lower N local causes more overhead in the communication between the CPU and FPGA. In the following sections we analyze this further.
PERFORMANCE ESTIMATION
In this section, we first show the performance model. Then, we make a model for resource estimation. Fi- nally, we show graphs to choose the best parameter among many possibilities.
Performance Model
The time needed for one step for an N-body simulation cab be expressed as:
where T fpga is the calculation time in the hardware pipeline in the FPGA, and T comm is the communication time between the CPU and the pipeline. T fpga is expressed as:
where N is the number of particles, p is the number of pipelines, and t fpga is the time for a pairwise force calculation between particles with a single pipeline. T comm can be expressed as:
(4) where t band is the communication time needed to transfer data for a particle, which is 16 bytes because four float variables are used. t lat is the setup time needed to initiate the communication. ⌈x⌉ represents an integer larger or equal to x. When N > N local , we need to replace the contents of particle memory, which causes the overhead of communication. Table 6 summarizes the parameters for performance estimation. To determine t fpga , the difference of T between {p=1, N=4096, N local =4096} and {p=1, N=2048, N local =4096} is used because only the T fpga part is different. For t lat , the difference of T between {p dma =1, p unroll =1, N=256, N local =4096} and {p dma =4, p unroll =1, N=512, N local =2048} is used because only the latency part is different. For t band , T of {p dma =7, p unroll =1, N=1024, N local =1024} is used. t fpga = 5.0 × 10 −8 means the calculation of the force on a particle is executed every 50 ns, which correspond to 20 MHz. Because the clock frequency of the pipeline is fixed to 100 MHz in SDSoC configuration in this paper, the initiation interval is 5 instead of 1. Further optimization of the interval is out of the scope of this paper as described in section 3.2.4. t band = 1.4 × 10 −7 corresponds to 114 Mbyte/s of data transfer speed, which is far lower than the peak band width, 4 Gbyte/s, of the DDR4 DRAM on the Ultra96 board. This is because t band includes operations other than pure data transfer, such as conversion from double to float and copy to a temporary buffer for communication with the FPGA. Figures 8 and 9 show the estimated performance in lines, while measured results are still shown as points. As can be seen, the estimation is roughly consistent to the measurement except for middle range of the number of particles (128 ≤ N ≤ 512). Even though the quantitative value for the middle range of the number of particles is different from the actual data, the order of estimated times are consistent with the data, i.e., a larger p dma causes lower performance.
Resource Estimation
In this section, resource parameters for BRAM and LUT are considered because DSP and FF do not become the bottleneck in our case. In this estimate, only the parallelization of the whole pipeline (Section 4.2) is assumed because parallelization in the pipeline (Section 4.1) is too simple for estimation.
The total number of BRAMs, B total , can be estimated as:
where B dma is the number of BRAM blocks for data transfer. B mem is the number of BRAMs for storing particle positions and forces, and B other is the number of BRAMs required for other than the pipeline itself.
Total number of LUT units, L total , is estimated as: where L pipe is the number of LUT units for a pipeline as well as the DMA controller, and L other is LUT for other type of logic. Table 7 summarizes the parameters for hardware resources. To determine B mem , the difference of BRAM between {p dma =4, p unroll =1, N local =4096} and {p dma =4, p unroll =1, N local =2048} are used in Tables  3 and 5 . Similarly, to determine B dma , the difference between {p dma =4, p unroll =1, N local =4096} and {p dma =5, p unroll =1, N local =4096} are used in Table 3 . Then B other is calculated from the data of {p dma =4, p unroll =1, N local =4096}.
To determine L pipe , the difference of LUTs between {p dma =4, p unroll =1, N local =2048} and {p dma =5, p unroll =1, N local =2048} are used in Table 5 . Then L other is calculated from the case of {p dma =4, p unroll =1, N local =2048}. Figures 10 and 11 show the estimated values in lines against actual values in points. As for BRAM, two limits are shown as horizontal lines. Estimated lines fit perfectly for large p dma , while the lines estimate much lower values than the actual usage for small p dma . The reason is that the optimization of resource usage is not performed when resources are not so severely restricted (sparse layout could be done).
As for LUT, the two horizontal lines indicate the maximum and 90% of the maximum LUTs. Using 100% of the LUT is difficult with FPGA because the layout of the logic becomes too difficult. In this estimate, we choose 90% as the practical limit. The difference, depending on N local , is very small and we did not include such a parameter for LUT, unlike what we did for BRAM. For small p dma , the actual number of LUTs is larger than the estimated number, which arises from the same reason that optimization of logics is not needed so much at the compile and layout stage. 
Estimation of the Best Configuration Parameters
In this section, we finally estimate the configuration parameters, i.e., p dma and N local in our case. We assume three cases: A) N=8192 and B max =432, B) N=256 and B max =432, and C) N=8192 and B max =300.
Here B max is the maximum number of BRAMs allowed for the accelerator. Figure 12 shows the estimated parameters for case A). The two dotted lines at the base indicate the boundary of limitations calculated by Eqs. (5) and (6). The filled-circle indicates that the values are within the range of the limitations, while the open circle indicates values that are out of range. p dma =8 is always out of range, and only {p dma =7, N local =4096} is also out of range for p dma =7. The highest performance is estimated to be achieved when {p dma =7, N local =2048}, which is consistent with Figure 5 . Figure 13 is for case B), in which the number of particles are small. Small N needs a small N local for high efficiency, as well as a small number of pipelines. The small N local is required for reducing the data size for transfer, and the small p dma is for low latency to start the DMAs. The highest performance is estimated to be achieved when {p dma =4, N local =256} ({p dma =3, N local =256} is very similar). 
CONCLUSION
In this paper, we proposed an optimization strategy to determine the configuration parameters of pipelines to accelerate N-body simulations. The method is summarized into following three steps.
First, we measure the consumed resources for implementing a middle range of a number of pipelines. A low number of pipelines is not good for estimation because the logics are not sufficiently optimized. However, a full number of pipelines is not suitable either because of the long compilation time.
Second, both performance and resource models are generated based on the measurement. By changing only one parameter among several, we can know the coefficient depending on the parameter. For better fitting, data from the middle range of the number of pipelines should be used, as pointed out above.
Third, make a graph for searching the best combination of parameters by integrating all the models and constraints into a single view. Once the model is generated, we can easily change the constraints: the maximum number of BRAM units (B max ), maximum number of LUT units, and the number of particles (N), in our case. Then, the best combination of the num- ber of pipelines (p dma ) and the size of the local memory (N local ) are estimated. The estimation reasonably agreed with the measurements.
With our strategy, we can reduce the development time for optimizing the accelerator great deal because the effective calculation speed is well estimated even for unknown combinations of parameters. As shown in Tables 3 and 5, the compilation time for a middle range of the number of pipelines is roughly half the compilation time needed for a full number of pipelines. Moreover, when the compilation fails because of a bad estimate, we need much more time to search the best one. A good estimate can avoid such a waste of time. Improvements should be made to reduce the disagreement for the middle range of the number of particles.
The most important part of our strategy is the performance model. Even with a simple N-body simulation, there are several methods to parallelize the calculation. The overhead greatly depends on how it is parallelized, especially when the resources are limited. Therefore, a good performance model should be carefully investigated for Edge devices. When the strategy is used for Deep Neural Network (DNN) applications, it would be more challenging to make performance models because they have more parameters, such as the depth of layers, the size of each layer, and the size of a convolution kernels. However, a similar strategy to that proposed in this paper would work for a quick estimation of the configuration parameters.
Future research should include applying this strategy for larger devices as well as different architectures. Several kinds of accelerators need to be investigated for further application of our method. Also, fully optimizing the pipeline for N-body simulation to compare previous research is another direction of the next study.
