The simulation of light propagation through tissues is important for medical applications, such as photodynamic therapy (PDT) for cancer treatment. To optimize PDT an inverse problem, which works backwards from a desired distribution of light to the parameters that caused it, must be solved. These problems have no closedform solution and therefore must be solved numerically using an iterative method. This involves running many forward light propagation simulations which is time-consuming and computationally intensive.
INTRODUCTION
The ability to predict the propagation of light in tissues allows medical professionals to control therapeutic processes, such as tissue destruction in photodynamic therapy (PDT). To determine the light distribution, it is necessary to simulate the propagation of light through complex media. These simulations can be timeconsuming and computationally intensive, which motivates a fast simulator and hardware acceleration.
These simulators have other applications within biophotonics, like diffuse optical tomography (DOT) and bioluminescence imaging (BLI) [22] , as well as potential applications outside of biophotonics where light is scattered in turbid media [5, 17] .
The description of our FPGA-accelerated biophotonic cancer treatment simulator will begin with a brief introduction to biophotonics, light propagation simulations and photodynamic therapy (PDT). Section 2 presents a high-level overview of the Monte Carlo algorithm we used for light propagation. Section 3 presents some prior work on CPU, GPU and FPGA accelerated simulators. Section 4 outlines our design decisions including the use of an FPGA and the Intel FPGA SDK for OpenCL. Section 5 discusses how our design was implemented on an Intel Stratix 10 FPGA using OpenCL and various optimizations and design techniques that led to its high performance. Section 6 discusses the validation of the results, the performance, area and power consumption of the design and the development productivity advantages of OpenCL. The paper concludes with directions for reducing FPGA resource utilization, removing constraints, scaling up the design and a summary.
Light propagation in turbid media
Materials, including bodily tissues, have several attributes which determine how light interacts with them. The absorption coefficient (µ a ) of a material measures the fraction of light (with a particular wavelength) that is absorbed per unit length. Tissues are also turbid materials meaning they scatter light. This is represented by a scattering coefficient (µ s ) and an anisotropy factor (д). The scattering coefficient of a material represents the typical number of times a photon will scatter per unit length. The anisotropy factor of a material is defined as д = cosθ , where θ is the average deflection angle during a scattering event. In other words, д represents the typical amount of forward direction retained after a photon is scattered inside a material. Materials with a д value of 0 are isotropic, meaning they scatter equally in all directions.
To solve for the absorption and scattering of light, the radiative transfer equation (RTE) must be solved. Solving this equation analytically is possible in homogeneous materials, but the methods become difficult at the interfaces of materials and at the sources and sinks of light [18, 19] . Instead, numerical approximations, like the Monte Carlo (MC) method, are regarded as the gold standard for the field [16] . For light propagation simulations, the MC method involves launching a large number of photons and statistically determining their propagation and absorption locations. Given a sufficiently high number of random samples, the output will converge to a statistically accurate result. While this method is computationally intensive, it presents significant parallelism which makes it attractive for hardware acceleration.
The light energy absorbed in a volume region is proportional to the amount of photons absorbed in that region. In many biophotonics applications, the fluence of a region is a quantity of interest. Given the absorption coefficient (µ a ), volume (V R ) and absorbed photon energy (E R ) of a region (R), the fluence (Φ R ) can be calculated using Equation 1 .
The choice of mesh geometry for representing the volumetric models is crucial to the simulator accuracy. Previous simulators have used layered [1, 2, 20, 27] , voxel [3, 6, 11] and tetrahedral [8] [9] [10] 26] geometries. Layered geometries are sufficient for some applications, like the skin [1] , but are too restrictive for general clinical models. Compared to voxel geometries, tetrahedral geometries require more complex intersection calculations. However, they more accurately represent region boundaries and curved surface normals in general clinical models, as depicted in Figure 1 . This is crucial for accurately simulating the reflection and refraction of light at region boundaries.
Photodynamic therapy (PDT)
Photodynamic therapy is a light-based therapy used to kill diseased tissues, such as cancer cells and bacterial infections. In PDT, the patient receives a photosensitizer (PS), either topically, orally or by injection. The PS accumulates preferentially in tissues with rapidly dividing cells, like tumors, and absorbs light of a specific wavelength. Light is provided by the physician using either external (e.g. light irradiating the skin) or internal (e.g. needle-sized fiber optic probe) sources. Both the inactive PS and light are harmless on their own, but when enough photons interact with the PS, it activates which causes surrounding oxygen to become reactive. The reactive oxygen species damage the cells and, given sufficient damage, cause tissue death. The goal of PDT is to inflict enough damage to kill the target tissue while minimizing the damage to healthy tissue [29] . [32] . The arrow represents the estimated surface normal.
There are many factors which affect the outcome of PDT. The most controllable factors are the light source parameters (i.e. number, type, position, orientation and intensity).
The forward problem, solved by the our simulator, determines the fluence distribution within the model given the tissue optical properties of the regions and the parameters of the light emitters. To optimize the treatment plan for a patient, it is necessary to solve an inverse problem. The inverse problem involves working backwards from a desired fluence distribution to the parameters that caused it. For PDT, this would involve determining the light source parameters that create a desired fluence distribution within the tissues in order to kill the target tissue with minimal damage to the healthy tissue. The number, type, position, orientation and intensity of the light sources are some of the varying degrees of freedom for which there is no known general case closed-form solution. Instead, iterative approaches are used that perform many forward simulations to approximate the optimal solution [30] . This approach leads to high compute needs: recent work in [30] , for example, produces higher quality treatment plans than prior approaches but still leverages only a subset of the available degrees of freedom in light delivery. The treatment plan optimization takes several hours per (patient-specific) plan due to the large number of forward light simulations required. Reducing the time per light simulation would allow for a more complete exploration of the planning space. This could enable the discovering of treatment plans which destroy the target tumor with less damage to healthy tissues.
ALGORITHM
We use a method called hop-drop-spin to model the propagation of light through turbid media. This method was initially proposed by Wilson and Adam in 1983 [28] and was later adapted for layered geometries by Wang et al. in MCML [27] , which served as a base for many more recent MC simulators. The method is visualized in Figure 2 .
The launch stage launches photon packets from the light sources with an initial position p, directiond and weight (energy) of w. Launching and tracking the path and weight of photon packets as they are gradually absorbed by the tissue is more computationally efficient than simulating individual photons, which would be completely absorbed by the tissue at their first MC absorption event. The packet then enters the drawstep stage which generates a random step length s based on the attenuation coefficient (µ t = µ a +µ s ) of the material it is currently in. The packet then enters the hop stage where it is moved by this step length along its current direction vector to point p ′ = p + sd. If p ′ remains in the same tetrahedron as p, then the hop stage is complete. If not, then the packet crosses a tetrahedral boundary during the step and is moved to the interface stage where the intersection point on the shared face of the tetrahedrons is calculated. If the refractive indices of the tetrahedrons differ, then the packet crosses a material boundary and calculations are made to account for reflection and refraction. If the attenuation coefficient of the tetrahedrons differ, then the step length is updated and the packet moves back to the hop stage with a new step length.
Once the step has finished, the packet enters the complex dropspin stage which contains three sub-stages: drop, roulette and spin. In the drop sub-stage, the packet loses a fraction of its weight as energy absorption into the tetrahedron. This is recorded by accumulating the dropped weight into an array with one entry per tetrahedral element. The packet then enters the roulette sub-stage. If the packet's weight is greater than a threshold (wmin), then the roulette sub-stage does nothing. If the weight is below wmin, then the packet is given a 1-in-p chance of surviving with an increased weight of p * w (due to energy conservation). If the packet loses roulette it is moved to the exit stage and its execution is finished. As a packet propagates and looses more weight, its contribution to the overall energy accumulation declines rapidly. Thus, the purpose of the roulette phase is to avoid wasting computation on increasingly insignificant packets [7] . If the packet's weight was greater than wmin or if it survived roulette, it enters the spin sub-stage where its direction vectord is changed based on a random sample of the Henyey-Greenstein (HG) scattering function with the д value set by the material of the current tetrahedron. Once the packet finishes the spin sub-stage, it returns to the drawstep stage where the entire process repeats.
PRIOR WORK 3.1 CPU
Other CPU-based, MC light propagation simulators use various algorithmic and model simplifications that limit the scope of the implementation in exchange for improved performance. Only tetrahedralmesh MC simulators are flexible enough to accurately represent the curved surfaces of general clinical models. Of the existing tetrahedral-mesh MC simulators, FullMonteSW [9] is the fastest. It outperforms other high-performing CPU implementations, TIM-OS [26] and MMCM [10] , while remaining full featured.
CPU implementations exhibit various levels of optimization, which complicates the interpretation of relative performance numbers for hardware accelerated implementations. Multi-threaded simulators show linear performance scaling with the number of cores as well as some performance increase from simultaneous multi-threading (or hyperthreading) [25] . Therefore, single-threaded baseline implementations are inherently~9x slower than multithreaded implementations on modern processors with six cores. FullMonteSW's use of optimized data structures and hand-coded vector instructions (AVX2) provided an additional speedup of~8x [25] . Therefore, unoptimized single-threaded CPU implementations may be~72x slower than an optimized multi-threaded CPU implementation. In this work, we compare performance against Full-MonteSW, as it currently represents the best performance achieved by a CPU implementation.
GPU
Various GPU-accelerated versions of the MCML algorithm for layered geometries have been implemented with different constraints [1, 2, 20] . The best performance increase was 100x over the unoptimized single-threaded baseline software code. A conservative adjustment of the performance places it at~2x faster than an optimized and multi-threaded CPU baseline. However, all MCML based implementations suffer from the same limitations of the layered geometry.
Fang and Boas [11] implemented a GPU-accelerated voxel-mesh simulator, MCX. A speedup of 75-300x was reported over its unoptimized single-threaded benchmark tMCimg [6] . Adjusted performance numbers suggest that MCX would be 2-6x faster than an optimized and multi-threaded baseline. However, as discussed in Section 1.1 and by the authors of MCX [11] , voxels have significant drawbacks when fitting to curved surfaces and accurately representing their normal vectors [4] .
Powell & Leung [23] implemented a GPU-accelerated tetrahedralmesh simulator that achieved a 2x speedup over the multi-threaded MMCM code, implying it is slightly slower than FullMonteSW.
Young-Schultz et al. [32] created FullMonteCUDA, a GPU-accelerated version of FullMonteSW using an NVIDIA GPU. FullMonteCUDA achieved a 4-13x speedup over FullMonteSW across various benchmark models.
FPGA
Lo et al. [20] created FBM, an implementation of MCML on an Altera Stratix III FPGA. They constrained the model to 10 layers and a limited number of accumulation elements. With these constraints, an advantage of 45x in speed and 700x in energy-efficiency over a CPU of the same process generation was reported. Adjusting for the comparison to the unoptimized single-threaded baseline code shows that an optimized and multi-threaded CPU implementation could perform better than FBM and the energy-efficiency would drop to~10x.
Hung et al. [13] described the MCML algorithm using OpenCL and targeted an NVIDIA GTS 450 GPU and Intel Stratix V FPGA.
The GPU and FPGA provided a speedup of 64x and 21x, respectively, over the unoptimized and single-threaded MCML code for layered models. Adjusting for the unoptimized baseline suggests that neither the GPU or the FPGA would provide a speedup over a multi-threaded and optimized CPU implementation. The authors find that the FPGA uses 7.5x and 1.5x less power than the GPU and single-threaded CPU, respectively.
Cassidy et al. [8] implemented FullMonteSW on an Altera Stratix V FPGA using the Bluespec SystemVerilog (BSV) language. The simulator only supports homogeneous meshes which are limited to 48k tetrahedrons. The authors do not provide raw performance numbers as the combined hardware-software system was not functionally complete. The performance estimates were based on simulations and hardware reports. They estimate a 4x speed improvement and 67x energy-efficiency improvement over the optimized and multithreaded FullMonteSW.
All prior FPGA implementations simplified the geometry and material model, yet most still fail to outperform optimized software. This highlights the complexity of a general MC light simulator and the resulting difficulties in hardware acceleration.
DESIGN CHOICES 4.1 FPGA as a hardware accelerator
FullMonteSW is currently the fastest tetrahedral-mesh MC light propagation simulator of its kind [9] . Its speed was achieved through compact data structures, intelligent custom caching, multi-threading and hand-coded vector (AVX2) instructions. Further software optimizations show diminishing returns on performance which motivates hardware acceleration. The decision to accelerate FullMonteSW using an FPGA was influenced by various attributes of the algorithm, some of which are described in the following sections.
Data parallelism. There is ideal data parallelism between photon packets. While propagating through the mesh, the path of a packet has no effect on the path of any other. Therefore, packets may be launched completely in parallel, even on different devices, so long as the energy accumulation arrays are summed together once the simulation finishes. In addition, the packets exhibit significant pipeline parallelism within each stage. This allows for a deep pipeline with no stalls and a high operating frequency.
Distributed random number generation. MC methods rely extensively on the use of random numbers. Many pipeline steps require the generation of random numbers of varying precision and ranges. FPGAs excel at this by using highly efficient bitwise operations distributed across the chip to supply the various pipeline steps.
Irregular memory accesses. The memory access pattern for MC light propagation simulations is irregular, which can be handled well by FPGAs [12] . The simulator contains a large read-only memory for tetrahedral geometry data and a moderate amount of readaccumulate-write memory to track the energy absorbed in each tetrahedron. Both are accessed irregularly due to the stochastic light sources and the random nature of MC simulations. Being able to exploit the irregular access patterns and the unique sizing and operations of the memory gives the FPGA an advantage over CPUs and GPUs.
Language choice
FPGA designs are typically described using Register Transfer Level (RTL) hardware description languages (HDL) like Verilog and VHDL. Various domain-specific tools have been created that abstract the hardware, allowing developers to more easily describe their circuits within a specific application domain (e.g. stream processing, DSP, networking, etc) but these tools are too restrictive for the algorithm we seek to accelerate. High-level synthesis (HLS) provides a generalized abstraction to the hardware by allowing developers to use sequential languages, like C/C++, or explicitly-parallel languages, like OpenCL. One of the most attractive features of OpenCL HLS, which does not exist in RTL or general C++ HLS, is the standardized high-level interface functions between the FPGA and CPU. To provide this, the board manufacturer creates a board support package (BSP) that is used by the OpenCL compiler (for Intel, this is called AOC) to abstract the host-accelerator interactions [15] . This means that the user does not have to use system integration tools to manually connect IP components nor write tedious low-level code to transfer data between the CPU and FPGA. This makes developing a full FPGA-accelerated system generally much simpler and quicker than both RTL and C++ HLS, as the CPU-FPGA communication is specified at a higher and standardized level.
Based on experiences and difficulties with implementing the complex FullMonteSW pipeline in lower-level RTL languages, we decided to use the Intel FPGA SDK for OpenCL. The challenge with using OpenCL is describing the algorithm in a way that the compiler is able to generate a functioning pipeline without sacrificing substantial performance and area.
DESIGN DETAILS
This section will describe some of the OpenCL programming techniques that were used to create an efficient hardware implementation of FullMonteSW on an FPGA.
The algorithm, described in Section 2 and visualized in Figure  2 , was implemented on a Terasic DE10-Pro board with a Stratix 10 (1SG280LU2F50E1VG) FPGA. To create random numbers, we use a version of the TinyMT RNG of Saito and Matsumoto [24] allowing us to generate sixty-four 32-bit random numbers every clock cycle, which is enough to supply up to 12 instances of the pipeline. Leveraging the OpenCL standard, we were able to quickly create a full system for interacting between the host CPU and FPGA accelerator. This system is visualized in Figure 3 , where the arrows represent the following:
(1) Transferring simulation data from the CPU to the FPGA.
(2) The CPU signalling each of the kernels in the FPGA to start.
(3) The launch kernel reading the pre-launched packets from the global memory of the board (this is described later in this section). (4) The dropspin kernel writing the on-chip tetrahedral energy accumulation arrays to global memory for the CPU to read. code to support many sources is complex, and new sources are regularly added based on medical requirements and device advancements. As an emitted photon packet undergoes many interactions with the tissue before being terminated, most of the computation is in the other stages of the pipeline in Figure 2 , rather than in the launch stage. Therefore, we implement the launching of packets from sources in the CPU host rather than the FPGA. This was done by creating an array of pre-launched packets in the CPU that is passed to the FPGA before the simulation starts. The OpenCL standard made this easy by allowing us to pass the list of pre-launched packets as an argument to the launch kernel (arrow 1 in Figure 3 ). To launch packets in the FPGA pipeline, the launch kernel simply reads this data from global memory (arrow 3 in Figure 3 ) and moves the packet to the drawstep kernel. The complex conditional code required to implement various sources in the FPGA would result in significant area usage and potentially reduced performance. Additionally, keeping the launching logic in the CPU allowed our FPGA design to instantly support the same light sources as the softwareremoving a major limitation of the previously most full-featured FPGA implementation [8] .
Both Intel and Xilinx support OpenCL pipes which allow data to be transferred efficiently from one kernel to another. Pipes bypass the slow global memory traditionally used for inter-kernel communication by using FIFO buffers implemented in the RAMs of the FPGA, as illustrated in Figure 4 . A major limitation of pipes is that they do not support the OpenCL struct datatype, which can make passing more complex data between kernels difficult. To address this, we use the Intel FPGA SDK for OpenCL channels extension [15] . Channels are specific to the Intel FPGA implementation of the OpenCL standard (i.e. they are not part of the official OpenCL standard [21] ). They are similar to pipes and support the struct datatype. This allows developers to define individual blocks of hardware using separate OpenCL kernels and connect them to form a larger pipeline. For us, each of the states in Figure 2 is a separate OpenCL kernel and the arrows represent a unidirectional channel that moves data from one kernel to the next. The reading and writing of channels is illustrated in Code 1. As discussed in [7] [8] [9] , the FullMonte algorithm contains many trigonometric, vector, matrix and random number distribution functions. The CPU version [9] uses 32-bit floating point precision for these calculations, while the previous FPGA version [8] uses fixedpoint of various precision. In this work, we decided to use 32-bit floating-point precision in the OpenCL kernels. This allowed portions of complicated sequential logic to be directly translated from CPU C++ code to FPGA OpenCL code, which greatly improved development productivity and code readability. We found that we achieved good performance and area results using 32-bit floatingpoint representations, as shown later in Section 6. In Section 7.1, we discuss the potential benefits of using fixed-point representations similar to [8] in the future.
Aside from the launch kernel, where the number of loop iterations is known, the code of each shaded kernel in Figure 3 is structured similarly to Code 1. Looking at Figure 3 , the number of times a packet will go through the circular pipeline path consisting of the drawstep, hop, interface and dropspin kernels is random. This is the reason for the infinite loop in the kernel skeleton from Code 1. This kernel structure generates a block of hardware that: reads from an input channel in a non-blocking fashion, performs the logic of the kernel if the read data is valid, writes the data for the next kernel into the output channel and repeats indefinitely. Our current design focuses on creating a pipeline that produces valid results with good performance. The only restriction placed on the model is a maximum of 65k tetrahedral elements in order to fit the entire mesh in on-chip memory. A limitation of using OpenCL for FPGAs is that separate kernels cannot share local memory. Both the drawstep and interface kernel require access to the same readonly tetrahedral mesh data. Ideally we would store a single copy of the mesh in on-chip memory with two read ports -one for the drawstep kernel and the other for the interface kernel. However, this is not possible due to the restrictions of the OpenCL language. We decided to duplicate the mesh memory in both kernels and plan to create a custom on-chip caching structure in the future to allow the mesh to be stored in the large global memory of the board. This would remove the mesh element constraint and is discussed further in Section 7.2.
The initiation interval (II) of a loop is the number of clock cycles between the start times of consecutive loop iterations. An II value of anything more than 1 creates back pressure that stalls the entire pipeline. In our initial implementation, the AOC was able to schedule our design with an II of 1 for all kernels except dropspin, which was scheduled with an II of 29. The double precision floating-point read-accumulate-write operation, illustrated in Code 2, was the reason for the high II value. We determined that the 64-bit floating-point number (i.e. double) could be replaced by a 64-bit fixed-point number (i.e. a long) for the energy accumulations without overflow or loss of meaningful precision. This lowered the II from 29 to 2. Since the read-accumulate-write operation to the tetraEnergy array of one loop iteration must finish completely before the next iteration reads from the same memory, the AOC was unable to schedule the loop with an II of 1. An II of 2 guarantees that read-accumulate-write operations from consecutive loop iterations to the same tetraEnergy entry will not be missed. To remove this dependency and achieve an II of 1, we created two instances of the accumulation array. This allowed us to write to alternating instances on consecutive loop iterations. We used the ivdep pragma to tell the AOC how many consecutive loop iterations are guaranteed to not access the same memory locations. This optimization is illustrated in Code 3. With this optimization, we were able to lower the II of the dropspin kernel to 1. However, we found that this also reduced the maximum operating frequency (Fmax) by~2x. Using the Quartus Timing Analyzer, we verified that the critical path was from the output register of the tetraEnergy memory, through a 64-bit adder and to the input register of the same memory. This showed that the AOC was scheduling the read-accumulate-write into a single cycle which required a lower Fmax. To address this, we duplicated the accumulation arrays again to four, increased the loop safe length from two to four using the ivdep pragma and manually inserted pipeline registers between the read, accumulate and write operations using the built-in AOC function __fpga_reg() [15] . This technique, illustrated in Code 4, allowed us to achieve an II of 1 and an Fmax of 312 MHz. To duplicate the pipeline we developed the code summarized in Code 5 to instantiate multiple copies of the original kernel from Code 1. The channels used to pass data between the kernels were duplicated, creating an array of channels accessed by the pipeline index p. In the launch kernel, each instance of the pipeline processes a fraction of the total photon packets requested. If Npkts is the total number of packets, then each pipeline processes N = Npkts/PIPELI N E_N packets with an offset of p * N into global memory (arrow 3 in Figure 3 ). For simplicity, we duplicated the on-chip tetrahedral geometry memory and energy accumulation arrays for each pipeline. Instantiating PIPELINE_N copies of these memories reduces their maximum size (and therefore the maximum number of tetrahedrons that can be stored on chip) by a factor of PIPELINE_N. Therefore, we only instantiated two instances of the pipeline which allowed us to store 65k tetrahedral elements on-chip; this enabled the use of existing models for validation and benchmarking. Each pipeline simulates half of the total packets requested and the energy accumulation arrays are summed together in the FPGA after both pipelines finish but before being written back to global memory (arrow 4 in Figure 3 ). The two pipeline design achieved an II of 1 and an Fmax of 285 MHz. The lower Fmax compared to the single pipeline is likely attributable to the increased routing complexity of the larger two pipeline design. Section 7 discusses more sophisticated methods for scaling up the design with even more pipeline instances. 
RESULTS
This section is split into four subsections: the validation of Full-MonteFPGACL against FullMonteSW, the resource utilization of the FPGA design, the performance and energy-efficiency comparison against FullMonteSW and FullMonteCUDA and a discussion of the development productivity using OpenCL. For FullMonteSW, we used an Intel Core i7-6850 3.8GHz CPU (14nm) with 6 physical cores (12 virtual cores with hyperthreading) and 32GB of RAM. For FulMonteCUDA, we used an NVIDIA Titan Xp GPU (16nm) and for FullMonteFPGACL, we used a Terasic DE10-Pro board with an Intel Stratix 10 (1SG280LU2F50E1VG) 14nm FPGA.
Validation
To validate FullMonteFPGACL we used the cube_5med and FourLayer models from the TIM-OS design suite [26] as well as a model of a real tumour. For each model, we ran the simulator for 10 6 -10 7 packets. We used FullMonteSW as a reference, which was itself validated and benchmarked against two highly-optimized software simulators: TIM-OS and MMCM [9] . To accurately compare the results of the simulations, we calculated the normalized L1-norm (|x | 1 ) between two simulations (A and B) using Equation 2. We found that for each test case, the normalized L1-norm for the tetrahedral fluences (Φ) of a FullMonteFPGACL and FullMonteSW simulation was comparable to that from two FullMonteSW simulations initialized with different RNG seeds, as shown in Table 1 .
cube_5med FourLayer Tumor CPU-CPU 0.0322 0.0012 0.0027 FPGA-CPU 0.0342 0.0020 0.0026 Table 1 : Normalized L1-norm values across the benchmark models using 10 6 packets for two differently seeded CPU simulations and an FPGA and CPU simulation.
FPGA utilization
The single pipeline design achieves an Fmax of 312MHz. The hardware resources required for each kernel and the total design are summarized in Table 2 . The kernel total row shows the total utilization for the partition of the design dedicated to the OpenCL kernels. The design total row shows the resource summary reported by the Quartus compilation report. The extra resources make up the global interconnect, global memory caches and OpenCL board interface logic. The RAM utilization is high due to the on-chip caching of tetrahedrons and programming language limitations discussed in Section 5. However, as pointed out in the footnote of Table 2 , only 10% of the total available RAMs are used when the tetrahedron data is not cached on-chip. Reducing memory usage, removing the mesh element constraint and duplicating the pipeline to scale up performance are discussed further in Section 7. 
Performance & energy-efficiency
To benchmark performance, we simulated three models with packet counts ranging from 10 6 -10 7 using FullMonteSW (CPU), FullMonte-CUDA (GPU) and two versions of FullMonteFPGACL: a single pipeline and two instances of the pipeline (FPGA). FullMonteSW was configured to use 12 threads with AVX2 instructions enabled.
The performance results are summarized in Table 3 and show that FullMonteFPGACL achieves a speedup of 2.1-3.9x over Full-MonteSW. The results depend on the mesh complexity, with the highest speedup occurring on the most complex and realistic mesh, Session: Applications I FPGA '20, February 23-25, 2020, Seaside, CA, USA the tumor. FullMonteFPGACL is between 1-3x slower than Full-MonteCUDA. However, as discussed later in Section 7.3, we believe that duplicating the pipeline to fill the FPGA should achieve an improvement of 8-16x and 1-4x over FullMonteSW and FullMonte-CUDA, respectively. To investigate the energy-efficiency of FullMonteFPGACL, we compare the energy-efficiency (performance per Watt) against Full-MonteSW (CPU) and FullMonteCUDA (GPU). We estimated the power of a single FPGA pipeline to be 30W using the Quartus Pow-erPlay Power Analysis post-synthesis vectorless power estimation with a 12.5% input toggle rate. Using the same method, we estimated a 45W power usage for two instances of the FPGA pipeline. The Intel Core i7 CPU has a thermal design power (TDP) of 140W and the NVIDIA Titan Xp GPU has a TDP of 250W. To compare energyefficiency, we conservatively assumed 85% of the CPU and GPU TDP (119W and 213W, respectively). Thus, a single FullMonteFPGACL pipeline is up to 10x and 4x more energy-efficient than the CPU and GPU, respectively. The duplicated pipeline version achieves an 11x and 5x energy-efficiency improvement compared to the CPU and GPU, respectively. If the single pipeline is duplicated to fill the FPGA, we expect to achieve an energy-efficiency improvement of up to 17x and 7x over the CPU and GPU versions, respectively. The improvement is due to amortizing the device static power overhead over more productive pipeline units. The duplication of the pipeline is discussed further in Section 7.3. Table 3 : Runtimes for various benchmarks on a CPU, GPU and a single and duplicated FPGA pipeline.
Packets Model
The lower energy-efficiency improvement, relative to [8] , can be explained by several factors. First, our OpenCL design uses 32bit floating-point precision which results in more area usage than the custom-width fixed-point precision used in [8] . Second, the problem size is bigger since we use more FPGA RAM blocks to store tetrahedrons on-chip. Lastly, FPGA power has increased with newer process generations, especially from 28 to 14nm. In addition, the power gap between CPUs and FPGAs has been shrinking with newer process generations.
Development productivity
Some of the main advantages of using OpenCL over lower-level languages are: improvements in development productivity, code maintenance, code extendability and the ability to quickly create a full FPGA-accelerated system. This section will provide an estimated comparison of the development time for various hardware accelerations of the FullMonteSW algorithm. Table 4 provides a rough estimate of the single-developer time required for the different FullMonteSW hardware acceleration implementations to reach the prototype stage, as well as to achieve a performance improvement over the baseline software implementation. It also shows the size of the designs in number of files and lines of code. Currently, there have been three different hardware acceleration implementations of FullMonteSW.
The first method [8] targeted an Intel Stratix V FPGA using Bluespec SystemVerilog (BSV). The intent of using BSV was to abstract from pure SystemVerilog RTL and improve development productivity without substantially sacrificing area and performance. However, as shown in Table 4 , this was not the case. The simulator only supports homogeneous models and, while the highly optimized pipeline stages can be verified in simulation individually, when the entire pipeline is connected and tested, the results are inaccurate in both simulation and hardware.
Recently, FullMonteSW was accelerated using an NVIDIA GPU [32] and the CUDA SDK. Table 4 shows that the prototype took a single developer roughly a week to create. This prototype achieved a 2x speedup over FullMonteSW and the final version achieved between a 4-13x speedup depending on the model being simulated [32] . The comprehensive and easy-to-follow NVIDIA documentation and example projects, the wide spread usage of the the CUDA SDK and the short compile times were all major factors in the fast development of an accurate and efficient GPU-acceleration of FullMonteSW.
The topic of this paper is our acceleration of FullMonteSW using OpenCL for FPGAs. As shown in Table 4 , it took a single developer around 2 weeks to create a functioning prototype -similar to the time required for the GPU. This included creating the entire system that is visualized in Figure 3 and verifying its functionality with the Intel FPGA OpenCL Emulator. Roughly another 2 months were required to investigate the inefficiencies of the prototype and test various techniques that led to the optimizations discussed in Section 5 and achieve a performance improvement over FullMonteSW. The main development productivity bottleneck was the~8 hour compile times for the FPGA, of which the AOC front-end took roughly an hour and Quartus the remaining seven. The OpenCL emulator was useful for verifying the functionality of small pieces of code. However, the validation of the entire pipeline requires running in the range of 10 5 -10 7 packets, which takes up to 8 hours in the emulator. Therefore, to validate the entire pipeline, we found it more productive to compile the entire design for the FPGA (~8 hours) and run the simulation on the actual hardware (~seconds), as it performed a more rigorous validation of the pipeline and benchmarked the performance in a similar amount of time as using the emulator.
Using OpenCL significantly reduced the amount of FPGA code and CPU-FPGA interface code that had to be written compared to HDL languages, as shown in Table 4 . In addition, we found that the use of OpenCL drastically decreased the effort and time involved in creating a complete heterogeneous CPU-FPGA system. The OpenCL compiler was capable of optimizing and pipelining our complex sequential code within each kernel. This made converting CPU code to target an FPGA much less cumbersome by allowing us, in many cases, to directly copy blocks of sequential C++ CPU code to OpenCL for the FPGA.
The use of OpenCL was beneficial for development productivity; however it did not remove the necessity for the developer to think like a hardware designer. Simply copying all the CPU code for the FPGA using the conventional OpenCL paradigm resulted in poor performance (an II of over 500 and an Fmax of 120MHz). We found it most effective for the developer to begin by considering the hardware circuit they wish to create, then explore alternative, sometimes counter-intuitive, methods to describe that circuit in OpenCL. Some optimizations involved: the investigation of hardware area and timing reports, identifying issues in the generated HDL and creating a high-level solution in the OpenCL code. Specifically, we found the most difficult parts of the design were: (1) structuring the channel logic to communicate accurately and efficiently between kernels (2) correctly unrolling OpenCL loops to duplicate the pipeline, as illustrated in Code 5 (3) optimizing the performance for 64-bit read-accumulate-write operations to the FPGA RAMs, as discussed in Section 5 and (4) dealing with OpenCL's inability to share local memory across kernels, as discussed in Section 5.
Overall, we feel that the trade-off is a good one as it resulted in increased productivity with little effect on the performance and area of the design.
FUTURE WORK
This section discusses potential methods for decreasing resource utilization, removing constraints and scaling up the design to increase performance.
Fixed-point representation
The current design uses 32-bit single precision floating-point numbers (i.e. float) for all data (positions, vectors, material properties, lengths, etc) that require decimal point precision, with the exception of the 64-bit fixed-point accumulation that was discussed in Section 5. Based on software profiling and previous experience with implementing the algorithm on an FPGA, we believe these could be reduced to 18-bit fixed-point numbers. The Intel FPGA SDK for OpenCL arbitrary integer precision extension [15] could be used to implement custom fixed-point widths with varying precision. This could provide up to a 2x reduction in RAM, DSP and ALM utilization for the OpenCL kernels, thereby enabling more pipeline instances and higher throughput per device.
Custom memory controller
The current design limits the size of the mesh to 65k tetrahedral elements but medical models can contain more than 1M elements.
The storage is broken down into two components: read-only data for the tetrahedral geometry and read-write data for the energy accumulation arrays. Memory trace profiling, obtained using the software simulator, indicates that there is little temporal reuse of data. A 4-8 entry LRU cache for both the geometry and accumulation data could provide a modest 60% hit-rate. Larger LRU caches show poor trade-offs in terms of area and hit-rate. As hypothesized in Section 4, memory profiling results showed highly irregular access patterns, which does not fit CPU or GPU-like caching structures well. However, the irregular access pattern is static across the entire simulation.
An on-chip memory controller could be created that removes the mesh element constraint without decimating the performance. The geometry and energy accumulation array data would be moved to the high-capacity (up to 32GB on the Terasic DE10-Pro) external memory of the board and the custom caching controller could be implemented using the FPGA RAMs. For both the geometry and energy accumulation data, we would create a 4-8 entry LRU L1 cache backed by a static L2 cache. To populate the static L2 cache, we could run a small subset of the packets preemptively in the CPU, sort the tetrahedrons by access rate and store the highest accessed tetrahedrons in the static L2 cache. The time overhead involved with this flow would decrease the performance of a single simulation. However, this time could be amortized when running many simulations with the same mesh, which is the typical case when solving PDT inverse problems [31] .
Multiple pipeline instances
The end of Section 5 discussed how we instantiated two instances of the pipeline in the FPGA. Now, we will discuss how a more sophisticated approach could improve RAM utilization when scaling up the design further.
Based on the resource utilization from Table 2 , the limiting factor for duplicating the pipeline is RAM (50%). However, without the on-chip caching of tetrahedrons, the RAM usage of the kernels drops to~10% per pipeline and the limiting factor becomes ALMs (12%). Thus, we believe that 8 copies of the pipeline can fit on-chip if the RAMs are managed properly.
A naive approach would be to duplicate the on-chip memory for each instance of the pipeline, as we did for the two-pipeline version in Section 5. However, as the number of pipelines is increased, the duplicated memory will severely limit the amount of tetrahedrons that can be stored on chip. Assuming the custom on-chip memory controller from the previous section were implemented, the L1 and L2 caches for both the geometry and energy accumulation data would be duplicated. This would decrease the maximum size of the static L2 caches for the memory controllers and therefore reduce their hit-rate. However, by utilizing the two ports and multipumping support of the FPGA RAMs [14, 15] , the RAM utilization from pipeline duplication can be decreased.
The energy accumulation arrays are read from and written to, meaning that the RAMs used to implement their cache must have one port for reading and the other for writing. In addition, the Intel FPGA OpenCL documents [14, 15] only mention multi-pumping support for memory reads. This means that the on-chip memory controller cache for the energy accumulation arrays must be duplicated for each pipeline instance. This eliminates the necessity for synchronization logic and avoids multiple pipeline instances updating the same memory location in the same cycle.
On the other hand, the geometry data is read-only, meaning that multiple pipeline instances can share the two read ports of the RAMs. In addition, each read port can be double or triple-pumped, allowing a single geometry data cache to supply two, four or six instances of the pipeline if configured using no multi-pumping, double-pumping or triple-pumping, respectively. To supply 8 instances of the pipeline, the most logical configuration would use two copies of the static L2 cache and double-pump the two read ports, as depicted in Figure 5 . This configuration would allow the static L2 geometry caches to store~4x more tetrahedrons than the naive configuration which duplicates the caches for each pipeline instance. To estimate the performance of this design, we scale up the single-pipeline performance by a factor of 8. This is similar to the performance scaling we experienced from the duplicated pipelines in Section 6.3 and the use of more CPU cores [25] . To estimate the energy-efficiency of this design, we multiply the core dynamic power (excluding I/O) by a factor of 8. This estimate is conservative since the global board control logic and the shared on-chip RAM caches (and therefore their dynamic power) will not scale linearly with the number of pipeline instances. This leads to an estimated 16x performance and 17x energy-efficiency improvement over FullMonteSW.
CONCLUSIONS
We have demonstrated FullMonteFPGACL, the first complete and verified FPGA-accelerated implementation of a tetrahedral mesh MC biophotonic simulator. We leverage OpenCL for FPGAs to improve development productivity without significantly sacrificing area or performance. FullMonteFPGACL can accommodate meshes with up to 65k tetrahedral elements and has been validated and benchmarked against highly-optimized and multi-threaded software, over which it shows both performance (4x) and energyefficiency (11x) advantages. We have presented techniques for writing OpenCL code for FPGAs that generates efficient hardware, as well as potential methods for reducing the FPGA resource utilization, removing memory constraints and scaling up the design to achieve up to a 16x performance and 17x energy-efficiency improvement over the highly-optimized and multi-threaded software baseline. This could allow inverse solvers to finish in minutes rather than hours or improve the result quality by running more simulations in an allocated time. FullMonteFPGACL is open-source and can be accessed from: www.fullmonte.org.
