ABSTRACT The major requirements of a good tomographic reconstruction algorithm are a reduction in radiation dosage, accurate reconstruction, detail enhancement, and rapid reconstruction time. Some of these factors are covered by many algorithms but are not collectively addressed in one. While the Maximum Likelihood Expectation Maximization (MLEM) algorithm fares well on many of these factors, it is difficult to apply this algorithm in real-time due to its long execution time. Our predetermined goal is to reduce the execution time to a large extent so that the MLEM's advantages can be leveraged by using hardware accelerators such as Field Programmable Gate Arrays (FPGA). The FPGAs are becoming especially popular as hardware accelerators and are well known for their programmability, configurability, and massive parallelism through a large number of Configurable Logic Blocks (CLBs). Although FPGAs are extremely versatile, they require complex languages like Verilog or VHDL to program them, incorporating changes in the design level at a later stage in FPGAs demands increased effort. Here, in this paper, for the first time, we present a parallel structure for hardware acceleration of the MLEM on the mammoth Virtex 7 VC709 FPGA. Using available tools, we also present a programming flow to design the algorithmic acceleration hardware architecture. The proposed flow does not require prior knowledge of the traditionally cumbersome Hardware Description Languages (HDLs) and this makes post design changes very easy to incorporate and validate. The parallel architecture is implemented on an FPGA operating at 220 MHz and we have achieved a 288× performance compared to an optimized software execution on an Intel Xeon workstation with 12-cores 3.1 GHz 32 GB RAM and 12 MB Cache architecture.
I. INTRODUCTION
Tomographic image reconstruction techniques such as computerized tomography (CT) and single photon emission computed tomography (SPECT) require high-performance computing solutions that augur the need for rapid access to medical imaging during critical healthcare procedures [1] . The highly parallel nature of frequently used Radon transform and other tomographic algorithms makes it possible to
The associate editor coordinating the review of this manuscript and approving it for publication was Lei Ding.
use embedded computing with parallel processing architectures as solutions to achieve a substantial gain in computational strength. Hardware accelerators are used to accelerate reconstruction algorithms that are computationally intensive and time consuming.
Although algorithms such as Maximum Likelihood Expectation Maximization (MLEM) and Ordered Subset Expectation Maximization (OSEM) produce images of better quality, [2] , [3] , higher spatial resolution [4] and accurate geometric reconstruction [5] , The Radon Transform-based Filtered Back Projection (FBP) algorithm is widely used in practice VOLUME 7, 2019 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ FIGURE 1. SPECT Imaging The injected isotope travels in the bloodstream and reaches the organ of interest. The disintegrations from the organ are picked up by the detectors and produce images which give details of the anomaly under suspicion.
as it requires less computing power and less execution time in tomographic imaging [6] . Bruyant et al. provides a comparative analysis of all available tomographic algorithms [7] . One of our earlier works reported acceleration of postprocessing of medical images using DSPs. [8] While DSPs have their advantages in the processing of ''pipelined'' tasks, they do not perform very well for ''parallel'' tasks. Parallelism can be especially useful while processing large data as in tomographic imaging. Various groups have tried performing these computations in the cloud [9] , [10] . However there would be the risk of loss of privacy of patient data. Apart from DSPs and cloud computing, another option would be the use of Graphics Processing Units (GPU). In terms of their features, GPUs and FPGAs are very different. GPUs can be equipped with hundreds or even thousands of parallel computing cores, but FPGAs are more suitable for supporting the high-performance computations due to their lower latency time even though they run on lower clock rates. [11] . The programs running on a FPGA run on ''bare fabric''. In case of FPGAs, the data can be directly delivered to the pins of the fabric using optional peripherals thus reducing the latency time. Whereas, in case of GPUs, the data would have to be delivered using the standard buses like USB and PCIe, therefore, inducing the dependencies on a OS and thereby, increasing the latency. The success of software-based accelerators, such as GPUs, is mainly due to their widespread use in consumer-driven markets like gaming industry and mobile markets. But they are not completely suitable for the problem under discussion.
An Ultra Scale FPGA can deliver a massive raw computation power of 38.3 INT8 Terra operations per second (TOP/s) while running at its base frequency. A State of Art GPUNVidia Tesla P40 card also produces the same computation power of 40 TOP/s but, at the cost of consumption of twice the power requirements of FPGA. In our application, we also needed the flexibility to operate around multiple data types like floating point, INT8 and binary. FPGAs have the flexibility to support this framework (along with the availability of on board DSP slices) by delivering 500 TOP/s of binary computation, which is 25 times more than that offered by a GPU.
One drawback for industrial use of GPUs is that the accelerators available have a limited re-configurability. On the contrary, FPGA has a performance advantage over GPUs as they are known primarily for their re-configurability and massive hardware parallelism with the availability of a large number of Configurable Logic Blocks (CLBs). They also possess customized architecture [12] , [13] . FPGAs are largely used for industry applications, providing a very long product life cycle. The FPGAs can be programmed at the Register Transfer Level (RTL), providing tremendous programming flexibility for both FPGA's internal and peripheral architectures to accelerate computationally intense problems.
For implementing designs in FPGAs, the programming language category known as Hardware Description Languages (HDLs) is used. The two popular industry standard HDLs are: Verilog and VHDL, which provide both sequential and combination logic design features. The main deficiencies of these HDLs, however, are that these languages are extremely verbose and involve rigid syntax. The hardware design with these HDLs are prone to error, cumbersome to debug, difficult to incorporate design changes, and requires full knowledge of these languages.
Over the past few years, a novel technique known as High Level Synthesis (HLS) has been developed to mitigate this tedious HDL-based designing. Synthesis refers to the translation process from behavioral description to structural description or to register transfer level. HLS is a powerful tool that translates the algorithmic design based on high-level languages into the corresponding HDL design with the same functionality.
HLS plays a central role, enabling high-level, untimed or partially timed C, C++ or SystemC specifications to be automatically synthesized to a low-level cycle-accurate registertransfer (RTL) specification for efficient implementation in ASICs or FPGAs. [14] It supports designing RTL, simulation, generating HDL code and continuous feature testing. Using HLS, a typical FPGA design flow involves converting C / C++ code to Verilog / VHDL code, simulating RTL for functional correctness and synthesizing down to gates. Implementation at the board level is then verified and monitored for performance and functionality. Various other Xilinx-developed software tools are used to debug and trace problems back to the RTL source where fixes are made and the cycle starts again.
To accelerate the Maximum Likelihood Expectation Maximization (MLEM) algorithm, we present the full acceleration design flow for embedded systems and explore the benefits of embedded computing using Virtex 7 VC709 VX690T FPGA hardware. One of the main challenges in efficiently implementing FPGA designs is the lack of a direct link between algorithmic design and target hardware. HLS addresses these challenges by providing an algorithm design, simulation, validation, debugging, and verification platform using unified design environment. [15] We used the Xilinx Virtex-7 VC709 FPGA in this work to target MLEM acceleration with HLS. By performing minimal changes to the code with handwritten manipulations [16] , we could achieve the required design parameters. We also discuss in detail the acceleration achieved in FPGAs hardware in comparison with the Intel Xeon workstation (3.1 GHz) (32 GB RAM), and various challenges faced during the design process. We have achieved 288X acceleration with latency as an evaluation criterion compared to the Xeon workstation.
II. BACKGROUND
The literature reports that the acceleration of computation time of MLEM algorithm for 64 projections of 128×128 pixels and 500 projections of 736 × 64 pixels of projection data in the CPU is 14 minutes, 1.52 hours by [17] and [18] respectively. One group reported an acceleration of 85× compared to Intel XEON 5138 CPU [12] , with the implementation of an EM algorithm with a customized convey HC-lex, which included four Virtex 6 LX 760 FPGAs.
Another group working with multi-GPU cluster did similar work to speed up a distributed MLEM algorithm using 160 × 160 × 20 and 240 × 240 × 15 projections, and the algorithm was executed at 139 seconds for each projection [19] . Several groups have tried accelerating the MLEM by making modifications from the algorithmic perspective. Some groups have tried accelerating by efficiently distributing the work amongst the many parallel cores [20] - [22] . Groups like [23] , [24] have implemented thread divergence strategies, prior fetch of data [18] , reduction of data movement [25] , using half precision floating point to reduce data transfer rate [26] .
These methods have shown slight improvement in the acceleration levels, in spite of many of these modified algorithms being implemented on GPUs. The reported values are in the range of just over 1.5× to a maximum of 5×.
Reported research has proved that the minimum number of iterations needed for MLEM algorithm convergence range from 50-60 iterations [27] , [28] . Given the fact that our desktop implementations have taken about 3 hours for completion of 50 iteration, algorithmic changes alone are insufficient to realize real time imaging,
III. MAXIMUM LIKELIHOOD EXPECTATION MAXIMIZATION (MLEM) ALGORITHM
Complete 3-dimensional medical imaging and measurement requires enormous computational power and large data storage. As a result, this reconstruction of the tomographic image becomes more intensive and time consuming. The solution is to use iterative methods of reconstruction such as Maximum Likelihood Expectation Maximization [29] . Fig. 3 shows the MLEM algorithm pictorially. The MLEM algorithm's purpose is to find a solution that best fits the estimate for f, that is, the mean number of radioactive FIGURE 2. MLEM Image Reconstruction The organ is voxelized and probabilities of it's incidence on the detector plane is assigned.
FIGURE 3. Maximum Likelihood Expectation Maximization (MLEM)
The process starts with a initial guess called the system matrix. The iteration introduces the new feed of data from the detector and it continues until the system has reached convergence. disintegration in the image that can produce the sinogram with the highest probability. This algorithm therefore involves two steps that are repeated at each iteration, namely Maximum-Likelihood and Expectation-Maximization.
Maximum-Likelihood: seeks a converged solution of the final reconstruction by iteratively maximizing the log likelihood function created from the projection data and imaging models.
Expectation-Maximization: the emission process is usually modeled with the Poisson probability distribution.
Expectation step (E-step): the formula expressing the likelihood of any reconstructed image given the measured data is formed.
Maximization step (M-step): the image with the greatest likelihood of providing the measured data is found. Based on the estimation of the distribution of activity from the previous iteration, the expected projections are calculated using the forward projection of system matrix. The iterations are carried out until convergence is achieved between successive runs, typically 50-60 iterations are done [27] .
A. MATHEMATICAL MODELLING OF MLEM ALGORITHM
Consider f j to be the mean number of events detected by the disintegrations occurring in the j th element in the voxelized body. Matrix A consists of elements a ij which are the probabilities that represent bin i detecting a photon emitted from voxel j. VOLUME 7, 2019 The mean number of photons g i detected by bin i is the sum of the mean numbers of photons emitted from each pixel.
Because the number of photons emitted from m pixels and detected by bin i is a Poisson variable, the probability of detecting g i photons is:
Since i is independent, the probability (P) of acquiring the measured projection count distribution, given an estimated distribution of activity in the emission object f is the product of probabilities for individual projection pixels. The
where P(g i ) represents the Poisson probability of detecting g i photons in bin i. Mathematically, we can state that the maximization of L(f ) provides the most likely distribution of emissions that represent the original activity distribution according to the measured projections. Therefore, the logarithm of likelihood is given by
Secondly, the maximization step finds the image that has the greatest likelihood to give the measured data by finding the maximum of equation
Equation (5) can be written as
The algorithm thus takes the form:
where: EST -Estimate; BP -Back Projection; ProjProjection
IV. HARDWARE AND DESIGN TOOLS A. FPGA AS HARDWARE ACCELERATOR
The primary aim of hardware acceleration is to increase the computational speed by using custom hardware specially designed to implement a particular routine or algorithm. The hardware accelerator architecture is shown in Fig. 4 . The advantage of this is that the CPU is free to process other data, while the necessary computation for the routine to be accelerated is offloaded to the hardware co-processor. The only time the processor is involved is to set up the co-processor to begin its calculation and the time it takes to receive the results. To realize the speedup, the necessary overheads to communicate with co-processor must be less costly than performing the actual computation.
Another advantage is that, as far as the application allows, FPGAs can produce deterministic performance and latency. Software implementations like CPU and GPU need to rely on complex protocol stacks and are frequently subject to resource scheduling by an underlying operating system. Calculation of latency in FPGAs can be accurately determined because a scheduler of the operating system does not obstruct a FPGA. In addition, implementation of FPGA is considered to be extremely reliable and needs much less time than ASICs for device development. This makes FPGA an excellent choice for the hardware-based acceleration task.
B. VIVADO HIGH LEVEL SYNTHESIS (HLS)
High Level Synthesis (HLS) is the eclipse based platform that interprets the C / C++-based algorithmic description of a desired functionality and translates it into an IP block with the same functionality without changing the semanticization of the algorithm. HLS relieves hardware designers from the time-consuming and tedious hardware description on standard HDLs such as Verilog and VHDL. The main advantage of High level synthesis technology is that the designer can alter the application functionality by simply adding a couple of lines of C code that are considerably easier and more efficient than remodeling HDL code [30] . The High-Level Synthesis tool of Xilinx helps to convert a C-level description to a Register Transfer Level (RTL) implementation that would be implemented in the Xilinx's Field Programmable Gate Array. C / C++, SystemC or Open Computing Language (OpenCL) C kernel can be used to write the Algorithmic Specifications. The Xilinx Software Development Kit (SDK) is closely integrated with other design platforms, such as the Vivado Design Suite (HLx Editions) [31] . This offers extensive support for the language and features for designing the algorithms both for their use of resources and design time in an optimal and efficient way. The results of the Vivado HLS synthesis depend entirely on the quality of the C / C++ code design and the effective use of directives in the code.
HLS acts as a bridge between hardware and software designing domains where hardware designers have the flexibility to design abstractions such as C and C++ at a higher level and software designers can accelerate computational design. The various steps involved in HLS are as follows:
1) Functional Verification and Debugging of the C/C++ algorithm. 2) Synthesis of the C/C++ algorithm into RTL description using various directives 3) Generation of detailed report and Analysis of the design. 
C. OPTIMIZATIONS IN HLS
HLS provides various techniques of optimization that can be implemented in the code using the available directives. Three techniques of optimization used in this work are discussed briefly here.
1) PIPELINING
Pipelining is a technique of optimization where different operations take place concurrently. Before starting the next operation, the pipelined computational task does not have to complete all operations. In HLS, both loops and functions are pipelined. The tasks are pipelined using the PIPELINE directive. In case of function pipeline, the pipeline runs until the execution is finished. But in the case of loop pipeline, the pipeline is executed only until all loop iterations are completed. Fig. 6.(a) shows the execution of instructions in the case of a pipeline loop. 
2) ARRAY PARTITIONING
Arrays can be the limiting factor for throughput in a read/ write intensive algorithms. This is because of the fact that the arrays are implemented as BRAMs having maximum of only two data ports. This limitation can be resolved by splitting the array into multiple smaller arrays. The ARRAY_PARTITION directive is used to partition the Arrays. HLS allows the partitioning of arrays in three different manner. Block: Original large array gets partitioned into equal sized smaller blocks of successive data element. Cyclic: Equal sized cyclic blocks are created interleaving the elements of larger array. Complete: Array is split into individual elements resolving memory into registers. The partitioning of arrays is shown in Fig. 7 .
3) LOOP UNROLLING
Vivado HLS allows the designers to unroll the for-loops either fully or partially. The loop unrolling is implemented using the UNROLL directive in source file. In the full unrolled version of loop, all operations are performed with single clock cycle. The Loop unrolling mechanism is shown in Fig. 6.(b) .
For realizing efficient design implementation in FPGA hardware and to achieve maximum speedup, the floating-point operations have to be replaced with fixedpoint operations [18] . HLS provides arbitrary precision fixedpoint data types which gives efficient utilization of FPGA resources. We have taken care to tackle the quantization error that arises from the use of fixed-point data types by carefully balancing the design without compromising the requirements of the application. Otherwise, the error due to truncation on the precision of data will propagate and accumulate, especially in iterative algorithms like MLEM.
Vivado HLS guides the designer in identifying and replacing various non-synthesizable constructs in the C-code. HLS does not support System calls in the core code that is to be synthesized. It does not allow Dynamic Memory Allocation and usage of malloc(), calloc(), and free(),Xilinx:2016. HLS also puts a limitations in the usage of pointers in the design and does not support tail recursions .
D. HARDWARE PLATFORM:Virtex-7 VC709 BOARD
Virtex-7 FPGA VC709 evaluation board provides a hardware simulation environment for evaluating and developing designs targeting the Virtex-7 XC7VX690T-2FFG1761C FPGA [32] . With about half a million logic gates, 2800 DSP slices, and large on board memory, VC709 is a mammoth device for studying the potential of MLEM acceleration. The board provides the dual DDR3 small outline dual-inline memory module (SODIMM) along with plenty of practical interfaces, such as GPIO, FMC, JTAG, USB, RS232, Ethernet etc. as shown in Fig. 8 . To the best of our knowledge, this is the first reported work of MLEM acceleration which uses the Virtex 7 VC709.
JTAG is used as a communication interface to program the core FPGA sub-blocks and read/write directly at the register level. UART cable is used for sequential transmission of bytes of data from host PC to target FPGA.
V. METHODOLOGY AND IMPLEMENTATION

A. IP CORE DEVELOPMENT IN Vivado HLS
In Xilinx Vivado HLS, both core C++ code and testbench for MLEM algorithm was developed and functionality check was carried out. The C++ code for MLEM was verified to be synthesizable and made free of any unsupported language constructs. The RTL code was synthesized and the functionality verification was done for synthesized RTL description by RTL co-simulation. After the RTL test passed with zero errors, the synthesized RTL was exported as IP block as shown in Fig. 9 . This IP core carries out the functionality of MLEM algorithm. The designed MLEM-IP was exported to Vivado Design Suite IP-Integrator, for integration with processor, memory and other necessary IP-subsystems.
B. IP-INTEGRATION AND BITSTREAM GENERATION IN DESIGN SUITE
The Vivado Design Suite (HLx edition) provides platform for RTL-to-Bitstream FPGA design flow [31] . The custom designed IP, exported from Vivado HLS was added in the Vivado IP-Integrator catalogue by specifying the path for location of the designed IP. The FPGA board on which the execution of algorithm is desired, was specified before carrying out the IP integration. The IP Integrator is a GUI-based interface that allows to establish connections between complex IP subsystems. The IP block of the MicroBlaze soft processor core was selected from the IP-catalogue which contains the Board Specific Packages for the Virtex7 VC709 FPGA board used. The Memory Interface Generator (MIG) IP-core was added in the design to create the interface between the design in FPGA fabric and the DDR3 Memory of the Virtex7 board. The AXI-Direct Memory Access (DMA) IP was used for direct transfer of data from Memory to the MLEM-IP core thereby relieving the main processor core to carry out other computations during the data transfer process. Integration of IPs was carried out in the IP-Integrator and the design was validated. The overall IP-Integration in Vivado Design Suite is shown in Fig. 10 . Flow Navigator provides control over the major design tasks, such as project block design, synthesis, implementation, and bitstream generation. After validation, the integrated design was synthesized to create the gate-level netlist. The netlist was then mapped to the FPGA hardware and configuration bitstream was generated to program the FPGA. The Generate Bitstream option was chosen in flow navigator to generate the programming bitstream file, describing the hardware which was finally implemented in the FPGA.
C. SOFTWARE APPLICATION IN Xilinx's SDK TO PROGRAM THE FPGA
Application development for the FPGA platform dedicated for the MLEM algorithm was done separately in Xilinx's Software Development Kit (SDK) after exporting the design. The address mapped Hardware Platform Specifications for the design (system.xml) was exported from Vivado Design Suite to SDK. The C++ language based software application for the MLEM algorithm was developed in SDK to send the input projection data file to IP core and to retrieve the processed data from the FPGA hardware.
The programming of the FPGA was done through the JTAG interface. The data transactions between the Virtex7 VC709 FPGA board and the Workstation was done through UART. The executable link-able file (.elf) for the MLEM algorithm design was run on the VC709 hardware through the SDK software.
The processed data, that is the reconstructed image data was received back in SDK and stored. This data was exported to the MATLAB environment to view the reconstructed Image. The corresponding execution time for the MLEM algorithm in FPGA hardware was noted for comparison.
VI. RESULTS AND DISCUSSION
The complexity in MLEM algorithm arises primarily due to the calculation of number of projections of the raw data from the detector and then estimating the maximized likelihood function repeatedly (50 iterations) to arrive at the converged value which involves a lot of Multiplication-Accumulation (MAC) operations.
As explained earlier, the FPGA used in our work -Virtex 7 has a large number of DSP slices that are customized and can handle these operations efficiently. The RTL synthesizer recognizes these patterns and accordingly allocates the corresponding hardware logic cells.
In this work, We have evaluated the comparative performance and resource utilization for the MLEM algorithm in Intel Xeon Workstation and Virtex7 VC709 FPGA. The Shepp-Logan Head phantom was used as a standard to evaluate the algorithm in both of the platforms.
A phantom of dimension 128 × 100 × 360 and an initial guess image of dimension 64×64×64 were given as an input to check the algorithm's functionality. The in-built system time calls were used to estimate the CPU time and AXITimer IP was integrated with other IP subsystems to estimate the execution time in FPGA hardware.
A. HARDWARE ACCELERATION
We have verified the hardware acceleration achieved by comparing the execution time of MLEM algorithm in CPU and the FPGA hardware.
The average execution time of one MLEM algorithm iteration in CPU is 21.564 seconds and 0.8508 seconds in FPGA. Table 1 shows the comparison of the total execution time received from appropriate system calls in Intel Xeon workstation and FPGA. Thus, acceleration of ≈ 25X was achieved with FPGA hardware without implementing any code optimization.
2) WITH CODE OPTIMIZATION
The average execution time in FPGA for one iteration of MLEM algorithm in optimized case is 0.0757 seconds. The estimate of resource utilization in the FPGA board by the MLEM algorithm is provided in HLS as a synthesis report. The maximum latency required for one non-optimized MLEM algorithm iteration (clock cycles) was 1.6×10 9 clock cycles. The maximum latency was reduced to 5.3 × 10 7 cycles after implementing the loop-pipeline, array partition, and loop-unrolling optimization in the code. Fig. 12 shows the number of pipeline loops in the design. We observe that the design does not pipeline a few of the loops. This is due to the inter-iteration dependencies in the design of two or more loops. When the loop computation depends on the results of the previous loops then these loops are excluded from HLS pipeline. We can conclude that the maximum latency in algorithm execution arises from the running of the 360-degrees outer loop. 
VII. CONCLUSION
In this work, we have presented the complete High Level Language-to-FGPA design flow and implementation of MLEM Tomographic Image reconstruction Algorithm, using Vivado HLS and Vivado Design Suite as the major design tools. We have shown the reconstruction of Shepp Logan head phantom image to validate the implementation of the MLEM algorithm in Virtex7 VC709 FPGA. We conducted a comparative study of execution time for MLEM algorithm in Intel Xeon Processor vs a Virtex 7 FPGA and found that FPGA provides 288X speed-up with the implementation of various code optimization techniques. This will be useful for rapid access to medical images in various time critical situations.
