field programmable gate array (FPGA) implementation of this optimized algorithm is able to process events in excess of 250 K events per second, which is greater than the maximum expected coincidence rate for an individual detector. In contrast with all detectors being processed at a centralized host, as in the current system, a separate FPGA is available at each detector, thus dividing the computational load. These methods allow SBP results to be calculated in real-time and to be presented to the image generation components in real-time. A hardware implementation has been developed using a commercially available prototype board.
presented a continuous miniature crystal element (cMiCE) detector as a lower cost alternative, with a statistics-based positioning (SBP) algorithm used for event positioning from list mode data [1] - [3] . We are currently developing a full preclinical scanner based on the cMiCE detector module. However, the processing and memory requirements of performing the event positioning from the list mode data results in an undesirable delay between the end of data collection and the positioning of the event data. Processor arrays are able to reduce this time; however, a goal of this work is to provide a real-time implementation in which processing occurs at the detector level. In this paper, several methods of algorithm optimizations, along with a pipelined FPGA implementation, results in local processing of our SBP algorithm at the detector level of over 250 K events per second.
A field programmable gate array (FPGA) is a special purpose integrated circuit (IC) designed to perform complex interface and logic processing in a small footprint [4] . The FPGA is a reconfigurable hardware platform consisting of a large number of simple logic elements with a matrix of interconnects that can be selectively enabled or disabled under the control of a downloadable configuration file. This allows the FPGA a flexibility that is not available with microprocessors or custom logic chips.
The flexibility of the FPGA allows many options to be considered in implementing the SBP method. The proposed system requires a method of generating solutions at a much higher rate than available with the postprocessing method currently used. Some of the hardware options to be explored include parallel processing and hardware acceleration of the individual computations. Software options include optimizing the algorithm performing the computations and conversion from floating-point to an integer or fixed-point implementation to reduce the computational complexity. These methods will be explored in the following sections.
II. MATERIALS AND METHODS

A. cMiCE Detector Module
A cMiCE detector module is pictured in Fig. 1 . It is composed of a 50 mm by 50 mm by 8 mm-thick slab of LYSO (Saint Gobain, Newbury, OH) coupled to a 64-channel multianode PMT (H8500, Hamamatsu Photonics K.K., Japan). One of the 50 mm by 50 mm surfaces was polished. All other surfaces were roughened. The polished side was coupled to the PMT using Bicron BC-630 optical grease. The face of the crystal opposite the PMT was painted white. The side surfaces of the 0018-9499/$26.00 © 2010 IEEE crystal were painted black to reduce light reflections off the sides.
B. SBP
Suppose the distributions of the 64 PMT outputs, for scintillation position x are independent normal distributions with mean and standard deviation . The likelihood function for making any single observation from distribution given x is
The maximum likelihood estimator of the event position x is given by (2) The SBP method requires that the light response function versus interaction location be characterized for the detector. Two SBP look-up tables (LUTs) corresponding to the mean and variance of the light probability density function (PDF) versus (x,y) position are created during the characterization process. The LUTs are 127 127, corresponding to 0.4 mm by 0.4 mm pixels for binning. When a coincidence event is detected signals from all 64 channels are collected. A photopeak energy window (i.e., 511 keV 20%) is applied to the data and if the energy of the event falls within the window it is accepted as a good event and will be passed along for SBP positioning. For more detailed explanations about the SBP implementation refer to Joung and Ling [1] , [2] . Please note that the Joung implementation was for a very thin 4 mm thick crystal and in that work it was determined that the term did not effect the positioning performance in the algorithm. However, as we have moved to a thicker crystal (i.e., 8 mm thick), it has been determined that the should be included in the calculation.
A sample of the positioning performance of a cMiCE detector module using our statistics-based positioning (SBP) method is active area) is 1.6 mm FWHM. In comparison, using standard Anger positioning the useful imaging area is only 34 mm by 34 mm. Further, the average spatial resolution for the Anger method is 2.2 mm and the positioning is biased [2] .
C. Equation Simplification
The proposed method to implement SBP within an FPGA relies upon multiple levels of optimization. The first is reducing the number of computations required for each iteration of the likelihood equation, the central component of the SBP method. The log likelihood function for each signal channel can be represented as (3)
The objective is to manipulate this function to minimize those numerical operations which are expensive to perform. With the computational resources available in this project the following guidelines are employed:
• Adds and subtracts are efficient.
• Multiply is relatively efficient, but should be kept to a minimum.
• Squares are special cases of multiply, but should also be minimized.
• Division is very inefficient because it requires an iterative approach.
• Transcendental functions should be avoided. Each iteration of (3) consists of seven operations: 1) Subtract . 2) Square the result. 3) Square . 4) Multiply the result by 2. 5) Divide the result of step 2 by the result of step 4. 6) Find the natural log of . 7) Add the result of step 6 to the result of step 5. The transcendental function of step 6 increases the complexity of this method well beyond the seven operations, and needs to be eliminated. Also, the division should be replaced with a multiplication by a precomputed inverse if possible. Following the guidelines given above leads to the simplification shown in (4). If we let , and , (3) can be rewritten as (4). 
D. 21 Channel Solution
The original SBP solution summed the log likelihood equation for each of the 64 channels of the 8 8 sensor. However, it was found that using channels far from the peak reduced positioning accuracy. Using the 21 channels surrounding the peak energy, as shown in Fig. 3 , led to better positioning performance. In addition to improving positioning, this results in a 3:1 reduction in the number of operations needed for each solution.
E. Conversion to Fixed-Point
Floating-point calculations are computationally expensive in an FPGA implementation. Converting to integer representations, while retaining the required resolution, necessitates multiplying by a constant before the conversion to move fractional parts of the value into the resulting integer. Testing has shown that the values used in the algorithm can be reduced to a 2-byte fixed point number while maintaining the range of values and accuracy of the original algorithm [5] . The fixed point numbers can be precomputed and stored on the hardware for use with the FPGA, allowing all calculations to be made using fixed point numbers. This conversion significantly reduces memory requirements and computational effort.
F. Hierarchal Search
An exhaustive search method has the advantage of being easy to implement and is guaranteed to always find the absolute minimum solution. It has the disadvantage of being very inefficient, both in terms of the number of iterations, and the number of memory accesses to find a solution. The solution set, Fig. 4 , illustrates the algorithm's output for all possible positions given an event. The vertex of the solution set is the location at which the event is most likely to have occurred. Observation of the properties of the solution set made it apparent that a structured search could be used to converge to a solution with many fewer iterations.
Of the structured searches explored [5] , it was found that a hierarchal search provided the best combination of reduced iterations and memory usage, allowing memory to be partitioned in a way that was conducive to a pipelined implementation. This search functions by finding the solutions at a number of evenly This algorithm yields a large reduction in the number of points that need to be sampled. Variations in the number of sampled points at each iteration were simulated, ranging from a 2 2 array of samples at each iteration of the search, resulting in an 8 iteration search in a 127 127 solution space to 5 5 samples resulting in a three iteration search. These require a total of sampled points for the 2 2 search to points for a 5 5 search. The optimal solution was found to be a hierarchal search with 3 3 samples at each iteration, dividing the solution space to approximately one-quarter of its previous size. An example of this search on a 63 63 solution space is shown in Fig. 5 . This example shows the five iterations necessary to fully search the 63 63 space. The solution space of each successive stage is shown by the increasingly smaller box. Within each solution space, the nine X-Y points to be tested are shown, centered on the best solution from the preceding stage. The search concludes with a three by three search at a spacing of one pixel.
The regularity of this search allows each level of the search to consist of an almost identical process, where each iteration starts with a center point for the search and the spacing between points to be tested at this iteration. For the 3 3 search shown, 9 sets of characterization data table values are needed for the second stage, one for each position to be tested (see Fig. 6 ). The next iteration will consist of potential sample positions at one-half the spacing, with sets of points centered on each of the previous stage's points, with the first stage defaulting to being centered on the center of the entire data Structuring the search in this way allows points sampled during each iteration to be independent of calculations currently underway in other iterations. This allows calculations for different events to be pipelined, allowing multiple stages to be processed simultaneously.
A second advantage of this method is that the storage requirement for each stage's data table is reduced for each iteration prior to the last. Ideally, for samples, the new solution space would contain times as many possible solutions as the previous iteration. For our example with nine samples, we would hope for nine times as many potential sample points at the next iteration. However, experiments determined that when the final solution was located approximately midway between two sampled points, this solution was not always nearest to the best sampled point. The cause of this was found to be that the solution set is not completely symmetric and is subject to some solution noise. To solve this problem, each successive iteration must operate on a subset that is significantly larger than times the size of the previous iteration.
Our approach is to have the solution space dimension at each stage to be an integer power of 2, minus 1 . Constraining each level to have an odd number of values allows each solution space to have a pixel in the exact center, with other X-Y locations spaced at a power of two. A representation of a simple four stage system is shown in Fig. 6 . The top row of Fig. 6 shows the spacing of the locations that can be tested at each stage. The bottom row of Fig. 6 shows how these values are stored using a compressed addressing scheme where the unused values are not included in the table for that stage. This allows the data table at all stages except for the final stage to be stored in a much smaller memory space. The storage of the odd number of values is padded with blank values to allow separate X and Y addresses to access the desired coordinate without requiring an address calculation. The proper value is addressed by simply appending the binary representation of the X location to the Y location. For example, if the desired X coordinate is and the Y coordinate is , the address of the desired value is at . As the solution moves from one stage to the next, address translations to match the next stage are accomplished by multiplying the X and Y coordinates by two. This is accomplished by appending a 1 to the binary address values. For the example above, and , so the resulting address of the same coordinate in the next stages addressing scheme would be . This allows the hardware system to do in-stage addressing and next-stage address translations in zero time.
As the solution progresses, the only information passed from one stage to the next is the center point on which to base that stage's search. Each stage has access to data table information at twice the resolution as the previous stage, but will only be accessing values directly adjacent to the center value, as shown in Fig. 7 . This is done by sequentially adding then subtracting one from the values of the center X and Y positions passed from the previous stage.
In this example, it can be seen that the spacing between tested locations reduces from four in the 15 15 space, to two in the 7 7 space, to one pixel as the last 3 3 stage is reached. The one pixel spacing of the last stage results in an exhaustive search of this reduced space.
G. Pipelined Architecture
To attain the desired processing rates for this project, multiple events must be simultaneously processed. Although parallel processing paths could easily be formed in the FPGA, multiple copies of the complete data tables would be required to support this architecture. Using the search described above, a pipelined architecture is possible. In this design, each stage of the hierarchal search is implemented with dedicated hardware that has exclusive access to a data table appropriately sized for the stage. As each event's processing at one stage is completed, the intermediate results are passed as a starting point to the next stage, allowing the next event to immediately begin processing.
This results in multiple events simultaneously being at various stages of their processing without requiring multiple complete copies of the data tables.
H. Memory Division
The characterization tables for each sensor must be available for solving the likelihood function. The most direct approach would be to store the tables in the FPGA's on-chip SRAM memory. Processing the events would be simplified compared to accessing off-chip synchronous dynamic (SD) or double data rate (DDR) memory. The difficulty with this method is the large size of the tables. Each of the three tables requires 16-bit values for each of the 8 8 channels of sensed data, for 127 possible positions in each of the X and Y axes. This equates to or 6 Mbytes. This is well above the approximately 300 kB to 1.1 MB available in members of the family of FPGA devices chosen for this project (i.e., Altera Stratix II or III devices, Altera, San Jose, CA).
The prototype design employs a mix of FPGA internal memory and high-speed external memory. This allows all the small data tables required by the early stages of the process to have individual data and address busses, allowing parallel access by the different stages. The largest data tables are kept in external memories where adequate storage is available. Because most of the processing is done in the multiple earlier stages, memory bandwidth requirements are significantly reduced to the external memories.
I. Input Buffering
Although the average event rate is expected to be 250 K events per second or less, because of the random nature of positron decay instantaneous peak rates as high as 2 times this value are expected. An input FIFO buffer allows data collection at up to the peak rate, and delivery to the SBP algorithm at no more than the designed rate.
J. Prototype Implementation
A block-diagram of the system design is shown in Fig. 8 . The numbered components of this system are:
1) An interface to the incoming data.
2) A FIFO buffer. This allows the randomly occurring samples to be collected at up to the peak rate, but processed at their average rate. This is necessary in the target system because the events occur randomly and may momentarily occur at faster rates than the SBP algorithm can process them. This buffer allows them to be stored and dispensed to the algorithm at a lower average rate. 3) This block determines the row and column of the sensor with the peak energy. This is made available to the SBP algorithm to allow the 21-cell version of the SBP algorithm to calculate the proper channels to process for the corresponding event. 4) The values of an event are stored locally at each stage so that they can be accessed for the calculations required at that stage. As processing of a particular stage completes, the values are passed along to the next stage. Double-buffering is used to allow each stage to transfer the values to the next stage before either stage has completed its current calculations. This allows the transfer of data to overlap the calculations. 
III. RESULTS
To test the accuracy of our design, the results of a number of sampled locations were compared to the results of the exhaustive search on an event by event basis. These comparison tests were run on experimental data. The results show that over 90% of the events show no error and over 99% of the positioned events are within 1.4 pixels (0.54 mm), representing all the pixels immediately adjacent to the reference pixel. Note that in this comparison, diagonal pixels are represented by the Euclidean distance.
The accuracy of the simplified algorithm can be confirmed by running tests with both versions of the algorithm using the same data. A sample test point for both the integer and floating point methods, Fig. 9 , illustrates no noticeable difference between the two methods. The average FWHM intrinsic spatial resolution for a range of point positions starting in the corner of the crystal and moving toward the center of the crystal along a diagonal (Table I) is almost identical for the two methods. The results in Table I are consistent with our measurement of an average intrinsic spatial resolution of 1.6 mm FWHM across the useful imaging area of the detector as shown in Fig. 2 . The largest difference between the two methods was for the data point near the corner of the crystal. We know that positioning near the edges and corners of a monolithic crystal detector are the most challenging. On the other hand, we have shown that using LUTs with some depth of interaction information can improve the decoding performance of a monolithic crystal detector, especially near the corners of the crystal [3] . Therefore, we are currently working on an FPGA implementation that will allow for depth of interaction processing. This implementation is beyond the scope of this paper.
The throughput of the system is defined as the rate at which new solutions are completed. In this pipelined design, it is equal to the time for one stage to complete. For the current design, each stage requires nine summations of the 21-channel solution, each being able to complete in one clock cycle. Additionally, three clock cycles are required for transferring this value to the next stage. This results in a new event starting, and one completing, every 192 clock cycles. At the designed clock rate of 70 MHz, this results in a throughput of 360 k events per second.
The latency is defined as the time from the acceptance of the data to the presentation of the processed SBP position to the host. Disregarding the time required to pass the detector data through the FIFO and the time to pass the results to the host, this is equivalent to the combined time of all the stages. For a complete seven stage system, this would be seven times 192 clock cycles at 70 MHz, or approximately 20 .
IV. CONCLUSION
This study showed that optimizations to the SBP algorithm to allow real-time implementation on an FPGA can be made without adversely affecting the accuracy of the results. These optimizations included algebraic manipulation of the SBP equation to allow higher speed computations, reduction of the number of cells required for each solution, conversion to fixed-point math, and the use of a structured search algorithm.
The hierarchal search method developed as part of this study greatly increases the search speed over current implementations and allows the memory requirements of the SBP method to be distributed across multiple stages and both internal and external memories. This allows a multistage pipeline solution to be used. This was a significant factor in increasing the speed of the SBP solution.
Collectively, these optimizations allow an accelerated SBP solution to be found at a rate that exceeds the maximum expected average event rate, to an accuracy that compares satisfactorily with existing methods. This project has shown that it is possible to use an FPGA implementation of SBP to facilitate real-time processing of event data for a PET system. This is a major step forward in the development of a research preclinical PET system being built at the University of Washington.
