Abstract-The PET image reconstruction based on the EM the final image based on the result of a statistical hypothalgorithm has several attractive advantages over the conventional convolution backprojection algorithms. However, the PET image reconstruction based on the EM algorithm is computationally burdensome for today's single processor systems. In addition, a large memory is required for the storage of the image, projection data, and the probability matrix. Since the computations are easily divided into tasks executable in parallel, multiprocessor configurations are the ideal choice for fast execution of the EM algorithms. In this study, we attempt to overcome these two problems by paralleliziing the EM algorithm on a multiprocessor system. The parallel EM algorithm on a linear array topology using the commercially available fast floating point digital signal processor (DSP) chips as the processing elements (PE's) has been implemented. The performance of the EM algorithm on a 386/387 machine, IBM 6OOO FUSC workstation, and on the linear array system is discussed and compared. The results show that the computational speed performance of a linear array using 8 DSP chips as PE's executing the EM image reconstruction algorithm is about 15.5 times better than that of the IBM 6000 RISC workstation. The novelty of the scheme is its simplicity. The linear array topology is expandable with a larger number of PE's. The architecture is not dependent on the DSP chip chosen, and the substitution of the latest DSP chip is straightforward and could yield better speed performance.
I. INTRODUCTION OSITRON Emission Tomography (PET) is an imaging
P technique to visualize the spatial and temporal distribution of the radio-nucleids inside the human body by measuring the event counts of positron-electron annihilation. There are two main approaches for PET image reconstruction: analytic methods such as the Convolution Back Projection (CBP) algorithm [ 131 which was originally devised for computer aided tomography (CAT), and the iterative algorithms such as expectation maximization (EM) algorithms. An analytic algorithm usually consists of two main computations. One is filtering and the other is back projection. An iterative algorithm, on the other hand, starts with an initial guess of the solution and iteratively updates (corrects) the object according to the computed pseudo-projection and the measured projection data, till convergence is reached. The stopping rule proposed in 1181 is based on a statistical approach, where after each iteration, the estimate is accepted or rejected as _ _ I esis test. The major computations in an iterative method are forward (pseudo-projection) and backward (correction) projections. The EM algorithm requires longer computation time than the CBP method. However, the image reconstructed using the EM algorithm is less noisy than the CBP image and the EM algorithm does not require the projection data to be equally spaced.
Various efforts have been made to speed up the image reconstruction tasks. These efforts essentially fall into three categories, Le., algorithmic improvement, dedicated hardware, and parallel processing. In the first category, most of the attempts for iterative reconstruction methods have been concerned with reduction of the number of iterations, i.e., to make convergence faster [7] , [lo] , [13] , [15] . In the second category, dedicated hardware techniques have been employed to speed up the computation [8] , [9] , [17] . To overcome the two major problems that impede the routine use of the EM algorithm for clinical use, i.e., the long computation time and very large memory requirement, it is imperative to rely on parallel processing techniques which have a potential to speed up the reconstruction by several orders of magnitude. Several attempts at improving the speedup using multiprocessor approach have been reported [2] , [3] , [51, [6] . Chen et al. have studied the parallelization of the EM algorithm on a message passing system (Intel iPSCl2) and on a shared memory system (BBN Butterfly GP 1000) [2] . A data and task partitioning scheme called partition-by-box is proposed in this study. The partition-by-box scheme proposed by Chen uses the broadcast and partial result integration algorithms. The binary tree architecture is more efficient to perform the broadcast and integration algorithms. Though Chen et al. have used the iPSCl2 hypercube system, the pseudo-binary tree embedded in the hypercube has been used for the EM algorithm. In [3] 
THE EM ALGORITKM
The EM algorithm is the basic approach used to maximize the log likelihood objective function for the PET image reconstruction problem. PET images are used to study the human physiology and organ functions. The patient is given a tagged substance which emits positrons. Each positron annihilates with an electron and emits two photons in opposite directions. The patient is surrounded by a ring of detectors and the two photons are detected in time coincidence by a pair of detector elements defining a detector unit or detector tube. The reconstruction problem in PET is to determine the memory map of the annihilations from which information about the regional physiology can be obtained.
In the recent literature, much attention is given to maximum likelihood reconstruction based on expectation maximization (EM). These algorithms are appealing because, unlike other methods such as CBP, they take into account the statistical nature of the measurements. Dempster et al. [41 presented a general algorithm to produce maximum likelihood estimates from incomplete data. Shepp and Vardi [ 161, and Lange [ 111 applied this technique to image reconstruction from PET measurements. The measurement system is shown in Fig. 1 . 
Equation (2) has been implemented on the linear array system. The EM algorithm converges toward a possible unique minimum, and the image obtained after convergence is independent of the initial estimate X[z]O. However, if the procedure is stopped before the maximum likelihood is reached, the initial estimate can strongly influence the result. All the reconstructions carried out in this study were started with identical X [ i ] values.
COMPUTATIONAL COMPLEXITY
The complexity of the EM algorithm is given in [ 161. For a 128 x 128 square object, there are 16384 object boxes. The probability p ( i , j ) that an emission in box i is detected in a tube j depends on a number of physical factors such as the geometry of the measurement system, the decomposition of the object space, the physical properties of the medium and the response of the detector system. In this study, it is assumed that the probability of an emission in box i and its detection in tube j depend only on the geometry of the measurement system. In such a case an annihilation event in box i is detected in a tube j with the probability p ( i , j ) proportional to the angle of view from the center of the box i in to the detector tube j. Shepp et al. [ 161 have shown that the choice of p ( i , j ) based only on the geometry of the measurement system is reasonable, and that the results of the reconstruction do not depend critically on the choice of p ( i , j). Since there are Nt number of detector tubes and N object boxes, the dimension of probability matrix p ( i , j) is N x Nt. For a circular ring measurement geometry with Nd detector elements equally spaced around the circle of radius fi circumscribing the display boxes, the total number of detector tubes Nt is given by Nt = (Nd/2) * (Nd/2 + 1) since there are (Nd/2 + 1) detector intervals opposite each one [16] . For a system with 128 detectors and an object space of 128 x 128, the dimension of the probability matrix p ( i , j ) is 4160 x 16384. Since only a very small percentage of the total detector tubes passes through a box, the probability matrix p ( i , j ) is highly sparse. To minimize the storage requirement, only the nonzero p ( i , j ) ' s are stored.
The EM algorithm represents a family of important problems that are rich in data parallelism. The data parallelism in the EM algorithm may be described by two spaces namely, box and tube spaces. There are three possible data and task partitioning schemes [2] to solve the image reconstruction problem based on the EM algorithm. The first approach, the partition-by-box scheme, is based on partitioning the EM algorithm based on the box space. In this approach, for both the forward and backward steps, a box and all the task and data associated with that box are assigned to a PE.
The second approach, partition-by-tube scheme, uses the detector tube as a subpartition. In this scheme, for both forward and backward steps, a detector tube is assigned to a PE. So all the task and data associated with that tube are assigned to a PE. The partition-by-box and partition-by-tube schemes give almost similar performance. The potential problem of the partition-by-box scheme and partition-by-tube is that the computational load is not well balanced. For the partition-bybox scheme, the computational load associated with each box in the partition region is different. For the partition-by-tube scheme, the number of pixels in each of the detector tubes is different.
In this study, we use the partition-by-tube scheme for implementation on a linear array. The same logic can be used for the implementation of the partition-by-box scheme on a linear array.
The third approach, the partition-by-tube-and-box scheme uses the partition-by-tube scheme for the forward step, and the partition-by-box scheme for the backward step. So the storage overhead for the partition-by-tube-and-box scheme is roughly twice that of the other two schemes, because we have to store the task and data corresponding to the tube space for the forward step, and the box space data and task for the backward step.
In order to implement the partition-by-tube scheme, two approaches are possible. First, the indices of the pixels in each of the Nt tubes are precomputed and stored. Second, the indices of the pixels in each of the tubes are computed in each step. The first approach is faster, but it requires more memory. Most of the memory is required to store the indices of the pixels in each of the Nt tubes, the corresponding pixel values, and the pixel-tubeprobabilities. The parallel algorithm formulated is based on the first approach. We introduce three 1D arrays: integer pixelindices[] to hold the indices of the pixels in each of the Nt tubes, the corresponding real pixel-tube-probability [ 1, and integer pixel-count[ ] to hold the number of pixels in each tube. The array pixelindices[] is stored in numerically increasing order to use binary search to check if a particular pixel lies in the tube processed by that PE.
The partition-by-tube scheme can be easily divided into tasks executable in parallel. In this study, we are investigating the implementation of the partition-by-tube scheme on the linear array multiprocessor system. Each PEt in the linear array executes the following algorithm: linear array implementation of the EM algorithm based on partition-by-tube for each detector tube t = 1 to Nt begin for i = 1 to N /* A P computation The image reconstruction algorithm based on the EM technique is iterative in nature. The major computations in an iterative scheme are forward (pseudo-projection) and backward (correction) operations. From an estimate of the object, the forward algorithm finds the pseudo-projection on each detector tube j. The ratio of the difference between the projection data (actual measurement data) and the pseudoprojection data (computed value) to the pseudo-projection data, i.e., A P ( t ) , gives the extent of resemblance between projection of the object and the projection of the reconstructed object. Based on AP, corrections are made to the initial guess of the object. In the correction operation, AP in the projection domain is back-projected onto the object domain.
IV. MULTIPROCESSOR ARCHITECTURE
The multiprocessor architecture used for the implementation of the parallel EM algorithm is shown in Fig. 2 . It consists of linear connection of Nt identical processors. PEj is connected to PEj-1 and PEj+l through a set of high speed first-infirst-out (FIFO) buffers. PEj can send a 32-b data item to PEj+l and receive a 32-b data item from PEj-1 through the FIFO's. In computing the forward step (pseudo-projection computation), once the pipeline is filled, all the Nt identical processors operate in parallel with each processor computing the pseudo-projection on a tube assigned to that PE. There are Nt number of detector tubes; the task and data corresponding to tube j are preloaded onto PEj. Each PE holds four 1D arrays: integer pixelindices[] that hold the indices of the boxes that lie in tube j, real pixel-values of the pixel indices stored in the array pixelindices[], the corresponding real box-tube-probability [ 1, and integer pixel-count[ ] that holds the number of boxes in tube j. It also gets preloaded with the projection data corresponding to the tube it processes. For maximum processing speed, there should be one processor per detector tube; otherwise each processor will hold data corresponding to multiple detector tubes. In a configuration with Np processors and Nt detector tubes, each processor will be assigned with data and task corresponding to N t / N p tubes. In order to minimize the potential load imbalance problem, data and task corresponding to Nt tubes in Np PE's were distributed cyclically.
The host processor sends the pixel intensities to the processor array. PEj receives the box intensity, and sends the intensity data to the neighboring node, PEj+l. Each PEj keeps track of the index of the pixel it currently processes using a processor register. PEj computes the contribution to the pseudoprojection on tube j from the current box intensity. The integer array pixelindices[] are stored in numerically increasing order. So each PEj uses binary search to find out whether the current pixel lies in tube j. If the current pixel lies in tube j, the corresponding probability value is used to compute the contribution to the pseudo-projection on tube j. Once PEj completes the processing of a pixel, it receives the next pixel intensity data from PEj-l. This process is continued till all the N image pixels are processed. When the complete image has been processed, the pseudoprojection of the image is distributed in the local memories of the N p processors in the linear array with N t / N p pseudoprojection in PEj. The complexity of the binary search for the average case and the worst case is the same and is given by O(log, n) where n is the number of entries in the table. The binary search is quite simple to implement and is quite efficient.
Each PEj in the array computes A P ( j ) , the ratio of the difference between the projection data (actual measurement data) and the pseudo-projection data (computed value) on tube j to the projection data on tube j. A P ( j ) gives the extent of resemblance between the projection of the object and the projection of the reconstructed object.
For handling the backprojection, the architecture employs the same Np processors. In the optimal case, with one PE per detector tube, each processor holds AP for one detector tube. The host sends the box intensities to the processor array through the FIFO. PEj receives the box intensity from PEj-1 and keeps track of the index of the pixel it currently processes. PEj uses binary search to find out whether the current pixel lies in tube j. If the current pixel lies in tube j , the corresponding probability value and AP are used to compute the correction to the pixel intensity. The partially corrected pixel intensity is passed down the pipeline through the FIFO. PEj receives the next box intensity data from PEj-1. This process is continued till all the N box intensities have traversed the pipeline. The last PE in the linear array gives out the reconstructed image.
v. IMPLEMENTATION OF THE LINEAR ARRAY
The linear array has been implemented using ADSP 21020 [l] DSP chips. Fig. 3 shows the schematic of a PE. 256 Kwords of program memory and 256 Kwords of data memory are attached to the DSP device through the program memory space and data memory space respectively. Eight PE's are linearly linked using high-speed first-in-first-out (FIFO) buffers (Am 7200 High Density FIFO 256 x 9 CMOS Memory). Fig. 4 shows the connection between the PE's. Four Am 7200 devices are connected in width expansion mode to form a 256 x 36 FIFO buffer. The FIFO buffers are mapped in the data memory address space of the DSP's. PEj can send data to PEj+l by writing into the FIFOj+l. PEj can receive data from PEj-1 by reading the FIFOj.
The Am 7200 CMOS FIFO is a 256 x 9 dual-port static RAM array and it stores the data written into it in sequential order. The dual-port RAM array has dedicated read and write pointers. The built-in flag logic allows the FIFO to accept and output data asynchronously and simultaneously. The PEj is connected to one of the ports of the FIFO and is controlled by the FIFO control signal full, which gives the status of the FIFO. It is asserted when the FIFO is full. Similarly, PEj+l is connected to the other port of the FIFO and is controlled by the control signal empty. The bit empty is asserted when the FIFO is empty, which indicates that no more reads should be made until PEj writes into the FIFO. The full and empty control signals of the FIFO are in turn connected to the FLAG-1 and FLAG-0 of the DSP device respectively. These two DSP flag bits are programmed as inputs and are tested through program instructions. Thus the FIFO allows the linear array to operate asynchronously. During the forward step, PEj receives the pixel-value from PEj-1 and transmits the same to PEj+l. PEj starts processing the pixel. The computational load at each of the PE's is not the same. So PEj may take a little longer to process a pixel. But as and when it completes, it picks up the next pixel-value and starts processing. During the backward step, a pixel-value arrives at a PEj. The PEj processes the pixel, updates the pixel-value depending upon the contribution from the tube j and then transmits the result to PEj+l. The PEj+l waits till the empty signal changes state, which indicates that new pixel-value has arrived.
A typical DSP such as ADSP 21020 has three independent computation units: an arithmetic and logic unit (ALU), a multiplier, and a shifter. The computation units perform single cycle operations. The three units are connected in parallel and they operate in parallel. In a multi-function instruction, multiple functional units operate in parallel. A 10-port register file is used for transferring data among computation units and data buses, and for storing immediate results.
ADSP 21020 has two independent memories-ne for data and the other for program instructions and data. Two independent address generators (DAG) and a program sequencer supply address for memory access. A program sequencer with a 32-word instruction cache allows the ADSP 21020 to access data from both data memory and program memory and fetch an instruction. In addition, the DAG'S are updated to point to Table I gives the execution time for one iteration of the EM algorithm on a linear array having a single PE, on a linear array with 8 PE's, on a 386/387 running at 33 MHz under Unix operating system and on an IBM 6000 RISC workstation for the partition-by-tube scheme.
The results show that the computational speed of a multiprocessor system for the EM image reconstruction algorithm is about 15.5 times better than that of an IBM 6000 RISC workstation. The speed-up of the linear array with 8 nodes is approximately 5. The EM algorithm was executed on a linear array with 4, 8, 12, and 16 nodes for an image size of 64*64. Table I1 shows the performance of the EM algorithm on multiple processors. The execution time decreases almost linearly with the increase in number of nodes. The total number of detector tubes processed is constant for a fixed image size. The number of detector tubes processed in each of the nodes decreases linearly with an increase in number of nodes. In addition, the memory required to store the data at the nodes reduces linearly with an increase in number of PE's.
The algorithm can also be implemented on a linear array of fixed point DSP devices which are available for as low as $10 per chip.
VI. CONCLUSIONS
In this paper, we have described the parallelization of the EM algorithm on an linear array topology. The linear array topology is expandable with more number of PE's. In this current study, a FIFO buffer between the stages has been used. But with more advanced devices such as an ADSP 21060 containing ADSP 21020 core, 4 Mb of static memory with cross bar switch buses, two serial ports, six 4-b link ports and a powerful DMA processor, it is possible to have glueless interconnection between the processors to build a large network. The architecture is not dependent on the DSP chip chosen, and the substitution of the latest DSP chip is straightforward and could yield better speed performance. The EM algorithm breaks down to a sequence of multiply and multiply/accumulate type of instructions. Since DSP chips are optimized processors for executing multiply and multiply/accumulate instructions, they give high performance. It has been found that the computational speed performance of the 8-node linear array executing the EM image reconstruction algorithm is comparable to that of IBM 6000 RISC workstation.
