ABSTRACT The emerging need for the current medical devices to achieve immediate visualization and performing diagnostic imaging at real time augurs the demand for high computational power of the associated electronic circuitry. The demand for such a high computational requirement is often met by using software methods to accelerate the computation, which is possible only to a certain extent, impairing the feasibility of real-time imaging and diagnosis. In this paper, a new method of using digital signal processors (DSPs) with a specialized pipelined vision processor (PVP) embedded at the hardware level to accelerate the routinely time-consuming imaging computation is proposed and validated. A lab prototype is built for the feasibility study and clinical validation of the proposed technique. This unique architecture of the PVP in a dual-core DSP offers a high-performance accelerated framework along with its large on-chip memory resources, and reduced bandwidth requirement provides as an ideal architecture for reliable medical computational needs. We have taken two sets sample studies from SPECT for validation-27 cases of thyroid medical history and 20 cases of glomerular filtration rate of kidneys. The results were compared with definitive post-scan SIEMENS image analysis software. From the statistical results, it is clearly shown that this method achieved very superior accuracy and 250% acceleration of computational speed.
I. INTRODUCTION
The efficiency of imaging algorithms require innovative architecture solutions. A lot of energy is consumed in solutions depending exclusively on Instruction Level Processors (ILPs). On the contrary the ASIC type of Instruction Programming approach fares very well with respect to performance and power efficiency, but does not allow the kind of flexibility and software environment the ILPs offer. Hence, there is a need for a combination of both of the above advantages [1] .
Implementing most image processing algorithms on the host PC has several disadvantages due to the computational load it induces on the system. A lot of serial operations have to be implemented to configure a single step of computation. This makes the option of real-time imaging a challenging task as the algorithm conflicts with the programs running on the host PC, demanding its portion of resources like memory, bandwidth, pipelining tasks, resources sharing [2] . Computationally demanding functions like masked spatial filter, image segmentation and Integral Image Computation are more demanding on the resources of the host PC.
More complication arise in systems where medical imaging is performed. Huge amount of data streams from the detector of a medical imaging device which need rapid processing. Gamma Camera is one such medical imaging device where a lot of data is generated at the detector when a radiopharmaceutical containing an isotope produces, gamma radiations when they hit the detector. For a regular dosage of 2 millicurie (mCi) for a thyroid scan, 74 million radioactive decays per second are produced. A considerable percentage of it reaches the detector, gets scintillated and showers a beam of visible photons on the anodes. The illuminated anodes contain the information about the location of scintillation. The data from the anodes run into an algorithm which maps and finds the location of the source of scintillation in the human body. This step is called Image Reconstruction and is carried out for each and every scintillation occurring on the detector surface. Typically for a thyroid scan, this acquisition goes on for about 12 minutes which leads to large amount of data generation. In our research group, effort towards minimizing the time taken to run the Image Reconstruction and Processing algorithms is attempted. The idea is to use the power of hardware acceleration to execute the algorithms which is considered computationally costly due to the load it induces on the host PC. In this paper, the work completed in hardware acceleration of Image Processing is presented. We also report the comparative performance of the same algorithm on three different platforms for Thyroid Uptake Ratio [3] . The speed and accuracy of the analysis has also been compared.
II. ANALOG DEVICES DSP BF609
A Digital Signal Processor, is a microprocessor that is optimized for the operative needs of digital signal processing. DSP can process real time data, making it suitable for applications that are intolerant to delays. Digital signal processors take digital signals and process them to improve the signal to produce either clear sound, faster data or sharper images.
A unique processor from Analog Devices with Blackfin architecture called BF609 is being used to achieve hardware acceleration of Image processing [3] .
The ADSP-BF609 processor is a member of the Blackfin family of products. Blackfin processors have the advantages of a RISC-like microprocessor instruction set, and single instruction, multiple-data (SIMD) with a dual-MAC signal processing engine which can process multimedia capabilities into a single instruction-set. ADSP-BF609 Blackfin processors are a new type of embedded processors designed to meet the computational demands of today's imaging systems where performance and power constraints exists [4] , [5] .
As shown in Figure 1 , ADSP BF609 is a dual core-fixed point processor used for embedded vision and real time imaging. The specialty of this processor is due to its unique hardware engine called Pipelined Vision Processor (PVP). This comprises of a set of blocks which operates irrespective of the dual cores and is used to accelerate image processing algorithms and reduce the overall bandwidth requirements. In addition to these, the ADSP-BF609 processor contains features for safety-critical applications, including a Cyclic Redundancy Check for memory protection, Parity and ECC protection in the internal memory blocks [6] .
Each processor core consists of two 16 bit multipliers, 40 bit shifter, two 40 bit accumulators, two 40 bit general purpose ALUs and four 8 bit Video purpose ALU. Single MAC can perform a 16-by-16 multiplications in one cycle and ALU operation on 16 bit or 32 bit data.
Blackfin processor supports a modified Harvard Architecture with a categorized memory structure. All the resources that include I/O control registers, Memory (Internal and External) occupies separate sections in this unified address space. These memory portions are arranged hierarchically for maintaining good performance balance between cache memories of fast & low latencies, along with lower cost & larger off chip memories. For example, L1 memory is a fast and low latency cache memory which typically operates at the full processor speed, while, off chip memory systems are VOLUME 5, 2017 SDRAM, flash memory and SRAM which can be accessed using External Bus Interface Unit (EBIU).
Memory DMA controller handles the block transferring of code or data between internal and external memory spaces providing high bandwidth data movement capability [3] .
III. ADVANTAGES OF PVP
A Pipelined Vision Processor provides a set of 12 high performance signal processing blocks that can be combined to form data processing pipes. These blocks are optimized for tasks typical of video and image processing, robotics, and 2-dimensional vector applications. The PVP works in tandem with processor cores. It is optimized for convolution and wavelet based object detection, classification, tracking and verification algorithms. PVP offers more than 25 giga operations per second (GOPs) with very low memory bandwidth and configurable data-paths. PVP operates on frame rates up to 1280×960×30Hz (16bits).
The PVP bundles a set of processing blocks required for high speed 2-Dimensional digital signal processing. The PVP contains a number of highly configurable blocks that provide a broad set of pixel processing features. The Block diagram of PVP is shown in Figure 2 . The definitions of each block that a PVP houses is given below [3] , [7] :
1) Cartesian to Polar Magnitude and Polar Angle conversion (PMA):
This is used when conversion of two 16-bit signed inputs in Cartesian format (x,y) into Polar form (Magnitude, Angle) is required. Identification of non-zero pixel crossing in edge detection algorithms can be done with PMA.
2) Convolution (CNV):
This block supports 2D convolution of pixel ranges -1×1, 3×3, 5×5 and upto 16 bit coefficients. The PVP has four such convolution blocks. Calculation of the first and second derivatives of pixel ranges and Gaussian image smoothening can be done using the CNV block coefficients.
3) Integral Image Computation:
Integral image blocks calculates a 2-Dimensional integral over the input frame and outputs the summed area table (SAT). Alternatively, these blocks can generate a rotated SAT (RSAT) or can integrate in horizontal dimension only (integral row mode).
4) Pixel Edge Classification Block:
PEC supports edge detection including: non-linear edge enhancement filtering in a pixel neighborhood, edge classification based on orientation, subpixel position interpolation, vertical/horizontal sub-pixel position into one byte per pixel. PEC operates either in first derivative mode (PEC-1) or second derivative mode (PEC-2).
5) Threshold-Histogram-Compression Block:
The statistical and range reduction signal processing functions are implemented by the threshold-histogramcompression. Many other PVP blocks can be connected to this block using the PVP's pipeline interconnection options.
6) Up-Down Scaling Block:
The up-down scaling block expects a 16-bit or 32-bit unsigned input data and drives 32bit output data.
When an anti-aliasing or an averaging filter is enabled, the input must be 16 bits and correspondingly the output is 16 bits presented in the lower 16 bits of 32-bit output.
7) Output Formatters (OPFn):
OPFs collect data results of PVP blocks, apply final formatting, and forward the results to the DMA channels. The OPF0, OPF1, OPF2 serves the camera pipes. The PF3 serves the memory pipe. Each OPF is associated with a specific DMA channel.
8) Input Formatters (IPFn):
IPFs can receive data directly from the video input interface (the Enhanced Parallel Peripheral Interface (EPPI)) and from memory through DMA. IPFs incorporate pre-processing techniques including color or luminance components extraction, pixel windowing, frame counting, and control of frame processing.
IV. MATERIALS USED
This section lists all the software and hardware tools, processor boards used to accomplish this task of hardware accelerated image processing.
A. ADSP BF609 EZ KIT BOARD
The EZ kit provided by Analog Devices has the BF609 processor interfaced by a lot of connecting options. The block diagram of the EZ Kit is shown in Figure 3 . It has an on board 25 MHz oscillator and can run up to a maximum of 500 MHz [7] .
B. STAND ALONE DEBUGGING AGENT (SADA)
An Interface for a processor with PC is required to facilitate the execution of DSP algorithms as well as to visualize the results. An In Circuit Emulator (ICE) or a stand-alone debugging agent (SADA) is used to interface the PC to Joint Test Action Group (JTAG) header on the processor. The Standalone Debug Agent provides a modular low cost emulation solution for EZ-Boards.
C. CROSS CORE EMBEDDED STUDIO (CCES)
Cross Core Embedded Studio is an integrated development environment (IDE) for the Analog Devices Blackfin and SHARC processor families. It is an Eclipse based IDE and employs the latest generation of mature code generations tools which provides seamless, intuitive C/C++ and assembly language editing, code-gen, and debug support. The IDE includes driver support for on chip and off chip peripherals, stacks for Ethernet and USB, a popular real time operating system and file system. It provides an easy to use development framework for working with BF609.
V. METHODS

A. CONFIGURING THE BF609
This section deals with detailed analysis of the methods deployed for the medical image segmentation using the BF609 processor. This is an application with an entry point of commands at Core0 from the CCES. Figure 4 shows how the application is distributed inside the processor. Core0 takes in image frames from the host PC in the form of text file and the segmented image from the Region of Interest is handed over to PVP. The multi-core library manages all the signaling and buffer exchanges between cores [3] . Transfers of data between peripheral and memory or between two memories are done by DMA channels.These channels can transfer data between off-chip and on-chip memories. DMA manages all the memory related access.
DMAxx CFG register configures the specific allotted DMA channel and can be done using the option: Pvp_mempipeInputDMAConfig() FIGURE 4. Developed control flow in BF609.
VOLUME 5, 2017
The following code snippet shows the configuration of the 43 rd channel of the DMA which serves as PVP0 Memory Pipe Data Input.
void Pvp_mempipe_InputDMAConfig() { * pREG_DMA43_ADDRSTART = nBuffer; ssync(); * pREG_DMA43_XCNT = 128u * 128u; ssync(); * pREG_DMA43_XMOD = 4u; ssync(); * pREG_DMA43_CFG = NUM_DMA_CFG_ADDR1D
In the input section, the IPF register PVP_IPFn_CTL controls the IPFn pipeline features that include color format, Unpack Incoming, output port selection and format conversion. PVP_IPFn_HCNT and PVP_IPFn_VCNT contains horizontal and vertical counts (Pixels) for the region of interest at PVP_IPFO_VPOS and HPOS respectively. Each block of the PVP can be configured to specify the block from which the input is received and the block to which the input is further sent. The registers have to be carefully manipulated so that the right setting is configured. The following code snippet shows the configuration of the IIM block inside the PVP whose configuration structure as shown in Figure 5 . * pREG_PVP0_IIM1_CFG = (ENUM_PVP_IPF1 BITP_PVP_CNV_CFG_IBLOCK0
|BITM_PVP_IIM_CFG_START; * pREG_PVP0_IIM1_CTL = (ENUM_PVP_IIM_CTL_RECTMODE)
Each configuration register can be manually configured by looking into its structure and locating the specific bits. The above code configures the IIM whose structure is as given in Figure 6 :
Similarly all the blocks of the PVP have to be configured and the signal flow as shown in Figure 4 is developed. This flow establishes the path the image takes in the processor while the segmentation and the data extraction is performed. Once the coordinates of the portion of the image to be segmented is set by the user, the PVP is configured to count the pixel intensities in the ROI (Region of Interest) as the pixel intensities reveal the spatial distribution of the isotope in the patient body. The functioning of the IIM in PVP is similar to set of ray sum along the region specified by the user. The Figure 6 summarizes the function of IIM.
B. THYROID UPTAKE RATIO ESTIMATION USING BF609
Thyroid gland function and structure can be evaluated using uptake and scintigraphy studies. Thyroid uptake and scintigraphy play an important role in various clinical situations, such as finding the detection of ectopic thyroid tissue in neck masses, functional assessment of single or multiple nodules, detecting hyperthyroidism. In a typical thyroid study, the Technetium 99m radiopharmaceutical is introduced into the patient intravenously. The isotope is allowed to distribute around the body. The tracer presence around the other parts of the body contribute to the background of the region which has to be manually subtracted. The thyroid study receives background counts by the accumulation of the tracer in nearby salivary glands, mediastinum organs thus requiring the process of manual background subtraction.
The thyroid imaging follows a four step imaging process. Initially, the full dosage in the syringe is imaged to estimate the total number of counts. The syringe is again imaged after the tracer is injected into the body as a part of it gets left behind in the syringe. A part of the tracer also accumulates near the site of injection, which is referred to as antecubital. The difference of total count and the above two would give an estimate of the amount of tracer which entered into the body. The following formula calculates the Thyroid Uptake Ratio [3] :
Standard CPM = (Full syringe counts)
− (Empty syringe + Antecubital counts) CPM : Counts per minute.
We have devised three methods to study the performance of the study in terms of the time and accuracy of prediction. SIEMENS propriety tool is taken as control sample, the ROI selection is done by the doctor and the test is performed on 27 Thyroid and 20 Renal patients. The performance was compared to 1) Host PC Segmentation and 2) ADSP SoC.
C. GLOMERULAR FILTRATION RATE ESTIMATION USING BF609
A nuclear medicine Technetium 99m DTPA (Diethylene triamine pentaacetic acid) renal scan is performed to look at the blood supply, function and excretion of urine from the kidneys. The test can find out what percentage each kidney contributes to the total kidney function.
The renal imaging also follows a four step imaging process similar to Thyroid Uptake Ratio study. The segmentation of kidney ROI is drawn to acquire data from the regions to calculate the Renal Uptake (RU) percentage and Glomerular Filtration Rate (GFR). These two are the defining parameters which define the functioning of the kidney of patient's body. This is calculated using the gates formula, which is given by equation below.
Total Renal Uptake percentage (RU%) 
VI. DEVISED APPROACHES TO MEASURE ACCELERATION
The medical RAW data was acquired by SIEMENS E-CAM and processed into a DICOM image format. To facilitate the time study and parameter estimation by proposed methods, the following two approaches were devised:
A. APPROACH 1: HOST PC BASED SEGMENTATION
To calculate the time it would take to run the file on Host PC, ROI based segmentation algorithm was written in MATLAB. The program was run on a 2.2 GHz, 64 bit, Intel i7, 8GB RAM with 6MB cache. Functions like imfreehand() were used to draw ROI extract the pixel intensities. The time taken for the computation was generated using MATLAB's internal timer and displayed using profiler.
B. APPROACH 2: ADSP BF609 SoC BASED SEGMENTATION
The EZ kit containing the BF609 was connected to an Intel i5, 64 bit, 8GB RAM computer using the SADA daughter card. This chip is used as a bridge to interface the BF609's JTAG using the USB. The code was dumped into the chip using the CCES. The relevant inputs and outputs were accessed by performing file I/O operations. The time taken to run the algorithm was timed by a digital timer. The assumption was considered satisfied, as the skew and kurtosis levels were estimates at 0.61 & 1.79 for TUR;0.1855 & 1.88 for GFR study respectively, which is less than the maximum allowable values for t-test (that is, skew < 12.01 & kurtosis < 19.01) [9] . It is also noted that the correlation between the two conditions were estimated at r = 0.9989 and p < 0.0001 with 95% confidence level for both the studies, suggesting that the dependent samples t-test was appropriate in this case.
VII. RESULTS AND DISCUSSION
The null hypothesis of unequal measurements means was rejected, t(26) = 0.498 and p < 0.6230 for TUR; t (19) = 0.3888 and p < 0.7023 for GFR study. Thus the readings of the SoC based architecture were statistically similar to the SIEMENS software readings. Cohen's d was calculated at 0.013 for TUR and -0.00413 GFR which is a small effect based on Cohen's guidelines [10] , [11] .
In terms of time, the chip consumed about 250% less time for segmentation, pixel extraction and computation of the index parameters compared to the time it took to run the same algorithm in the Host PC. The results of time taken for computation is shown in Figure 9 . This shows that SoC design reduces the time taken to complete the algorithm.
Thus the study proves that SoC based method, even though highly computationally by about 250%, the method provides excellent correlation with the established methods currently used, thus facilitating real time diagnostic imaging.
