These days, computer based image registration techniques are increasingly being used in the area of medical imaging as they offer significant benefits for aligning different images together and for visualizing their combined images. However, these techniques require an enormous amount of computation time due to the high resolution and complex nature of the medical images. We propose to alleviate this problem by using a dedicated Network-on-Chip (NoC) based hardware platform for image registration. This paper describes a novel technique for FPGA implementation of the B-Spline based Free Form Deformation (FFD) algorithm, i.e., a widely used algorithm for modeling geometric shapes in a computerized environment. For performance enhancement, we have utilized a a lightweight circuit-switched NoC architecture, which is adaptable to most FPGAs. The design description is captured in the Verilog language and implemented using the Xilinx XC2v6000 device at 37 MHz. The proposed design is parameterizable at the compile time and supports a wide range of the image resolutions and computational precisions. The experimental results have shown a significant improvement in performance when compared with the other existing hardware implementations of the B-Spline based FFD algorithm.
Introduction
Traditionally, in the medical domain, multiple radiological images of a patient are acquired, printed and then analyzed by viewing them on a light box. The computer based image registration techniques offer significant benefits for aligning different images together and for visualizing their combined images and thus are increasingly being used in medicine. For example, it is quite difficult to localize tumors using CT and MR scans based images because the contrast between the tumor and its surrounding tissues is of very low intensity 1 . However, image registration has been shown to enhance the tumor detection significantly 2 . Medical image registration methods are primarily based on iterative algorithms that tend to minimize some cost or energy factor, which is usually defined in terms of the difference of geometry or intensity between images. Thus, the efficiency of image registration is directly dependent on the performance of the underlying algorithms. A significant amount of research has been conducted to explore efficient algorithms for medical image registration. Free Form Deformation (FFD) 1 based algorithms are most commonly used in medical imagining mainly because they support the modeling of geometrical shapes in a computerized environment. For example, the medical imaging utilization of a non-rigid registration algorithm based on the FFD and modeled by B-splines is explained in 3 . Similarly, another algorithm for the non-rigid registration of 3-D breast MRI is also investigated 4 . One of the common problems associated with the image registration algorithms is their enormous computation complexity due to the high resolution and complex nature of the medical images. For example, a 3D image of 256x256x64 voxels is processed in 15-30 minutes using the FFD algorithm on a Sun Ultra 10 workstation 5 . In order to optimize the performance of FFD algorithms, dedicated hardware platforms have been proposed to be used for executing non-rigid image registration algorithms 2, 6 . One of the most recent works, oriented towards this direction of research is the reconfigurable hardware based FPGAs implementation to compute the B-spline based FFDs for medical imaging 5 . This paper contributes towards further performance enhancement of the B-spline based FFD algorithm by using a Network-on-Chip (NoC) 7, 8, 9 based hardware implementation for this purpose. NoC is basically a computing system consisting of several interconnected and concurrently running processors. Scalability, design-flow parallelization, and reusability are the main benefits of NoC based implementations. Thus, due to the inherent nature of NoC architectures, the proposed approach of using NoC architecture is expected to improve upon the performance of the FFD algorithm based on B-Spline.
We propose to use a simple circuit-switched architecture called programmable NoC (PNoC) 10 . PNoC is a very flexible and lightweight architecture for FPGA based systems. PNoC uses a modular design that facilities the usage of standard interfaces and IPs 11 . Higher communication bandwidth and better scalability are the foremost merits of PNoC. The main contribution of this paper is the implementation details of the B-Spline based FFD algorithm using PNoC. The analysis results of our implementation are compared with the reconfigurable FPGA based approach of the same algorithm 5 , and it has been observed experimentally that the proposed approach led to a noticeable performance increase and cost reduction. To the best of our knowledge, this is the first time that a NoC based architecture has been proposed to be used in the context of medical image registration applications.
The rest of the paper is organized as follows: Section 2 provides some preliminary information regarding the NoC architectures and the B-Spline FFD algorithm. In Section 3, we describe the proposed NoC based implementation of the B-Spline based FFD algorithm. The results and comparisons with the reconfigurable FPGA based approach are presented in Section 4. Finally, Section 5 concludes the paper.
Preliminaries
To facilitate the understanding of the rest of the paper, we provide some fundamentals regarding NoC, PNOC architecture and the B-Spline FFD algorithm in this section.
Network-on-Chips
The basic ingredients of a NoC architecture, depicted in Figure 1 , include the processing elements, connection topology, routing technique, switches, and programming model. There are various connection topologies from the communication perspective. Torus, octagon, mesh, ring, and irregular connection networks are some of the communication topologies 13 . However, it has been shown that the 2-D mesh architecture is both easy to implement and provides the lowest latency 14 . Different In the circuit switched architecture 16 , there is a dedicated channel for the data flow so no buffering or queuing is required.
PNOC Architecture
In this paper, our focus is on a NoC based implementation of a FFD algorithm and thus for this purpose, our requirement is to use a flexible and lightweight NoC architecture. The lightweight circuit-switched NoC for FPGA based systems, described by Hilton et al 10 , fulfills our needs. In PNoC, the network consists of subnets, such that each subnet has a router and a bunch of network nodes shown in Figure 2 . The circuit switching between the nodes is performed by the router and each node is connected to a router. A dedicated connection is established using a light handshaking mechanism for the data exchange and connection removal. The connection is established when master node A sends the request signal along with the address of the target node to the router. The second router sends the grant signal to first router that port B is available and the connection is established.
A dedicated connection path is used for data transfer so no acknowledge signal is required. Data transaction can occur on successive clock cycles if master receiver is low. The read and write requests can be pipelined. A CPU is connected to the PNoC like any other module. The interfacing circuit constitutes FIFO's and FSM to communicate with the router. The router is the main component of the PNoC. The router includes the routing table, queue, and switch box. Another part of the PNoC is the buffer which is a parameterizable feature. Buffer is necessary in two cases. Firstly, if nodes and routers are running at different clock rates, and secondly, when there is a difference between the transmitting and receiving rate.
The PNoC has been implemented using JHDL 10 , which is not a commonly used HDL. In order to facilitate broad usage of PNoC, we implemented its design in Verilog 17 . We use this Verilog implementation to develop the NoC based implementation of the B-Spline FFD algorithm.
B-Spline based FFD Algorithm
The B-Spline based FFD algorithm is considered to be one of the most powerful techniques for modeling 3-D deformable objects in the domain of non-rigid image registration 4 , which is a special kind of image registration used specifically for images with nonlinear geometric differences. The main motivation behind the optimization of this algorithm in this paper is the increasing utilization of non-rigid registration for the analysis of huge and complex brain images.
For non-rigid image registration, a combined transformation (T) using both global and local transformation is used 5 .
In the case of the B-Spline FFD, the image volume is defined as
Thus, the FFD can be described as the product of 3 1-D cubic B-Splines 5 :
where θ denotes the mesh of control points n x × n y × n z and
B i represents i th basis function as follows:
where u ∈ [0, 1]. The image intensities might change between the pre-contrast and the post-contrast image so direct image intensity comparisons as sum of squared differences (SSD) or correlation cannot be used. Normalized Mutual Information (NMI) has been recommended to be used for image alignment to avoid any dependence on the quantity of image overlap 5 .
where H(A) and H(B) denote the marginal entropies of A and B, and H(A, B) denotes their joint entropy. It has been proposed 5 that in order to find the optimal transformation we have to minimize the cost function associated with local and global transformation parameters. The term (C similarity ), given in Equation (4), corresponds to the image similarity, while the term (C smooth ), given in Equation (5), corresponds to the image smoothness 5 .
In the above equation the weighing parameter λ shows the tradeoff between transformation smoothness and the alignment between image volumes, θ represents the global transformation, and, φ represents the local transformation.
B-Spline FFD Algorithm Implementation
We propose to implement the FFD on the PNoC using a pipelined architecture, which was initially proposed in 5 and is illustrated in Figure 3 . The first stage of this pipeline, i.e., Stage 1 processes input data in fixed point format. The second to fourth stages use three pipelined multipliers and a pipelined adder to produce the end result. The integer part of the input data points to the grid of control point stored in the external memory while fraction part directs the pointer to LUT for B-Spline. MULT P shows the pipelined multiplier and ACC P shows the pipelined adder. The control points (CP) from φ(0, 0) to θ(3, 3) manipulate the central 16 grey points (gp), as illustrated in Figure 4 . Seven memory banks are needed for the processing of 2-D images, two for input data, two banks for the control points, one for B-Spline data, and, two for the output data.
Our Verilog code of the FFD mainly consists of three modules, i.e., B-Spline Look-Up Table ( A simulation diagram for our FFD implementation is given in Fig. 6 , with all Fig. 7 , consists of 16 control points for each grid of image pixels. The central 8 pixels are fed into block 1A and the effect of control point (CP1) is calculated on these pixels. The other 8 central pixels are fed into the block 1B and the effect of CP1 is calculated. The first 8 central pixels from the block 1A are fed into block 2A and the effect of CP2 is calculated on them. Similarly, pixels from 1B are fed into 2B for the calculation of effect of CP2. After 4 cycles, the effect of all the 16 control points on these pixels is calculated. All these steps are repeated until all the pixels in the image are processed. The coordination of the image data transfer between different nodes is the main design challenge of the system. There are two major communications. The first one involves the CPU to communicate with the block processors and the second one is between different block modules.
PNoC is very well suitable for this system as more than one connections are active at any instant so inter-block data transfer can occur simultaneously. The choice of window to be chosen is made by the router and no extra hardware is required for that, and, if no module is available the connection request can be queued up until the availability of a block. As this system is quite flexible so additional blocks can also be added to the present design as well. 
ws-jcsc

Experimental Results
In order to illustrate the effectiveness of the proposed NoC based implementation and have a fair comparison, we developed the Verilog model of Jiang et. al's im- plementation 5 of the B-Spline FFD Algorithm and synthesized it for the same FPGA target device. The data width, clock speed and area in terms of slices of our model were found to be almost the same as the ones reported previously 5 . Next, we implemented our final design for the FFD on PNoC using an XC2v6000 device and recorded its expectation time and throughput. Table 1 summarizes the results and compares the proposed PNoC based implementation of FFD with its other hardware implementations that have been reported previously 5 . It can be clearly seen that the proposed design has the highest throughput, reported in terms of pixels/sec processing, compared to all the existing FFD implementations for a 2D image. The throughput of the proposed implementation is even better than the mainstream microprocessors, which clearly indicates the potential of the proposed approach and the usefulness that it can bring in the area of medical imaging. It is worth mentioning that the rows 2-4 of Table 1 provide the figures for the implementation of the same algorithm 5 that we have implemented ( Figure 3 ). However, our performance is better than all of them including the two-pipeline architecture. Jiang et al also implemented an alternate 2-channel architecture for processing 2-D images and achieved a performance of 4187500 pixels/sec 5 . However, this is a different design and thus its performance cannot be directly compared with the performance achieved in implementing the architecture shown in Figure 3 . It may be an interesting future direction to utilize NoC for this alternate architecture and investigate the performance increase. The device utilization report, given in Table  2 , indicates that the overall area is increased as compared to the Reconfigurable FPGA based implementation of Jiang 5 . This is one of the inherent drawbacks of NoC based implementations due to the additional logic that supports network communication. But, given the availability of the sub-micron transistor sizes, such an area increase is not a major concern. The increase in area makes the clock speed a ws-jcsc NoC based Implementation of Free Form Deformations 11 bit slow as compared to other implementations but due to parallel processing the throughput is still higher.
Conclusions
The paper describes a NoC based implementation of the B-Spline FFD algorithm. For this purpose, we used PNoC, which is a very flexible architecture that suits the FPGA-based systems. Our design description is captured in Verilog language and implemented on Xilinx XC2v6000 device at 37 MHz. The proposed design is parameterizable at the compile time for the computational precisions and supports a wide range of image resolutions. The experimental results have shown a significant improvement in performance when compared with the other hardware implementations of the B-Spline FFD Algorithm.
To the best of our knowledge, this is the first implementation of a medical registration algorithm that is based on NoC. Our promising results illustrate the usefulness of NoC in the medical registration algorithms and thus other algorithms, such as non-rigid registration for breast MRI images and PETCT image registration in the chest, can also be implemented using our generic PNoC model. Another interesting future extension is to partition the images into 3 by 3 sub-images and then use 9 pipelined processors for a single block in parallel for each one of the sub-image that contains 12-bit fixed-point data. The slice resources for this system would be around 31410 for each block and it would allow us to further enhance the performance. Another interesting area of future work could be to experiment by implementing the Torus architecture instead of the 2-D mesh architecture of NoC chosen for this work. Both architectures have their own advantages as has been mentioned in 18 . The main motivation for the selection of 2-D mesh for our work was its rather straightforward implementation and better scalability in terms of area and power consumption compared to the Torus architecture. But the performance is expected to further increase if the Torus architecture is used.
