The semivariogram is a statistical measure of the spatial distribution of data and is based on Markov Random Fields (MRFs). Semivariogram analysis is a computationally intensive algorithm that has typically seen applications in the geosciences and remote sensing areas. Recently, applications in the area of medical imaging have been investigated, resulting in the need for efficient real time implementation of the algorithm. The semivariogram is a plot of semivariances for different lag distances between pixels. A semi-variance, γ(h), is defined as the half of the expected squared differences of pixel values between any two data locations with a lag distance of h. Due to the need to examine each pair of pixels in the image or sub-image being processed, the base algorithm complexity for an image window with n pixels is ( 2 ). Field Programmable Gate Arrays (FPGAs) are an attractive solution for such demanding applications due to their parallel processing capability. FPGAs also tend to operate at relatively modest clock rates measured in a few hundreds of megahertz, but they can perform tens of thousands of calculations per clock cycle while operating in the low range of power. This paper presents a technique for the fast computation of the semivariogram using two custom FPGA architectures. The design consists of several modules dedicated to the constituent computational tasks. A modular architecture approach is chosen to allow for replication of processing units. This allows for high throughput due to concurrent processing of pixel pairs. The current implementation is focused on isotropic semivariogram computations only. Anisotropic semivariogram implementation is anticipated to be an extension of the current architecture, ostensibly based on refinements to the current modules. The algorithm is benchmarked using VHDL on a Xilinx XUPV5-LX110T development Kit, which utilizes the Virtex5 FPGA. Medical image data from MRI scans are utilized for the experiments. Computational speedup is measured with respect to Matlab implementation on a personal computer with an Intel i7 multi-core processor. Preliminary simulation results indicate that a significant advantage in speed can be attained by the architectures, making the algorithm viable for implementation in medical devices
INTRODUCTION
Semivariogram analysis is a computationally intensive algorithm that has typically seen applications in the geosciences and remote sensing areas 1 . Recently, applications in the area of medical imaging have been investigated, resulting in the need for efficient real time implementation 2 . The semivariogram is a plot of semi variances for different lag distances between pixels. It is a property used to express the degree of relationship between pixels of image. The semivariance value typically increases with the lag distance converging to a constant limit called the "sill". The value increases rapidly at low lags and then progresses linearly. The semivariogram displays the average change of a property and relation between a pair of pixels with changing lags. It can be used as a descriptor of second-order statistics within the image and hence provide a quantitative measure of texture.
A brief literature survey shows very limited progress in implementing the variogram on FPGA hardware to increase its speed. Marcotte proposed the computation of variograms using the fast Fourier transform by defining two indicator matrices which have ones at each data point and zeros elsewhere 3 . The computation is performed at lower cost by shifting the data matrix and the data pairs are formed by overlapping the original matrix and shifted matrix. Hence the FFT approach is shown to be faster than the spatial approach, especially when repetitive variogram computations are required.
Most other work appears to focus on the fast computation of gray-level co-occurrence matrices (GLCM). It is worth examining these approaches since the GLCM also involves comprehensive pixel-pair data extraction. Roumi proposed an FPGA-based architecture for parallel computation of symmetric co-occurrence matrices 4 . The approach improves by a factor of 2 to 4 the processing time for simultaneous computation of sixteen co-occurrence matrices. The feature calculation operation has two steps. In the first step, mean, contrast, entropy, and dissimilarity are calculated by four different Processing Elements (PEs). Processing elements contain multipliers and adders that execute in parallel. Furthermore, for increasing the computation speed RAM is used for the calculation of the log function in the entropy measure. In the second step, the angular second moment, variance, and correlation are calculated. Harshavardhan et al implemented a novel FPGA-based architecture for real-time extraction of four GLCM features by dividing the architecture in two stages, a pre-processing stage and the feature extraction block 5 .
Akoushideh et al proposed a parallel FPGA architecture to calculate co-occurrence matrices and thirteen texture features 6 . In order to improve performance, all texture features were computed in parallel. In addition, the clock divider technique was applied with parallel implementations on a cell processor. Experimental results show that using 16 processing elements in parallel provided speedups of up to 10 times the non-parallelized implementation. Geisha et al presented an FPGA architecture for Gray Level Co-occurrence Matrix (GLCM) to increase the speed of computation 7 . This paper does not include the Haralick feature extraction of the image; it calculates the GLCM for four different angles 0°, 45°, 90° and 135°for a given intensity of an image. Wielgosz has proposed FPGA Architecture for Kriging Image Interpolation in three steps: finding a basis for interpolation, constructing variogram matrix and computing coefficients and interpolation 8 . Constructing the variogram matrix involves implementing it in a pipelined fashion on the FPGA's but the architecture is not optimal in terms of image data distribution.
This work intends to implement the semivariogram calculation with an FPGA module taking advantage of its reconfigurability characteristics. The aim is to have faster calculation times using design techniques to implement parallelism and pipelining, which is not possible with dedicated DSP designs. The FPGA Kit used in this project is the Xilinx XUPV5-LX110T Development Kit, which utilizes the Virtex5 FPGA.
TECHNICAL BACKGROUND
The theoretical variogram is a function describing the degree of spatial dependence of a spatial random field. It is defined as the variance of the difference between field values at two locations across realizations of the field. The physical distance between the two locations is known as the lag distance or lag, and is denoted as . The semivariogram, denoted as γ(h), displays the average change of a property with changing lags. The property under study can be a function of the gray levels ( ) at two locations in the image and is usually related to the application domain under consideration. A property such as the difference between the gray-level values of the two locations can serve as a measure of textural parameters or second-order statistics of the random field. The experimental variogram is calculated by averaging one half the differences squared of the z-values over all pairs of observations with the specified lag distance and orientation. The relation between a pair of pixels that are lag distance h apart can be given by the average variance of the difference between all such pairs and is expressed as follows:
where m is the total number of pixel pairs used in the computation of the sum. The value of m is governed by the geometrical properties of the region of interest in the image used for the statistical analysis and the value of h. Given the same region larger values of h will result in lower values of m since the dipole connecting the pixels in the pair may not lie within the region of interest for certain pixels.
In order to understand the computation of the semivariogram for an image, it is helpful to look at an example with some image data. Table 1 shows the pixel arrangement in a one dimensional array, where denotes the position of the pixel values ( ). The difference of the pixel values is calculated in the third row and the obtained values are squared in the fourth row for a lag distance h=1.It should be noted that a two-dimensional image can be arranged as a vector for computation of the isotropic variogram as long as the position information is retained for each location. The following steps have to be followed to calculate the isotropic semivariogram value,:
1. Calculate the Euclidean distance and sort the pixel pairs according to the lag distances.
2. Calculate square of the difference between all pixel values with distance h as computed in Table (1). 3. Determine the number of pixel pairs m, for the example shown in table 1 the number of pairs of pixels with distance 1 is m=14.
4. Add all the square of differences for a particular lag distance h. Summing all the squares of differences computed in Table ( As can be noted from the example, the semivariogram values tend to rise rapidly and then converge to a value known as the sill. The semivariogram plot can be fitted with various models described by exponential or hole-effect parametric equations to obtain a description of the texture in the region of interest.
It is a good idea to study the computational complexity of the semivariogram algorithm for the isotropic computation case. For = × pixels in the image there are ( − 1)/2 unique pixel pairs. Calculating the variogram value includes the computation of the distance and difference between all the pixel pairs, and summation of squares of differences. The distance calculation has 3 × ( ( − 1)/2) additions and multiplications each, whereas the difference calculation between the pixel pairs and the square of the differences has ( − 1) additions. ( 2 ) where n is the number of pixels in the image region. For example, for P = (M=10, N=10), the total number of pixel pairs is 4950 assuming distinct pixels, and the total number of summations would be 29,700 and the total number of multiplications would be 14,850. The value is particularly large even for a small image of 10 x 10 pixels.
As seen by the computational breakdown, a large part of the burden exists due to the comprehensive analysis of pixel pair data. Practical applications in the medical domain involve regions of interest with around 50 x 50 pixels, resulting in an exponentially larger number of computations: approximately 187 million sums and 95 million multiplications. This large number of computations make the semivariogram calculation extremely time intensive. It is worthwhile investigating methods to speed up the computation in order to make the semivariogram a practical 
ALGORITHM IMPLEMENTATION AND DIGITAL LOGIC DESIGN SPECIFICATION
Two architectures were designed and implemented in succession on the FPGA. Architecture version 1 (nonpipelined) was initially implemented and tested. Architecture version 2 (pipelined) was subsequently introduced to improve upon the performance of the former. Both architectures were programmed in VHDL and consist of four different modules with each module performing different tasks as described below:
1. Image data loading is one of the challenging tasks for any architecture with comprehensive pixel-based computations. One of the major bottlenecks is that data has to be presented to the modules in pairs. Popular approaches like the crossbar switch and ring buffers can be used to load the data into the processing modules. The crossbar switch has a matrix with M x N cross-points or places where the "bars" cross. At each cross point is a switch, which when closed, connects one of M inputs to one of N outputs. It has the disadvantage of large implementation size due to the number of connections. Ring buffer is a data structure that uses a single and fixed buffer connected end-to-end 12 . Ring buffers can be used to present the serialized data in a tight sequence to the modules. The linear ring buffer faces a disadvantage due to large memory size and access time 13 . Thus the architecture has been implemented with two single array image buffers to load the data into the all the sub-modules where data is preloaded to allow for testing access by an algorithm on hardware. multiplexers present these values to the sub-modules in sequence. The first multiplexer presents pixel data to output buffer 1 and the second one multiplexes pixel data registers into output buffer 2, based on current values of input buffer counter 1 and buffer counter 2.
Since the pixel data loading process can be time-consuming, the second architecture addresses the first part of this process in a brute force fashion. Image buffer for architecture version 2 (pipelined) has just one multiplexer which presents the data to 99 sub-modules from image buffer 2. The buffer 1 pixel data is replicated in the modules depending on the corresponding counter value. In other words, all pixels in the vector are presented to each of the computational modules, representing the first pixel value in the pixel pairs utilized for correlated distance and difference calculations. Figure 2 shows the image buffer for the architecture version 2 (pipelined). 
Architecture Version 1 (Not Pipelined):
The block diagram and description of the basic architecture is shown in Figure 3 . The distance module calculates the Euclidian distance between pixel pairs using the square root sub-module. Depending on the counter values, the corresponding x and y position for the calculation of the distance are determined. The difference module handles computation of the absolute difference between two pixels. It has input registers to store pixel data internally when the enable signal is high. The module then checks if pixel 2 is higher than pixel 1. If it is then, pixel 1 is subtracted from pixel 2, while if it is lower, pixel 2 is subtracted from pixel 1. After calculating the lag distances and the difference between the pixel pairs, the outputs from both the modules are connected to the sorting module which sorts the differences depending on the lag distances and calculates the number of pixel pairs for each distance. These sorted values are passed to the final module which computes the semivariogram value for different lag distances.
The output of the distance and difference modules are then given as an input to the sorting module, which calculates the number of pixel pairs for each lag distance and computes the summation of the square of differences for each lag distance starting from 1 to the maximum distance between the pixel pairs. The registers act as accumulators for the pixel pair count and the sum of squares of difference respectively. Whenever the buffer counter changes logic, the sorting module checks the distance value and uses it as an offset. Then the register number corresponding to the offset is incremented and the square of the input difference value is added to the corresponding sum register.
The final stage calculates the semivariogram value for all the lag distances, using the division function. First, all the data from the sorting module is multiplied by 2 by shifting bits to the left by one. The divider circuits for each input, from lag distance 1 to the maximum lag distance between pixel pairs are used to divide the sums with the number of pixel pairs for each lag distance. After the division process, the result is then rounded up or down to the nearest integer value. The output multiplexer is used to display results on the LED display one at a time. Figure 4 shows the block diagram for the architecture version 2 which is pipelined to improve performance. This architecture has the distance and difference modules replicated multiple times (99 times for the tested implementation). Each of the replicated distance sub-modules computes the Euclidean distance and the difference sub-modules calculate difference for a fixed pixel with all the pixel values presented to it. For example, the first distance and the difference modules calculate distances and differences for pixel 1 to pixel N. The sorting module has the same functionality as the sorting module in the first architecture version, except that it has 33 sub modules which are pipelined. Each sub-module handles 3 distance or difference values. The output m and sum of the 33 submodules are then added to find the number of pixel values and sum values for each distance, which is then passed to variogram module for final calculations.
Architecture Version 2 (Pipelined):

RESULTS AND DISCUSSION
Initially a Matlab program was used to implement the algorithm and find the baseline variogram values. The design was then implemented on the FPGA board for a 10 × 10 image. The results show that the architecture version 1 speed is measured to be 400 microseconds for calculating the semivariogram values, as seen in Figure 5 . Each counter combination to introduce pixel pairs to the combinational modules, lasts for 4 clock cycles because data inputs for differences and the distance module will be changed every 4 clock cycles. This clock period is used so that the large combinational logic circuits will have stable inputs to perform the calculations. Each clock cycle is 10ns for 100MHz hence a counter increment takes 4 x 10 ns = 40ns. There are two counters, each counting up to 100, when counter 1 is 1, counter 2 goes from 1 to 100, hence there are 100x100 counts. So approximately the total time of computation is 40ns x 100 x 100 = 400,000ns or 400uS. The results are shown in Figure 5 and are very close to these theoretical model values.
The ArchitectureVersion 2 is 100 times faster in speed than the first one due to the pipelining. Figure 6 shows the implementation time from data enable to completion. As discussed before, each count of buffer lasts for 4 clock cycles (40ns), but there is only one counter counting from 1 to 100, as the buffer 1 data is directly connected to inputs from difference module for the parallel calculations so approximately the total time of computation is 100x40ns = 4000ns or 4us, but due to the small logic delay the computational time is 4.27us. The maximum Euclidian distance between pixels in a 10×10 image is 12.72. Hence there will be 13 output semivariogram values for the algorithm being tested, corresponding to each value of 'h'. Table 3 below shows the computation times for the Matlab implementation, architecture version 1 (non-pipelined) and architecture version 2 (pipelined). The head-to-head results indicate significant performance improvement using the non-pipelined version, which is greatly improved by pipelining. The speedup obtained in the first case is 5,000 and in the second pipe-lined case is 500,000. 4.27 microseconds A summary of the synthesis report, device requirement and utilization shown in Table 4 . As indicated by a study of the architectures, a significant price in extra hardware is required to improve the speed using pipelining. This can be attributed to the replicated modules and data supply arrangements to these extra replicated modules in the pipelined architecture. The maximum frequency and power consumption are also shown in the table and indicate that both versions are utilizing the FPGA board at almost the maximum frequency allowed by the timing constraints for the designs. Pixel pair data presentation appears to be a significant factor in the speed of the algorithm. 
CONCLUSION AND FUTURE WORK
This paper conceptualizes and implements two architectures Architecture version 1 (non-pipelined) and Architecture version 2 (pipelined) for the computation of the semivariogram using a custom FPGA architecture to reduce the computational time required. It can be observed in the first architecture that the hardware implementations of semivariogram calculation, reduce the time of computation when compared to the software implementation. The second architecture implemented massive parallelism in the form of replicated distance, difference and sorting modules that are further pipelined. The basic difference between the two architectures is the massively pipelined sub-modules in the second version with 99 sub modules which run in parallel fashion. The output of these modules is connected to 33 sub modules of the sorting module. Hence, the massively parallel pipelined architecture implemented results in a speedup of 100 over the non-pipelined architecture. It is envisaged that the speed is proportional to the number of parallel sub-modules and further experiments are planned in the future.
The pipelined architecture can be utilized for small matrices, but the implementation for larger matrices imposes a challenge due to the large design size and FPGA limitations. The current design rounds off data to the nearest integer. Future work can involve a design that can process floating point numbers and can provide precision up to two decimal values. Other challenges to be addressed in future work also include increasing the image size, and further parallelization and pipelining to mitigate the effects of massive orderd pixel pair data access.
