Abstract-The three dimensional structure tensor algorithm (3D-STA) is often used in image processing applications to compute the optical flow or to detect local 3D structures and their directions. This algorithm is computationally expensive due to many computations that are required to calculate the gradient, the tensor, and to smooth every pixel of the image frames. Therefore, it is important to parallelize the implementation to achieve high performance. In this paper we present two parallel implementations of 3D-STA; namely moderately parallelized and highly parallelized implementation, on a massively parallel reconfigurable array. Finally, we evaluate the performance of the generated code and results are compared with another optical flow implementation. The throughput achieved by the moderately parallelized implementation is approximately half of the throughput of the Optical flow implementation, whereas the highly parallelized implementation results in a 2x gain in throughput as compared to the optical flow implementation.
INTRODUCTION AND MOTIVATION
The three-dimensional structure tensor algorithm (3D-STA) is an effective calculation of the local motion information of video objects [1] . Many image processing and computer vision applications use the 3D-STA, for example to compute the optical flow or to detect local 3D structures and their directions. Applications include autonomous systems, robotics, and CT scans. Practically, this algorithm is difficult to apply in robotic vision because of its tremendously high computational demands due to higher frame rate requirements in real time applications. In most cases, the general purpose processor and sequentially executed software cannot compute 3D structure tensor in real time. Therefore, the 3D-STA algorithm and many other image processing algorithms demand the use of parallelization techniques to meet real-time requirements. The massively parallel architectures, preferably programmed in dataflow languages, provide the means to meet the computational needs of such applications. Different blocks of this algorithm are suitable to be implemented on many-core architectures. Each block can be divided into smaller parts that can then be mapped on an individual processor to be executed in a pipelined manner. Finally, the complete 3D-STA can be implemented by composing these sub-blocks executing in parallel.
This paper provides 3D-STA implementation on the Ambric AM2045 architecture, which is a massively parallel processor array (MPPA) containing 336 programmable processors combined together with a dynamically reconfigurable interconnect on a single die. Two designs, one moderately parallelized and one more highly parallelized, are developed for this algorithm and implemented on the Ambric architecture. We have evaluated our implementations by comparing the performance results of the two implementations of 3D-STA against another optical flow implementation [2] . This paper is organized as follows: Section II gives an explanation of the 3D Sobel Operator (3D-SO), 3D-STA, and its three computational stages (gradient calculation, outer products (tensor) calculation, and smoothing the tensor components). Section III describes optical flow calculation and some related work, and Section IV provides an overview of the Ambric architecture and its programming model. Section V presents the implementation details of the 3D-STA algorithm and the different strategies for parallelization. Performance results obtained from the hardware implementation are provided in Section VI and compared with another optical flow implementation and Matlab version. Finally, the paper is concluded with some remarks and future work in Section VII.
II.
3D STRUCTURE TENSOR The 3D structure tensor can be defined for a function f of three variables p=(x,y,t), where x and y are the spatial components, and t is the temporal component. 3D-STA takes a sequence of frames f(x) as an input that acts as a volume of data where x, y and t are the 3D coordinates [2] [3] [4] . The algorithm has several computational stages as shown in Figure 3 . The steps of the algorithm are:
1. Taking a sequence of frames as an input 2. Calculating the gradient in 3D direction x, y, and t using 3D sobel operator (3D-SO)(mask size 3x3x3) 3. Calculating the tensor for every pixel in the images using gradient 4. Finally, smoothing each component of the tensor, to get the 3D structure tensor
A. 3D Sobel Operator
The Sobel operator is an edge detection operator. The 3D-SO kernel size is 3x3x3. The kernel is convolved over the image to calculate the derivative of the image [5] . The 3D-SO is operating on three adjacent slices (a subset of the image solid). The results of gradient values are put in the central slice. The operator is shown in Figure 1 . Figure 2 [6] . Advantages of the 3D-SO are:
The 3D filtering properties of the kernel are very effective in reducing noise in the pictures and effectively highlight the edges rather than the 2D filter.
It is easy to implement in hardware by using a pipeline approach than the other operators. It produces small range of output values than the other operators such as the scharr operator.
No need to require multiplication when we calculate gradient values. Figure 3 depicts the dataflow of this algorithm [2] [3] where the video input is represented as a sequence of image frames. For instance f(x) is a 3D image. It is a three-stage computation as shown in Figure 3 .
B. Data flow diagram of 3D-STA
The first stage is called gradient (derivative) calculation. The derivative calculation is done using 3D-SO (mask size 3x3x3) and generates image sequence called gx, gy and gt for 3D images in three directions x, y and t.
The second stage (Tensor/Outer Product) produces six new matrices (xx, xy, xt, yy, ty, tt) by performing point-bypoint multiplications between elements of the previously computed gx, gy, gt.
The third stage (smoothing) implements an 8x8 square filter for smoothing in x, y, and t directions to produces xys, xts, yys, tys and tts images. 3D-STA is suitable for "data-level parallelism" (DLP) because it has independent operations that are to be calculated in each step of 3D-STA [8] . These operations can be applied simultaneously to multiple segments of data. For example, to calculate the derivative, the 3D image will be convolved with 3D-SO and the center position will get new value in three directions (x, y, and t). Then previous six pixels will be stored from each image and three new values will be added as well as convolved with 3D-SO. Then again the centre position will get new value in three directions. Each gradient value computation is independent and the same operation is applied simultaneously. To implement this algorithm on Ambric, we split the image into multiple parts that can then be computed independently in parallel to speed-up the calculation. Nevertheless, we have to deal with so called apron pixels when we split the image. Apron pixels are the extra pixels (one row and one column) that need to be shared with other parts of the splitted image to calculate gradient of their boundary pixels. Extra pixels are also needed to be added when we compute the derivative in the border of the image.
III.
RELATED WORK Over the past two decades, tensor based optical flow has been calculated using different parallel processing hardware such as FPGA, GPU etc. [3] [4] [7] . As far as optical flow implementation on the Ambric architecture is concerned, only one example has been found [2] .
The aim of the optical flow calculation is to compute a velocity vector for each pixel. That vector indicates the pixel's apparent motion in the x and y directions over the past few image frames. The optical flow implementation by Hutchings et al. was implemented as a multi-stage pipeline [2] . The first stage (derivative) calculation is done using five element masks (mask is: [1 8 0 8 1]) 1D convolution in the x, y and t. In our work, we used 3D-SO instead of 1D convolution to compute derivative of the images. The second stage (smoothing), smoothes the gradient values using 3D convolution. However, we did not smooth the image after gradient calculation because we used 3D-SO that gives better gradients than the 1D and 2D convolution. The third stage, tensor calculation (outer product) is point by point multiplication between previously computed and smoothed images. Our implementation is similar in this stage. In the fourth step, then again a 2D smoothing is performed in each of the six succeeding smooth_blocks. In this stage, we also smooth the tensor values but we instead used 8x8 box filter. Finally, it calculates the velocity vectors for each pixel for optical flow calculation. However, in our implementation we implemented up to structure tensor.
There are other FPGA implementations [3] [4] [7] , where the algorithm is implemented as a five-stage computation. The first stage performs 1D convolution in the x, y, and t dimensions using five-element masks. In our case, to calculate gradient we used 3D-SO instead of 1D convolution for 3D images. Gradient weighting is the second processing stage; it performs seven-element row and column convolutions. Nevertheless, after gradient calculation we do not calculate gradient weighting because the 3D-SO gives better gradient then the 1D convolution. The third stage (outer product) produces five new matrices by multiplying various combinations of the weighted gradient matrices together element-wise. We also did outer product calculation but after gradient calculation. Tensor calculation is the fourth processing stage, and performs a three-element row convolution on each of these matrices to produce five tensor matrices. However, instead of threeelement row convolution we smoothed tensor values using 8x8 square filter. The final stage is optical flow calculation, which computes the velocity flow fields. Nonetheless, in our calculation, we calculated up to structure tensor.
IV. AMBRIC ARCHITECTURE & PROGRAMMING MODEL
The Ambric Am2045 chip consists of an array of 336 32-bit RISC processors, which can be executed at up to 300 MHz. The architecture is suitable for embedded systems such as medical imaging, video and signal processing [9] .
In the Am2045 chip, there are two types of processors, one is SRD (Streaming RISC with DSP extensions) processor and other one is SR (Streaming RISC) processor. These processors have local RAM memories and they get the instructions through channels. Ambric has reconfigurable circuit-switched interconnect that joins processors and memories. Processors are general purpose and are programmed using aJava or assembly language. We write sequential programs for each of the processors. A programmed processor is called a primitive object and the smallest programmed object is called a leaf object (Figure. 4). The leaf objects can be connected to each other by channels. Each channel is word-wide unidirectional, pointto-point from one object to another, and acts like a FIFObuffer. Objects run concurrently on an array of Ambric processors and memories. The objects run independently with their own speeds without having any side effects on each other because they have no shared memory [10] . 
V.
3D-STA ON AMBRIC HARDWARE In this section, we describe implementation strategies for the 3D-STA algorithm. Two design approaches are implemented on the Ambric hardware. In the first design technique, each image is split into four parts. In the second design, each image is divided into sixteen parts. Since we are using 3D-SO we have to deal with the apron problem when splitting the image. Apron pixels are the extra pixels that need to be added in order to calculate the gradient in the border of the image. To prevent apron problem after the image is split, the splitting parts has to share one extra column and row to calculate gradient of their boundary pixels.
A. Design approaches for implementation
Due to the fact that 3D-SO kernel size is 3x3x3 pixels, we need to add apron pixels (usually by zero padding) around the image in order to convolve pixels along the boundary of the image. The implementation of this algorithm on Ambric architecture is done by splitting the image into smaller parts, and then assigns these smaller parts to different processors. The splitting of the image will produce additional edges which require additional apron pixels [5] . Otherwise the computation will fail to calculate boundary pixels of the small part of the image, as shown in Figure 5 . Figure 5 shows apron pixels along the additional edges, where the image is divided into four parts. The gray color represents the apron of the image, the blue color represents shared border of the image and the white color is the original image. The total image size is 256x256 pixels and with apron its size is 258x258 pixels. After dividing the image into four parts, each part becomes of a 129x129 size. For calculating the border image part, we have added one more pixel from the other part of image portion. Accordingly, the size of each separated part becomes 130x130 pixels.
To prepare the input data we rearrange the image into a set of blocks of size 3x3. Then, three blocks from three consecutive frames will be mixed to create a block of size 3x3x3, which is equal to the kernel size of 3D-SO. The reason for mixing the data of three consecutive frames is because the Ambric PerfHarness library used to measure the performance permits only one input file and one output file. The two designs are coarse-grained and fine-grained implementations. In both the implementations, we have introduced a processor named separator to separate the mixed data of the three frames. Let us now describe the two different implementations of 3D-STA based on the dataflow diagram shown in Figure 6 .
1) Moderately parallelized implementation
In the first design, the input data is initially separated into three different images, and then each image is split into four parts using three processors called splitter. After splitting, four pre-STA processors collect the same part from the three images to prepared data for pipelined 3D-STA implementation. In total, this approach uses 17 SR and 32 SRD processors. The SR processors are used to distribute and collect data, while the SRD processors are used to compute the three stages of 3D-STA. Half of the SRD processors are used for smoothing, 12 for gradient calculation, and 4 for tensor calculation. Figure 6 shows the mapping of this approach on the Ambric platform. The circles represent processors and the arrows represent the flow of data. Now, in this section we describe implementation of the three stages in 3D-STA, i.e., gradient calculation, outer product calculation, and smoothing operations. The gradient block accepts three sequential images (g) and performs three dimensional gradient calculations such as horizontal (gx), vertical (gy) and over the time components (gt), using 3D-SO. To calculate the gradient, pre_STA processors create data streams for gx, gy and gt starting from the top-left corner of the image in 3x3 format and then compute the actual gradient. That means, nine pixels are consumed from each image to calculate the gradient for the first iteration. So, pre_STA consumes twenty seven values from three images, because of the 3D-SO kernel size 3x3x3, and sends all of them to the gradient_calculator processor. The gradient_calculator processor multiplies sobel operator's coefficient with pixels values in three directions and produces three gradient values in three directions (gx, gy and gt).
The 
2) Highly parallelized implementation
The dataflow diagram of the highly parallelized implementation is similar to the moderately parallelized one, while it uses a greater number of processors. Here, the image is split into sixteen parts by two splitting stages (splitter_1 and splitter_2). These sixteen parts of the image will be arranged for the three stages of 3D-STA by using 16 pre_STA processors. Moreover, to implement the smoothing part, we classify the tensor values into two parts. This approach uses 69 SR and 96 SRD processors. In this design all of the SR processors are also used to distribute and to collect data while SRD processors are mainly used for calculating all parts of the algorithm. Fortyeight SRD processors are used for gradient, sixteen for tensor calculation and thirty-two for the smoothing operation. The algorithm takes sequential images as an input like video sequence and calculates all parts of the algorithm in a pipelined manner.
VI. RESULT ANALYSIS
The performance results are achieved by realizing both the above mentioned designs on Ambric architecture and executing them on a GT board containing one Am2045 chip being operated at 300 MHz clock. We have used the performance harness library provided by Nethra Imaging Inc. to obtain cycle accurate performance measurements. The performance results of the two implementations are summarized in Table 1 . In addition, the implementation results of the 3D-STA are compared to the optical flow implementation by Hutchings et al. [2] . The performance result is measured using an input of 20 images of size 256x256 for the two implemented designs. We can notice in Table 1 that the moderately parallelized implementation of 3D-STA achieves 25 frames per second (fps) and that the highly parallelized achieves 100 fps. The frame rate ratio of the highly parallelized design shows a factor of 4 improvement compared to the moderately parallelized one, but we have also used 165 processors for the highly parallelized compared to only 49 processors for the moderately parallelized implementation.
For the execution of moderately parallelized implementation, the gradient, tensor and smoothing blocks on Ambric have taken approximately 70%, 16%, and 14% of time respectively. The gradient block takes more time to execute than the other parts of the algorithm because each pixel of gradient calculation requires 54 multiplications (we have used bit shift operator instead of multiply operation to increase the performance) in three directions. On the other hand, highly parallelized implementation on Ambric chip has taken same percentage of time (0.20 seconds) to execute three stages of the algorithm because of the similarity in calculation procedures and design techniques in both of the designs.
We have also simulated the algorithm in Matlab and compared the results with Ambric implementation for validation of accuracy. The moderately parallelized implementation is 58 times faster than the sequential Matlab implementation, and that the highly parallelized implementation is 233 times. Optical flow implementation on FPGA by Wei et al. [3] can process 640x480 images at 64 frames per second, while consuming 10W. On the other hand, our moderately parallelized implementation is 2.56x slower, while highly parallelized is 1.56x faster with lower power consumption (estimated power on Ambric using 165 processors is approx. 5.4W) as compared to their optical flow implementation on FPGA. We can observe from Table 1 that the optical flow calculation on Ambric by Hutchings et al. is 1.7 times faster than the moderately parallelized implementation and 2.2 times slower than our highly parallelized one. We can notice that the highly parallelized implementation on the Ambric chip uses 165 processors and it can process 100 fps. On the other hand, the optical flow implementation on same chip uses 190 processors and it can only process up to 37fps on an image stream consisting of 320x240 frames. However, we normalized the image size according to our image size, which gives 43 fps. If we applied our design technique to implement remaining parts (smoothing t direction and to calculate velocity vector) of optical flow according to the above result, we expect that our implementation will still be faster than their implementation. The optical flow implementation by Hutchings et al. has multiple computational stages and due to the following reasons our implementation will be faster. The first step is derivative block, this stage calculation is done using 1D convolution in the x, y and t direction using the five elements mask to produce gradient. Actually, to calculate gradient they did not use 3D gradient operator for 3D images but in our implementation we used 3D-SO to get proper derivative of the image. We achieved this result because of using the 3D-SO coefficients to calculate gradients of images. To calculate one gradient value in three directions, it requires 18x3 multiplications. However in our calculation, since 3D-SO coefficients are only 1, 2 and 4, we use bit shift operators instead of multiplication. The second stage is the smoothing operation. It smoothed the gradient values to get better derivative using 3D convolution using 3x1 convolutions in t direction, 5x1 in x direction and 1x5 in y direction. In our implementation, we do not need to smooth the gradient values because the 3D-SO gives the better gradient then their gradient calculation, which results in improved performance. The third step is outer product calculation. The calculation procedure of this step is exactly the same in their and our calculation. The fourth step is smoothing tensor values. In their implementation they used 2D filter to smooth the tensor values. However, in our implementation we used a very simple method called 8x8 box filters for smoothing the tensor values, which avoids using multiplication. Finally, they smoothed the velocity vector using kernel size 7x1 and 1x7. If we compute up to optical flow, we only need to smooth in t direction and then finally also need to calculate the velocity vector. After calculating the velocity vector we can ignore smoothing. Thus, we have demonstrated several useful optimizations. For example, in gradient calculation, instead of multiplication we used the faster bit-shift operator; we eliminated the need for smoothing operation in the second stage, and in the fourth stage we used 8x8 square filters instead of 2D filter to avoid multiplication. Also, if we compute optical flow (final step) we can avoid smoothing after velocity vectors calculation. All of these optimizations contribute to an efficient implementation.
VII. CONCLUSIONS AND FUTURE WORK
The implementation fulfills the real time performance requirement for 3D-STA that is used in optical flow 5. Eliminating the smoothing operation after gradient calculation. 6. For smoothing operation we used simple 8x8 box filters to ignore multiplication The future work that can be performed is as follows:
The entire program of the algorithm is written in aJava currently. In future instead of aJava, assembly language can be used to get better performance.
The Gradient calculation part of the algorithm can be partitioned (to exploit more parallelism) and map it onto more processors in order to avoid the slow-down effect on the rest of the parts of the 3D-STA. This work can also be used in the future in many image analysis applications, such as to detect local 3D structures and their directions as well as in autonomous systems, robotics, and CT scans etc.
