Video Up-scaling is a hotspot in TV display area; as an important brunch of Video Up-scaling, Texture-Based Video Upscaling (TBVU) method shows great potential of hardware implementation. Coarse-grained Reconfigurable Architecture (CGRA) is a very promising processor; it is a parallel computing platform which provides high performance of hardware, high flexibility of software, and dynamical reconfiguration ability. In this paper we propose an implementation of TBVU on CGRA. We fully exploit the characters of TBVU and utilize several techniques to reduce memory I/O operation and total execution time. Experimental results show that our work can greatly reduce the I/O operation and the execution time compared with the non-optimized ones. We also compare our work with other platforms and find great advantage in execution time and resource utilization rate.
Introduction
Nowadays TV display has achieved a leap development; the television which supports Ultra High-Definition (UHD) video display with 4k (3840x2160) resolution will be the next hotspot in this area. However, videos with such high resolution is not so easily accessible and most video sources are still in Standard Definition (SD), this becomes a bottleneck of promoting high-resolution display.
Video up-scaling focuses on up-sampling low-resolution images to high-resolution ones; it has the character of both computing-intensive and time-consuming. In TV display area, Video up-scaling is so important and frequently used that it is usually realized by hardware and embedded in television. ASIC is a basic option for this issue, it can get obvious accelerating effect, but however, one ASIC can only support one up-scaling method and thus it has no flexibility. FPGA is another option, this platform has switchable ability among different up-scaling method, but it has no dynamical configuration ability, and the resource utilization is relatively inefficient. All these features indicate that ASIC and FPGA are not quite suitable for video up-scaling application.
In this paper we propose a method of implementing Texture-Based Video Up-scaling (TBVU) on Coarsegrained Reconfigurable Architecture (CGRA). TBVU is an important brunch of Video Up-scaling. Concerning the effect and complexity, the high performance-price-ratio of TBVU makes it suitable for hardware implementation. CGRA is a very promising processor; it is a parallel computing platform with high flexibility and performance. Moreover, this platform has dynamical reconfiguration ability, which makes it switchable during the execution. Note that the configuration time of CGRA is in nanosecond range [1] . Our work focuses on the relief of TBVU computing intensity and memory access frequency with limited resources. By fully exploring the TBVU method, we have discovered the operator constitution and the data dependency. Based on the analysis of t, we propose a method for TBVU implementation. Experiment shows that our work has excellent effect on accelerating TBVU. As we only focus on implementing TBVU on hardware, some complicated TBVU methods are ignored. Note that our work concerns only the feature of CGRA, it can be adopted by different CGRA platforms. The remainder of this paper is organized as follows: Some background has been introduced in Sect. 2; it includes the detail of TBVU, the CGRA architecture and our experimental platform REMUS. Section 3 shows the related papers we investigated before our study. Section 4 presents the details of implementation. Section 5 demonstrates some experiments to evaluate our idea, and Sect. 6 concludes our work.
Backgrounds

Texture-Direction Based Video Up-Scaling
Video up-scaling has been widely studied recently. Some of them obtain good result by carrying out complex computation, while some of them is relatively simpler but performs badly. Most video up-scaling methods are designed for General Purpose Processor (GPP) which has highest flexibility but lack of speed. Hardware acceleration is essential for upscaling, but the implementation requires high performance-complexity-ratio of the method. TBVU is an important brunch of Video Up-scaling, it utilizes the texture as feature to carry out interpolation. TBVU is well-studied and widely used; several methods of this type have been discussed in [2] - [6] and [7] . TBVU is non-iteration algorithm, it has relatively low complexity but good result, so it is very suitable for hardware implementation.
Generally, TBVU includes three steps: 
A. TD (Texutre Direction)
TD consists of two parts: Sobel Detection and filtering with angle computation.
Sobel Detection computes the texture direction for each existed target pixel of input image. The input is usually in grey value of greyscale image or luminance component of color image. The texture direction is assorted by its intersection angle with an appointed direction. The number of classification differs for different methods. A popular way is dividing the 360
• angle into 8 classes: 45
• range each, and the direction angle is assigned to a class according to the degree number.
There are several ways to detect the direction angle, for example, Canny [8] and Sobel [9] are two well-known methods. Take Sobel as example, this method includes a convolution step and a filtering step. The convolution step computes the edge direction of target pixel with two template matrixes below:
The template f x and f y compute the gradient in x and y direction respectively. This calculation involves 8 reference pixel of target, Eq. (1) shows another description of this step:
A is the template matrix, each a i and p i represents the number in A and its corresponded pixel respectively, and n is the total number of a i in A. In our Sobel case, n = 9, which means we need 9 "×" and 8 "+" for f x and f y each. The filtering part rejects pixels with inconspicuous texture feature. A threshold is usually needed for this step, it can be given arbitrarily or computed with a self-adaption way by using the following template:
1 9 1 9 1 9 1 9 1 9 1 9 1 9 1 9
This matrix computes the average of target and reference pixels. The modulus of texture direction is usually used to compare with the threshold, which is computed with Eq. (2):
As square root can't be calculated by hardware directly, we use an approximate computation "+" shown in (2) to get A. The intersection angle of target pixel is computed with Eq. (3):
As (3) can't be computed directly by hardware, a substitute of taylor expansion is used as shown in Eq. (4):
The output of TD is a direction image to memorize the direction of each pixels of original input image.
B. PDS (Pixel Direction Speculation)
PDS is composed by Texture Speculation part and filtering part.
Texture Speculation determines the direction of target by processing the direction image that TD outputs. A common way for this speculation is to detect the direction distribution around the target pixel. This method needs counting the pixel number of each angle class among all reference pixels of target. The reference pixels are selected from a square window of the target: n image rows and n image columns. When the distribution is obtained, we use the angle class which largest number of pixels belongs to as the angle class of target. Figure 1 shows the pseudo code of detecting the distribution.
In pseudo code, the input is a set of angle class of each reference pixel of target, the output is a vector, its length equals to the number of classification in TD. Each array item in output represents one angle class and records the number of pixels that belong to the certain angle classs. The number "CLASS" in Fig. 1 represents the number of classification, and m is the number of reference pixels for each target. The output of PDS is the speculated angle class for each new interpolated pixel.
In some TBVU algorithms a filter part is used to reject targets whose direction distribution is too confused or inconspicuous. This procedure inputs the result of direction distribution shown in Fig. 1 , compares the angle class which contains the maximum number of pixels with an appointed threshold, if pixel number is less than threshold, the target will be considered with an intersection angle 0.
C. InterP (Interpolation)
The InterP step computes the pixel value for target, it inputs both the output of PDS and the original input image. For each newly input pixel, InterP will choose an interpolation type according to its angle class. Usually the reference pixels are chosen along the target's direction. The most popular interpolation types are the bilinear interpolation and the bicubic interpolation.
Bilinear interpolation utilizes 4 reference pixels in two dimension as shown in Fig. 2 (c) . In first step, it processes linear interpolation for each column, and then perform another linear interpolation with the former results. Equation (5) shows an example of linear interpolation.
x is the new interpolated pixel, a and b are reference pixel of x. Pixel a, b, and x are in the same line and x is between a and b. The p(x), p(a), p(b) represents the pixel value of x, a, b respectively. Assume the distance between a and b is 1, then d is the distance between x and b. Bicubic interpolation utilizes 16 reference pixels lying in 4 image rows and 4 image columns as shown in Fig. 2 (a) . The bicubic firstly perform cubic interpolation in each of the 4 column and then executes another one cubic interpolation to obtain the target value. Equation (6) shows an example of cubic interpolation.
The k is the interpolated pixel and a, b, c, d are the reference pixels, they lie in same column, the sequence of them is a, b, k, c, d from top to bottom. The number h a , h b , h c and h d are the coefficients computed by Eq. (7): Suppose the distance between two adjacent pixels in a, b, c, d is 1, then x is the distance between k and the corresponded pixel of h i (x).
The principle for choosing certain interpolation type is comparing the input pixels' angle class with some appointed parameters. For example, if the angle of target is 45
• , bicubic is a good choice and the reference pixels will be chosen like the blue ones shown in Fig. 2 
(a). If the angle is 0
• , bicubic will also be used and the reference pixels will be selected like Fig. 2 (b) . If the angle is 67.5
• , then bilinear may be a better choice and the reference pixels will be chosen as shown in (c).
CGRA Architecture
A. Typical Architecture of CGRA
A typical CGRA [10] - [12] is composed of one host controller and one 2-D Programmable Element Array (PEA). Each PE includes an ALU and a register file. The functionality of PE could be configured to be different wordlevel operations of numbers. The configuration words for PE array are stored in configuration memory, and the data memory stores intermediate results. Figure 3 shows a typical CGRA with a 4 × 4 PE array.
In this paper we will call PEs in same horizontal line a PE row, and PEs in same vertical line a PE column, the PE rows and the PE columns are all spatial.
PEA is the computing part of CGRA, it can be a square array or a rectangular array, and the sizecan be different in differentdesign. The PEs can work in collaboration with others or as individual to perform different computing functions, this feature makes CGRA a perfect platform to carry out parallelism. CGRA has the ability of dynamical reconfiguration, this give it the ability to change its functio during the execution. Note that the reconfiguration time is in nanosecond magnitude [1] .
B. REMUS Architecture
REMUS [13] is an architecture designed by our group, it is designed specially for image processing and media processing. In this paper, we will use REMUS as our experimental platform. Figure 4 (a) shows the architecture of RE-MUS, it includes one host processor, two RPUs, each RPU contains 4 Processing Element Arrays (PEAs) and some as- The flexibility of REMUS can be observed in 3 aspects: First, the function of PEA can be changed dynamically by switching contexts. Second, the PEs can work in collaboration with others or as individual to perform different computing functions. Third, the PEAs can be combined dynamically to meet computing needs and realize algorithm-level parallelism.
Related Works
Image upscaling is an important topic and has been widely used in many image processing applications. In order to speed up image upscaling, hardware implementation becomes an essential work, but however, not all upscaling methods can be implemented on hardware. [14] shows an super-resolution image reconstruction method, it utilizes the related information of adjacent images for upscaling and edge enhancement. [15] works on the backprojection and carries out a local self-similarity to reconstruct the high frequency data in high resolution images. [16] shows an upscaling method based on two-step grid filling and an iterative correction of the interpolated pixels. Though these works get very good results, but they all hire complex computation or iteration to optimize the result, it's very hard to implement these methods on hardware.
TBVU is a widely used upscaling method, it's easy to compute and the result can be outstanding. [5] talks about an improved edge-adaptive upscaling method, it detects 4 kind of edge block and carries interpolation along the edge. [6] studies an upscaling method using edge-curve scaling and cubic-spline interpolation. [3] shows an image interpolation method by applying different interpolation to texture and non-texture areas. [17] proposes an up-sampling method by utilizing a spatial filter for interpolation along texture directions. [18] presents a method using texture to eliminate error with interpolation. These works all hire certain kind of edge-based interpolation, which is relatively easy to compute, and the results of them is terrific. With these good characters we can see that TBVU is a good candidate for hardware implementation.
Lots implementation work has been done on TBVU or interpolation. A hardware implementation of bicubic enlargement interpolation algorithm has been proposed in [19] , it utilizes a search table technique to avoid massive cubic and floating number multiplier operations on FPGA. [20] shows an VLSI implementation on image upscaling, the authors utilizes area pixel scale technique and an approximate technique to simplify computing complexity. [21] proposes a new architecture of bicubic interpolation and implemented it on FPGA. [22] proposes a linear algebra implementations on FPGA, the authors utilizes the overlap technique between I/O and execution time to increase computing speed. [23] shows the procedure of mapping a Jacobi Iterative Solver on FPGA. These works are based on FPGA or VLSI, while VLSI is so specified and lack of flexibility, FPGA can achieve flexibility and performance, but it can't be configured during the computation and the resource usage rate is relatively low.
In this paper we propose an implementation of TBVU on CGRA. Like FPGA, CGRA has the flexibility to change its function for different application. Several implementations with CGRA has been proposed in [24] , [25] and [26] moreover, CGRA can dynamically configure its computing resource during the execution, so computing resource can be reused, and provide relatively more resource for each step of TBVU to increase the parallelism and reduce total computing time.
Implementation
This part shows the detail of implementing TBVU on CGRA. The data flow is shown in Fig. 5 . First the video frames are transferred from external memory to CGRA for TD step, TD outputs direction image in Memory 1. Then PDS reads in data directly from Memory 1 and puts its result in Memory 2. The InterP loads in data from Memory 2 and original pixel value from input frame buffer, and finally outputs the interpolated pixel value as the final result. In Pipeline technique is utilized inside each part of TD and InterP respectively, and the combination of parts of TD also forms a pipeline. The PEA can be detached into PE rows. With the buffer memory, each PE row can perform as one stage in pipeline. In our design, we will first detach the DFG into different stages according to priority defined by the data dependency of nodes like in Fig. 6 (a) , then implementing DFG onto PEA. After the implementation, each PE represents a node in DFG and each PE row represents a stage in DFG, in this way, a pipeline implementation of DFG is constructed. Figure 6 illustrates an example of pipeline. Despite the pipeline inside each part, the parts of TD can be combined to form a pipeline. Figure 6 (c) shows the combination, the result of Sobel Detection is given to Filtering and Angle computation in parallel, then Determine Angle receives the output from previous parts and output the final result. In this way, a pipeline containing all the parts of TD is formed.
The 3 steps of TBVU can't be connected together and work as pipeline: PDS needs data from 4 image rows but same columns for each computation, while TD outputs data one by one in row major order, so a buffer is needed between TD and PDS. PDS is a procedure of counting, the datapaths inside it are not pipelines, these features results that the 3 steps can't be combined to form a pipeline.
As the input volume of input data can be large, CGRA may not able to cache enough data for pipeline in on-chip memory and need to load data from external memory. Section 4.4 will give the discussion of external memory.
In our design several novel techniques are utilized to optimize the efficiency: In Sobel Detection, Data dependency only occurs between target pixel and reference pixels; moreover, two adjacent target pixels share 6 reference pixels. The input image could be so large that CGRA has to memorize it on external memory and loads it through I/O pins piece by piece, the number of pins will affect T I/O .
Considering the entire phenomenon above, several technologies will be utilized to optimize Sobel part.
Parallel technology by digging data dependency.
As the data dependency of TBVU only occurs between reference pixels and targets, different targets can be implemented in parallel inside of TD.
2. Sharing contiguous data (pixels) to reduce the number of I/O operation.
Each pair of adjacent targets shares some reference pixels, we adjust the execution order to be from one target pixel to its adjacent pixel and memorize the shared data on chip. In this way, I/O operation is reduced. The I/O time T I/O is restricted by the I/O pin number of CGRA, if data can't be loaded in time, barely increasing parallelism will be meaningless. In our design, we consider the I/O time and execution time as a joint problem and tried to overlap them as much as possible.
Suppose the I/O pins of CGRA is bw (b*w), w is the width of a word, so CGRA can load in bw bits data in one cycle. Use k to represent the number of datapaths in parallel, when handling n targets, we need to load in 3n data (contiguous data sharing has been considered) and output n results, the I/O operation will be 3n + n = 4n, I/O clock cycle will be T I/O = 4n b . Ignoring the clock cycle used to fill pipelines, the execution clock cycle will be T comp = n k . The total delay will be
To achieve the lower bound, let
. This result shows that if the pin number is bw, the best number of datapaths for parallelism is CGRA's computing resources is grouped in square or rectangle shape, while the DFGs mapped on it can be in any size and form, they have to be modified before implementation to increase the utilization of computing resource. The connection ability of CGRA's PEs is a limitation when modifying the DFG, it should be satisfied for practical application. Next we will use an example to show the implementation of Sobel Detection.
Suppose the width of input is 8 bit, the pin number is 64, according to the analysis above, 2 datapaths can work in parallel. Figure 7 (a) shows the DFG of Sobel. Figure 7 (b) shows an intuitive way of mapping the Sobel DFG on PEA. Two datapaths are mapped in parallel on the left and right half respectively. In Fig. 7 (b) PE rows with same number forms a pipeline, this demands the CGRA to connect PEs separated by two PE rows, but not all CGRAs have this ability, the DFG has be modified to satisfy the connection limitation. As the adjacent target shares reference data, we use adjacent targets with prior to execute in parallel. This modification will reduce the I/O operation with external memory.
The filtering part inputs f x and f y from Sobel Detection and outputs the direction judge number to check whether the target is directional. The intersection angle computation part inputs f x and f y from Sobel part and outputs the angle class of target. The data dependency of filtering and angle computing only lies in their results: the result of filtering step determines whether the output of angle computation step is effective. Figure 8 (c) shows the datapath of angle computation, it is an implementation of Eq. (4) .
Consider all these features, several techniques will be used for this part.
As discussed above, the filtering and angle computation can be implemented in parallel. As this parallelism only occurs inside the datapath of TD, it won't increase data throughput but shorten the length datapath.
Modify DFG to increase the utilization of computing
resource while satisfying the connection limitation.
Figure 8 (a) shows the DFG for self-adaption threshold, it's an adder tree with a divider. In this tree, the "+"s are not well-distributed which is unfavorable for utilization. Figure 8 (b) shows a modified DFG of (a), it has better uniformity and can be helpful for increasing utilization of computing resource.
Equation deformation for computational efficiency.
The DFG of Fig. 8 (c) carries out the arctangent computation, its datapath is so long because that the power calculation has many redundancy. In our design, we deform (4) to shorten the length as follow:
Figure 8 (d) shows the DFG of (8). This deformation utilizes intermediate result x 2 and removes several unused power items to shorten critical path. Figure 9 shows the implementation of filtering and angle computation; both of these two parts are modified to increase computational efficiency. As discussed above, 2 datapaths will be implemented in parallel for maximum overlap of T I/O and T comp . In each datapath filtering and angle computation are mapped in parallel. The part on left side of Line 1 in Fig. 9 is filtering, and the part between Line 1 and red line is the angle computation part.
PDS (Pixel Direction Speculation)
Counting the angle class distribution is the key part of PDS. For each target, a square of reference pixels located in n image rows and n image column of the input direction image around it will be loaded in. Figure 10 shows an example, suppose the reference data lies in 4 image rows and 4 image columns around the target, and the red dots inside the square interval are the new interpolated target pixels; then the red dash box loops the reference data of these target pixels. From Fig. 10 we can see that the target dots in one interval shares the same reference data, so they share same angle class distribution and have the same speculated angle class. The blue dot and the red dots in Fig. 9 lies in adjacent intervals, the blue dash box loops the reference data of blue dot. It's easy to find that data share occurs between the adjacent red and blue dots.
According to the discussion above, several techniques can be used for this part.
1. Optimized storage of similar data to reduce memory space and computation.
As discussed above, targets in one interval have the same reference data and same angle class, so only one time of calculation is needed and only one result needs to be stored for each interval. This optimization reduces the needless computation and storage.
Parallel technology by digging data dependency.
Data dependency only exists among the counters and the reference data, so parallel among different targets can be used.
3. Sharing contiguous data (pixels) to reduce the number of I/O operation.
As seen in Fig. 10 , data share exists between the adjacent intervals. In our design, we adjust the execution order from one interval to its adjacent one and memorize the shared data on chip to reduce I/O operation with external memory. Suppose the angle classification number is m, then a datapath of PDS will be formed by m counters. Each counter is c is formed by an "=" and "+" as shown in Fig. 11 (a) , When inputs n 2 reference data lies in n image row and n image column of direction image, the datapath will judge them one by one and add them to certain counter, then the computing time will be T comp = n 2 for each target. Suppose k is the datapath number mapped in parallel, for r target new interpolated pixel, the execution clock cycle is T comp = n 2 r k . As we utilized the technique in 2, only one computation for each interval is needed, so only n new reference data needs to be loaded. Suppose the I/O pin number of CGRA is bw (b*w), the clock cycle of I/O operation will be T I/O = nr b . We overlap T I/O and T comp , the total delay is:
This result shows that if the pin number is bw, the best number of datapath for parallelism is nb.
Here is an example. Let n = 4, counter number m = 8, bw = 4 * 8 = 32, then the best number for parallelism will be 4 * 4 = 16. Let's see the implementation of this case on PEA. Figure 11 (b) shows the implementation. There are 16*16 PE in (b), the "=" and "+" in the box of left top is a datapath constructed by 8 counters. 16 datapaths are mapped on PEA in parallel to overlap T I/O and T comp .
InterP (Interpolation)
In InterP, each target will be assigned an interpolation type according to its angle class. According to PDS, the targets in one interval share the same angle class. Different interpolation type needs different reference data selection and compute procedure, our hardware has to support all kinds of interpolation in certain way.
Considering the discussion above, several techniques will be used for this part.
1. Execution order rearrangements for efficiency optimization.
The target new interpolated pixels in one interval call for same interpolation type, in our design, the procedure will be from interval to interval. In this way, the reference data and interpolation type implementation is fully reused. The data dependency only exists between reference data and the target, and targets in same interval share same reference data, so parallelization can be implemented between different targets. Suppose there will be k intervals mapped in parallel, as the pixels inside each interval shares same reference pixels, only one time of load is needed for all pixels in one interval. Assume each interval has t target pixels, and the reference data needs to be loaded for one interval is n. When handling q input intervals, the I/O operation will be qn input and qt output. Clock cycle is
b , the execution time will be T comp = q k . Assume the configuration cycle is T config = c, the total delay by considering time overlap is:
To get the lower bound let
, as q can be very big, c q ≈ 0, k = b n+t and k should be integer. If k is less than 1, it means there's not enough pins to output the result of all pixels in one interval, in this case, the datapaths in parallel is b-n, otherwise if k is more than 1, all the pixels in k intervals will be work in parallel. Note that this result is an optimal result, the number of parallelism in practical situation should take the resource of hardware into account.
5. Dynamical reconfiguration ability of CGRA to reuse computing resources and decrease execution time.
The execution time of processing an image is determined by the following factors: hardware performance, design of datapath, and the data throughput during the processing. The former two factors are determined by hardware and design method, while the third one is determined by the parallelism of datapath, which is constrained by the computing resources of hardware. InterP has to support two or more types of interpolation to meet the calculation need. The target pixels in one interval hire same kind of interpolation, as its number is too big and the interval can be processed in parallel is usually 1, so the pixels mapped on CGRA in parallel are from one interval for most of the time, only one kind of interpolation will be used. It is obvious that the more resource we give to this interpolation, the higher parallelism we will get. CGRA has dynamical reconfiguration ability to support dynamical support between different interpolations. With CGRA, we can map only the needed interpolation with the whole computing resource and dynamically switch when different interpolation is called. While with hardware using static reconfiguration, all interpolation must be mapped simultaneously and the resource for each of the interpolation is relatively less than on CGRA. This technique can mostly increase the parallelism and data throughput during the execution while meets the calculation need of hiring multiple interpolations.
The rest of this subsection will demonstrate an implementation example of InterP.
The angle class of interval is used to select interpolation type of the interval. As the reference data of bilinear and bicubic come in different number and different location, in our design, CGRA will load in 25 reference data from 5 image rows and 5 image columns of original image around target, in this way, both bilinear and bicubic can find their reference data correctly. In our case, n = 5, c = 5, t = 4, bw = 16 * 8, then k = 1, which means mapping datapaths of one interval each time. Figure 12 shows the DFG and implementation of bicubic interpolation on PEA, (a) and (b) shows the computation of Eq. (7), the DFGs in (a) and (b) are mapped twice in (c) to compute the 4 coefficients of (7). In Fig. 12 (c) red box and blue box loop the implementation of DFG in (b) and (a) respectively. Bicubic will select 16 data from 25 input and cached data for calculation. The execution order will be from one interval to its adjacent one. Figure 13 shows the DFG and implementation of bilinear on PEA. This DFG computes the value of interpolated pixel, each red box in Fig. 13 (b) loops an implementation of DFG in (a).
Memory Architecture and Scheduling
In TBVU, we need to deal with data from no more than 4 different rows but same column of image, so in our work we splice 4 DDR-400 as the external memory to memorize the data of 8 bit each. Figure 14 shows the architecture. Each DDR is assigned a serial number from 1 to 4 as shown in Fig. 14 (b) ; assume the quantum of rows in input image is RM, each row in image is assigned a row number from 1 to RM as shown in Fig. 14 (c) . In our design, we use a kind of redundancy storage to memorize an input image: DDR 1 stores image data from row 1 to row RM, DDR 2 stores image data from row 2 to row RM, DDR 3 stores data from row 3 to row RM and DDR 4 stores data from row 4 to row RM. Fetching data from external memory is a procedure of reading the 4 DDRs with same address, by using this redundancy storage we can access data from 4 rows but same column with one address as shown in Fig. 14 (b) . The data loaded onto CGRA is firstly stored on a buffer as shown in Fig. 14 (a) , and then datapaths fetch data directly from the buffer.
This organization of external memory can support our computing needs. Suppose the upscaling of our method is from 720p to 4k, in TD, I/O operation is 1280 × 720 × 3 = 2764800 byte; the I/O operation in PDS is 1280 × 720 × 4 + 3840 × 2160 = 11980800 byte, and the I/O operation of InterP is 1280 × 720 × 5 + 3840 × 2160 = 12902400 byte, the total I/O is 27648000, that is less than 28MB, assume we process 25 fps for real-time application, the total I/O is less than 700MB/s, while the data rate of DDR 400 is 3200MB/s, which much higher than our need.
As the 3 steps of TBVU cannot be connected as pipeline, the result of each step needs to be stored on external memory. In our design, all CGRA's PEAs are combined to perform the implemented function. A configuration is needed before the execution of each step. For each configure, all PEAs will be configured to support the computation of that step. Figure 15 shows the timing diagram of processing one image frame. As the function performed by CGRA doesn't change during the procedure of TD and PDS, only one time of configuration is needed for them. Our design utilizes I/O and execution time overlap technique, the data can be loaded in during the execution of PEAs. Each green block represents a data block, three kinds of Data Block loads data from different source respectively. In InterP step, the PEAs has to check the angle of input data and configure the PEAs for proper interpolation, an angle check time is needed. If the CGRA find the input data hires same kind of interpolation with the former one, the configuration time can be omitted.
The advantage of utilizing reconfigurable architecture lies in the flexibility and dynamical reconfiguration ability. The implementation on CGRA is changeable to satisfy different applications and perform different algorithms. The dynamical reconfiguration ability provides more resource for each of the steps by reusing the resources of its former step. More resource means more parallelism, with dynamical reconfiguration we can get higher data throughput and higher resource usage rate by reusing them.
Experiment Results
In order to evaluate the efficiency of our work, we carry out several experiments. Our test platform is REMUS, Table 1 shows some key parameter of it.
Our TBVU uses Sobel and self-adapted threshold in TD, 8 counters in PDS, and bilinear and bicubic interpolation respectively in InterP. The input image is in YUV format with 1280 × 720p resolutions, and the output is in 3840 × 2160p resolution. As the I/O pin number is 64, the parameters in each step can be identified as shown in Tables 2, 3 and 4.
Note the last parameter in Table 4 , according to Sect. 4.3, the number of interval in parallel is k = b n+t < 1, that means executing all pixels in one interval calls for too much output, only part of them could be executed in parallel. In this case, b − n = 3, 3 datapaths will work in parallel. In order to show efficiency of our method, we use a non-optimized TBVU to compare with us. The nonoptimized method uses pipeline technique but no parallel, no data reuse and no overlap of T I/O and T comp . Table 5 shows the comparison in memory access and execution time of these two methods.
In TD, our work shares data between adjacent pixels and overlap the T I/O and T comp , while NOP has to load in 9 data for each target, no parallel is used, and I/O operation and execution is carried out one after another. The difference between NOP and our work is big. We use 32 datapaths in PDS to work in parallel while NOP don't and has to wait 16cycles for each target. In InterP, NOP did no reconfiguration and load in 25 reference data for each target, the difference between NOP and our work is obvious.
We also did some experiments on different platforms. Table 6 shows the platform parameter and results. We use the same test case in this experiment with the last one. The execution time on REMUS is the total time of three steps in Table 5 .
We utilize Xilinx Vertex-6 XC6VLX760 large volume FPGA as our target device. FPGA is a fine-grained reconfigurable hardware with no dynamical reconfiguration ability. In this experiment we try to prove that: with same optimization method of datapath and equivalent computing resource, CGRA can get better performance for its reconfigurable ability.
In FPGA experiment, we utilize exactly the same datapath in each step with REMUS, the difference between the implementation on these two hardware lies in the parallelism of each step. In consideration of fairness, we restrict FPGA's logic LUT number to give equivalent resource for both FPGA and REMUS. The maximum computing resource we use on REMUS is 6 PEA, the I/O pins restrict the parallelism of each step of TBVU. In order to get the equivalent resource on FPGA with that we use on REMUS, we take out the memories and draw out the logic part of 6 PEAs in RTL code and implement them on this FPGA. Table 7 shows the synthesis result of this. We use the total LUT of the result as the equivalent resource on FPGA with on REMUS.
As we utilize exactly the same datapath with REMUS when implementing TBVU on FPGA, the only changeable parameters are the parallelism of each of the TBVU's step. With the resource constraint we obtained above, we adjust the parallelism of each step iteratively and let the LUT used Table 7 Synthesis LUT result of implementing RTL code of RPU's logic part on FPGA. as logic in synthesis result to approximate 43254. The parallelism of each step and the total LUT used on FPGA is shown in Table 8 . Table 6 shows the execution time of the implementation on FPGA. As FPGA has no dynamical reconfigurable ability, it has to map all three steps simultaneously while REMUS can reconfigure its computing resource to perform each step with all its resource. Thus with equivalent resource, the average resource provided for each step on FPGA is less than that of REMUS. As we implement exact the same datapath on FPGA and REMUS, the parallelism on FPGA is lower, which affects the data throughput. When dealing with massive input, data throughput becomes an key issue and the cost caused by reconfiguration of CGRA become insignificant. Experimental result shows that the execution time of FPGA is longer than REMUS.
On GPP we use Visual Studio 2010 to compile the TBVU algorithm realized with C++. As GPP has no parallelism, the execution time is the longest. Note that the execution time on REMUS is 38.79 ms, which indicates that REMUS can process 25 frames per second. Our can satisfy real-time application.
Conclusion
Texture-Based Video Up-scaling is a well-studied method for up-scaling video resolution, and has good potential for implementing on hardware. This paper fully exploits the character of TBVU and details of CGRA and propose an implementation method. Experimental results show that our work can greatly optimize the I/O operations and the execution time compared to non-optimized implementation. With equivalent resources, our implementation on CGRA achieves better performance compared to FPGA, and the speed of our work is faster than GPP.
