Abstract-Color correction, which nonlinearly converts the color coordinates of an input device such as the scanner and digital camera into that of an output device such as the color laser printer, is important for multimedia applications. In this brief, we present a novel dynamic pipelined VLSI architecture for the fuzzy color correction algorithm (FCC) proposed by Jou et al. to meet the speed requirement of time-critical applications. To promote the performance, the presented architecture is dynamically pipelined with unfixed or run-time determined latencies (or data initiation intervals) and the speculation technique is also applied, then the problems of arduous pipelining, due to the variant execution time of each iteration and slower executing of FCC are solved efficiently. As for data path design, a systematic design methodology of high-level synthesis is used. As a result, a significant (about 2 times) speedup of the dynamic pipelined architecture with a slight hardware overhead relative to the sequential one has been achieved.
I. INTRODUCTION
The reproduction of color documents between different devices has become one of important problems in multimedia applications due to the rapid expansion on the using of color input/output devices such as color scanners, digital cameras, and color printers. The major issue of a color correction system is how to maintain the original color quality of documents or images very faithfully when they are transferred between different input/output devices. The most common example is the color documents scanned in by the scanner and then printed out by the printer, it is also this brief's focusing problem. Since the tints and scales of colors between different devices are different, the coordinates obtained by the color scanner cannot be used directly by the color printer and then the color of the reproduced document after transferring may be distorted very much and be unacceptable. In additional, the nonlinear color-mapping problem makes the color correction process difficult and slow. Therefore, a mechanism to efficiently (smoothly) and fast convert the scanner's color coordinates into that of the printer is necessary and important for a time-critical color office automation environment.
In the past, there were several methods [2] - [5] proposed to deal with the color correction (or conversion) problem between the scanner and the printer. Some of them are with high computation costs and/or need large storage area [2] , [3] ; the others are slow and unsuitable for time-critical applications [4] , [5] . Recently, a good fuzzy color correction algorithm and a corresponding FPGA implementation were presented in [6] , however, they are slow and even with poor performance due to unreduced computations and the maximum criteria defuzzification used. In [1] , an efficient fuzzy-tree correction algorithm, denoted as FCC, is proposed. Using fuzzy inference trees like [6] the color coordinates of red, green, and blue (RGB) generated by the scanner to that of cyan, magenta, and yellow (CMY) of the color printer smoothly. The advantages of FCC lie in its simplicity, adaptability and good correction effect. A sequential hardware of it had also been designed [1] , but its processing speed is still slow and is a problem for time-critical applications. Making a fast hardware realization for FCC to promote the color correction speed is then the subject of this brief. We present here a dynamic pipelined architecture for FCC to increase the correction speed significantly. Pipelining [7] - [10] is a powerful and popular technique in designing high throughput digital circuits. For a functional pipeline design, two consecutive iterations of the same loop are initiated at a time interval called the latency. In general, the latency of a pipeline is fixed [7] - [9] or has some fixed values [10] . However, in the main FCC processing loop, variant execution time of each iteration and time-relative data dependencies between different iterations make pipeline latencies unfixed and pipelining be hard to do. In order to construct a high throughput pipelined architecture for FCC, a new dynamic pipelined architecture with run-time determined latencies will be designed. We first identify and modify some complicated data and/or control dependencies in FCC such that pipelining is easy to carry out and then we partition FCC into two sections: an inner section and a main section. In the main section, the whole inner section is viewed as an unbound delay operation whose delay is data-dependent and run-time determined. The inner and main sections are then optimized in performance by using speculation-based pipelining [11] to form a single dynamic pipeline. An integrated controller is designed to control the dynamic pipelined datapath with variant latencies. After dynamic pipelining, the pipeline latencies of the FCC loop are dependent on the execution length of the inner section and are naturally unfixed and vary between 5 and 12. The average value is about 9.5 (measured by the hardware simulation). Thus, the processing speed of FCC can be promoted significantly and is about 2 times speedup to the sequential version [1] , but hardware overhead is slight.
II. OVERVIEW OF FCC AND ITS SEQUENTIAL ARCHITECTURE
In [6] , a fuzzy algorithm with good color correction performance was proposed. However, it has high computation complexity and tends to be slow. Based on [6] , we had developed a more efficient FCC algorithm [1] , which uses a new efficient approach for fuzzy inference and the different center-average method for defuzzification and gets the same color correction performance as the algorithm of [6] . During the fuzzy color correction process of [1] , the function of each fuzzy subtree is to do one color conversion that is performed by finding a decision path in it. At the beginning, a subtree with one level and eight leaves is employed to determine the mapping of the red color according to the matching degree of the input, each input color value X i is processed with eight fuzzy sets (s1; s2 ; . . . ; s8 ) corresponding to the eight leaves for the color. The eight fuzzy sets are used to represent the different intensities of each color. After determining the decision path in the tree for the red color, the subtrees and decision paths of green and blue colors can be found sequentially on the analogy of the red color fuzzy inference process. The FCC algorithm of [1] is presented in Fig. 1 . In it, L denotes the current level of the three-level (color) fuzzy tree and
otherwise, X i is blue. In addition, P ath L denotes the decision path of the current subtree and 0 P athL 7 and a 146 bytes ROM is if L = 2 P ath 3 L02 16 + P ath 3
The center-average based inference result X o is calculated by
where jDj is the distance between X i and the fuzzy set s k and w 1 ; w 2 ,… and w 8 are the supported values of the fuzzy sets which represent the various intensities of color space CMY. For more details of the fuzzy color correction process and performance, please refer to [1] . Fig. 2 shows the corresponding sequential architecture of FCC in [1] . In Fig. 2 , the ALU performs multiplication and division operations in (2) . The notation n( n) denotes that the value is shifted left (right) n bits. The storage used to store k is a 4-bit up-count counter to perform its increment operation. The condition E8 (E0) is set to 1 if k = 8 (k = 0). Another condition L0 is set to 1 if D > 0. Moreover, w1; w2,… and w8 are stored in an 8-byte ROM, denoted as ROM2 and k is used as its address.
Note that the execution time of each iteration in the FCC main loop section is variable, since the number of its inner loop's iterations (see the second while loop of Fig. 1 ) is changed according to the different values of data k and D. In addition, FCC also contains data dependent branches, which can not be sped without speculative computation which makes tasks executing before this is known to be necessary whenever no other task is ready for execution [11] . Thus, pipelining FCC with traditional pipeline methods that use fixed latencies to design pipelined chips is impossible (that is why only a sequential circuit for FCC is presented in [1] ) and a new dynamic pipelined architecture with variant latencies combined the speculative technique is required for designing a higher performance FCC circuit.
III. DYNAMIC PIPELINED ARCHITECTURE DESIGN OF FCC
To efficiently design the dynamic pipelined architecture, FCC first is modeled as a hardware behavioral description and then is partitioned into the inner section and the main section. Next, the inner and main sections are optimized by using speculative computation and pipelining to enhance the performance. Finally, the performance-optimized inner and main sections are combined to form a dynamic pipelined datapath with variant latencies and then the dynamic pipelined controller is generated. The following subsections explain these processes.
A. Graph Model and Partition
To clearly explain the dynamic pipelined architecture, the FCC description in Fig. 1 is transformed into a hardware behavioral description and then is represented as a simplified control/data flow graph (CDFG) shown in Figs. 3 and 4 , respectively. In the simplified CDFG of Fig. 4 , only the important operations and dependencies are depicted. In it, the notations of operations addition, subtraction, multiplication, division, comparison, up-count and assignment are denoted as +; 0; 3 ; =; }; ++ and , respectively. Moreover, the operation of Fig. 4 also in line i of Fig. 3 is denoted as notation-i. For example, the addition operation in line 4 of Fig. 3 is denoted as +4 as shown in Fig. 4 . The multiplication and division operations need two clock cycles to compute the values and the other operations like addition, subtraction, or comparison need one cycle to execute based on the 0.8 m cell library. The control condition produced by the comparison operation in line i of Fig. 3 is denoted as ci. In Fig. 3 , if condition c2 (c9) is true, the main (inner) while loop will continue to execute; otherwise, the loop will terminate. There is an inner loop with data dependent number of iterations in the main loop of the simplified CDFG. Pipelining this CDFG with a fixed latency is impossible due to the variant-iteration (variant execution-time) inner loop.
To design a high performance dynamic pipelined FCC architecture with minimum hardware resource, all operations of Fig. 3 are first partitioned into two parts: a main section and an inner section. The operations of the second while loop in lines 9 and 10 as well as the operations in line 11 of Fig. 3 , which make the execution length of FCC main loop to be variable and be unpredictable in advance, are grouped into the inner section and the remaining operations are grouped into the main section. The operations in line 11 are put into the inner section too since line 10 and line 11 both have the common operation: P ath L = k. The aim of the partition is to divide the unpredictable inner section from the main section so that they can be pipelined to form a high performance dynamic pipeline. On the other hand, although the operations in line 12 to line 17 of Fig. 3 also make the execution time of each iteration of the main loop section to be variable, but its influence is predictable. We know that if k is equal to 0 or 8 then only operation 12 is executed; otherwise operations in line 13 to line 17 are executed. That is, we can know its execution time needs 1 or 7 clock(s) in advance. Therefore, the operations in line 12 to line 17 are not put into the inner section.
B. Speculative Computation of Inner Section
After the CDFG for FCC is partitioned, the inner section is scheduled and pipelined by applying performance optimization techniques to achieve a high performance. The inner section is performance-optimizedly scheduled without regard to the interaction and precedence constraints between it and the main section temporarily. The operations of the inner section may be scheduled into 4 states by using the conventional pipeline scheduling [12] : the operations in line 9 of Fig. 3 are scheduled at the first state; the operations in line 10 are scheduled at the second state; and the operations in line 11 are scheduled at the third and fourth states, respectively. However, this schedule is slow and cannot be pipelined due to data dependency between operations }9 (i.e., k < 8) and + + 10 (i.e., k + +) (see Fig. 3 ). To get a high performance circuit before pipeline scheduling, the technique of control speculative [11] , which implies the execution of an operation before the execution of a preceding instruction on which it is control dependent, must be applied.
We first schedule the comparison operations }9 in line 9 and the conditional operations in line 10 into a certain state called S and then determine whether or not the results generated by the operations in at the ending of state S . That is all operations in the inner section are scheduled into one state S . However, at state S , how to prevent these conditional operations in line 10, whose executions depend on the result of logical operation }9, overrunning (when c9 is false) becomes a pressing problem. Some new control signals must be planned and associated circuits must be designed to solve it. Let L D, L P ath1 , L P ath 2 and C k be the original loading or up-counting control signals of registers D, P ath 1 , P ath 2 and counter k, respectively. Signals 
where subscript L is equal to 1 or 2. Then, the circuits to implement (3) and (5) are shown in Fig. 5(a) and (b) , respectively. The circuit to implement (4) is similar to the circuit in Fig. 5(a) . By the design above, all operations of the inner section can be scheduled speculatively and efficiently into the state S , which further enhances the performance of the FCC architecture and needs only slight additional hardware.
C. Pipelining Main Section
Subsequently, the main section will be pipelined and then integrated with the performance-optimized inner section to form a dynamic pipelined design. Performance optimization of the main section is more important and difficult, because the more complex interactive precedence relations between respective sections must be considered and the execution length of the inner section is unknown (data dependent) and unpredictable. We incrementally unwind the main section loop to pipeline the main section. After a polynomial (and in practice small) number of iterations have been unwound and parallelized, a repeating pipeline body will provably emerge.
During pipelining the main section, the original location of the operations of the inner section in the main section is replaced by an unbound delay operation whose delay is dependent on the result of comparison operation }9. Fig. 6 shows the four times unwinding of the main loop body. These operations of calculating the ROM address (e.g., operations +4 and +5) in each iteration must be scheduled after the operation + + 18 of its previous iteration due to the data dependency caused by L. Therefore, the performance of pipeline scheduling shown in Fig. 6 is limited by the data dependency and is highest under the constraint. Note that operations +4 and +5 are mutually exclusive and thus they can share the same adder though they are scheduled at the same clock cycle. Consequently the scheduling is area-efficiency and is with the minimal hardware resource. In addition, the operation 12 and operations 3 13, 014, 3 15, +16, and =17 are also mutually exclusive. Thus the condition c12 must be appropriately preserved during pipelining to judge these conditional operations whether or not are executed.
After the repeating pipeline body of the main section is found, all operations of different iterations or different sections which are executed at the same time are formed a state such as P S 1 to P S 4 as shown in Fig. 6 . The latency`of the pipeline scheduling is equal to four clock cycles plus the delay of the inner section (i.e., the unbound delay operation) and is variable. Since the number of iterations, denoted as I, of the inner section in Fig. 3 is between 1 to 8 and all operations in the inner section are executed within the state S. The execution length of the unbound delay operation is also between from 1 to 8. As a result, the latency`= 4 + I and its values are between 5 to 12.
In the pipeline scheduling of Fig. 6 , no operation of the main section is scheduled at the same state with the unbound delay operation. In fact, operations +16 and =17 can be scheduled one clock cycle forward to decrease one clock cycle time in executing the main section once; the new pipeline scheduling doesn't increase extra functional units. However, we do not adopt the new pipeline scheduling based on two reasons. First, the performance improvement of this new pipeline scheduling is very little. Second, the new pipeline scheduling swells more design complexity. Therefore, we adopt the original pipeline scheduling.
D. Dynamic Pipelined Datapath
Finally, the state transition graphs of the inner and main sections are generated and combined into a final state transition graph (STG) as shown in Fig. 7 . The STG consists of the prelude, repeating pipeline body and postlude. The operations activated in each state of Fig. 7 are not shown and please refer to Fig. 6 . The datapath allocation is then performed to construct the dynamic pipelined datapath. The datapath allocation technique in [12] is applied to generate the datapath by assigning each operation to a functional unit, by assigning each data value to a storage element and by providing interconnections among functional units and storage elements using multiplexers and/or buses.
The activated times of operations in the repeating loop body can be observed from Fig. 6 . By the result, the ALU of Fig. 2 is split into a multiplier and a divider due to their activated time conflict. In addition, operations +4 and +5 are mutually exclusive and they can share the same adder as mentioned above. The functional units allocated include one adder, one subtractor, one multiplier, one divider, one comparator and one two's complementor. On the other hand, the lifetimes of data values in the repeating pipeline body are shown in Fig. 8 , which shows the lifetime of data values in the iteration 1, 2, and 3 of the repeating pipeline body and data value x of the ith iteration in it is denoted as x(i). By Fig. 8 , the results of register allocation are shown in Table I .
After the functional units and registers have been allocated, the interconnection among them is constructed by tracing the execution of the operations in Figs. 6 and 7. The final dynamic pipelined datapath then is completed and depicted in Fig. 9 . A controller is designed to sequence datapath hardware units with run-time determined latencies according to dynamic pipelined scheduling, which is simplified represented by with the state transition graph shown in Fig. 7 .
IV. EXPERIMENTAL RESULTS
The dynamic pipelined FCC architecture aforementioned has been designed and simulated in Verilog based on the 0.8-m cell library. Four pictures were used to test the processing speed of the dynamic pipelined design and the two sequential designs [1] , [6] . The compared results with the two sequential designs are reported in Table II . In it, column "file size" shows the file sizes of these pictures. Moreover, columns "sequential [6] ," "sequential [1] ," and "dynamic pipelining [ 3 ] " show the total state number required for two sequential designs and the proposed dynamic pipelined design to process these pictures, respectively. The results show that averagely about 2.7 and 2 times speedup compared with the sequential designs, respectively, can be obtained.
Besides the state numbers shown in Table II , the simulation results show that the latency values of the dynamic pipelined deign vary from 5 to 12. The average latencies L for the dynamic pipelined design to process the four pictures are also listed in Table II . From the point of theoretical analysis, we can find that from Fig. 6 the execution time of each iteration of the sequential FCC architecture [1] needs 19 cycles, since it shall use the longest 12 cycles to process the inner section. But the proposed dynamic pipelined FCC architecture only needs 9.5 cycles on average, which is the average value of Ls of the four cases shown in Table II , to run each iteration of FCC. Therefore, we get the Table II .
However, hardware overhead is needed to achieve the 2 times speedup. Comparing the sequential and dynamic pipelined datapaths shown in Figs. 2 and 9 , the penalty paid for the speedup is extra area overhead including more pipelined registers and a somewhat larger controller. In addition, the total area of one multiplier and one divider used in the dynamic pipelined architecture is larger than the area of one ALU in the sequential architecture that can perform multiplication and division operations. In the future, how to reduce the area of datapath such as using the table look-up method to replace some larger functional units is our next goal.
V. CONCLUSION
This brief has presented a dynamic pipelined architecture for the fuzzy tree color correction algorithm FCC. It describes the design process of the FCC dynamic pipelined architecture, which performs a nested loop with data-dependent number of iterations and data-dependent branches, using the variable latencies and speculative computation. The pipeline latencies of the designed architecture are depended on the execution length of the inner section and are naturally unfixed. Experimental results show that the dynamic pipelined FCC architecture obtains about 2 times speedup with a slight area overhead.
