Abstract
Introduction
Pipelining is an essential method for designing high performance systems. It has been applied at different levels of design.
For a system level pipeline, it is possible to have the supply voltage of the hardware modules scaled. The scaled voltages, in turn, can be used in changing speed, reducing power consumption and even in balancing pipeline stages (as will be demonstrated in this paper).
However, unlike lower level pipelines (such as at the RTL or the instruction level), system level pipelines have some specific design issues. One of them is the coarse grain of components (sub-systems) which a design is based on. These subsystems usually exhibit different execution times for different tasks. As such, traditional static pipelining with a fixed stage execution time is not practical.
Figure 1(a) shows an example of three components where each component performs a specific function or behaviour. There are four tasks processed in this system, with the execution time in each component shown in the brackets. If we pipeline this system into three stages (each component as a stage) and take the worst case into account, the stage execution time is 20 ms. The three stages perform in parallel on different tasks, and tasks flow in the pipeline at the same pace. As shown in Figure 1 (b), the four tasks take 120 ms to complete. Obviously, this pipeline is very inefficient as the worst case rarely occurs. The pipeline idles for most of the time.
If we allow each stage to work at its own pace by providing enough buffering devices to store tasks transferred between stages, then the four tasks only take about 50 ms to finish, as illustrated in Figure 1(c) . However, stages in this system are likely to be un-balanced. Fast stages will demand more memory to buffer the completed tasks. More waiting time may occur for some tasks. If we can change the stage supply voltages to dynamically adjust the task execution time (which is inversely related to the voltage), a more balanced pipeline can be achieved, as shown in Figure 1(d) . We first increase the voltage of stage 2 (hence reducing its execution time) and reduce the voltage of stage 3 (hence increasing the execution time). We then increase the voltage of stage 3 when task 3 comes, to obtain a more balanced pipeline and the four tasks take less time to complete (< 40ms). 00000  00000 00000  11111  11111  11111  00000 00000  00000  11111  11111  11111  00000 00000 11111 11111   00000 00000  00000  11111  11111  11111  00000 00000 11111 11111   00000 00000 11111 11111  00000 00000 11111 11111 00000 00000   00000  11111  11111  11111   00000 00000  00000  11111  11111 [9] . The issues that have been addressed mainly focused on partitioning and scheduling. All the existing works done so far are based on the assumption that the execution time of each sub-system is known and fixed. We, for the first time, tackle the problem when the execution time of each sub-system can be different.
11111
Our approach is explained in the next section with the simulation results shown in section 3. The conclusion is given in section 4.
Dynamic pipeline with stage voltage scaling
For a system with varied execution time and different stage data size, traditional pipelining will result in an inefficient and sometimes impossible design. Here we provide a new type of pipeline, called dynamic pipeline. As compared to the traditional model, there are two features in the new model: first, the buffering devices between stages are changed from fixedsize registers to FIFOs (memory with First In First Out data operation scheme) to accommodate data from a varied number of tasks; second, the flow of tasks within the pipeline is asynchronous -a stage, after it completes a task, can output to the next stage and start a new one without having to wait for other stages. Therefore, the stage resource can be fully utilised and the pipeline execution can be sped up.
With the dynamic pipeline model, tasks may queue in a FIFO if the next stage is jammed with a big task and this queue can become longer if such a situation persists. Therefore, the time a task spends in a pipeline is not determined by the number of stages and the stage execution time; instead, it is decided by the execution times of this task in each stage and the waiting times in each FIFO.
To reduce the waiting time and minimise average response time and increase the throughput, it is desirable that each stage have equal execution time at any time. To maximise the possibility of meeting equal stage execution time and increase the feasibility for stage voltage scaling, we take the average execution time of each implementation as our initial design inputs. The system is first partitioned and pipelined into the minimum number of stages by using the approach proposed in [9] . All stages have equal or near equal execution times and are balanced based on the average execution times. FIFOs are used in the pipeline to allow stages to work asynchronously.
This balance is then maintained and reinforced by a stage voltage controller to dynamically adapt the execution time changes between stages when the pipeline is in operation. These requirements from all detectors are then processed by the regulator to determine the actual stages that need to be voltage scaled. The output sc to each stage controls the supply voltage of the stage. Figure 2 (b) shows the structure of one detector. It contains a low pass filter, a comparator and an inverter. The low pass filter is used to obtain the trend of FIFO usage, denoted bȳ δ. Whenδ > 0, sr f = 1; otherwise, sr f = 0. The inverter provides opposite value of sr f .
The rules of the regulator are as follows.
• For a stage, there are two scaling requirements from the detectors on either side of the stage. If one requirement is to scale down the voltage and another is to scale up the voltage, then no scale change is undertaken.
• sr b has higher priority than sr f . In a case where two stages i and i+1 are out of balance, either scaling down stage i or scaling up stage i + 1 is allowed. Scaling down stage i should be the first choice to achieve a possible power reduction.
• If more than two contiguous stages are unbalanced, among them, only the first stage has the voltage scaled down and the last stage voltage is scaled up. The voltages of the middle stage remain unchanged. 
Simulation results
Five systems are simulated with three types of designs: the conventional pipeline, the dynamic pipeline without stage voltage scaling and dynamic pipeline with stage voltage scaling. Figure 3 shows the average response time, throughput time, FIFO usage and power consumption with different numbers of tasks. As can be seen, with the stage voltage scaling, the average task response time and FIFO usage are well under the control and are independent from the number of tasks processed . Without voltage scaling, the response time and FIFO usage grows as more tasks are processed. Also it can be seen the average throughput time and power consumption are quite stable in both working conditions as the number of tasks increases. But the higher throughput results in higher power consumption. Table 1 Based on these values, we can see that the dynamic pipeline always provides high throughput, but in the case without stage voltage scaling, this improvement is at the cost of more FIFOs and a much higher response time due to the imbalance of the pipeline stages, and if the imbalance persists, the cost may grow. This is reflected in System II and V, where the average response times and FIFO usages are much higher than those in the conventional pipelined systems. On the other hand, with voltage scaling, the pipelines are dynamically balanced, which provide both high throughput and quick response time. Also, the memory for FIFOs is limited, and the power consumption is more likely reduced because of the low power oriented scheme used by the controller in the pipeline. Take system IV as an example. The response time is only 39.3% of the normal pipeline, and the throughput is increased by nearly 30%, while there is only 35.7% of FIFO usage and 67.6% power consumption. The increased power consumption can sometimes happen when the stage voltages are scaled up more than they are scaled down, as is the case in System V, where 3% more power is consumed. However, this higher power consumption brings about a greater improvement in throughout with the throughput time reduced to 48.6% of the conventional pipeline, as compared to 58.6% in the case without voltage scaling. The measurements of the five systems show that the average improvement in throughput, response time and power consumption are 42%, 55% and 11%, respectively, with only a little additional cost in memory for FIFOs.
Conclusion
The coarse grain of sub-systems in the design makes it difficult to design a pipeline with well balanced stages. To tackle this problem, for the first time, we proposed a new pipeline model, the dynamic pipeline, where stages in a pipeline perform concurrently but output asynchronously. A stage can Table 1 . Simulation results perform more tasks than others in a certain period of time and the tasks between stages are buffered in FIFOs to maintain the integrity and the order of processing. The dynamic balance of stages in the pipeline is achieved by using stage voltage scaling. The voltages are changed dynamically, during the pipeline execution, in response to the accumulation of tasks buffered in the FIFOs.
We have simulated a number of systems with three designs: the conventional pipeline, the dynamic pipeline without stage voltage scaling and the dynamic pipeline with stage voltage scaling. Simulation results show that the dynamic pipelines both with and without stage voltage scaling can provide higher throughput than the conventional pipelines. With stage voltage scaling, a dynamically balanced pipeline with high throughput, limited memory, quick response time, and potential lower power consumption, can be obtained. The improvement in throughput and response time is about 50%, and the power consumption improvement is about 10% with little additional cost in memory, compared to the system pipelined with the conventional model. It must be pointed out that the simulations were performed with the tasks of execution times within bounded ranges. Tasks that take unbounded execution times are not considered in this paper.
