Abstract-This paper presents a BPD (Balanced Power Dissipation) heuristic scheduling algorithm applied to VLSI CMOS digital circuits/systems in order to reduce the global computational demand and provide balanced power dissipation of computational units of the designed digital VLSI CMOS system during the task assignment stage. It results in reduction of the average and peak temperatures of VLSI CMOS digital circuits. The elaborated algorithm is based on balanced power dissipation of local computational (processing) units and does not deteriorate the throughput of the whole VLSI CMOS digital system.
I. INTRODUCTION
This paper concerns application of heuristic scheduling algorithm to balance the load of tasks onto computation units (cu i ) uniformly with reduction of the total cost of the digital VLSI CMOS system (co) but without deteriorating its global computational efficiency measured e.g. by its throughput (Th). In this paper, the universal measure, named the cost (co) represents the consumption of power supply of VLSI CMOS digital system.
In the literature we can find a variety of methods concerning computational task assignment to different computation units (cu i ) [1] , [2] , [4] , [5] , [11] . It is especially important for reducing the supply power demand [6] , [7] , [8] , [9] . In the paper the design objective function taken into account is the cost of the system (co). This measure has the straightforward influence on average and peak temperature of IC.
The presented BPD (Balanced Power Dissipation) algorithm can be applied to reduce global computational demand and provides balanced power dissipation of the digital VLSI CMOS system at the task assignment stage.
II.
PROBLEM FORMULATION The task to be computed is described by the tasks' graph G T (V,E) as presented in Fig. 1. An individual computational task (ct) in graph G T is represented by vertex v i in the set of vertices (e.g. in Fig. 1 ☼, ▲, •) . Each ct has to be assigned to one of the computational units of a given resource type cu_t j ∈{cu_t j } in the proper discrete time (dtu).
The value |{cu_t j }| is equal to the number of computational units of j-type available during design process, e.g. for the graph in Fig.1 Each type of processing unit is capable of processing a specified type of computational tasks. The problem of assigning tasks to processors as specified here is NP complete [3] .
While assigning ct i to cu_t j in BPD algorithm, different The main aim of the elaborated BPD algorithm is fulfilling the condition of providing the balanced power dissipation of computational units ({cu_t 1 }∪...∪{cu_t j }∪...)) leading to reduction of average and peak temperatures of the digital VLSI CMOS system without deteriorating its throughput. In this process the total cost of the system is also decreased. The assumption is that less effective computational units are cheaper, so that replacing the chosen cu_t i with other, less effective, results in decreasing the total system cost.
IV. ALGORITHM DESCRIPTION
The elaborated BPD heuristic algorithm is partially based on the research results presented in papers [9] and [10] and concerned with reducing the power consumption of digital CMOS circuits. The algorithm consists of two stages described below. At the first stage of the algorithm the computational tasks represented by graph G T have to be assigned to the elements of resources with the lowest number of discrete time units taken into account. During this stage, either ASAP or ALAP [8] algorithm is executed.
Scheduled computational task for the example in Fig.1 after the first stage of BPD algorithm is shown in Fig.3 .
The second stage of BPD algorithm (Tab.I) consists of delaying the initial SG created in the first stage. The C su set includes all the rows suitable for delaying, and C ms indicates the most suitable row selected from this set. The chosen C ms row actually undergoes the process of delaying. 
This function performs the check for the free space condition (fsc), defined by the following formula:
where n i is the number of dtu needed to perform the task, l k is the number of tasks assigned to C k computational unit row, r o is the number of free task slots after the first occurrence of a task in C k computational unit row.
Formula (1) checks if the number of free dtu slots is sufficient for the idle tasks to introduce longer processing time of a cui. Every C k selected for the increased number of dtu has to fulfil condition (1) . Despite the fact, that cascading tasks from the other C k 's are not taken into consideration while calculating fsc, condition (1) is sufficient to pre-reject quickly some C k from the C su set before starting the time-consuming delaying process.
there_is_same_cu(C k ) (line 10)
This function simply indicates whether there is another cu i of exactly the same type as the one assigned to C k , i.e. being capable of performing the same type of task in the same time. 
©EDA Publishing/THERMINIC 2007 ISBN: 978-2-35500-002-7
This function returns a row containing the tasks of exactly the same type as C k .
fvi( v i ) (line 16)
This function calculates fvi factor for the v i task according to the following formula:
where i vi is the number of independent inputs of task v i , o vi is the number of system outputs of task v i , f avi = dtu M -(cs e + n i ), dtu M is the maximal number of dtu admissible, cs e is the number of dtu which v i is assigned to, s vi is the number of tasks of the same type as task v i in the path of G T below task v i , n i is the number of dtu needed to perform the task, p i is the p -label of task v i , the minimal plabel of a task equals 0, hence addition of 1 in the denominator is necessary to avoid dividing by 0.
The f vi value of a task indicates its suitability for being slowed down. When there is more than one row of the same type it is used to create the best interconnect rows by interchanging tasks.
interchange(v i , v j ) (line 14)
This function swaps the v i and v j tasks, so that v i is located in task slots earlier occupied by v j and vice versa.
fck( C k ) (line 15)
The value of the function is given by:
where n dCk P is the normalised computational load of C k computational unit row (cu-assigned to C k row, n dCk P is normalised to cu i , having the lowest value of P di ), l Ck is a number of tasks in C k computational unit row. The fck function is responsible for selecting the most suitable row (C ms ) for inserting idle tasks, from the C su set. It chooses the row assigned to cu_ti that has the highest throughput demand, hence it gives the highest throughput demand reduction when the processing element x y p assigned to row C k is slowed down.
insert_idle_tasks(v i ) (line 20)
This function simply adds new task slots with idle tasks after the v i task. If there is an empty task slot after the last dtu occupied by v i , then an idle task is added there. However, when there is no empty room for a new idle task, then the next task in the row of v i is delayed. Next the data interconnections between v i and its successor tasks must be checked. This is done by the delay_all_successors function described below.
delay_all_successors(v i ) (line 21)
This function checks if all the data needed to perform successor task (s i ) of v i is available on time, by checking the condition:
If it is not fulfilled, then the successor is delayed as many dtu as needed, so that start_step(s i )=end_dtu(v i ).
Such a delay implies the need for checking all the sets of data and interconnections between the successors of s i . If the delay is not possible due to the dtu M constraint, increasing the number of dtu of v i (and the computational unit it is assigned to) fails.
In such a case the row containing v i , i.e. C ms is removed from the C su set F, and the process starts from the beginning with the decreased |C su |.
For a simple benchmark shown in Fig. 1 , the results are presented in details. The obtained results are shown in 
©EDA Publishing/THERMINIC 2007 ISBN: 978-2-35500-002-7
algorithm, while the second stage is given in Fig. 4 . There in Fig. 4 lowering the cost of the appropriate computational units is represented by inserting the symbol ◊. It means that its throughput can be twice as low without deteriorating the efficiency of the whole computational system. Therefore our example for computational units cu_a_2 and cu_c show that their throughput can be lowered twice resulting in cost reduction of the designed computational system. Moreover, the value of the throughput obtained earlier does not deteriorate.
IV. EXPERIMENTAL RESULTS
This section presents the results obtained by applying the BPD algorithm on selected benchmarks [12] .
Cost reduction is calculated for each computational task based on the number of computational units of each type before and after application of BPD algorithm. The cost of each computational unit type assumed for cost calculation is directly proportional to its throughput. To simplify the comparison of computational efficiency we assume that T hi can be lowered by the factor 0.5. 8 4 2 1 8 4 2 1 8 4 2 1 8 4 2 1 8 4 2 1 P di 8 4 2 1 8 4 2 1 8 4 2 1 8 4 2 1 8 4 2 1 Tables III and IV and Fig. 5 show number of computational units of each type before and after applying BPD algorithm, respectively. The resulting cost reduction together with the number of the G T graph vertices of each benchmark computational task is reported in Table VI .
V. CONCLUSIONS
In this paper the BPD heuristic scheduling algorithm for load balanced power dissipation resulting in reduction of average and peak temperatures of the digital VLSI CMOS digital circuits/systems was presented. The objective function introduced is measured by cost reduction of VLSI CMOS digital circuits which directly depends on power dissipated in IC. The main idea of BPD algorithm is based on decreasing the cost of chosen computational units by adjusting their efficiency to real needs without deterioration of the computational efficiency of the whole system.
The applied BPD algorithm has been verified for the chosen set of benchmarks. Experimental results proved 13 to 43 per cent cost reduction of the computing system achieved without deterioration of the system throughput with the assumed cost to throughput dependency. This reduction has a straightforward influence on decreasing the average and peak temperatures of VLSI CMOS system and results in increasing its reliability.
ACKNOWLEDGMENT
The work was partially supported by the State Committee for Scientific Research (MNiI, Poland) through the grant 3 T11B 015 27.
