Designing energy-efficient Digital Signal Processor (DSP) cores has become a key concern in embedded systems development. This paper proposes an energy-proportional computing scheme for Very Long Instruction Word (VLIW) architectures. To make the processor power scales with adapted parallelism, we propose incorporating distributed Power-Gated Register Files (PGRF) into VLIW to achieve a PGRF-VLIW architecture. For energy efficiency, we also propose an instruction scheduling algorithm called the Deadline-Constrained Clustered Scheduling (DCCS) algorithm. The algorithm clusters the data dependence graph to reduce data transfer energy and makes optimal use of low-powered local registers for tree-structured data dependence graphs. The results of evaluations conducted using the MiBench and DSPstone benchmark suites substantiate the expected power saving and scaling effects. 
INTRODUCTION
Digital Signal Processors (DSPs) with Very Long Instruction Word (VLIW) architectures are widely used in embedded systems and multimedia systems-on-chip (SoCs) [Philips 2011; Texas Instruments 2011] . Assisted by the compiler to exploit instruction-level parallelism, a VLIW DSP has multiple execution slots to provide high-performance arithmetic computing. Many semiconductor manufacturers provide VLIW DSP cores for embedded SoCs [Freescale Semiconductor 2008; Texas Instruments 2005] . However, a wide-issue VLIW architecture increases processor power and has a significant impact on battery life, system density, cooling cost, and system reliability. On the other hand, in the era of deep submicron semiconductor fabrication, designers face new challenges such as reducing the exponentially increasing leakage This work is supported by the High-Speed Intelligent Communication Research Center of Chang-Gung University and the CGU research funding under grant BMRPB88. Authors' addresses: Z. Liang and W. Zhang, School of Electronic Information Engineering, Tianjin University, Tianjin 300072, People's Republic of China; email: {liangzb; tjuzhangwei}@tju.edu.cn; Y.-C. Ma (corresponding author), Department of Computer Science and Information Engineering, Chang-Gung University, Kwei-Shan, Taoyuan, Taiwan; email: ycma@mail.cgu.edu.tw. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. c 2014 ACM 1544-3566/2014/06-ART20 $15.00 DOI: http://dx.doi.org/10. 1145/2632218 power [Industry Technology Roadmap for Semiconductors 2010] . This article investigates energy-efficient VLIW architectures for deep submicron processes using emerging technologies.
Main-stream VLIW DSPs are in some means of distributed register files architecture. Researchers report that the register file power accounts for a large proportion of the total processor power . The power dissipated in the global register file of an N-issue VLIW processor is on the order of O(N 3 ) [Rixner et al. 2000] . Major VLIW DSPs employ some forms of distributed register files architectures to reduce the register file power [Freescale Semiconductor 2008; Lin et al. 2008; Texas Instruments 2005] . Various researchers have also proposed a variety of compiler optimization technologies for distributed register files architectures [Akturan and Jacome 2001; Aleta et al. 2007; Qian et al. 2002a Qian et al. , 2002b Zalamea et al. 2001] .
In recent years, energy-proportional computing [Barroso and Holzle 2007; Cameron 2010] has been proposed as a means of improving energy efficiency. An energyproportional processor has various speed modes to adapt to applications with diverse performance requirements. Power consumption is reduced while working in a lower speed mode. The philosophy is to run the processor just fast enough to meet an application's need. A Dynamic Voltage Scaling (DVS) [Corporation 2000; Devices 2001; Fleischmann 2001] processor is a typical energy-proportional processor. However, although the DVS scheme successfully reduces the dynamic power, designers are facing new challenges arising from the advancements in semiconductor manufacturing processes.
One of the challenges mentioned earlier is that the leakage power increases exponentially and becomes comparable to the switching power [Industry Technology Roadmap for Semiconductors 2010] . To cope with the increasing leakage power, Multiple Threshold-Voltage CMOS (MTCMOS) power-gating technologies are proposed [Shin et al. 2010] . Various researchers have proposed compiler-assisted power-gating control to reduce the power dissipation of instruction-level parallel processors Wang et al. 2010; You et al. 2006 You et al. , 2007 . Although power reduction on functional units is widely discussed, we focus on reducing the power dissipated in register files by means of power gating.
In this article, we propose a novel approach-energy-proportional computing on VLIW architectures. Utilizing MTCMOS power-gating technologies, the design objective is to make the processor power scale with adapted parallelism. We propose incorporating distributed Power-Gated Register Files (PGRF) with VLIW to achieve a proposed PGRF-VLIW architecture that reduces the power dissipation on functional units as well as register files. The power efficiency of the architecture relies on compiler instruction scheduling to make efficient use of local registers and reduce powerhungry data transfers across execution slots. Our major contribution is our proposed Deadline-Constrained Clustered Scheduling (DCCS) algorithm for energy-aware instruction scheduling. The algorithm is motivated by the observation that a DSP application is formed primarily by equation evaluations and has tree-like data dependence structures. The DCCS algorithm clusters a tree-like data dependence graph to reduce cross-slot data dependencies. Moreover, for a tree-structured data dependence graph, we prove that the DCCS algorithm generates optimal use of local registers. The results of evaluations conducted using the MiBench [Guthaus et al. 2001] and DSP-stone [Zivojnovic et al. 1994 ] benchmark suites verify that our approach is effective.
This article is organized as follows. Section 2 overviews background information on VLIW processors and power-gating technologies. The main concept underlying the proposed approach is given in Section 3. Sections 4 and 5 are devoted to the DCCS algorithm. Section 6 discusses practical issues on applying the proposed approach. Finally, evaluation results are discussed in Section 7, and conclusions given in Section 8.
Deadline-Constrained Clustered Scheduling 20:3
BACKGROUND
This section overviews the two diverse branches underpinning our research about power gating on VLIW-DSP cores with distributed register files architectures.
Trends in Energy-Efficient VLIW Architectures Design
Most commercial VLIW DSPs employ distributed register files architectures to reduce power dissipation. The trend comes from the energy complexity of a register file [Rixner et al. 2000] . In a distributed register files architecture, the register set is partitioned into pieces, each of which is connected to a limited number of functional units. The Texas Instruments C6000-series DSP [Texas Instruments 2005] is an eight-issue VLIW processor with two homogeneous clusters. Functional units are organized into two identical clusters, and each cluster has a register file connecting to all functional units in the cluster. The Freescale Starcore DSP [Freescale Semiconductor 2008] is a six-issue VLIW processor with heterogeneous cluster architecture. Functional units are grouped into two clusters: one cluster for two-issue memory operations and the other for fourissue arithmetic operations. The PAC-DSP core provided by the Industry Technology Research Institute (ITRI) [Lin et al. 2008 ] is also a clustered architecture using a pingpong register file design-a register file with reconfigurable connection to functional units-for intercluster data exchange with a limited amount of access ports.
The key issue in developing a VLIW architecture is to design the instruction scheduling algorithm for the compiler. Since a distributed register files architecture results in data-transfer limitations across execution slots and may degrade performance, researchers have proposed a variety of instruction scheduling schemes aimed at high-performance execution under the data-transfer limitations. Terechko et al. [2007] established a comprehensive theory of code optimization for clustered VLIW architectures. The theory defines various intercluster communication models and proposes scheduling algorithms for each of them. A DSP core is mainly used to process streamed data with a kernel loop. Various researchers have proposed software pipelining algorithms to improve the performance of clustered VLIW architectures [Akturan and Jacome 2001; Qian et al. 2002a Qian et al. , 2002b Zalamea et al. 2001] . This article redefines the instruction scheduling problem from an alternative viewpoint: deadline-constrained energy optimization for energy-proportional computing.
In the drive to design compilers for architectures with distributed register files, previous researchers [Cai et al. 2008; proposed graph clustering schemes for instruction scheduling. A clustered architecture induces additional latency in intercluster data transfer communications. To reduce program execution time, the Data Dependence Graph (DDG) is partitioned into virtual clusters such that the communication cost is reduced without affecting parallelism exploitation. The virtual clusters are formed by repeatedly examining critical paths of the DDG. The virtual clusters are then mapped to physical clusters with an as soon as possible (ASAP) scheduling policy. This article revisits the graph clustering problem from an alternative viewpoint: saving the data transfer energy on a shared register file.
Compiler-Assisted MTCMOS Power Gating
2.2.1. MTCMOS Circuitry for Leakage Power Reduction. Progress in semiconductor manufacturing has given rise to a new challenge: how to reduce leakage power [Industry Technology Roadmap for Semiconductors 2010] . MTCMOS circuitry [Shin et al. 2010] with compiler-assisted power gating has been proposed to overcome this challenge, realizing that MTCMOS power gating induces overhead and results in constraints for architecture-level design. The first constraint comes from time and energy overhead for transition between active and sleep modes. To prevent the circuit being destroyed by a sudden rush of current, a gradual wake-up scheme has to be applied. To save energy, power gating is only enabled when the idle period exceeds a certain threshold. The second constraint comes from the area overhead. An isolation cell has to be inserted on each net across power domains to protect the circuit. As a result, power domain partitioning has become a key issue in architecture-level design.
2.2.2.
Compiler-Directed Power-Gating Control. A power-gated architecture requires compiler support to exploit its energy efficiency. The compiler analyzes the parallelism in the program and inserts power-gating instructions to turn on/off hardware for program execution. Various researchers have been working on this issue. proposed a global scheduling algorithm to shut down idle hardware without affecting program execution time. The algorithm traverses the control flow graph to estimate the required parallelism: the ratio of the number of operations to the schedule makespan. Hardware exceeding the estimated parallelism is shut down via power gating. You et al. [2006 You et al. [ , 2007 proposed an instruction scheduling algorithm to reduce power-gating overhead. Their proposed algorithm, called "sink-and-hoist," traverses the control flow graph to estimate the lifetime of functional units. Instruction scheduling is then performed to assemble idle periods of functional units and pack multiple power-gating instructions into one. The algorithm not only reduces the amount of power-gating instructions but also improves the energy efficiency. Wang et al. [2010] studied loop scheduling for a VLIW processor with power-gating control. Rotation scheduling over a kernel loop is performed to build an ASAP schedule with minimum parallelism. As a result, power consumption is reduced while the processor performance is retained.
All the works just cited consider power gating on functional units only. However, researchers report that the register file power occupies a larger portion of a processor's power pie chart . Therefore, we propose power-gated architecture and compiler support aimed at reducing the power dissipated in register files.
DESIGN OVERVIEW: THE PGRF-VLIW ARCHITECTURE WITH COMPILER SUPPORT
Our objective is to minimize the power dissipated in both functional units and register files. The main theme is energy-proportional parallel computing: adapting processor power by scaling the instruction-level parallelism. Power gating is deployed on both functional units and register files. The compiler has directives for a user to specify program execution deadline over a code fragment. To insert power-gating instructions, the compiler analyzes the parallelism required and optimizes hardware resource usage subject to the execution deadline constraint. The remaining sections of this article are devoted to the joint effort of hardware and software design.
The PGRF-VLIW architecture
The PGRF-VLIW architecture is shown in Figure 1(a) . An instruction packet flows through the fetch and decode units, and operations are dispatched to multiple execution slots. Like the traditional architecture, a Shared Register File (SRF) is connected to all execution slots for data transfer across execution slots. Each execution slot has a Local Register File (LRF) that is accessible only by functional units in the execution slot. Power gating is applied to execution slots as well as register files. Each execution slot is an individual power domain. Moreover, as shown in Figure 1(b) , each of the local register files and the shared register file is partitioned into banks with individual power gating. Register banks not allocated with operands are shut down during execution.
The architecture is designed to extend the overall power-scaling range of the processor. Each execution slot is an individual power domain, and hence power dissipation on the functional units scales with parallelism. The energy complexity of register files [Rixner et al. 2000] leads to the use of LRFs: an LRF has fewer access ports and hence consumes less power. The architecture relies on the compiler to direct most of the operand accesses to LRFs. Moreover, it is believed that data transfer through SRF increases as parallelism increases. Consequently, we expect power dissipation on both functional units and register files to scale with changing parallelism.
Compiler Support for the PGRF-VLIW Processor
The architecture relies on the compiler to exploit its energy efficiency. Power reduction is achieved through instruction scheduling and register allocation to allocate operations and operands onto the architecture. The optimization problem is to minimize the processor power subject to the user-specified program execution deadline. Instruction scheduling imposes constraints on register allocation and hence plays a key role in optimizing the power of register files. We focus on the local instruction scheduling that is the theoretical foundation underpinning the success of the architecture. A spill stage inserts load/store operations into the IR such that the shared register file is able to store all register operands. Instruction scheduling is performed to assign when (cycle) and where (execution slot) each operation is to be executed. The instruction scheduling sets lifetime and positioning constraints to allocate temporaries. Constrained by the instruction scheduling outcome, the register allocation maps temporaries to registers distributed on LRFs and SRF. Finally, the code-generation stage generates the parallel assembly code. In addition, power-gating instructions are inserted to turn off unused execution slots and register banks during code generation. In practical use, the proposed scheduling algorithm cooperates with global optimization technologies (such as trace scheduling [Aho et al. 2007] ) to deal with real applications. A global optimizer generates segments of straight programs (such as traces) to feed into the proposed instruction scheduler for energy optimization.
Energy-Aware Instruction Scheduling
Guidelines. In the instruction scheduling process, the program IR is modeled as a Data Dependence Graph (DDG). Figure 2 (b) shows an example. A series of operations I i s (operating on temporaries t j s) on the left-hand side is modeled as the DDG on the right-hand side. A node of the DDG represents an operation, and a directed edge indicates the producer-consumer relation of a temporary.
The outcome of the instruction scheduling process is a schedule chart that is viewed as a layout of the DDG. An example schedule chart for the DDG in Figure 3 (a) is shown in Figure 3 (b). In a schedule chart, a row corresponds to an execution slot, and a column corresponds to a cycle. An operation I i at position (Slot k , τ ) indicates that I i is to be executed by execution slot Slot k at cycle τ . A cross-slot DDG edge (unbroken lines in Figure 3 (b)) indicates that a temporary is transferred across execution slots. This kind of cross-slot dependence implies that the underlying temporary has to be allocated to the power-hungry SRF. Conversely, an intraslot DDG edge (the broken lines in Figure 3 (b)) indicates that there is a chance to transfer operands through low-powered LRFs.
The guidelines for deadline-constrained energy-aware scheduling are (1) to minimize the hardware parallelism to meet the deadline constraint and (2) to reduce the number of cross-slot DDG edges to direct more temporary accesses to the LRFs. We use the DDG in Figure 3 Figure 3 (c) meets the deadline with only two execution slots and hence consumes less computational energy. Moreover, the schedule in Figure 3 (c) has fewer cross-slot DDG edges and directs more accessed temporaries to low-powered LRFs.
3.2.3. Motivating Idea Underlying the Proposed DCCS Algorithm. We propose our DCCS algorithm along the just noted guidelines. The algorithm is motivated from the observation that most DSP applications consist of equation evaluations, and the DDG can be viewed as an inverted tree. The DCCS algorithm partitions a tree-like DDG into clusters: each cluster is a tree, and all operations in a cluster are scheduled in the same execution slot. Such a clustering sets most of the edges in the DDG as intraslot edges. In addition, we propose an algorithm to obtain a sequential schedule aimed at minimizing the register pressure within an execution slot. The proposed sequential scheduling algorithm has been proved to minimize the number of local registers required when the DDG is a tree. The final schedule for all parallel execution slots is established by controlling the clustering granularity with the deadline constraint. 
The unique sink node of
The subsequence of Q containing only vertices in subtree
The time range over which vertices in subtree T are spread
RP(Q)
Register pressure of a feasible sequence Q RP * (T )
Register pressure of the optimal sequencing for DDG T Q 1 Q 2 The sequence constructed from appending Q 2 to the tail of Q 1
SEQUENTIAL SCHEDULING TO MINIMIZE REGISTER REQUIREMENTS

Modeling the Sequential Scheduling
We first propose the High Register Pressure First (HRPF) algorithm for sequential scheduling within an execution slot. Figure 4 (a) illustrates a tree-structured DDG T with accessing temporary t i s marked with edges and shows our terminologies. In this example, each operation (vertex) has a one-cycle latency. For a path from vertex I i to I j , we say I i is an ancestor of I j and I j is a descendant of I i . A vertex having no ancestors in T is called a source node, and a vertex having no descendants is called a sink node. For a tree T , there are at least one source node and one unique sink node called the root of T . Table I gives our notations associated with sequential scheduling. Figure 4 (a) also illustrates our decomposition scheme DQ(T ) = ({T 1 , T 2 , . . .}, T R ) on a tree T . Suppose that T is not a path graph. (In a path graph, all vertices are in a simple path [Gross and Yellen 2006] .) Backtracking from the root, we obtain the first vertex u, with at least two immediate ancestors, as the split point. Decomposing from the split point, we obtain a set of parallel branches {T 1 , T 2 , . . .} and the subtree T R covering the path from u to the root. Each pair of branches has no vertices in common. A feasible sequential schedule is a time mapping from a topological order of the subgraph. Any topological order of vertices in T is a feasible sequence for the scheduling. To construct a sequential schedule, the sequence is mapped to a linear timeline with the first operation assigned to cycle 1. We build ASAP schedules, and each operation is mapped to the earliest available time satisfying dependence constraints. Thus, the operation sequencing uniquely determines the sequential schedule. In the latter discussion, we use the two terms "feasible sequence" and "feasible sequential schedule" interchangeably. T . Let T be a subtree of T and Q be the subsequence of Q consisting of only vertices in T . The sequence Q is also a feasible sequence for subtree T . Notations SS Q (T ) and R Q (T ) denote the subsequence and time range for a subtree T in Q, respectively. For the example in Figure 4 , we have SS Q (T 1 ) = {I 2 , I 1 , I 5 , I 6 , I 8 }, and
The scheduling determines the register requirement of a program. The live range of a temporary t i is the time range in which some register has to be occupied by the value of t i . A schedule defines live ranges for all temporaries. The live range of t i starts from the time the value is generated by the producer operation and ends at the time the last operation reading t i is scheduled. In this example, all operations have one-cycle latency to produce the outcome. The register pressure of a schedule is the maximum number of simultaneously alive temporaries. With perfect register allocation, the register pressure is the required number of registers. The lower half of Figure 4 (b) shows the live ranges, for which we have RP(Q) = 3.
We can now define the optimization problem. Given a tree-structured DDG T , we build a topological order Q of T to find a feasible sequence with minimum register pressure. In general, finding an optimal schedule is NP-complete [Garey and Johnson 1979] . Nevertheless, we devise a polynomial-time algorithm that is guaranteed to find an optimal solution when the DDG is a tree.
The HRPF Sequencing Algorithm
Algorithm 1 presents the sequential scheduling algorithm. The algorithm builds a feasible sequence recursively by decomposing the tree T into subgraphs. The HRPF policy is applied to cascade sequential schedules for each subgraph. The recursion terminates when a path graph is reached. Figure 5 shows an example. By applying the HRPF policy, the temporary t 4 carrying the result of subtree T 1 does not increase the register pressure in the final result.
Exactness of the HRPF Algorithm
We now prove that our proposed algorithm generates an optimal sequence for any tree T when each operation has a latency of one cycle.
LEMMA 4.1. Let Q be any feasible sequence for a tree-structured DDG T and T be a subtree of T . For any time point τ ∈ R Q (T ), except for the first cycle of R Q (T ), there exists a temporary t j defined in T that is alive at time point τ .
PROOF. Let I 1 be the first operation of T appearing in Q and I n be the root of T . Vertex I 1 is a source node of T and I n is the last operation in SS Q (T ). There exists a path γ from I 1 to I n in T and, say, consists of the sequence of vertices {I 1 , I 2 , . . . , I n }. For any time point τ ∈ R Q (T ), except for the first cycle of R Q (T ), there exists a vertex Let Q be the unique topological order of T ; Return Q; 3: end if 4: Backtracking from root (T ) to find the first vertex u with multiple immediate ancestors;
Construct sequence Q j for each branch T j using Algorithm HRPF Sequencing; 8: end for 9: Let QS = {Q 1 , Q 2 , . . . , Q n }; //exclude Q R for T R that contains root(T ) 10: while QS is not empty do 
PROOF. Pick an arbitrary branch T j and the time point τ ∈ R Q (T j ) with maximum number of alive temporaries in T j . We assert that each branch T i with i < j contributes an additional temporary t that is alive at time τ . Consider all possible cases on the intersection of R Q (T i ) and R Q (T j ) w.r.t. τ . (Figure 6 depicts all the possible cases.) In the first case (Figure 6(a) ), τ falls before the end of T i . Lemma 4.1 states that there is at least one additional temporary t a defined by T i that is alive at time τ . In the second case (Figure 6(b) ), τ falls after the end of T i , the temporary t b carrying T i 's result to T R is alive at time τ . Considering all branches T i s appearing before T j , we have RP(Q) ≥ RP(SS Q (T j )) + ( j − 1), and this lemma follows.
We draw the relation of optimal register pressures between a tree and its subtree. PROOF. Let Q * be an optimal sequence for T and consider an arbitrary branch T j . In the case where the appearance order of T j is not less than j, Lemma 4.2 implies that
. Consider the case in which the appearance order of T j is less than j. There exists a branch T i , with i < j (and hence RP * (T i ) ≥ RP * (T j )), such that the appearance order of T i is not less than j. Lemma 4.2 implies that
for each branch T j , and this lemma follows.
Finally, we prove that the HRPF sequencing algorithm is exact. PROOF. An optimal sequence is obtained for a path graph because there is only one choice. For a tree T with decomposition DQ(T ) = ({T 1 , T 2 , . . . T n }, T R ), we prove the theorem by induction on the number of decompositions to simplify to path graphs. Suppose that the algorithm obtains optimal sequence Q * j for each branch T j . Sort the parallel branches in nonincreasing order of the optimal register pressure and consider the sequence Q generated by the HRPF algorithm. Note that the sorted order of T j s is also the appearance order of all branches in Q. In sequence Q, each branch preceding T j contributes one result-carrying temporary that is alive throughout R Q (T j ). Hence, we have RP(Q) = max 1≤ j≤n {RP * (T j ) + ( j − 1)}, and Lemma 4.3 states the optimality.
THE DCCS ALGORITHM FOR SCHEDULING PARALLEL OPERATIONS
On the basis of the HRPF sequencing algorithm for a single execution slot, we propose energy-aware instruction scheduling for parallel execution slots. The top-level framework is shown in Algorithm 2, which executes the DCCS algorithm multiple times to select a schedule that meets the deadline with minimum parallelism. The core of the scheduling is the DCCS algorithm for fixed parallelism. Algorithm 3 gives the framework: a clustering-and-allocation approach. A cluster S i is a treestructured subgraph of the DDG to be scheduled in a single execution slot using the HRPF sequencing algorithm. Data dependencies among clusters are represented by a Subgraph Dependence Graph (SDG). The first stage, the clustering stage, gives an initial clustering on DDG. Intended to reduce accesses to SRF, this stage tries to maximize the cluster size. The second stage, the partitioning-to-fit stage, adjusts the clustering granularity by imposing the critical path deadline CPD over SDG. We estimate CPD = (total execution time of all operations)/M, where M is the amount of active execution slots or the hardware parallelism. As the parallelism M is increased in the multipass algorithm, CPD is reduced and forces the parallel portions in SDG to increase with finer clustering granularity. Finally, the allocation stage allocates nodes of SDG onto parallel execution slots.
Clustering Stage
The clustering stage groups operations and builds the SDG. Following the clustering stage, each cluster S i becomes a tree-structured subgraph in the DDG. Moreover, the clustering stage also identifies clusters that can be executed in parallel to utilize multiple execution slots. Data dependencies among clusters are represented by SDG: a vertex represents a cluster S i , and an edge (S i , S j ) indicates that the output temporary of S i is read by some operation in S j . and I 17 , respectively, after finding they are not merged with their descendants. As a result, the clustering stage establishes sequential execution clusters and cluster sets ({S 1 , S 2 , S 3 } and {S 4 , S 5 }) that may be executed in parallel.
The Partitioning-to-Fit Stage
This stage refines the clustering at Stage 1 to meet the deadline constraint. Recall that a cluster is allocated in one execution slot. The clustering reduces the software parallelism, and oversized clusters may cause the execution deadline to be missed. This stage partitions oversized clusters to increase software parallelism according to the CPD determined from the hardware parallelism.
We examine the critical path of SDG to refine the clustering. Each node and each edge in SDG is associated with a weight to indicate the execution time. The weight of an edge is the latency for the producer operation to generate the underlying temporary. The weight of a node S i , denoted ET (S i ), is the makespan obtained from applying the HRPF algorithm to schedule the cluster S i . The length of a path P, denoted ET P , is the total weight for all nodes and edges in P. The quantity ET P indicates the execution time lower bound to execute operations in P for any feasible schedule of SDG. The Establish sequential schedule IS for the cluster S j using HRPF Sequencing algorithm;
5:
Find the earliest available cycle τ satisfying all dependencies for S j ;
6:
Compute affinity AF i : the number of DDG edges connectingoperations in Slot i to operations in S j ; 8: end for 9:
Select Slot k : The slot with maximum affinity that is empty after cycle τ ; 10:
Update SC by scheduling operations in S j onto Slot k in the order of IS and starting from cycle τ ; 11: end for 12: Return schedule chart SC; critical path of SDG, denoted CP, is the longest path in SDG. The partitioning criteria are set to keep the length of the critical path within the required deadline CPD.
Algorithm 5 outlines the partitioning algorithm. The algorithm iteratively refines the SDG to meet the deadline CPD. At each iteration, the node S k with highest sequential execution time in CP is decomposed to reduce the length of the critical path. The selected node S k represents a cluster of tree-structured DDGs, and we use the same decomposition scheme presented in Section 4.1. The execution time is reduced by overlapping the execution of parallel branches.
Allocation Stage
Finally, we complete the scheduling with the allocation stage. This stage allocates nodes in SDG to a schedule chart SC, where a node is a tree-structured subgraph that is allocated in one execution slot. Allocation guidelines are (1) to build an ASAP schedule that tries to shorten the makespan of the whole SDG, (2) to reduce the amount of cross-slot dependencies that results in operand accesses through the high-powered SRF, and (3) to reduce the register pressure resulting from the schedule SC.
Algorithm 6 gives an outline of the allocation algorithm. A topological order SS over nodes in SDG is established by depth-first traversal from sink nodes by treating each edge in reverse direction [Cormen et al. 2009 ]. To select the next node to append to SS, we break the tie using the HRPF policy. Each cluster S j is then allocated to a row of empty positions in the schedule chart SC following the ASAP guideline: a cluster 20:14 Z. Liang et al. starts from the earliest available cycle τ that satisfies all dependence constraints. In cases where there are multiple empty slots at cycle τ , we break the tie using the affinity measure AF i , which counts DDG edges connecting the cluster to an execution slot. Note that the depth-first-style topological order also helps to reduce cross-slot dependencies. With restriction on hardware parallelism, some tree-structured portion of the SDG will be allocated in a single execution slot. Figure 8 illustrates how the partition-to-fit and allocation stages work. After the clustering in Figure 7 , instruction scheduling is now reduced to schedule the SDG in Figure 8 
PRACTICAL USE OF THE DCCS ALGORITHM
The DCCS algorithm serves as the core theory for energy optimization and can cooperate with various existing technologies. To implement power-gating instructions, we may deploy Sleep Control Registers (SCR) to store power configurations of all power domains [Roy et al. 2009] . Energy optimization on register files is realized through a compiler platform with two-pass scheduling [Chang et al. 1995] . The DCCS algorithm serves as the prescheduling stage to set allocation constraints for register allocation. After register allocation, power-gating instructions are inserted in the postscheduling stage to ensure correct execution. In cooperation with window scheduling [Muchnick 1997 ], the DCCS algorithm can be utilized for software pipelining. In window scheduling, the DDG is partitioned into virtual pipeline stages for overlapped execution of loop iterations. The DCCS algorithm is applied to schedule each virtual stage for energy optimization. The required initiation interval for software pipelining now serves as the deadline constraint to the DCCS algorithm.
The DCCS algorithm can also be adapted to various architecture features. For operations with variable latencies, the allocation stage calculates the earliest available time to schedule a cluster considering the operation latencies. The earliest available time of an operation is the time that all source operands are ready, considering the latencies from all dependencies. The HRPF sequential scheduling also takes the operation latencies to build the timed schedule. Most of the VLIW-DSP cores have heterogeneous execution slots. To support heterogeneous slots, we modify the closing rules of the clustering stage such that a cluster of operations fits into some execution slot types. The allocation stage then affirms the feasibility of assigning a cluster to an execution slot.
Special care should be paid for long-latency operations, especially for memory operations. The DCCS algorithm is designed to optimize for short-latency operations, Fig. 9 . Flow of the evaluation process.
typically for operations in integer arithmetic units. In Section 4.3, the assumption that each operation has one-cycle latency is necessary for the HRPF algorithm to obtain an optimal solution. With long-latency operations, one can build counterexamples to Lemma 4.2 and 4.3 and show that a better schedule is obtained by interleaving operations from two different branches. The optimization quality of the DCCS algorithm degrades when there are long-latency operations. Most of DSP processors have only integer units, and real-number arithmetics are realized with fixed-point encoding [Freescale Semiconductor 2008; Analog Devices 2010; Texas Instruments 2005] . Long-latency operations are primarily memory operations executed on dedicated load/store slots with internal ALUs for address calculation. Similar to the adaption for heterogeneous execution slots, we suggest modifying the closing rule of Algorithm 4 such that a cluster is closed immediately whenever a memory operation is included. This makes the memory operation the last operation in the cluster, and the HRPF algorithm still works on a sequence of short-latency integer operations. Moreover, the modified closing rule enables the allocation stage to utilize the delay slots between the underlying memory operation and the operations consuming the memory operand.
The proposed approach can also be applied to clustered hardware architectures [Terechko and Corporaal 2007] , with the addition of LRFs in each execution slot and the deployment of power gating. Instruction scheduling to support a clustered architecture has two phases [Trimaran 2007 ]. The first phase partitions the DDG into subgraphs, where each subgraph is assigned to a hardware cluster. The second phase schedules each subgraph onto parallel execution slots connected to a shared register file in a cluster. The DCCS algorithm serves as the scheduling algorithm for the second phase and can cooperate with various cluster assignment algorithms [Terechko and Corporaal 2007] in the first phase.
EVALUATION OF THE DCCS ALGORITHM
This section discusses the evaluation results of our approach conducted on power efficiency. The static power consumed by the functional units in proportion to the parallelism is trivial. Our evaluation focuses on the power dissipated on register files.
We also compare the performance of the DCCS algorithm to Cai's algorithm [Cai et al. 2008] . Whereas our algorithm adjusts the parallelism and clustering granularity by the intended performance, Cai's algorithm clusters the DDG to reduce the communication cost without affecting parallelism exploitation. The comparison indicates how we can balance between parallelism exploitation and power dissipation. Figure 9 shows the flow of the evaluation process. We utilized benchmark programs from the MiBench [Guthaus et al. 2001] and DSPstone [Zivojnovic et al. 1994] suites. The LLVM compiler [Lattner and Adve 2004] was used to generate the DDG and the temporaries set. Code optimization worked on the LLVM IR and loop unrolling performed to exploit parallelism. The DCCS algorithm was implemented to build optimized schedule charts. The schedules were fed to the register allocator using the Chaitin graph coloring algorithm [Chaitin 1982] . The register allocator allocated as many local temporaries as possible on the LRFs. In cases where an LRF did not have sufficient space, local temporaries with the lowest access counts were sacrificed. Temporaries not allocated to LRFs were allocated to SRF, also using the Chaitin algorithm. We estimated register file power from the compiler outcome using the CACTI power model [Thoziyoor et al. 2008 ] under a 45nm process.
Evaluation Scheme
Although the DCCS algorithm can be co-operated with various global optimization technologies, the evaluation shows the pure effect coming from the DCCS algorithm. Each benchmark program is compiled multiple times with various hardware parallelism. Each time a fixed parallelism is imposed on the whole program, and the DCCS algorithm schedules all basic blocks with the same hardware parallelism. The scheme is indeed the simplest way to use the DCCS algorithm: compile the application program multiple times and then select appropriate parallelism to execute the whole program. The evaluation shows the energy-saving effect of the DCCS algorithm without co-operating with any other global optimization technologies.
We conducted the evaluation using an eight-issue VLIW architecture. The SRF had 24 ports (16 write ports and eight read ports to support eight-issue parallelism) and contained 32 registers with a 32-bit width. Each LRF had three ports (two write ports and one read port) and contained 16 registers. We examined several bank-size selections for power gating on SRFs as well as on LRFs and considered power-gating overhead from isolation cells.
7.1.1. Modeling the Power-Gating Overhead. Figure 10 shows the overhead associated with a power-gated register file. In Figure 10 (a), storage elements are partitioned into banks in separate power domains, and access buses are routed to each bank. An isolation cell, which goes from a power-gated domain to an always-on domain, is inserted on each net of the read data buses. Each bank contains address decoders and an array of storage cells surrounded by word-lines and bit-lines, as shown in Figure 10(b) . We estimated the chip area and power dissipation overhead coming from the isolation cells.
The overhead is approximated by cell count. In Won et al. [2003] , a data-holding circuit is used as an isolation cell. We approximate the power dissipated on an isolation cell as the power of a 1-bit storage element. Assume that the architecture in Figure 10 is realized by the following types of cells: (1) NOT-gate, (2) 4-input AND-gate, (3) 4-input OR-gate, (4) isolation cell, (5) transmission gate (pass transistors), and (6) storage cell of 1-bit data. We estimate the ratio of isolation cells in the total cell count as the measure of chip area overhead. Ratio of register accesses directed to SRF Calculated power dissipation P st (m) Static power dissipated on all register files (when IPC is m) P dyn (m) Dynamic power dissipated on all register files (when IPC is m) P st,iso Static power dissipated on all isolation cells P dyn,iso (m) Dynamic power dissipated on all isolation cells (when IPC is m) P st,SRF Static power of the register file for the SRF-only architecture without power gating P dyn,SRF Dynamic power of the register file for the SRF-only architecture without power gating Table II shows the evaluated bank size configurations and the chip area overhead. Configuration code (m, n) signifies an architecture in which a bank of LRFs has m registers and a bank of SRFs has n registers. A special configuration, coded "(1,1)-Ideal," indicates the ideal power saving effect achievable if VLSI researchers keep reducing the power-gating overhead. This configuration represents an architecture in which each register is power gated individually without any overhead. 7.1.2. Power Model Used in the Evaluation. We computed power dissipation using the compiler outcome and the CACTI power model. Table III shows all the related parameters. CACTI is an SRAM model and provides the (1) static power, (2) dynamic energy per access, and (3) access delay of an SRAM block. The SRAM configuration includes the (1) manufacturing process, (2) number of access ports, (3) number of storage cells, and (4) width per access (equivalent to register width). A bank of PGRF is modeled as an SRAM block, and power parameters e rw , e st ,ê rw , andê st are obtained. For our proposed architecture, the power formulas with the isolation cells overhead considered are:
Z. Liang et al. where P st,iso is approximated using the static power of a 1-port SRAM block with an equal number of storage cells and P dyn,iso (m) is approximated using a 1-port singleregister SRAM block as the energy-per-transfer of an isolation cell. We also computed the power dissipation of the SRF-only architecture to compare with the traditional design. Detailed formulations of P st,SRF and P dyn,SRF are omitted for space constraints. Note that the evaluation overestimates the energy overhead on isolation cells. Energy from decoding logic, which is not consumed by isolation cells, is calculated in our approximation scheme. Even with the overestimated overhead, our result still shows significant power savings from the power-gated architecture. Experiment reports on validating the power model can be found on our website [DCCS 2013 ].
Code Optimization Effect from the Compiler
To understand the effect the compiler has on power efficiency, we investigated how register file usage scales with parallelism and cross-referenced it with the speedup. 7.2.1. The Speedup. Figure 11 shows the speedup scaling with parallelism m. The benchmark programs are classified into three classes: (1) the high-parallelism class (Figure 11(a) ), which has almost linear speedup up to the maximum parallelism; (2) the middle-parallelism class (Figure 11(b) ), which gets saturated above m = 7; and (3) the low-parallelism class for remaining benchmark programs. Figure 12 shows the speedup achieved by Cai's algorithm. Both algorithms achieve comparable speedup on most of the benchmark programs. Cai's algorithm achieves a better speedup for a set of benchmark programs in the low-parallelism class.
7.2.2. SRF Access Ratio. Figure 13 shows, for the DCCS algorithm, the ratio of total register accesses sent to the SRF. The DCCS algorithm directs more than 40% of register accesses to low-powered LRFs in order to conserve power. High-and middle-parallelism benchmark programs can be divided into two sets: the first set has SA(m) and scales up with parallelism, and the second set has low and saturated SA(m). The benchmark program "13_fir2dim" exemplifies the first set, and the behavior matched our expectations: the dynamic power scaled up with increased parallelism. The second set of benchmark programs has a different behavior: the ratio to access SRF is low and virtually fixed. The benchmark "03_susan_smoothing" exemplifies this set. Note that each benchmark program in the high-and middle-parallelism classes obtains speedup increments with increased parallelism. The saturated SA(m) indicates that the DCCS algorithm has a good clustering effect to parallelize the application without inducing much cross-slot dependencies. For the second set, we pay only additional energy cost on functional units to gain better performance. Figure 14 shows the SRF access ratio for Cai's algorithm. Compared to Cai's algorithm, the DCCS algorithm lowers the SRF access ratio by at least 5%, except for "23_pbmsrch," where both algorithms achieve very low SRF access ratios. For 18 out of the 28 benchmark programs, additional saving by the DCCS algorithm is more than 15%. Cross-referenced with the speedup, the DCCS algorithm provides a better balance between parallelism exploitation and reduction of cross-slot dependencies.
7.2.3. Register File Usage. Figure 15 illustrates how the number of active SRF registers scales with parallelism. The result shows the need for power-gating of the register files: benchmark programs have diverse behaviors on register requirements. Although some benchmark programs (such as "11_convolution") use all 32 registers, other benchmark programs (such as "22_sha") use only a small portion of SRF. Power gating on register files is required in order for a single DSP core to keep good energy efficiency over diverse applications. The second observation is on how the static power will scale with parallelism. Similar to the observation of Figure 13 , high-and middle-parallelism benchmarks are further classified into two sets: the set with SRF usage that scales up with parallelism and the set with saturated scaling curves. The classification indicates whether the static power of SRF will scale with increased parallelism or not. Compared to Figure 13 , Figure 15 has fewer saturated curves. Scaling of static power is more sensitive than that of dynamic power. Figure 16 shows the results for Cai's algorithm. Compared to Cai's algorithm, the DCCS algorithm reduces the register pressure by at least 20% for 19 out of the 28 benchmark programs. Furthermore, the DCCS algorithm is more effective on the high-parallelism class. Figure 17 shows the scaling of LRF pressure. The LRF pressure refers to the maximum number of active LRF registers among all the activated execution slots. As expected, the required number of LRF registers per execution slot scales down as parallelism increases. With high-parallelism execution, an execution slot needs only four to eight local registers. Figure 18 shows the results for Cai's algorithm. For most of the benchmark programs, the DCCS algorithm has comparable LRF pressure. The curves for SRF access and usage imply that the DCCS algorithm allocates more temporaries on LRFs. The comparable LRF pressure comes from the effect of HRPF sequencing to reduce the register pressure.
Power Efficiency of Register Files
We also evaluated the power scaling functions on static and dynamic power P SL st (m) = P st (m) P SRF,st , P SL dyn (m) = P dyn (m) P SRF,dyn , to justify the expected effects. The power-scaling functions present the power dissipation normalized to the traditional SRF-only architecture. We evaluated the configurations in Table II by considering overhead coming from isolation cells. The power scaling resulting from Cai's algorithm is also presented to compare the clustering effect.
Figures 19 through 21 present the static power scaling P SL st (m), classified into high-parallelism, middle-parallelism, and low-parallelism classes. The results show that our approach saves more than 40% of the static power compared to the SRF-only architecture. The results also suggest finer grained power-domain partitioning on the register files: the most power efficient configuration is (4, 4). The normalized static power of a benchmark is either saturated under 0.2 or scales up with parallelism. For most of the high-and middle-parallelism benchmarks, the overall trend is for the static power to scale up with increased parallelism. However, this trend is not monotonic: there are several local minimums and local maximums in each power-scaling curve. The imperfect scaling implies that the compiler needs a precise power model considering both functional units and register files to discover proper parallelism.
In Figures 19 through 21 , we also compared the static power dissipation to that of Cai's algorithm on the configuration (4, 4) (the curve titled "(4, 4)_Cai"), which also represents the best power saving for Cai's algorithm. For 7 out of the 28 benchmark programs, both the DCCS and Cai's algorithm had PSL values that were saturated under 0.2. The most obvious advantage of the DCCS algorithm appears with the benchmark programs "01_fft_float," "02_fht," "03_susan_smoothing," and "10_encrypt." For these benchmark programs, the DCCS algorithm causes the PSL value to become saturated under 0.2, whereas Cai's algorithm results in high and scaled-up power dissipation. For the remaining benchmark programs, the DCCS algorithm has an apparently lower PSL curve and is approximately 5-20% lower than Cai's algorithm.
Figures 22 through 24 present the scaling curves of the normalized dynamic power. The overall trend is for the normalized dynamic power to either saturate under 0.6 or scale up with increased parallelism. The only exception occurs for "28_startup," which has saturated dynamic power over 0.8. The results show that dynamic power is not sensitive to power-domain (bank) partitioning. From the perspective of conservation of dynamic power, the design benefits from LRFs with reduced access ports. Compared to Cai's algorithm, the DCCS algorithm consumes less dynamic power. For 18 out of the 28 programs, the DCCS algorithm saves at least 10% of the dynamic power. 
CONCLUSION
This article investigated energy-proportional computing on power-gated VLIW architectures. Although previous researchers focused on power gating the functional units, our contribution saves the power dissipated on register files. The success relies on the DCCS algorithm for instruction scheduling to reduce the register pressure and crossslot data transfer. The results of evaluations conducted confirm the expected effects as follows: (1) Compared to the traditional SRF-only architecture without power gating, the PGRF-VLIW architecture saves approximately 20% to 50% of register file power even with maximum parallelism. (2) As an overall trend, the register file power scales up with increased parallelism. The scaling range occupies approximately 20% to 35% of the dynamic power and 35% to 65% of the static power.
The most important finding is that the power dissipated on register files can be saved and scaled through instruction scheduling with parallelism adaption. The compiler outcome shows the need to apply power gating on register files: application programs have diverse storage and parallelism requirements. Although this work establishes the architecture with theory on local code optimization, future work will enhance the compiler for practical use. Loop and global code optimizations, such as software pipelining, are planned for the next stage in our drive toward energy-proportional computing.
