The clock-frequency assignment problem has not been addressed in the design automation literature, to the best of our knowledge. optimal algorithm to solve the problem, based on dynamic programming. We apply the algorithm to the particular context of
Introduction
those interfacing directly with external circuitry, may have hard realtime constraints that dictate a specific operating frequency. Other
Modern system-on-a-chip platforms support multiple clock domains modules may have softer performance requirements, for which the on a single chip. A clock domain is a block of circuitry that operates clock-frequency assignment problem seeks to optimize a at a single clock frequency that may differ from the frequency of performance metric. other blocks on the same chip. In addition to reducing clock skew [14] sought to create a set of be clocked at its maximum frequency rather than all modules big heterogeneous general-purpose processor core microarchitectures clockedatthe maximum frequency ofie with the goal to minimize the sum of the execution times of a set of shows a system with four modules having maximum frequencies of benchmarks on each core. 100, 500, 1000, and 200 MHz. Communication across clock A summation metric could be used to minimize the critical path domains is a challenge, but has been aggressively researched of a task-level dataflow graph [12] . Such tasks may be mapped to recently, with established solutions (e.g., [4] [15] ) and with pretwo or more processing modules. When those processors are not designed bridge blocks present in many system libraries, system-level pipelined (mapping such tasks to multiple processors Because circuitry to generate a unique clock frequency is not without pipelining is often done for hardware modularity free, platforms impose a limit on the number of unique clock purposes), the goal of minimizing the task graph's critical path domains. For example, the Xilinx Virtex-I1 Pro FPGA (fieldbecomes the goal of minimizing the sum of the critical tasks' programmable gate array) has eight clock frequency synthesizers, execution times. able to generate frequencies between 24 MHz and 420 MHz via Another use of a summation metric is in accelerator-based different clock multiplication and division factors [22] . If the hardware/software partitioning of sequential programs, illustrated number of modules having distinct maximum frequencies exceeds in Figure 2 . When microprocessor execution reaches a critical the limit on unique clock frequencies, then we define the clockfunction, control switches from the microprocessor to a hardware frequency assignment problem as assigning a frequency to each accelerator (such as a hardware floating-point unit or graphics module such that a performance metric is optimized, subject to a accelerator); after the accelerator completes, control switches back limit on the number ofuniquefrequencies. The example in Figure 1 to the microprocessor. The goal of minimizing program execution shows a frequency assignment for four modules driven by only two time in this case is the same as minimizing the sum of the available frequencies, chosen to be 100 and 500 MHz.
microprocessor and accelerator execution times for the program. Figure 4 suggests the idea of first moving down the column to cell X(2,1). That sub-problem has two sorting the accelerators by their maximum frequency, resulting in a1, accelerators a, and a2 with maximum frequencies of 1000 and 500, a4, a2, and a3, and frequencies of 100, 200, 500, and 1000. Consider but has only one available frequency. Thus, the only solution is to the case of F=1. In that case, the problem has only one solution: 100, assign both accelerators a frequency of 500. Since the accelerators 100, 100, 100. Consider instead the case of F=2. In that case, a1 require 5 and 10 cycles, respectively, the total execution time of reasonable solution assigns the maximum frequency to the instead assign 100, then we again have a new sub-problem accelerator, yielding 5/1000=0.005. Cell X(2,2) has two consisting of the remaining two accelerators, F=1, and the option of accelerators and two clock frequencies available, so the only assigning 100. Noting the sub-problem structure in the problem, we reasonable solution assigns the maximum frequency to each investigated a dynamic programming solution.
accelerator, yielding 5/1000+10/500=0.025. The solution is the same as X(1,1)+10/500=0.025; we had an available frequency, so Z (a. .cycles) we assigned the present accelerator a2 its maximum frequency, and E total cycles of group i z I " then used the best solution for X(1,1) for the previous accelerator Z<i<N min frequency of group i 1<i<N MIN(a j.freq) a1. This cell hints how we can reuse prior sub-problem solutions in computing the present sub-problems solution.
where aij denotes the1th component in the ith group of a solution. Cell X(3,2) reveals such reuse more fully. That sub-problem
We note that if the maximum number of available clock has three accelerators a,, a2, and a3 having maximum frequencies frequencies F is greater than or equal to the number of accelerators of 1000, 500, and 200, respectively, and has two available clock M, the solution is trivial -we just assign each accelerator a frequencies. Accelerator a3 must be assigned a frequency of 200, frequency equal to that accelerator's maximum frequency. If F is because it has the lowest maximum frequency of the three less than M, some accelerators must be grouped to share a single accelerators (recall that the accelerators were initially presorted clock frequency. We make three observations that allow us to according to their maximum frequency). To complete the subformulatethe dynamicprogramming algorithm.
problem solution, we have two choices. We can assume that a3 is the only accelerator assigned 200, in which case the remainder of have been pre-sorted in decreasing order of maximum frequency and can assume that the frequencies of the given accelerators are all that each maximum frequency is unique. Let X(A, C) equal the total distinct. If not, we can simply combine those accelerators having execution time of the first A accelerators using the first C clock identical frequencies into one single accelerator. The new frequencies. We define the following recurrence relation as a accelerator will have the same frequency as that of the original function:
accelerators, and its total cycles will equal the sum of the total Otherwise, we could split any group consisting of more than two accelerators into two groups and the new solution will have a If A=0, there are no accelerators, and thus the execution time is 0. If smaller total execution time, which again contradicts our C=0, there are no clock frequencies available, so execution time is assumption of the optimality of the original solution.
infinite. We intentionally define X to return 0 for X(0,0). accelerators). The algorithm ran in about 0.2 seconds.
Related Work
We exercised the algorithm using synthetically-generated Figure 7 provides results of applying the clock frequencies to reduce power of a single microprocessor. The clock assignment on 10 synthetic benchmarks with varying numbers of domains included the front end (including LI instruction cache), the accelerators ranging from 5 to 50. For each, we report speedups integer units, the floating-point units, and the load-store units relative to execution time of the set of accelerators using only one (including LI data cache and L2 cache). Kumar [14] considered clock frequency, which must necessarily be the lowest maximum multiple frequencies for multiple heterogeneous microprocessors, frequency of the set. We considered available numbers of where each microprocessor might be optimized for different frequencies F of 3, 6, and 9. Figure 7 shows that partitioning an application sets, each conceivably having its own clock frequency.
application among the available clock improves overall execution Some work has considered the similar topic of voltage islands.
time by 1.5x-3x. Figure 8 shows algorithm runtimes for Figure 7 .
A voltage island is a sub-circuit operating at a different voltage, concept of multiple voltage levels (and typically therefore multiple higher-level exploration tool that repeatedly applied our algorithm clock frequencies), namely the concept of voltage scaling of a to determine the point of diminishing returns for number of single microprocessor. In such work, a microprocessor's voltage available clock frequencies -such determination would be and clock may be dynamically adjusted to reduce power while still important for a system-level tool that allocates available clock satisfying an application's performance constraints. frequencies to different sub-groups of modules. Figure 9 shows http://www.cray.om/products/xd1/index.htmn.
We are not aware of prior work on clock domains, voltage
