Abstract
Introduction
Manufacturing process variations have emerged as a major design concern for aggressively scaled technologies. These variations manifest themselves across a single die (WID or withindie) or across several dies (D2D or die-to-die). Furthermore, the source, and therefore the statistical nature, of these variations can be random or systematic, static or dynamic [4] .
There is a significant body of work that analyzes the impact of process variations on static timing analysis [1, 3] at the circuit level. Such techniques are extremely useful in helping designers predict timing yield and in optimizing their designs at the circuit/gate level. There is, however, very little work that makes variability models available to system and microarchitecture level designers. In [11] , the authors take a first step in that direction and introduce a framework that uses variability information from low level circuit analysis to build variability models for system level performance parameters, such as end-to-end system latency. Using this framework, they show that systems comprised of multiple voltage-frequency islands (VFI) are more likely to meet a specified latency constraint than their fully synchronous, single clock, single voltage (SSV) counterparts. * This research was supported in part by Semiconductor Research Corporation contract no. 2005-HJ-1314.
VFI systems tend to outperform fully synchronous designs because they provide more flexibility in dealing with WID process variability. While the clock frequency of a fully synchronous design is limited by the slowest critical path on the entire chip, the clock frequency of each clock domain in a multiple VFI system is only limited by the slowest path in that particular domain.VFI systems are thereby able to isolate the impact of critical paths that have been negatively affected by process variations to the frequency domains in which these speed constraining paths lie.
In this work, we complement the analysis framework proposed in [11] and outline an algorithm to determine the throughput distribution of a multiple VFI system, given frequency distributions for each clock domain. Combined with results from [11] , our proposed technique would allow designers to specify throughput and latency constraints for their designs and determine the percentage of manufactured chips that will meet these constraints.
Related Work
While statistical timing analysis has become a hot area of research over the past few years, the problem has, with the exception of [11] , been addressed only at the gate/circuit level. Techniques have been proposed to deal with correlated gate delay distributions [3] and to provide stochastic bounds on worst-case circuit delay [4] . [11] provides stochastic bounds on the end-to-end latency of a directed acyclic task graph implemented using multiple VFIs. Our work extends the scope of [11] by analyzing graphs that have cyclic paths and consequently determining the throughput distribution of the design.
In [9] , the authors propose a globally asynchronous locally synchronous (GALS) architecture consisting of a number of processing units (PU), each implemented as a separate VFI. The authors note that intra-die process variations can cause the maximum clock frequency of each PU to shift by different margins. They propose a software based self-test scheme to run each PU at its optimal clock frequency. [5] proposes a hardware based technique that uses shadow flip-flops to detect timing violations, thereby allowing the synchronous logic to operate close to or at its maximum clock speed. Both works focus on implementation issues, while our focus is on evaluating the benefits of such implementations. We note that while [5] uses a fully synchronous design, the technique described could as easily be used in a multiple VFI system to adapt the clock speed of each frequency island.
Paper Contributions
This work makes the following contributions:
• We consider the case of applications specified as cyclic task (or component) graphs implemented on multiple VFIs and derive distributions for the best case throughput (alternatively rate) under manufacturing process variations. Our technique offers significant speed-up over Monte Carlo based simulation at the expense of marginal loss in accuracy.
• In the process of determining system throughput, we describe, to the best of our knowledge, the first algorithm that solves the Maximum Cycle Mean (MCM) problem in a probabilistic setting.
• Using a case study, we demonstrate how our framework can be used to evaluate the trade-off between performance and clock domain granularity, and compare the performance of a multiple VFI design versus that of a fully synchronous design.
Before proceeding further, we will now discuss the assumption we make about the hardware implementation of single and multiple VFI systems and introduce the mathematical notation that will be used in the rest of the paper.
Preliminaries and Assumptions
We consider the case of systems comprised of a number of synchronous cores, IPs or processing elements (PE). Henceforth, we will refer to all three generically as PEs. We now consider two cases: (1) A fully synchronous (or an SSV) system that has a single global clock that drives all the PEs. Communication between PEs is assumed to be point-to-point and synchronous. (2) A multiple VFI system in which each voltage-frequency island is controlled by an independent clock source. Each clock domain (or VFI) can have multiple PEs within it, all running locally synchronous to each other. PEs in different clock domains communicate via mixed-clock token ring FIFOs modified to support voltage level conversion if required [6] .
For both VFI and SSV systems, we assume that the implementation supports fine grained frequency scaling, and there exists either hardware or software based support, as discussed in [9] or [5] to allow each clock domain to run at or near its optimal clock frequency under the impact of process variations.
We model a system comprising of a number of communicating PEs using a component graph, represented as a directed graph G(V, E). Vertices in a component graph represent PEs and edges represent control or data dependencies between vertices. Figure 1 shows an example of a component graph with five PEs. The graph on the left is clocked globally with a single clock and represents an SSV system, while the one on the right has an two independent clocks that operate two separate VFIs. As in Figure 1 , each clock domain can contain more than one PE.
Theoretical Formulation
Without any loss of generality, for a given component graph G(V, E), we make the following assumptions: An SSV system with a single global clock source (b) A multiple VFI system with two VFIs and more than one PE in each VFI.
• The component graph G(V, E) of a system with n PEs comprises the set of nodes V = 1, 2, . . . , n and edges
• Each node i, (1 ≤ i ≤ n), is characterized by the number of cycles Ci it takes to produce an output data token after all its input data dependencies are satisfied. For an IP implemented as a simple linear pipeline, for example, the number of cycles will be equal to the number of stages in that pipeline. We assume, as in [11] that the communication latency is lumped into the number of execution cycles, and that the architecture is partitioned to minimize inter-domain communication.
Note that, in general, the number of execution cycles for a PE can vary dynamically depending on the workload. In this work, we are not interested in modeling workload or application driven variability and therefore we restrict C i to be a fixed number.
• Each PE is characterized in terms of the probability density function (pdf ) of its cycle time T i, where the cycle time is defined as the inverse of the clock frequency for that PE. If the PE is an external IP, the pdf of cycle time could be provided by the IP vendor or it could be obtained using detailed circuit level statistical timing analysis (SSTA) assuming probability distributions for the underlying process parameters. We note that from an implementation perspective, the clock frequency of a PE is likely to be controlled in discrete steps. If that is the case, the pdf of T i will actually be a discrete distribution. While, our approach is general and can handle both discrete and continuous distributions, we note that continuous distributions serve well to answer what-if kind of questions that frequently arise in system level design. The pdf of Ti is represented as fT i (t) and its corresponding cumulative density function (cdf ) as FT i (t), where
• Given Ci and Ti for a PE, its execution latency Li = Ci.Ti will also be a random variable. Since we have assumed Ci to be a fixed number, the pdf for Li can be computed directly from the pdf of the cycle time. We will refer to the pdf of Li as fL i (t) and its cdf as FL i (t).
• We point out that the notation and assumptions described above hold if each PE lies in a separate VFI. In general, assume that there are p VFIs, where p ≤ n. If p < n, there will be at least one domain with more than one PEs. The cycle time of the VFI j is given as T
V F I j
, where . Though our proposed algorithm is presented for the case in which each PE lies in a separate VFI (to avoid notational complexity), these modifications make it equally valid for the case when there is more than one PE in a clock domain.
Having introduced the mathematical notations and assumptions, we will now outline our algorithm to compute the throughput of the component graph efficiently.
Throughput Analysis for VFI Systems
The throughput (or rate) of a component graph is restricted by the presence of cycles in the graph [7] . Cycles in component graphs can only be found within strongly connected components (SCC). A SCC is a set of nodes in which it is possible to traverse from every node to every other node. While a graph can have more than one SCC, no two distinct SCCs can have a node in common. Furthermore, all SCCs in a graph can be found in linear time with respect to the number of nodes in the graph [7] . It is, therefore, sufficient to individually compute the throughput constraining cycles of each SCC in a component graph to determine the system throughput, which will just be the minimum throughput across all SCCs. In the following discussion, therefore, we only discuss throughput analysis on a component graph that is strongly connected, and later show how graphs with multiple SCCs can be analyzed.
We start with a component graph G(V, E) with n nodes, as described in the previous Section. We make an additional assumption that G(V, E) is strongly connected. We note that if the graph is not strongly connected, we can run the proposed algorithm on each of its SCCs individually and take the statistical minimum of the resulting distributions from each SCC, as we will demonstrate in the final step of the proposed algorithm. Finally we associate weights w(u, v) to every edge e ∈ E that connects nodes u and v. The weight assigned to edge e is equal the to the latency (as defined in the previous Section) of the source node of that edge. Specifically
Since the latencies are random variables, the edge weights are random variables also. We can now compute the throughput for the graph by computing its maximum cycle mean (MCM) [7] . The cycle mean (CM) of a cycle C in G(V, E) is defined as the sum of the weights of the edges in the cycle divided by the number of edges in the cycle. The MCM can then be computed by determining the maximum value of the cycle mean over all cycles in the graph. The throughput for the graph is then inversely proportional to the MCM. Formally, if λ * is the throughput for G(V, E), then:
where |C| represents the number of edges in cycle C. In [7] , the authors use Karp's algorithm [10] to compute the MCM. According to Karp's algorithm, the MCM (∆ * = 1 λ * ) is given as:
where
is defined as the maximum k step distance between node v and an arbitrarily picked node s ∈ V in the graph. This can be computed by enumerating all paths between s and v that contain exactly k edges and picking the path that has the maximum sum of edge weights. The algorithm begins by D To ensure that we keep track of correlations at all stages in the algorithm, we use a recently proposed SSTA technique that models correlations by representing all intermediate random variables as linear (or quadratic) functions of the input random variables, and uses a moment matching based propagation scheme [3] . For each random variable T i, we introduce a new random variable T i that is a normalized version of Ti. If µT i is the mean of Ti and σT i is its standard deviation, we can write T i as:
The cdf of T i can now be written in terms of FT i (t) as:
Since the only random variables we take as input to our algorithm are the cycle times Ti, or equivalently T i , we would like to express the intermediate variables D We can now solve the recurrence relationship to get
but we also know that:
We now need to determine the coefficients a We want to write:
This can be accomplished by noting that:
and
where E(X) represents the expectation of random variable X. (A, B) ) are provided in [3] and can be evaluated numerically. Having computed the coefficients for each D k v for 1 ≤ k ≤ n and v ∈ V , we can now rewrite equation (1) as:
Exact algebraic expressions for the terms E(max(A, B)) and E(T i max
This equation needs, again, a series of max and min operations over inputs that are linear combinations of random variables, where, at each stage we express the output as another linear combination over the same random variables. Even though we have only described in detail how this can be done using moment matching for the max operation, the expressions for the min operation can be derived in exactly the same fashion as for the max operation. Algorithm 1 is the formal description of our proposed technique and yields the desired coefficients δ i, where (0 ≤ i ≤ n), that allow us to write:
We can now write the cdf of random variable of ∆ * as:
Therefore the cdf for the throughput of G, represented as F λ * (λ), is given by: (5) and (6), * represents the convolution operation. Finally, the description above applies to a component graph that is an SCC. If there are more than one such SCCs in the graph, we run the steps described above on each SCC individually and obtain the cdfs of the throughput for each SCC. These cdfs can then be combined using a simple statistical min operation to yield the final result. If F λ * i (λ) represents the cdf for the i th SCC in the graph, we can write the cdf of throughput for the entire graph by taking the statistical minimum across the throughput distributions from each SCC. If X, Y and Z are some arbitrary random variables, and Z = min(X, Y ), the cdf of Z can be written in terms of the cdf of X and Y as:
We can therefore write:
Equations (2), (3), (5), (6) and (7) can be computed efficiently using the techniques outlined in [1] . This completes the description of the proposed algorithm. We note that the time complexity of the algorithm is O(p|V ||E|), since Karp's MCM is itself O(|V ||E|) [7] , and we replace the max function in Karp's MCM with the computation of p coefficients (the description in this section assumes p = n, but in the general case p ≤ n).
Algorithm 1 Statistical MCM
Inputs: Number of cycles for each PE (Ci), cdfs of cycle time random variables (Ti, T i ). (A, B) ) (C, D) ) end for
Throughput Analysis for SSV Systems
An SSV system has only one global clock frequency. Let the cycle time of the global clock be TG. Since TG is constrained by the cycle times of each of the individual PEs, we can write
If we assume that the individual cycle times vary independently due to random variations we can write the cdf of TG as
In the previous Section, we outlined an algorithm to compute the distribution of λ * given the input latencies Li = CiTi for all nodes in V . Formally:
where the function Q(.) represents the proposed probabilistic version of Karp's MCM algorithm. We note that Q(ax, ay) = aQ(x, y) since scaling the latency of each node by a fixed amount can only scale the output by the same amount. Now, for an SSV system
Since the cycle counts are single values, and we have already computed the cdf (and therefore pdf ) of TG in equation (9), we just need a single run of the classic Karp's MCM algorithm over input values that are fixed numbers.
SSV vs. VFI
Though it seems intuitive to assume that multiple VFIs will always perform better than SSV systems under variability, we would like to prove formally that this is indeed the case. Lemma: The probability that a multiple VFI system meets a given throughput constraint, λc, is always greater than or equal to the probability that its SSV counterpart will meet the same constraint. Proof: Assume there that exists a probability space Ω, from which samples of the random vector of cycle times T = (T 1, T2 . . . Tn) are drawn. Now we define
and correspondingly
Since the Q(.) function is a monotonically decreasing function of its inputs [7] , and from equation (8) we know that TG ≥ Ti, (∀i s.t. 1 ≤ i ≤ n), we can say that:
Now consider the probability of an SSV system meeting the throughput constraint λc,
P r(QSSV (T ) ≥ λc) = P r(ΩSSV )
and the corresponding probability for VFI systems
where ΩSSV and ΩV F I are the regions of Ω where the corresponding throughputs of SSV and VFI systems respectively are larger than λc. From equation (10), we know that
and therefore, P r(ΩV F I ) ≥ P r(ΩSSV ), or equivalently P r(λ *
In other words, a VFI system is more likely to meet a given throughput constraint than an SSV system. We note that we have not made any assumptions about the nature (discrete or continuous) or statistics of the cycle time distributions. The proof is therefore applicable for any arbitrary distribution of cycle times. 
Results
We implemented the proposed algorithm for determining the throughput distribution of single and multiple voltage-frequency island systems in C. The implementation takes as input the component graph for the given application, the number of execution cycles and the cycle time distribution for each PE in the graph, the number of clock domains and the allocation of PEs to clock domains and provides the cdf of the throughput for the application. There is, unfortunately, an acknowledged lack of embedded system benchmarks that have cyclic component graphs [12] . We therefore validate the accuracy and efficiency of our proposed techniques on a set of synthetic benchmarks that are generated using the algorithm presented in [12] . We then demonstrate the impact of our proposed analysis framework on the design of multiple VFI systems with a case study on a real embedded benchmark (MPEG-2 encoder). All results are compared to an exhaustive simulation that consists of 10,000 runs of Monte Carlo simulation [3, 1, 2] . 
Synthetic Benchmarks
In [12] , the authors outline an algorithm to generate cyclic task (component) graphs with specified properties. Using this approach, we generate a set of four synthetic benchmarks that we label synth-1,synth-2,synth-3 and synth-4. We vary the number of PEs (n) in the graphs from 15 to 60 and the number of clock domains (p) from 4 to 20. The number of execution cycles for each PE is chosen randomly from a uniform distribution between 50 and 100. Finally, we assume that the cycle times of the PEs are normally distributed with a 3σ of 20% of the mean. Table  1 shows the error between the mean and the 99% yield points of the throughput distributions for each of the benchmarks. We note that the average error in the mean of the throughput distribution, compared to the Monte Carlo results, is 1.2% (maximum 1.81%) and the average error in the 99% yield point is 2.14% (maximum 3.08%). This comes at a speed-up ranging from 78X to 260X (average 145X). We note that the speed-up is greater for systems with fewer clock domains, as predicted by the time complexity analysis.
Case Study : MPEG-2 Encoder
The results from the previous section demonstrate that the proposed technique is able to accurately estimate the throughput distribution of multiple VFI systems with an appreciable speed-up in run time. Using an example of an MPEG-2 encoder from [8] , we now demonstrate how such a framework can be used by system level designers to evaluate multiple VFI systems. Figure 2 shows the component graph of the MPEG-2 encoder. To determine the execution cycles for each of the components, we simulated a software version of the MPEG-2 encoder on an ARM7TDMI core and obtained cycle counts for each module. We note the software implementation we used, the DCT and Quantizer modules were implemented together, as were the IDCT and IQ modules. Instead of rewriting the software to separate the two modules, we divided the cycle counts equally between the modules that were implemented together. As in the previous section, we assumed the cycle time of each PE to be normally distributed with a 3σ of 20% of the mean.
Using these simulation parameters, we considered three possible implementations of the MPEG-2 encoder: MCV-9, MCV-3 and SSV. MCV-9 is a nine clock domain architecture in which each PE lies in its own VFI. MCV-3 has three clock domains, with three PEs in each clock domain, as represented by the shaded regions in Figure 2 . Finally SSV is a fully synchronous design with a single clock domain. Figure 3 show the cdf of throughput obtained (normalized to the nominal cycle time of a PE) using our approach and using Monte Carlo simulations for each of the three architectures (for SSV the proposed method always yields exact results and therefore we only show the Monte Carlo curve for SSV). The results allow us to quantify the yield of any of the three designs for a given throughput constraint. We can see that for a throughput constraint that gives 50% yield for a fully synchronous system, a nine clock domain architecture (MCV-9) achieves 100% yield (proposed scheme also predicts 100%) while a three clock domain architecture (MCV-3) achieves 98% yield (proposed scheme predicts 92%). The improvements are more dramatic when we consider the 25% yield point of the SSV architecture-MCV-9 again achieves 100% yield (predicted textbf99.8%) while MCV-3 is able to achieve 77% yield (predicted 71%). Such information could be used by designers, in conjunction with the throughput constraints that the design is expected to meet, to decide on the number of clock domains for their design or even choose between a fully synchronous and a multiple VFI design style.
Conclusions and Future Work
We provide an efficient and accurate algorithm to compute the system throughput of an embedded application, implemented as a VFI design, under manufacturing process variations. The proposed framework allows system level designers to make variability aware architectural decisions for their VFI designs. Results on synthetic benchmarks demonstrate the accuracy of the proposed technique (on average 1.2% error in mean and 2.14% error in the 99% yield point) for a speed up ranging from 78X to 260X. Furthermore, using an MPEG-2 benchmark application, we show that multiple VFI island designs are more likely to meet throughput constraints than their fully synchronous counterparts and that the yield advantage diminishes gradually as the number of clock domains are reduced (PEs are clustered together). Our future research directions involve modeling spatial and systematic sources of WID variability.
