We propose an algorithm for assessing probabilistic performance constraints for systems including components with uncertain delays. We make a case for designing systems based on a probabilistic relaxation of performance constraints, as this has the potential for resulting in lower silicon area and/or power consumption. We consider a concrete example, an MPEG decoder, for which we discuss modeling and assessment of probabilistic throughput constraints.
Introduction
This paper discusses models and algorithms to support statistical relaxations of worst case constraints on system performance. Consider, for example, a system which is designed to meet a delay constraint d and suppose the critical path's delay D p is in fact random. A design based on a worst case analysis would ensure that P(Dp > d ) = 0. In our view, for a number of application domains, such designs may be unnecessarily conservative. For example, suppose the design constraint d can be relaxed in the sense that it can be violated but only rarely, say P(Dp 2 d ) 5 lop6. Such a relaxation of design constraints will in turn allow for a larger set of acceptable design solutions with hopefully less demanding performance requirements and/or power consumption. Note that even when performance constraints are truly worst case, in the sense that the system malfunctions if they are not met, it is reasonable, and possibly beneficial, to still relax the performance constraints -say, to the same level of certainty as the probability of failure of the system's components. The examples in 54 suggest that one might expect to benefit significantly from a probabilistic relaxation of worst case constraints for systems comprising a large number of non-deterministic components.
Uncertainty in the performance of a system's components may have a variety of origins. For example, for high-level system representations, such as those used in system level and hardwarelsoftware codesign, uncertainty may be due to *This work was supported by NSF Career Grants NCR-9624230 and MIP-9624321 a n d a Texas Higher Education Coordinating Board Grant ATP-088.
Permission to make digital/hard copy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercia1 advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 98, San Francisco, Califomia 0 1998 ACM 0-89791 -964-5/98/06. .$5.00 a looping index whose exact initial value is unknown, e.g., data dependent [l, 3, 6, 111. A number of current systemlevel models use hierarchy and aggregation as a means of controlling complexity [3, 111. If such approaches are to succeed, and one is to reason efficiently about the required performance of such systems during design space exploration, it will become increasingly critical to capture the performance variability of aggregated system elements.
We propose to use probability distributions to model uncertainty. The distributions may either be derived from statistical models for the underlying source of variability, estimated based on experimental data, or gathered through simulation/profiling. Realizing that characterizing distributions is in itself a challenging and expensive task, in $3.1 we propose a crude model based on knowing the mean and upper and lower bounds on delay. This simple characterization of non-determinism is shown to be conservative for assessing system performance and eliminates (in some cases) the need for obtaining detailed statistical information on component delays.
In 52 we formulate a probabilistic critical path problem, and propose an approximate algorithm for assessing probabilistic constraints on systems represented by directed acyclic graphs (DAGS) with random edge weights. Therein we discuss related work. In 53 we discuss modeling distributions as well as a set of transforniations (reductions) of typical high-level system models to obtain DAGS which are in turn amenable to analysis with the proposed algorithm. Synthetic examples exhibiting significant differences between worst case and probabilistic requirements are presented in 54. In 55 we discuss a concrete example, showing how our approach might be applied to modeling and assessing throughput constraints for an MPEG decoder. We conclude with a discussion of research/implementation directions we are currently pursuing. In general this problem is difficult to solve, primarily because the edge weights of a path are not additive (due to the convolution) and thus cannot be decomposed along the path as is usual when using dynamic programming approaches. In fact, by adapting a Lemma 2 in [5] , one can show that Problem 1 is NP-hard, suggesting that one should seck good heuristics.
A word of warning is in order. There are two reasons why the probabilistic critical path problem should not be interpreted as the equivalent of the standard critical path problem when the weights are random. First, the problem is predicated on specifying a constraint d with respect to which a probabilistic critical path is identified. Second, and more subtly, we compare the performance of individual paths with each other, rather than assessing the maximum of the delays across all paths. Whereas in the case with deterministic weights these two problems are equivalent, in a graph with random weights they certainly are not. The discussions on modeling in 53 and examples in 54 further elucidate this point.
Previous Work
To our knowledge there is no previous work addressing the above problem. Our work was inspired from recent work in network routing considering the uncertainty in the delay or bandwidth availability at remote links [5] . The flavor of the approach, in particular our use of the Chernoff bound to perform constraint analysis for a simple example, can be found in [6] [page 1121. However, both our formulation of the probabilistic critical path problem as well as the proposed systematic algorithm arc new. We also note that there arc efficient algorithms for determining the most reliable path through a network with unreliable links, a problem which arises in voice recognition and Viterbi decoding applications. However such problems are significantly easier (can be reduced to a the traditional shortest paths problems) than the problem we address here.
Approximate Algorithm
We propose an algorithm to solve the probabilistic critical path problem based on an approximate formulation as a convex optimization problem. The edge weight distributions De on the DAG are represented via a parametric weight Ae(8) = log Eexp[BD,] for 0 2 0, i e . , the moment generating function of the delay distribution on the edge. This results in a collection of DAGs with deterministic weights he(8) parameterized by 8. Our algorithm solves an optimization problem over this set of parameterized DAGs by using the standard critical path algorithm. The derivation of the algorithm can be found in [2] Initialization: check that the problem is "well posed" by verifying that:
1. the constraint d exceeds the critical path delay for the graph where the weights are given by the mean edge delays;
2. and, the constraint d is bounded by the critical path delay for the graph with weights given by the maximum (possibly infinite) delay on each edge.
Optimization: determine the maximum f* ( d ) and the optimizers 8 and ?j for the following optimization problem
Note that evaluating maxpEp, x e E p A e ( 8 ) 
De be a Bernoulli random variable o n { l e , u e } with m e a n me,
i . e . , P(B, = Q,.) = w 7 and P(D, = I,
Hierarchical representations and reductions to directed acyclic graphs
In general a hierarchical high-level system models are respresented by control/data flow graphs (CDFGs) [3] . Such graphs allow the represention of flow of control constructs, including branching, looping, and possible synchronization requirements, see e.g., [3, I, 111. Rather than formally defining such a modeling framework we will exhibit some cases that arise and the manner in which they arc reduced to a corresponding DAG. Below we show how the delay weights associated with traversing the nodes in Fig. 1 can be reduced to equivalent path weights. Generalizations of these cases to more than two non-intersecting (independent) subpaths should be clear from the discussions below.
Reducing nodes with probabilistic branches.
Consider the left node in Fig. 1 . Within the node, a branch is modeled probabilistically, in the sense that one of the two sub-paths, p l or p z , is selected at random. Suppose the branching probability is y then the weight for the sub-path
Note that if a branch is not modeled probabilistically (due to lack of information) then both paths would be kept in the eventual DAG.
3.2.2
Consider the middle node in geometric distribution with parameter y, z.e., after completion of any iteration the probability of looping back is 1 -y. In this case M,v (2) = Of course if the loop index is deterministic, Le., P ( N = n) = 1, then the corresponding DAG would unravel the loop, i.e., the weight for a path through this node would be Rp(B) = n x hp1(6').
Reducing nodes with iterations or feedback loops.
3.2.3
In general synchronization is the most difficult abstraction to handle, particularly in a setup with random delays. Consider the rightmost example in Fig. 1 , where two paths p i and pz must synchronize prior to leaving the node. The delay incurred in this node, Dp, is given by Dp = max[Dpl, Dp2], the maximum of the delay along the two paths. The weight for this node would be h p ( 6 ) 
Unfortunately there is no general way to compute this metric, without explicitly computing the distributions for the maximum of the delays for the two paths.
Notice that whereas for graphs with deterministic delays we need only consider the worst case path to deal with synchronization, in the case of random path delays, both paths contribute to the characteristics of the synchronization time 
By using the delay metric Ap (e) for this node we can proceed safely knowing we will still obtain an upper bound on performance. Note that if dp* 4-dp2 > up then this reduces to using the worst case upper bound on synchronization. However when dpl +dP2 < up we can still glean some information on the probabilistic behavior of that node.
Last resort conservative bound. If the paths in the node are "short" relative to the critical path of the graph then a simple upper bound can be devised by noting that, Dp 5 DP1 +DpZ, so it follows that h p ( 0 ) 5 hP1 (8) +hp2 (e). This is likely to be conservative in the probabilistic sense, yet may still be reasonable when compared to the results obtained using the worst case edge delay.
Synthetic Examples: Why use probabilistic V.S. worst case critical paths?
For simplicity let us assume that all edge delays are independent and identically distributed Bernoulli random variables with mean m, Le., P(D, = I) = m and P(D, = 0) = 1 -m. Suppose there is a single path through the DAG representing a system and it has n edges so Dp = E,"=, D,. Clearly the worst case critical path would have a length of n. A probabilistic analysis might consider the likelihood that the delay exceeds 90 % of the worst case delay, Le., P(Dp 2 0.9n). Table 1 exhibits some results for this setup, where both the length n of the path and the mean m of the edge delays are varied .
When r n = 1/2 and the path is relatively long, say n = 20,80 the probability of failure are O(10d4) and 0(10-14) respectively, possibly small enough to be neglected. Thus a delay constraint which is 90 % of the worst case, is very likely to be met. For m = 1/4 even a path with a moderate number of elements n = 10 has a small probability of failure O( l o p 5 ) again showing that a probabilistic relaxation of the constraint is likely to be advantageous.
Based on this simple example it should be clear that as we consider increasingly large systems with many uncertain elements, the gains of a probabilistic relaxation of constraints will accrue. Moreover if the delay distributions are such that the average performance is significantly smaller than the worst case bounds, e.g., 75 % smaller when m = 1/4, then probabilistic constraints are likely to allow a significant relaxation over the worst case critical path. The failure probabilities in Table 1 were computed exactly based on Bernoulli distributions and via the Chernoff bound used by our algorithm. Clearly the results compare favorably, and as expected the Chernoff bound gives an upper bound on the failure probability. In summary these examples show that if indeed there is sufficient uncertainty in the performance of elements on a reasonably large graph, the proposed method is likely to pay off handsomely if one can allow for a probabilistic relaxation of constraints. In general the probabilistic and conventional critical paths need not coincide. Indeed, consider the graph in Fig. 2 , where three edges have independent Bernoulli distributions on (0, l} with means 1/2,1/2,1/8 and the fourth is deterministic with mean 1.2. The worst case critical path is obviously p2 with a delay of 2.2. Now, given a delay constraint d = 1.5 one can easily show that the probability of violation is largest on p l , i.e., P(Dpl > 1.5) = 1/4 > 118 = P(Dp2 2 1.5). This suggests that a designer optimizing a system based on worst case information m a y be addressing the wrong path, at least when a probabilistic relaxation of system constraints is possible.
Probabilistic Constraints and MPEG Video Decoders
In this section we illustrate the practical interest of the proposed algorithm for probabilistic constraint analysis by considering MPEG video decoders [7, 4, 81. The MPEG decoder was chosen due to the presence of non-deterministic (data dependent) delays in some of the key decoding sub-tasks. This example illustrates how the inherently variable nature of these tasks makes it interesting to assess probabilistic throughput constraints.
Background on MPEG-2
A video stream consists of a sequence of pictures or frames sampled at a given rate. Three basic types of pictures are defined: intra-coded pictures, which are coded without reference to other pictures; forward/backward predictively coded pictures, which can use motion prediction from a pastlfuture picture; and bidirectionally-predictively coded pictures, which can use motion prediction from both past and future pictures. These are referred to as I, P and B pictures respectively.
Pictures are in turn subdivided into a number of macroblocks -a 16 by 16 pixel region. Depending on the picture type, a good match might be sought between its macroblocks and other pictures in the sequence, based on computing motion vectors. Thus a macroblock can be:
causal (forward coded): defined from a previous picture, ~ allowed for macroblocks within P and B-pictures; non-causal (backward coded): defined from a future picture -allowed for macroblocks within B-pictures only;
interpolative (bidirectionally coded): defined from a past and a future picture -allowed for macroblocks within B-pictures only.
Non-motion compensated macroblocks, are allowed for all types of pictures, and are said to be intra-coded.
As the MPEG-2 decoder reads the bitstream, it identifies the start and type of a coded picture, and then decodes each macroblock in the picture, as shown in Fig. 3 . Table 2 : Estimates of branching probabilities for MPEG macroblock decoding.
In Fig. 3 , N represents the number of macroblocks in a picture -for the streams being considered a picture is comprised of 330 macroblocks. The shaded ellipses in Fig. 3 represent basic flow of control decisions taken during the decoding of each macroblock within a picture. Table 2 shows estimates of branching probabilities for these decision points. These estimates were generated by running a software decoder on a collection of MPEG video traces. The first two columns in Table 2 identify the type of picture (I, P, or B) and the percentage of occurrence of that particular type of picture in the fixed sequence of pictures considered for our MPEG-2 decoder. The third column in the table gives the branching probability for the first decision point in Fig. 3 , i.e., the probability that a macroblock will be skipped within a P or a B picture (note that all macroblocks within an I pictures are intra-coded). The three last columns in Table   2 give the probability that a given macroblock will be intracoded, forward/backward coded, or bidirectionally coded, for I, P and B pictures.
The performance of an MPEG-2 decoder is determined by the individual performance of five key modules: Variable Length Decoding (VLD), Inverse Quantization (IQ), Inverse DCT (IDCT), Pixel Interpolation (PI), and Pixel Add (PA) [7] . However not all the modules are executed for every macroblock. In particular, as shown in Fig. 3 , none of the modules is executed for non-coded (or skipped) macroblocks, and the PI and PA Modules are not executed for intra-coded macroblocks. Moreover, the processing done by the PI and PA Modules for bidirectionally coded macroblocks is twice of that required by forward or backward coded macroblocks, since one additional reference macroblock needs to be considered in the first case. (This extra-processing is the reason for the separation of bidirectionally coded macroblocks from the two other types of motion compensated macroblocks in the control flow graph shown in Fig. 3.) The algorithm-level descriptions of the MPEG-2 modules referred to above have been the focus of extensive studies on optimizations/transformations geared towards performance enhancement [7, 4] . In our example, we have adopted the set of highly optimized algorithmic descriptions discussed in [7] . In these behavioral descriptions, the VLC and IQ modules are merged in order to save on write/read cycles to memory.
VLD+IQ (avg) IDCT

Using Probabilistic Constraint Analysis to Guide the
Design of an MPEG2 video Decoder
The objective in this example is to define/specify the RTL architecture (functional units and registers/memory) for the key MPEG-2 decoder modules referred to above, so as to derive a decoder supporting a throughput of 30 frames/sec (which translates into a 33.3 ms decoding time per picture).
2.304 1.152
Design Option 1 The modules' RTL descriptions of our initial design, referred to as Option 1, were directly derived from the modules' algorithmic descriptions given in [7] . The scheduling of operations within each module was strictly performed based on data dependencies, ie., the performance of such modules is never compromised by resource sharing. Memory blocks were assumed to be implemented by RAMS with a single read port (with two cycle read operations) and a single write port. Table 3 shows the resulting execution delays (in # cycles)
for the various MPEG-2 decoding modules. As mentioned previously, the execution delays of the PI and PA modules are given separately for bidirectionally coded and for forward/backward coded macroblocks. A crucial observation needs to be made with respect to the numbers shown in Table 3 for the VLD+IQ Module. The execution delay of that module for each macroblock depends on the number of non-zero DCTs per macroblock, and is thus data dependent. In [7] , the average size of VLCs in typical MPEG-2 bitstream was reported to be about 4.5 bits which in turn translates to an average of 30 non-zero DCTs per macroblock, Le., an average of 484 cycles per macroblock for Option 1. We have used a crude model for the delay of the VLD+IQ module given by a Gaussian distribution with this mean (see Table 3 Table 3 MPEG modules
Time estimates (# cycles per macroblock) for
In the upper part of Fig. 4 , we show the decoding time distributions for I, P, and B pictures for design Option 1, derived using the execution delays per macroblock (in # cycles) given in Table 3 , the branching probabilities given in Table 2 , and the previously mentioned model for the VLC+IQ block Table 4 shows the corresponding average and worst case decoding times (in # cycles) for the three types of pictures, and also the worst case and average decoding time considering all picture types (given on the last row of the table)
Option-1 PDF of Decoding Cycles for I P B Table 4 : Average and worst case decoding times, in # cycles for 1,P and B frames.
The maximum combinatorial delay for our module's RTL descriptions was determined to be 43 ns (for a 0.7 pm standardcell library). So, for a 43 ns clock, our Option 1 design led t o an average delay per picture of 52.6 ms (Le., the decoder would only sustain a throughput 19 pictures per second), which is below the target of 33.3 ms per picture (Le., the desired throughput of 30 pictures per second). Moreover, the resulting design exhibited a worst case picture decoding delay of 58.8 ms. Option 1 was thus clearly insufficient in terms of performance, and was dropped.
Design Option 2 A second implementation, which we will call Option 2, was then developed, taking advantage of the fact that the computations performed by the IDCT, the PI, and the PA Modules can be easily parallelized. A new design was developed that: (1) has two parallel IDCT units (i.e., can compute simultaneously two 8x8 2-D IDCTs, each of which is done as a loop whose body computes an 8-point IDCT); (2) has two parallel pixel interpolation and pixel add units; and, (3) uses RAMS with two parallel read ports (still with a two cycle read operation). Table 3 shows the resulting execution delays (in # cycles) for the various decoder modules for Option 2. The bottom part of Fig. 4 shows the decoding time distributions for I, P, and B pictures for the new design. Table 4 shows the resulting average and worst case picture decoding delays for the three types of pictures, and also the worst case and average decoding delays for all picture types.
The maximum combinatorial delay was determined to be 46 ns, using the same standard-cell library, thus leading to an average decoding delay per picture of 31.5 ms, now below the target delay of 33.3 ms per picture, and to a worst case delay of 36.5 ms. Note, however, that the (relative) gap between the average and the worst case delays for Option 2 has increased significantly with respect to that for Option 1 (see last row of Table 4 ). Indeed, in the Option 2 design we have increased the decoder performance by introducing some parallelism in the IDCT, PI, and PA Modules. As a result, the relative percentage of time spent on the heavily data dependent VLD+IQ Module, with respect to the total decoding time, has increased significantly, leading to more significant delay variations across pictures.
It is in cases such as the above that the interest of the systematic algorithm for assessing probabilistic constraints proposed in this paper becomes obvious. Indeed, in order to adequately evaluate the suitability of the decoder design under discussion, a key piece of information (to be given to the designer) is the probability that the target delay of 33.3 ms will be exceeded by the particular decoder design. Note that, based on such a probability, and depending on the specific timing requirements of the application for which the MPEG-2 decoder is being developed, the outcome of the evaluation might be radically different. Specifically, the Option 2 design could be considered an adequate solution, could be an unnecessarily expensive solution (in terms of area and/or average power consumption), or could still require further performance improvements. Table 5 shows the probability of violating the decoding time constraint and the Chernoff bound, for I, P and B frames and overall obtained by our algorithm. The exact numbers are exhibited to show the quality of the approximations (upper bound) provided by the algorithm. Based on these numbers, the designer would proceed, either by performing yet another iteration at the RTL level, or by starting the physical design of the decoder.
Conclusions
In this paper we have formulated a probabilistic critical path problem on a DAG with random weights and proposed a novel approximate algorithm for determining the likelihood that a constraint is satisfied. Through a discussion using synthetic and real examples, we have made a case for the importance/relevance of assessing probabilistic constraints on system performance, whenever the application domain is amenable to some level of constraint relaxation. Specifically, the ability to analyze the system model so as to derive less aggressive performance requirements on its various components has the potential to reduce the final cost and power consumption of the system. Our algorithm is currently being implemented in an environment for assisting algorithm and architecture-level design space exploration during systemlevel design [ 111.
