Parametric Design Synthesis of Distributed Embedded Systems by Kang, Dong-In et al.
Parametric Design Synthesis of Distributed Embedded Systems Dong-In Kang, Richard GerberInstitute for Advanced Computer StudiesDepartment of Computer ScienceUniversity of MarylandCollege Park, MD 20742fdikang, richg@cs.umd.edu Manas SaksenaDept. of Computer ScienceConcordia UniversityMontreal Quebec H3G 1M8Canadamanas@cs.concordia.caAbstractThis paper presents a design synthesis method for distributed embedded systems. In such systems,computations can ow through long pipelines of interacting software components, hosted on a varietyof resources, each of which is managed by a local scheduler. Our method automatically calibratesthe local resource schedulers to achieve the system's global end-to-end performance requirements.A system is modeled as a set of distributed task chains (or pipelines), where each task representsan activity requiring nonzero load from some CPU or network resource. Task load requirementscan vary stochastically, due to second-order eects like cache memory behavior, DMA interference,pipeline stalls, bus arbitration delays, transient head-of-line blocking, etc. We aggregate these eects{ along with a task's per-service load demand { and model them via a single random variable, rangingover an arbitrary discrete probability distribution. Load models can be obtained via proling tasksin isolation, or simply by using an engineer's hypothesis about the system's projected behavior.The end-to-end performance requirements are posited in terms of throughput and delay con-straints. Specically, a pipeline's delay constraint is an upper bound on the total latency a compu-tatation can accumulate, from input to output. The corresponding throughput constraint mandatesthe pipeline's minimum acceptable output rate { counting only outputs which meet their delay con-straints. Since per-component loads can be generally distributed, and since resources host stagesfrom multiple pipelines, meeting all of the system's end-to-end constraints is a nontrivial problem.Our approach involves solving two sub-problems in tandem: (A) nding an optimal proportion ofload to allocate each task and channel; and (B) deriving the best combination of service intervals overwhich all load proportions can be guaranteed. The design algorithms use analytic approximationsto quickly estimate output rates and propagation delays for candidate solutions.When all parameters are synthesized, the estimated end-to-end performance metrics are re-checked by simulation. The per-component load reservations can then be increased, with the synthe-sis algorithms re-run to improve performance. At that point the system can be congured accordingto the synthesized scheduling parameters { and then re-validated via on-line proling.In this paper we demonstrate our technique on an example system, and compare the estimatedperformance to its simulated on-line behavior.This research is supported in part by ONR grant N00014-94-10228, NSF Young Investigator Award CCR-9357850,ARL Cooperative Agreement DAAL01-96-2-0002 and NSERC Operating Grant OGPO170345. A preliminary versionof this work was reported in \Performance-Based Design of Distributed Real-Time Systems," at the Proceedings ofIEEE Real-Time Technology and Applications Symposium (June 1997).
1 IntroductionAn embedded system's intrinsic real-time constraints are imposed on its external inputs and out-puts, from the perspective of its environment. At the same time, the computation paths betweenthese end-points may ow through a large set of interacting components, hosted on a variety ofresources { and managed by local scheduling and queuing policies. A crucial step in the designprocess involves calibrating and tuning the local resource management policies, so that the originalreal-time objectives are achieved.Real-time scheduling analysis is often used to help make this problem more tractable. Using theapproach, upper bounds are derived for processing times and communication delays. Then, usingthese worst-case assumptions, tasks and messages can be deterministically scheduled to guaranteethat all timing constraints will get met. Such constraints might include an individual thread'sprocessing frequency, a packet's deadline, or perhaps the rate at which a network driver is run. Inthis type of system, hard real-time analysis can be used to help predict (and then ensure) that theperformance objectives will be attained.This approach is becoming increasingly dicult to carry out. First, achieving near-tightexecution-time bounds is virtually impossible due to architectural features like superscalar pipelin-ing, hierarchies of cache memory, etc. { not to mention the nondeterminism inherent in almost anynetwork. Given this, as well as the fact that a program's actual execution time is data-dependent, aworst-case timing estimate may be several orders of magnitude greater than the average case. If oneincorporates worst-case costs in a design, the result will often lead to an extremely under-utilizedsystem.Moreover, parameters like processing periods and deadlines are used to help achieve acceptableend-to-end performance { i.e., as a means to an end, and not an end in itself. In reality, missinga deadline will rarely lead to failure; in fact, such an occurrence should be expected, unless thesystem is radically over-engineered. While hard real-time scheduling theory provides a sucientway to build an embedded system, it is not strictly necessary, and it may not yield the most ecientdesign.In this paper we explore an alternative approach, by using statistical guarantees to generatecost-eective system designs. We model a real-time system as a set of task chains, with eachtask representing some activity requiring a specic CPU or network link. For example, a chainmay correspond to the data path from a camera to a display in a video conferencing system, orreect a servo-loop in a distributed real-time control system. The chain's real-time performancerequirements are specied in terms of (i) a maximum acceptable propagation delay from input tooutput, and (ii) a minimum acceptable average throughput. In designing the system, we treat therst requirement as a hard constraint, that is, any end-to-end computation that takes longer thanthe maximum delay is treated a failure, and does not contribute to the overall throughput. (Someapplications may be able to use late outputs { yet within the system model we currently do notcount them.) In contrast, the second requirement is viewed in a statistical sense, and we designa system to exceed, on average, its minimal acceptable rate. We assume that a task's cost (i.e.,2
execution time for a program, or delay for a network link) can be specied with any arbitrarydiscrete probability distribution function.Problem and Solution Strategy. Given a set of task chains, each with its own real-timeperformance requirements, our objective is to design a system which will meet the requirementsfor all of the chains. Our system model includes the following assumptions, which help make thesolution tractable: We model a task's load requirements stochastically, in terms of a discrete probability distri-bution function (or PDF), whose random variable characterizes the resource time needed forone execution instance of the task. Successive task instances are modeled to be independentof each other. We assume a static partitioning of the system resources; in other words, a task has alreadybeen allocated to a specic resource.It is true that embedded systems design often involves making some task-placement decisions.However, we note that tuning the resource schedulers is, by denition, subservient to the allocationphase { which often involves accounting for device-specic localities (e.g., IO ports, DMA channels,etc.), as well as system-level issues (e.g., services provided by each node). While this paper's focusis narrowed to the scheduling synthesis problem at hand, we note that a \holistic" design tool couldintegrate the two problems, and use our system-tuning algorithms as \subroutines."Also, while our objective is to achieve an overall, statistical level of real-time performance,we can still use the tools provided by hard real-time scheduling to help solve this problem. Ourmethod involves the following steps: (1) assigning to each task a xed proportion of its resource'sload; and (2) determining the reasonable service interval (or frame) over which the proportion canbe guaranteed. Then, using some techniques provided by real-time CPU and network scheduling,we can guarantee that during any such frame, a task will get at least its designated share ofthe resource's time. When that share fails to be sucient for a currently running task to nishexecuting, it runs over into the next frame { and gets that frame's share, etc. Given this model,the design problem may be viewed as the following interrelated sub-problems:(1) How should the CPU and network load be partitioned among the tasks, so that every chain'sperformance requirements are met?(2) Given a load-assignment, how should the frame-sizes be set to maximize the eective outputrate?As we discuss in the sequel, load proportions cannot be quantized over innitesimal time-frames.Hence, when a task's frame gets progressively smaller, it starts paying a large price for its guarantees{ in the form of wasted overhead.In this paper we present algorithms to solve these problems. The algorithm for problem (1) usesa heuristic to compare the relative needs of tasks from dierent chains competing for the sameresources. The algorithm for problem (2) makes use of connecting Markov chains, to estimate the3
eective throughput of a given chain. Since the analysis is approximate, we validate the generatedsolution using a simulation model.2 Related WorkLike much of the work in real-time systems, our results extend from preemptive, uniprocessorscheduling analysis. There are many old and new solutions to this problem (e.g., [1, 15, 20, 21]);moreover, many of these methods come equipped with oine, analysis tests, which determine apriori whether the underlying system is schedulable. Some of these tests are load-oriented suciencyconditions { they predict that the tasks will always meet their deadlines, provided that the systemutilization does not exceed a certain pre-dened threshold.The classical model has been generalized to a large degree, and there now exist analogousresults for distributed systems, network protocols, etc. For example, the model has been appliedto distributed hard real-time systems in the following straightforward manner (e.g., see [26, 31]):each network connection is abstracted as a real-time task (sharing a network resource), and thescheduling analysis incorporates worst-case blocking times potentially suered when high-prioritypackets have to wait for transmission of lower-priority packets. Then, to some extent, the resultingglobal scheduling problem can be solved as a set of interrelated local resource-scheduling problems.In [30], the classical model was extended to consider probabilistic execution times on unipro-cessor systems. This is done by giving a nominal \hard" amount of execution time to each taskinstance, under the assumption that the task will usually complete within this time. But if thenominal time is exceeded, the excess requirement is treated like a sporadic arrival (via a methodsimilar to that used in [19]).In our previous work [8, 9] we relaxed the precondition that period and deadline parametersare always known ahead of time. Rather, we used the system's end-to-end delay and jitter require-ments to automatically derive each task's constraints; these, in turn, ensure that the end-to-endrequirements will be met on a uniprocessor system. A similar approach for uniprocessor systemswas explored in [2], where execution time budgets were automatically derived from the end-to-enddelay requirements; the method used an imprecise computation technique as a metric to help gaugethe \goodness" of candidate solutions.These concepts were later modied for use in various application contexts. Recent resultsadapted the end-to-end theory to both discrete and continuous control problems (e.g. [18, 27],where real-time constraints were derived from a set of control laws, and where the objectives wereto optimize the system's performance index while satisfying schedulability. Our original approach(from [8, 9]) was also used to produce schedules for real-time trac over eldbus networks [6, 7],where the switch priorities are synthesized to ensure end-to-end rate and latency guarantees. Arelated idea was pursued for radar processing domains [11], where an optimization method producesper-component processing rates and deadlines, based on the system's input pulse rate and itsprescribed allowed latency.End-to-end design becomes signicantly more dicult in distributed contexts. Solving this4
problem usually involves nding an answer to the following question: \Given an end-to-end latencybudget, what is the optimal way to spend this budget on each pipeline hop?" Aside the complexityof the basic decision problem, a solution also involves the practical issue of getting the local runtimeschedulers to guarantee their piece-wise latencies. Results presented in [25] address this problemin a deterministic context: they extend our original uniprocessor method from [8] to distributedsystems, by statically partitioning the end-to-end delays via heuristic optimization metrics [25].Similar approaches have been proposed for \soft" transactions in distributed systems [17], whereeach transaction's deadline is partitioned between the system's resources.To our knowledge, this paper presents the rst technique that achieves statistical real-timeperformance in a distributed system, by using end-to-end requirements to assign both periods andthe execution time budgets. In this light, our method should be viewed less as a scheduling tool(it is not one), and more as an approach to the problem of real-time systems design.To accomplish this goal, we assume an underlying runtime system that is capable of the follow-ing: (1) decreasing a task's completion time by increasing its resource share; (2) enforcing, for eachresource, the proportional shares allocated for every task, up to some minimum quantization; and(3) within these constraints, isolating a task's real-time behavior from the other activities sharingits resource. In this regard, we build on many results that have been developed for providingOS-level reservation guarantees, and for rate-based, proportional-share queuing in networks. Sincethese concepts are integral to understanding the work in this paper, we treat them at some length.In rate-based methods, tasks get allocated percentages of the available bandwidth. Obvi-ously these percentages are cannot be maintained over innitesimal time-intervals; rather, theproportional-shares are serviced in an approximate sense | i.e., within some margin of error. Themagnitude of the error is usually due to the following factors: (1) quantization (i.e., the degreeto which the underlying system can multiplex trac); and (2) priority-selection (i.e., the order inwhich tasks are selected for service). At higher levels quantization, and as multiple streams sharethe same FIFO queues, the more service-orders depart from true proportional-sharing.Our analytical results rely on perhaps the oldest known variant of rate-based scheduling { time-division multiplexing, or TDM. In our TDM abstraction, a task is guaranteed a xed number of\time-slots" over pre-dened periodic intervals (which we call frames). Our analytical techniquesassume tasks have their time-slots reserved; i.e., if a task does not claim its load, the load getswasted. We appeal to TDM for a basic reason: we need to handle an inherently stochastic workloadmodel, in which tasks internally \decide" how much load they will need for specic instances.These load demands can be be quite high for arbitrary instances; they may be minuscule for otherinstances. Moreover, once a task is started, we assume that its semantics mandate that it alsoneeds to be nished. So, since unregulated workloads cannot simply be \re-shaped," and sinceend-to-end latency guarantees still must be guaranteed, TDM ensures a reasonable level of fairnessbetween dierent tasks on a resource { and between successive instances of the same task.The downside of this scheme is that TDM ends up wasting unused load. Other rate-based dis-ciplines solve this problem by re-distributing service over longer intervals { at a cost of occasionallypostponing the projected completion times of certain tasks. Most of these disciplines, however,5
were conceived for inherently regulated workload models, e.g., linear bounded arrival processes [3].In such settings, transient unfairnessis is often smoothed out by simply \re-shaping" the departureprocess { i.e., by inserting delay stages.Many algorithms have been developed to provide proportional-share service in high-speed net-works, including the \Virtual Clock Method" [38], \Fair-Share Queuing" [4], \Generalized ProcessorSharing" (or GPS) [23], and \Rate Controlled Static Priority Queuing" (or RCSP) [36]. These mod-els have also been used to derive statistical delay guarantees; in particular, within the framework ofRCSP (in [37]) and GPS (in [39]). Related results can be found in [5] (using a policy like \VirtualClock" [38]), and in [34] (for FCFS, with a variety of trac distributions). In [14], statistical servicequality objectives are achieved via proportional-share queuing, in conjunction with server-guidedbacko, where servers dynamically adjust their rates to help utilize the available bandwidth.Recently, many of these rate-based disciplines have sprouted analogues for CPU scheduling. Forexample, Waldspurger et al. [32] proposed \Lottery Scheduling," which multiplexes available CPUload based on the relative throughput rates for the system's constituent tasks. The same authorsalso presented a deterministic variant of this, called \Stride Scheduling" [33]; this method providesan OS-level server for a method similar to the Weighted-Fair-Queuing (WFQ) discipline used inswitches. WFQ { also known as \Packetized GPS" (or PGPS) { is a discrete, quantized versionof the uid-ow abstraction used in GPS. Scheduling decisions in WFQ are made via \simulating"proportional-sharing for the tasks on the ready-queue, under an idealized model of continuous-timemultiplexing. The task which would hypothetically nish rst under GPS gets the highest priority{ and is put on the run queue (until the next scheduling round). Stoica et al. [29] proposed arelated technique, which similarly uses a \virtual time-line" to determine the runtime dispatchingorder. This concept was also applied for hierarchical scheduling in [13], where multiple classes oftasks (e.g., hard and soft real-time applications) can coexist in the same system.Several schemes have been proposed to guarantee processor capacity shares for the system'sreal-time tasks, and to simultaneously isolate them from overruns caused by other tasks in thesystem. For example, Mercer et al. [22] proposed a processor capacity reservation mechanism toachieve this, a method which enforces each task's reserved share within its reservation period, undermicro-kernel control. Also, Shin et al. [28] proposed a reservation-based algorithm to guarantee theperformance of periodic real-time tasks, and also to improve the schedulability of aperiodic tasks.As noted above, many proportional-share methods have been subjected to response-time studies,for dierent types of arrival processes. This has been done for switches, CPUs, and for entirenetworks. Note that the problem of determining aggregate delay in a network is dual to theproblem of assigning per-hop delays to achieve some end-to-end deadline. The latter is a \top-down" approach: the designer \tells" the network what its per-hop latencies should be, and thenthe network needs to guarantee those latencies. Delay analysis works in a \bottom-up" fashion: thenetwork basically \tells" the user what the end-to-end delay will be, given the proportional-shareallocations for the chain under observation.While seemingly dierent, these problems are inextricably related. \Top-down" deadline-partitioning could not function without some way of getting \bottom-up" feedback. Similarly,6
the \bottom-up" method assumes a pre-allocated load for the chain { which, in reality, is negoti-ated to meet the chain's end-to-end latency and throughput requirements. In real-time domains,solving one problem requires solving the other.Deriving the end-to-end latency involves answering the following question: \If my chain owsthrough N nodes, each of which is managed by a rate-based discipline, what will the end-to-endresponse time be?" This issue is quite simple when all arrival process are Poisson streams, servicetimes are exponentially distributed, and all nodes use a FIFO service discipline. For a simpleJackson queuing network like this, many straightforward product-form techniques can be applied.The question gets trickier for linearly regulated trac, where each stream has a dierent arrivalrate, with deviations bounded over dierent interval-sizes, and where each stream has dierentproportional service guarantees. Fortunately, compositional results do exist, and have been pre-sented for various rate-based disciplines { for both deterministic [12, 24, 35] and statistical [39, 37]workloads. Deterministic, end-to-end per connection delays were considered in [24] for leaky-bucketregulated trac, using the PGPS scheduling technique. In [35] a similar study was performed usinga non-work-conserving service discipline. Also, as noted above, statistical treatments have beenprovided for the PGPS [39] and for RCSP [37].In Section 4 we present an analytical approximation for our TDM abstraction { perhaps the ex-treme case of a non-work-conserving discipline. The method is used to estimates end-to-end delaysover products of TDM queues; where a chain's load demand at a node can be generally distributed;where all tasks in a chain can have dierent PDFs; and where queue sizes are constrained to asingle slot.This problem is innately complex. Moreover, our design algorithm needs to test huge numbersof solution candidates before achieving the system objectives. Hence, the delay analysis should befast { and will consequently be coarse. At the same time, it cannot be too coarse; after all, it mustbe suciently accurate to expose key performance trends over the entire solution space. As shownin Section 4, we approach this problem in a compositional, top-down fashion. Our algorithm startsby analyzing a chain's head task in isolation. The resulting statistics are then used to help analyzethe second task, etc., down the line, until delay and throughput metrics are obtained for the chain'soutput task { and hence, for the chain as a whole.3 Model and Solution OverviewAs stated above, we model a system as set of independent, pipelined task chains, with every taskmapped to a designated CPU or network resource. The chain abstraction can, for example, capturethe essential data path structure of a video conferencing system, or a distributed process controlloop. Formally, a system possesses the following structure and constraints.Bounded-Capacity Resources: There is a set of resources r1; r2; : : : ; rm, where a given resourceri corresponds to one of the system's CPUs or network links. Associated with ri is a maximumallowable capacity, or mi , which is the maximum load the resource can multiplex eectively. Theparameter mi will typically be a function of its scheduling policy (as in the case of a workstation),7
Y2X1X2X3 X4
r3
Y1Y5X52;41;2 2;51;35;1 2;6 Y33;6 3;8





Figure 1: Example System Topology.or its switching and arbitration policies (in the case of a LAN).Task Chains: A system has n task chains, denoted  1; 2; : : : ; n, where the jth task in a chain i is denoted i;j . Each computation on  i carries out an end-to-end transformation from itsexternal input Xi to its output Yi. Also, a producer/consumer relationship exists between eachconnected pair of tasks i;j 1 and i;j , and we assume a one-slot buer between each such pair, sincethe queuing policy chooses only the newest data in the buer. Hence a producer may overwritebuered data which is not consumed.Stochastic Processing Costs: A task's cost is modeled via a discrete probability distributionfunction, whose random variable characterizes the time it needs for one execution instance on itsresource.Maximum Delay Bounds:  i's delay constraint, MDi, is an upper bound on the time it shouldtake for a computation to ow through the system, and still produce a useful result. For example,if MDi = 500ms, it means that if  i produces an output at time t, it will only be used if thecorresponding input is sampled no earlier than t   500ms. Computed results that exceed thispropagation delay are dropped.Minimum Output Rates:  i's rate constraint, MORi, species a minimum acceptable averagerate for outputs which meet their delay constraints. For example, if MORi = 10Hz, it means thatthe chain  i must, on average, produce 10 outputs per second. Moreover, MORi implicitly speciesthe maximum possible frame-size for the tasks in  i; e.g., if MORi = 10Hz, than  i's maximumframe-size is 0:1 { which would suce only if an output were produced during every frame.An Example. Consider the example shown in Figure 1, which possesses six chains, labeled  1- 6. Here, rectangles denote shared resources, black circles denote tasks, and the shaded boxes are8
Resource Usage PDFsFor Tasks Derived From E[t] (ms) Var[t] [Min,Max] NumSteps1;1; 3;2; 3;5; 4;1; 6;1 Normal 10 64 [4,35] 101;3; 2;3; 3;6; 3;8; 4;2 Normal 20 100 [10,50] 202;1; 3;4; 3;7; 4;3; 5;1 Exp 10 [0,100] 302;4; 2;6; 3;1; 4;5; 6;2 Exp 20 [0, 200] 501;2; 2;2; 2;5; 3;3; 4;4 Normal 8 144 [2,48] 20End-To-End Constraints i MDi (ms) MORi (Hz) 1 240 10 2 1000 5 3 1000 5 4 700 5 5 100 5 6 300 5 Maximum Resource Utilizations:m1 m2 m3 m4 m5 m6 m7 m8 m9 m100.9 0.7 0.9 0.6 0.9 0.7 0.7 0.8 0.9 0.9Figure 2: Constraints in Example.external inputs and outputs. The system's resource requirements and end-to-end constraints areshown in Figure 2.In any system, a task's load demand varies stochastically, due to second-order eects like cachememory behavior, DMA interference, pipeline stalls, bus arbitration delays, transient head-of-lineblocking, etc. By using one random variable to model a task's load, we essentially collapse allthese residual eects into a single PDF, which also accounts for the task's idealized \best-case"execution-time. In our model, any discrete probability distribution can be used for this purpose.Two points should be articulated here, the rst of which is fairly obvious: If the per-task loadmodels are ill-founded, then the synthesis results will be of little use. Indeed, in some situations onecannot just convolve all the abovementioned eects into a single PDF. While it may be temptingto stochastically bundle \driver processing interference" with a task's load model { and while oftenone can do just that { in other situations, this sort of factor needs to be represented explicitly, asa task. (Our method can easily accommodate this alternative.) The process of obtaining a goodmodel abstraction is non-trivial; it requires accounting for matters like causality (i.e., charging loaddeviations to the tasks which cause them), scale (i.e., comparing the task's loading statistics {mean and variance { to those of the residual eects charged to it), and sensitivity (i.e., statisticallygauging the eect of load quantization on end-process results). These problems are well outside thescope of this paper; interested readers should consult [16] for a decent introduction to statisticalperformance modeling. However, while none of these problems is trivial, given sucient time,patience and statistical competence, one can employ some standard techniques for handling all ofthem.The second point is a bit more subtle, though equally true: For the purposes of design, a coarseload model { represented with a single stationary distribution { is better than no load model. In9

















Failure Figure 3: Design Overviewper second. This, in turn, means 3;1's frame can also be no greater than 200ms. But if the frameis exactly 200ms, the task induces a utilization of 1.0 on resource r1 { exceeding the resource'sintrinsic 0.9 limit, and disallowing any capacity for other tasks hosted on it.3.1 Run-Time ModelWithin the system model, all tasks in chain  i are considered to be scheduled in a quasi-cyclicfashion, using a time-division multiplexing abstraction for resource-sharing, over Fi-sized frames.That is, all of  i's load-shares are guaranteed for Fi intervals on all constituent resources. Hence,the synthesis algorithm's job is to (1) assign each task i;j a proportion of its resource's capacity(which we denote as ui;j) and (2) assign a global Fi frame for  i. Given this, i;j 's runtime behaviorcan be described as follows:(1) Within every Fi frame, i;j can use up to ui;j of its resource's capacity. This is policed byassigning i;j an execution-time budget Ei;j = bui;j  Fic; that is, Ei;j is an upper bound on theamount of resource time provided within each Fi frame, truncated to discrete units. (We assumethat the system cannot keep track of arbitrarily ne granularities of time.)(2) A particular execution instance of i;j may require multiple frames to complete, with Ei;jof its running time expended in each frame.(3) A new instance of i;j will be started within a frame if no previous instance of i;j is stillrunning, and if i;j 's input buer contains new data that has not already exceeded MDi. This ispoliced by a time-stamp mechanism, put on the computation when its input is sampled.Due to a chain's pipeline structure, note that if there are ni tasks in  i, then we must have thatMDi  ni  Fi, since data has to ow through all ni elements to produce a computation.3.2 Solution OverviewA schematic of the design process is illustrated in Figure 3, where the main steps are as follows.(1) partitioning the CPU and network capacity between the tasks; (2) selecting each chain's frame11
to optimize its output rate; and (3) checking the solution via simulation, to verify the integrity ofthe approximations used, and to ensure that each chain's output prole is suciently smooth (i.e.,not bursty).The partitioning algorithm processes each chain  i and nds a candidate load-assignment vectorfor it, denoted ui. (An element ui;j in ui contains the load allocated to i;j on its resource. ) Givena load assignment for  i, the synthesis algorithm attempts to nd a frame Fi at which  i achievesits optimal output rate. This computation is done approximately: For a given Fi, a rate estimateis derived by (1) treating all of  i's outputs uniformly, and deriving an i.i.d. per-frame \successprobability" i; and (2) then simply multiplying i  1Fi to approximate the chain's output rate,ORi. If ORi is lower than the MORi requirement, the load assignment vector is increased, and soon. Finally, if sucient load is found for all chains, the resulting system is simulated to ensure thatthe approximations were sound { after which excess capacity can be given to any chain, with thehope of improving its overall rate.4 Throughput AnalysisIn this section we describe how we approximate  i's output rate, ORi, given candidate load andframe parameters (Fi and ui) for the chain. Then in Section 5, we show how we make use of thistechnique to derive all the system's Fi and ui parameters for every chain.Assume we are currently processing  i, which has some frame-size Fi and load vector ui. Howdo we estimate its output probability i? Recall that outputs exceeding the maximum allowed delayMDi are not counted { and hence, we need some way of determining latency through the system.One benet of proportional-share queuing is as follows: Since each chain is eectively isolated fromothers over Fi intervals of observation, we can analyze the behavior of  i independently, withoutworrying about head-of-line blocking eects from other components.We use a discrete-time model, in which the time units are in terms of a chain's frame-size; i.e.,our discrete domain f0; 1; 2; : : :g corresponds to the real times f0;Fi; 2Fi; : : :g. Not only does thisreduction make the analysis more tractable, it also corresponds to \worst-case" conditions: Sincethe underlying system may schedule a task execution at any time within a Fi frame, we assumethat input may be read as early as the beginning of a frame, and output may be produced as lateas the end of a frame. And with one exception, a chain's states of interest do occur at its frameboundaries. That exception is in modeling aggregate delay { which, in our discrete time domain wetreat as bMDiFi c. Hence the fractional part of the last frame is ignored, leading to a tighter notionof success (and consequently erring on the side conservatism).Theoretically, we could model a computation's delay by constructing a stochastic process forthe chain as a whole { and solving it for all possible delay probabilities. But this would probably bea futile venture for even smaller chains; after all, such a model would have to simultaneously keeptrack of each task's local behavior. And since a chain may hold as few as 0 ongoing computations,and as many as one computation per task, it's easy to see how the state-space would quicklyexplode. 12
Instead, we go about constructing a model in a compositional (albeit inexact) manner, byprocessing each task locally, and using the results for its successors. Consider the following diagram,which portrays the ow of a computation at a single task:agei;j 1 Bi;j i;ji;j 1 	i;jOuti;j 1 Outi;jagei;jThese random variables are dened as follows:1. Data-age (agei;j): This variable charts a computation's total accumulated time, from entering i's head, to leaving i;j .2. Blocking time (Bi;j): The duration of time an input is buered, waiting for i;j to completeits current computation.3. Processing time (	i;j): If ti;j is a random variable ranging over i;j 's PDF, then 	i;j def= d ti;jEi;j eis the corresponding variable in units of frames.4. Inter-output time (Outi;j): An approximation of i;j 's inter-output distribution, in termsof frames; it measures the time between two successive outputs.We assume data is always ready at the chain's head; hence agei;j can be approximated via thefollowing recurrence relation: agei;1 = 	i;1agei;j = agei;j 1 +Bi;j +	i;jAnd for j > 1, we approximate the entire agei;j distribution by assuming the three variables to beindependent, i.e.,Pr[agei;j = k] = Xk1+k2+k3=kPr[agei;j 1 = k1] Pr[Bi;j = k2] Pr[	i;j = k3]Note that i;j 's success probability, i;j , will then be 1E[Outi;j] , i.e., the probability that a (non-stale) output is produced during a random frame. After processing the nal task in the chain, i;n,we can approximate the end-to-end success probability { which is just i;n's output probability,appropriately scaled by the probability of excessive delay injected during i;n's execution:i = i;n  Pr[agei;n  di]At this point the end-to-end success rate is estimated as ORi = i  1Fi .Note that our method is \top-down," i.e., statistics are derived for i;1, then i;2 (using thesynthesized metrics from i;1), then i;3, etc. Also note that when processing i;1, we already have13
2 3451 0Figure 4: Chain Xi;j(t), max(	i;j) = 6.all the information we require { from its PDF. In other words, we trivially have Pr[Outi;1 =t] = Pr[	i;1 = t], and thus i;1 is 1E[	i;1] . Since i;1 can retrieve fresh input whenever it is ready(and therefore incurs no blocking time), i;1 can execute a new phase whenever it nishes with theprevious phase.Blocking-Time. Obtaining reasonable blocking-time metrics at each stage is a non-trivial aair,especially when longer-tailed distributions are involved. In carrying out the analysis, we synthesizea stochastic process { whose states describe i;j 's remaining busy-time on a random input arrival.We describe the structure via a simple Markov Chain, Xi;j(t), which describes random states ofi;j . Specically, we are interested in capturing (1) i;j 's remaining execution time until the nexttask instance; and (2) whether an input is to be processed or dropped.Xi;j(t)'s transitions can be described using a simple Markov Chain, as shown in Figure 4 (wherethe maximum execution time is 6Fi.) The transitions are event-based, i.e., they are triggered bynew inputs from i;j 1. On the other hand, states measure the remaining time left in a currentexecution, if there is any. In essence, moving from state k to l denotes that (1) a new input justreceived; and (2) it will induce blocking time of l frames. For the sake of our analysis, we distinguishbetween three dierent outcomes on moving from state k to state l. In the transition descriptions,we use the term di to denote the end-to-end delay bound, in terms of frames, i.e., di = bMDiFi c.(1) Dropping [k ! l]: The task is currently executing, and there is already another input queuedup in its buer { which was calculated to induce a blocking-time of k. The new input will overwriteit, and induce l blocking-time frames.P d[k ! l] = Pr[Outi;j 1 = k   l] (k > l  0)(2) Failure [k! 0]: A new input arrives, but it will be too stale to get processed by the nish-timeof the current execution.P f [k ! 0] = Pr[Outi;j 1 > k] Pr[agei;j 1 > di   k](3) Success [k ! l]: A new input arrives, and will get processed with blocking time l. Figure 5illustrates a case of a successful transition. 14
i;j 1i;j kt1 t2 l
Outi;j 1
Figure 5: Transition k ! l, with 0 < k, 0 < l.Case A: Destination state is 0.P s[k ! 0] =Xt>0Pr[	i;j = t] Pr[Outi;j 1  t + k] Pr[agei;j 1  di   k]Case B: Destination state l > 0:P s[k! l] =Xt>l Pr[	i;j = t] Pr[Outi;j 1 = t+ k   l] Pr[agei;j 1  di   k]We distinguish between outcomes (1)-(3) via partitioning the state-transition matrix { i.e., Pd,Pf and Ps denote the transition matrices for dropping, failure and success, respectively. Each iscalculated in terms of parameters we discussed above { i.e., agei;j 1, Outi;j 1 and 	i;j .After getting the complete transition matrix, P = Pd + Pf + Ps, we solve for steady-stateprobabilities in the usual fashion, i.e., for xi;j = xi;j  P. In turn, the steady-state probabilitiesare used to derive i;j 's per-frame success probabilityi;j = i;j 1  0@Xk0(xi;j Ps)[k]1Awhere the k index denotes the kth element in the resulting vector. In essense, the calculation justcomputes the probability of (1) having an input to read at a random frame, and (2) successfullyprocessing it { which is obtained by summing up the successful out-ow probabilities. The samesimple Bayesean method is used to achieve a stationary blocking-time PDF:Pr[Bi;j = k] = Pl0(xi;j [k] Ps[k; l])Pk0(xi;j  Ps)[k]The nal ingredient is to derive task i;j 's inter-output distribution, Outi;j. To do this we use acoarse mean-value analysis: After i;j produces an output, we know that it goes through an idlephase (waiting for fresh input from its producer), followed by a busy phase, culminating in another15
output. Let Idlei;j be a random variable which counts the number of idle cycles before the busyphase. Then we have: E[	i;j + Idlei;j] = 1i;j=) E[Idlei;j] = 1i;j   E[	i;j]Using this information, we model the event denoting \compute-start" as a pure Bernoulli decisionin probability STi;j , where STi;j = 1E[Idlei;j]+1 , i.e., after an output has been delivered, STi;j isthe probability that a random cycle starts the next busy phase. We then approximate the idledurations via a modied geometric distribution:Pr[Idlei;j = l] = (1  STi;j)l  STi;jThen we derive the distribution for Pr[Outi;j] as:Pr[Outi;j = k] = X0k1<k Pr[Idlei;j = k1] Pr[	i;j = k   k1]Finally, we approximate the end-to-end success probability, which is just i;n's output proba-bility, appropriately scaled to account for excessive delay injected during i;n's execution:i def= i;n  Pr[agei;n  di]Hence, by denition, the end-to-end output rate is given as follows:ORi = i  1FiExample. As an example, we perform throughput estimation on  6, assuming system parametersof F6 = 60ms, E6;1 = 6ms, and E6;2 = 30ms. Recall that the delay bound for the chain isMD6 = 300; thus d6 = bMD6=F6c = 5. Within our head-to-tail approach, we rst have to considertask 6;1. Recall, however, that the distributions for both age6;1 and Out6;1 are identical to 	6;1,the quantized load distribution. Moreover, we also have that 6;1 = 1E[	6;1] = 0:3291:Next we consider the second (and last) task 6;2. The following tables show the PDFs for 	6;2,for the Markov Chain's steady states, the blocking times, and for age6;2.
16
k Pr[	6;2] x6;2[0] Pr[B6;2] Pr[age6;2]0 0:0 0:975 0:980 0:01 0:7543 0:019 0:017 0:02 0:1968 0:0045 0:002 0:03 0:03751 0:0009 0:00009 0:26584 0:0098 0:0002 0:0 0:34005 0:0019 0:00003 0:0 0:24466 0:0005 0:000001 0:0 0:11537 0:00008 0:0 0:0 0:02698 0:0 0:0 0:0 5:7 10 39 0:0 0:0 0:0 1:3 10 310 0:0 0:0 0:0 2:7 10 411 0:0 0:0 0:0 5:3 10 512 0:0 0:0 0:0 5:8 10 6Now, summing up the successful out-ow probabilities, we haveXk0(x6;2 Ps)[k] = 0:9804and hence, the chain's i.i.d success probability is dened as follows:6;2 = 6;1  0:9804 = 0:3228Pr[age6;2  d6] = 0:8506 = 6;2  Pr[age6;2  d6] = 0:2745Multiplying by the frame-rate, we get OR6 = 0:2745 100060 = 4:574:5 System Design ProcessWe now revisit the \high-level" problem of determining the system's parameters, with the objectiveof satisfying each chain's performance requirements. As stated in the introduction, the designproblem may be viewed as two inter-related sub-problems:1. Load Assignment. Given a set of chains, how should the CPU and network load be parti-tioned among the set of tasks, so that the performance requirements are met?2. Frame Assignment. Given a load-assignment to the tasks in the chain, what is the optimalframe for the chain, such that the eective throughput is maximized?Note that load-allocation is the main \inter-chain" problem here, while frame-assignment can beviewed strictly as an \intra-chain" issue. With our time-division abstraction, altering a chain'sframe-size will not eect the (average) rates of other chains in the system.Consider the synthesis algorithm in Figure 6, and note that the Fi's (expressed in milliseconds)are initialized to the largest frame-sizes that could achieve the desired output rates. Here F17
Synthesize(): returns f(ui;Fi;ORi) : 1  i  ng(1) Fi  d FMORi e (for all 1  i  n)(2) ui;j  E[ti;j ]Fi (for all i;j)(3) k  Xresource(i;j )=k ui;j (for all 1  k m)(4) S  f i : 1  i  ng(5) while (S 6= ;)f(6) Find the i;j in  i 2 S s.t. rk = resource(i;j)which maximizes(7) wi;j  H(MORi;ORi; mk ; k; ui;j)(8) if(k  mk   )(9) return Failure(10) ui;j  ui;j + (11) k  k + (12) (ORi;Fi) Get Frame( i, Fi,ui)(13) if ( ORi  MORi )(14) S  S   f igg(15) return(f(ui;Fi;ORi) : 1  i  ng)
Get Frame( i;Fi;ui) : returns (ORi;Fi)(1) ORi  get rate( i;Fi;ui)(2) for (t Fi   1; t > 0 ; t t  1)f(3) if ((MDi mod t) < (MDi))f(4) OR0i  get rate( i, t, ui)(5) if (OR0i > ORi)f(6) ORi  OR0i(7) Fi  tggg(8) return (ORi, Fi)
Figure 6: Synthesis Algorithmdenotes global time-scale; in our example it was chosen as 1000, since all units were in millisec-onds. Also note that for any task i;j , its resource share ui;j is initialized to accommodate thecorresponding mean response-time time E[ti;j]. (The system could only be solved with these initialparameters if all execution times were constant.)Load Assignment. Load-assignment works by iteratively rening the load vectors (the ui's),until a feasible solution is found. The entire algorithm terminates when the output rates for allchains meet their performance requirements { or when it discovers that no solution is possible. Wedo not employ backtracking, and task's load is never reduced. This means the solution space is notsearched totally, and in some tightly-constrained systems, potential feasible solutions may not befound.Load-assignment is task-based, i.e., it is driven by assigning additional load to the task estimatedto need it the most. The heart of the algorithm can be found on lines (6)-(7), where all of theremaining unsolved chains are considered, with the objective of assigning additional load to the\most deserving" task in one of those chains. This selection is made using a heuristic weight wi;j,reecting the potential benet of increasing i;j 's utilization, in the quest of increasing the chain'send-to-end performance. 18
The weight actually combines three factors, each of which plays a part in achieving feasibility:(1) additional output rate required, normalized via range-scaling to the interval [0,1]; (2) thecurrent/maximum capacities for the task's resource (where current capacity is denoted as k forresource rk); and (3) the task's current load assignment. The idea is that a high load assignmentindicates diminishing returns are setting in, and working on a chain's other tasks would probablybe more benecial. For the results in this paper, the heuristic we used was:H(MORi;ORi; mk ; k; ui;j) = MORi  ORiMORi  (mk   k) 1ui;jThen the selected task gets its utilization increased by some tunable increment. Smaller incrementswill obviously lead to a greater likelihood of nding feasible solutions; however, they also incur ahigher cost. (For the results presented in this paper, we set  = :05.)After additional load is given to the selected task, the chain's new frame-size and rate param-eters are determined; if it meets its minimum output requirements, it can removed from furtherconsideration.Frame Assignment. \Get Frame" derives a feasible frame (if one exists). While the problem offrame-assignment seems straightforward enough, there are a few non-linearities to surmount: First,the true, usable load for a task is given by buij  Fic=Fi, due to the fact that the system cannotmultiplex load at arbitrarily ne granularities of time. Second, in our analysis, we assume that theeective maximum delay is rounded up to the nearest frame, which errs on the side of conservatism.The negative eect of the second factor is likely to be higher at larger frames, since it resultsin truncating the fractional part of a computation's nal frame. On the other hand, the rstfactor becomes critical at smaller frames. Hence, the approximation utilizes a few simple rules.Since loads are monotonically increased, we restrict the search to frames which are lower than thecurrent one. Further, we restrict the search to situations where our frame-based delay estimatetruncates no more than   100 percent of the continuous-time deadline, where  << 1. Subjectto these guidelines, frames are evaluated via the throughput analysis presented in Section 4 { whendetermines the current ORi metric.Design Process - Solution of the Example.We ran the algorithm in Figure 6 to nd a feasiblesolution, and the result is presented in Figure 7.On a SPARC Ultra, the algorithm synthesized parameters for the example in approximately 30minutes of wall-clock time. The results are presented in Figure 7. Note that the resources havecapacity to spare; r1 is has highest load in this conguration (at 0:72), r7 has the lowest (at 0:35);most others are around 50% loaded. The spare load can be used to increase any chain's outputrate, if desired { or held for other chains to be designed-in later.19
A. Synthesized Solutions for Chains. i Fi ORi ui 1 3 11:33 [u1;1 = 0:33; u1;2 = 0:33;u1;3 = 0:33] 2 16 5:50 [u2;1 = 0:188;u2;2 = 0:188;u2;3 = 0:25;u2;4 = 0:188;u2;5 = 0:188;u2;6 = 0:25] 3 5 5:26 [u3;1 = 0:2; u3;2 = 0:2; u3;3 = 0:2;u3;4 = 0:2; u3;5 = 0:2; u3;6 = 0:2;u3;7 = 0:2;u3;8 = 0:2] 4 20 5:47 [u4;1 = 0:25; u4;2 = 0:2; u4;3 = 0:2;u4;4 = 0:15;u4;5 = 0:25] 5 10 5:39 [u5;1 = 0:10] 6 5 6:91 [u6;1 = 0:20; u6;2 = 0:20]B. Resource Capacity Used by System.1 2 3 4 5 6 7 8 9 100:72 0:39 0:45 0:4 0:65 0:52 0:35 0:62 0:65 0:65Figure 7: Synthesized Solution of the Example System.6 SimulationSince the throughput analysis uses some key simplifying approximations, we validate the resultingsolution via a simulation model.Recall that the analysis injects imprecision for the following reasons. First, it tightens all (end-point) output delays by rounding up the fractional part of the nal frame. Analogously, it assumesthat a chain's state-changes always occur at its frame boundaries; hence, even intermediate outputtimes are assumed to take place at the frame's end. A further approximation is inherent in ourcompositional data-age calculation { i.e., we assume the per-frame output ratios from predecessortasks are i.i.d., allowing us to solve the resulting Markov chains in a quasi-independent fashion.The simulation model discards these approximations, and keeps track all tagged computationsthrough the chain, as well as the \true" states they induce in their participating tasks. Also, theclock progresses along the real-time domain; hence, if a task ends in the middle of a frame, it getsplaced in the successor's input buer at that time. Also, the simulation model schedules resourcesusing a modied deadline-monotonic dispatcher (where a deadline is considered the end of a frame),so higher-priority tasks will get to run earlier than the analytical method assumes. Recall that theanalysis implicitly assumes that computations may take place as late as possible, within a givenframe.On the other hand, the simulator does inherit some other simplications used in our design.For example, input-reading is assumed to happen when a task gets released, i.e, at the start of aframe. As in the analysis, context switch overheads are not considered; rather, they are charged tothe load distributions. Figure 8 summarizes the dierences between the two models.Validation of the Design. The following table compares the analytical throughput estimateswith those derived via simulation. Simulated rates are displayed with 95% condence-intervals20
Simulation AnalysisMaximum delay MDi bMDi=Fic  FiState change Measured Frame boundaryOutput Rate Measured i.i.dScheduling Frame-based As late as possibledeadline monotonicData reading time start of a frame start of a frameFigure 8: Comparison between Analysis model and Simulation modelover 100 trials, where each trial ran over 100,000 frames (for the largest frame in the system). Thelast column shows the standard deviation for output-rates calculated over t-time moving averagesw.r.t. the simulated ORi, where t 2 f0:5; 1:0g. This means that for 1-second intervals our tooldoes the following: (1)  i's output rate is charted over all 1-second intervals; (2) each interval'sdeviation from ORi is calculated; (3) the sum-of-squared deviations is obtained, and then dividedby the degrees of freedom in the sample (4) the square-root of the result is produced. i Fi MORi ORi (Analysis) ORi (Simulated) tORi for moving averages over t(ms) (ms) (Hz) (Hz) t = 1 sec. t = 0.5 sec.Y1 3 10 11.33 11.44  0.023 2.06 3.06Y2 16 5 5.50 5.47  0.015 1.59 2.38Y3 5 5 5.26 5.35  0.014 1.47 2.11Y4 20 5 5.47 5.76  0.015 1.60 2.36Y5 10 5 5.39 5.39  0.027 2.76 3.85Y6 5 5 6.91 6.90  0.028 2.63 3.66Note that the resulting (simulated) system satises minimum throughput requirements of allchains; hence we have a satisfactory solution. If desired, we can improve it by doling out the excessresource capacity, i.e., by simply iterating through the design algorithm again.Figure 9 compares the simulated and analytic results for multiple frames, assigned to threeselected chains. The frame-sizes are changed for one chain at a time, and system utilization remainsxed at the synthesized values. The simulation runs displayed here ran for 10; 000 frames, for thelargest frame under observation (e.g, on the graph, if the largest frame tested is denoted as 200ms, then that run lasted for 2000 seconds). The graphs show average output-rates for the chainsover an entire simulation trial, along with the corresponding standard deviation, computed over allone-second interval samples vs. the mean.The combinatorial comparison allows us to make the following observations. First, output ratesgenerally increase as frame-sizes decrease { up to the point where the the system starts injectinga signicant amount of truncation overhead (due to multiplexing). Recall that the runtime systemcannot multiplex innitesimal granularities of execution time; rather, a task's utilization is allocatedin integral units over each frame.Second, the relationship between throughput and burstiness is not direct. Note that for somechains (e.g.,  2), deviations tend to increase at both higher and lower frames. This reects twoseparate facts, the rst of which is fairly obvious, and is an artifact of our measurement process.21
















Chain 2 − Success Rate
Analysis  
Simulation
















Chain 3 − Success Rate
Analysis  
Simulation


















Chain 6 − Success Rate
Analysis  
Simulation






























Chain 2 − Standard Deviation






























Chain 3 − Standard Deviation



























Chain 6 − Standard Deviation
Figure 9: Analysis vs. simulation: ORi for  2, 3, 6; standard deviations for ORi computed over 1second moving averages.We calculate variance statistics over 1-second intervals, yields deviations computed on the basis ofa single reference time-scale. This method inserts some bias at low frames { since these processesget sampled more frequently. In turn, this can lead to articially higher deviations.At very large frames, another eect comes into play. Recall that tasks are only dispatched atframe boundaries, and only when they have input waiting from their predecessors. Hence, if aproducer overruns its frame by even a slight amount, its consumer will have to wait for the nextframe to use the data { which consequently adds to the data's age.On the whole, however, we note that the simulated deviations are not particularly high, espe-cially considering the fact that they include the one-second intervals where measured output ratesactually exceed the mean.Also, while output rates decrease as frame-sizes increase, the curves are not smooth, and smallspikes can be observed in both the simulated and analytical results. Again, this eect is due tothe multiplexing overhead discussed above, injected since execution-time budgets are integral. Forexample if a task's assigned utilization is 0.11, and if its frame-size is 10, then its execution-timebudget will be 1 { resulting in an actual, usable utilization of 0.1. However, the output rates domonotonically decrease for larger frames when we only consider candidates that result in integralexecution-time budgets.Finally, we note that the dierence between analysis and simulation is larger when a frame-sizedoes not divide maximum permissible delay. Again, this is the result of rounding up the fractionalpart of a computation's nal frame, which will cause us to overestimate the output's age.22
Remarks. Coarse analytical estimations are essential at the synthesis stage. As we showed inSection 3, a deterministic real-time approach would fail to work for our small example, and theproblems associated with stochastic timing deviations would only increase in larger systems. Notealso that we could not rely exclusively on simulation during the synthesis phase, i.e., as a substitutefor analysis. Based on our timing information for single-run simulations, such an approach wouldrequire over three months to synthesize our small example.Hence, since we require coarse analytical estimates at the design stage, validating the solutionis essential. As a rst pass, the obvious choice is discrete-event simulation. A set of simulationruns { requiring perhaps several minutes { will often be signicantly cheaper than going directlyto integration. Indeed, discovering a severe design aw after implementation can be a nasty propo-sition for a development team { particularly when the hardware is found to be insucient for theapplication hosted on it. Also, the simulation model's underlying assumptions are fundamentallydierent from those used in the analysis; thus the results can provide a margin of condence in thedesign's robustness. Simply put, for the sake of validity, two performance models are better thanone.However, simulation is not the end of the story. After all, our objective is to build the appli-cation, by calibrating the kernels and drivers to use our analytically-derived parameters. At thatpoint, one subjects the system to the most important validity test of all: on-line proling. Evenwith a careful synthesis strategy, testing usually leads to some additional system tuning { to helpcompensate for the imprecise modeling abstractions used during static design.7 ConclusionWe presented a design scheme to achieve stochastic real-time performance guarantees, by adjustingsystem utilizations and processing rates. The solution strategy uses several approximations to avoidmodeling the entire system; for example, in estimating end-to-end delay we use a combination ofqueuing analysis, real-time scheduling theory, and simple probability theory. Our search algorithmmakes use of two heuristics, which help to signicantly reduce the number of feasible solutionschecked.In spite of the approximations our simulation results are promising { they show that the ap-proximated solutions are suciently close to be dependable; also, the resulting second-momentstatistics show that output-rates are relatively smooth.Much work remains to be done. First, we plan on extending the model to include a morerigid characterization of system overhead, due to varying degrees of multiplexing. Currently weassume that execution-time can be allocated in integral units; other than that, no specic penaltyfunctions are included for context-switching, or for network-switch overhead. We also plan ongetting better approximations for the \handover-time" between tasks in a chain, which will resultin tighter analytical results.We are also investigating new ways to achieve faster synthesis results. To speed up convergence,one needs a metric which approximates \direction of improvement" over the solution space as a23
whole. This would let the synthesis algorithm shoot over large numbers of incremental improve-ments { and hopefully attain a quicker solution. However, we note that the problem is not trivial;the solution space contains many non-linearities, with no ready-made global metric to uncondi-tionally predict monotonic improvement. Yet, we can potentially take advantage of some relativelyeasy optimization strategies, such as hill-climbing and simulated annealing. The key to success liesin nding a reasonable \energy" or \attraction" function { and not necessarily one that is exact.Finally, we are currently deploying our design technique in several large-scale eld tests, on dis-tributed applications hosted on SP-2 and Myrinet systems. Part of this project requires extendingthe scheme to handle dynamic system changes { where online arrivals and departures are permitted,and where the component-wise PDFs may vary over time. Hence, we are working on self-tuningmechanisms which get invoked when a chain's throughput degrades below a certain threshold {which can trigger an on-line adjustment in a chain's allocated load, as well as its associated framesize. Note that this is similar a common problem studied in the context of computer networks:handling on-the-y QoS renegotiations, to help smooth out uctuating service demands. Hence, weare investigating various strategies proposed for that problem, to determine if they can be modiedfor the more arcane (but equally challenging) domain of embedded real-time systems.References[1] Alan Burns. Preemptive Priority Based Scheduling: An Appropriate Engineering Approach.In Sang Son, editor, Principles of Real-Time Systems. Prentice Hall, 1994.[2] Wu chun Feng and Jane W.-S. Liu. Algorithms for scheduling real-time tasks with inputerror and end-to-end deadlines. IEEE Transactions on Software Engineering, 23(2):93{106,February 1997.[3] R.L. Cruz. A calculus for network delay, part i : Network elements in isloation. IEEETransactions on Information Theory, 37(1):114{131, 1991.[4] Alan Demers. Analysis and Simulation of a Fair Queueing Algorithm. In Proceedings of ACMSIGCOMM, pages 1{12. ACM Press, September 1989.[5] Norival R. Figueira and Joseph Pasquale. Leave-in-Time: A New Service Discipline for Real-Time Communications in a Packet-Switching Network. In Proceedings of ACM SIGCOMM,pages 207{218. ACM Press, October 1995.[6] Lucia Franco. Communication congurator for eldbus: An algorithm to schedule tran smissionof data and messages. In Proceedings of IFAC/IFIP Workshop on Real Time Programming.IFIP, November 1996.[7] Lucia Franco. Transmission scheduling for eldbus: A strategy to schedule data and messageson eldbus with end-to-end constraints. In Proceedings of IEEE International Symposium on24
Intelligent Systems /Automation and Robotics (IAR). IEEE Computer Society Press, Decem-ber 1996.[8] R. Gerber, S. Hong, and M. Saksena. Guaranteeing Real-Time Requirements with Resource-Based Calibration of Periodic Processes. IEEE Transactions on Software Engineering, 21, July1995.[9] R. Gerber, Dong-In Kang, Seongsoo Hong, and Manas Saksena. End-to-End Design of Real-Time Systems, chapter 10, pages 237{265. Wiley, 1996. In Formal Methods for Real-TimeComputing, edited by Constance Heitmeyer and Dino Mandrioli.[10] Ladan Gharai and Richard Gerber. Multi-platform simulation of video playout performance.In Proceedings of SPIE/IS&T Multimedia Computing and Networking (MCMN98), 1998.[11] S. Goddard and Kevin Jeay. Analyzing the real-time properties of a dataow executionparadigm using a synthetic aperture radar application. In Proceedings of IEEE Real-TimeTechnology and Applications Symposium. IEEE Computer Society Press, June 1997.[12] S. J. Golestani. Congestion-free communication in high-speed packet networks. IEEE Trans-actions on Commmunication, 39(12):1802{1812, 1991.[13] Pawan Goyal, Xingang Guo, and Harrick M. Vin. A Hierarchical CPU Scheduler for Multi-media Operating Systems. In Proceedings of Symposium on Operating Systems Design andImplementation (OSDI '96), pages 107{121, October 1996.[14] Pawan Goyal and Harrick M. Vin. Network Algorithms and Protocol for Multimedia Servers.In Proceedings of IEEE INFOCOM. IEEE Computer Society Press, March 1993.[15] M. Harbour, M. Klein, and J. Lehoczky. Fixed Priority Scheduling of Periodic Tasks withVarying Execution Priority. In Proceedings, IEEE Real-Time Systems Symposium, pages 116{128, December 1991.[16] Raj Jain. The Art of Computer Systems Performance Analysis Techniques for ExperimentalDesign, Measurement, Simulation, and Modeling. John Wiley & Sons, 1991.[17] Ben Kao and Hector Garcia-Molina. Deadline Assignment in a Distributed Soft Real-TimeSystem. In Proceedings of International Conference on Distributed Computing systems, pages428{437. IEEE Computer Society Press, May 1993.[18] Namyun Kim, Minsoo Ryu, Seongsoo Hong, Manas Saksena, Chong-Ho Choi, and HeonshikShin. Visual assessment of a real-time system design : A case study on a cnc controller. InProceedings of IEEE Real-Time Systems Symposium, pages 300{310. IEEE Computer SocietyPress, December 1996. 25
[19] J. P. Lehoczky and S. Ramos-Thuel. An Optimal Algorithm for Scheduling Soft-AperiodicTasks in Fixed-Priority Preemptive Systems. In Proceedings of IEEE Real-Time SystemsSymposium, pages 110{123. IEEE Computer Society Press, December 1992.[20] J. Leung and M. Merill. A Note on the Preemptive Scheduling of Periodic, Real-Time Tasks.Information Processing Letters, 11(3):115{118, November 1980.[21] C. Liu and J. Layland. Scheduling Algorithm for Multiprogramming in a Hard Real-TimeEnvironment. Journal of the ACM, 20(1):46{61, January 1973.[22] Cliord W. Mercer, Stefan Savage, and Hideyuki Tokuda. Processor Capacity Reserves: Op-erating System Support for Multimedia Applications. In Proceedings of IEEE InternationalConference on Multimedia Computing and Systems. IEEE Computer Society Press, May 1994.[23] A. K. Parekh and G. Gallager. A Generalized Processor Sharing Approach to Flow Controlin Integrated Services Networks - The Single Node Case. In Proceedings of IEEE INFOCOM,pages 915{924. IEEE Computer Society Press, March 1992.[24] A. K. Parekh and G. Gallager. A generalized processor sharing approach to ow control inintegrated services networks - the multiple node case. In Proceedings of IEEE INFOCOM,pages 521{530. IEEE Computer Society Press, March 1993.[25] M. Saksena and S. Hong. Resource Conscious Design of Real-Time Systems: An End-to-EndApproach. In IEEE International Conference of Engineering Complex Computer Systems.IEEE Computer Society Press, October 1996.[26] S. Sathaye and J. Strosnider. A Real-Time Scheduling Framework for Packet-Switched Net-works. In Proceedings of IEEE Real-Time Systems Symposium, pages 182{191. IEEE ComputerSociety Press, December 1994.[27] D. Seto, J.P. Lehoczky, L. Sha, and K.G. Shin. On Task Schedulability in Real-Time ControlSystem. In Proceedings of IEEE Real-Time Systems Symposium. IEEE Computer SocietyPress, December 1996.[28] Kang G. Shin and Yi-Chieh Chang. A Reservation-Based Algorithm for Scheduling BothPeriodic and Aperiodic Real-Time Tasks. IEEE Transactions on Computers, 44:1405{1419,December 1995.[29] Ion Stoica, Hussein Abdel-Wahab, Kevin Jeay, Sanjoy K. Baruah, Johannes E. Gehrke, andC. Greg Plaxton. A Proportional Resource Allocation Algorithm for Real-Time, Time-SharedSystems. In Proceedings of IEEE Real-Time Systems Symposium, pages 288{299. IEEE Com-puter Society Press, December 1996.[30] T. S. Tia, Z. Deng, M. Shankar, M. Storch, J. Sun, L.-C. Wu, and J. W.-S Liu. ProbabilisticPerformance Guarantee for Real-Time Tasks with Varying Computation Times. In Proceedings26
of IEEE Real-Time Technology and Applications Symposium, pages 164{173. IEEE ComputerSociety Press, May 1995.[31] K. Tindell, A. Burns, and A. Wellings. Analysis of hard real-time communication. The Journalof Real-Time Systems, 9:147{171, September 1995.[32] Carl A. Waldspurger and William E. Weihl. Lottery Scheduling: Flexible Proportional-ShareManagement. In Proceedings of Symposium on Operating Systems Design and Implementation(OSDI '94), November 1994.[33] Carl A. WAldspurger and William E. Weihl. Stride scheduling: Deterministic proportional-share resource management. Technical Report MIT/LCS/TM-528, MIT Laboratory for Com-puter Science, June 1995.[34] David Yates, James Kurose, Don Towsley, and Michael G. Hluchyj. On Per-session End-to-end Delay Distributions and the Call Admission Problem for Real-time Applications with QOSRequirements. In Proceedings of ACM SIGCOMM. ACM Press, September 1993.[35] H. Zhang. Providing End-to-End Performance Guarantees Using Non-Work-Conserving Dis-ciplines. Computer Communications: Special Issue on System Support for Multimedia Com-puting, 18, October 1995.[36] Hui Zhang and D. Ferrari. Rate-controlled static-priority queueing. In Proceedings of IEEEINFOCOM, pages 227{236. IEEE Computer Society Press, September 1993.[37] Hui Zhang and Edward W. Knightly. Providing End-to-End Statistical Performance Guar-antees with Bounding Interval Dependent Stochastic Models. In ACM SIGMETRICS. ACMPress, May 1994.[38] Lixia Zhang. VirtualClock : A New Trac control Algorithm for Packet Switching Networks.In Proceedings of ACM SIGCOMM, pages 19{29. ACM Press, September 1990.[39] Zhi-Li Zhang, Don Towsley, and Jim Kurose. Statistical Analysis of Generalized ProcessorSharing Scheduling Discipline. In Proceedings of ACM SIGCOMM, pages 68{77. ACM Press,August 1994.
27
