Automated Techniques for Designing Embedded Signal Processors on 
Distributed Platforms by Kang, Dong-In et al.
Automated Techniques for Designing Embedded Signal Processorson Distributed PlatformsDong-In Kang, Richard Gerber, Leana GolubchikDepartment of Computer ScienceUniversity of Maryland College Park, MD 20742fdikang, rich, leanag@cs.umd.eduAbstractIn this paper, we present a performance-based technique to help synthesize high-bandwidthradar processors on commodity platforms. This problem is innately complex, for a number ofreasons. Contemporary radars are very compute-intensive: they have high pulse rates, and theysample a large amount of range readings at each pulse. Indeed, modern radar processors canrequire CPU loads of in high-gigaop to tera-op ranges, performance which is only achieved byexploiting the radar's inherent data parallelism. Next-generation radars are slated to operateon scalable clusters of commodity systems.Throughput is only one problem. Since radars are usually embedded within larger real-time applications, they also must adhere to latency (or deadline) constraints. Building anembedded radar processor on a network of workstations (or a NOW) involves partitioning loadin a balanced fashion, accounting for stochastic eects injected on all software-based systems,synthesizing runtime parameters for the on-line schedulers and drivers, and meeting the latencyand throughput constraints.In this paper, we show how performance analysis can be used as an eective tool in thedesign loop; specically, our method uses analytic approximation techniques to help synthesizeecient designs for radar processing systems. In our method, the signal-processor's topology isrepresented via a simple ow-graph abstraction, and the per-unit load requirements are modeledstochastically, to account for second-order eects like cache memory behavior, DMA interference,pipeline stalls, etc. Our design algorithm accepts the following inputs: (a) the system topology,including the thread-to-CPU mapping, where multi-threading is assumed to be used; (b) theper-task load models; and (c) the required pulse rate and latency constraints. As output, itproduces the proportion of load to allocate to each task, set at manageable time resolutionsfor the local schedulers; an optimal service interval over which all load proportions should beThis research is supported in part by National Science Foundation Grants CCR9804679 and CCR9619808, ARLCooperative Agreement DAAL01-96-2-0002 and National Science Foundation CAREER grant CCR-96-25013.
guaranteed; an optimal sampling frequency; and some reconguration schemes to accommodatesingle-node failures. Internally, the design algorithms use analytic approximations to quicklyestimate output rates and propagation delays for candidate solutions. When the system issynthesized, its results are checked via a simulation model, which removes many of the analyticapproximations.We show how our system synthesizes a real-time synthetic aperture radar, under a varietyof loading conditions.1 IntroductionDesigning a contemporary high-resolution radar processor is a complex art. These applicationsare extremely compute-intensive, due to their high pulse rates, and the huge amount of range datasampled at each pulse - not to mention the complexity of the operations carried out on the data, e.g.,digital ltering, fast Fourier transforms (FFTs), inverse FFTs, matrix inversions, convolutions, andthe like. Also, in some radars (e.g., with phased-array antennas or synthetic apertures), multiplesignal-processing engines are run in parallel, and the outputs are composed to form single images.Given these factors, high-end radar processors often demand CPU loads of 108 to 1013 FLOPS, andnext-generation phased-array radars are slated to require up to 1016 FLOPS.This throughput can only be achieved by relying on some degree of parallel processing. Fortu-nately, most radar processors do contain a large amount of data parallelism within the functionalunits for both temporal and spatial computations. To some degree, this parallelism is currentlyexploited in custom-logic layouts; however, as throughput demands rise, designers are slowly re-placing VLSI solutions with pipelines of commodity systems, hooked together by fast interconnects.Indeed with a suciently large number of workstations, any throughput is theoretically attainable.But handling throughput is not the only problem. Most radars are usually embedded withinlarger real-time applications; hence they must adhere to latency (or deadline) requirements. Infact, for a real-time radar processor, meeting the end-to-end latency constraints are as importantas processing the throughput, since overly-late images may be worthless to a recognition engine.Hence, building an embedded radar processor on a network of workstations (or a NOW) involvesstriking a balance between several key factors: (1) conguring a NOW to handle the throughput,at a reasonable price; (2) partitioning the workload to t the capacity of the system's nodes; and(3) doing this in a manner which meets the end-to-end latency requirements. In this respect,high-performance radar design is similar to many other problems in computer engineering - it onlyhappens to be signicantly larger.The exercise is additionally complicated by the problem of synchronization. It is true thatsome radar computation stages can be parallelized; however it is not true that these stages arecompletely independent. To the contrary, all radars possess two large \macro-stages," which areusually performed in sequence: range compression (i.e., spatial processing) and pulse compression2
(i.e., temporal processing). Between these two stages lies an all-to-all data exchange - a fork/joinsynchronization which implements a real-time matrix transpose, where incoming rows arrive atstaggered intervals, and where each outgoing column gets broadcast in its entirety. If the systemis not tuned properly, then a slice of the image may be unduly delayed - thereby making theentire image exceed its allowed latency. Moreover, since a delayed slice of an old image can inducehead-of-line blocking eects on newer images, these can end up being worthless too.Another issue deals with runtime behavior, e.g, accounting for factors like cache and pipelineeects, interrupts, queuing delays, context-switches - not to mention the resolution of the real-time clock. Without reasonable ways of estimating (and controlling) these factors, the chances ofobtaining a well-balanced design are fairly low.Finally there is the problem of fault-tolerance. Traditionally, hardware-level redundancy hasbeen the rule, and it has often been used at great expense. Specically, the system is deployed witha \twin" - another supercomputer, which sits idle, waiting to repair single node/board failures.Often the probability of failure is quite low. If the baseline processor is not overly utilized, single-node failures could theoretically be repaired via repartitioning slack, and multi-threading someof the functions from the failed node. In practice, this is not done; in fact, multi-threading israrely used on any real-time radar processor, in any context. The main reason for this is a lackof adequate performance metrics. That is, radar developers do not have sucient models to helpcharacterize the interaction eects between the software, the hardware, and the runtimes - andthen chart their inuence on the end-process statistics. Without this technology, the process ofdesigning a single-threaded system is suciently dicult - and multi-threading makes things morecomplicated. Hence it is avoided.The role of performance modeling. In this paper, we show how performance analysis canbe used as an eective tool in the design loop; specically, our method uses analytic approximationtechniques to help synthesize ecient designs for radar processing systems. Using this method, weshow how some of the abovementioned problems can be treated at design time, before the system isbuilt. Moreover, we illustrate how we can use some simple statistics to account for many runtimeeects, as well as nondeterministic execution times, multi-threading overhead, and the like.In our method, the signal-processor's topology is represented via a simple ow-graph abstrac-tion, where a vertex represents an activity requiring nonzero load from some CPU or networkresource. As noted, these load requirements can vary stochastically, due to second-order eectslike cache memory behavior, DMA interference, pipeline stalls, bus arbitration delays, transienthead-of-line blocking, etc. Instead of explicitly modeling each of these eects individually, we ag-gregate them with a task's inherent load demand, and use random variables to represent eachtask's per-service load demand. These random variables range over arbitrary discrete probabilitydistributions, and can be obtained via proling tasks in isolation, or simply by using an engineer'shypothesis about the system's projected behavior. Successive instances of a single task are mod-3
eled to be independent of each other. Our results show that this kind of a simple model, whileadmittedly coarse, tends to be highly robust for the purposes of this application.We also use some simple queueing-theoretic techniques as \subroutines" within our load-allocationheuristic. Using these abstractions, our algorithms can eciently process millions of candidate de-signs in a few seconds, toward the goal of nding one which meets the end-to-end requirements.Aside from the technical contributions in this paper, we also hope to provide a \social" contri-bution: We illustrate how a straightforward performance technique is quite useful when used as atool, functioning within the context of a larger problem - that of design synthesis. We also show howstochastic models can be harnessed to produce more ecient, scalable systems than are currentlydeployed via deterministic models. Our aim is to help introduce this area to the community ofperformance analysts.Overview of the Design Method. The design algorithm accepts the following inputs: (a) thesystem topology, including the thread-to-CPU mapping, where multi-threading is assumed to beused; (b) the per-task load models; and (c) the required pulse rate and latency constraints.Using these constraints, the algorithm tries to synthesize ideal scheduling parameters for eachtask, via solving two sub-problems in tandem: (i) it nds the proportion of load to allocate eachtask, set at manageable time resolutions for the local schedulers; and (ii) it derives an optimal serviceinterval over which all load proportions should be guaranteed. Internally, the design algorithmsuse analytic approximations to quickly estimate output rates and propagation delays for candidatesolutions. In addition to the scheduling parameters, the algorithm also produces statistics on thenumber of projected late images, if any (i.e., images which exceed their latency constraints).When all parameters are synthesized, the estimated end-to-end performance metrics are re-checked by simulation. The per-component load reservations can then be increased, with thesynthesis algorithms re-run to improve performance.A consequence of the multi-threading model is that it can be used for the purposes of fault-tolerance, since unused slack can be redistributed to o-load processes from failed nodes. In practice,this means that backup congurations are pre-stored on every node. When a fault is detected, thesystem switches its mode to one of these backups. Spare slack is redistributed to the relocated tasks,and other tasks may have their load adjusted accordingly, perhaps retaining the original averageperformance, albeit with a higher output jitter. Or, if there is not sucient slack to sustain theoriginal performance, our method degrades to a lower output quality, and calibrates the system assuch.Narrowing the Focus. We do not treat the entire design problem here - i.e., parallelization,task partitioning, graphical CAD tools, etc. In particular, as mentioned above, the design processinvolves exploiting the degrees of parallelism in a problem, and assigning tasks to processors. Whilethis is an important problem, we do not handle it here. We note that tuning the resource schedulersis, by denition, subservient to the allocation phase { which often involves accounting for device-4
specic localities (e.g., I/O ports, DMA channels, etc.), as well as system-level issues (e.g., servicesprovided by each node).However, we narrow our focus to solving several rather important sub-problems, which havenot been treated in the literature on radar processing. Yet while we concentrate on the schedulingsynthesis problem at hand, we note that a \holistic" design tool could integrate the two problems,and use our system-tuning algorithms as \subroutines."Running Example. Throughout this paper, we use the RASSP SAR benchmark as a runningexample for our design scheme, and we show how it operates on two dierent layouts of the radarsystem. We also show their recongurations under single-node failures, and compare the estimatedperformance to a simulation model. The RASSP SAR (Synthetic Aperture Radar) was posed asa \challenge" signal-processing problem for COTS-based development. In the realm of advancedradars, the SAR's throughput is quite modest - 1.1 GFLOPS for processing three polarizations, atthe highest input pulse frequency, using tuned software for the benchmark system [27]. However,the point of the SAR exercise was not to build most advanced radar. Rather, it was to ndscalable, methods to perform pulse-compression and range-compression in software, on commoditysystems. We note that in terms of general-purpose computers, 1.1 GFLOPS - with bounded latencyconstraints - is still considered quite high.Related Work. Hundreds of books have been written on radar systems; however the eldof deploying high-performance radars on clusters of general-purpose computers is a relatively newone. Its emergence is due to a rapid decline in price-performance ratios, and to the introduction ofcheap, fast interconnects - but also, to a paradigm shift in the area of \supercomputing." Now, anetwork of workstations is the rule, not the exception.Several papers document experiments with radar processors on this sort of systems, and manyof these are based on the RASSP SAR benchmark. Specically, the SAR's requirements werepresented in [27]. In our opinion, the most successful implementation to date was performed byresearchers at the Mitre Corporation, using an Intel Paragon [4]. The system design was notparticularly sophisticated, but one SAR channel was implemented entirely in software. As for thedesign phase, it was guided via coarse, deterministic load models - indeed simple Flop-counts wereused to determine where to place the functional units. We discuss some of the problems with thisapproach in the sequel.Additional work has been done on streamlining the radar code itself, to run on various typesof general-purpose computers. Some of the SAR's units were implemented and optimized by agroup at Carnegie Mellon [8]; also, its memory allocation issues were studied in [15]. However, toour knowledge, no work has been done on using stochastic performance models for the purpose ofsystem synthesis.11Note that some results in this area are classied as military secrets - and hence not published.5
On the other hand, much work has been done in the area of real-time systems synthesis, for othercontexts. In all this work, the prevailing idea is to design a real-time system by solving schedulingproblems from the \inside out." That is, rather than use a theory to determine whether a fully-designed system is schedulable, one uses the theory as a metric - to help instantiate parameterslike frequencies, deadlines and the like. Specically, in our previous work [1, 2] we relaxed theprecondition that period and deadline parameters are always known ahead of time. Rather, weused the system's end-to-end delay and jitter requirements to automatically derive each task'sconstraints; these, in turn, ensure that the end-to-end requirements will be met on a uniprocessorsystem. A similar approach for uniprocessor systems was explored in [5], where execution timebudgets were automatically derived from the end-to-end delay requirements; the method used animprecise computation technique as a metric to help gauge the \goodness" of candidate solutions.These concepts were later modied for use in various application contexts, e.g., for discrete andcontinuous control problems [16, 19], for scheduling for real-time trac over eldbus networks [10,11]. A modication of our theory was even used to help solve some basic parameters of this SARproblem [12] - however, load requirements were not taken into account; rather the method wasused to derive per-component frequencies and deadlines. Other results presented in [18] addresseddistributed real-time synthesis in a deterministic context: they extended our original theory bystatically partitioning the end-to-end delays, via heuristic metrics [18].The analytic results in this paper rely on a form of proportional-share scheduling, perhaps theoldest variant of PS, namely time-division multiplexing (TDM). We use a TDM abstraction toensure that a task is guaranteed a xed number of \time-slots" over pre-dened periodic intervals.Since CPU workloads in real-time systems cannot simply be \re-shaped," and since end-to-endlatency guarantees still must be guaranteed, we have found that TDM ensures a reasonable levelof fairness between dierent tasks on a resource { and between successive instances of the sametask. Also TDM-based schedulers and drivers are fairly easy to implement, using credit/debittoken-bucket schemes. Perhaps more importantly, the TDM abstraction is understood by radardomain experts: the underlying formalism is not radically dierent from techniques used in VLSIlayouts; it is just a bit more nondeterministic.Many other service disciplines re-distribute slack over longer intervals { at a cost of occasion-ally postponing the projected completion times of certain tasks. Most of these techniques wereconceived for regulated workload models, e.g., linear bounded arrival processes [6]. Most of thequeueing techniques developed help provide proportional-share service in high-speed networks, e.g.,the VirtualClock [25], Fair-Share Queueing [7], GPS [17], and RCSP [23]. These models have alsobeen used to derive statistical delay guarantees; in particular, within the framework of RCSP (in[24]) and GPS (in [26]). Related results can be found in [9] (for VitualClock) and in [22] (for FCFS,with a variety of trac distributions). In [14], statistical service quality objectives are achieved viaproportional-share queueing, in conjunction with server-guided backo, where servers dynamically6
Corner Turn
Azi. DFT
















Pulse Figure 1: SAR Channel Flowgraph, and Functional Unit Complexityadjust their rates to help utilize the available bandwidth. Also, many of these disciplines haveanalogues in CPU scheduling, e.g., Lottery Scheduling [20], Stride Scheduling [21], and hierarchicalscheduling [13], among others.2 The Synthetic Aperture RadarFigure 1 shows the dataow for one SAR channel, with some sample computation requirementsfor each component. These numbers were taken from a Mitre Corporation report [4], documentingtheir experiences in implementing a single-channel layout on a 16-node Paragon. While the Paragonis about 2 generations behind current supercomputers, the op-counts of operations like FFTsremain constant on any machine. Perhaps the way these numbers get used is a more crucial issue.In a rst pass, they can potentially help a designer compare the relative complexity of the variousfunctional units in the pipeline. However, op-counts do not convey much information about afunction's true response time model. This is particularly true in software layouts, which containa mixture of oating-point and integer operations. Moreover, in COTS environments, a unit'sarithmetic complexity is only one factor contributing to its response time - as noted above, thereare many others, which are best quantied in a statistical sense.Requirements. The SAR's requirements depend on several factors, enumerated as follows:Number of channels. The RASSP SAR is composed of three channels, all of which are run inparallel as independent radar processing systems. The outputs of these channels are then composedto build the full image; hence the term \synthetic aperture." In this paper, we present our resultsfor synthesizing a single channel. While our algorithms easily synthesized the full 3-channel system(by replicating the performance results for each channel), our reduction here amply illustrates ourapproach, while making the gures and graphs signicantly more readable.Pulse rate. The SAR requirements stipulate a pulse rate lying between 200Hz and 556Hz, wherehigher rates lead to better temporal resolution - and hence are preferred. We chose the highestpossible resolution as our design goal, i.e., 556Hz was selected as our input arrival rate.7
Ranges per pulse, and their precision. In the RASSP SAR, 2048 ranges are sampled per pulse,and each is represented as a single-precision oating point number.Number of pulses per image. A single channel's image is formed by concatenating two frames,each of which corresponds to 1/2 of the image's temporal resolution. In this SAR, one frameis formed out of 512 consecutive pulses; hence 1024 pulses are required to produce a full image.However, every frame is used in two images - rst as the \new temporal part," and then shiftedinto the next image's \old part." Hence, images are produced at the frame processing rate, or556=512 = 1:09Hz.Required latency. The end-to-end latency is bounded by 3 seconds, where latency is measuredfrom the arrival of the last pulse in a frame, to the time the frame is produced.Functional Units. In producing images, all radars go through two major stages and some minorstages. The major stages are range compression (where the range samples are processed into frameslices), and pulse compression (where the slices are processed to form a single 2D frame). Minorphases include lter/transform operations, e.g., to normalize the antenna's digital output to someinternal form, or to perform basic shifting and transposition functions.In this SAR, channels are organized into four principal stages - input ltering, range compres-sion, a \corner turn," and pulse compression. The phases are as follows: Video to Baseband I/Q Conversion (the IQ stage): A pulse's samples are converted and l-tered from video format to in-phase and quadrature bands, represented internally by complexnumbers. Range Compression: Range compression consists of three steps. First, an equalization lter(EQ) normalizes the data for range processing. Then a discrete Fourier transform (the RDFTphase) converts the data to the frequency scale. The result is run through another lter (theRCS phase), to compensate for cross-section variations produced by the DFT. Basically, theRCS just normalizes the synthesized frequency coecients, via applying amplitude weightsto the radar's cross-section frequencies. Corner Turn (CT): The corner phase is a real-time matrix transpose. As such, it forms thecrucial bottleneck in most adaptive radars. The problem is as follows. The RCS phase outputs2048 range coecients for each pulse - and 512 pulses are required to form a frame. However,to work on a range gate, the pulse compression engine requires all 512 readings for thatrange - which were sampled at dierent pulses. Hence, the corner-turn's job is to accumulatethe 512x2048 matrix (with rows corresponding to pulses), and then feed the columns to thepulse-compression engine. Pulse Compression: The SAR uses an azimuth processor to handle the pulse data. In thisradar, two sequential frames form a processing array of size 2048  1024, where columns8
correspond to pulses (after the corner turn), and rows correspond to range gates. When anew frame is shifted into the array, the oldest frame is shifted out. The actual pulse com-pression phase consists of three steps: (1) a discrete Fourier transform (denoted ADFT); (2)a convolution (denoted KM, for \Kernel Multiplication"); and an inverse Fourier transform(denoted AIDFT, for Inverse Discrete Fourier Transform). Most readers are familiar withFourier transforms, and the Azimuth processing does not aect the transform's basic arith-metic complexity. Rather, it involves the selection of the matrix coecients used for thisapplication.In the preceding sections, we outlined some of the challenges involved in building this sort ofa system. However, perhaps largest challenge lies in sorting out the incredible number of designchoices available. This radar possesses almost innite degrees of freedom in decomposing theproblem along the spatial/temporal domains. Moreover, software layouts increase the degrees offreedom considerably. VLSI solutions are much more rigid; when an FFT chip for 256-sized vectorsis deployed, it imposes a constraint which gets reected all the way down the radar's datapath. Insoftware, these constraints do not exist, which is both a blessing and a curse. After all, makingany layout choice can radically aect the system's latency. Hence, there is a prevailing need forperformance models which can help account for system-level eects at the initial design phase.Further, while we do not handle the placement problem in this paper, we do oer methods tomodel the runtime eects due to interacting units. These methods can potentially help otherdesigners make their placement decisions.The typical approach to signal-processor design still bears a heavy inuence from the eld'straditional \computing environment," that is, synchronous logic. For example, the Mitre project- the most successful implementation of the SAR to date - used each Paragon node for a singlefunction, be it an FFT or a simple convolution. In parallelizing the system, a simple worst-caseop-count metric was used to help to slice the datapath. No multi-threading was used, and it ishardly ever used in these systems.We note that radar designers rarely use queuing-theoretic techniques in carrying out perfor-mance estimation. Rather, if a priori estimates are used at all, simulation models are the rule, withactual measurements then carried out on the real system. While neither of these methods can bereplaced in the design loop - and testing can never be replaced - we still advocate using analytictechniques at layout time. For obvious reasons, neither simulation nor measurement can be usedwithin the context of automatic synthesis.As for op-counts and other deterministic metrics, they obscure the functional units with long-tailed response-time distributions, or those with high variability. They also cannot account foreects due to caching, DMA interference, traps, etc. The result is usually a very imbalanced,underutilized system. 9
3 Model and Solution OverviewWe model a system as a owgraph, where tasks are mapped to some CPU or network resource.Formally, a system possesses the following structure and constraints.Bounded-Capacity Resources: There is a set of resources R1; R2; : : : ; Rm, where a given resourceRi corresponds to one of the system's CPUs or network links. Associated with Ri is a maximumallowable capacity, or Maxi , which is the maximum load the resource can multiplex eectively. Theparameter Maxi will typically be a function of its scheduling policy (as in the case of a workstation),or its switching and arbitration policies (in the case of a LAN). In the examples in this paper, weset Maxi of all resources at 0:95.Acyclic Flow-graph. A system is represented as an acyclic ow-graph, where vertices representtasks (denoted by the letter ), and edges represent a producer/consumer relationship between apair of tasks. We assume unlimited buer space available between each such pair, before the systemis designed. As we will see, the dynamics of the application do serve to put an upper bound on thequeue-depth.Channels: When a ow-graph includes disjoint subgraphs, we say it has multiple channels.Since there are no explicit data or control dependencies between channels, we treat their latencyanalysis independently. They may be indirectly dependent, via resource sharing, and input sharing.For example, the three channels (or polarizations) in the SAR may share resources, and do processdierent parts of an image. As for the resource-sharing, the TDM scheduling method allows isolatingthe analysis (and behavior) of multiplexed tasks. As for image collation, here we consider it externalto the design problem, as is done in the requirements specication for this radar. If a designer insertsan explicit join point, correlation can easily be incorporated into our scheme.Task Chains: A task chain is a feed-forward pipeline of tasks, where each task has only onepredecessor, and one successor. We denote chains with the Greek letters  1; 2; : : : ; n, where thejth task in a chain  i is denoted i;j . Each computation on  i carries out a transformation from aninput (or a split) to an output (or a join point). When multiple chains feed into a synchronizationpoint, we often use the following technique: we abstract a chain as an independent queueingnetwork, and generate its latency accordingly. Then, we use simple probability theory to combinethe latencies of all chains owing into a join point - thereby approximating the latency for joinpoint as a whole.Stochastic Processing Costs: A task's cost is modeled via a discrete probability distributionfunction, whose random variable characterizes the time it needs for one execution instance on itsresource.Latency Bound (MLl): The delay constraint, MLl, is an upper bound on the average time itshould take a computation to ow through a channel. Unlike the original specication of the SARbenchmark, we measure the end-to-end latency from the arrival time of the last pulse in a frame,10


































































































Processor r3 failsEQ1(1)EQ1(2) RDFT1(1)RDFT1(2) CT1(2)CT1(1) ADFT1(1)ADFT1(2) KM1(2)KM1(1) AIDFT1(1)AIDFT1(2)IQ1(1)IQ1(2)Figure 3: SAR channel Design II (top), and reconguration for \r3" failure (bottom).integrated system. When this is done for all tasks { with the PDFs quantized at some acceptablelevel { the results can be fed into the synthesis tool. If the system topology can handle a numberof hypothetical per-task load proles, a designer can gain a margin of condence in the system'srobustness to subtle changes in loading conditions.We take the latter approach in our running example: we discretize two dierent continuousdistributions. In Figure 4, \Derived From" denotes a base continuous distribution generated andquantized using the parameters Min, Max, E[t] (mean), Var[t] (variance) and \NumSteps" (numberof intervals). In the case of an exponential distribution, the CDF curve is shifted up to \Min," andthe probabilities given to values past \Max" are distributed to each interval proportionally. Thegranularity of discretization is controlled by \NumSteps," where we assume that execution timeassociated with an interval is the maximum possible value within that interval.The execution times were synthesized using the Mitre numbers as a baseline, and assumingwe have CPUs capable of handling 70 MFLOPS on average, and network links of 120 MBytes/secon average. However, the stochastic variation in the PDFs also accounts for response-times thatdeviate from the average, sometimes to a large extent. The more a designer carries out this sort ofexercise, the more condent he/she can be that the synthesized design is robust. We believe thistreatment of stochastic eects is a crucial element that previous eorts have overlooked.Fault Tolerance. While multi-threading achieves a more exible, ecient design, it also buysthe ability to do on-the-y reconguration. Many radar designers want to have this capability forthe sake of fault tolerance - though no in-process radar currently uses it, due to the reasons itemizedabove. In this scheme, each resource has a copy of the software components that will be placed onit when it gets recongured. When a resource fails, an executive process activates the redundantsoftware components - whose design was, after all, predetermined before the system was deployed.13












































Figure 5: Design Overviewconsidered; hence layouts like Figures 2 and 3 would not be possible. Moreover, several nodes onthe Paragon were left empty - due to no \leftover" functional units - whereas several were fullyutilized. Second, the stochastic nature of a task execution time is not exploited. The approach wasbased on MFLOP requirements for tasks, given certain data paths. This approach may result ina highly underutilized system. However, without using more sophisticated performance models, itwould be hard to do much better.Our method attempts to handle some of these problems automatically.A schematic of the design process is illustrated in Figure 5, where the algorithms carry out thefollowing functions: (1) partitioning the CPU and network capacity between the tasks; (2) selectingeach service interval to minimize latency; and (3) checking the solution via simulation, to verifythe integrity of the approximations used, and to ensure that each channel's prole is sucientlysmooth (i.e., not bursty). The main boxes are outlined as follows:Load Share Estimation. The local schedulers are calibrated to satisfy the end-to-end constraints,via (1) partitioning the CPU and network capacity between the tasks; (2) selecting the serviceintervals to minimize latency; and (3) validating the solution via simulation, to verify the integrityof the approximations used.Slack Distribution. Here slack is used to either enhance output quality, or to produce alternativelayouts for fault-tolerance. It includes (1) calibrating loads for migrated tasks, and adjusting theload of the other tasks; and (2) estimating the performance of the reconguration.The main partitioning algorithm processes each chain  i, and nds a candidate load-assignmentvector for it, denoted ui. (An element ui;j in ui contains the load allocated to i;j on its resource.)Given a load assignment for  i, the synthesis algorithm attempts to nd a service interval Ii atwhich  i's achieves its nominal latency constraint. This computation is done approximately: Fora given Ii, a latency estimate is derived by treating all of  i's outputs uniformly, and deriving ani.i.d. latency distribution for it. If the latency exceeds the requirement, then the load assignment15
age1age2IQ1(1); EQ1(1);RDFT1(1)CT1(1)IQ1(2); EQ1(2);RDFT1(2)CT1(2) ADFT1(1);KM1(1);AIDFT1(1)ADFT1(2);KM1(2);AIDFT1(2) age4age1;2age1;2 age3chain 1chain 2 chain 3chain 4 age3;4fork/join joinpoint 1data sync. point 2data sync.age0age0 Figure 6: Overview of the channel analysisvector is increased, and so on. Finally, if sucient load is found for all the system tasks, the entiresetup is simulated to ensure that the approximations were sound { after which excess capacity canbe re-allocated for the sake of fault-tolerance.4 Latency EstimationIn this section we describe how we approximate the system's latency, given candidate load andservice-interval parameters. Then in Section 5, we show how we make use of this technique toderive these parameters. In determining latency, we make heavy use of the TDM abstraction.Since each chain is eectively isolated from others over Il-sized intervals of observation, we cananalyze the behavior of each chain independently, without worrying about head-of-line blockingeects from other components.We use a discrete-time model, in which the time units are in terms of a chain's frame-size;i.e., our discrete domain f0; 1; 2; : : :g corresponds to the real times f0; Il; 2Il; : : :g. Not only doesthis reduction make the analysis more tractable, it also corresponds to \worst-case" conditions:Since the underlying system may schedule a task execution at any time within a Il-sized interval,we assume that input may be read as early as the beginning of an interval, and output may beproduced as late as the end of an interval.We go about constructing a model in a compositional (albeit approximate) manner, using thefollowing techniques.Decomposition into Chains: We rst decompose a channel into its constituent chains, bysimply traversing the ow-graph between all fork/join points. In analyzing each chain, we abstractit as being independent of all others - while in fact, this may not be the case in the real system. Inthe case of SAR, joins occur at the Corner Turn task, and at the output of the entire system. Asan example, the graph in Figure 3 can be decomposed into 4 chains `chain 1', `chain 2', `chain 3',and `chain 4', which are shown in Figure 6. 16
Per-Chain Analysis: Each chain is analyzed in a top-down fashion, with results from produc-ers used to gauge the metrics for consumers. In this fashion, we generate an approximate latencydistribution for each chain. Figure 6 shows synthesized random variables age1 and age2, whichcorrespond to the latencies of 'chain 1' and 'chain 2', respectively. Likewise, age3 and age4 denotethe end-to-end latency from external input, to the end of `chain 3' and `chain 4', respectively. Thearrival time of pulses in a subframe are synchronized - hence, the variable age0 denotes the latencyrequired to buer up these pulses. If multiple subframes in a frame go through a chain, the chain'slatency is considered to be that of the last subframe exiting. As an example, in Figure 6 our process-ing granularity is divided into 8 subframes, with 2 parallel pipelines handling range-compression.Hence 4 subframes go through chain 1, and age1 characterizes the latency of the 4th subframe topass through chain 1.Synchronization: At a synchronization point, a frame is composed from the subframes fromjoining chains. The latency of a whole frame is estimated from those of joining chains. For example,in Figure 6, the frame latency age1;2 up to the `data sync. point 1' is estimated from age1 andage2 using the following equations.Pr[age1;2 = t] = Pr[age1 = t] Pr[age2 < t] + Pr[age2 < t] Pr[age2 = t]+ Pr[age1 = t] Pr[age2 = t] (Eq1)This equation sets the per-frame latency distribution to reect that of the largest chain feedinginto the sync point. For three or more chains, we can get the frame latency distribution by applyingthe above equation repeatedly. Moreover, we use the accumulated latency at the corner-turn to bethe base-line latency for successive pulse-compression chains, etc.4.1 Intra-Chain AnalysisIn this section, we describe how to approximate the latency of the output of a chain when thelatency of the input to the chain is given. For each chain  i, we go about constructing its latencymodel in a compositional (albeit inexact) manner, by processing each task locally, and using theresults for its successors. Consider the following diagram, which portrays the ow of a computationat a single task:
17
agei;j 1 Wi;j i;ji;j 1 	i;jOuti;j 1 Outi;jagei;jThese random variables are dened as follows:1. Data-age (agei;j): This variable charts a computation's total accumulated time of a subframein a frame, from entering  i's head, to leaving i;j . When the age of the k-th subframe in a frameneeds to be presented explicitly, the random variable agei;j(k) is used.2. Waiting time (Wi;j): The duration of time an input is buered, waiting for i;j to process allthe preceding subframes. Like agei;j(k), random variable Wi;j(k) denotes the waiting time of thek-th subframe in a frame.3. Processing time (	i;j): If ti;j is a random variable ranging over i;j 's PDF, then 	i;j def= dti;jEi;j eis the corresponding variable in units of intervals.4. Inter-output time (Outi;j): An approximation of i;j 's inter-output time, in terms of inter-vals; it measures the time between two successive outputs.5. Number of subframes per chain (si): The number of subframes in a frame that chain  iprocesses. If the number of K subframes are processed by a functional unit, and the unit is splitinto D parallel slices - of which  i is one - then si = K=D.Again, we assume subframes arrive synchronously, which simplies our analysis considerably.Then, we adjust the age distribution to account for the initial buering required. Finally, thevalidation of this approximation is done via simulation.We compute the distribution of agei;j via the following recurrence relation:agei;j = agei;j 1 + Wi;j +	i;jwhere agei;0 depends on its position in the pipeline. We assume that input arrivals are synchronizedat head chains; hence, at the IQ phase the age is set to 0 or 1, depending on whether arrivals areread in before or after the rst service period. As for the corner-turn phase, we set the incomingage distribution to be that of the synthesized compression chains.Within a chain, we approximate the entire agei;j distribution by assuming the three constituentvariables to be independent, i.e.,Pr[agei;j = k] = Xk1+k2+k3=kPr[agei;j 1 = k1] Pr[Wi;j = k2] Pr[	i;j = k3]18
As for a join-point, the latency is synthesized from all incoming chains - and indeed, is extractedby analyzing the behavior of the subframe arriving last in each chain. Assume  i processes sisubframes per frame, and the latency of the si-th subframe is characterized by agei(si). Nowassume  i+1 ows into the same sync point as  i, and it also processes si subframes. Then agemF ,the joint incoming latency of the entire frame, is just extracted by forming their joint probabilitydistribution.Pr[agemF = t] = Pr[agei(si) = t]  Pr[agei+1(si) < t] + Pr[agei(si) < t]  Pr[agei+1(si) = t]+Pr[agei(si) = t]  Pr[agei+1(si) = t]Likewise, the end-to-end latency of a frame is derived from the latency distributions of the subframesat the end of the channel; when multiple chains join at the end of the channel, frame latency isapproximated as above.Now we need to derive the latency distributions for connected chains, from top-to-bottom.Case 1: Head ChainIn head chains, an input subframe arrives periodically, and the latency distribution is set as ex-plained above. From the perspective of successor chains, subframes are batched, and considered assingle inputs.Case 2: Non-Head ChainWe approximate the latency of the rst subframe using the same technique as outlined above.That is, we rst estimate the latency for the rst subframe in a frame, agei;j(1). Then agei;n(k)is approximated by adding the extra waiting time that the k-th subframe should suer due to thepreceding subframes queued up. The extra waiting time W0i(k) of k-th subframe is approximated asthe sum of the k   1 independent execution times of the head task i;1.Pr[W0i(k) = t] = k 1Yl=1 Pr[	i;1 = tl] (s.t. X1l<k tl = t)The age of the k-th subframe is approximated as the sum of the age of the rst subframe and theextra waiting time of the k-th subframe:Pr[agei;n(k) = t] = Pr[agei;n(1) = t0]  Pr[W0i(k) = t   t0]Waiting Time. Obtaining reasonable waiting-time metrics at each stage of a chain is a non-trivial aair, due to arbitrary execution time PDFs, and scale of the problem. In carrying out theanalysis, we use a stochastic process to characterize i;j's total remaining work.Let Xi;j be an imbedded Markov Chain corresponding to input arrival events for i;j . Thischain is fully connected, and in general it is innite - as is shown in Figure 7. The state-transitions19
. . .0 1 2Figure 7: Innite Markov Chain Xi;joccur at the new input arrival instants, and the states represents the total remaining work for thetask, excluding that of the new arrival.By solving the following set of equations, where P denotes the one-step transition matrix cor-responding to Xi;j , we obtain the steady state probabilities. Here t is the steady state probabilityof being in state t. ~  P = ~; Xt0 t = 1 (Eq2)Moreover, the Waiting Time Distribution is exactly that of the steady state probability distribution.Pr[Wi;j = t] = tAs for the Markov chain's transition probabilities, note that a transition from state k to lrepresents 2 facts: (1) a new input was just received; and (2) it will experience waiting time for lintervals. Depending on the destination state, there are two cases:Case A : Destination state is 0.P[t; 0] = Pr[	i;j + t  Outi;j 1]= Xt1>0Pr[	i;j = t1] Pr[t+ t1  Outi;j 1]Case B : Destination state t1 > 0.P[t; t1] = Pr[t+ 	i;j   t1 = Outi;j 1]= Xt2>0Pr[	i;j = t2] Pr[Outi;j 1 = t+ t2   t1]Now we show how to derive task i;j 's inter-output distribution, Outi;j . First, note that forall tasks in a chain  i, the average inter-arrival time of subframes, SPi, depends on (1) the average20
inter-arrival time of a frame to  i (FPi), and (2) the number of subframes  i should process, or si.SPi = FPisiFor example, consider Figure 2. Denote the set of range-compression chains as f 1; 2g, anddenote the set of pulse-compression chains as f 3; 4; 5; 6g. Since our datapath is assumed tobe partitioned into 8 subframes, then s1 = s2 = 4, and s3   s6 are all set at 2. If pulses arrives at556Hz, the period of the subframe arrival for the head chains (SP1-SP2) is512556 4 = 0:230216s. We calculate the inter-output times for each task as follows. First, for head-tasks, we quantizingthe inter-arrival period with service interval of size Il, as follows:Pr[Outi;0 = bSPiIl c] = 1  SPi mod IlIlPr[Outi;0 = dSPiIl e] = SPi mod IlIlIn Figure 2, when the service interval is 0.01 seconds, Pr[Out1;0 = 23] = 0:9784 and Pr[Outi;0 =24] = 0:0216, for the head chain  1.This is an approximation; when the period of input arrival is not divisible by the service interval,the trac denoted by Outi;0 fails to model the periodic arrival. This statistical approximationmay cause overestimating the latency.For the other tasks, inter-output times Outi;j are calculated via using the execution time 	i;jand accounting for the bubble between two outputs from i;j . This bubble Bi;j is inserted when i;jis idle. Pr[Bi;j = t] = Pr[Wi;j = t1] Pr[	i;j = t2] Pr[Outi;j 1 = t1 + t2 + t] (Eq3)Pr[Outi;j = t] = Pr[Bi;j = t1] Pr[	i;j = t   t1] (Eq4)Remarks: Several design thresholds were selected for the SAR. For example, in solving (Eq 2),we restricted the number of states to 500; and we aggregated the (small) probabilities for higher-order transitions into those for state 499.P[t; 499] = Pr[t +	i;j  Outi;j 1  499]= Xk>0Pr[	i;j = k] Pr[Outi;j 1  t + k   499]Also, we found that waiting times could be over-estimated, due to the approximation of the21
Outi;j . While inputs are bursty, they do have a regularity - which Outi;j doesn't account for.This can be shown with one task chain as an example, where the inputs arrive periodically, everytwo quanta. The task's execution time can be either 1 or 2 intervals, with a 50-50 likelihood ofeach. By solving the equations shown above, we have the following:Pr[W = 0] Pr[B = 0] Pr[B = 1] Pr[Out = 1] Pr[Out = 2] Pr[Out = 3]1:0 0:5 0:5 0:25 0:5 0:25In this approximation, n outputs can be produced in n consecutive intervals, with probability0:25n. However, we know this is impossible - this sort of batched burstiness cannot occur, due tothe regular nature of the arrival process. This conservatism leads to overestimating the waitingtime for a task; for longer chains, the approximation gets worse.To avoid this sort of amplication, we also tried another technique - simply assuming a re-regulated trac ow, all the way through the pipeline. That is, in this alternative, the following isassumed to hold: Pr[Outi;0 = bSPiIl c] = 1  SPi mod IlIlPr[Outi;0 = dSPiIl e] = SPi mod IlIl8i;j in  i 8k Pr[Outi;j = k] = Pr[Outi;0 = k]This alternative fails to consider bursty outputs, and it should be used with care. It could,in essence, lead to underestimating the waiting time of each task, and result in optimisticallyapproximating end-to-end latency. However, we argue that as the end-to-end delay constraints gettighter, the simplication makes more sense. As the constraints get tighter, a task is given moreutilization of a resource, and the inter-output trac gets less bursty.In the sequel, this alternative is used to generate our results. Then the numbers are checkedagainst a simulation model, as shown in Section 6.5 System DesignWe now revisit the \high-level" problem of determining the system's parameters, with the objectiveof satisfying each chain's performance requirements. As stated in the introduction, the designproblem may be viewed as inter-related sub-problems:1. Load Assignment. How should the CPU and network load be partitioned among the setof tasks, so that the latency requirements are met?2. Interval Assignment. Given a load-assignment to the tasks, what is the optimal serviceinterval to maximize eective throughput?22
Synthesize(): returns f(Ul; Il; Ll;Fl) : 1  l  ng(1) foreach Channel Cl, ( 1  l  n) f(2) Il  min(SPi) (for all  i in Cl)(3) Fl  LFl(4) foreach Chain  i in the Channel Cl f(5) ui;j  E[ti;j ]SPi (for all i;j in  i in Cl)gg(6) k  Xresource(i;j )=k ui;j (for all 1  k m)(7) S  fCl : 1  l  ng(8) while (S 6= ;) f(9) Find the i;j in  i in Ci 2 S s.t.rk = resource(i;j) which maximizes(10) wi;j  H(MLl; Ll; E[Wi;j])(11) if(k  mk  )(12) return Failure(13) ui;j  ui;j +(14) k  k +(15) (Ll; Il) Get Interval(Cl , Il,Ul, Fl)(16) while ( Ll  MLl and Fl  UFl)(17) L0l  Ll ; U0l  Ul ; I0l  Il ; F0l  Fl(18) Fl  Fl + 1(19) (Ll; Il) Get Interval(Cl , Il,Ul, Fl)gif (Fl > UFl)(20) S  S   fClgg(21) return(f(U0l; I0l; L0l;F0l) : 1  l  ng)
Get Interval(Cl; Il;Ul) : returns (Ll; Il)(1) Ll  get latency(Cl; Il;Ul)(2) for (t Il   1; t > 0 ; t t  1) f(3) if ((Ll mod t) < (  Ll)) f(4) L0l  get latency(Cl , t, Ul)(5) if (L0l < Ll) f(6) Ll  L0l(7) Il tggg(8) return (Ll, Il)
Figure 8: Synthesis Algorithm3. Input Frequency Setting. If the highest pulse-rate cannot be met, what is the ideal pulserate for this problem.Note that load-allocation is the main \inter-channel" problem here, and interval-assignment is anintra-channel problem. With our time-division abstraction, altering a tasks's service interval willnot eect the (average) rates of other chains in the system.The design procedure shows the basic idea of how to solve the problems.In general, input frequency is the only parameter that is to be synchronized among those 3channels while designing. The other parameters, such as service interval of each chain, may be setdierently. Here we show the design of only 1 channel of the SAR.23
Input Frequency Setting. If the system fails to nd a feasible solution at the lowest inputfrequency (LFl), the design is infeasible. When a feasible load allocation is found at it, the frequencyis increased, as is shown in line (18) of algorithm Synthesize() in Figure 8. At the new inputfrequency, the same load assignment procedure is repeated. This procedure ends either when afeasible load allocation is found at the highest input frequency (UFl) or when no more improvementin the load allocation can be done. As for SAR, LFl = 200Hz, and UFl = 556Hz. The largest inputfrequency with a feasible load allocation is returned, which is the best achievable performance thatthe tool can nd.Load Assignment. Load-assignment works by iteratively rening the load vectors (the ui's), untila feasible solution is found. The entire2 algorithm terminates when the latency for all channelsmeet their performance requirements { or when it discovers that no solution is possible. We donot employ backtracking, and a task's load is never reduced. This means the solution space is notsearched totally, and in some tightly constrained systems, potential feasible solutions may not befound.Load-assignment is task-based, i.e., it is driven by assigning additional load to the task estimatedto need it the most. The heart of the algorithm can be found on lines (9)-(10), where all of theremaining unsolved channels are considered, with the objective of assigning additional load to the\most deserving" task in one of those channels. This selection is made using a heuristic weightwi;j , reecting the potential benet of increasing i;j 's utilization, in the quest of increasing thechannel's end-to-end performance.The weight actually combines two factors, each of which plays a part in achieving feasibility:(1) additional latency improvement required, normalized via range-scaling to the interval [0,1]; (2)the average waiting time E[Wi;j] of the task with current load assignment. For the results in thispaper, the heuristic we used was:wi;j  H(MLl; Ll; E[Wi;j]) = Ll   MLlMLl  E[Wi;j]Then the selected task gets its utilization increased by some tunable increment . Smaller incre-ments will obviously lead to a greater likelihood of nding feasible solutions; however, they alsoincur a higher cost. (For the results presented in this paper, we set  = :05.)After additional load is given to the selected task, the channel's new interval-size and latencyare determined; if it meets its maximum latency requirements at the highest input frequency, it2Clearly, in this example the degree of parallelism is two, and the computation with more than two chains issimilar. 24
can be removed from further consideration.Interval Assignment. \Get Interval" derives a feasible service interval (if one exists). While theproblem of interval-assignment seems straightforward enough, there are a few non-linearities tosurmount: First, the true, usable load for a task i;j in  i in Cl is given by bui;j  Ilc=Il, due tothe fact that the system cannot multiplex load at arbitrarily ne granularities of time. Second, inour analysis, we assume that a task nishes only at the end of the service interval, which errs onthe side of conservatism. Third, utilization factor of task i;j , which is E[	i;j]E[Outi;j 1] , may varywith the service interval.The negative eect of the second factor is likely to be higher at larger intervals, since it resultsin adding the fractional part of a computation's nal interval. On the other hand, the rst factorbecomes critical at smaller intervals. The third factor depends on the PDF of the task, andthe service interval. The approximation utilizes a few simple rules. Moreover, to speed up thecomputation in the algorithm in Figure 8, we restrict the search to situations where our interval-based delay estimate truncates no more than 100 percent of the continuous-time deadline, where << 1. Subject to these guidelines, intervals are evaluated via the latency analysis presented inSection 4 { which determines the current Ll metric.Slack Distribution Slack can be used either for fault-tolerance, or for increasing performance.If the latter is desired, one need only re-run \Constraint Satisfaction" algorithm, with a highertarget performance. Here we focus on slack distribution for fault-tolerance.Fault-tolerance is achieved by (1) distributing resource slack to the tasks which are relocateddue to resource failures (2) adjusting load allocation of the tasks which are in the same channel thatthe relocated tasks are in. When a resource has slack larger than the sum of the load thresholdsof the relocated tasks to it, (or when system would not be overloaded even after activating thoserelocated tasks), all the tasks are given their load thresholds. There may not be sucient slack tomaintain the original performance; hence some decisions need to be made. In general, we use thefollowing rules to adjust the load of existing tasks:rule 1: The loads of the tasks in channels unaected by the fault are preserved. This isnecessary to prevent the eect of the fault from propagating to other channels.rule 2: In an overloaded resource (due to the relocated tasks), the loads of these tasks arereduced. This \shares" the eect of the fault throughout the channel, evenly in all multi-threadedresources. In the implementation, we gradually reduce the load shares of the old tasks, and re-distribute it to the relocated tasks. This is done evenly, and ensures that thresholds in the resourcesare not overloaded.rule 3: Some resources may be highly loaded at this point, the eect of which could producebubbles in the pipeline. However, often there is spare slack on other resources hosting successor25
A. Synthesized Solutions for Channels.Ii Li(A) Li(S) u iq un1 urt un2 uct un336 2:6557 2:887 0:423 0:140 0:541 0:140 0:113 0:086un4 uat un5 un6 ukm/aidft un70:086 0:437 0:086 0:086 0:539 0:086B. Resource Capacity Used by System.r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r120:846 0:541 0:541 0:226 0:437 0:437 0:437 0:437 0:539 0:539 0:539 0:539n1 n2 n3 n4 n5 n6 n70:281 0:281 0:171 0:171 0:171 0:171 0:343Figure 9: Synthesized Solution of the design I.tasks; if these tasks are given more load share, they can potentially compensate for the bubble.Hence, we re-run the \Constraint Satisfaction" process, to re-adjust the loads in these successortasks. Often we can achieve a reconguration with the original performance, or only a sightdegradation.5.1 Design Examples RevisitedFigure 9 and Figure 10 show the load allocations for each task in our two designs, which were pro-duced by the synthesis algorithm. In those tables, Li(A) denotes the latency estimated analytically,and Li(S) denotes the latency measured in the simulation model. Note that we found feasible loadallocations and service intervals at the 556 Hz for both layouts.As for the reconguration of Design I: Note that resource r2 could accommodate the loadthresholds of the relocated tasks CT1(1), and CT1(2). The load of processor \r2" becomes 0:917,which is high - but less than its maximum allowable load of 0:95.The reconguration of Design II shows a dierent situation. Resource \r4" did not have slack(0.359) to accommodate the load threshold (0.591) of the relocated task RDFT1(1). Hence, by loadadjustment rule 2, we \steal" some load from RDFT1(2), and now achieve our threshold of 0.95 -the peak allowed capacity. Now both tasks are given half of the load - however, no service intervalcan be found that satises latency constraints at 556 Hz input frequency.However, now rule 3 kicks in. The loads of other tasks are increased, to help compensate fordelays at the r4 bottleneck. The result is a sampling frequency of 506 Hz - not the peak frequency,but still falling within the SAR guidelines. Figure 12 shows the results of this reconguration.26
A. Synthesized Solutions for Channels.Ii Li(A) Li(S) uiq un1 ueq un2 urt un3 uct25 2:65 2:9047 0:372 0:090 0:159 0:090 0:591 0:090 0:163un4 uat un5 ukm un6 uaidft un70:121 0:924 0:121 0:201 0:121 0:924 0:171B. Resource Capacity Used by System.r1 r2 r3 r4 r5 r6 r7 r8 r9 r100:743 0:317 0:591 0:591 0:326 0:924 0:924 0:403 0:924 0:924n1 n2 n3 n4 n5 n6 n70:181 0:181 0:181 0:243 0:243 0:243 0:342Figure 10: Synthesized Solution of the design II.6 SimulationSince the latency analysis uses some key simplifying approximations, we validate the resultingsolution via a simulation model. This is a part of our synthesis tool, as shown in Figure 5.We rst review the main source of the approximations. First, we assume periodic arrivals ofinputs to all tasks; in reality, inputs may be bursty. Second, we use an approximate joint probabilitycalculation to determine latency at synchronization points. In reality, there may be a lot ofdata-dependent correlation between the response-times, and this is ignored in the approximation.Third, statistical modeling of periodic input arrivals is highly inaccurate, due to quantization by theservice interval. Fourth, our analysis assumes that a task's state-changes always occur at its intervalboundaries; hence, even intermediate output times are assumed to take place at the interval's end.A further approximation is inherent in our compositional data-age calculation.Some of these approximations are conservative; others lead to optimistic results. However,the simulation model discards these approximations, and keeps track of all subframes and framesthroughout the channel, as well as the \true" states they induce in their participating tasks. Also,the clock progresses along the real-time domain; hence, if a task ends in the middle of an interval,it gets placed in the successor's input buer at that time. Also, the simulation model schedulesresources using a modied deadline-monotonic dispatcher (where a deadline is considered the endof an interval), so higher-priority tasks will get to run earlier than the analytical method assumes.Recall that the analysis implicitly assumes that computations may take place as late as possible,within a given interval.On the other hand, the simulator does inherit some other simplications used in our analysismodel. For example, input reading is assumed to happen at the start of an interval. As in27



































Latency of a Whole Frame in design 1 (8 subframes/frame)
Analysis  
Simulation






























Latency of a Whole Frame in design 2 (8 subframes/frame)
Analysis  
Simulation




















Standard Deviation of Average Latency in design 1 − Simulation
Simulation





















Standard Deviation of Average Latency in design 2
Simulation
Figure 11: Average latency and its standard deviation of design I and design II at dierent serviceintervalsthe analysis, context switch overheads are not considered; rather, they are charged to the loaddistributions. In the simulator, we did not implement innite-sized buers; rather, they wererestricted to have 2000 slots. However, given the 3 second end-to-end delay, there is little chancethat such a huge buer would ll up - and indeed, it did not in the simulations.Each simulation trial runs for 20000 frame inputs, which corresponds approximately to 20000seconds running in real radar processing. The condence interval of each run with 95 percent is 2 milliseconds.Figure 11 shows the latencies of Design I and Design II estimated by analysis, and measuredby simulation at dierent service intervals. The load allocation of each task is shown in Figure 9and Figure 10. Figure 12 shows the latencies of Design II at dierent input frequencies before andafter reconguration at a fault. Design II is feasible at the highest input frequency 556Hz, beforea fault occurs. After the reconguration, it is feasible at or lower than a 505Hz input frequency.28































Latency of a Whole Frame in design 2 before/after a fault occurs
Analysis w/ fault  
Simulation w/ fault
Analysis w/o fault 




















Standard Deviation of the Latency of a Whole Frame in design 2 with a fault
Simulation w/ fault
Figure 12: Average latency of the recongured and the original design II at 25ms service intervalas input frequency changes. And standard deviation of the simulation of the recongured design.From Figure 11 and Figure 12, we note that analysis crosses between conservative estimationto optimistic estimation. The main source of the conservatism comes from the high utilizationfactors of the tasks in the chain. The utilization of a task may vary with its service interval size.Consider Design I, and RDFT1(1). At service intervals of 12ms and 84ms, its utilizations are 0:9026and 0:9946 respectively. At such high utilizations, the system gets less stable, and the statisticalapproximations for inter-output times start deviating from the true times. We conjecture thatthis is why service-interval graphs possesse some conservative spikes. In the experiments we ran,however, more than 85 percent of the analytical estimations were within 10 percent of the simulatedresults.7 ConclusionWe presented a semi-automated design synthesis technique for calibrating resources in an embeddedsignal processing system. We showed how analytical techniques can help automate design, and alsoaccount for stochastic eects common in COTS systems. We showed how a large SAR couldexploit a simple software fault-tolerance scheme - while still giving designers a priori condencein the stability of their system. Our synthesis tool uses a variety of simple analytic techniques toestimate latency, in tandem with heuristic search algorithm to nd a feasible load partitioning.Though approximation is used, the result is promising. Our four example layouts consist ofmore than 25 tasks, and 15 shared resources, with compute-times modeled in a variety of dierentways. Nonetheless, our methods found results which achieved the SAR requirements, and whichalso could be validated via an independent simulation model.29
However, simulation is not the end of the story. Ultimately, one needs to build the application,and calibrate the kernels and drivers via our analytically-derived parameters. At that point, thesystem gets subjected to the most important validity test of all: on-line proling. Even with a care-ful synthesis strategy, testing usually leads to some additional system tuning { to help compensatefor the imprecise modeling abstractions used during static design.In this regard, we plan on implementing a full-scale version of the RASSP SAR benchmark onthe network of workstations; specically we plan to use o-the-shelf PentiumII processors, connectedvia a gigabit-ethernet, and with standardized runtimes. In designing a signal processor on this sortof a system, with all the stochastic eects they contain, we believe that a statistical technique likeours is not just one option - it could be the only option.References[1] R. Gerber, S. Hong, and M. Saksena. Guaranteeing Real-Time Requirements with Resource-BasedCalibration of Periodic Processes. IEEE Transactions on Software Engineering, 21, July 1995.[2] R. Gerber, Dong-In Kang, Seongsoo Hong, and Manas Saksena. End-to-End Design of Real-TimeSystems, chapter 10, pages 237{265. Wiley, 1996. In Formal Methods for Real-Time Computing, editedby Constance Heitmeyer and Dino Mandrioli.[3] Ladan Gharai and Richard Gerber. Multi-platform simulation of video playout performance. In Pro-ceedings of SPIE/IS&T Multimedia Computing and Networking (MCMN98), 1998.[4] C.P. Brown, R. A. Games, and J.J. Vaccaro. Real-Time Parallel Software Design Case Study: Imple-mentation of the RASSP SAR Benchmark on the Intel Paragon. Technical Report MTR 95BTBD, TheMITRE Corporation, Bedford, MA, 1995.[5] Wu chun Feng and Jane W.-S. Liu. Algorithms for scheduling real-time tasks with input error andend-to-end deadlines. IEEE Transactions on Software Engineering, 23(2):93{106, February 1997.[6] R.L. Cruz. A calculus for network delay, part i : Network elements in isloation. IEEE Transactions onInformation Theory, 37(1):114{131, 1991.[7] Alan Demers. Analysis and Simulation of a Fair Queueing Algorithm. In Proceedings of ACM SIG-COMM, pages 1{12. ACM Press, September 1989.[8] Peter Dinda, Thomas Gross, David O'Hallaron, Edward Segall, James Stichnoth, Jaspal Subhlok, JonWebb, and Bwolen Yang. The CMU Task Parallel Program Suite. Technical Report CMU-CS-94-131,School of Computer Science, Carnegie Mellon University, March 1994.[9] Norival R. Figueira and Joseph Pasquale. Leave-in-Time: A New Service Discipline for Real-TimeCommunications in a Packet-Switching Network. In Proceedings of ACM SIGCOMM, pages 207{218.ACM Press, October 1995. 30
[10] Lucia Franco. Communication congurator for eldbus: An algorithm to schedule transmission of dataand messages. In Proceedings of IFAC/IFIP Workshop on Real Time Programming. IFIP, November1996.[11] Lucia Franco. Transmission scheduling for eldbus: A strategy to schedule data and messages oneldbus with end-to-end constraints. In Proceedings of IEEE International Symposium on IntelligentSystems /Automation and Robotics (IAR). IEEE Computer Society Press, December 1996.[12] S. Goddard and Kevin Jeay. Analyzing the real-time properties of a dataow execution paradigm usinga synthetic aperture radar application. In Proceedings of IEEE Real-Time Technology and ApplicationsSymposium. IEEE Computer Society Press, June 1997.[13] Pawan Goyal, Xingang Guo, and Harrick M. Vin. A Hierarchical CPU Scheduler for MultimediaOperating Systems. In Proceedings of Symposium on Operating Systems Design and Implementation(OSDI '96), pages 107{121, October 1996.[14] Pawan Goyal and Harrick M. Vin. Network Algorithms and Protocol for Multimedia Servers. InProceedings of IEEE INFOCOM. IEEE Computer Society Press, March 1993.[15] Wagner Meira Jr. Understanding Parallel Program Performance Using Cause-Eect Analysis. TechnicalReport Ph.D. Thesis, University of Rochester, 1997.[16] Namyun Kim, Minsoo Ryu, Seongsoo Hong, Manas Saksena, Chong-Ho Choi, and Heonshik Shin.Visual assessment of a real-time system design : A case study on a cnc controller. In Proceedings ofIEEE Real-Time Systems Symposium, pages 300{310. IEEE Computer Society Press, December 1996.[17] A. K. Parekh and G. Gallager. A Generalized Processor Sharing Approach to Flow Control in IntegratedServices Networks - The Single Node Case. In Proceedings of IEEE INFOCOM, pages 915{924. IEEEComputer Society Press, March 1992.[18] M. Saksena and S. Hong. Resource Conscious Design of Real-Time Systems: An End-to-End Approach.In IEEE International Conference of Engineering Complex Computer Systems. IEEE Computer SocietyPress, October 1996.[19] D. Seto, J.P. Lehoczky, L. Sha, and K.G. Shin. On Task Schedulability in Real-Time Control System.In Proceedings of IEEE Real-Time Systems Symposium. IEEE Computer Society Press, December 1996.[20] Carl A. Waldspurger and William E. Weihl. Lottery Scheduling: Flexible Proportional-Share Man-agement. In Proceedings of Symposium on Operating Systems Design and Implementation (OSDI '94),November 1994.[21] Carl A. WAldspurger and William E. Weihl. Stride scheduling: Deterministic proportional-share re-source management. Technical Report MIT/LCS/TM-528, MIT Laboratory for Computer Science,June 1995. 31
[22] David Yates, James Kurose, Don Towsley, and Michael G. Hluchyj. On Per-session End-to-end DelayDistributions and the Call Admission Problem for Real-time Applications with QOS Requirements. InProceedings of ACM SIGCOMM. ACM Press, September 1993.[23] Hui Zhang and D. Ferrari. Rate-controlled static-priority queueing. In Proceedings of IEEE INFOCOM,pages 227{236. IEEE Computer Society Press, September 1993.[24] Hui Zhang and Edward W. Knightly. Providing End-to-End Statistical Performance Guarantees withBounding Interval Dependent Stochastic Models. In ACM SIGMETRICS. ACM Press, May 1994.[25] Lixia Zhang. VirtualClock : A New Trac control Algorithm for Packet Switching Networks. InProceedings of ACM SIGCOMM, pages 19{29. ACM Press, September 1990.[26] Zhi-Li Zhang, Don Towsley, and Jim Kurose. Statistical Analysis of Generalized Processor SharingScheduling Discipline. In Proceedings of ACM SIGCOMM, pages 68{77. ACM Press, August 1994.[27] B. Zuerndorfer and G.A. Shaw. SAR Processing for RASSP Application. In Proceedings of the FirstAnnual RASSP Conference, August 1994.
32
