Introduction
Image-processing applications often must not only provide accurate results but also meet real-time exigencies. This suggests a sensible division of labour, since in practice algorithmic designers are rmly wedded to workstations or PCs. Realtime acceleration in machine vision 1] can be provided either by specialist hardware such as eldprogrammable gate arrays (FPGAs) or by parallel processing, neither of which are convenient for algorithmic development. Radar, vision, and some varieties of speech-processing all commonly have a strong pipelined structure. The Analysis Prediction Template Toolkit (APTT) provides a seamless way to model graphically an image-processing pipeline before purchase of the target hardware, and subsequently construct a parallel application from the developer's code, without signi cantly compromising algorithms. Testing on a handwritten postcode recognition application has con rmed agreement to within 10% between simulation and target system for pipeline traversal latency and throughput, at the same time allowing the designer to gain an intuitive feel for the behaviour of the application. Analytic results are available 2] to re ne the prediction, though the simulation is already suitable for cross-architectural comparisons of asynchronous pipelines.
Design cycle
APTT supports a design cycle, Fig. 1 , aimed at producing pipelines with a single primary data ow constructed from a set of parallel data farms. Sequential bottlenecks are masked by single processor stages, possibly accelerated. Pro ling the development code enables a tentative allocation of functions to di erent stages of the pipeline based on mean timing ratios. Normally, the potential for loop parallelism will determine the per-stage weighting of processors. A set of inner-loop timings enable distribution tting by statistical or other means. Identifying a job as one iteration of the inner loop, jobs can then be grouped into tasks. Note that the grouping results in a c hange in distribution as the addition of distributions is a convolution. Two classes of problem are catered for: vision applications where the semantic content pre-determines the extent to which a n optimal grouping can be made and lowlevel image processing where there is generally more latitude in the way image segments can be grouped. To arrive at an optimal arrangement also requires consideration of dynamic e ects such as communication bandwidth both across the pipeline backplane and in subsidiary farms. Bu ering can be added, at a cost, to enhance the bandwidth and smooth work ow. Though synchronous pipelines are best dealt with analytically, asynchronous pipelines are dicult in the general case, requiring the statistics of large deviations, which makes a discrete event simulation an attractive and malleable tool.
Integrated with APTT is a parallel-application generator, taking as input high-level descriptions of shared data structures. The other input is innerloop sequential code sections after which boilerplated parallel code can be output. Built-in trace instrumentation and communication calls are intended for two generic targets: a network or cluster of workstations, or a modestly parallel machine. The former is suitable for veri cation of correct working of the application and the latter is suitable for performance testing. By means of the trace, the simulation prediction can be compared to actuality. As with the con gurer tool, an intermediate form is employed for the application code so that communication primitives can be generated for a variety o f machines.
Both simulation (prediction) and trace (analysis) are presented in a similar graphical format, though the former is intended as an abstract model while the latter is a physical model. A key di erence is that feedback is explicitly represented in the analysis tool while the predictor reduces feedback to a at representation.
Worked example
To c heck our work, simulated results were correlated with an application 3], previously implemented on an eight module message-passing parallel machine, the Transtech Paramid. Recognition of handwritten postcodes with appropriate weightings can be split into three stages: identi cation of features within each character of a postcode, classi cation of those features to form a ranked list of candidate characters, and a search to match candidate postcodes against a dictionary of available postcodes. This moderately-sized system, 4.5k lines of code, employs multiple algorithms with data-dependency introduced in the nal stage (UK postcodes tested could have 6 or 7 characters in the ratio 145:155), achieving 80% recognition accuracy. If the throughput constraint of 10 postcode/s or the maximum latency constraint of 8s were to be exceeded then the computer processing would not keep up with the mechanical conveyor belt which transports the mail items.
The static timings are set out in Table 1 . Timing a set of 300 (1945) postcodes (characters) and then applying separately Kolmogorov-Smirnov a n d chi-squared tests, established that the distributions of processing times were approximately deterministic (and not Gaussian as had been supposed before tests), while the nal stage, assuming random ordering in the input le set-up to test recognition accuracy, w as matched by a Bernoulli distribution. In the original implementation, interstage bu ering had been set by trial-and-error at 20 slots, while the local input bu er sizes were 10 slots. The aim had been to nd the best throughput if jobs were instantaneously available in the worst case scenario. Each application has some special features. In the postcode application, di ering postcode (task) sizes in the nal stage occur which w e bracketed by w orst (all size seven) and best (all size six) cases. In Table 2 , the worst-case estimates are compared with the implemented result with favourable accuracy.
The 3:3:1 pipeline simulation is optimal, as had been suggested by preliminary static analysis. Note that one of the eight processor modules is reserved for feeding the test le to remove I/O dependency. Throughput is critical, while latency is well below the 8s requirement o n t h e P aramid, though on earlier transputer-based machines latency was an issue. Though one might seek to apply a simulation capturing more of the computer system detail, experience has shown 4] that no greater accuracy necessarily results. In the APTT simulation, varying the numberofprocessors in the pipeline above and below the number in the Paramid, Table 3 , established possible cost/performance tradeo s and highlights the advantages of the simulation tool in allowing rapid and complete exploration of the design space of possible parallel solutions. The original bu er slot sizes were probably set too high as internal bu ering was not found to be critical while interstage bu ering could be reduced throughout to the postcode character size (seven slots). When testing a real-time application it may be difcult to remove the e ect of system-dependent I/O except by pre-loading a le. However, latency also arises if jobs are blocked on requesting entry to the 
Cross-architectural comparison
To a l l o w the performance within APTT on one machine to be extrapolated to another we sought a simple but widely-recognised characterisation. A two parameter model of performance has now been applied to a variety of parallel architectures 5], though not apparently previously in a predictor tool. For example, in Fig. 2 , which is a log-log plot, the Paramid reaches half its maximum bandwidth with messages of about 60 bytes ( rst parameter, established by linear regression) before reaching steady state (second parameter). In this case, the user need only know the message length and the target processor to project results.
Measurements on an individual Paramid processor, an i860, showed that a two parameter characterisation might b e insu cient for computation as there was dependency on the computation kernel being performed, with additional cache e ects evident. Fig. 3 , showing results for four out of seventeen test kernels at full compiler optimisation, indicates two linear phases for some kernels where the vector length being computed stays within and steps outside the cache. However, it is not a di cult matter to store in a look-up-table the results for each m achine and for each kernel. The user then selects a kernel, vector size, and processor to enable the performance tool to give a rst-order approximation by means of scaling the computation times. This is likely to be more helpful for regular computations such as orthogonal transforms. An alternative c haracterisation is to use the computational intensity of the code, f , in units of ops/memory reference. Table 4 records steady-state performance (which i s well below theoretically optimal performance), r ic and r oc being respectively in-cache and out-of-cache performance in units of M op/s. 1 Applying the out-of-cache computational intensity test, a Dec Alpha (21064 at 175 MHz) server was found to scale over the i860 by a factor of 3.0 for f = 5 with a -fast compiler setting. As this is a load-dependent measurement, the arithmetic mean of ve selected results was taken. Table 3 records the projected timings were 21064s to be substituted for i860s, otherwise keeping the system the same. The 1 The out-of-cache measurements arise by using vector lengths designed to exceed the cache size and by causing a cache ush between tests.
longer out-of-cache test gures were chosen because the lower resolution clock w ould otherwise e ect the accuracy 2 though in-cache timings indicated a larger scaling particularly for higher values of f , indicating an e cient memory hierarchy. Low-resolution software clocks may be a deterent to the use of a processor in some hard real-time system, though not perhaps for soft real-time systems as herein.
Spatial ltering is an example of an image processing operation commonly performed with integer operations, whereas benchmarking kernels, being derived from the numerical analysis community, usually employ oating-point operations. There would appear to be a need for a set of agreed kernels speci cally for image-processing tasks. 
opment. Our design aims to exploit familiar user interface paradigms in terms of navigating data entry screens and utilising simulation and trace tools thus reducing the user's learning time. The interface was written in the Java programming language, which has enabled a trivial port between Windows NT and Unix operating systems which would not have been possible with X-window software.
A problem with previous analysis visualisation tools, such as ParaGraph 6] written with X-lib calls, is that an animated display occurs, rather like a c a rtoon lm. The user may nd it di cult to establish a pattern. Moreover, in seeking generality, w i t h twenty-four ways of presenting data, no structure to the tool's use was provided. An over-animated display also reinforces the sequentiality of the simulation whereas the pipeline represented has local and general parallelism. 4 , showing a summary of statistics entered, is taken from the APTT data-entry`wizard' which has a familiar look-and-feel to ease user adaptation. Fig. 5 shows a snapshot of the predictor running the postcode simulation. The pipeline backplane occupies the main window with details of the stage activity such as bu er and processor usage available from subsidiary windows. Processor activity is shown using colour by analogy with stop/go displays. Again using the linguistic associations of colour, the communication arrows change colour from black, through red to white to highlight hotspots'. The arrows also widen and contract. However, the cumulative mean bandwidth, not instantaneous bandwidth is displayed. The colour scaling is adjustable to centre on critical data rates as otherwise the variation across the whole bandwidth range is too low to show u p . Latency is also indicated in a persistent display. Jobs are marked o at task boundaries, with the task latency determined by the slowest job. Though persistent displays convey more information, they need to be balanced with features marking progress, which i s w h y the processor activity diagram and message motion arrows are included. suitable for large-grained applications. Using the Java v ector class, which can be made to transparently grow and shrink as work requests are serviced, could remove the need to provide concurrent access management. Though a peer-to-peer communication mode is possible, the underlying semantics of RMI are client-server utilising remote procedure call (rpc). Our implemented design requires the data farmer to poll the remote worker processes acting as servers. Conveniently, this ts a polling queueing model 9] of performance estimation. Java's pre-emptive priority-based thread scheduling can be adapted to provide a responsive structure, Fig. 7 . However, rpc always comes with an overhead by reason of stub processes acting as intermediary communication interfaces, and additionally it is necessary to arrange that the farmer is not blocked until the remote invocation completes. RMI is also suitable for linking JavaBean software components within a framework and as such our template is consistent with a trend within industry-standard distributed software 10] towards standardised middleware and high-level object-oriented software architectures.
Data-farm multicast is present in our original model for applications such as the H.263 hybrid video encoder 11] where per image-row distribution of quantisation levels takes place. Broadcast is also a synchronisation mechanism for pipeline recon guration if workloads vary over time between the parallel stages. Reliable broadcast can be employed as an e cient method of implementing shared objects 12] which w ould allow co-operative forms of parallelism to take place within a pipeline stage, extending the range of applications suitable for parallel pipelines.
Conclusion
Data ow rather than control ow is the key issue in the design of real-time image-processing applications. Understandably, algorithmic designers do not wish to be concerned with the added problem of understanding how the details of a numberofdifferent implementation platforms a ect the ow b u t do require the speci cation to be met. APTT is an integrated environment which has the potential to close this loop. An extended machine-vision example has demonstrated that accurate results are predicted for a constrained parallel-pipeline construction system. Running a simulation for su cient time will establish whether maximal events, exceeding the real-time constraints, will occur in a way that a prototype may not because of physical limitations on test le-size. Cross-architectural comparisons requiring relatively simple machine/algorithm characterisations are possible by adapting computer benchmark methodology. A graphical display, p r ovided it gives meaningful information, is a vital element in giving a feel for the application, building con dence in eventual implementation on a range of target systems.
