.. List of Tables 
ll

List of Figures
Introduction
Sophisticated vision algorithms are usually disappointingly slow when they are put together into a complete system. This is especially true of systems mated for research purposes, since speed is usually not a primary concern and much effort must be spent on working out ideas in a user-friendly environment, which is not conducive to high speed algorithms. However, even in a research environment, reasonable speed is desirable. One reason for t h i s is that it is difficult to debug an algorithm that takes minutes to compute one meaningful result; not many debugging runs can be made in a day. Another reason is tbe difficulty of using the algorithm in a non-simulated environment, for example to control an actual robot vehicle; if the algorithm takes minutes to run for each step, it is not practical to debug given time constraints imposed by the physical environment. Emally, it is not possible to integrate an algorithm into a working vision system, which may use multiple s o m s of inputs, if the speed of one algorithm is much lower than the others.
One sophisticated vision system that has received much attention over the years, and which continues to be developed, is the FIDO (Find Instead of Destroy Objects) vision system. FIDO is a stereo vision navigation system used for the control of robot vehicles; it includes a stereo vision module, a path planuer, and a motion generator.
This system is descended from work done by Moravec at Stanford woravec 801. After Moravec came to Camegie Mellon in 1980, work was done by T h o q and Mattbies [Thorpe 84; Matthies 841, who gave the system its name. More recently, work has been continued by the authors of the present paper, as well as others. This vision system is unusual in its longevity and in the range of speed over its span of development: Moravec's original algorithm, which was heavily optimized (though different in many important ways from the FIDO algorithm), took 15 minutes to make a single step while running on an unloaded DEC KL10; the Vax 780 implementation ran at 35 seconds per step; the Sun 3 implementation presently at Camegie Mellon takes 8.5 seconds per step; and the current Warp implementation takes 4.8 seconds per step. This is a factor of 190 in improvement.
We will begin by describing the W a r p system, on which the current implementation of FIDO runs, and how it is programmed. Then we will review the history of the system that became FIDO, and how the different parts of FIDO were implemented on Warp. Finally, an evaluation of tbe Warp/FIDO system is given and a brief description of future implementations.
The Warp System
Warp Hardware
A discussion of the W a r p hardwad is necessary in order to fully understand the W a r p implementation of FIDO. This discussion is greatly abbreviaw more detail is available elsewhere [Kung and Menzilcioglu 841 . The Warp machine has three components: the W a r p processor a m y , or simply Warp, the interface unit, and the host, as All of the work on F D O on W a r p so far has been done using the demonstration and prototype Warp machines, which are &-wrapped machines built according to Camegie Mellon's design (the demonstration machine was built by Camegie Mellon; prototypes were built by General Electric and Honeywell). The &wrapped machine has been superseded by an improved production Warp machine [humatone et al. 871 , built using printed circuit boards by General Electric. The machine described here is the wire-wrapped machine.
The W a r p processor a m y is a programmable, onedimensional systolic array, in which all the cells, called Warp cells, are identical. Each cell is a complete computer, with computational units and local data and program memory, except that address generation is normally done on the interface unit, so that addresses, along with data, flow Figure 2-1: W a r p host through the array. Each cell contains two floating-point units: one multiplier and one ALU, each of which can deliver up to 5 MFLOPS. The peak processing rate is 10 MFLQPS per cell. A 4 K-word memory is provided for resident and temporary data storage.
As address patterns are typically data-independent and common to aU the cells in low level vision algorithms, full address generation capability is factored out of the cell architecture and provided in the interface unit. Addresses axe generated by the interface unit and propagated from cell to cell, together with the control signals. In addition to generating addresses, the interface unit passes data and results between the host and the W a r p array, possibly performing some data conversions in the process.
The host consists of several processors: The "master" controlling the W a r p array is a Sun workstation which can run Unix to provide a convenient programming environment and provide compatibility with other Camegie Mellon Vision research programs. D a t a transfer to and from the W a r p array, and control of Vision peripherals such as cameras and frame buffers, is pmvided by an "external host" (physically external to the Sun) which communicates with the Sun through a VME bus repeater. The external host consists of three "standalone processors": two of them, called "cluster processors," are used for sending and receiving data during a computation (they can exchange roles as needed) and the third one, called "support processor," is used for controlling peripherals and executing a runtime library. The runtime libmy allows event sequencing and processing of intempts from Warp. To clarify the algorithm, the following terminology will be used. An obstacle is an identified location in the three-dimensional world where the vehicle is unable to pass. Afeature is the two-dimensional appearance of the obstacle in the image. FIDO considers only point features. This assumes that physical objects will have enough features in the image to be correctly bounded in three-dimensional space.
Below is a more detailed description of the FIDO algorithm. 0 Catalog: The catalog is a list of obstacles known to FIDO. 'Ibis fonns a simple map describing the environment surrounding the vehicle. The new objects that a n discovered by the reacquire and matching steps are added to the map. Previously found obstacles that should be seen but are not are discarded, to eliminate spurious obstacles.
Pk-zn a path: Plan a path from the vehicle's current position to its destination that avoids all of the obstacles. 
Implementation of F'IDO on Warp
In the summer of 1984, Dew, Chang, Matthies and Thorpe designed a new version of the FKDO system to run on Warp, which was then in its initial design phase. They identified the most time consuming parts of FIDO to be the major vision algorithms (correlation, interest operator, and reduction), which were considered to be suitable for implementation on a systolic array such as Warp. They redesigned FDO to run on a W a r p multiprocessor system, based on an early design that used the Aptec bus WcApline et al. 821 as the external host p e w and Chang 841. The extemal host was later changed to be the one described in section 2.1, largely for reasons of programmability.
Once the initial design was done, implementation of FIDO on W a r p proceeded in three steps: *Step 1-Implement the three vision modules of FIDO for the W a r p array, providing a simple demonstration program to show that the implementation works. This was done by Klinker using Warp microcode on a demonstration W a r p system.
Step 2 -Reimplement the Vision modules on the prototype W a r p system. This was done by Clune using W2. At the same time, exploit the external host to overlap computation on the W a r p array with preparation and post-processing of the data in the cluster processors, achieving more parallelism.
The steps above use the most powerful part of W a r p -the W a r p processor m y -to speed up the FIDO system. However, it is also possible to use Warp's e x t e d host to exploit algorithm-level parallelism and get further speedup. This approach can sometimes give a large speedup in the system, since in some cases significant portions of the code can be run in parallel with the rest of the system, effectively executing that code for fee. Therefore, we formulated Srep 3: Make efficient use of the multiprocessor host system. To some extent, this step has already been accomplished. but there is still room for exploiting algorithm-level parallelism in FIDO, as we shall discuss later.
Section 5 describes the initial implementation of the vision modules on W a r p using microcode (Step 1). Section 6 describes the current implementation of FIDO on W a r p in W2 (Steps 2 and 3).
FIDO on the Demonstration System
FIDO was first implemented on the W a r p demonstration system, which included a Sun 2, an interface unit, and two W a r p cells. In addition to being the first work on F D O on Warp, this step showed in the demonstration system that W a r p can be used for vision applications. Moreover, it allowed us to test algorithm decomposition techniques and to test the Warp hardware with realistic algorithms. Once the initial implementation was done, it was ported to the ten-cell prototype system and modified to use the cluster processors for input and output. This implementation was then superseded by new software, as discussed in the next section. Here we describe the vision modules in FIDO in detail and give their timings on the demonstration system and first prototype implementation.
Image Pyramid Generation
"he image pyramid used in FIDO consists of 7 levels, starting with a 5 12 x 5 12 image at level x l and ending with an 8 x 8 image at level x7, as shown in Figure 5 -1. h a s of 2 x 2 pixels are replaced by one pixel in the next higher level of the pyramid. The new pixel value is computed by averaging over a window in the higher resolution image to produce a one pixel result in the lower resolution image. The simplest averaging is to take a 2 x 2 pixel area and average it to one pixeL The initial implementation on the W a r p array used overlapping 4 x 4 windows, which gave slightly better results than 2 x 2 windows.
Mapping the Algorithm onto Warp
by Kung for convolution-type algorithms Kung 841.
The pyramid generation algorithm has been implemented in W a r p microcode in a systolic scheme, as suggested
The pyramid generation algorithm was planned to fit the ten cell W a r p array. Tbe algorithm quired the processors to accumulate and normalize 16 pixels in a 4 x 4 window to produce one reduced pixel value. This was mapped onto the Warp array as nine modules, with the first eight each adding two new pixel values to the accumulated partial sum, and the ninth module normalizing the result. The second, fourth and sixth module also stored the partial results until the necessary pixels from the next row underlying the 4 x 4 window had arrived at the module. The new data and the partial results were then sent together to the next module.
Because the demonstration W a r p system consisted of only two cells when the micro-programmed version of the pyramid generation algorithm was implemented and tested, we had to face the problem of mapping nine conceptual, algorithmic modules onto two physical cells. This was achieved by having the first cell rn the first four modules and the second cell run the remaining five modules. Each cell switched between consecutive modules whenever eight new partial results were produced. Thus, the pixels were processed in batches of sixteen at a time. Two extra pixels had to be sent with each batch, due to the overlapping kernel. Since the entire processing in each module consisted of adding two numbers and passing them on, each module needed two processing cycles to perform its task on a given pair of input pixels. The modules could thus start to operate on a new window position every two cycles. Accordingly, the modules could perform their tasks as fast as the data could be sent to the cells. On an image w i t h m x n pixels, the nine modules thus needed essentially rnxn cycles to reduce an m x n image into an ; x i image, plus a few initial cycles for startup. To generate the described image pyramid, Consisting of seven levels, this evolved to xl=, -x p-1 -2c1 cycles, which is Nine W a r p cells thus provided a speed-up of 14, which is relatively small. The W a r p implementation of the pyramid generation algorithm was communication intensive: it used the adder effectively only half of the time (in every other row). It did not use the multipliers at all (except for a normalization). Each W a r p cell was used as a 2.5 MFLOPS machine, for 25 MFLOPS from the amy. This explains the relatively small speed-up of the pyramid generation algorithm. Note that adding more cells would not increase the speed since this would not reduce the communication requirement.
Interest Operator
image intensities change rapidly in all directions for a point that is "interesting." "Interesting points" are those points which can be localized well in different pictures (for example comers). The Interesting points are found using an interest operator, which is largely unchanged from Moxavec's Stanford Cart work. The interest operator takes squared pixel differences around a pixel in the vertical, horizontal and both diagonal directions and accumulates them (separately for each direction) in the 3 x 3 neighborhood of the current pixel [Thorpe 84; Dew and cbang 841, as shown in Figure 5 -2. For the current pixel to be an interesting point, all four accumulated differences must be large. Their minimum gives the interest value of the current pixel. The intexest values are locally maximized in one hundred subimages that are ananged in a lox 10 grid. The maxima of all subimages are stored in a list of decreasing interest values. This gives a set of points, distributed across the image, which can be localized in the image for matching,
Only the first part of this algorithm, computing the accumulated squared pixel differences in all four directions for every pixel, was implemented in microcode on the Warp array. In the demonstration system, processing stopped here. Later, we implemented the minimization, maximization and list processing on the cluster output processor.
Mapping the Algorithm on Warp
The interest operator does not have a good partitioning into modules with similar timings. We thus did not try to implement it in a systolic scheme, as was described for the pyramid generation algorithm in the previous section. Instead, we used the input partitioning model [Kung and Webb 861 in which not the algorithm, but the data is divided into equally sized parts. In this scheme, each cell perfoms the complete algorithm on a portion of the data. An m x n image is divided into c vertical stripes to be processed on c different cells. For the interest operator, the stripes had to overlap by 4 pixels, due to the width of the operator window. Thus, every cell ran on m.(z1+4) pixels. The systolic communication facilities were then used like a "bus": each cell received data from the previous cell and sent it to the next cell. The host sent the data interleaved such that each cell could use every c~ pixel for itself. At the beginning of every new iteration, c new pixels were sent over the "bus." The ofiet between programs that ran on neighboring cells was two cycles such that each cell started a new iteration exactly when a new pixel arrived.
Note that, since the programming scheme of the interest operator was organized around partitioning the data and not the algorithm, this algorithm could be easily adapted to run on any number cells. In the demonstration system, the algorithm ran on two Warp cells, computing the interest value for every pixel of a 256x256-sized image (level x2 of the right image pyramid) in two parallel vedcal stripes each consisting of 256x 132 pixels. It was a matter of changing a few constants that indicated the width of the vertical stripes to provide a veIsion that ran on ten cells, when the ten cell Warp became available. The ten cell version divided the image into ten vertical stripes, each consisting of 256 x 30 pixels.
The algorithm iterated on a sequence of two steps:
1. Get the next pixel, compute the difference between the new pixel and its neighbors in four directions, 2. Add the squared differences to the partially computed interest values of the nine pixels whose 3 x 3
Within each of the two steps, the code was optimized using software pipelining to use all facilities, such as the multiplier and adder, of the cell simulmusly. For ease of programming, no optimization was organized between the steps. This made the innennost loop 65 cycles long, whereas the ideal algorithm would have needed only 40 cycles.
and square the differences.
windows overlap at the current pixel.
The algorithm must store the most recent two rows of pixels and the most recent three rows of partial interest values. Thus, (2 + 3.4). rq + 4 pixels were s t o r e d in the local memory of each Warp cell. For the given memory size of 4096 K words, a maximal row width of n=256 columns per cell could be allowed.
Microcode Timing of the Interest Operator
The interest operator ran in ( 
Image Pyramid Correlation
For a given pair of images and a given list of interesting points in one image, the correlation algorithm is used to find the corresponding points in the other image. The search for the most likely corresponding points of the interesting points is performed on the image pyramids, stamng at the highest level (x7:8 x 8 image). At each level, a 4 x 4 template around the current interesting point is taken and correlated with an 8 x 8 search area in the other image pyramid at the same level. The best matching position of the template in the search area determines the position of the search area in the next lower, more detailed, pyramid level moravec 801, as shown in Figure 5- In the micro-programmed version of the algorithm, Warp was used to find the optimal correlations of all features for one given pyramid level at a time. First, all templates of a pyramid level were sent. The cells stored tbe templates and computed their means and variances. Then the search areas of that level were given to the Warp array in the same sequence as the templates. The cells conelated the current template with the current search area and sent the correlation results for every template position to the output cluster. The cluster processor then found the optimal position of each template within its search window and determined the search areas for the next lower level. The process was then repeated for the next lower pyramid level [Dew and chang 841.
Mapping the Algorithm onto Warp
The correlation algorithm was implemented in a systolic programming scheme, as for the pyramid generation algorithm. It was designed as nine modules. Each of the first eight modules covered two template elements. The algorithm was designed so that initially, each module received the template elements and stored the respective template elements of each template. The mean and the variance of all templates were computed and stored in the ninth module. Then, in the correlation phase, each module got the pixels of the search areas and the partial sums S,, S , , and S, from its left neighbor and updated the partial sums before it sent them to its right neighbor with the next pair of pixels. As in the case of data pyramid genemtion, the second, fourth and sixth module stored the derived partial results until the pixels of the next row, underlying the current window position, arrived. The ninth module combined the partial sums and the mean and variance of the current template into a correlation value that denoted how well the template fitted the data in the search area at the m n t position.
In the demonstration system, the comlation algorithm ran on two cells. The first four modules ran on the first cell, the other five modules ran on the second cell. For every pair of pixels, six additions and four multiplications had to be performed in the first eight modules. In the ninth module, one addition, two subtractions, three multiplications, and two divisions had to be executed. Thus, the adder was the most used resource for the first eight modules (used six times per module run), whereas the multiplier was used most in the last module (five times, so that it was not the bottleneck). It was possible to achieve the optimal speed for the first eight modules, i.e: start a new module run every six cycles, keeping the adder busy all the time. When the modules were run on two cells, however, the ninth module had to share the resources of the second cell with the fifth through eighth module. Since the microcode of the ninth module was very different from the microcode of the other modules (heterogeneous modules), it was impossible to schedule the facilities of the second cell such that one resource was used in every cycle: the ninth module required that the innermost comlation loop be augmented by twelve cycles (i.e.: two extra module runs). The sequential algorithm took about 2.3 seconds on a Vax/780. N i~e cells thus provided a speed-up factor of 78. This was a much higher speed-up tban that achieved by the pyramid generation algorithm and the intemt operator because the multiplier was used in every cycle and the adder was used in every other cycle. Each cell thus ran here as 7.5 MFZOP a machine. The communication facilities were also used in every other cycle. Therefore, the correlation algontbm was a fairly well balanced algorithm. The maximum speed-up would have been reached if 18 cells had been used (due to communication requirements).
Performance of the Vision Modules
Times for the three Vision modules are shown in Table 5 -1. The speedup (optimal and actual) of a Sun 2 with Warp over a Sun 2 without Warp is also shown for each module. As mentioned above, only the two cell system was available at this time. In addition, the system software was not U y tuned so that it took approximately 0.3 seconds to call each Warp module.
Most interesting is the speedup of the pyramid generation module on Warp. It actually takes longer to run on the Warp than on the Sun alone. This is because the data flow between the clusters and Warp is unbalanced. Very time consuming manipulations were required to order the data correctly for Warp in this implementation, but the actual pyramid generation on the Warp a m y is not computationally intensive. The array is virtually starved for data. This is a case where the ordering of data is too complex for Warp (specifically the clusters). A more efficient implementation is for the cluster processors to send the pixels in the order it is stored in memory so that data can flow rapidly into the array, and have the Warp anay handle the data reordering itself.
The interest operator and correlation functions did not perform at the optimal speeds either, although they are faster than the comparable Sun functions. If the startup times for the Warp implementations is subtracted (the overhead for startup is much less in the prototype and production Warp machines), the actual times are close to optimal times as shown in Table 5 -1.
FIDO on the Prototype Warp System
The hardware completely changed from the demonstration system to the prototype system described in Section 2. The number of cells in the Warp array increased Erom two to ten. Just as important for system performance, the master was upgraded from a Sun 2 to a Sun 3 with the MC68881 floating point hardware. and the external host became available. These changes markedly improved the performance of the system. In addition, the software environment changed radically. Previously, all of the vision modules for the demonstration system were coded directly in microcode, a tedious task. With the prototype system, the W2 compiler became available, greatly simplifying the programmer's task.
We completely reimplemented the vision modules as a result of these changes: 0 Pyramid generation: This module was reimplemented as a C program to be NL] on the clusters, since very little computation is done here. This made it possible to do the two pyramid generations in parallel using the two cluster processors. In this implementation non-overlapping 2x2 windows were used instead of the overlapping 4 x 4 windows in the Warp implementation, to simplify the computation. Interest operator: This module was reimplemented in W2, without signi6cant change in the algorithm.
0 Correlation: This module was originally written as a systolic program, but could not be reimplemented in this way because the prototype W2 compiler allowed only homogeneous code. Instead, it was implemented using input pdtioning, as in the interest operator.
In addition, preprocessing and postprocessing of the data in modules that send data to the Warp array was implemented as C programs to be run on the clusters, which send and receive data directly to the Warp array. These C programs placed the standard compiler-generated modules, which transferred data to the Warp array from memory. Use of the clusters in this way exploited some of the parallelism available in the Warp system.
Analysis of FJDO Performance on the Prototype
The reimplementation of FIDO led to a total system time for one step of 4.8 seconds, which is a large speedup over the on@ time, but still relatively small compared to the time on a Sun 3 alone (8.5 seconds). Here we analyze how we got this speedup and identify the parts for the system that allow a further increase in performance. ' Ihe actual time for this function to execute is about 0.1 seconds, the same as estimated in Section 4 for the ten cell implementation. Some of the rest of time is spent starting the Warp array (about 25 milliseconds). However, most of the time is spent in post processing. After the interest operator has been run, some sorting and selecting is done h m the resulting data. This is done on one of the cluster processors, which is about 28% slower than the Sun 3 processor, because of a slower clock me. The effect of this post processing is shown in Table 6 -2. I TOTAL TIME 1.6 0.5
The interest operator is sped up by a factor of 10, h m one second to 0.1 seconds. This leaves only the sorting and selecting, which was about one-quaxter of the time of the original function, but which is 80% of the time in the W a r p implementation. The total speedup is approximately three.
Correlation modules
speedup, compared to a Sun 3 alone. A breakdown of the times for Warp is shown in Table 63 .
AU of the modules that use the correlation function (e.g. 'Match Features') have less than a factor of three 
Algorithm-level Parallelism in FIDO
It is possible to exploit algorithm-level parallelism in FIDO, as shown in Figure 7- These predictions are shown in Table 7 -1, compared with the times on the ament FIDO system. We see that pymnid generation and cataloging and path planning and execution are done for free. ?his reduces the total time for FIDO from 4.8 seconds to 3 2 seconds, which is a speedup of 2.6 over the Sun 3 version.
Using the Production Machine
General Electric has now manufactured a production version of the Warp machine, as mentioned earlier. This machine design was influenced by our experiences with the prototype system, including our experiences with FIDO. We have made a number of modifications to this system, which will improve the performance on FIDO:
0 System overhearls reduced. Overhead for startup of a Warp program is now 5 milliseconds, down from 25 milliseconds on the prototype. This will substantially reduce the ovexbead of calling a Warp p r o m with a fast execution time. For example, this should reduce the startup overhead in correlation from approximately 0.2 seconds to 35 ms.
Heterogeneous code. In the come of implementing FIDO, we origmally implemented the pyxamid generation and correlations functions systolically, then reimplemented them using input partitioning when we reprogrammed them using W2. This was due to a restriction in the prototype machine that made it impossible for W2 to support heterogeneous code in a general way. This restriction has now been removed, and we are fiee to reimplement these modules systolically. constructed that allows direct trax~~Ser of data h m the framebuffer to the Warp array, bypassing the cluster processors. This will allow reimplementation of the pyramid generation step on the Warp array, with each pyramid generation taking approximately 60 ms.
Faster cluster processors. The clusters are being upgraded to use microprocessors with a faster clock DMA h m the host. The new processors also suppoxt DMA from the host to the W a r p array, ehinatm g a significant bottleneck in feeding data from the host to the Warp array.
summary
We have discussed the history of the mD0 algorithm, and its gradual increase in speed over a period of some 13 years-starting fn#n Moravec's work on its pl.edecessor at Staufod through the current implementation on Warp. Over this period of time, three things have influenced its speed:
placing constraints on calculations to reduce processing. For example, Tho-and Matthies were able to increase the reliability of FIDO, while reducing the number of images needed, by adding more constraints in the stereo matching. This has accounted for about one order of magnitude improvement in speed. Note that the usefulness of constraints depends on the reliability of the underlying hardware (e.g. sensors) and that any improvement in computational speed of a program can be used to perform more experiments or incorporate other functions into the program.
W a r p ' s potenrial in the implementation of FIDO is due to several factors, which reflect not only on the design of W a r p but also on other special-purpose machines:
1. The W a r p array woks well for the majority of the computation in FIDO, namely low-level vision computations. W o r k i n g either in microcode or W2, we never had problems with the W a r p array not having enough effective computation power. However, while low-level vision computations form the majority of the computation of FIDO, simply speedmg them up is not enough for good speedup of the FIDO system as a whole. 2.The external host is the weakest part of the W a r p system. This was known when the host was designed; it was determined once we decided to use industry-standard processors and buses, instead of building our own. In our early versions of FIDO on Warp, this kept us from realizing full use of the W a r p array, because of the constraints in rearranging data on the external host. 3. The programmability of the W a r p array allowed us to modify our algorithms and programming models to accomcdate a regular data pattern from the host. This is important even in the latest versions of the host, which have faster processors and higher data rate, but which can use DMA, which requires a regular address pattern. This flexibility is the main reason we have been able to observe the predicted 4. W2 makes it possible to experiment with different algorithms, in the context of a research system such as FIDO, while getting good use of the powerful W a r p array. As we program more and more of FIDO on the production W a r p machine, programmability is essential, especially as it allows us to make use of more complex programming models that use the powerful W a r p array more and require less intervention by the relatively weak host.
5.Althougb the computing power of the external host is small compared to the W a r p array, its programmability, and its close integration with the master and the W a r p array, makes it an important part of the W a r p system. Irregular operations can be mapped onto it as part of pre-and postprocessing of data from Warp. Also, it can sometimes perfonn memory access-intensive but not compute-intensive computations as well as or better than the W a r p array, which can also allow the W a r p array to be used for something else in the meantime.
performance of algorithms in actual w a r p rum. PhD tbesis, Camegie-Mellon University, December, 1984.
