Developing embedded parallel image processing applications is usually a very hardware-dependent process, req uiring deep knowledge of the processors used. Furthermore, if the chosen hardware does not meet the req uirements, the application must be rewritten for a new platform. We wish to avoid these problems by encapsulating the parallelism.
Intro ductio n
As processors are becoming faster, smaller, cheaper, and more efficient, new opportunities arise to integrate them into a wide range of devices. However, since there are so many different applications, there is no single processor that meets all the requirements of every one. The SMARTCAM [6] project investigates how an application-specific processor can be generated for the specific field of intelligent cameras, using design space exploration.
The processing done on an intelligent camera has very specific characteristics. On the one hand, low-level image processing operations such as interpolation, segmentation and edge enhancement are local, regular, and require vast amounts of bandwidth. On the other hand, high-level operations like classification, path planning, and control may be irregular while typically consuming less bandwidth. The architecture template on which the design space exploration is based therefore contains data-parallel (SIMD) as well as instruction-parallel (ILP) processors.
One of the main goals of the project is keeping the system easy to program. This means that one single program should map to a wide range of configurations of a wide range of processors. It also means that the application developer shouldn't have to learn a parallel programming language. The solution presented below is based on using algorithmic skeletons to exploit data parallelism within each operation, while a form of asynchronous RPC allows the operations to run concurrently.
The structure of this paper is as follows: section 2 reviews some related work. Sections 3 and 4 describe our programming environment and some optimizations. Section 5 presents some results from our prototype, and finally section 6 draws conclusions and points to future work.
R ela ted wo rk
Programming environments for image and signal processing applications are widely ranged. Tightly coupled systems usually have parallel extensions to a sequential language, like Celoxica's Handel-C [2] for FPGA programming, or NEC's 1DC [7] for their IMAP SIMD arrays. More loosely coupled systems usually work with the concept of a task or kernel, and differ in how these tasks are programmed and composed.
Process networks such as used by YAPI [4] allow much freedom in specifying the tasks, but require a static connection network between them. StreamC/KernelC [8] , developed for Imagine, reduces the allowed syntax within a kernel, but makes the interconnections dynamic by using streams. Their current implementation doesn't support task parallelism, however. EASY-PIPE [9] does, but requires a batch of tasks to be explicitly compiled and dispatched by the user. Their main contribution is the use of algorithmic skeletons to make programming the tasks easier. Finally, Seinstra [11] allows no user specification of the tasks, instead relying on an existing image processing library. It is also limited to data parallelism, but these restrictions allow it to be more transparent to the user, presenting a purely sequential model.
Futures were introduced in the MultiLisp [5] language for shared-memory multiprocessors. Requesting a future spawns a thread to calculate the value, while immediately retu rning to the c aller, whic h only bloc ks when it tries to ac c ess it. O nc e the c alc u lation is c omp lete, the fu tu re is overwritten by the c alc u lated valu e. B atc hed fu tu res [1 0 ] ap p ly this c onc ep t to R PC, bu t with the intent to redu c e the amou nt of R PC c alls by sending them in batc hes that may referenc e eac h other's resu lts.
Pro gra mming
O u r p rog ramming environment is based on C, to p rovide an easy mig ration p ath. In p rinc ip le, it is p ossible (althou g h slow) to write a p lain C p rog ram and ru n it on ou r system. In order to exp loit c onc u rrenc y, thou g h, it is nec essary to divide the p rog ram into a nu mber of imag e p roc essing operations, and to ap p ly these u sing fu nc tion c alls. Parts of the p rog ram whic h c annot easily be c onverted c an be left alone u nless the sp eedu p is absolu tely nec essary.
The main p rog ram, whic h c alls the op erations and inc lu des the u nc onverted c ode, is ru n on a control p roce ssor, while the imag e p roc essing op erations themselves are ru n on the cop roce ssors that are available in the system (the c ontrol p roc essor itself may also ac t as a c op roc essor). O nly this main p rog ram c an make u se of g lobal variables; bec au se of the distribu ted natu re of the c op roc essor memory, all data to and from the imag e p roc essing op erations mu st be p assed u sing p arameters.
With in-op eration p arallelism
The main sou rc e of p arallelism in imag e p roc essing is the loc ality of p ixel-based op erations. These low-level operations referenc e only a small neig hborhood, and as su c h c an be c omp u ted mostly in p arallel. A nother examp le is objec t-based p arallelism, where a c ertain nu mber of p ossible objec ts or reg ions-of-interest mu st be p roc essed. B oth c ases refer to d ata p aralle lism, where the same op eration is exec u ted on different data (all p ixels in one c ase, objec t p ixels or objec ts in the other).
D ata p arallel imag e p roc essing op erations map p artic ularly well on linear S IM D arrays (L PA s). H owever, sinc e we don't want the ap p lic ation develop er to write a p arallel p rog ram, we need another way to allow him to sp ec ify the amou nt of p arallelism p resent in his op erations. For this p u rp ose, we u se alg orith mic ske le tons. These are te mp late s of a c ertain c omp u tational fl ow that do not sp ec ify the ac tu al op eration, and c an be thou g ht of as hig her-order fu nc tions, rep eatedly c alling an instantiation fu nction for every c omp u tation. Take a very simp le binariz ation:
fo r (y=0; y<HEIGHT; y+ + ) fo r (x=0; x<WIDTH; x+ + )
U sing a hig her-order fu nc tion, P ix e lTo P ix e lO p , we c an sep arate the stru c tu re from the c omp u tation. P ix e lTo P ixe lO p will imp lement the loop s, c alling b ina riz e every iteration: int b ina riz e (int v alu e ) re tu rn (v alu e >128);
P ix e lTo P ix e lO p (b ina riz e , in, ou t);
N ote that imp lementing P ix e lTo P ix e lO p c olu mn-wise instead of row-wise -by interc hang ing the loop s -does not c hang e the resu lt, bec au se there is no way for o p to referenc e earlier resu lts (side effec ts are not allowed). It c an be said that by sp ec ifying the inp u ts and ou tp u ts of the instantiation fu nc tion, the skeleton c harac teriz es the available p arallelism. S o, by c hoosing a skeleton, the p rog rammer makes a statement abou t the p arallelism in his op eration, while not sp ec ifying how this shou ld be exp loited. This freedom will allow u s to op timally map the op eration to different arc hitec tu res.
A nother benefi t is that the imag e p roc essing library normally ship p ed with D S Ps and other imag e p roc essors is rep lac ed by a skeleton library, whic h is more g eneral and thu s less in need of c onstant u p dates.
O f c ou rse, not all op erations c an be data-p aralleliz ed as easily as p ixel op erations. M ore irreg u lar op erations p lac e inc reasing demands on the au tonomy and interc onnec tion of the p roc essing elements. For examp le, for effi c ient imp lementation, loc al neig hborhood op erations are straig htforward, rec u rsive neig hborhood op erations req u ire indirec t addressing , ru n-leng th enc oding req u ires non-loc al c ommunic ation, and edg e following is mostly seq u ential. H owever, even in the seq u ential c ase the skeleton ap p roac h c an still be u sed, if only to fac ilitate p rog ramming instead of p aralleliz ation. If sp ec ializ ed hardware then bec omes available, it is easier to make u se of it.
B etw een-op eration p arallelism
A n imag e p roc essing ap p lic ation c onsists of a nu mber of op erations desc ribed above, su rrou nded by c ontrol fl ow c onstru c ts. B ec au se ou r hardware p latform is heterog eneou s, it is imp ortant that mu ltip le of these op erations are ru n c onc u rrently, as not all p roc essors c an be working on the same c omp u tation. We are therefore u sing async hronou s R PC c alls as a method to exp loit this task-level p arallelism.
In R PC, the clie nt p rog ram c alls stu b s whic h sig nal a se rv e r to p erform the ac tu al c omp u tation. In ou r c ase, the ap p lic ation is the c lient p rog ram ru nning on the c ontrol p roc essor, while the skeleton instantiations are ru n on the c op roc essors. This alone does not imp ly p arallelism, bec au se the stu b waits for the resu lts of the server before retu rning . In async hronou s R PC, therefore, the stu b retu rns immediately, and the c lient has to b lock on a c ertain op eration before ac c essing the resu lt. This allows the c lient p rog ram to run concurrent to the server program, as well as multiple server programs to run in parallel.
However, this still has the disadvantage of requiring the client program to wait on the completion of an operation before passing its result to another one, even though it never uses the results itself. To address this problem, we are using MultiLisp's concept of futures, placeholder objects which are only blocked upon when the value is needed for a computation. Since simple assignment is not a computation, passing the value to a stub doesn't require blocking; once the called function needs the information, it will block itself until the data is available, without blocking the client program: 
O ptimizations
While our futures-like implementation is much less elaborate than MultiLisp's (requiring, for example, explicit blocks on results, although these could be inserted by the compiler), it does tackle two other problems: data distribution and memory usage. Both originate from our architecture template, which features distributed-memory processors with a relatively low amount of on-chip memory.
Furthermore, although the skeletons are called as higherorder functions in order to provide an easy migration path, we avoid the function call overhead by using source-tosource transformations. Using transformations also allows us to translate between different target processor languages, and to provide an efficient way to specify skeletons that are polymorphic in the number and types of their arguments.
D ata d istrib u tion
The data generated by most image processing operations is not accessed by the client program, but only by other operations. This data should therefore not be transported to the control processor. In order to achieve this, we make a distinction between images (which are streams of values) and other variables.
Images are never sent to the control processor unless the user explicitly asks for them, and as such no memory is allocated and no bandwidth is wasted. Rather, they are transported between coprocessors directly, thus avoiding the scatter-gather bottleneck present in some earlier work [9 ] . All other variables (thresholds, reduction results, etc.) are gathered to the control processor and distributed as necessary. These can be used by the programmer without an explicit request.
The knowledge about which data to send where, simply comes from the inputs and outputs to the skeleton operations, which are derived from the skeleton specification and are available at run time. Coprocessors are instructed to send the output of an operation to all peers that use it as an input.
M emory u sage
Our concern about memory usage stems from the fact that especially SIMD LPAs for low-level image processing may not have enough memory to hold an entire frame, let alone multiple frames if independent tasks are mapped to it. These processors are usually programmed in a pipelined way, where each line of an image is successively led through a number of operations. We would like our system to conserve memory in the same way, and have therefore specified all our skeletons to read from and write to F IF O buffers.
The distribution mechanism allocates these buffers, and sets up transports as described above. The operations themselves read the needed information from the buffer, process it, and write the results to another buffer. A method is provided to signal that no more data will be forthcoming. This conserves memory, because even a series of buffers is generally much smaller than a frame. Simultaneously, it hides the origin of the data, making the operations independent of the producers of their input and the consumers of their output.
The price of all this is that operations must consume data in a certain order, and if the source operation doesn't generate it in the correct sequence, a reordering operation must be inserted, typically requiring a frame memory. Fortunately, many low-level operations can tolerate different orderings, while more irregular operations are generally run on processors with enough memory.
R esults
We have implemented a double thresholding edge detection algorithm on a prototype architecture consisting of a XETAL [1] 16 MHz 3 2 0-PE SIMD processor and a TriMedia [12 ] 18 0 MHz 5 -issue V LIW processor (figure 1). In this algorithm, the Bayer pattern sensor output is first interpolated, then the Sobel X and Sobel Y edge detection filters are run and combined, the output is binarized at two levels, and finally the high threshold is propagated using the low threshold as a mask image. This final propagation cannot be run on the XETAL, because it requires a frame memory.
Three situations were compared: one in which the entire algorithm was implemented in a single operation on the TriMedia, as a baseline for how a sequential application would be written. Next, the operation was split into tasks as described above, and all tasks were mapped to the TriMedia; this shows the overhead caused by the task switching and buffer interaction. Finally, all low-level operations were mapped to the XETAL, while the propagation and display were mapped to TriMedia; this resembles the final situation as it would run on our system. See table 1. Because XETAL has only 16 line memories, the buffers between the filters were 1 line. On the TriMedia, they were 16 lines, to avoid too much context switching. An allocateand-release scheme was used on the TriMedia, so that no extra state memory was needed in the filters, and no unnecessary copies were made.
As can be seen, the overhead of running the RPC system is around 8% (with 16-line buffers; the overhead approaches zero if full-frame buffers are used, but that is unrealistic). This seems quite a reasonable tradeoff if we consider that it can now run transparently on the parallel platform, achieving a 4 2% processing time decrease. Actually, because the filtering and propagation are done concurrently in the parallel case, the processing time is bounded by the slowest operation, which is the propagation.
C onclusions and future work
We have presented a system in which an application developer can construct a parallel image processing application with minimal effort. Data parallelism is captured by specifying the way to process a single pixel or object, with the system handling distribution, border exchange, etc. Task parallelism of these data parallel operations is achieved through an RPC system, preserving the semantics of normal function calls as much as possible. Results from an actual prototype architecture have shown that the system works, and can achieve a significant speedup by using an SIMD processor for low-level vision processing.
The automatic skeleton instantiation is currently limited to ILP processors, and we wish to include XETAL and IMAP skeletons as well. Furthermore, we want to investigate dynamic image sizes and data types. Finally, an automatic mapping step should combine CPU-, memory-, and bandwidth usages to best determine buffer sizes and assign operations to processors. This work is supported by the Dutch government in their PROG RESS research program under project EES.54 11.
