Imaging applications such as filtering, image transforms and compression/decompression require vast amounts of cornputing power when applied to large data sets. These applications would potentially benefit from the use of parallel processing. However, dedicated parallel computers are expensive and their processing power per node lags behind that of the most recent commodity components. Furthermore, developing parallel applications remains a difficult task : writing and debugging the application is difficult (deadlocks), programs may not be portable from one parallel architecture to the other, and performance often comes short of expectations.
INTRODUCTION
Imaging computations such as filtering, image transforms, compression/decompression and image content indexing [3, 41 require, when applied to large data sets (such as 3-D medical images, satellite images and aerial photographs), vast amounts of computing power. Such applications would potentially benefit from the use of parallel processing. However, dedicated parallel computers are expensive and their processing power per node lags behind that of the most recent commodity components. Moreover, developing parallel applications remains a difficult task : writing and debugging the application is difficult (deadlocks), programs are not portable, and performance often comes short of expectations.
In order to facilitate the development of parallel applications, we propose the CAP computer-aided parallelization tool which enables application programmers to specify at a high-level of abstraction the flow of data between pipelined-parallel operations. The CAP environment supports the programmer in developing parallel imaging applications. The CAP environment features (1) support for the parallel storage of large data sets ; (2) an image library supporting 1-bit, 8-bit, 16-bit, 24-bit images, as well as the division of images in tiles of user-defined size ; (3) the CAP language extension to C++ which allows to write deadlock-free portable pipelined-parallel applications, and combine parallel storage access routines and image processing operations.
This paper shows how processing and 110-intensive imaging applications can be implemented to take advantage of parallelism and pipelining between data access and processing operations. This paper's contribution is (1) to show how such implementations can be compactly specified using the CAP set of flow control instructions, and (2) to demonstrate that CAP specified applications achieve the performance of custom code. The paper analyzes theoretically the performance of CAP specified applications and demonstrates the accuracy of the theoretical analysis through experimental measurements. To implement I/O intensive applications, large 2D (resp. 3D) images are divided into square (resp. cubic) subsets with good locality called tiles. Two kinds of applications are considered in this paper : neighborhood-independent operations, and neighborhood-dependent operations. Neighborhood-independent operations are operations where no data must be exchanged between tiles to compute the resulting final image.
Section 2 shows how large images are divided in tiles for storage and processing purposes. Section 3 describes in general terms the process-and-gather operation, i.e. the problem of applying a neighborhood-independent operation to selected tiles stored on disk(s) and gathering the processed tiles in a single address space. It shows the ideal execution schedule for performing a process-and-gather operation and analyzes theoretically the performance of such a schedule. Section 4 shows how process-and-gather operation is specified in CAP. The process-and-gather operation is limited to linear filters. Section 5 describes the more general exchange-process-and-store operation, i.e. the problem of applying a neighborhood-dependent operation to tiles stored on disk(s) and storing the result back to disk(s). Section 6 lists performance results for the exchangeprocess-and-store parallel operations.
SYSTEM SUPPORT FOR MANAGING LARGE IMAGES

Hardware architecture
The hardware we consider consists of multiple PentiumPro PC's connected through a commodity 1000Mb/s network such as FDDI or Fast Ethernet (Figure 1 ). The PCs run the WindowsNT operating system and communicate using the TCP/IPbased MPS message-passing system developed by the authors. Each PC represents a storage/processing node (SIP node) consisting of one or two processors connected through its PCI internal bus to one or more disks. The client requesting some image processing operation is also located on a PC. This platform scales from a single PC architecture with one processor and one disk, to a multiple-PC multiple-disk architecture. For the purpose of comparing modeled and measured performance figures, we will assume single processor PCs. We assume that both the disk and the network can access memory through DMA (Direct Memory Access). While this hypothesis is accurate for disks1, network interfaces based on the TCP/IP protocol consume a lot of processing power2.
Software architecture
Large images are divided in square tiles which are stored independently, possibly on multiple disks. Pixmap image tiling is routinely used for the internal representation of images in software packages such as PhotoShop. Square tiles enable accessing image windows efficiently, with a good data locality. The CAP imaging library provides data types and functions for splitting images in tiles, and allocating tiles to disks. Figure 2 shows an image divided in tiles, as well as a visualization window covering part of the image. Figure 2 also shows a possible allocation of tiles to disks, assuming an image striped over 8 disks. The allocation index consists of the disk index, as well as the local tile index on the disk. For example, the bottom right tile in Figure 2 is allocated on disk 3. and is the 6111 tile on that disk. The distribution of tiles to disks is made so as to ensure that direct tile neighbors reside on different disks. We achieve such a distribution by introducing, between two successive rows of tiles (and between two successive planes of tiles in the case of 3-D images). otisets which are prime to the number of disks. 
FIGURE 2. Image tiling and visualization window
The data types required in this example are the WindowT and the TileT classes, provided in the CAP imaging library (Program I). The WindowT class fields are a file name, the window position within an image. the window size, the window data. and a pixel descriptor (pixel size in hit, color scheme (gray level. RGB. ...). The WindowT class is used both to specify a window request parameters (in which ease the data field is empty). as well as the window itself. The TileTclass fields give its position. its size, the tile data, a pixel descriptor, as well as the index of the disk where the tile is stored, and the local index of the tile on the disk. Program 2 lists a simple sequential program which performs a process-and-gather operation. It assumes that the whole window to he processed is in memory. The while loop (lines 12 to 16) repeatedly calls the GetNextTilelnWindou' routine, until it returns 0. At each iteration, the GetNext Tilein Window routine returns the next window tile (next? parameter). based on the window description (window? parameter) and the previous window tile (prey? parameter). The first time the GetNextTilelnWindow is called. the prey? parameter is 0. In the body of the while loop (lines 13 to 15). the new tile is processed using the user-defined ProcessTile routine. In this simple program. all tiles are processed independently. When the tile is processed. the result is merged into the final window using the MergeAndAddTile routine. This simple routine handles correctly neighborhood-independent and linear filtering operations (see section 3.l)The Get NextTileln Window and the MergeAndAddTilc arc provided by the CAP imaging library. The ProcessTile routine is user-defined. The next sections show more sophisticated imaging programs which in a pipelined-parallel manner access tiles stored on disks and perform on it processing operations. 
Problem description
THE PARALLEL PROCESS-AND-GATHER OPERATION
A process-and-gather operation consists of reading tiles from the disk(s), performing a neighborhood-independent operation on the tiles, gathering the processed tiles in a single address space, and merging the tiles to form the final visualization window. The SIP nodes perform the neighborhood-independent operation, and send the processed tiles to the client PC, where processed tiles are merged to form the final image. Linear filtering fits the process-and-gather scheme ( Figure 3 ). The linear operation is performed on each tile, assuming that pixels beyond the tile border are set to 0. The linear filtering operation generates enlarged tiles. After filtering, tiles overlap. When merging the tiles to form the final image, the overlapping part of the tiles are added together, leading to the correct final result. As an example, we filter a 1-D gray-level image, by averaging pixels with a 3-by-i convolution kernel (Figure 3 ). In Figure 3 , a 3-by-i convolution kernel is applied to an 8-pixel vector, with and without tiling. Both situations assume that pixels outside the range of the vector are 0. With tiling, the filter is applied to two 4-pixel vector slices, and both vector slices grow to 6 pixels after filtering. The overlapping parts of the vector slices are then added at tile-merging time to recover the correct 8-pixel vector.
Modelled single-PC execution schedule
This hardware configuration consists of a single PC reading data from the disks, performing the neighborhood-independent operations on all tiles, and merging the processed tiles. We assume that the disks read tiles faster than the processor can handle them ( Figure 4 ).
The schedule described in Figure 4 guarantees that the PC processor is busy at all times, after the first tile has been fetched. In this model, all disk accesses but the first one are performed while the processor is busy. Figure 5 shows the ideal execution schedule for a multiple SIP node situation. We assume that the disks read tiles faster than the processors can handle them, that the network transfers processed tiles faster than the processor can produce them, and that the client PC merges processed tiles faster than the network can transfer them.
Modelled multiple-PC execution schedule
,I Pi J P In Figure 5 , horizontal arrows represent disk access and processing operations ; vertical arrows represent ordering between operations ; gray boxes represent data transfers between PCs. The critical path is represented as a smooth light-gray line. As in section 3.2, this schedule ensures that all disk transfers but the first one, all network transfers but the last P (where P is the number of SIP nodes) and all merging operations but the last one are performed while the S/P node processors perform the neighborhood-independent tile computations.
Theoretical performance analysis
As described in Figure 5 , the critical path in the pipelined-parallel process-and-gather operation consists of one disk access, [ T/P1 tile processing steps, P network transfers and one MergeAndAddTlle operation, where T is the number of tiles in the window, and P the number of S/P nodes in the architecture. We assume that the tile size does not change significantly during the neighborhood-independent computation. The time required to read a tile from a disk is written as td 'd td • TileSize2 where 1d S the disk latency and l/td 5 the disk throughput. The time required to transfer data over the network is written as t, = i, ÷ t, . TileSize2 where l, is the network latency and 1/; is the network throughput. The time required to process a tile is written as t., = t, . f(TileSize) where t is the unitary computation time and f gives the complexity of the algorithm as a function of the tile size. The time required to merge a tile into the visualization window is tm _ tm TileSize2 . The duration of the process-and-gather operation is (Equation 1): T = td+1.ltp+Ptn+tm (1) The assumptions behind the execution schedule of Figure 5 are that tile accesses are faster than tile processing steps (td < D . t., ), where D is the number of disks per S/P node), that P network transfer times are faster than a single tile processing step (P -t, < t., ), and that merging a tile into a window is faster than a network transfer step ( tm < tn)
CAP SPECIFICATION OF THE PROCESS-AND-GATHER OPERATION
The computer-aided parallelization framework
In order to speed-up the development of parallel applications and to specify parallel I/O and processing operations at a high level of abstraction, we use the Computer-Aided Parallelization (CAP) tool. This tool enables application programmers to hierarchically specify the macro dataflow between operations performed on tiles (file stripe parts). Operations are segments of sequential code performed by a single execution thread and characterized by input value and output values. The input and output values of an operation are called tokens. In the context of this paper, tokens consist of tile data and additional application-dependent parameters. The macro dataflow specifies how tokens are routed between the operations of the parallel program. In addition, synchronization points (also used for merging intermediate results) specify which tokens must be available before the next operation can start ( Figure 6 ).
In a graphical CAP specification, parallel operations are displayed as parallel horizontal branches, pipelined operations are operations located in the same horizontal branch. Figure 6 assumes a parallel program consisting of 4 threads T1 , T2, T3, and T4. In the macro data flow graph of Figure 6 , the input token enters the graph from the left. It is divided into two parts, inpi and in'2' which undergo operations P1 and P2. Operation P1 is performed by thread T1. Operation P2 is performed by thread T2. The result of operation P1 is 0tPi. 0Pi is divided into three tokens 111P3, 1P4, jP5, which undergo operations P3, P4 and P5 in parallel (threads T1, T2 and T3). The results of operations P3, P4 and P5 are merged into a single token, 0Mi, which is in turn merged with out2 to form 0M2. 0M2 is fed to operation P6. If several tokens enter the macro data flow graph of Figure 6 , they are processed in a pipelined fashion.
The semantics of CAP is based on directed acyclic graphs [6}. The CAP specification of a parallel program is described in a simple formal language, an extension of C++. This specification is translated automatically into a C++ source program. At program startup time, the CAP runtime allocates the program threads to the available processors, using the information stored in a configuration file [ 1] . The macro data flow model which underlies the CAP approach has also been used successfully by the creators of the MENTAT parallel programming language [2] . Thanks to the automatic compilation of the parallel application, the application programmer does not need to explicitly program the protocols to exchange data between parallel processes and to ensure their synchronization. Furthermore, predefined library operations are available, for example for parallel file storage and access operations. Combining parallel disk access and processing operations enables the customization of the imaging application according to the user's requirements.
CAP threads are grouped hierarchically. In the context of this paper, the CapServerT thread hierarchy (Program 3) consists of a client thread running on the client node (line 3) and two sets of threads running on the SIP nodes (lines 4 and 5).The TileServer threads perform I/O operations (Readrile and WriteTile, lines 16 to 19) and the ComputeServer threads perform cornputations on the tiles extracted from the disks (e.g. filtering, lines 25 and 26). Each SIP node comprises one ComputeServer thread and as many TileServer threads as disks. The CapServerT thread hierarchy can perform two parallel operations : the process-and-gather and the exchange-process-and-store operations. Section 4.2 and 5.3 specify the behavior of these two operations.
Cap specification of the process-and-gather operation
Program 4 is the CAP specification of the process-and-gather operation declared in Program 3, line 7 and 8. This program applies in a pipeline-parallel manner a linear filter to all tiles within a window specified by the WindowT Input class instance. The WindowT class consists of a window position, and a file name (see section 2.2). The CAP pipeline expression semantics (Program 4, lines 4 to 7) is to perform in parallel the body of the pipeline (lines 5 and 6). The pipeline expression iteratively calls the GetNextTilelnWindow routine (line 4, first pipeline parameter) until it returns 0. Each token generated by the call is immediately (before the next token is generated) sent to the appropriate TileServer thread which reads a tile (line 5). The tile is then processed by the ComputeServer thread (line 6). Tiles sent to different SIP nodes are processed in parallel. Tiles sent to the same SIP node are processed in pipeline : the TileServer thread fetches the next tile, while the ComputeServer is processing the previous tile. When a tile has been processed, it is returned to the Client PC (line 4, third initialization parameter) where it is merged using the MergeTile routine (line 4, second initialization parameter) into the final window (line 4, fourth initialization parameter). 
S THE EXCHANGE-PROCESS-AND-STORE OPERATION
Problem description
We consider the situation where the source image resides on disks, and the target image is written to disk(s). Before filtering can be performed on a tile, tile sides must be exchanged : a tile must receive pixels from its 8 neighboring tiles. The width of the border exchanged between tiles is defined as w and depends on the filtering operation (Figure 7 ).
We assume that there are enough disks to ensure that disk(s) throughput is superior to the processor(s) throughput, i.e. our algorithm is always compute bound. The CAP runtime library features a tile cache keeping loaded tiles in memory. The tile cache works according to a LRU (least-recently used) scheme. Provided that the cache can store at least 3 tile rows, most of the required tiles will be in memory during a given computation step. Considering a tile size of 512-by-512 pixels or 64KB, in a 4096-by-4096 image, 3 rows represent 24 tiles, or 1.5MB, well below the typical memory size of current PCs. In the theoretical analysis and experimental measurement sections, we consider two situations : tile cache disabled, tile cache enabled.
We select an execution schedule where the processors are always busy. We must ensure that the required data to compute a given tile (or part of it) is in memory when the computation starts, i.e. the required data is read from the disks, and exchanged between the various SIP nodes before the computation is started.
1'o achieve this result, all S/P nodes work in parallel, and each S/P node runs a four-step pipeline. The first step consists of reading tiles from disk (or from tile cache). I)uring the second pipeline step. the S/P node computes the central parr of the tile and in parallel read the neighboring tiles' borders from the other S/P nodes. During the third pipeline step. the S/P node computes the border of the tiles after having received the neighboring tile borders. During the tourth pipeline step. the S/P node writes the computed tile hack to disk. The tile central part is defined as the part of the tile that is not affected by the neighboring tile sides. The tile border is defined as the part of the tile that is affected by the neighboring tile sides. As opposed to hardware pipelining, the pipeline steps are not performed synchronously : the only guarantee is that a given tile will undergo the four pipeline steps in the specified order.
Tiles arc allocated to S/P nodes to ensure proper load-balancing. Assuming P S/P nodes, the tile disk index n is processed by the S/P node n mod P. For example. in Figure , tile 4-6 is stored as local tile 12 on disk 2. On a 2 S/P node machine, tile 4-6 will he processed by S/P node 0. In the present allocation of tiles to disks, adjacent tiles on the same row are processed by different S/P nodes. S/P node O(resp. l)reads tiles 3-9(resp. 3-Sfrom disk, processes the central part of tile 2-6(resp. 2-5). coniputes the border of tile 2-4 (resp. 2-3) after having exchanged the neighboring tile sides, and writes tile 2-2 (resp. 2-I) to disk. For the next computation step. the activity pattern is shifted two tiles along the arrow. The tiles are scanned in serpentine order so as to benefit from the tile cache. 1 . The basic activity pattern can be adapted so that each SIP node processes more than one tile during each computation step. This reduces the number of synchronizations during the course of the algorithm, but increases the pipeline startup cost.
2. For the same reasons, an increase in tile size reduces the number of synchronizations during the execution of the algorithm, but increases the pipeline startup cost.
3. As explained in section 2.2, the tile allocation scheme selected in this paper ensures that neighboring tiles are allocated on different SIP nodes, to improve load balancing. This in turn increases the number of communications required during each computation step. An alternative tile allocation scheme could optimize communications over load balancing, and would be easy to specify in CAP.
Theoretical performance analysis
The theoretical performance analysis is done under two separate assumptions : disabled tile cache and enabled tile cache.
We first assume that the tile cache is disabled. During each computation step, four activities are carried out simultaneously. Each SIP node reads at most nine tiles from disk; communicates with the other SIP nodes to get at most 4 tile sides and 4 tile corners ; computes one tile ; writes one tile to the disk. Assuming that tile processing operation is a computation intensive operation (as opposed to data-intensive), we try to keep the SIP node processors busy at all times by reading in advance data from the disks.
The time required to read a tile from the disks is written as tdr 1dr tdr • TileSize2 where 1dr the read disk latency and Vtdr the read disk throughput. The time required to write a tile to the disks is written as tdw = 1dw tdw •TileSize2 where 1dw the write disk latency and Vtdw 5 the write disk throughput. The time required to transfer data over the network is written as t = l + t, . DataSize where i, is the network latency and 1/; is the network throughput. The time required to process a tile is written as t = 'ri, f(TileSize) where t is the unitary computation time andfgives the complexity of the algorithm as a function of the tile size. The width of a tile border is defined as w.The number of SIP nodes is defined as P. The number of disks per SIP node is D. The number of tiles to be processed is N. Considering this, when the pipeline reaches the steady state, the disk access, the network transfer and the processing times during each step are formulated as follows (equations 2, 3 and 4): 9x read and lx write disk access time : Td = 9(1dr + tdr TileSize2) + (!dw + tdw TileSize2) (2) networktransfer time : T = 4(i + r . TileSize . w) + 4(l + t, . w2) (3) processing time :
The pipeline startup cost is the cost of preloading P tiles. The pipeline termination time is the cost of writing back P tiles. The total computation time with P SIP nodes consists of the pipeline startup cost, the computation time, and the pipeline termination time (Equation 5).
T-
The last equality is true only if the algorithm is compute-bound, i.e. processing-intensive enough to hide the disk access and the network transfer time. Provided the number of tiles is large, the relative startup cost can become very small. The trade-off in the tile size is that (1) the larger the tile size, the smaller the overhead due to synchronization and communications ; (2) the smaller the tile size, the smaller the overhead due to pipeline startup cost.
We now assume that the tile cache is enabled. When the pipeline reaches the steady state the number of read tiles per SIP node is reduced to one (vs. nine). Equations 3 and 4 remain unchanged. Equation 2 becomes: lx read and lx write disk access time: Td = (1dr + tdr TileSize2) + (1dw + tdw TileSize2) (6) Assuming that the time until the pipeline reaches the steady state (pipeline startup and first tile row computation) is not significant and that the algorithm is compute-bound, then the total computation time with or without tile cache is similar. The only difference between the two situations is the number of disks required to keep the algorithm compute-bound.
CAP specification
Program 1 is the CAP specification of the exchange-process-and-store operation (lines 20 to 31). The input token to the exchange-process-and-store operation is a window description (image name, size, position, but no data) and the output token is void because the filtered image is directly stored on the disk without producing any output. The indexed parallel construct semantics (lines 25 to 28) consists of : one or more iteration expressions (lines 26 and 27), a split-function name (line 28, first initialization parameter), a merge-function name (line 28, second initialization parameter), an output token (line28, fourth mitialization parameter) initialized in the specified address space (line 28, third initialization parameter) and a CAP expression as body of the loop (line 29 and 30) . The split-function indicates how to create the parallel-construct body-input token from the parallel-construct input token. The merge-function indicates how to merge the parallel-construct body-output token into the parallel-construct output-token. The program consists of a double parallel iteration. The first loop (lines 26) iterates on successive window tile rows and the second loop (line 27) iterates on all tiles of one input window tile row. As explained in Figure 8 , the second loop iterates from left to right on even rows and from right to left on odd rows. At line 28, the split-function DuplicateWindow duplicates the window parameters; the merge-function is void, indicating that the parallel-construct body-output is used for synchronization purposes only. The indexed parallel expression executes in parallel instances of its body (lines 29 and 30), as many times as expressed in the index specification (lines 26 and 27). The body consists of two parts, performed in parallel : filter the tile by calling the TileFiltering operation (line 29) and then save the filtered tile (line 30). The flow_control specification (line 24) indicates that only 2*P instances of the body should be executed simultaneously; i.e. that each processor receives two tile processing requests. This ensure pipelining while avoiding memory and stack overloading.
The tile filtering operation consists of the parallel application (indexedparallel construct, lines 8 to 16) of the filter to the tile central part (line 14 and 15) and in fetching the tile sides from the neighboring SIP nodes (line 16). Once the tile borders have been fetched, the filter is applied to them (line 17). 
PERFORMANCE RESULTS
In this section we present and discuss performance results, and compare them to the theoretical models of section 5.2.
We run the performance measurements on a network of Bi-PentiumPro (200 MHz), connected through Fast-Ethernet (100 Mbits/sec). We filter a 4096x4096 pixel graylevel image (16 MBytes) split in 256 (resp. 64 tiles) of size 256x256 pixels (resp. 5 12x5 12 pixels). The tiles are stored on several disks (from one to nine per computer). The filtering operation consists of applying a 5x5 median filter mask [5] to the whole image. FIGURE 9. Single processor computation time according to the filtered data size (MBytes) Figure 9 displays the computation time as a function of the filtered size (MBytes). These results are computed locally on a Mono-PentiumPro and image tiles are stored on two local disks. The cache is disabled. The diagram shows that a smaller tile size reduces the pipeline startup cost but increases the synchronization overhead, since more image tiles need to be processed. For reference, performing locally and sequentially the same tile-based algorithm with the whole image preloaded in memory (16 MBytes) takes about 44s. This shows that disk accesses are almost completely hidden during the computation. This illustrates the ability of CAP to properly handle pipelining programs fetching data from disk have the same performance as programs working directly from main memory. Performance results are similar with and without the cache. In both situations the algorithm is compute bound, and the total execution time is mostly computation time. In contradiction with the theoretical previsions, we achieve better results for larger tiles. The reason is that the theoretical analysis assumes the same unitary processing throughput for both 256x256 pixel and 5 1 2x5 12 pixel sized tiles without taking into account the increased overhead due to the larger number of processed tiles (exchanging the borders, initiating accesses to the disks, ...).
The theoretical previsions for four processors are close to the experimental results. With tdw _ 1dr 18.5ms (disk latency), tdw tdr 3.3MBytes/s (disk throughput), TileSize = 5l2x5l2pixeIs , N = 64 (number of tiles), P = 4 (number of SIP nodes), T = O.69s (time needed to filter one tile) we obtain a theoretical total time of T = 1 1 .24sec (see section 5.2) against 1 1.9s measured. The difference is explained by two factors : (1) the theoretical pipeline startup cost underestimates the actual pipeline startup cost; (2) the load is not perfectly balanced between the processors. In fact, with 64 tiles and 4 processors (N = 64, P= 4) equation 5 (see section 5.2) predicts an ideal balanced load. Practically, the processors (PC's under WindowsNT) work at a slightly different pace and terminate with a time difference of at most 10% of the total processing time. At this time, solutions where part of a tile computation are off-loaded to a remote processor have not been investigated.
With ten processors, the efficiency falls down to 75% against 88% theoretically (equation 5, section 5.2). Nevertheless, the processors are kept busy during the whole program execution. The reason is that with current PC hardware, processing time is spent handling the TCP/IP stack protocol. Therefore the computation part of the algorithm does not benefit from the full processor activity as it does on a single-processor execution (i.e. without communication). We measured the TCP/IP protocol overhead by running the program without any disk accesses and computations, and found that it is about 1 s. Substracting it from the experimental measurements, the resulting time is in accordance with the theoretical predictions.
CONCLUSIONS
This paper shows that CAP enables the compact specification of pipelined-parallel imaging applications. The CAP environment is not restricted to the process-and-gather and exchange-process-and-store operations described in this paper. It can be applied to any imaging algorithm, including non-oblivious algorithms3. Once the imaging library is available, the implementation and test effort for the two applications described in this paper is of the order of days. The generated programs run on PCs under WindowsNT and on Sun workstations under Solaris. With a limited effort, reusable and customizable parallel code can be produced. The programs generated use as runtime systems either the MPS communication library developed by the authors, or the familiar PVM communication package.
The CAP imaging library supports the subdivision of images in tiles. The CAP language handles the communication and synchronization of messages in a parallel program. These two features of the CAP environment free the programmer to concentrate on the algorithm(s) to be applied to the image. Once the algorithm has been designed, the programmer can either reuse existing CAP programs and modify the processing operation performed on each tile, or create new CAP programs to handle new parallel execution schedules.
Performance measurements on a 1-processor 2-disk configuration show that we obtain similar results by reading the image directly from disks or from memory. This result shows that the CAP-specified algorithm achieves excellent pipelining between disk accesses and filtering operation. In the multi-processor configuration, the speed-ups achieved with the 4-processor and 10-processor configurations are 3.71 and 7.5 respectively. Disk accesses and network transmissions are hidden during the computation. Except at the beginning and the end of the algorithm, the processors utilization is 100%.
The CAP language has also been applied to the parallelization of linear algebra algorithms (matrix multiplication and LU decomposition), and to the visualization of 3-D tomographic images. In the field of linear algebra, speed-ups of 1 8 with 20 processing nodes, and 8.4 with 10 processing nodes have been demonstrated for the matrix multiplication and the LU factorization respectively. The experimental analysis of the parallel 3-D tomographic imaging application (plane extraction) has shown that the application achieves near linear speed-ups for a hardware architecture consisting of up to 4 PCs (3 S/P nodes 3. Oblivious algorithms are algorithms whose execution flow is independent of the content of the data being processed. and a client) and up to 27 disks working in pipeline parallel fashion. Both applications have demonstrated that the overhead of CAP is very low.
