Abstract. This paper presents algorithms developed for pixel merging phase of object-space parallel polygon rendering on hypercube-connected multicomputers. These algorithms reduce volume of communication in pixel merging phase by only exchanging local foremost pixels. In order to avoid message fragmentation, local foremost pixels should be stored in consecutive memory locations. An algorithm, called modi ed scanline z-bu er, is proposed to store local foremost pixels e ciently. This algorithm also avoids the initialization of scanline z-bu er for each scanline on the screen. Good processor utilization is achieved by subdividing the image-space among the processors in pixel merging phase. E cient algorithms for load balancing in the pixel merging phase are also proposed and presented. Experimental results obtained on a 16-processor Intel's iPSC/2 hypercube multicomputer are presented.
Introduction
There are two approaches for parallel polygon rendering in multicomputers image-space parallelism 1, 2, 3] and object-space parallelism 4, 5, 6] . In objectspace parallel rendering, input polygons are partitioned among the processors. Each processor, then, runs a sequential rendering algorithm for its local polygons. Each generated pixel is locally z-bu ered to eliminate local hidden pixels. After local z-bu ering, pixels generated in each processor should be globally merged, because more than one processor may produce a pixel for the same screen coordinate. The global z-bu ering operations during the pixel merging phase can be considered as an overhead to the sequential rendering. Furthermore, each global z-bu ering operation necessitates interprocessor communication. E cient implementation of the pixel merging phase is thus a crucial factor for the performance of object-space parallel rendering. In its simplest form, pixel merging phase can be performed by e x c hanging pixel information for all pixel locations between processors. We will call this scheme full z-bu er merging. T h i s s c heme may i n troduce large communication o verhead in pixel merging phase because pixel information for inactive pixel locations are also exchanged. This overhead can be reduced by exchanging only local foremost pixels in each processor. This scheme is referred to here as active pixel merging. The approaches in 5, 6 ] u s e a r c hitectures whose processors are interconnected in a tree structure for pixel merging phase. Both approaches result in low processor utilization in pixel merging phase due to tree topology. The processors in the lower levels of the tree (e.g., processors at the leaves) may h a ve substantially less work than those in the upper levels of the tree. Another approach presented in 4] utilizes network broadcast capability for pixel merging phase. Each processor, starting from the rst processor and continuing in increasing processor id, broadcasts \active" pixels to a global frame bu er. The other processors capture the broadcast pixels and delete their local pixels which are hidden by the broadcast pixels. In this way, t h e n umber of pixels broadcast by the next processor is expected to decrease. Their approach will introduce a large communication overhead due to broadcast operation on medium-to-coarse grain distributed-memory architectures. In addition, their approach su ers from low processor utilization because a processor remains idle until the end of pixel merging phase after broadcasting its pixels. This paper investigates the object-space parallelism on hypercube-connected distributed-memory multicomputers. In our approach, the hypercube interconnection topology and message passing characteristics of hypercube multicomputer are exploited. Algorithms proposed in this work achieve good processor utilization by implicitly subdividing image-space among the processors in pixel merging phase. The volume of communication is decreased by o n l y e x c hanging local foremost pixels for active pixel locations as in 4]. However, storing only local foremost pixels for e cient pixel merging introduces some overhead to conventional scanline z-bu er algorithm. An algorithm, called modi ed scanline z-bu er, is proposed to reduce this overhead. The proposed algorithm also avoids initialization of scanline z-bu er for each scanline in local z-bu ering. Load balancing issue in pixel merging phase is discussed. Algorithms for achieving better load balance are proposed and discussed.
Modi ed Scanline Z-bu er Algorithm
In order to prevent message fragmentation in active pixel merging, the local foremost pixels should be stored in consecutive memory locations. In this section, a modi ed scanline z-bu er algorithm is presented. This algorithm utilizes a modi ed scanline scheme to store foremost pixels in consecutive memory locations e ciently. In addition, this algorithm avoids initialization of scanline z-bu er for each scanline by sorting polygon spans at each scanline in increasing minimum x-intersections.
When polygons are projected to the screen (of resolution NxN), some of the scanlines intersect the edges of the projected polygons. Each pair of such intersections is called a span. In the rst step of the algorithm, the spans are generated and put into the scanline span lists. T h e scanline span lists involve a linked list for each scanline which c o n tains the respective polygon spans. Each span is represented by a record, which c o n tains the intersection pair (minimum x-intersection x min and maximum x-intersection x max ) and necessary information for z-bu ering and shading. Scanline span lists are constructed by inserting the spans of the projected polygons to the appropriate scanline lists in sorted (increasing) order according to their x min values. This sorting allows to perform local z-bu ering without initializing the scanline array for each scanline on the screen.
In the second step, spans in the scanline lists are processed, in scanline order (y order), for local z-bu ering and shading. Two local arrays are used to store only local foremost pixels. First array is called Winning Pixel Array (WPA) used to store the foremost (winning) pixels. Each e n try in this array c o n tains location information, z value, and shading information about the respective l ocal foremost pixel. Since z-bu ering is done in scanline order, the pixels in the WPA are in scanline order and pixels in a scanline are stored in consecutive locations. Hence, for location information, only x value of the pixel generated for location (x,y) needs to be stored in WPA. Second array, called Modi ed Scanline Array (MSA) of size N, is a modi ed scanline z-bu er. MSA x] g i v es the index in WPA of pixel generated at location x. Initially, e a c h e n try of the MSA is set to zero. Moreover, a \range" value is associated with each scanline. The \range" value of the current scanline is set to one plus the index of the last pixel, which is generated by the previous scanline, in WPA. The \range" value for the rst scanline is set to 1. Since spans are sorted in increasing x min values, if a location x in MSA has a value less than the \range" value of current scanline, it means that location x is generated by a span belonging to previous scanlines. For such locations, the generated pixels are directly stored into WPA without any comparison. Otherwise, the generated pixel is compared with the pixel pointed by t h e i n d e x v alue. This indexing scheme and sorting of spans in scanline span list avoid re-initialization of MSA at each scanline. However, due to comparison made with \range" value, an extra comparison is introduced for each pixel generated. These extra comparison operations are reduced as follows. The sorted order of spans in the scanline span lists assures that when a span s in scanline y is rasterized, it will not generate a pixel location x which is less than x min of previous spans. The current s p a n s is divided into two segments such that one of the segments cover the pixels generated by previous spans in the current scanline and other segment c o vers the pixels generated by spans of previous scanline. Distance comparisons are made for the pixels in the rst segment. The pixels generated for the second segment are stored into WPA without any distance comparisons.
Pairwise Exchange Scheme
This scheme exploits the recursive-halving idea widely used in hypercube-speci c global operations. This operation requires d concurrent divide-and-exchange stages. Within each stage i (for i = 0 1 2 : : : d ; 1), each processor divides horizontally its current active region of size N n into two equal sized subregions (each of size N n=2), referred here as top and bottom subregions, where n = N during the initial halving stage. Meanwhile, each processor divides its current local foremost pixels into two subsets as belonging to these two subregions, which are referred here as top and bottom pixel subsets. Then, processor pairs which are neighbors over channel i exchange their top and bottom pixel subsets. After the exchange, processors concurrently perform z-bu ering operations between retained and received pixel subsets to nish the stage.
All-to-All Personalized Communication Scheme
The pairwise exchange scheme can also be considered as a store-and-forward scheme. At e a c h stage, the received pixels are stored into the local memory of the processor. These pixels are compared and merged with the pixels retained. After this merge operation, some of the pixels are sent at the next exchange stage, i.e., they are forwarded towards the destination processor through other processors at each concurrent communication step. Note that during these store-compare-andforward stages, pixels may be copied from memory of one processor to memory of the other processors more than once. This memory-to-memory copy operations can be reduced by sending the pixels directly to their destination processors.
In iPSC/2 hypercube multicomputer, communication between processors is done by Direct Connect Modules (DCMs). Communication between two n o nneighboring processors is almost as fast as neighbor communications if all the links between two processors are not currently used by other messages. The communication hardware uses the e-cube routing algorithm 7]. Using DCMs, we can exchange messages between non-neighbor processors by the algorithm presented in 8]. This algorithm totally avoids message congestion by ensuring that at each exchange stage, the pixel data is directed to destination processors following disjoint paths.
In all-to-all personalized communication scheme, the screen is implicitly divided into P horizontal subregions. Each subregion is implicitly assigned to a processor. Then, each processor sends the pixels belonging to the subregion of processor \k" directly to processor \k". After P;1 exchange steps, each processor z-bu ers the local pixels with the received pixels. Each processor holds a local z-bu er of size N N=P. Local pixels are scattered onto the z-bu er without any distance comparisons. Then, each received pixel's z value is compared with the z value in the pixel location in the z-bu er. After all pixels are processed, z-bu er contains the pixels in the nal picture.
Recursive Adaptive Subdivision
This scheme recursively divides the screen into two subregions such t h a t n umber of pixels in one subregion is almost equal to the number of pixels in the other subregion. This scheme is well suited to the recursive structure of the hypercube.
Each processor counts the number of local foremost pixels at each scanline and stores them in an array. E a c h e n try of the array stores the sum of local foremost pixels at the corresponding scanline. An element-by-element global pre x sum operation is performed on this array to obtain the distribution of foremost pixels in all processors. Then, using this array, e a c h processor divides the screen into two horizontal bands of consecutive scanlines so that each region contains equal number of active pixel locations. Along with the division of the screen, the hypercube is also divided into two equal subcubes of dimension d ; 1. Top subregion is assigned to one subcube while bottom subregion is assigned to other subcube. Subcubes perform subdivision of the local subregions concurrently and independently. Since screen is divided into horizontal bands, the global array obtained by global sum operation is used for further divisions of the screen.
Heuristic Bin Packing
In the recursive adaptive subdivision scheme, the subdivision of the screen is done on scanline basis, i.e., scanlines are not divided. For this reason, it is di cult to achieve exactly equal load in each subregion. In addition, when a division point is found and screen is divided into two subregions, each subregion is subdivided independent of the other one. As a result, at each recursive subdivision, the load imbalance between the subregions may propagate and increase. Therefore, at the end of recursive subdivision, some processors may still have substantially more work load than others. A better distribution of work load among the processors can be achieved by using a di erent partitioning scheme, called heuristic bin packing. In this scheme, the goal is to minimize the di erence between the loads of the maximum loaded processor and minimum loaded processor. In order to realize this goal, a scanline is assigned to a processor with minimum work load. In addition, scanlines are assigned in decreasing number of pixels they have, i.e., scanlines that have large number of pixels are assigned at the beginning. In this way, large variations in the processor loads due to new assignments are minimized towards the end.
Experimental Results
The algorithms proposed in this work were implemented in C language on a 16-node Intel iPSC/2 hypercube multicomputer. Algorithms were tested for scenes composed of 1, 2, 4, and 8 tea pots for screens of size 400x400 and 640x640. The characteristics of the scenes are given in Table 1 . The abbreviations in the gures and tables are AAPC: all-to-all personalized c ommunication, P AIR: pairwise exchange, RS: recursive adaptive subdivision, H B P : heuristic bin packing, ZBUF-EXC: full z-bu er merging. All timing results in the tables are in milliseconds. Table 2 illustrates the performance comparison of PAIR-RS scheme with full z-bu er merging. The timings for some scene instances for ZBUF-EXC s c heme Table 1 . Scene characteristics in terms of total number of pixels generated (TPG), number of polygons, and total number of winning pixels in the nal picture (TPF) for di erent screen sizes. could not be obtained due to insu cient local memory. Those cases are indicated by a \*" in this table. As seen in Table 2 , PAIR-RS gives much better results than ZBUF-EXC in pixel merging phase. Since pixel information for inactive pixel locations are also exchanged, the volume of communication in ZBUF-EXC is larger than that of PAIR-RS. As is also seen from the table, the PAIR-RS performs better than ZBUF-EXC also in local z-bu er phase since it avoids initialization of z-bu er. Total volume of concurrent c o m m unication (in bytes) for various pixel merging schemes are illustrated in Fig. 1 . The total volume of concurrent communication is calculated as the sum of the maximum volume of communication a t each communication step. As seen from the gure, AAPC scheme results in less volume of communication than PAIR scheme as expected. Note that the volume of communication in active pixel merging is proportional to the numb e r o f a ctive pixel locations in each processor. As the number of processors increases, the number of active pixel locations per processor is expected to decrease. Hence, it is expected that volume of communication decreases as the number of processors increases as is also seen in Fig. 1(a) . The increase in volume of communication in PAIR-RS scheme on 4 processors is due to store-and-forward overheads. It is also experimentally observed that better load balance in pixel merging indirectly a ects the volume of communication as well. As illustrated in Fig. 1(b) , HBP scheme results in less volume of communication than RS scheme. Performance comparison of load balancing heuristics are illustrated in Fig. 2 . The load imbalance is the ratio of the di erence of the work loads of maximum and minimum loaded processors to average work load. The work load of a processor was taken to be the number of pixel merging operations it performs in the pixel merging phase. As seen from the gure, HBP achieves much better load balance than RS as expected. Load balance improves with increasing screen resolution due to better accuracy in dividing the screen. As is also seen from Fig. 2(a) , HBP scales better than RS for larger number of processors. A speedup of 11.47 was obtained using 16 processors with AAPC-HBP scheme for 2 POT scene and A = 6 4 0 640.
Conclusions
In this work, e cient algorithms were proposed for active pixel merging on hypercube multicomputers. These algorithms reduce the volume of communication by exchanging only active pixel locations in pixel merging phase. The message fragmentation in active pixel merging is avoided by storing local foremost pixels to consecutive memory locations in local z-bu ering phase. An algorithm, called modi ed scanline z-bu er, is proposed to store the local foremost pixels into consecutive memory locations e ciently. This algorithm also avoids initialization of scanline z-bu er for each scanline on the screen. It is experimentally observed that active pixel merging with modi ed scanline z-bu er algorithm performs better than full z-bu er merging. It is also experimentally observed that all-toall personalized c ommunication scheme achieves less communication o verhead than pairwise exchange scheme due to less store-and-forward overheads in active pixel merging. Two load balancing heuristics were proposed to distribute load evenly in pixel merging. The heuristic bin packing achieves better load balance and scales better than recursive adaptive subdivision in active pixel merging. Therefore, it is recommended that all-to-all personalized c ommunication with heuristic bin packing scheme should be utilized for active pixel merging on hypercube multicomputers.
