This paper presents a graphics renderer which incorporates new partitioning methodologies of memory and work for efficient execution on a parallel computer. The task adoptive domain decomposition scheme is an image space method involving dynamic partitioning of rectangular pixel area tasks. We show that this method requires little overhead. allows coherence within a parallel context, handles worst case scenarios with reasonable speedup, executes efficiently, and requires minimal processor synchronization. The implementation analysis indicates that load imbalance is the major cause of performance degradation at the higher processor counts. Even so, on a variety of test scenes, an average rendering speedup of 79 was achieved utilizing 96 processors on the BBN TC2OCMl multiprocessor with processor efficiency ranging from 66% to 94%.
Introduction
Using parallel processing for visualization allows quick turn around for computer graphics rendering of complex datasets from disciplines such as global climate modeling, molecular dynamics, and finite. element modeling. Parallel processing can also be used to accelerate the making of high quality graphics animations so that the scientist can debug and test scientific simulation codes. This paper reports on a polygon scan conversion renderer which has been designed and implemented on a distributed memory parallel computer. The algorithm is intended to support fast rendering of highly complex datasets using advanced lighting models, rather than real-time update of moderately complex scenes. We show that good speedup and parallel efficiency can be obtained with only a small overhead when compared to an optimized serial renderer. Several advantages to using a software based approach include the feasibility of adding special rendering features to the program and the capability to integrate a parallel scientific application with the graphics renderer. This paper presents a new work decomposition strategy called fask adaptive which is based on dynamically partitioning the amount of computational work left at a given time. The algorithm uses a heuristic for dynamic task decomposition in which image space tasks are partitioned without requiring the partitioned processor to be interrupted. We employ a sophisticated memory referencing strategy which is integrated into the task adaptive algorithm to allow local access to graphics data during the rendering *current address: David Samoff Research Center. CN 5300. Princeton, NJ 08543. email: slim@samoff.com O-81 86-4920-8/93 $3.00 Q 1993 IEEE process. The exact data which is needed for a particular graphics rendering task is copied to a processor without the need for remote referencing during rendering. This approach on a MIMD parallel computer allows one to obtain reasonable speedup as more processors attack the problem. The algorithm is also amenable to using either a shared or message passing programming paradigm. An indeptb analysis of the overheads accompanied with parallel processing is presented to find out where performance is adequate or could be improved. The analysis is from both a theoretical and practical point of view to understand the degradation factors as a function of number of processors (P). The structure of the paper is as follows. Section 2 outlines previous work in the area of parallel graphics rendering. Section 3 describes the three main phases of the parallel display algorithm and section 4 presents the strategies for storage of graphics data. Section 5 discusses the complexity of this algorithm and section 6 gives performance results for several test datasets.
Historical Perspective
In the past 15 years, there have been numerous approaches to using parallelism in the tiling operation of polygon display algorithms. Much of this research has focused primarily on the domain (or work) decomposition of the rendering process. While there has been some work on parallel object space methods ([l, 81 among others) , the bulk of the algorithms have been of the image space variety. A taxonomy of these approaches appears in [17] which are briefly summarized here. Polygon decompositions can involve independent tasks such as used by [7] where span segments of a different polygons on a scan line are processed in parallel. Allison [2] uses a shared Z-buffer where the objects are processed in parallel and rendered to a common frame buffer. Crockett and Orloff [5] alternate each processor on an Intel iPSC from computing the rendering of image space tasks to communicating polygonal data to balance the load effectively. The above algorithms seem to be either too restrictive in the amount of parallelism provided or suffer inherent limitations which prevent reasonable scalability.
In using pixels as a basic building block for domain decomposition, researchers have devised the following tasks for parallel processing: horizontal strips (screen wide) of scan lines [ll. IS], vertical strips (screen height) of pixels [ 151, and rectangular areas of pixels [5, 11. 14. 151 . The solutions can be divided into the following categories: dota non-adaptive and data adaptive. The data non-adaptive methodology relies on an initial decomposition of image space that is not related to the input data. The idea is that if many (simply determined) tasks of varying work loads are assigned to the processors, the overall load will become balanced. In the data adaptive case, the size of the tasks (that is, the area of the pixel regions) are adjusted according to the input data in an attempt to obtain better load balancing. Data adaptive schemes have been implemented as described in [15. 14, 161 in the context of scan conversion methods, in addition to [6] in a ray tracing algorithm. A method to reduce communication during rendering has been discussed in the PixelFlow design [ 121. This solution involves image cornpositing after each processor renders the entire image space using the local data only. That design trades the communication problem for the expenses of: synchronization, later communication (albeit potentially smaller than discussed above), and a non-adaptive load balancing scheme based primarily on the data distribution to processors. Cox [4] has adopted some of the ideas in Pixelflow into his software algorithm described elsewhere in these proceedings.
We have implemented a number of the aforementioned algorithms to determine their relative strengths and weaknesses. In the data adaptive algorithms, it was found [ 161 that creating nearly equal work tasks requires too much pre-processing time. Of the data non-adaptive partitioning schemes (horizontal scan lines or groups of scan lines, vertical strips, or rectangular areas of pixels), empirical test results indicate that the rectangular method works the best since this layout minimizes the perimeter while maximizing coherence. The algorithm presented here is based on the rectangular method but improves upon it by limiting overhead in pre-processing. handling worst case scenarios, and adaptively partitioning the work load.
Algorithm
In this section, we discuss the work decomposition strategy for the following three distinct phases of computer image synthesis. The average percentage of total time in a sequential version of this program is noted in parenthesis (data used here is an average of the input data referenced in the remainder of the paper).
1. Pmprocessing (13 %) -this consists of data read-in, transformation of points, normals calculation, back-face rejection, clipping, and perspective projection.
2. Rendering (86 %) -includes hidden surface removal, shading, anti-aliasing, and any other visual effects.
3. Post-pnxessing (1 %) -this involves displaying the image on a frame buffer or storage in a file.
The task adaptive domain decomposition is a variation on using rectangular areas of pixels as individual tasks. This algorithm uses less overhead than the rectangular approach in terms of preprocessing and later communication, however. Our primary focus here will be on the rendering portion since this phase takes the bulk of the computation time. Several methods for parallelizing the first and third phases are given but implementations of these will generally be specific to the environment where the overall program is to run. The algorithm given here was designed for a physically distributed memory computer (either shared or message passing); the implementation discussed in this paper covers a shared memory implementation.
Pre-processing Phase
The implementation of the pre-processing phase depends to a large degree on the amount of parallelism available. If there are a large number of objects (i.e. > P. the number of processors), then each object can be read into a different processor's memory (see figure 1 ). Since each object may be a different size, the time to process a given object may vary as indicated by the size of the rectangles in the figure. Each processor can then perform the transformations, clipping, etc. on the data in local memory. If the number of objects is instead < P. we then have two possible scenarios. In the first, if the user is animating these objects over a long period of time, then each object can be split into multiple sub-objects which can then be processed independently. This provides enough parallelism to keep all of the processors busy. If the user is only creating a single image, then the cost of splitting the objects cannot be amortized. A possible parallelization in this case would be to independently process iterations of the individual loops used for transforming or clipping the objects. This would work in a shared memory machine but not for a message passing architecture. Alternatively, a reader process could assign portions of the data to individual processors. Another part of the pie-processing phase involves setting up the data for the domain decomposition so that each processor will have access to the data it needs for a particular task. Each initial task corresponds to a small area of the image space so the polygons are placed into bins (see figure 2) where a bin is associated with a particular task. The bounding box of the polygon is used to judge which bin(s) that polygon is placed into. The data structures for these bins is discussed in section 4. The average speedup for the pre-processing portion of the algorithm as implemented on the BBN TC2000 using this partitioning strategy was 9.4 on 96 processors. The main limitation to further speedup is the sequential nature of the disk access for reading in data. If the data is coming directly from a simulation program running simultaneously on the same computer, the disk bottleneck would not exist.
Rendering Phase
As discussed previously, the task adaptive algorithm uses rectangular regions of pixels as a basic task for solving the render-ing problem in parallel. Each task consists of a region of pixels for which the tiling problem is solved serially for those polygons which are present in the region. Here, we employ a modified scan line Z-buffer algorithm [ 131 which uses stochastic sampling (16 samples/pixel) for anti-aliasing. One limitation of the original (Whelan) rectangular assignment method is that a specific load balancing mechanism must be chosen so as to assign equal work to the processors. An example load balancing scheme is to divide the image space into R. P tasks which are dynamically assigned to processors (see figure 9 in the color plate for a Gnple rectangular decomposition). R is the grannfati~ ruIio and must be chosen properly so as to minimize overhead and maximize load balance. The larger a value of R chosen, the more work involved in pie-processing, communication, and polygon duplication, but the better the load balancing. Conversely, smaller values of R result in less overhead but inadequate load balancing.
The task adaptive approach attempts to bridge this gap by using a small granularity ratio (in this case, R = 2 is used) to minimize overheads along with an adaptive load balancing scheme. A ratio of R = 1 could be used, but that presents a situation that requires more communication due to the load balancing mechanism utilized. After a processor has finished its first task, it dynamically retrieves additional tasks off the queue until there are no tasks left. The pseudo code similar to what each processor executes is shown below. The code is shown for a shared memory implementation where the shared variable j is atomically accessed. j = P; for-all (i=O; i < P; i++)/*static scheduling*/ work-on-task(i); /*of first P tasks.*/ while (j < (2 * P)) /*dynamic scheduling*/ work-on-tasktatomic-add (j) ); while (work-available > threshold) partitiono; /*start partitioning*/
The part it ion routine is executed so as to dynamically balance the load among the processors. Since each image space task differs in its amount of work, steps are taken to sfeal part of another processor's work when there aren't any initial tasks left. The adaptive nature of this work decomposition is outlined in the steps given below which are essentially part of the partition routine. The terminology Pm,= refers to the processor index of the maximally loaded processor as determined by a particular splitting processor P, at a given time.
1. When a processor needs work (call this processor P,), it searches among the other processors for the one which contains the most amount of work left to do (call this processor P,,,,,). If there is sufficient work to do, proceed through the remaining steps, otherwise return.
* The P., processor then sets a lock preventing any other processors from splitting P,,,,, in addition to setting its own lock to prevent unwanted blocking of processors.
P, partitions Pm,, 's work left into hvo segments; the 6rst goes to P,,,,, and the second to P,.
P, then copies from Pm,, the data necessary for it to work on the second segment.
P, unsets both its lock and PmoI's and starts doing work. After completion of its work, P, repeats these steps.
The usage of the lock is to prevent more than one processor from splitting a given Pm,, at a time. The lock is implemented Pma.5 prior to splitting Current scanline haiis on Figure 3 : Dynamic splitting of regions for task adaptive scheme as part of the operating system and guarantees that one processor proceeds through it at a time. If it is not open, the processor spin waits until it opens. After P, completes its new work, it calls partition. This routine is repeatedly called until there is no work available for splitting above a certain threshold. Based on empirical studies, splitting is worthwhile even down to two scan lines but splitting a single scan line in two did not prove to be beneficial. Since P, horizontally splits Pmo+'s remaining work into two tasks; P,,,,, continues to work on the upper task while Pa takes the lower one. This also allows coherence to be maintained in Pmar's region without any additional overhead and Pm., can continue working on its own task uninterrupted. Other splitting mechanisms were investigated (such as creating two side-by-side regions or even a combination of side-by-side and top-down) but their performance was found to be inferior to the method outlined here. Figure 3 illustrates the splitting process.
Because the splitting relies on work proceeding in a task in a top to bottom fashion, only scan line oriented hidden surface removal algorithms may be utilized as tasks. On the other hand, any scan line algorithm could be implemented since it does not matter what the attributes of the particular algorithm are. Since the initial task regions are split horizontally, the regions which have been split become further deviant from square. In order to find out the effect of the region size on splitting, we tested aspect ratios for the initial tasks other than square (1:l). A horizontally oriented region such as what is produced with a 2:l (2 pixels across for every 1 down) ratio resulted in poor parallel performance overall. The following average percentage improvements over a 2:l ratio were noted: 1:l -2.5%. 1:2 -6.4%. 1:3 -6.7%. and 1:4 -1.7% indicating that the ratio of I:3 produced the best results while 1:4 resulted in too much loss of coherence.
Heuristic for Spllttlng
In order to find Pm,,, it is necessary to come up with a method for determining the amount of work a given processor has left to do at any given time. A heuristic which can be used is the number of scan lines left to work on by a processor, since this is indicative of the amount of work left. Other heuristics were investigated but did not perform as well. During the tiling portion of the computation, each processor updates its own shared variable corresponding to the number of scan lines it has left to compute. P. quickly checks the other processors' number of scan lines left in order to find the largest one, which is then denoted P,,,,,. Figure 10 in the color plate shows a final illustration of the splitting process for an iso-surface dataset where the larger areas are the size of the initial tasks and the smaller areas are initial areas that have been split (here, only 20 processors were used for clarity of illustration). The bottom and right side of each area are color-coded according to the processor which worked on it.
Anomalous Sltuatious
Additional synchronization code is required to combat any possible race conditions or deadlocks which could occur during splitting. For instance, it is possible that more than one processor might try to split a given processor at nearly the same time. If a semaphore lock is used to prevent simultaneous splitting, the processors could be backed up for some time trying to partition the same P,,,,,, not doing any useful work. This is solved by using a test and lock methodology in which P, (as it is searching for P,,,,,) checks each processor's split lock to see if the lock is already set as described below.
Assume that P, has determined so far that processor 6 meets the heuristic for the most amount of work left. If processor 6's lock has not been set (i.e., this processor is not being split at the moment), then the number 6 is stored in P,'s local variable pmax and 6's work left is stored in pmax-work-lef t. If processor 6's lock has been set. then store the number 6 only as a potential P,,,,, Of course, even with this scheme, it is possible that some processors will have to wait at the lock before they can proceed. It is also possible that after a processor has proceeded through the lock, there is no work left since it was all completed in the meantime. If that is the case, P. will recursively call partition to obtain additional work. Based on our measurements, the time spent in a lock by a processor is no more than 0.1% of the execution time and in general is much less.
Post-processing Phase
This phase consists of outputting the image to the frame buffer or disk, depending on the needs of the user. The most straightforward method is to have each processor kep a local buffer to store the pixel colors for its screen area. When a given screen area's rendering has been completed, the buffer is sent as a message on a scan line basis to the frame buffer for display. If the image is being converted to a file for disk storage, the easiest output method is to send it from top to bottom for later display. This is done by storing a virtual copy of the frame buffer in the memory of the multiprocessor. Each scan line of the buffer is stored on a separate processor. As each partial scan line is rendered, it is sent to the memory module which contains that corresponding line in the virtual frame buffer. After the rendering phase is completed, the virtual frame buffer can be copied to a file for storage. This can be accomplished by having each processor run-length encode (in parallel) a scan line for final output. 'Ihe run-length buffers are then sent to disk in a sequential manner. The average speedup for this section of the code was 1.6 on 96 processors, again limited by the sequential disk access.
Graphics Data Decomposition
There are several possible data decomposition methods for storing the graphics data in a multiprocessor. One scheme involves storing data in globally shared memory which all processors can access remotely. This is known as the Uniformly Distributed (UD) scheme. Previously [ 161, it was demonstrated that this method incurred a large overhead and did not scale well. A second scheme, described below, capitalizes on distributing the data among the memories and moving the data around for local access during the rendering phase.
Locally Cached (LC) Scheme
The LC memory referencing strategy involves initial storage of data scattered throughout the memories of the machine while using a software caching technique to bring data into local memory during processing. A similar type of mechanism has been used before for parallel graphics rendering [3, 91. In those cases, though, the instance was ray tracing and involved an implementation of a technique known as "shared virtual memory" which emulates shared memory on a message passing system. Instead of using shared memory, we employ explicit copying of the exucr data which is needed for a given task so that: no unnecessary communication is required, the minimum amount of memory is used, and a cache replacement policy is unnecessary. This explicit copying is not amenable to parallel ray tracing implementations since the exact data needed for a given portion of the image space is not known a priori in that type of algorithm.
During the pre-processing phase, polygons are put into bins in each processor corresponding to the screen space areas to be assigned as tasks in the rendering phase. The bins are implemented as a two-dimensional array of structures. The structures contain the following four arrays: a points lisf a normals lisf a polygon connectivity list (indices into the points list), and a polygon information list (such as bounding box, color, etc.). Each strncture contains data for those polygons which cross into that particular bin as illustrated in figure 2. Before storing the polygons in these bins, processors look at all the polygons in their local memory to see how many belong in each bin. This Erst pass is used to determine how much array memory is necessary to allocate for a particular bin. The reason that arrays (as opposed to linked lists) are constructed is that each array can later be sent out as a contiguous block to a processor which will tile the area during the rendering phase.
After the memory has been allocated, the local polygon data is placed into the separate arrays for each bin. The data in a given object's polygon topology array contains index pointers to that object's points array. These indices must be modified to point to the correct place in each bin's points array. It is desirable to store a point only once in the new points list and record the reference index value once rather than place a new copy of the point into the points list for each polygon which references it. If the latter were done for say, quadrilateral polygons, we might end up using four times as much memory as is really needed. After the data is processed into messages for each bin, it is "cached" into each processor's memory during the rendering phase.
In order to tile an area during the rendering phase, a processor must obtain the transformed polygon data from the other processors (if they contain any) which is relevant to its assigned area ([illi] ) as shown in Egure 4. This is done by querying each ptocessor individually and retrieving that processor's data (if it has any) in contiguous messages. Note, this is happening in parallel for all of the processors simultaneously so the network might potentially become clogged for a short time. After the burst of communication necessary for each processor to obtain its initial task data, no more communication is necessary for this task since 
Data Movement During Partitioning
When a processor is going to split another processor's pixelarea, it must retrieve the four arrays in the bin structure of that pixel area. The pointers to these arrays are stored in the memory of the remote processor and can be readily retrieved. Ideally, it would be nice to only obtain the data for the polygons which are relevant to the lower task after the split. Since that involves stopping the execution of the remote processor, it was not deemed worthwhile to do in the shared memory implementation. However, a simple method can be used to reduce the amount of data to be copied over time by performing a quick clip test after the arrays are received. Polygons which are no longer relevant to the new partitioned task are deleted. This reduces the amount of communication for further splits of an area which is especially likely near the end of the computation when most of the areas to be split are small and have already been split at least once. Because of the block transfer of data and local memory access, this scheme can also be implemented on a message passing computer with some modification.
Complexity
The overall complexity of the tendering phase is briefly analyzed here using the CREW PRAM (Concurrent Read Exclusive Write, Parallel Random Access Machine) model of computation. We assume that there are N polygons used for input with P processors applied to the rendering computation. Running on a parallel computer, the time complexity is:
where Trend is the actual rendering time, Tsplit is the time to perform the task adaptive splitting, and T,,,, is the time for communication of data to the processors. 'Ihe analysis given in [13] for the serial scan line Z-buffer algorithm shows that Trend is proportional to O(N) in time complexity. The algorithm described in this paper is a parallel version of the scan line Zbuffer where each task computed by a processor can be considered to be executing a serial version of this same algorithm. For a base analysis, we assume that no task splitting takes place during the program which results in 1 = 2 . P tasks with $ polygons independently and identically distributed (iid.) to each processor. Assuming each processor (i) receives exactly the average amount of polygons, the number of polygons per task is half the average (since R = 2) so the time complexity of a single task is:
The total amount of work for all tasks is then:
T rend = 2. p ' Trend, or just O(N). If this work is done in parallel by P processors, then Trend can be reduced to O(s).
We now show that task adaptive tendering results in nearly the same complexity as the ideal distribution of polygons used in the base analysis described above.
Contrary to the assumption of i.i.d. assignment of polygons above, most datasets will result in some processors obtaining more than $ polygons and some less. The task splitting mechanism compensates for this deviation in the following manner. Task splitting occurs when a given processor does not have enough work (i.e., it's original polygon distribution is < 8). When this is the case, a lightly loaded processor steals some work (polygons) from a more heavily loaded one in order to bring its total closer to the ideal distribution. The processor it steals from (P,,,) is the most heavily loaded processor according to a greedy selection criteria. Thus, P,,,,, must have more polygons than 6 and need to give some up to come closer to the ideal distribution. The task adaptive splitting mechanism essentially ensures that processors come closer to the ideal polygon distribution. Since the result of task adaptive splitting is a situation where each processor ends up with a total workload nearly equal to 8 polygons, the previous analysis for rendering is therefore accurate for the time complexity of the parallel task adaptive approach (i.e., 0 ($)).
It can be shown that for splitting, the time to find Pm,, is O(P) and the number of splits is proportional to P, but the splits occur in parallel which results in a time complexity of 0( P). This analysis assumes perfect load balancing. If the load imbalance, li is measured as a percentage (0 -fully balanced, 1 .O -unbalanced), then rendering is proportional to (1 + li . (P -l))$.
As load imbalance is reduced, net performance increases proportionally. Communication is proportional to 9, so the entire program is of complexity T = ((1 + li . (P -1)) 9 + P). The splitting takes a very small amount of time which should be taken into account in the overall time complexity. Of course, the PRAM model has the limitation in that it assumes a constant access to shared memory regardless of how large P becomes. Since this is unrealistic, performance degradation is bound to occur as large processor configurations are used.
Results
This algorithm was tested on a BBN Butterfly lC2000 multiprocessor using the task adaptive domain decomposition and the LC memory referencing scheme. Three input datasets of varying complexity were used, two of which (rings and tree) are from Eric Haines' SPD database [lo] . The other image, layers, is several transparent iso-surface layers from a fusion plasma turbulence sirnulation generated by Tii Wiiliams at Lawrence Livermore Lab. These test images ate shown in figures 10. 11, and 12 (the layers image shows the area borders color-coded according to which processor worked on a particular scren area). All rendering utilixed 16 samples/pixel anti-al&sing with Phong smooth shading at standard video (640 x 484) resolution.
Performance
In table 1. we compare average time for rendering including algorithmic specific overheads for the test images using the data adaptive algorithm (described in section 2). the base rectangular region method, and the task adaptive algorithm using the LC memory referencing scheme and static versus dynamic scheduling of tasks. A full detailed comparison is in [17] . The timing, speedup. and efficiency results for the tiling section of the program are given in table 2 with a graph of the speedup in figure 5. The speedup measured is self speedup and the program utilized remote memory transfers even on single processor runs. This was required since the input datasets would not fit into the memory of a single processor. To compensate, the extra cost of communication and code modification were measured and subtracted out of the single processor time in order to come up with a realistic estimate of the sequential time (noted in the graph) for speedup measurements. An average rendering rate of 55,400 polygons per second was achieved using stochastic sampling antialiasing while without anti-aliasing the average was 93,600 polygons per second. According to our measurements, there is only a 15% measured overhead cost difference between an optimized sequential renderer and the parallel program described here when run on a single processor. The average efficiency of the parallel algorithm is 82%; this means that the parallel version provides high performance relative to itself as well as to a serial renderer which is important in measuring the success of a parallel implementation.
In order to tind the actual causes of performance degradation from linear speedup, various overhead effects were measured for the implemented program on the input datasets as a function of the number of processors (P). They include: load imbalance, net- work contention, code modification to allow parallel execution, and communication.
Other measured factors such as memory latency, synchronization. and scheduling represent such a small fraction (typically < 0.1%) of the work that they are not presented in the graphs. These overheads am not taken into account in the PRAM model of computation. 'Ihe latter factors are small due to the fact that the splitting mechanism involves very little synchronization of processes. The graphs which depict the measured degradation factors as a function of P for the layers, rings, and tree images are given in figures 6-8. The overheads shown at one processor indicate the effects of code modifications necessary to run the code on a parallel machine and the communication required to transfer the data from shared memory to local memory. Table 3 indicates the computed overheads at 96 processors.
Analysis
From the graphs, we can see that several of the overhead effects use a higher percentage of runtime as more processors are utilized, meaning that different factors come into play at higher processor counts. Load balancing, in particular, is directly effected. As P increases, more processors perform dynamic partitioning simultaneously. Consequently, at the end of a run, some processors have work to do while others are trying to obtain work. By the time they obtain the lock to split the work, there may not be any work left. As was stated previously, the heuristic for splitting is not particularly accurate which explains the high load imbalance particularly for the tree input dataset. The degradation due to code modification has to do with the fact that additional coherence is lost as the number of processors and consequently total tasks, increase. Communication overhead increases with P for two reasons. As P increases, the size of the task regions becomes smaller since #tasks = 2. P, so polygons cross over into more areas (which means more copying of this polygon data). Secondly, as P is increased, more processors become available to partition others' work, which requires communication of data to obtain the new tasks. As a result of this increased communication, network contention increases as well. If the number of polygons (N) increases and the number of processors (P) stays the same, we observe that the overall communication will increase as well. But, according to the time complexity 0 ($), rendering time also increases with N so communication is not likely to become a higher percentage cost. However, if the size and distribution of polygons is changed significantly. this will affect the overall rendering time. For instance, the polygons in the tree image are smaller and more densely packed than in the rings image. Since N is larger for the tree image, communication is higher but it is also a higher percentage of overall rendering time since the rendering time per polygon is smaller (comparing the two serial execution times). As one can see, depth complexity, polygon area, number of polygons, as well as microprocessor speed, communication speed, and data transfer methodology all contribute to the efficiency of a parallel renderer. It is obvious, for instance, that doubling the output image resolution increases the work without increasing the overheads, so the program appears to be more efficient. With all of these variables, though, it is impossible to generalize a solution to meet all users' needs. One should take into account the main requirements of the system when designing the data decomposition mechanism and use optimizations for data locality and load balancing appropriately. The results here indicate that the LC scheme exploits data locality at a minimal expense of communication overhead and can be used in both shared memory and message passing environments.
lhis algorithm does have deficiencies in dealing with image space tasks which have a high degree of local complexity since these types of scenes may not be amenable to the splitting mechanism. Flight simulation is an example application where data can become concentrated at the horizon. But, reasonable speedup can still be maintained even under these adverse conditions as exemplified by the tree input dataset. Unfortunately, it is not possible to test this algorithm under a large number of test scenes in order to adequately measure worst case scenario performance. In addition, it would be ludicrous to assume that a single algorithm can be designed to perform equally well under all types of scenarios. The algorithm presented here is a compromise which can handle a moderate amount of local image complexity with reasonable performance. Assigning smaller tasks approaching pixel size or even sub-pixel size to individual processors in an attempt to handle local image complexity may prove to be a workable solution in extreme cases. For most input datasets, however, the additional work to do so does not seem necessary. In fact, when we tested recursive splitting of tasks alternating splits in horizontal and vertical directions even down to the pixel level, it was found that no additional benetit in performance resulted.
Conclusion
The main goal of this project was to achieve good speedup and efficiency using a parallel algorithm for rendering of complex geometric scenes. The locally cached memory strategy allows the algorithm to be implemented on either shared memory or message passing MIMD computers. Although the absolute performance of this software algorithm already ranges up to 100,000 anti-aliased Phong shaded polygons per second, the clear metric for success is that we have achieved an average 82% efficiency utilizing 96 processors. This indicates that with even faster microprocessors and/or larger parallel computers, rendering rates can be increased even further. In addition, using this algorithm as part of scientific simulation system on a parallel computer allows direct memory transfer of data and supports fast creation of "movies in minutes."
The algorithm outlined here can also be used for a number of other problems in graphics. Certainly, the decomposition and memory referencing strategy can be applied to handle additional rendering features such as shadows, textures, and wide pixel llltering. Modifications to the basic approach are necessary to handle these effects. In addition, 2id polygon overlaying for geographical planning is a computationally intensive problem that can benefit from parallel processing. The task adaptive approach to work decomposition can also be useful in a cluster (or distributed) computing environment to harness cycles from idle machines for graphics rendering. 
