Ray tracing programs are widely used to generate photo-realistic images, but the high computation time may discourage their implementation on single-processor machines; moreover, cost reduction of multi-processor general purpose architectures makes parallel rendering an attractive field of research. We propose a new algorithm which addresses the main issues of a parallel implementation of ray tracing on a message-passing-based machine. We adopt efficient strategies for dynamic workload distribution among processors, task synchronization and communication delay reduction. The resulting implementation is highly flexible, since any number of processors can be employed without introducing synchronization problems. We show an implementation of our algorithm on an nCUBE 2 supercomputer which is a general purpose parallel architecture with distributed memory. A theoretical evaluation of our algorithm allows us to identify a decreasing function for rendering times; the considered examples confirm the theoretical expectations showing that the efficiency of our system may reach up to the 91% of the value achievable by dividing the sequential rendering time by the number of processors employed.
INTRODUCTION
Ray tracing is a very powerful technique in the production of photo-realistic images; its main drawback is the high computation time required to perform the ray-object intersections. Several techniques have been proposed to speed up the basic ray tracing algorithm ( [1, 2, 3] etc.) but often requirements for the rendering time may not be fulfilled on single-processor systems.
Several multiprocessor applications have been developed in the last few years to solve a large spectrum of problems such as scientific visualization [4, 5] , rendering [6, 7, 8] , and many other topics; an introduction to parallel rendering can be found in [9] .
A ray is traced through each pixel of the screen in order to compute pixel values in ray tracing algorithms; each ray is independent of the others and hence they can be processed concurrently. Several parallel ray tracing algorithms have been proposed which are mainly based on two strategies: (i) image space subdivision-based algorithms [10, 11, 12] ; (ii) object space subdivision-based algorithms [13, 14, 15, 16, 17] .
The power of one method is the main drawback of the other one. With the first approach uniform load balancing may be achieved, but this is often paid for by inefficient memory usage. On the other hand, with object space subdivisionbased techniques, a uniform division of the database may be easily obtained, but very complex load balancing strategies have to be adopted in order to achieve good performances.
In this paper we use some useful properties of a sequential ray tracer proposed in [18] . With the technique used in [18] the screen is split into several independent cells and each cell can be computed by considering a reduced number of objects belonging to the scene.
The algorithm proposed in [18] is up to seven times faster than the ray tracing algorithms based on bounding volume hierarchies or recursive space subdivisions. Moreover, the properties of [18] can be employed to design a parallel ray tracer based on a hybrid approach which performs dynamic load balancing by distributing the cells to be computed among processors and allows each processor to test rayobject intersections only on a small subset of the whole database. Unfortunately, the parallelization of this algorithm is not as straightforward as it could be for traditional ray tracing algorithms because of the necessity of building several data structures which have to be used to speed up the rendering. In order to build these structures a large data exchange among the processors is required and this can strongly affect the performance of the parallel algorithm. In this paper we show how the parallelization can be achieved by taking into account three main issues involved in designing an efficient algorithm for message-passing-based machines:
• load balancing;
• task synchronization;
• communication delays.
We consider these issues and present an implementation of our algorithm on an nCUBE 2 supercomputer. Our 504 A. SANNA, P. MONTUSCHI AND M. ROSSI methodology organizes the processors in two different ways. When there are no problems with synchronization and the management of the processor communication buffers, a master processor can directly communicate with the other slave processors (direct master/slave). Otherwise, the interconnection topology is a balanced binary tree where the root is the master and each 'slave' processor has a double function, since it has to execute the job assigned from the master and it has to manage the communication for its sub-tree of the hierarchy (binary tree master/slave). In this case, a processor can only send messages to either its father or its children and, in order to synchronize the transmissions, the processors need to use a rigid protocol which allows them to send a message only after receipt of an acknowledgement from the receiver. In this way, it will be shown that our structure may be expanded with any number of processors without deadlock and loss of data problems, provided that the message buffer of a processor is sufficiently wide to accept one message from each of its children. The proposed technique is scalable and well applies to problems with sufficient parallelization properties, i.e. where a large amount of 'heavy' computations can be done in parallel.
We also present a theoretical analysis of the performances of our technique, which is not related to the study of any particular example, but which is aimed to obtain general results.
The paper is organized as follows: in Section 2 we review the basic parallel ray tracing methodologies and describe, for the sake of reference, the sequential algorithm of [18] . In Section 3 we present the goals of our work, while in Section 4 the steps of the proposed technique can be found. Finally, we show a theoretical evaluation of the proposed algorithm and the experimental results in Section 5.
PREVIOUS WORKS

Approaches to parallel ray tracing
Two main approaches are known in the literature for a division of the computation on multi-processor systems:
(i) image space subdivision-based algorithms [10, 11, 12] ; (ii) object space subdivision-based algorithms [13, 14, 15, 16, 17] .
It is very difficult to identify the most convenient approach to parallel ray tracing since the choice is strongly related to both hardware and specific application environments. Image space subdivision-based algorithms split the plane of view into several regions and each processor has to compute one or more of them. With this approach the load distribution can be either static or dynamic; in the first case, each processor has a priori knowledge of both the number and the co-ordinates of pixels to be traced, while, in the second case, the load is assigned at run time. For instance, a vectorization of a ray tracing algorithm by using an array processor is proposed in [11] and an efficient image plane subdivision for SIMD architectures can be found in [10] .
Image space subdivision-based algorithms may suffer from two main drawbacks:
(i) the implementation may be critical on shared-memory machines, since a large number of accesses to memory may become a bottleneck for the system when a large number of processors is used; (ii) the whole scene database may have to be duplicated at each processor when distributed-memory machines are employed.
The object space subdivision-based methods were designed to address the problem of the implementation of parallel ray tracing algorithms on distributed memory machines. In such computers each processor has a small amount of local memory, so that each processor has to load into its memory a small sub-set of the entire database. Each processor checks for intersections only the objects stored in its local memory and the results of the intersection tests are sent to the other processors by messages. The strategy of the database division is very important because it affects the workload distribution among processors; the object distribution can be either static or dynamic. In the former case, each processor has a priori knowledge of the elements to be loaded into its memory, while, in the latter case, the objects are assigned at run time. A statistic analysis of the image is performed to compute a database distribution; lowresolution images are often computed to achieve good tradeoffs when static data-distribution techniques are employed [17] , while run time statistics on ray-object intersections are used to move the elements among processors. In [14] and [15] the space is divided into a regular three-dimensional (3D) grid of subspaces and each subspace is related to the objects partially or completely contained within it. Each processor receives a set of subspaces and when a ray enters into a subspace the processor which 'controls' that zone tests the intersection of the ray with the objects related to the zone itself. A similar approach is employed in [19] but a hierarchical organization of the objects is used instead of a uniform subdivision. The main drawback of object space subdivision-based methods is the high communication cost which may become a bottleneck for the entire system when a large number of processors is employed. The current generation of distributed memory machines may have 64 Mbytes or more of local memory for each unit; hence, database subdivision cannot be considered a critical topic as in the past.
Architectural implications and issues concerning the main visualization algorithms (radiosity, volume rendering and ray tracing) are discussed in [20] ; in particular, implementations of radiosity, ray casting and ray tracing algorithms on a new class of shared-address-space multiprocessor machines are shown.
Several hybrid approaches which combine both image space and object space subdivision have been proposed [19, 21, 22, 23, 24, 25] .
In [21, 22, 23] the task allocation mechanism is demand driven so that on completion of a task, a slave requests another task and in this way, the load is balanced. Moreover, A FLEXIBLE ALGORITHM FOR MULTIPROCESSOR RAY TRACING 505 by taking advantage of the coherence in references to the model database (the authors call it data coherence), the system can be designed so that memory requirements of each processor are modest compared to the whole database. The authors propose to use an arbitrary tree structure since it imposes a natural hierarchy among the processors by defining for each processor parent and child connections. The authors show that a hierarchical structure provides considerable benefits for exploiting coherence in data references. The drawback of this technique is that the performance of the algorithm is strongly related to the cache design which is a difficult task; in fact, the authors proposed a profiling step where a low-resolution image is generated in order to obtain some statistics which allow the user to efficiently split the database between the caches of the processors. Database partition schemes based on tree structures are also proposed in [26] and [27] . Only the top levels of the object hierarchy are duplicated on all processors while the lower levels are evenly distributed. In [27] the load balancing is statically achieved by scattering the image regions among all processors and this may lead to an inefficient workload distribution. On the other hand, in [26] the image coherence property is used in order to distribute efficiently the lower level of the object hierarchy. Moreover, the processors are grouped in clusters and a data exchange may be done only between processors belonging to the same cluster. The performance of this algorithm is strongly related to an experimentally determined set of parameters. In particular, the efficiency depends on the scene topology as the distribution of the database is based on the image coherence property.
In [19] a hierarchy of bounding volumes is built and it is split into two parts across some level of the tree. The upper part is duplicated in each local memory while the sub-trees below the threshold level are randomly assigned to each processor. Each processor can perform two kinds of task: a task to compute the intersections between the rays and the bounding volumes of the upper part of the hierarchy and a task to compute the intersections between the rays and both the bounding volumes and the objects assigned to the processor itself. The division of all tasks into two kinds is the key to self-balancing the workload. In [24] an octree structure is used instead of a hierarchy of bounding volumes and the ray coherence property is used in order to enhance the performance of the algorithm.
In [24] the ray coherence property is used to distribute the database instead of the image coherence. Because of this, primary and shadow rays are computed very efficiently, but the computation of the secondary rays is still done in a rather inefficient way, above all when a large number of processors is employed.
In [25] the authors adopt static load balancing using a pixel scattered decomposition and they propose a data prefetching scheme to utilize the ray coherence for supplementing the defects of pixel scattered decomposition. The performance of the algorithm is strongly related to two assumptions: adjacent pixels are computed by the processors at similar times and adjacent pixels require the same primitives to be used. Moreover, the efficiency of the database distribution scheme is based on the hardware which must provide efficient multicasting facilities to support data prefetching.
A slightly different approach has to be used to design ray tracing algorithms for distributed environments where a set of heterogeneous computers connected by a network is seen as a parallel virtual machine. In particular, in distributed computing environments the communication overhead is much larger than for multiprocessor machines, hence dynamic load balancing strategies may introduce an excessive delay due to message passing. Therefore, static load balancing heuristics are often preferred. In [28] the plane of the image is split into several square regions of (8 × 8) pixels; the 'weight' of each region is evaluated by tracing a ray through a random pixel of that region. The number of intersections for each area and the computational power of each computer allow a host machine to decide the regions to be computed by each 'server'. Each 'server' receives from the host computer only the co-ordinates of the top-left corners of the regions that it has to compute. The values of the computed pixels are sent to the host with only a packet. In this way, the number of messages is vastly reduced.
A sequential ray tracing algorithm
Several useful ideas can be inferred from the sequential ray tracer presented in [18] which splits the plane of view into cells and, by employing object coherence properties, identifies a subset of objects to be considered for each cell. This approach indirectly suggests a strategy of load balancing and reduction of the database for a parallel ray tracer. In this section we review the main steps of the algorithm [18] , while in Section 3 we address the issues concerning load balancing, task synchronization and system flexibility that are not required in a sequential approach, but which contribute to define the basis for a parallel algorithm.
It is well known that the main reason for the high computational complexity of ray tracing algorithms is the large number of intersection tests needed for rendering an entire scene. This is due to three factors:
(i) the number of primary rays (rays traced from the observer through the plane of view into the scene); (ii) the number of secondary rays reflected and refracted from the objects; (iii) the number of rays traced for shadow computation, in order to decide if a given point on a surface is illuminated or not (and how much).
Several techniques have been developed to address these issues. The algorithm presented in [18] takes advantage of spatial coherence of objects in the scene to speed up the ray tracing process. The ray tracing process is organized in two steps:
• an analysis phase;
• a rendering phase. In the first step, only a subset of all rays needed by a standard ray tracer is traced (denoted 'sampling rays') and some lists are built by observing the behaviour of the sampling rays. These lists are used in the second phase to reduce the number of ray-object intersection tests. Different kinds of lists are used for primary rays, secondary rays and shadow computation:
• Primary rays. The plane of view is split into square cells, where the size depends on the image resolution; two rays are traced for each cell, one through the top-left corner and one through the centre point (see Figure 1 ). Two sampling rays for each cell tend to provide a good trade-off between speed of the analysis phase and accuracy of the final image. Anyway, the user can manually set the size of the cells in order to control the number of sampling rays. By default, the algorithm sets a size of (5 × 5) pixels at a resolution of (320 × 200) pixels (i.e. 2560 cells are produced), (10 × 10) pixels at a resolution of (640 × 480) pixels and so on. In order to avoid an incorrect rendering of an object detected in one cell, but not in the neighbouring ones, the objects hit by sampling rays are also assigned to the neighbouring cells. In the rendering phase, only the objects associated to each cell under analysis are tested for intersections with primary rays, while
the other ones are ignored. In this way, the number of intersection tests for primary rays may be limited, especially for complex scenes.
• Reflected and refracted rays. During the analysis phase, each time a ray is reflected or refracted from object A to object B, object B is stated as a 'neighbour' of A, and it is included into the neighbouring 'cluster' of A (see Figure 2 ). In the rendering phase, intersection tests of reflected and refracted rays from A are limited to the objects belonging to its cluster, rather than to the whole set of objects of the scene.
• Shadow computation. In the analysis step, each time the contribution of a light L to a point P belonging to an object A has to be evaluated, the objects occluding the light are considered and added to a list associated to the object-light pair (A-L i ); this list is called the occluding object list (see Figure 3 ). In the rendering phase, only the intersections with occluding objects of the list A-L i are checked when the contribution of the light L on object A has to be evaluated. Moreover, if during the analysis phase a light source is always occluded, its contribution will not be considered in the rendering step.
In order to speed up the performance during analysis, the objects are surrounded by bounding boxes and then the boxes are arranged into a hierarchy. In the same way, bounding volume hierarchies are used during the rendering phase if the number of the objects belonging to a list is greater than ten (when just a few objects have to be considered the delay of visiting a list is equal to visiting a hierarchy). The algorithm proposed in [18] can be up to seven times faster than a ray tracer with standard optimizations with a minimum loss of image quality mainly due to the possibility that small details are lost in the analysis phase. Moreover, it has been proven that the time spent in the analysis phase is negligible compared to the rendering time for high-resolution image computations.
GOALS OF THE DESIGN OF A PARALLEL ALGORITHM
During the design of parallel algorithms for ray tracing, as for other tasks, three main issues have to be considered:
(i) Load distribution among processors. If a sequential version of a program needs a computation time T , a parallel implementation of the same algorithm should require, as an 'ideal' reference bound, a computation time decreasing linearly with the number of processors used. Therefore, if N processors are employed, a computation time of T /N should be aimed for. In real cases, T /N represents an ideal lower bound, since the communication delays and task synchronization among processors have also to be taken into account resulting in a delay. (ii) Task synchronization. The problem of task synchronization depends on the hardware features; for instance, on shared memory machines a programmer has to manage the accesses of the processors to the global memory, in order to avoid deadlock and conflicts.
On the other hand, a programmer has to implement a message passing protocol in order to synchronize processor tasks in distributed memory machines. (iii) System flexibility. The communication overhead grows when the number of processors is increased; on the other hand, we attempt to design a flexible system where the number of processors can be increased without noticeably affecting either the communication delays or the topology of interconnections.
An ideal solution for one issue might counteract the others; for instance, the communication overhead could be strongly limited by allowing each processor to communicate with all other processors without any constraint. On the other hand, this solution could be critical for task synchronization and system flexibility since a loss of data can occur if a processor receives a message when its input buffer is full. Moreover, constrained organization could solve the synchronization and flexibility problems, but it might not provide an efficient solution for the load distribution and the communication overhead.
THE ALGORITHM
Interconnection topology
When difficulties with synchronization and management of the processor communication buffers are not present, a master processor can directly communicate with the other slave processors (direct master/slave). Otherwise, for the sake of symmetry, the interconnection topology is organized by a complete binary tree. If this is not possible, either a perfect balanced tree topology or a height balanced binary tree can be employed. In the case of hierarchical organization, the root processor is the master, while the other processors have a double function since they both have to compute the load assigned by the master and have to propagate the messages up and down. An example with 15 processors is shown in Figure 4 . When a processor is free, it requests the master for a cell to be processed. In this way, the load can be distributed among processors in the most uniform way possible. We adopt a hybrid approach, since a subdivision of the image space is performed by splitting it into disjoint cells and a division of the object space is obtained by considering just a small sub-set of the whole database for each cell. If the amount of memory were a critical factor, each processor might load into its memory only the small set of lists needed to compute the cell. In our implementation each processor, which has 4 Mbytes of local memory, keeps in memory the whole database. In order to solve synchronization and flexibility problems, each slave can propagate a partial result, by only sending packets either to its father or to its children. A packet is recursively propagated up in the tree until it is received by the master or down until it is received by each leaf processor. When a processor receives a packet, it first reads the message and empties its input buffer and then it sends an acknowledgement message allowing the sender another transmission. In this way, the buffer of a processor can never be full, thus avoiding loss of data. Of course, the buffer of a processor has to be sufficiently large to store one message coming from each of its children processors.
We split our description into three steps where we explain the analysis phase, the rendering phase and the packet formats.
Analysis phase
Pseudo C-code for master and slave/communicator processes are shown in Figures 5 and 6 . Each processor receives a cell to be analysed (Read Cell) by the master (Send To Slave Cell); the master suspends itself, by waiting for an acknowledgement from a slave (Read Ack). Each slave analyses the assigned cell and builds the lists of visible, neighbouring and occluding objects (Build Lists). When a cell has been analysed an acknowledgement is sent to In the second step the lists have to be joined; the master performs this task by receiving partial lists from the other processors and by linking them together. When a processor has to send its lists to the master, it splits them into packets (Build Packet) and propagates them up to its father (Send Packet To Father). Observe that a processor can send a packet only when it receives an acknowledgement from its father. If a processor is at an intermediate level, it has to check if it has in its input buffer a packet received from a child (Test From Buffer). If the buffer is not empty the packet has to be propagated to the father (i.e. to an upper level of the hierarchy). This second step ends when each processor has completely sent its lists to its father.
A processor denotes the end of its lists with a special packet (end list); each processor has to receive a number of end lists depending on its level in the hierarchy. instance, if we consider a hierarchy with 15 processors, processors 2 and 3 receive 12 end list packets, while processors 4, 5, 6 and 7 receive 2 end list packets. Each processor checks (in polling mode) its input buffer until the right number of end list packets is received. A packet of a type different from an end list is sent to the father. The master receives the packets (Receive Packet) and builds the final lists; it has to read as many end list's as the slave number. The final lists have to be sent from the master to the other processors in the last step of the analysis phase. The master builds packets and sends them to its children (Send Packet To Sons). A new packet can only be transmitted when an acknowledgement has been received. When a processor receives a packet, it updates its lists, checks for the acknowledgements from its children and sends to them the packet just received from its father. Finally, it sends an acknowledgement to its father. If a processor receives a visible object packet, in order to reduce the probability of partial rendering of the objects, it assigns the element listed into the packet also to the neighbouring cells (Flood Vis Obj). 
Rendering phase
Pseudo C-code for master and slave/communicator processes is given in Figures 7 and 8 . Each processor receives from the master a cell to be processed (Read Cell) and it renders all pixels of the cell and builds a packet to be sent to its father (Build Packet Cell). A processor can propagate a computed cell up in the tree after it has received an acknowledgement from its father. Then, a new cell is requested from the master.
If a processor is at an intermediate level of the hierarchy, it has to check its input buffer for a packet coming from one of its children; if a packet is detected, it has to be propagated to the father. The master processor can receive (Test From Buffer) either an acknowledgement or a computed cell. In the first case, a new cell is sent to the processor which has transmitted the acknowledgement while in the second case the partial result is stored in the memory to form the final image (Store Cell Into Memory).
When the master has received all the cells, it sends a message of end cell to all the other processors (Send All Slave). On the other hand, each processor checks (in polling mode) its input buffer until an end cell packet is received. If a slave is an intermediate processor, it tests its buffer for packets coming from its children to propagate them up in the hierarchy. 
Packet formats
The packet formats are shown in Figure 9 .
(i) The format of a visible object packet concerning a cell consists of 11 bytes. The first two bytes denote the number of the cell (x and y co-ordinates), while the third byte indicates the number of elements hit in the analysis phase by tracing the two sampling rays through the corner and the central pixels. The field neighbour can be 0, 1 or 2. If it is 0 the next fields should not be considered. The following two fields of four bytes represent the numbers of the objects. We consider scenes with at most 2 32 elements. If scenes have more objects the fields could be extended to five or more bytes. (ii) We use 45 bytes to build a neighbouring object packet. The first field (four bytes) denotes the object considered, while the following byte (neighbour field) indicates the number of neighbours of the object to be considered for each packet. The next fields are of four bytes and indicate the number of each neighbour of the object. The number of these fields to be considered depends on the neighbour field value. (iii) The format of the occluding object packet is more complex, since the first two fields denote the object- light pair (four bytes for the object and one byte for the light). We assume that at most 256 lights can be placed into a scene. The field neighbour (one byte) denotes the number of occluding objects listed in the following fields of the packet (at most ten occluding objects can be stored in a packet), while the field flag indicates if the light has been detected visible or not during the analysis phase. (iv) The end list, end cell and the acknowledgement packets are just one byte long. (v) The rendered cell packet uses the first two bytes to indicate the co-ordinates of a cell and the following bytes to store the R-G-B components of each pixel of the cell itself. The size of this packet depends on the image resolution; the cell step can be either automatically computed according to scene resolution (our program uses steps of 5 pixels for (320 × 200) images, 10 pixels for (640 × 480), 12 pixels for (800 × 600) and so on) or set by the user. This type of packet is the largest one used, therefore, the input buffer of a processor has to be at least twice as large as this packet (i.e. for our examples about 900 bytes), since a processor at an intermediate level can simultaneously receive two computed cells from both of its children.
If any processor has to send a packet greater than the preset format, it splits the data into two or more packets. For instance, in the event of 15 neighbouring objects, a processor builds two packets: in the first one the neighbour field is set to 10 and all the next fields have to be considered and in the second one the neighbour field is set to five and only the following five fields have to be taken into account from the processor receiving the message. With our strategy we get a highly flexible structure, since the packets can be received in any order both from master and slave/communicator processors.
RESULTS
Evaluation
Premise
The proposed algorithm can be used, with only minor modifications, with a wide range of different architectures of Master/Slave processors, i.e. based on tree topologies. The two ends of a tree topology are the binary tree and the tree with one master and (N − 1) 'direct' slaves (in the following denoted as 'direct slaves' processor) and both have advantages and drawbacks.
It can be easily observed that the binary tree has the largest number of levels, while the 'direct slaves' has the smallest. Hence, the worst case communication delay between one slave and the master is larger for the complete binary tree. This is, in general, also true in the case of routing due to the lack of direct physical connections between the processors.
On the other hand, message-passing parallel machines have limited-size buffers for messages and may have very simple message transfer protocols in order to achieve high transfer speeds. In particular, some protocols do not allow a check on the buffer availability at the destination processor to be performed, before sending a message. This implies that, if the number of 'child processors' which could send a message to their 'father' is too large, the buffer of the 'father' may not be able to store all the messages. Therefore, some messages could be lost, resulting in errors. The binary tree topology provides the simplest organization with the smallest number of 'children' and hence, given the maximum message length, provides the minimum required buffer size. Conversely, the 'direct slaves' architecture has the largest buffer size requirements.
In the following, we will evaluate the delay of a binary tree topology, which corresponds to the worst case delay but with smallest buffer size requirements. We will show that, under some reasonable assumptions, the performances A FLEXIBLE ALGORITHM FOR MULTIPROCESSOR RAY TRACING 511 of the binary tree interconnection topologies are very close to the theoretical maximum performances of a parallel architecture using (N − 1) processors. This implies that, if some conditions apply (as we have also verified for our practical examples), then the binary tree topology represents a solution providing interesting performances (i.e. close to the 'optimal') at the smallest hardware cost.
Clearly, either in the absence of restrictions on buffer sizes, or in the presence of high communication delays, other topologies between the 'direct slaves' and binary tree units can be chosen and our algorithm can be applied to any of the topologies used.
Theoretical evaluation
For the sake of simplicity and with no loss of generality, let us assume that all the processors are organized in a complete binary tree topology with h = log 2 (N +1) levels, where N is the number of processors employed, the root being at level 1 and the slave 'leaves' at level h. Let us denote the whole time to render an image employing N processors as T pt . It is the result of two times: T ps , related to the analysis phase, and T pr due to the rendering step. Moreover, for the sequential approach, let us denote with T ss and T sr the computation times for analysis and rendering phases, respectively. T st will denote the whole sequential time (T ss + T sr ). Finally, we use D N to denote communication delay of a configuration with N processors. An evaluation of the algorithm described in Section 4 (and more specifically in Sections 4.2 and 4.3) suggests the following behaviours for T ps , T pr and T pt .
The analysis time (T ps ) is the combination of two times: T c related to the computation of the lists and T t due to the packet transmissions from slaves to master and vice versa (T ps = T c +T t ). Only the (N −1) slave processors contribute to the computations of the lists, while the master organizes and distributes the work. For the sake of symmetry, the highest performances (for the viewpoint of T c ) are obtained when the load is uniformly distributed among all the (N −1) slave processors. This implies that T c decreases linearly with 1/(N − 1). Therefore, T c = T ss /(N − 1). An exact evaluation of T t is very difficult, since communication delays depend both on the depth of the hierarchy and the scene characteristics. Let us denote with K (χ, N) a function which takes into account type and 'complexity' of the scene (through the 'parameter' χ) and the number of processors employed.
A lower bound for (1) can be found by assuming that the communication delays are negligible and hence T ps is only related to T c . The evaluation of an upper bound is slightly more complex: the transmission delay grows by increasing the number of levels of the tree (i.e. the number of processors), since a packet has to traverse all levels of the hierarchy to go from a leaf processor to the master and vice versa. Therefore, the transmission delay increases linearly with h − 1 = log 2 (N + 1) − 1. On the other hand, under the assumption of a parallel co-operation of all the processors all the slaves have to manage (on average) the same number of packets. Therefore, as the number of processors employed increases, each processor has to manage a number of packets which decreases by 1/(N − 1). Moreover, at level 2, where all the packets have to pass through, the largest transmission delay is introduced, while a processor at the leaf level introduces the shortest communication delay, since it has to transmit only its own packets. If we denote with D 3 the transmission delay at level 2 obtained using the configuration with three processors (master and two slaves) and if we assume that all levels introduce the delay D 3 , we get:
Concerning the rendering time T pr , let us assume that, on average, all the processors of each level 'handle' the same number of cells and that the time for rendering one cell is constant. In order to simplify the notation and the following derivations of the results, let us denote with M i the number of cells which are rendered by one processor at level (h +1−i ) (e.g. at the leaf level, each processor handles M 1 cells).
In order to determine a lower bound of the rendering time, it is reasonable to assume that all the processors of the complete tree work in parallel. This implies that they all do the same amount of work, i.e. rendering computations and transmission of the results from the lower to the upper level. In other words, since all the processors work in parallel, the working time T pr has a constant value T , independent of the level, which is the result of two contributions:
• t (C,i) , which is the computation time of one processor, at level (h + 1 − i ), when rendering 'its' M i cells. This includes the delays for receiving the data from the master, and for sending back the rendering results to the upper processor.
• t (T,i) , which is the transmission delay of the data from the lower to the upper level (i.e. the transmission delay of the data not related to the level (h + 1 − i ), but to the lower levels).
Let us normalize all the times to the time required to render one cell (which is assumed to be constant), and let k be the normalized global transmission delay of the data related to one cell from the lower to the upper level. In order to have an efficient parallel architecture, it is necessary that k < 1. Otherwise, the upper level processors will not do any computation, but only transmissions and therefore the inclusion of these processors would lead to a very small performance increase of the whole architecture. On the other hand, when k 1, the impact of the transmissions implies only a small loss of performance, and the task of rendering is almost equally distributed among all the processors of the tree. Actually, the lower level processors will do more 'rendering' than the upper level processors, but also they will spend less time on transmission. 512 A. SANNA, P. MONTUSCHI AND M. ROSSI At level h, a processor would have to transmit only its M 1 cells, whose contributions are, however, already included in the definition of t (C,1) previously given. Therefore, a processor at level h has a working time
At level (h − 1), each processor should be able to transmit the data related to the 2 * M 1 cells of its children, and to 'handle' M 2 cells. Therefore, at level (h − 1) the working time is
At level (h + 1 − i ), the working time is
From the (h − 1) equations of all the 'slave' levels (4), we can derive the number of cells handled from each processor at each level. In particular, we obtain:
and so on. Since we have assumed that k 1, we approximate our computations by neglecting the contribution of the terms where k has a degree higher than one. It can be shown that, in this case,
At this point, we know that the summation of all the cells 'handled' by all the processors must yield the total number M of cells to be rendered. Since the tree is perfectly balanced there are 2 h−1 processors at level h, each one 'handling' M 1 cells, 2 h−2 processors at level (h − 1) each 'handling' M 2 cells and so on. Therefore, from (5) we get
Therefore, since h = log 2 (N + 1), and since the working time T is (from (3)) T = M 1 , we get that the rendering time obtained when all the processors work in parallel, and with a small transmission delay k, is
where M is a constant value. If k 1, meaning that transmission delay of the data related to a cell is negligible, the rendering time should decrease by 1/(N − 1). Moreover, this curve represents the best lower bound for rendering delays achievable with our hierarchical organization.
Remarks
With no prior knowledge of the value of the terms T ss , D 3 , M and k, it is difficult to observe the mutual relationships between T ps and T pr . However, some conclusions can be drawn:
• The best lower bounds for T ps and T pr both decrease with 1/(N − 1). Therefore, independently of the values of T ss and M, the best lower bound of T pt = T ps + T pr should decrease (as is reasonable to expect) with 1/(N − 1).
• In our algorithm the pre-processing due to the analysis phase allows us to speed up the rendering, which has already been shown in the literature to be the most intensive computation task. Therefore, it is also reasonable to expect that for our algorithm, the rendering phase still requires a relatively large computation time, especially with respect to the analysis phase. This implies the behaviour of the whole computation time T pt is expected to be very similar to the function T pr .
Hardware architecture
We have implemented our algorithm on an nCUBE 2 supercomputer which is a general purpose parallel MIMD (multiple instruction stream multiple data stream) architecture. An nCUBE uses custom processors that work at 20 MHz; each processor can have up to 64 Mbytes of RAM. For integer operations, the peak rate is around 7.5 MIPS per processor, while for floating point operations this is around 2.5 MFLOPS [29] . The performance of these processors is comparable with Intel 80386 processors. The nCUBE is a message-passing hypercube architecture and its memory is distributed over processors and there is no global memory. All the communication takes place through the transmission of messages between the processors. The rate of a communication channel is around 2.2 Mbytes s −1 and there is a message latency of about 140 µs and a read overhead for a message of about 55 µs. The manufacturer states that even though direct hardware channels exist only between nearest neighbours in the network, the speed of the routing hardware makes communication between distant processors almost as fast as that between nearest neighbours. An nCUBE system can have up to 1024 processors; our configuration has 16 processors, each with four Mbytes of RAM. A user can select a number of processors equal to a power of two (1, 2, 4, 8, etc.) and a copy of a task is concurrently launched for each processor. A user can specify the parts of a program to be carried out by a processor by using its identification number.
Practical examples
We have considered three examples based on two models proposed by Haines [30] . the spheres; three point light sources illuminate the scene (see Figure 10 ). (2) Example 2. 820 highly reflective spheres plus a background polygon are used to model a superflake; three point light sources illuminate the scene (see Figure 12 ). (3) Example 3.
1024 polygons model a tetrahedral pyramid; one point light source illuminates the scene (see Figure 14) .
All pictures have been rendered in true colour with a resolution of (800 × 600) pixels and for each primary ray the reflections/refractions of third level have been considered; all images have been rendered using the Phong light model. Both superflake and tetrahedral pyramids are commonly used as benchmark scenes for ray tracing algorithms. In particular, Example 2 allows us to evaluate the performance of the algorithm when the objects are not evenly placed but they are clustered in a restricted spatial area. We rendered the images with 1 (sequential approach), 3, 7 and 15 processors which correspond to complete binary trees over 1, 2, 3 and 4 levels, respectively. The computation times are shown in Table 1 . 
Remarks
Speed-up graphs for the three examples are shown in Figures 11, 13 and 15. Three curves are presented for each graph. The curve 'ideal' represents the theoretic speed-up while the curves 'all proc' and 'only slave' plot the speedup factors obtained by dividing the computation time by all N processors and (N − 1) processors, respectively. It is worth noting that the master processor is not involved in the computation (neither in the direct master/slave configuration nor in the binary tree master/slave topology) but it simply distributes the workload to the slaves; because of this, the master processor is absolutely not overloaded. This can be seen in the previously mentioned figures: the curves which represent the speed-up factor obtained by only considering the slaves are closer to the ideal ones than the curves obtained by considering all processors. When the number of processors grows the impact of the master on the speedup factor decreases and the two broken curves converge. With our scheme the master processor does not require a large bandwidth for communication nor a high computation power since its job is limited to assign work to the slave processors and to link the partial results. Moreover, each processor, independently of its level within the hierarchy, can only communicate directly with its father and with to the slaves. In order to do this, a large amount of communication among processors is, in general, necessary and this can strongly affect the performance of the algorithm, especially when the analysis and rendering delays become comparable.
We have replicated the entire database on every processor in order to both reduce the communication overhead and enhance the performance and the reduction in memory prices justifies this choice. In the event that the whole database was not loadable in the private memory of each processor we could assign to a processor only the objects related to each cell. In this way, the entire database should be loaded only in the memory of the master processor. Of course, this strategy would reduce the performance by introducing communication overhead due to the data communication.
CONCLUSIONS
In this paper we have presented a flexible multiprocessor ray tracer. We have proposed an architecture where the processors are organized in two different ways according to the task to be performed. The processors are arranged in a 'direct' master-slave structure when the master has to assign the workload to the slave processors; otherwise, the processors are arranged in a hierarchical structure. This architecture allows us to obtain both a uniform load distribution and a decreasing rule of rendering times close to the ideal one. We have designed our software by considering three main issues concerning parallel implementations, that is, load balancing, communication delays and system flexibility. Moreover, we have addressed the problem of synchronization for message passing-based machines, where data loss occurs when the input buffer of a processor is full and a new message is coming in.
Our strategy splits the ray tracing process into two steps, one of analysis and one of rendering. We implemented our strategy on an nCUBE 2 supercomputer with up to 15 processors and considered three examples. The examples confirm the theoretical expectation, that is, the computation times decrease linearly with the number of slave processors employed. The third example does not satisfy this law when 15 processors are involved in the computation because the workload is not sufficient to keep all the slave processors engaged.
Future work will consider splitting the database among the processors by taking into account the fact that only the data structures related to a cell are necessary for a processor to render the cell itself.
