The rendering of iso-surfaces in a scalar 3D dataset can be performed with a new algorithm, called iso surface volume rendering. This algorithm does not introduce sampling artifacts or artifacts due to triangularization. The risk to skip very small details by insufficient re-sampling is also eliminated. Another advantage is its speed compared to conventional volume rendering. So far we achieved speeds in the order of ten frames per second on advanced CPU's. The multiprocessor implementation of this new algorithm uses a division of the voxel data into multiple cubes. These cubes are the basis for distributing the workload onto several processors. A scheduler process is running to perform the distribution of the workload. During the distribution of the workload the scheduler also eliminates the need to render invisible parts of the dataset. This reduces the part of the dataset which must be processed to one third of the original dataset for typical applications. Another major advantage of the scheduling algorithm is that the communication overhead is reduced by a factor of ten to twenty, which allows for the efficient use of many processors.
INTRODUCTION
As the capabilities of medical scanning equipment, like CT and MRI scanners, increase it is becoming ever more difficult to interpret the results by just looking at the gray-scale slices. For this reason it is desirable to be able to view the data in a three or even four dimensional manner. Unfortunately the most commonly used methods, like those given by Lorensen 2 and Levoy, 1 of three dimensional visualization are too complex for normal computers. This results in very low speed rendering or extremely expensive computers. The problem with low speed rendering is that in a single image only a subset of the original data is presented, in many cases it is also very difficult to interpret a single three dimensional image correctly. For these reasons we want to be able to render at a high speed to give the user a way to interact with the dataset, such as rotating the object or even moving inside the object.
CHOOSING THE HARDWARE
The render algorithm chosen is not yet implemented in any hardware accelerator. Most 3D hardware is made for surface rendering and is almost completely useless for volume visualization. There are some hardware accelerators for volume rendering available like the Volume Pro 500. 3 However, these hardware accelerators tend to be very inflexible in their use, and even worse they use a rendering algorithm that gives an inferior image quality when compared to the Iso Surface Volume Rendering algorithm.
We could have decided to make dedicated hardware for the Iso Volume Rendering algorithm as well. Unfortunately the algorithm is, due to its computational efficiency, very complex to implement in hardware. Another problem with creating dedicated hardware is that the development cycle tends to be much longer, by the time the design is finished the hardware is usually obsolete. Also minor modification in the algorithm may lead to large modifications in the hardware design. The biggest problem is however the lack of flexibility. This inflexibility makes it impossible to use the hardware to implement additional features, unless of course the whole design is over engineered and has become a general purpose computer.
For these reasons the decision has been made to run the Iso Surface Volume Rendering algorithm on general purpose microprocessors.
MULTI PROCESSOR COMPUTER ARCHITECTURE
When using general purpose micro processors we can choose many different designs from many different manufacturers. Another design decision to make is how many processors are going to be used. And in the case of more than one processor, we also have to choose a method to connect the processors with each other. As can be seen in the title of this paper we have chosen to use several processors. The reason if of course that, if done correctly, two fast processors will be faster than one fast processor.
There are many ways to build a multi processor computer. It can be done as simple as connecting multiple single processor computers with Ethernet, to creating a custom build multiprocessor machine with a high bandwidth interconnect.
Using an Ethernet connection to connect multiple computers is the cheapest and easiest solution. Unfortunately the bandwidth is too low and the communication overhead is large to be useful in a fine grain communication application like the multi processor volume render engine. A more optimal solution would be to provide a cheap standard PC that can be upgraded when needed.
Symmetrical Multi Processor boards
The most commonly used technique to create a multi processor computer with very high speed and low latency communication is the Symmetrical Multi Processor (SMP) method. In these system several processors share a single memory bus as is shown in figure 1 . One reason these systems tend to work very well is that most software only use a relatively small amount of data. This allows for efficient use of the cache memories. In the case of volume rendering however this will not work. For every frame a lot of data needs to be fetched from main memory. As soon as the amount of data that a processor needs to fetch for a single frame is larger than the cache size of that processor, the cache will need to be refilled on the next frame. As cache memory is expensive the caches are unlikely to be larger than four Mega bytes, which will not be enough for any useful volume rendering application. The result is that all data needs to be fetched from main memory for each frame. As the main memory bus is shared between several processors this will lead to a severe bottleneck. An analysis of how such a system would perform showed that a SMP system with more than four processors would be inefficient for Iso Volume Rendering algorithm.
As long as no more than four processors are used they are an attractive option however, as there a two way SMP systems available in almost every computer shop. It is also possible to buy a fourfold SMP system, these systems tend to be much more expensive however (more than four times as expensive as a twofold SMP system). For the development of the Multi Processor Iso Surface Volume Rendering algorithm a twofold SMP system was used, it contained two Intel Pentium 3 processors. 
The StrongARM card
To overcome the main memory bandwidth problem it was decided to create a computer which had much more main memory bandwidth, which lead to the design of a computer where each processor had its own main memory. This computer was based on a PC with additional processors on PCI boards. These PCI boards contained eight processors of the SA110 type. The SA110 is a member of the StrongARM processor family and has exceptionally low power consumption for its performance. This combined with an already available PCI interface chip allowed for the easy creation of a board as is shown in figure When the project was nearing its completion major performance problems became apparent. The problems were twofold, the first being the system interface of the SA110. This system interface lacked a cache coherency protocol which made communication inefficient. Due to the large datasets involved when volume rendering, no single processor in the system could have the entire dataset in its main memory. For this reason the processors had to be able to down-load parts of the dataset to their main memory from other processors. There was also a second and even more problematic shared memory area called the ready buffer, which will be explained later. The lack of a cache coherency protocol on the bus interface made it necessary to flush the cache of the SA110 whenever the shared memory segments were to be read from another processor. Unfortunately the SA110 was not very efficient in doing this and the overhead was enormous.
The lack of a floating point unit was a second problem. This was already known when the project was started, but a new generation of the StrongARM family was already announced. This new generation was supposed to have a vector floating point unit which would have been perfect for the application. Unfortunately we were unable to get sufficient information on the availability of these devices, and these StrongARMs with a vector floating point unit are to the authors knowledge still not available.
This led us to abandon the StrongARM based project.
The G4 boards
After it became apparent that the SA110 was unsuitable for our application, we started to look for another processor. When building a multi processor computer, the power consumption is of a major concern. Our next choice was the SH family of Hitachi. Unfortunately it suffered from the same problem as the StrongARM, the problem being an announced next generation which would be almost perfect but without any indication of when it will become available. Having learned from the StrongARM project we decided it was not a very good choice. As we searched on we looked at the Motorola G4 processor.
The G4 is not an extremely low power processor, it consumes 5 Watt compared to 450 mW for the SA110. It is still possible to build a relatively low power computer with it. The G4 is faster than the SA110, so fewer processors can be used to achieve the same performance. If one only looks at the power consumption of the processors, the SA110 would still outperform the G4 however. A more important factor is the power consumption of the complete system. When these figures are compared to each other the G4 wins. This is mainly due to the power consumption of the bridge and the memory. A complete cluster of an SA110 with an 21285 and 64MB of SDRAM consumes ≈3W. If we create a cluster of four G4 processors we would consume 20W in the processors and ≈10W in the memory and bridge. If we then compare the 30W of four G4 processors with ten SA110 processors, the G4 processors win with a very large margin, as our benchmark have shown the G4 to be more than eight times as fast as the SA110. This comparison is not even including the tremendous performance win that comes from the much more efficient bus interface of the G4.
The bus-interface of the G4
Unlike the SA110, the bus-interface of the G4 is capable of providing cache coherency. This makes it possible to create an SMP system with G4 processors. Unfortunately we already came to the conclusion that an SMP system with more than 4 processors would be inefficient. Obviously we do not want to limit our system to only four processors. The G4 bus interface, is flexible enough to create larger efficient computers. We can create four way SMP clusters and put multiple clusters into a large shared cache coherent memory system as is shown in figure 3 . 
Altivec
The G4 processor has a vector instruction set called Altivec. With this instruction set it is possible to speed up the render algorithm. This work has already been done on a single processor machine by M.K. Bosma at the university of Twente and can be used in the multi processor G4 system. A similar instruction set is available on the Pentium 3, where it is called SSE. During the development of the multi processor rendering algorithm on a dual Pentium 3 a speedup of a factor four was achieved by using SSE. The availability of Altivec also lead to the choice of the G4.
THE RENDER ALGORITHM
In this section a more detailed overview is given of the Iso Surface Volume Rendering(ISVR) algorithm. This knowledge is necessary to determine a good method of making a multi processor version of it.
There are mainly two known approaches for volume rendering. These two are back to front and front to back rendering. The ISVR algorithm used the front to back method. Another feature that can be exploited for multi processor rendering is that each ray/pixel is computed only once. For this purpose a ready buffer is used. When a ray is about to be computed the ready buffer is checked to see if the computations are really necessary, if the ray was already terminated the computations are skipped. This speeds up the rendering process by a large amount.
Volume rendering is by its very nature very data intensive. For this reason the ISVR algorithm uses a binary shell. This binary shell is also checked to avoid unnecessary computations. When the iso surface is reconstructed an interpolator is used to find the exact location of the surface. If based on the values in the dataset we can determine that there is no iso surface between eight neighboring voxels, we can skip that particular space. These small spaces between eight neighboring voxels will be called a cell for the rest of this paper. Generating the binary shell can be done very fast compared to the rest of the computations and therefore using the binary shell is very efficient. Another advantage of the binary shell is that it only needs to be generated when the iso value is changed. This allows the algorithm to skip many cells whose voxels will no longer be fetched. This reduces the amount of data that must be fetched from the main memory, which also speeds up the rendering process.
MULTI PROCESSOR RENDERING
Volume rendering with multiple processors on a general purpose computer is not only a matter of making the software run on multiple processors simultaneously. The properties of the hardware should be taken into account as well. When considering the hardware we assume an architecture as is given for the StrongARM of the G4 boards. This is very representative for most larger computers. The rendering software is also expected to run efficiently on an SMP system. Since the SMP systems have a much simpler architecture they are far more easy to program efficiently and the software developed for a large machine is also close to optimal for an SMP machine.
The software architecture
The complete software system will exist of several processes. These processes will communicate with standard Inter Process Communication (IPC) methods. The first method of communication is through the use of network sockets. These sockets will enable the various processes to send each other packets of information. Another feature of sockets is that they can be multiplexed with the select 5 function call. This function will also put the process to sleep when nothing is to be done, which allows the computer to be used for other tasks.
When large amounts of data need to be shared, shared memory segments will be used. These large amounts of data are the dataset, the ready buffers and the output buffer. The whole software structure is shown in figure 4 , where k is the number of clusters and n is the number of processors in each cluster. In this software architecture the render jobs are responsible for the actual rendering. The scheduler will assign cubes which must be rendered to each of the render jobs. The data manager process will manage the data-segment.
RenderJob

Distributing the workload
There are two basic ways to distribute the rendering workload on several processors. It can be done either in the input space or the output space, where the input space is the voxel dataset and the output space is the image. On a machine like the G4 board it is much more attractive to take the input space as a basis for the workload distribution, this will be explained discussing how to store the dataset.
To take the input space as a basis for workload distribution makes it necessary to make a division of the input space into several smaller pieces. These pieces will be cubical subsets of the original dataset and will be called cubes. These cubes will be the smallest unit that can be rendered on a single processor.
When scheduling the cubes to be rendered on the render processors we can take several properties of the ISVR algorithm into account. These properties are:
• Back to front rendering.
• Each ray is terminated only once.
• A binary shell gives the parts of the dataset which need to be rendered.
• A ready buffer indicates which parts of the image are already finished.
To ensure back to front rendering and still use multiple processors we will schedule only cubes that are completely visible. The sequence in parallel view is shown in figure 5 . As can be seen we can render only one cube in the first step. This cube will thus be rendered on a single processor. After this cube is done we get three completely visible cubes which will be rendered on three different processors. The amount of available cubes will grow rapidly and will in most cases soon exceed the number of processors in the system. The amount of cubes available for rendering is given in equation 1. Where k is the rendering step and s is the number of cubes along every axis, which is a simplification as the s may be different for each dimension. Obviously a slight modification is needed when rendering in perspective, but the principle stays the same. 
Storing the dataset
When storing the dataset on a computer which has an architecture as is shown in figure 3 we must try to keep data access inside a SMP cluster. If possible we should store the entire dataset in the main memory of each cluster to minimize communication over the secondary bus. Unfortunately we will need much more RAM than is absolutely necessary if we do that. The currently available dataset can already be over one hundred mega bytes and we expect the datasets of the future to be much larger. Therefore we should not store the entire dataset in every cluster's main memory but we should try to give each cluster a region in which it should work. The scheduler should take the available data in a cluster into account when selecting a processor for rendering a specific cube. If for some, later to be explained, reason it is more attractive to render a cube on a cluster which does not have the data available, the scheduler should instruct the render processor to down-load that cube into the cluster's main memory from another cluster.
This way most accesses to the dataset can be done locally and the memory bandwidth available for dataset access increases with each cluster added to the system. It is very important to minimize communication over the secondary bus, as it is much slower than accessing local memory. It is also the only non scalable part of the computer.
The data-segment of each cluster is managed by the data manager of that cluster. When the scheduler instructs the data-manager to down-load a cube it will copy the cube from another cluster's data-segment to its own. If necessary it will also delete cubes from its data-segment to make room for the new cube when instructed to do so. The scheduler has the responsibility to make sure every cube is present in at least one of the data segments, otherwise it may have to be fetched from hard-disk which is very slow.
Storing the ready buffer
Minimizing the use of the secondary bus for accessing the ready buffer presents another challenge to the scheduler. Each cluster will have its own copy of the ready buffer in local memory. The scheduler should instruct the render-jobs to fetch the parts of the ready buffer, that are necessary to render a specific cube, from the local memory of other clusters. This way only local memory access is used when repeatedly checking the ready buffer. Communication can be lowered further by making sure that a cube is scheduled on a cluster which also rendered the cubes that obscured the cube that is to be rendered. If a cube is rendered on a cluster which also rendered the obscuring cubes the ready buffer is already up to date and no communication over the secondary bus is necessary. It also makes sense to prefer rendering a cube on a processor which rendered the three obscuring cubes as it makes it more likely the relevant parts of the ready buffer are still in the processor's cache, which reduces communication on the cluster's main memory interface.
Skipping cubes
In the single processor version of the ISVR algorithm we used a ready buffer and a binary shell to reduce the amount of work. This can also be taken to a coarse grain for the scheduler. Using the binary shell on a cube instead of a cell basis is relatively easy. If no bit is set in the entire binary shell of a cube, then the cube in its enterity will not be able to terminate any rays.
Using the ready buffer on a coarse grain is not so obvious. If we let the scheduler check the ready buffer each time it is about to render a cube, we will use an enormous amount of secondary bus band width. A better solution is to let the render-job check the ready buffer and make it report the unterminated ray-counts of its surfaces to the scheduler.
Unfortunately this will force the scheduler to schedule cubes which are empty, since it needs the ray-counts. To remove this unnecessary communication the ray propagation algorithm was developed.
RAY PROPAGATION
Ray propagation allows us to skip entire cubes if there is no pixel to be processed for this cube even though the cube may contain an iso surface. In figure 6 a single slice of the binary shell of the MRbrain dataset is shown. In this figure the gray space indicates cells which may contain the iso surface. The dataset is split up in multiple cubes, the borders of these cubes are drawn with th dotted lines. When the dataset is rendered from a certain viewing point, which is indicated by the arrows pointing to the upper left corner of the image, only a small portion of the cubes will contribute to the image, these cubes are given a black solid outline. As can be seen only a small portion of the cubes which might contribute to the image will actually do so. In most typical application it will be around 30% of the binary shell. For special application like virtual endoscopy it may even be a much smaller percentage.
We can check the ready buffer to see if a cube has to be processed. A check of the ready buffer for each cube before it is processed on a multiprocessor rendering engine implies however a substantial overhead in terms of communication. As the ready buffer local to the processor has to be updated for every cube, even if it had no rays entering it.
A far more attractive approach is to count the rays after the cube is rendered. This would then give three ray counts on the three backside surfaces of the cube. The three backside surfaces of a cube are front-side surfaces of three other cubes. When a cube is unobscured by another cube it also receives the ray count from that cube. When all three obscuring cubes are rendered, the cube itself is ready to be rendered. Since all un-obscured cubes are told how many rays are entering this cube, it can be determined if there are in fact no rays entering this cube. If there are no rays entering this cube, this cube can be skipped. Now we are left with the problem, what to do with the cubes we can skip. Of course we could determine their ray count as well by checking the ready buffer. This would, again, mean a lot of overhead however. It should not be difficult to see that a least half the object could be skipped since the backside of the object would not be visible. The cubes which form this backside should not take much time to process. In most cases there also are a lot of cubes which are entirely empty. If we were not to count any rays in the ready buffer, these cubes would not cost much processing time at all. We therefore use ray-propagation. This is based on the simple fact that there can be no rays on the backside surfaces if there are no rays on the front side surfaces.
First an explanation is given on how ray propagation works in parallel view. In perspective view the problem becomes more complex and the difference will be explained later.
Parallel view
Recording if any rays are coming into a cube is a straight forward process. Unfortunately it still leaves us with many cubes we have to check, as each cube with a non-zero incoming ray-count make three other cubes have a non-zero incoming ray-count. We can also be more precise than this by using a matrix. The matrix is called the propagation matrix and it gives the relative overlap between the front side surfaces and the backside surfaces. It can also be seen as a matrix which gives the probability that a ray exits on a surface, given that it enters on another surface.
The propagation matrix is derived from the visualization matrix M (given in equation 2). This visualization matrix M transforms a coordinate from the dataset space d to the screen space s. With the screen space being the coordinate of the pixel and the distance from the screen of the point.
If we have visualization matrix M, we can find the propagation matrix P. To do this we first find the front most point of a cell. This can be done by looking a the bottom row of M. We call this front most point c 0 , and this vector is 0 by definition. We work in a three dimensional system, which means we will have eight points on a single cell. We find c 1 by taking c 0 and change dimension 0. In a similar way we find c 2 and c 4 . All the other dimensions are linear combinations of these vectors. This automatically leads to c 7 being the back most point. This is shown in figure 7. 
if c1× c7 c1× c4 > 0 and c4× c7 c1× c4 > 0 we use equation 10 to 16.
in all other cases we use equation 17 to 23.
The propagation matrix can then always be found with equation 24.
We can now use P to estimate the ray count on the backside surfaces as given in equation 25. In this equation i is the incoming ray count and o is the outgoing ray count. The three components of o are then added to the incoming ray count of the three obscured cubes. These cubes may then be skipped in the absence of rays entering these cubes.
An important feature of the propagation matrix is that it will contain many zero components. There can be only one non-zero component on the diagonal for example. This will the algorithm to skip many more cubes than is possible without the propagation matrix.
Perspective view
In the preceding section we derived a matrix to be used for ray propagation. We saw that is was a 3 × 3 matrix with constant components. If we try to derive a similar matrix for the perspective viewing method we will find out that we get a 6 × 6 matrix with place dependent components. To compute this matrix for every cube is far too complex and will lead to a very slow implementation of the algorithm. The ray propagation in the perspective viewing method is therefore reduced to a binary 6 × 6 matrix which is place dependent. Finding the components of this matrix can be done with an if statement.
CONCLUSIONS
The multi processor version of the Iso Surface Volume Rendering algorithm has been developed and tested. The results on the dual Pentium 3 800 MHz test machine look very promising for the future implementation on larger machines. In figure 8 a rendering is shown of the CThead dataset. This dataset can be rendered with 14 frames per second on the test machine. Changing the iso value to be displayed hardly influences the rendering speed. In figure  9 an example of virtual endoscopy is given, the viewer is located inside the head and is looking to the outside via the canal where the spinal cord is supposed to be. Work on the new multiprocessor G4 machine is in progress and the algorithm has been tested on a single G4 machine. The G4 compares very well to the Pentium 3, one major disadvantage is its low clock speed of 400 Mhz. New versions of the G4 running at 733 MHz are already available however. Once the machine is implemented with 733 MHz G4's, a single cluster is expected to outperform the dual 800MHz Pentium 3 by a factor of 2 and a normal PC can contain at least four of these clusters.
The render engine also needs more work. Currently only a single iso surface can be shown at any time. With blending several iso surface should be shown in a single image. Cutting planes are already implemented, but these cutplanes can only cut along the axis of the dataset. It is our intention to implement these cut planes so that they can be placed in an arbitrary direction. Combining several dataset should be possible as well. This can be done by either showing the objects of different datasets in a single image or by color mapping a dataset on the surface generated from another dataset. 
