On the design of a real-time volume rendering engine by Smit, J. et al.
Pergamon 
Comput. & Graphics, Vol. 19, No. 2, pp. 297-300, 1995 
Copyright 6 1995 Elsevier Science Ltd 
Printed in Great Britain 
0097~8493/95 $9.50 + aI 
Graphics Hardware 
ON THE DESIGN OF A REAL-TIME 
VOLUME RENDERING ENGINE 
1. SMIT, H. J. WESSELS, A. VAN DER HORST, and M. J. BENTUM 
University of Twente, Department of Electrical Engineering, 
Laboratory for Network Theory and VLSI Signal Processing, P.O. Box 217, 7500 AE Enschede, 
The Netherlands, e-mail: jaap@nt.el.utwente.nl 
Abstract-An architecture for a Real-Time Volume Rendering Engine (RT-VRE) is given, capable of 
computing 750 x 750 x 512 samples from a 3D dataset at a rate of 25 images per second. The RT-VRE 
uses for this purpose 64 dedicated rendering chips, cooperating with 16 RISC-processors. A plane interpola- 
tor circuit and a composition circuit, both capable to operate at very high speeds, have been designed for 
a 1.6 micron VLSI process. Both the interpolator and composition circuit are back from production. They 
have been tested and both complied with our specifications. 
1. INTRODUCTION alpha : (~~~~(1. J, K)*( 1 - cw(Z*.Z)) 
The visualization of high resolution images ac- 
quired from 3D datasets in real-time is extremely 
important in every day applications. This is espe- 
cially true for medical applications where large 3D 
datasets are acquired in a relatively short time us- 
ing scanners based on X-ray imaging (CT equip- 
ment ), Magnetic Resonance Imaging (MRI equip- 
ment), Ultrasound imaging etc. 
a(Z*J) := a(Z, J) + alpha 
C(I, J) := C(Z, J) + C,,,(Z, J, K)-alpha (1) 
The diagnostic value of the various scanners can 
be greatly increased if the 3D data can be inter- 
preted and viewed in real-time. This assumes the 
classification of the dataset, as well as the final 
visualization of it. It should be observed however 
that the classification needs to be performed only 
once. The visualization on the contrary, should be 
performed at a rate of at least 25 images per second, 
at such high resolutions as 750 *750 *5 12 points in 
display space. The real-time speed of 25 images 
per second is highly useful as only a fraction of 
the relevant data may be visible at an arbitrary 
initial setting of visualization parameters, due to 
the hidden surface removal in the composition al- 
gorithm. The subsequent process of the selection 
of optimal observation parameters, such as tissue 
opacity, visualization angle, lightsource position 
etc., takes an intolerable amount of time on current 
workstation implementations of 3D visualization 
software with single frame interaction rates in the 
order of minutes to seconds. 
With cr(Z. J) the opacity along a ray, cwi.,(Z, J, K) 
the sampled opacity at position (I, J, K), alpha an 
intermediate opacity value, C(Z, J) the color along a 
ray and &,(I, J, K) the sampled color at position (I, 
J, K). 
Volume visualization has the advantage of being 
the most complete visualization method[4,6, 71. The 
fact that there is no need to extract a specific surface 
has as an advantage that no true interpretation of the 
3D data is needed before actual visualization, thereby 
avoiding the problems with the partial volume effect. 
The reader is referred to[ 31 for a survey about the 
various techniques for 3D visualization. 
2. VOLUME VISUALIZATION HARDWARE 
High speed, full resolution imaging gives the 
additional advantage that measurement data, which 
normally vanish in noise like ID structures from 
small blood-vessels, can be observed by slowly ro- 
tating a 3D scene with optimal visualization param- 
eters, as the eye is very sensitive in the recognition 
of correlated “noisy” paths in 3D scenes. 
The real-time visualization engine described in 
this paper is based on volume rendering, i.e., it 
computes lightintensity values along rays, using 
the composition formula: 
Many volume rendering implementations have been 
realized on general purpose computers, or on general 
purpose computers combined with computer graphics 
hardware. Levoy [ 41 discusses the performance of the 
volume rendering algorithm. Most of the systems de- 
signed so far degrade the image quality in order to 
obtain real-time performance. For instance the imple- 
mentations discussed in [3] are not designed for the 
volume rendering algorithm. These machines are not 
capable of rendering semi transparent surfaces and 
produce images of inferior quality in real-time. The 
‘ ‘Voxel Processor Prototype” for instance [ 21 is capa- 
ble of rendering approximately 20 images per second. 
However, this machine too does not perform volume 
rendering with the composition formula 2. Instead an 
alternative approach is used in which no subsampling 
to true display coordinates is used. The image is gener- 
ated just by addressing the voxels in a back-to-front 
order, overwriting the hidden voxels, without a com- 
position step. This way of rendering is more like bi- 
nary voxel rendering, resulting in images with an arbi- 
trary discreteness, in which individual voxels became 
297 
298 J. SMIT er al. 
(X,W) 
Location 
/ 
Detector 
@u 
N WE S 
Communication 9 unit 1 F B 
u Cache -D Corn- + Bund,e 
memory position 
2 4- unit 
+ buffer 
Fig. 1. The VRE Processing Element. 
visible like “sugar cubes.” A sub real-time, true vol- 
ume visualization running on the fast general purpose 
graphics engine, Pixel-Planes 5, is described in [5]. 
The hardware of the Pixel-Planes 5 implementa- 
tion [ 11, which includes a 640 MByte per second ring 
network and dedicated RAM with built-in graphics 
primitives, is so considerable that its realization can be 
expected to be expensive. The usage of low resolution 
images during image rotations was needed in this ap- 
proach to bring real-time volume visualization within 
reach. 
3. THE DESIGN OF A REAL-TIME VOLUME 
RENDERING ENGINE (RT-VRR) 
Extensive studies of the volume visualization algo- 
rithm reveal that the percentages of time spent in the 
straightforward implementation of the volume render- 
ing algorithm are divide as follows: 
Interpolation : 91.34% 
Composition : 5.37% 
Geometry : 3.29%. (2) 
This gives an indication that a speed-up for the inter- 
polation algorithm is most wanted. It can be shown 
however that the task of generation of 750*750 im- 
ages, sampled at 512 depth positions, at a rate of 25 
images per second is equivalent to the desire to con- 
struct a real-time volume rendering engine capable of 
executing 600 Giga operations per second, using the 
straightforward algorithm. A study about the amount 
of power required to execute this straightforward im- 
plementation of the algorithm shows that between 10 
and 20 kilowatts are required to implement the algo- 
rithm at the given performance level, provided that it 
could be realized with chips of the current generation, 
even if dedicated ASICs are used at strategic places. 
The outcome of this study motivated us to start 
a chip-design, with minimal power dissipation and 
maxima1 performance as main objectives, resulting in 
various novel VLSI building blocks for the visualiza- 
tion task, combining compact layouts, extremely low 
2D interconnect 
interprocessor blls 
Dataset Dataset Memory 
Fig. 2. The local VRE interconnect 
Real-time volume rendering engine 299 
power dissipation, and unprecedented computational 
capabilities. Fig. 1 shows a block diagram of the typi- 
cal embodiment of an RT-VRE chip. 
The engine processes complete bundles of 3D im- 
age data, resulting in 2D patches of display data. The 
starting point for such a bundle is sent to the VRE 
Processing Element though a broadcast interface (B ) 
in the form of an origin and a set of increments, used 
to sequence through the 3D space along the rays. The 
bundle reference point sequencer just computes a sin- 
gle point in 3D space, which is a reference point for 
a cut-plane of the bundle. This plane is subdivided 
into smaller units, called tiles, which are sequenced 
by the tile corner sequencer. This sequencer puts the 
calculated addresses in a dedicated tile corner buffer. 
The sub-tile selection unit selects four tile comer 
points from the tile comer buffer and loads their X,Y,Z 
components in parallel into the plane-interpolator for 
the calculation of local addresses within the bundle 
plane. The calculated values are shifted out of the 
output registers of the interpolator, while triggering a 
cache memory connected to the main memory which 
contains the voxel values in the form of opacity and 
colors as well as the tissue types (i.e., the dataset). 
Any value missing in the cache is loaded from the 
dataset. The cache memory 1 is of a very special con- 
struction. It will cache 8 tuples of voxel data elements, 
which can be loaded in parallel into the interpolator 
2. The output shift register 2 is used to allow some 
time for the interpolator 2 to calculate all the values 
of opacity and color within a given plane using a plane 
interpolator. Repetition of this process gives full tri- 
linear interpolation within the region of interest. The 
results of this step are stored in cache memory 2. The 
addresses produced by the output shift register 2 are 
used to select the desired values of opacity and color 
from the cache memory 2, which are fed into the 
composition unit, which operates on all points within 
the bundle. 
A good feeling for performance level of the ASIC 
can be obtained if one takes into consideration that a 
bit addition takes 2 ns. in the 1.6 micron process used. 
The ASIC executes nevertheless at a 100 MHz compo- 
sition rate. The DRAM bandwidth is fully saturated, 
using 40 ns cycles whenever possible and 150 ns cy- 
cles when new rows should be selected. 
The RT-VRE architecture is capable to calculate 
composition operations at 100 Mega operations per 
second, using 4 composition units each measuring 1 
x 2 mm in the 1.6 micron VLSI process. A total of 
64 VRE-ASICs are needed to calculate 750*750 im- 
ages at a rate of 25 images per second, using 512 
samples in the depth of a 256 *256 *256 dataset. Proto- 
types of the plane interpolator and the 4-way composi- 
tion unit are currently being processed. An 1 micron 
process will reduce the power dissipated by each ASIC 
to about 1 Watt. 
The 64 RT-VRE processing elements (PEs) are part 
of an inhomogeneous multiprocessing network, as 
shown in Fig. 2. A total of 4 PEs is connected, together 
with a general purpose RISC processor, using bi-direc- 
13cm 
6cm 
Fig. 3. Board level realization of the VRE. 
tional bus-couplers to a local interprocessor bus. Six- 
teen of such unit build-up the complete RT-VRE, us- 
ing a regular 2D interconnect pattern. VRAM memo- 
ries are used at this level to provide high bandwidth, 
unattended interprocessor communication. 
The use of general purpose RISC processors makes 
the overall RT-VRE design very flexible, as compute 
intensive operations, like MRI classifications, can be 
performed on the same hardware with excellent speed. 
Fig. 3 shows one of the 16 boards to be used in 
the final RT-VRE prototype. The overall design 
will dissipate as little as 370 Watt, 160 Watt for 
the RAMS, 128 Watt for the 64 ASICs, 32 Watt 
for the RISC processors and 50 Watt for the service 
processors and peripherals. One ASIC co-pro- 
cessing unit, comprises: 
1. One VRE-ASIC 
2. A set of bidirectional bus-buffers 
3. Either 4 pieces of 256K x 4 DRAM, or one 256K 
x 16 DR4M 
4. ADDITIONAL. FUNCTIONALITY 
The ASICs provide additional functionality to the 
RT-VRE through the inclusion of support for other 
visualization tasks, like: 
l Enhanced composition, giving a mix of normal 
compositions and maximum (minimum) intensity 
projection. 
l Cast-shadow calculation. 
l Fast classification. 
The inclusion of additional DRAM makes it possible 
to scroll through larger datasets as those indicated. 
The system can be software reconfigured from a 256 
X 256 X 256 dataset resolution, to a 512 x 5 12 X 64 
dataset resolution. 
5. CONCLUSIONS 
Visualization, of 3D (medical) datasets ampled at 
a resolution of 750 X 750 X 512, is not economical 
with current CPU technologies and general purpose 
techniques, due to excessive power dissipation. We 
have shown a solution for the 3D visualization prob- 
lem using dedicated ASICs. The new architecture de- 
scribed in this article requires as little as 370 Watt. 
This makes a Real-Time 3D Visualization Worksta- 
tion a feasible unit which can be used at arbitrary 
(clinical) locations. 
Acknowledgemenbs-The final realization of the Plane Inter- 
polator has been done by Maarten de Miinnink and RenC 
300 J. SMIT 
Oogink. The plane interpolator chip was tested by Hans 
Snijders. The Composition Unit was realized by Ronald Peer, 
Waldo Hazeleger and Edwin aan de Stegge. The chips were 
processed by Eurochip and the Dutch Pica organization. The 
realization of the chips was a cooperative effort between the 
Hogeschool Enschede and the University of Twente. 
REFERENCES 
I. H. Fuchs, J. Poulton, J. Eyles, T. Greer, J. Goldfeather, 
D. Ellsworth, S. Molnar, G. Turk, B. Tebbs, and L. Israel. 
Pixel-Planes 5: A heterogeneous multiprocessor graphics 
system using processor-enhanced memories. Computer 
Graphics, 23(3), 79-88 (1989). 
2. S. M. Goldwasser and R. A. Reynolds. Real-time display 
and manipulation of 3-D medical objects: The voxel 
et al. 
processor architecture. Computer Vision, Graphics and 
Image Processing, 39, I-27 ( 1987). 
3. A. Kaufman, R. Bakalash, D. Cohen, and R. Yagel. A 
survey of architectures for volume rendering. IEEE Engi- 
neering in Medicine and Biology, 18-23 ( 1990). 
4. M. S. Levoy. Display of surfaces from volume data. 
IEEE Computer Graphics and Applications 29-37 
(1!)88). 
5. M. S. Levoy. Design for a real-time high-quality volume 
rendering workstation. In Proc. ofthe Chapel Hill Work- 
shop on Volume Visualization 85-92 ( 1989). 
6. M. S. Levoy. Efficient ray tracing of volume data. ACM 
Transactions on Graphics 9(3), 245-261 (1990). 
7. M. S. Levoy. Volume rendering by adaptive refinement. 
The Visual Computer 6(l), 2-7 (1990). 
