EM-Cube: An Architecture for Low-Cost Real-Time Volume Rendering by Osborne, Rändy et al.
 
EM-Cube: An Architecture for Low-Cost Real-Time Volume
Rendering
 
 
(Article begins on next page)
The Harvard community has made this article openly available.
Please share how this access benefits you. Your story matters.
Citation Osborne, Rändy, Hanspeter Pfister, Hugh Lauer, TakaHide
Ohkami, Neil McKenzie, Sarah Gibson, and Wally Hiatt. 1997.
EM-Cube: An architecture for low-cost real-time volume
rendering. In Proceedings of the ACM
SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware:
August 3-4, 1997, Los Angeles, California, ed. A. Kaufman, W.
Strasser, S. Molnar, B. Schneider, S. N. Spencer, 131-138. New
York, N.Y.: Association for Computing Machinery.
Published Version doi:10.1145/258694.258731
Accessed February 18, 2015 4:04:55 PM EST
Citable Link http://nrs.harvard.edu/urn-3:HUL.InstRepos:4141476
Terms of Use This article was downloaded from Harvard University's DASH
repository, and is made available under the terms and conditions
applicable to Other Posted Material, as set forth at
http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-
use#LAAEM-Cube: An Architecture for Low-Cost
Real-Time Volume Rendering
R¨ andy Osborne
￿ Hanspeter Pﬁster Hugh Lauer Neil McKenzie Sarah Gibson Wally Hiatt
TakaHide Ohkami
MERL – A Mitsubishi Electric Research Lab
Abstract
EM-Cube is a VLSI architecture for low-cost, high quality volume
rendering at full video frame rates. Derived from the Cube-4 ar-
chitecturedevelopedatSUNY atStonyBrook,EM-Cubecomputes
samplepointsandgradientson-the-ﬂytoproject3-dimensionalvol-
ume data onto 2-dimensional images with realistic lighting and
shading. A modest rendering system based on EM-Cube consists
of a PCI card with four rendering chips (ASICs), four 64Mbit
SDRAMs to hold the volume data, and four SRAMs to capture the
rendered image. The performance target for this conﬁguration is
to render images from a
2
5
6
3
￿
1
6 bit data set at 30 frames/sec.
The EM-Cube architecture can be scaledto larger volumedata-sets
and/or higher frame rates by adding additional ASICs, SDRAMs,
and SRAMs.
This paper addressesthree major challenges encountered devel-
oping EM-Cube into a practical product: exploiting the bandwidth
inherent in the SDRAMs containing the volume data, keeping the
pin-count between adjacent ASICs at a tractable level, and reduc-
ing the on-chip storage required to hold the intermediate results of
rendering.
CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional
Graphics and Realism—Raytracing I.3.1 [Computer Graphics]:
Hardware Architecture—Graphics Processors B.3.2 [Memory
Structures]: Design Styles—Interleaved Memories
1 Introduction
Real-time volume rendering is an enabling technology for medical
applicationsincludingdiagnosis,surgicaltraining,andsurgicalsim-
ulation [6]. The large computational and memory requirements of
real-time volumerenderingplaceitbeyondthecapabilities ofsingle
processorPCsandworkstationswithoutdedicatedhardware. While
high performance graphics systems can perform volume rendering
in real-time (e.g. the SGI InﬁniteReality Engine), such systems are
very expensive.
Ourgoalis to developafamily ofproductsthatprovidereal-time
volume rendering at affordable prices — i.e., within reach of per-
sonalcomputerbudgets. This family is intendedto addressmedical
￿osborne@merl.com,201 Broadway,Cambridge,MA
02139,Phone: (617)621-7524,FAX: (617) 621-7550
applicationswherevolumerenderingisanobviousrequirement,but
also to provide a foundationfor the developmentofinteractive vol-
ume graphics — that is, the graphics of 3-D sampled images and
their manipulation at interactive speeds. We expectthat as systems
forreal-time volumerenderingbecomecheaperandmore common-
place,abroaderclassofapplications—e.g.scientiﬁcvisualization,
industrial design and analysis, virtual sculpture, and games — will
begin to use volume graphical methods. Eventually, we envision
thatthemechanismsofvolumegraphicsandconventionalpolygon-
basedgraphicswillconverge,sothatbothkindsofrenderingwill be
supportedby the same kind of hardware.
This paper describes the architecture of the ﬁrst member of this
family, a volume rendering chip currently under development. The
architecture is a scalablesystolicarray basedon Cube-4,developed
at SUNY at Stony Brook [16]. The performance target is a chipset
that ﬁts onto a single PCI card and renders volume data sets of size
2
5
6
3
￿
1
6bitvoxels,at30frames/sec. Thecostofsuchanaccelera-
tor will beon the orderof a low-costPC. In subsequentgenerations
the costwill decreaseasthe underlying implementation technology
improves.
Cube-4,thoughscalabletolargervolumesbyaddingmoreASICs
and memory modules,is impractical for low-cost ASIC implemen-
tation. The key challenges are delivering the required bandwidth
with as few chips as possible, reducing the inter-chip communi-
cation to keep the pin count reasonable, and reducing the on-chip
storage required for intermediate results. Our EM-Cube (Enhanced
Memory Cube-4) architecture meets the ﬁrst two challenges by us-
inga blockskewedmemory,whichexploitsinherentSDRAMburst
bandwidth,andmeetsthethirdchallengebysubdividingthevolume
in a technique we call sectioning.
The organization of this paperis as follows. Section 2 describes
related work. Sections 3 and 4 describe Cube-4 and introduce the
three implementation challenges. Sections 5 and 6 introduce block
skewed memory and show how it meets the ﬁrst and second chal-
lenges respectively. Section 7 discusses the on-chip storage prob-
lem and our solution via sectioning. Section 8 presents the overall
architecture. Finally, Sections 9 to 11 discussfeatures neededfor a
commercial product,such as supportfor multiple voxelformats.
2 Related Work
Several approacheshave been taken to achieve interactive volume
rendering rates. Software implementations use acceleration tech-
niques which require pre-computation, additional data storage, or
trade-off image quality for speed. Shear-warp rendering, the cur-
rently fastest software algorithm, achieves one projection in a few
secondson a regular workstation [11]. Many researchers have im-
plemented volume rendering algorithms on large general-purpose
multiprocessors [2, 5, 14, 15]. However, this approach requires ex-
pensive, typically network-shared machines to achieve acceptable
frame rates, and the lack of direct frame-buffer access prohibits
real-time output rates. Anotherapproach is to use existing polygon
graphicshardwareforvolumerendering[18,8,13]. Interactiveren-Pixel
Gradient
Voxel
Opacity
Estimation
Sample
Gradient
Sample
Shaded
Shading
Compositing
Classification
Interpolation
Figure 1: Renderingpipeline
deringrates havebeenachievedonthe SGIReality Engineusing3D
texturing hardware [3, 1]. However,current 3D texturing hardware
is expensiveand doesnot supportestimation of gradients that is re-
quired for high-quality shadingand classiﬁcation. Furthermore, the
bestvolumerenderingperformanceonlarge general-purposesuper-
computers or special-purpose texture mapping hardware is still be-
low 15 frames/sec for
2
5
6
3 volumes.
In view of these limitations, it is not surprising that a number
of researchershave undertakenthe developmentof special-purpose
hardware for volume rendering. VOGUE, one of the most concrete
proposals,is a compact ray-casting unit which provides interactive
rendering speeds at moderate hardware costs [10]. A single board
consistingofeight-way interleavedvolume memoryand fourVLSI
chips provides 2.5 frames/sec for
2
5
6
3 volumes. Near real-time
rates of 20 frames/sec can be achieved by connecting severalmod-
ules over a ring-connected cubic network [9]. VIRIM, an object-
order volume rendering engine, is one of the few research propos-
als that has been built and tested [7]. The machine consists of four
VME boardswith special-purposegeometry processorsfor data re-
samplingandprogrammableray-castingprocessorsfortheﬁnalim-
age generation. VIRIM achieves2.5 frames/sec for
2
5
6
3 datasets.
3 Cube-4 Architecture
Cube-4, developedat SUNY Stony Brook, is a scalablesystolic ar-
ray ofrendering pipelines,eachconnectedto its own memorymod-
ule [16]. Figure 1 shows the major functions in each rendering
pipeline. Cube-4 uses a modiﬁed ray casting algorithm. Instead of
processing along each ray in depth-ﬁrst fashion, Cube-4 processes
raysinparallelinabreadth-ﬁrstfashion. Inparticular,allthesample
points containedin an entire plane of voxelsare processedin paral-
lel, thereby avoiding the need to re-read neighboring voxels from
memory. Sucha voxelplane,called aslice, is alwaysperpendicular
tooneofthethreeaxesofthevolumedatacube. Cube-4choosesthe
direction for the slice suchthat the slice normalsubtendsthe small-
est angle with the actualviewing direction.
1
Sincea slice has too manyvoxels to be processedatonce,Cube-
4 scans each slice a beam (i.e. a row) at a time. Beams are further
dividedinto partial beamsof
p voxels. Eachvoxelofa partialbeam
is processed by a separate rendering pipeline capable of fetching a
1Thealgorithmchoosesarbitrarilyamongstviewnormalshavingequally
small subtendedangles.
(x+y+z)  mod C = 0
(x+y+z)  mod C = 1
(x+y+z)  mod C = 2
(x+y+z)  mod C = 3
x z
y
Figure 2: Skewed voxelmemory
new voxel from an associated memory module every clock cycle.
Thus a Cube-4 system with
p pipelines can process a beam in
N
=
p
cycles,aslicein
N
2
=
p cycles,andavolumein
N
3
=
p cycles,where
N is the size of a cubic datasetin any dimension.
A key feature of the Cube-4 architecture is that rendering
pipelines communicate only locally with associated memories and
neighboring pipelines up to three away. Thus the Cube-4 architec-
ture is highly scalable.
3.1 Cube-4 Skewed Memory
AfundamentalchallengeinCube-4isarrangingdataamongstmem-
ory modules so that the processingchips can concurrently fetch all
p voxels in a partial beam regardless of the viewing direction. To
meet this challenge, Cube-4 uses 3D skewed memory. A voxel at
position
(
x
;
y
;
z
) in unskewed voxel space is mapped to position
(
i
;
r
;
s
) in skewed voxel space where
i
=
(
x
+
y
+
z
)
m
o
d
N,
r
=
y,and
s
=
z.G i v e n
C memorymodules,where
N is amultiple
of
C, a voxel
(
i
;
r
;
s
) in skewed voxel space is mapped to module
number
i
m
o
d
C and to an address within that memory module of
b
i
=
C
c
+
r
￿
N
=
C
+
s
￿
N
2
=
C.
The layoutof voxelsin the volume memory is illustrated in Fig-
ure2whichshowsa setofvoxelsnearthe origin ineachofthethree
dimensionsfor
C
=
4 . Voxelsare representedbysmall cubes,with
the shading illustrating their assignment to memory modules. The
ordering of the assignmentsof colors to voxelsis identical for each
of the three visible faces. Throughout the volume, adjacentvoxels
within a beam are stored in adjacentmemory modules,and thus re-
gardless of the view direction, a partial beam of
p
=
C voxels can
be fetched concurrently from the
C separate memory modules.
The 3D skewing introduces a lateral shifting in voxels between
adjacentbeamswithin a slice andalso betweenadjacentbeamsin a
row plane perpendicular to a slice. As discussed in Section 6, this
shifting mustbeundonein orderto processeachvoxel(e.g.seeFig-
ure 6), and it leads to signiﬁcant communication between adjacent
rendering pipelines.
4 Implementation Issues
Toachievealow-costsystem,thenumberofrenderingchipsandas-
sociatedmemory chips mustbe as smallas possible. The rendering
chips must have a reasonabledie size and must be compatible withC C
C
C C
C
C
C C
2 N + N
3N + N
2 N + 2N
2
2
2 2N + 3N
C
N
N
C
2
C
C
C
2 2N
C
2 3N + 2N
2 2
2 2 2N + N 2N + 2N
y
z
3N
2N 3N
0
N + 3N
3N + 3N
Figure 3: Memory location assignments of YZ face
current packaging technology. The Cube-4 architecture described
in Section 3 does not meet these goals. It requires too many mem-
ory modules (about 20), too many pins per rendering chip (on the
order of 512 signal pins), and too much on-chip storage, resulting
inanexcessivelylargedie(in excessof
1
0
0
m
m
2 for storagealone).
Subsequentsectionsdescribeeachofthesepointsinmoredetailand
describe our modiﬁcations to Cube-4 to attain a feasible design for
VLSI implementation.
5 Voxel Bandwidth
To meet our performance targets, the voxel memory must have a
capacity of 32Mbytes and must deliver a sustained bandwidth of
1Gbyte/secindependentof view direction.
5.1 Cube-4 memory access patterns
The Cube-4 skewed memory organization has view-dependent
memory access strides which exceed common DRAM page sizes
for some view directions. This precludes the use of fast page (i.e.
column) mode accessin DRAMs in suchview directions, reducing
achievablememoryperformanceto random(i.e. row)accesslevels.
Viewdependenceforcestheentire memorysystemdesignto handle
this worst case.
Inparticular,fora
N
3 datasetwith
C memorymodules,themem-
ory accessstride is 1,
N
=
C,o r
N
2
=
C if the view normal direction
isparallelto theZ,X,orYaxesrespectively. Figure3 showsthe as-
signment of memory locations of voxels on the YZ face for a view
direction parallel to the X axis. A stride of
N
=
C is required to ac-
cesssuccessivevoxels in successivepartial beams parallel to the Y
axis. Moreover, there is an anomaly in this stride at the beginning
of eachbeam. Therefore,exceptfor small
N and/orlarge
C, only a
few successiveaccesseswill fall on the sameDRAM page,making
little beneﬁtoffastpagemodeaccess. Likewise,ontheZXface(not
shown),astride of
N
2
=
C is requiredto accesssuccessivevoxelsof
successivepartial beams, also with an anomaly at the beginning of
each beam. For small
C and reasonable values of
N,t h i s
N
2
=
C
strideislargerthantypicalDRAMpages,completelyprecludingthe
use of fast page mode.
5.2 Memory Technology
64Mbit synchronous DRAMs (SDRAMs) will be the mainstream
DRAM in the next 1-2 year period. Such SDRAMs meet our
32Mbytecapacityrequirement,and4Mx16versionsat125MHzde-
liver 1Gbyte/secwith just 4 chips. 64Mbit Rambus(TM) will ramp
upduringthesameperiodbutits higherclockspeedrequiresamore
complicated interface.
zx
y
Figure 4: Blocked skewedmemory
(
b
=
4
)
Unfortunately, Cube-4’s large memory strides prevent getting
anywhere near the maximum 1Gbyte/sec bandwidth with 4 mem-
ory chips. For Mitsubishi Electric’s 64Mbit 125MHz SDRAM, the
cycle time for a row access is
t
R
C
=
8
0 nsec. In practical opera-
tion, at mosttwo banks can be overlappedin
t
R
C, thus limiting the
maximum performance to 2 accessesper 80nsec, or 50Mbytes/sec
perSDRAM(at 16bits/voxel). Thus20SDRAMs areneededto ob-
tain 1Gbyte/sec. The situation is similar for Rambus since it is also
block oriented. This number is unreasonablefor a low costdesign.
To signiﬁcantly reduce the row access time, the DRAM banks
must be smaller, and as a side effect usually less dense. Exam-
ples are 16Mbit Enhanced SDRAM (30nsec row access time) and
MoSys’s 1Mbyte multibank MDRAM (20nsec row access). How-
ever, these devices are too slow (a 20nsec row access time im-
plies 10 chips) or not dense enough. The performance of vari-
ous cache
+DRAM combinations, such as 16Mbit cached DRAM
(CDRAM)andEnhancedSDRAM,degradesto therow accesstime
for strides greater than a DRAM page.
5.3 Block Skewed Memory
To take advantage of the high bandwidth of SDRAM in fast page
mode, we organize the volume memory into subcubesor blocks of
b
￿
b
￿
b voxels in such a way that all of the voxels of a block
are stored linearly in the same DRAM page. The memory is still
skewedto supportrendering independentofview direction,butit is
now skewed at the block granularity rather than voxel granularity
as in Cube-4. Each rendering chip processesa block and maintains
a block-sized reordering buffer so that the voxels in a block can be
read out in the order appropriate for the view direction. Figure 4 il-
lustrates the block skewedmemory for
b
=
4 .
Inthisneworganization,arowofblockscomprisesablock-beam
and a two-dimensional array comprises a block-slice. At the block
granularitytheprocessingalgorithmis thesameastheCube-4algo-
rithm, exceptthat partial block-beams replace partial beams. Each
blockis processedinternally on a voxelgranularity using the Cube-
4 algorithm.
There are severaldesign points for
b.
PageBlock:
b can be as large as possible while still allowing the
b
3
block to ﬁt into a single DRAM page. Thus the burst transfer size
can be as large as a page size, which easily permits sustaining full
bandwidth from the SDRAMs. One disadvantage of this scheme
is the block size depends on the voxel size. The 512 byte pages in64MbitSDRAMssupport
b
=
8for8bit voxelsand
b
=
4for 16 or
32 bit voxels. Another disadvantageis that it requires a page-sized
buffer on-chip.
MiniBlock: Alternatively,
b can be as small as possible. This
eliminates the sensitivity to voxel size. Blocks with
b
=
2 are
large enoughto completely overlap the row accessoverhead of the
SDRAM module with data transfer. Assuming 16 bit voxels and
Mitsubishi Electric’s 4Mx16 SDRAM at 125MHz, the single burst
accesstimefora2x2x2blockis112nsec,i.e.8accessesin14clocks.
Two ofthe four banksin the SDRAMcan be interleavedto achieve
8accessesin 8clocks,i.e.fullbandwidth.
2 Adisadvantageof
b
=
2
is the large inter-chip communication.
Hierarchical Blocks: A compromise yielding the advantages of
both large and small block sizes can be achieved by tiling blocks
ofsize
b with miniblocks. Theblocksthemselvesareskewedacross
memorymodules,buttheminiblockswithinthemarenot. Thishier-
archical blocking permits efﬁcient implementation of larger blocks
e.g.PageBlocks. Insteadof fetching theentire blockatonce,which
requires a
b
3 voxel buffer, miniblocks can be fetched on a row by
row basison demand. This capability ensuresminimal overheadfor
the sectioning describedin Section 7.1.
The maximum block size is
b
￿
N
=
C since blocks must be
skewed over
C chips so that a block-beam can be fetched without
conﬂict for any view direction.
A hierarchical blocking scheme is also described in [12]. The
data volume is divided into subcubesandsubcubesare divided into
2x2x2 “supervoxels”. However, while the hierarchical division is
the sameas above,the actualmemory blockingis different. In [12]
theeightvoxelsin asupervoxelaredistributed acrosseightmemory
modules,i.e.supervoxelsaretheunitofinterleavedmemoryaccess.
In ourblocking,all the voxelscomprising a blockare located in the
samememoryandminiblocksare theunit ofpipelinedburstaccess.
In addition, all the blocks are skewed.
6 Inter-chip Communication
Figure5showstheEM-Cubearchitectureinagenericwayindepen-
dent of
b. Voxel blocks are distributed across the set of SDRAM
volume memories at the top. Each rendering chip connects to a
SDRAMmemorymodule,apixelmemorychip(SRAMorDRAM)
for output, and neighboringrendering chips for transfer of interme-
diatevalues.
3 Suchinter-chip communicationis required for resam-
pling (intermediate trilinear interpolation results and possibly vox-
els), gradientestimation (intermediate results and trilin results), and
compositing (partial pixels).
Eachvoxelblockis processedbya singlerenderingchip. Within
a block, intermediate values are communicated on-chip. The only
inter-chip communication results from processing voxels near the
faces of each block. Since the area of a block face is
b
2, the inter-
chip communicationgrowsas
b
2. On the otherhand,the numberof
voxels processedper block grows as
b
3. Therefore, on a per voxel
basis, the interchip communication scales as
1
=
b. Thus a design
with
b
=
4 requires up to 4 times less inter-chip communication
bandwidth
4 than Cube-4. Table 1 summarizes the inter-chip com-
munication requirements for several architectural variations. The
2Provided that every row is accessed at least once within every 64msec,
no additional overhead is necessary for refresh. Rendering the entire
2
5
6
3
dataset of 16 bit voxels accesses every row of four 64MbitSDRAMs every
32msec. Fora smallervolumeorsmallervoxelsize,renderingmightnotac-
cess everyrow every32msec. However,we donotneedfull250Mbytes/sec
bandwidthin such cases and thus we can slip in auto-refresh cycles without
degradingthe bandwidth.
3Becauseoftheone-to-onecorrespondenceofmemorymodulesandren-
dering chips, we use
C interchangeablyfor either.
4Exactly4 less exceptfor compositingwhichis
3
7
=
6
4less. See Table1.
Trilin Grad est Compos
Unskewed 1 2 1
Cube-4 3 3 1
EM-Cube
3
b
3
b
1
b to
3
b
2
￿
3
b
+
1
b
3
Table 1: Summaryofinter-chipcommunicationbandwidth(in “val-
ues”/clock)
compositing communication dependson the view direction.
The inter-chip communication for resampling has an interest-
ing geometric interpretation. The left side of Figure 6 shows, in
unskewedvoxelspace,theeightvoxelneighborhoodfortrilinear in-
terpolation. Here we assume
b
=
1to simplify the picture,and thus
there is one memory module and one rendering chip for each col-
umn
i. It sufﬁces to communicate the bilinear interpolation of the
fourside face voxels(e.g. 2,4, 6, and8)to the left neighbor. Skew-
ing the volume transforms the eight voxel neighborhood cube into
the slanted parallelepiped in the right of Figure 6. The transforma-
tion is the same as pulling vertices 4 and 5 of the unskewed voxel
cubelaterallyto therightandleft, respectively. Suchpullingspreads
the eightvoxelcubeoverfourcolumns. To perform the trilinear in-
terpolation, we ﬁrst undo the skewing by shifting voxels 5 and 6 to
the right by 1 and likewise shifting voxels 3 and 4 to the left by 1.
This lateral communication can be pipelined, with all front bottom
voxels moving one to the right and all top rear voxels moving one
to the left on each clock. The four side face voxels are then bilin-
early interpolated and the result sentlaterally to the left neighborto
compute the ﬁnal trilinear interpolation result. The total communi-
cation is thus 3 values per clock. For
b
>
1 each vertex becomesa
b
3 block of voxels and
b
2 face voxels move to the left and another
b
2 move to the right eachtime step.
For compositing, the inter-chip communication is equal to the
number of rays exiting a block. The best case shown in Table 1 oc-
curs for a viewing direction parallel to an axis and the worst case
occursfor a raydirection 45 degreesfrom two axes. The worstcase
communication scales as
1
=
b in all three dimensions. Thus
b must
be fairly big, e.g. 8, before there is a signiﬁcant reduction in total
compositing communication from the
b
=
1case.
Comparing the entries in Table 1 for Cube-4 (skewed volume)
and the unskewed volume reveals that skewing signiﬁcantly in-
creasestheinter-chip communication. However,theunskewedvol-
ume is not practical because either the view direction must be re-
stricted or there must be a copy of the entire dataset for each axis
direction.
The blocked architecture permits a tradeoff between signal fre-
quency and the pin count for inter-chip communication. The inter-
chipbandwidthdecreasesby
ballowingfewerpinsand/orlowerfre-
quency. For example,if the resampling stage uses16bitvoxels,the
inter-chip communication can be any combination of
(
1
6
=
w
)bits
wide every
(
b
=
w
)
￿
8nsecwhere
w
=
1
;
2
;
4
;
8, and 16 and
w
<
b .
For
b
=
8weestimatearenderingchipwillhave267signalpins.
This is feasible for today’s packagingtechnology. Only 20 of these
pins need to run at 125MHz, the remainder at 62MHz or less. All
the inter-chip signals use quarter-width paths, i.e. the pins are mul-
tiplexed over four 62MHz clocks. The unskewedvolume variation
has 72 fewer pins. Thus skewing costs 72 pins for
b
=
8(the cost
increasesfor smaller
b).
7 On-chip Storage
AsdepictedinFigure5,eachrenderingchipneedsbufferstoragefor
buffering blocks,voxelsfor interpolation, valueson the slice ahead
and slice behind for gradient estimation, and partially compositedcompos
pixel
chip
memory
buffer
compositing
buffer
compositing
slice
buffer
compos
pixel
chip
memory
slice
buffer
compos
pixel
chip
memory
buffer
compositing
slice
buffer
compos
pixel
chip
memory
buffer
compositing
slice
buffer
linear
buffer
slice
grad
est
shader
MUX
bidirectional buses
bilin bilin
block
buffer
i
module
memory
voxel
buffer
slice
grad
est
shader
MUX
bilin bilin
block
buffer
Signals
wrap
around
Rendering chip
memory
i-2
module
voxel
buffer
slice
grad
est
shader
MUX
bilin bilin
block
buffer
i-1
module
memory
voxel
buffer
slice
grad
est
shader
MUX
bilin bilin
block
buffer
i+1
module
memory
voxel
trilin sample trilin sample trilin sample trilin sample
linear linear linear
Figure 5: EM-Cube architecture(4 renderingchips shown)
v(i,r,s)
i+1 i i-1 i-2
5
7
34
2
6
8
1
v(i-1,r-1,s) v(i-2,r-1,s)
i
1
3
7
56
2
4
8
v(x-1,y-1,z)
v(x-1,y,z)
v(x,y-1,z)
v(x,y,z)
x
x x-1
yr s z
v(x,y,z+1)
v(x,y-1,z+1)
v(x-1,y-1,z+1)
v(x-1,y,z+1) v(i,r,s+1)
v(i+1,r,s+1)
v(i,r-1,s+1)
v(i-1,r,s)
Figure 6: Unskewed(left) and skewed(right) voxelcubes6 bytes/pixel 3 bytes/pixel
Block buffer 3.1 3.1
Interpolation 1053/C 1053/C
Grad est 2097/C 2097/C
Compos 3146/C 1573/C
Lookup 36.9 36.9
Total 6296/C + 40 4723/C + 40
# chips 6 bytes/pixel 3 bytes/pixel
4 1614 1221
8 827 630
16 433 335
32 237 187
Table 2: On-chip buffer storage for
b
=
8(Kbits/chip where C is
the numberof chips)
pixels. Eachchip also needslookup tables for opacity values,color
values,and shading (not shownin Figure 5).
The blocked architectures require a reordering buffer of
b
3 vox-
els. For uninterrupted supply of voxels, the block buffer must be
double buffered with
2
b
3 voxel storage per rendering chip. How-
ever,for hierarchical blocking the storage drops to
3
b
2 voxels (
b
>
2).
Trilinear interpolation requires voxels in two adjacent slices.
Thus voxels must be buffered from one slice to the next. This stor-
ageisindependentofthearchitecture(e.g.Cube-4orEM-Cube)and
dependssolelyon the numberof renderingchips,
C. The slice stor-
agerequiredperrenderingchipis
N
2
=
C voxels. However,interpo-
lation also requires voxels in the previous row, thus the total inter-
polation storage per rendering chip is
(
N
2
+
N
)
=
C voxels.
To compute a central difference for gradient estimation requires
samples from a slice ahead and a slice behind. This requires two
slice buffers and thus the gradient estimate storage per rendering
chip is
2
N
2
=
C samples.
Shadingproducespartialpixels. Asthesepartialpixelsare gener-
atedslicebyslice,theyarecompositedintoa“running”pixelbuffer.
Allthepartialpixelsalongthesameray(i.e.sharingthesamescreen
pixellocation)are compositedinto the samelocation in the running
pixel buffer. Final pixels corresponding to a ray emerging on an
exit face are immediately written to pixel memory off-chip. Con-
sequently, only the
N
2 running pixels of the slice cross-section of
the volume needto be stored. Thusthe compositing storageperren-
deringchipis
N
2
=
C runningpixels. Weallow 3to6bytesperpixel
tocoveranumberofpossiblepixelformats,e.g.containinganalpha
value (for front-to-back compositing).
For lookup tables, we assume a two-tiered table opacity lookup
with two 512byte tables and one 512 entry table per color compo-
nent(3x512bytes total). Shadingis notyetﬁnalized. Onepossibil-
ity is the lookuptable method of [17] which usesa reﬂectancemap
(one 512 byte table per axis direction, for 3x512bytestotal) and an
arctangenttable(one512bytetable).
5 Thetotalforalllookuptables
is 9x512bytes.
Table 2 lists the total on-chip storagerequired for
N
=
2
5
6
;
b
=
8 with hierarchical blocks, and 16 bit voxels. With present em-
bedded SRAM densities, the buffer storage per chip must be less
than roughly 200Kbits to ensure a cost-effective core area of about
1
0
0
m
m
2, reserving half the core for logic. Thus 32 chips are re-
quired. This is far too many chips for a costeffective solution.
5This produces grey level shading; full color shading requires one re-
ﬂectance map per colorcomponent.
Two voxel
Section face
Section
plane
overlap
Figure 7: Sectioning of volume memory
7.1 Sectioning – A Solution for the On-Chip Buffer
Size Problem
To reduce the on-chip buffer area to a feasible amount, we use the
sameapproachasin[4]: wedividethevolumeinto
Lhorizontalsec-
tions as shown in Figure 7. We process each section in turn using
the EM-Cubealgorithm andthencombinethe results. This section-
ing reducesthe slice facearea andhencethe size ofslice buffers:
L
sectionsreducethe sizeof on-chipslice buffers by
1
=
L.F o r
C
=
4
chips,
L
=
8is a feasible design.
Sectioning does not come for free. We are performing a space-
time tradeoff: we re-read voxels from volume memory and move
some intermediate results back and forth from external pixel mem-
ory.
7.1.1 Voxel bandwidth
Interpolation requires the voxels in the previous row while gradi-
entinterpolation requires thevoxelsin the two previousrows. Con-
sequently, after the ﬁrst section all subsequent sections require re-
reading the bottom two rows of the previous voxel plane as de-
picted in Figure 7. If there are
L sections, this means re-reading
2
(
L
￿
1
)
N
2 voxelsperframe,andthusthetotalbandwidthoverhead
is
2
(
L
￿
1
)
N
2
=
N
3
=
2
(
L
￿
1
)
=
N.Thi si sl esst han
5
% ofthe total
bandwidth if
L
￿
8. For blocks with
b
>
2, tiling with miniblocks
eliminates any excessoverheadin re-reading the two voxelplane.
However, one consequence is that the SDRAM clock and ren-
dering chip pipelines must run slightly faster to deliver the addi-
tionalbandwidth. For
L
=
8 ,theSDRAMclockandrenderingchip
pipelinesmustrun5%faster,i.e. at132MHz,orat5%slowerframe
rate, i.e. 28frames/sec.
7.1.2 Pixel memory re-read
While processing a section, we only need on-chip storage for the
compositing buffer proportional to the size
N
2
=
L of the slice face
area. All running pixels for rays emerging on a section face can be
written to off-chip pixel memory as “interim” pixels.
However,interimpixelswritten tooff-chip pixelmemoryforrays
exiting a section face must be combined/composited with values
for rays continuing into the adjoining section. We deal with this
problem by reading interim pixels from off-chip pixel memory into
the on-chip compositing buffer before processing the next section.
Thereare up to
N
2 interim pixels to read per section (the numberis
as few as 0 for rays parallel to a voxelrow). The worst casecan be
handled by reading one beam of interim pixels from off-chip pixel
memory per slice. In fact, the latency for reading these interim pix-
elscanbehiddenby thetime to reloadthe additionaltwo voxelsper
slice from voxelmemory.trilin trilin trilin trilin
grad
est
shader grad
est
shader grad
est
shader grad
est
shader
compos compos compos compos
clock
block
buffer
smoothing
buffer
voxels
pixels
32nsec
8nsec
clock
Figure 8: Renderingchip pipelines
8 Rendering Chip Structure
Figure 5 shows the overall architecture. Each rendering chip has
buffers and datapaths built-in for a nominal design such that 4 ren-
dering chips, 4 SDRAMs, and 4 pixel memories achieve 28-30
frames/secwith
2
5
6
3
￿
1
6 bitvoxels. To reduceinter-chip commu-
nicationcost,andhencethepincount,tomanageablelevels,weplan
to use a block size of
b
=
8hierarchically tiled with miniblocks.
Each rendering chip processes 16bit voxels at 125MHz
6 and has
slice buffers of size
2
5
6
￿
2
5
6
￿
1
6bit/32 (4Kbytes). Currently we
plan to have four pipelines on-chip, as shown in Figure 8, each 16
bits wideclockedat32nsec. Largervoxelsare treatedasasequence
of 16 bit values with proportional reduction in frame rate.
9 Voxel Formats
Flexibility in voxel formats is important. Accordingly, the EM-
Cubearchitectureallowstheuserto fashionthevoxelformatappro-
priately. Voxelsare either 8 bits or a sequenceofoneor more 16 bit
ﬁelds. We distinguish the format of voxels in memory (“memory
voxels”)and the format of voxels in EM-Cube pipelines (“pipeline
voxels”). Inthesimplestcase,pipelinevoxelsarethesameasmem-
ory voxels. In general,a pipeline voxelcanbe a simple transforma-
tion, e.g. a table lookup,on some or all ﬁelds of memory voxels. A
memory voxelhas the following conceptualcomponents:
1. Intensityﬁeld: 8,12,or16bitsto indicateintensityortoindex
aR G Bt a b l e .
2. Index ﬁeld: 4, 8, (maybe 12), or 16 bits for color lookup and
material type indicator.
3. Gradient coefﬁcient: 8 bits (may increase later).
4. Opacity ﬁeld: 8 bit value or index to opacity table.
5. Arbitrary user ﬁelds (size unrestricted as long as user pads
overall voxelsize out to a multiple of 16 bits).
Notallﬁeldsneedbepresent;someﬁeldsmaynotexistandsome
may overlap with other ﬁelds. Table 3 shows examples of some of
the voxelformats.
6Or slightly more due to sectioning overhead.
intensity 8 bits
intensity index
intensity index
intensity
intensity
index
intensity index grad.
intensity index
grad. coeff opacity
rgb index grad. opacity RGB table index
R G
B opacity/intensity
directRGB
Table 3: Example voxel formats
10 Scaling
It is important that EM-Cube scale to accommodatelarger volumes
and larger voxel sizes. Given
C rendering chips each having the
nominal design described in Section 8 and a volume dataset of
N
columns,
M rows,
S slices and
1
6
v bits/voxel(
v
=
:
5,1 ,2 ,4 ) ,w e
have the following constraints:
Memory capacity:
2
v
N
M
S
=
C
￿
8
m Mbytes where there are
m
64Mbit SDRAMs per rendering chip.
Frame rate:
￿
C
=
(
2
v
N
M
S
)
￿
2
5
0M f/sec, determined by the ren-
dering chip processingrate.
7
Slice buffer:
2
v
N
M
=
L
C
￿
4
0
9
6 bytes
10.1 Voxel Scaling
Theaboveconstraintsdeﬁnetheoptionsif thevoxelsize
v changes.
For example, if
v doubles and if
N
M
S
=
2
5
6
3 and
N
M
=
6
4Kbytes, then we can half the volume size by halving
N or
M
(halving
S does not help because of the slice buffer constraint); or
we can double the number of rendering chips
C, SDRAMs, and
pixelmemories;orwe candoublethe numberof sections
L, double
the amountofvoxelmemory perrendering chip,and half the frame
rate.
10.2 Volume Scaling
To handle a data set of size
N
M
S larger than the nominal design
of
N
￿
M
￿
S
=
2
5
6
￿
2
5
6
￿
2
5
6 supported in the four chip
nominaldesign,we extendsectioningto three dimensionsto divide
the volume into smaller volumes. Thus we virtualize the voxeland
pixel memories by paging them to the host memory system. As in
Section 7.1, volume sections must overlap by two voxel planes re-
quiring re-reading part of a section.
This 3D sectioning also allows us to handle reasonable volume
sizes with just a single rendering chip, albeit with proportional re-
duction in performance.
7Frame rate degradationdueto sectioningis ignored(typically only5%,
dependingon
L).11 Other Issues
Several important issues such as supersampling, subvolumes, and
perspectiveprojectionsareunaddressedin this paper. Weare inves-
tigatingtheseissuesaswereﬁneourarchitecture. Weanticipatethat
supersamplingwill beeasyto work into the pipelineswhile subvol-
umes will be moderately more difﬁcult.
12 Summary
We presented the outline of a feasible architecture for a low-cost,
real-time volume rendering system suitable for PCI cards in PCs.
Processing
2
5
6
3
￿
1
6 bit voxels at 30frames/sec requires four sets
of rendering chips and associatedvoxel and pixel memories.
A major innovationofthe architectureis block-skewedmemory.
Blocking achieves maximum bandwidth from a small number of
SDRAMs. While skewing eliminates memory access conﬂicts to
provide view independence without duplicating voxel data, it in-
creases inter-chip bandwidth. Blocking counteracts this problem,
reducingtheinter-chip bandwidthandthusthepin count. Theblock
size
b parameterizes the architecture. The larger
b, the lower the
communication overhead paid for skewing, and the more the data
accesspattern resembles that for an unskewedvoxel memory.
A second key aspect of the architecture is sectioning. This re-
duces the on-chip storage requirements to achieve a feasible chip
area for implementation.
Other features of the architecture are ﬂexible voxel formats and
scalability. As in Cube-4, onecan always addmore chipsand mem-
ories for scalability. Alternatively, given a ﬁxed amount of hard-
ware, one can use sectioning in multiple dimensions to scale to
larger volumes. We are investigating adding additional features
such as supersampling,subvolumes,and perspective projection.
ArchitecturalsimulationsofEM-Cube areunderway. We planto
freeze the architecture in early summer and expectchips and a PCI
reference board in the secondhalf of 1998.
References
[1] B. Cabral, N. Cam, and J. Foran. Accelerated volume ren-
dering andtomographicreconstructionusingtexture mapping
hardware. In WorkshoponVolumeVisualization,pages91–98,
1994.
[2] B. Corrie and P. Mackerras. Parallel volume rendering and
data coherence. In Proc. Parallel Rendering Symposium,
pages23–26,1993.
[3] T. J. Cullip and U. Neumann. Accelerating volume recon-
struction with 3D texture mapping hardware. Technical Re-
port TR93-027, Dept. of Computer Science, Univ. of North
Carolina, Chapel Hill, 1993.
[4] M. de Boer, A. Gropl, J. Hesser, and R. Manner. Latency-
and hazard-free volume memory architecture for direct vol-
ume rendering. In Proc. 11th EurographicsHardwareWork-
shop,pages 109–118,1996.
[5] K.Maetal.Adatadistributedparallelalgorithmforray-traced
volume rendering. In Proc. Parallel Rendering Symposium,
pages15–22.ACM Press,1993.
[6] S. Gibson et al. Simulating arthroscopic knee surgery using
volumetricobjectrepresentations,real-time volumerendering
and haptic feedback. In First Joint Conferenceon Computer
Vision, Virtual Reality, and Robotics in Medicine and Medi-
calRobotics and ComputerAssistedSurgery,pages369–378.
Springer-Verlag, 1997.
[7] T. Guentheret al. VIRIM: A massively parallel processorfor
real-time volume visualization in medicine. In Proc. 9th Eu-
rographicsHardwareWorkshop,pages103–108,1994.
[8] H. Fuchs and J. Poulton. Pixel-planes: A VLSI-oriented de-
sign for a graphics engine. VLSI Design,2(3):20–28, 1981.
[9] G. Knittel. A scalable architecture for volume rendering. In
Proc. 9th Eurographics Hardware Workshop, pages 58–69,
1994.
[10] G.Knittel andW.Strasser. Acompactvolumerenderingaccel-
erator. In Proc. Volume Visualization Symposium, pages 67–
74. ACM Press, 1994.
[11] P. Lacroute and M. Levoy. Fast volume rendering using a
shear-warp factorization of the viewing transform. In Proc.
SIGGRAPH, pages451–457,1994.
[12] J. Lichtermann. Design of a fast voxel processor for parallel
volume visualization. In Proc. 10th EurographicsHardware
Workshop,pages83–92,1995.
[13] S. Molnar, J. Eyles, and J. Poulton. Pixelﬂow: High-speed
rendering using image composition. Computer Graphics,
26(2):231–240,July 1992.
[14] C. Montani, R. Perego,and R. Scopigno. Parallel volume vi-
sualizationon a hypercubearchitecture. Workshopon Volume
Visualization, pages9–16, October1992.
[15] U. Neumann. Parallel volume-rendering algorithm perfor-
mance on mesh-connectedmulticomputers. In Proc. Parallel
RenderingSymposiumProceedings,pages 97–104,1993.
[16] H. Pﬁster and A. Kaufman. Cube-4 – A scalable architecture
forreal-timevolumerendering.InACM/IEEESympos.onVol-
ume Visualization, pages47–54,1996.
[17] J. Scheltinga, J. Smit, and M. Bosma. Design of an on-chip
reﬂectancemap. In Proc.10thEurographicsHardwareWork-
shop,pages51–55, 1995.
[18] P. Schr¨ oder and G. Stoll. Data parallel volume rendering as
line drawing. In WorkshoponVolumeVisualization,pages25–
31, 1992.