SUMMARY This paper proposes a convolution with systolic array sturucture for perspective projection in real-time volume graphics based on the shear-warp method. In the original method, the further the ray proceeds, the more voxels are required for the calculation of convolution. The increase of required voxels makes dicult to implement the method in VLSI oriented architecture. 1) We use several sets of resolution of voxels associated with depth, in order that convolution can be done with constant number of voxels independent of depth. 2) We implement 3D convolution by three serial 1D convolutions along X, Y and Z axes, which reduces the calculation steps from M 3 to 3M, where the convolution is calculated for M 3 area. For V 3 voxels dataset, the number of pipelines for rays is V 2 and their pipeline stage is 3M. If the hardware of a single pipeline has the capability of calculating V rays, each of the implemented pipelines is assigned to V theoretical pipelines (for V 2 rays). In actual implementation, a number of hardware pipelines should be much smaller than the V theoretical pipelines. We fold the theoretical pipelines and reduce them to the certain number of hardware pipelines. Regarding this folding, we show the relation between folding process and its necessary time delay. The architecture can generate image of 256 3 voxles dataset( V = 256 ) at 30Hz with 4 pipelines. In addition, the architecture can be extended easily for 512 3 (V = 512) and 1024 3 (V = 1024) dataset with 32 pipelines and 256 pipelines respectively.
Introduction
Fast direct volume rendering systems are in high demands. This is due to an increasing number of scientic data generated by a variety of computer simulations, medical data obtained by MRI and CT scanners, and geological, oceanographic, and meteorological data collected from various sensors. One of the notable characteristics shared by these volume data is the sheer amount of data elements to be processed in rendering. This requires a huge amount of computing resources for animated visualization, which is essential to observe some physical phenomena. Another characteristic of the data is that they can not be represented by surfaces Although there are many algorithms for volume rendering, the ray-casting algorithm is the precisest algorithm based on a physical model. It casts rays from the center of projection into a volume to calculate each pixel value on a screen. Let suppose I(a; b) be a intensity from a ray through the volume between points a and b, s(r) be a light added per unit length at distant r along the ray, and (r) be an absorption coecient corresponding to the attenuation of the light per unit length. The following Equation (1) calculates the eects of the light, and it has been used as volume rendering equation [1] [5] [10] .
I(a; b) = R b a s(r)e 0 R r a a(t)dt dr (1) As a simplied implementation of Equation (1) , each sample is computed from voxels surrounding the sample point by interpolation, then being accumulated along the ray to calculate the intensity of the pixel. Each resampling operation is relatively simple, but the total number of resampling operations is very large, and the time spent for the operations is dominant in the rendering time. Because of this, a ray-casting-based volume rendering system could be considered a resampling machine. Therefore in real-time implementation y , how to organize a parallel processing for the resampling is one of major issues.
In this paper, we propose a new VLSI oriented architecture of resampling for both parallel and perspective projections in real-time volume rendering based on the shear-warp method [6] [5] . In particular, we show implementation of 3D convolution for resampling with systolic array structure.
Previous works
From architectural stand point, the ray-casting algorithm is categorized into two schemes for implementation: sample-order and voxel-order scheme. As the term implied, the sample-order scheme uses sample point as a starting point of processing then get memory address of voxel. On the other hand, the voxel-order y generating more than 30 images in a second scheme uses memory address of voxel directly as starting point for processing. Each scheme has advantages and disadvantages for structuring real-time volume rendering architectures.
Sample-order scheme
The sample-order scheme is straightforward implementation of ray-casting algorithm. The ray-parallel method in the scheme parallelizes rendering operations on a ray basis [4] . The number of rays to cast is determined by the screen size and resolution. It can use some optimization techniques available for this scheme. Early ray termination and coherence encoding are two examples to reduce the number of memory accesses [7] . The major disadvantage of this scheme is that one voxel is simultaneously accessed by multiple rays for resampling, which increases the total number of memory accesses time. In addition, it requests complicated memory address calculation from the sample point. These are disadvantages for hardware implementations. VIRIM [2] is one of typical real-time parallel rendering system in this scheme. It is a parallel rendering system for both parallel and perspective projections, but not organized in a complete parallel-pipeline structure for VLSI implementation. In addition, it can not generate images for large data more than 64 3 grids in real-time.
Voxel-order scheme
The voxel-order scheme on the other hand, uses voxel address directly so that memory address calculation is simple and suitable for VLSI implementation. The major advantages of this scheme are the reduction of the number of memory accesses, and the xed resampling structure. The voxel-parallel method [11] [10] in the scheme reads voxeles once and retains them until all the samples requiring the voxels. The disadvantage is that there are no major optimization techniques available for this scheme. The optimization techniques for the ray-parallel method can not be used for this voxelorder scheme, because they break the xed resampling structure. Cube-4 [11] and its VLSI implementation: EM-Cube [9] (product name is VolumePro [12] ) are rendering system in this scheme. The architectures are organized in systolic array structure. However, they support only parallel projections. The memory access for parallel projections are very regular and amenable to parallel pipelined processing; a rendering pipeline is directly corresponded to a ray. The memory access for perspective projections, however, are not regular due to diverging perspective rays. This processing variability adversely aects the parallel-pipeline structure for perspective projections and it has been a major obstacle for hardware implementation in this scheme [12] . The shear-warp [6] [5] is another method in voxelorder scheme. It has been used as software oriented method for both parallel and perspective projections. Although it has the signicant advantage, i.e. both projections can be treated with unied way, hardware implementations of convolution for variavle size is hard due to diversing rays so that it has been used for software implementations.
Issues of Perspective Projections
The shear-warp method produces a distorted base plane y image with a shearing matrix H [6] [5] as an intermediate image, then the image is warped to produce a correct image on a screen. Fig. 1 shows the perspective rays parallelized at the base plane by the shearing matrix H. They are parallel to the principal viewing axis and perpendicular to the base plane. The shearing transformation shears and progressively scales the voxel grid as shown in Fig. 1 , where the voxel grid becomes ner as the distance from the base plane increases.
Starting from a position in the rst slice, i.e. base plane, a parallelized perspective ray proceeds in the progressively scaled grid to compute a sample at each slice. The computed samples are accumulated to produce the nal pixel value in the base plane image. A resampling operation is a convolution over the voxels in a resampling area. There are two important issues for a parallel pipelined implementation, i.e. systolic array, of the method: (1) how to bound a number of voxles to be convoluted and (2) how to organize a convolution. A n umber of voxels to be convoluted in one dimension, M, is computed by: M = 1 + k=k 0 ; (2) where k yy is the slice number or the distance from the base plane and k 0 the distance between the eye position (the center of projection) and the base plane as y A plane the most perpendicular to the viewing vector, which includes the front face of the volume. yy We use notation (x; y; z) for the dataset description and (i; j; k) for that of transformed as shown in Fig. 1 shown in the Fig. 1 . It can be arbitrarily large with large values of k and/or small values of k 0 . Since the hardware implementation can only use a xed amount of resources for convolution, it has to ignore the voxels outside the convolution area, producing a low-quality or aliased base plane image.
The convolution structure is another issue. The underlying architecture pairs a memory module and a rendering pipeline so that each pipeline can take one voxel along with neighboring voxels through sideway communications and produces one sample computed from the voxels.In this systolic array structure, the number of inputs (voxels) is equal to the number of outputs (computed samples). This holds for parallel projections, but not for perspective projections.
These issues make hard to implement the parallelpipeline structure for perspective projections and has been a major obstacle for hardware implementation in this method.
Basic Ideas
The parallel projection is amenable to parallel processing by assigning a set of rendering pipelines to a set of rays with one to one corresponding as illustrated in Fig. 2(a) . In order to organaize rarallel pipelined convolution, the number of voxels to be convoluted has to be limited. It is reasonable to use low resolution data for distant samples so that selecting appropriate resolution data from multiple resolution dataset, prepared in advance, limits the number of voxels to be convoluted within a limit as shown in Fig. 2(b) . 
Sample-Parallel Architecture
We shift a focus from voxels to samples and reorganize the rendering architecture as a sample-parallel architecture to provide a unied parallel-pipeline structure for both parallel and perspective projections.
As shown in Fig. 3 , the proposed architecture places a resampling module between the voxel memory and rendering pipelines. This module is specialized for resampling with a variable number of voxels in the resampling area. All the complicated communication and its control between other pipelines are moved from the rendering pipelines to this resampling module in order to organize the rendering pipelines in a xed parallelpipeline structure. 
Transformation
To estimate a sample, we use four coordinate systems and transformations between them. The transformed coordinates are used to calculate weights for convolution. The four coordinate systems are the normalized, shear-shrink, scale-up, and compositing coordinate systems, as illustrated in Fig. 4 . The normalized coordinate system denes the original voxel grid at slice 0. It is equivalent to the base plane coordinate system. The shear-shrink coordinate system denes the voxel grid sheared and scaled by a shear-shrink matrix. The scaleup coordinate system denes a grid scaled up by a scaling factor D. This is the eect of using multi-resolution datasets; a dataset covering a larger area eectively scales up the grid. The compositing coordinate system denes a pixel grid which coincides with the original voxel grid in the normalized coordinate system. The sequence of transformations from the normalized coordinate system to the compositing coordinate system entails the sequence of grid changes as follows; 1) transform a grid point at (i; j; k) in the normalized coordinate system into a grid point at (i 3 ; j 3 ; k 3 )
by the shear-shrink matrix;
2) scale up the grid point at (i 3 ; j 3 ; k 3 ) using the scaling factor D to a grid point at (i y ; j y ; k y );
3) apply the oor operation to the grid position If we assume separable weights w lmn , i.e. w lmn = w n w m w l , then Equation (3) can be transformed into Equation (4). This assumption is reasonable and useful for the implementation of 3D convolution. For example, 3D Lagrange interpolation formula and a 3D Sinc function are separable and belongs to this category. By using separable weight, the 3D convolution can be implemented using series of 1D convolution. This implementation uses 3M calculation units, i.e. multiplier and adder, (4) and (5). The induction of the structure is described in Appendix A. The gure shows a special case: the dataset size in one dimension is equal to the number of processing pipelines. In the gure (a), the convolution has two types of data path; the solid line shows data path of voxels and dotted line shows that of sheared position. In each data path, operations are divided into three groups: one group for the i, the next for the j, and the last for the k-direction. In j-direction 1D convolution, it has j-delay to adjust timing of next scanline voxels. In k-direction 1D convolution, it has kdelay to adjust timing of next slice of voxles. In Fig. 5 (b), how to access all the voxels with a set of pipelines in sequential manner is shown. Table 1 shows several snapshots of the pipeline operations of the 1D convolution for the i-direction in in the table. Snapshots for whole pipeline operations can be induced from the snapshots in Table 1 . Because the 1D convolution structure for the j and k-direction is similar to that of i-direction, but having dierent time delays due to the dierent timing of neighboring voxels in the j and k-directions.
Selecting Multi-Resolution Datasets
The use of multi-resolution datasets can reduce the number of voxels required for convolution. By selecting an appropriate multi-resolution dataset depending on the resolution of the scaled voxel grid, the architecture can always use a bounded numb e r o f v o xels for resampling regardless of the depth. It is a 3D version of the mip-mapping scheme for texture mapping [14] . The memory overhead to store multi-resolution datasets for a volume of size V 
where M is the convolution area size in Equation (2).
The scaling factor D is dened by D = 2 L .The multi-resolution data, v i;j;k , for convolution are specied by the logical address (L; bi=Dc; bj=Dc; bk=Dc).
Note that their geometrical positions are given by (i y ; j y ; k y ).
Multi-Resolution Skewed Memory Organization
The skewed memory organization is a technique to store voxels in separate memory modules so that voxels in a slice can be accessed in parallel without any memory conict regardless of the viewing direction [3] . It does not require multiple volume copies. The proposed architecture uses it to store multi-resolution datasets.
Consider a system with N p rendering pipelines for a volume of size V 3 . A logical memory address for the multi-resolution skewed memory is specied by (L; i; j; k), where L is the resolution level. Each multiresolution dataset can be stored in the skewed memory as if it were an original volume dataset.
Let n p be a memory module number, L be a resolution level in the module, and i p be an index in the module, the multi-resolution data are accessed with the physical address (n p ; L ; i p ) given by the following addressing scheme: n p = m mod N p ; (7) i p = bm=N p c + j 0 V 0 =N p + k 0 V 02 =N p ; The folding scheme requests another delay elements in the architecture. Those are: folding-delays, left-folding-delays and selectors. Fig. 7 illustrates derivation of these elements and their control. In the gure, derivation of time delay for the case of V = 8 and N P = 4 is shown. Fig. 7 (a) shows a naive implementation, i.e. no folding. Fig. 7 (b) illustrates dummy time-slots generated for folding. The input timing of the left half voxels is one unit time early to the right one. To compensate this timing mismatch, delay units are used . Fig 7 (c) shows that the dummy time slots are lled by folding. By this folding, there is no wasted time slots. As a result of the derivation, we can see that the time delay for folding-delays is always 1 unit time regardless of V , on the contrary that of left-foldingdelays is V= N P . Using these additional delay elements and controls, Fig. 5 can be extended to 3D convolution for a general case. =4 for a next slice of voxels. In addition, the structure has folding-delays and left-folding-delays to compensate time delay caused by folding as described in Fig. 7 . In summary, the number of delay time for general case with N P pipelines is shown in Table 2 . It is actually congured as a variable delay element for dierent resolution levels in adaptive processing. For a given scaling factor D, representing a resolution level, the delay element causes a delay of (V = D )=4 for j-delay and (V = D ) 2 =4 for k-delay respectively as shown in Table 2. The variable delay element can be implemented using a FIFO memory addressed in a circular manner by a single pointer for both read and write operations. The cycle time from one location to the same location determines the delay time, which can easily be changed by changing the maximum address value.
In the gure, the shearings (i; j; k) ! (i y ; j y ; k y ) and (i; j; k) ! (î;ĵ;k) are carried out on the y at shear block in the gure. This shearing is carried out using DDA instead of matrix multiplication. Table 2 
Rendering Timings
The rendering time is directly related to the number of resampling operations to perform, which can be re- duced by the use of multi-resolution data. It is upperbounded by the total number of resampling operations without multi-resolution datasets, that is, only with original voxels.
Since the resampling and other rendering operations can fully be pipelined, the pipeline cycle time can be equal to the memory access time T m . This implies that the processing bottleneck is the memory access time. For a given set of rendering parameters, accesses to the voxel memory are regular and deterministic. A double buering technique can be used to provide continuous data streams from the voxel memory to the resampling module and rendering pipelines. The pixel memory does not require much bandwidth on average, because the pixel write operations are bursty but intermittent. A simple pixel buering technique with a FIFO memory will be enough for pixel write operations. 
For a given set of parameters T m , N f , and N p , the maximum dimension of volume that can be rendered is given by:
Assuming that T m = 8 ns as in a 125-MHz SDRAM chip and N f = 30 frames/second, the volume dimensions computed for several values of N f and N p are shown in Fig.9 . The volume dimensions computed for several typical values of N f and N p are also shown in Table 3 . These values verify that the proposed ar- chitecture can render volumes of practical sizes in realtime.
Multi-resolution datasets are generated before a volume dataset is loaded into the voxel memory. They can be generated o-line by software. The set of operations to compute a single data from eight data at a lower level includes seven additions, one division (shift), and eight memory reads, and one memory write; . Although the number of operations can be reduced by using some optimization techniques, such as buering and pipelining, it seems still dicult to perform these operations on the y with the current technologies. It may be reasonable, however, to perform these operations using several frame periods for practical applications.
Experiments
We built a software simulator to simulate the pipeline data ow of the proposed architecture and verify the concept for both parallel and perspective projections. To compare images, we also built a sample-order raycasting renderer that computes slices of samples perpendicular to the viewing vector and accumulates them to produce the nal image. We conducted several rendering experiments with these simulators. Fig. 6 (a) shows a perspective image rendered from an opaque cube of size 64 3 to verify the perspective projection. The lter kernel based on the 22222 Lagrange formula is used in resampling.
Figs. 6 (b) and (c) show two perspective images rendered from an opaque checker-board cube of size 128 3 (spatial frequency of 64 Hz) to explore the aliasing problem; a fully opaque dataset gives the worst case for aliasing. The image in Fig. 6 (b) is generated by using the nearest neighbor voxel values in resampling, showing the aliasing problem clearly. The image in Fig. 6 (c) is generated by using a 3 2 3 2 3 box lter kernel in resampling, showing the antialiasing eect by convolution using multi-resolution datasets.
Figs. 6 (d), (e), and (f) show the images rendered from the engine block of size 256 3 used in Lacroute's rendering experiments [6] with a manually adjusted opacity table. Fig. 6 (d) shows a perspective projection image, and Fig. 6 (e) shows a parallel projection image for a comparison purpose. These two images are generated by using a kernel based on the 3 2 3 2 3 Lagrange formula. Fig. 6 (f) is a perspective projection image generated by the sample-order ray-casting renderer with interpolations using multi-resolution datasets. The number of the slices taken for this image is 258, about the same number of slices (256) used in Fig. 6 (d) . The two images in Figs. 6 (d) and (f) look comparable in quality.
Future Work
The analysis for xed-point arithmetic is an essential work for hardware implementation. And the other future work is the theoretical analysis for convolution on re-sampled data from irregular volumes.
Conclusion
We h a ve proposed a new VLSI architecture of real-time volume graphics using 3D convolution for both parallel and perspective projections based on the shear-warp algorithm. In the original algorithm, the further the ray proceeds, the more voxels are required for the calculation of convolution. The increase of required voxels makes dicult to implement the algorithm in hardware. 1) We prepare several sets of resolution of voxels associated with depth, in order that convolution can be done with constant number of voxels independent of depth. In this induction, the real sheared positions indicated in the Equation (4) are not necessary so that we do not used it for simplicity. 
