Multiresolution volume processing and visualization on graphics hardware by van der Laan, Wladimir
  
 University of Groningen
Multiresolution volume processing and visualization on graphics hardware
van der Laan, Wladimir
IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from
it. Please check the document version below.
Document Version
Publisher's PDF, also known as Version of record
Publication date:
2011
Link to publication in University of Groningen/UMCG research database
Citation for published version (APA):
van der Laan, W. (2011). Multiresolution volume processing and visualization on graphics hardware.
Groningen: s.n.
Copyright
Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the
author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).
Take-down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the




8.1 Summary and Conclusions
In this thesis we investigated several advanced techniques in visualization of large data sets,
multidimensional signal processing, deformable models, and data reduction. As we aimed to
develop fast algorithms, all of these can be used in an interactive pipeline.
In chapter 2 we have investigated a number of algorithms based on morphological pyramids
for multiresolution MIP volume rendering on graphics hardware. We found that our highly-
optimized streaming MIP GPU-method outperforms both its software implementation as well as
existing ray-casting and 3-D texture-based methods.
In chapter 3 we presented a novel, fast wavelet lifting implementation on graphics hardware
using CUDA, which extends to any number of dimensions. We compared our method to an op-
timized CPU implementation of the lifting scheme, to another (non-CUDA based) GPU wavelet
lifting method, and also to an implementation of the wavelet transform in CUDA via convolu-
tion. We implemented our method both for 2D and 3D data. The method is scalable and was
shown to be the fastest GPU implementation among the methods considered. Our theoretical
performance estimates turned out to be in fairly close agreement with the experimental observa-
tions. The complexity analysis revealed that our CUDA kernels are cost- and work-efficient. Our
proposed GPU algorithm can be applied in all cases were the Discrete Wavelet Transform is part
of a pipeline for processing large amounts of data. Examples are the encoding of static images,
such as the wavelet-based successor to JPEG, JPEG2000 [124], or video coding schemes [9].
In chapter 4, we showed how to accelerate the Dirac Video Codec by our fast wavelet lift-
ing implementation on graphics hardware using CUDA. We also accelerated the motion com-
pensation and frame arithmetic stages of this codec. The experiments on high definition video
sequences have demonstrated that one can achieve a speedup factor of more than 7 for the en-
tire decoding process including the CPU steps, and a factor of 15 for just the GPU part. In our
benchmark we could play back a 1080p resolution Dirac video sequence at roughly 50 frames
per second on basic consumer hardware.
In chapter 5 we presented a new method for rendering fluids in real-time directly from particle
based representations without the need for intermediate triangulation, but which still produces
a high-quality fluid surface. We also introduced new ideas to add thickness-based shading and
small-scale surface detail to fluids.
132 8.2 Future outlook
In chapter 6 we proposed an efficient data structure, the Sorted Tile List, with associated
operations for the level set representation, and compared the resulting method with the current
method of choice, the DT-Grid method. With regard to performance, given the same numerical
simulation code, our method turned out to be faster by a significant factor. After fine-grain
parallelization using SIMD instructions our method was shown to be roughly 8 times faster.
In chapter 7 we adapted our highly-efficient, sparse, tile-based level set method to leverage
highly parallel architectures such as GPUs We compared our method to other state-of-the-art,
sparse approaches, and showed that our method is about 20 times faster than the optimized
CPU version of the Sorted Tile List method, and two orders of magnitude faster than the DT-
Grid method. Many level-set applications can benefit from our level-set GPU infrastructure. To
demonstrate its efficiency, we discussed two graphics applications: surface reconstruction from
point clouds and level-set surface editing. Our novel multi-resolution method for surface re-
construction compares favorably with recent, existing techniques and parallel implementations.




In the land of GPUs, things change fast, very fast. This makes it very difficult to say something
about the future which is not outdated already. Certain is that slowly but steadily the typical
GPU restrictions are being removed. Several limitations existed when this thesis project was
started: 3D floating-point textures were not available, rendering directly into a texture was not
fully implemented, writing to multiple (output) buffers was not yet allowed, and instruction sets
were limited. These obstacles were removed in the course of perhaps not more than a year.
With the advent of CUDA, which was introduced after the work on MIP rendering with mor-
phological pyramids was completed, even more limitations of traditional GPGPU programming
disappeared.
Other limitations have been alleviated but remain a limiting factor in performance. For exam-
ple, the NVidia Tesla (G80) architecture did not support any sort of caching for global memory, so
the programmer had to rely on optimized memory access patterns to reach a significant through-
put. The Fermi (GF100) [95] generation added a cache hierarchy, and relaxed the constraints on
memory access patterns. The cache comes at a price, as part of shared memory is traded for it.
Even with caching, the maximum throughput is achieved by using an optimized access pattern,
so the results described in this thesis remain important.
Lifting such architectural restrictions makes the GPU stream processors become more com-
plex and more like CPU cores. On the other hand, CPUs increasingly include more GPU-like
features (e.g., instruction sets for exploiting inherent parallelism, or an increasing number of
cores). It is probable that convergence will occur, although it is not yet clear what the end result
should be. GPU vendors should be careful not to generalize too much, as the strength of GPGPU
computing lies in massive parallelism with simple execution units.
Concluding remarks 133
8.2.2 Computer Graphics APIs
As graphics cards grow toward full programmability and generality, APIs such as OpenGL/DirectX
will, for graphics programming, probably be overtaken by more convenient higher level graphics
APIs (such as OGRE [63]), which allow full programmability and extensibility under the hood,
by making use of GPGPU APIs such as OpenCL [138]. As these libraries are designed with
the user in mind, they might rely on abstractions such as Renderman [5] that are used in 3D
rendering for motion pictures.
If programmable GPUs mature like CPUs, which they will probably do, there will be multiple
programming languages to choose from, but the interface to the hardware will be standardized
by the operating system. As this will provide the low level interface, the graphics APIs lose their
status as low level interface to the hardware and become intermediate level interfaces. Will there
still be a place for them?
OpenGL started out as a complete set of rendering commands for the professional graphics
hardware. DirectX started as a light programming interface for customer graphics hardware.
Both have evolved enormously since, and have long ago converged with respect to capabilities.
At their heart, they both have the triangle rasterization-oriented graphics pipeline. With pro-
grammable hardware, the full graphics pipeline has been implemented in software [74] which is
much more flexible. In many cases, the focus on a rigid rendering pipeline only gets in the way.
For example when implementing advanced rendering techniques (such as ray-tracing [122], vox-
els [26], or irregular volume rendering [163]) using complex data structures, one does not want
to worry about the triangle rasterization state.
Also, the graphics APIs have their own specific problems, that might make them fade out
of scope eventually. OpenGL suffers from a very slow political decision process, which results
in vendors bolting on their own hardware specific extensions. This makes it very complex for
users, which have to cope with all the combinations of extensions and hardware details. DirectX
has the opposite problem, its decision process is fast and pragmatic, and thus its programming
interface is changed wildly in every major release. However, it is inefficient for programmers to
have to learn a new API every two years, and an application needs to be able to make use of two
or three versions to be able to support older hardware. Also, it is restricted to the MS Windows
operating system, which has a large installed base, but is not quite the whole story nowadays,
especially in the growing mobile realm.
8.2.3 CUDA
CUDA still has a few user (developer) friendliness issues that hamper its adoption. Although
currently used by companies in oil and gas and finance, and at many universities, far outside the
scope of computer graphics and games where it began, adoption of CUDA would be helped by
finding ways to integrate GPGPU into day to day experimentation. Some of these issues are the
following.
Ease of programming Even though GPUs can be programmed in C nowadays, a lot of hard-
ware pitfalls exist which make that code which is not specifically optimized runs slower than the
134 8.2 Future outlook
CPU implementation. Specific techniques that take knowledge of the hardware into account are
needed, especially since the hardware continuously changes. The NVidia Fermi [95] architec-
ture takes a step in the right direction by adding cache and relaxing the constraints on memory
access patterns. This makes it easier for a programmer to write moderately efficient code. The
authors of CUDA-lite [142] have developed a tool to simplify CUDA programming by help-
ing the programmer to deal with the complex memory hierarchy. Another programming issue
is the difficulty of writing shared memory parallel code. Programmers tend to think in blocks
and objects communicating with each other, for example using MPI [126]. The GRAMPS [130]
programming model is an interesting development in this direction, as it considers the GPU as a
general set of stages connected by queues.
Debugging It has always been very difficult to debug GPU code. There used to be no sup-
port for single-stepping through code or using breakpoints, and also no way to log simple tracing
messages. The only possibility of debugging was to write to some memory area or texture, and in-
spect this result from the CPU. This has been addressed with the introduction of the CUDA GPU
debugger in CUDA 3.0, which does offer those features. Also, there has been research to make
debugging GPU code more convenient, see Hou et al. [54]. However there is a general problem
in debugging multi-threaded code: conventional debugger paradigms were invented with single-
threaded usage in mind, and hardly consider thread communication and synchronization, which
are exactly the areas that tend to introduce most bugs.
Scalability With so-called cloud computing (general-purpose utility computing clusters) on the
rise, scalabilty is more important than ever. Recently, Amazon EC2, a popular cloud-computing
platform, introduced a new node type with two GPUs. Everyone can now hire a full-blown GPU
cluster for a few hours for a relatively low cost.
Utilizing this cluster is another story, though. Many of the challenges in GPU cluster building,
management and programming are addressed in [66]. However, one issue remains: the use of
clusters adds yet another level to the already complex GPU execution and memory hierarchy.
It would be most efficient if the programmer could write code that runs on clusters as well as
single and multiple GPUs of various sizes, without having to worry about the different levels of
parallelism and data distribution between nodes, except when fine-tuning.
Library support There is always a demand for more readily usable general purpose and
application-specific software libraries. Libraries currently provided by NVidia are “cublas” for
linear algebra, “cufft” for FFT transforms, and “cudpp” for generic parallel primitives. Third-
party libraries have also been developed, such as OpenVIDIA [37] for computer vision.
8.2.4 GPGPU for embedded systems
In my opinion a great, mostly open opportunity is the use of GPGPUs in embedded systems,
such as intelligent vehicles that process large amounts of incoming sensor data, cameras that
recognize faces, and other upcoming “smart devices”. GPUs excel at fast signal processing, for
Concluding remarks 135
adaptive controller systems, artificial intelligence, pattern matching, and so on. However, there
are still a few issues with this.
Noise and heat GPU boards have the reputation of producing much heat which has to be
cooled with fans which produce a large amount of noise. Also, even though the performance-
to-watt ratio of GPUs compares very favorably to CPUs, they use a lot of power, also when
(partially) inactive. The latter is being addressed with on-chip power saving techniques.
Platform openness GPUs are hard to embed due to closed platforms and drivers; there are no
drivers for platforms generally used in embedded systems such as MIPS or ARM. Also, only a
few operating systems are supported.
I/O capability There is currently no direct I/O possibility except to and from the host system
(and the graphics output). Direct input from sensors and output to peripherals would be useful.
It is an interesting open question how to best integrate the mostly serial nature of external I/O
with the parallel nature of GPU programming.
136 8.2 Future outlook
