130,398 research outputs found
Recommended from our members
Ray Casting Architectures for Volume Visualization
Real-time visualization of large volume datasets demands high performance computation, pushing the storage, processing, and data communication requirements to the limits of current technology. General purpose parallel processors have been used to visualize moderate size datasets at interactive frame rates; however, the cost and size of these supercomputers inhibits the widespread use for real-time visualization. This paper surveys several special purpose architectures that seek to render volumes at interactive rates. These specialized visualization accelerators have cost, performance, and size advantages over parallel processors. All architectures implement ray casting using parallel and pipelined hardware. We introduce a new metric that normalizes performance to compare these architectures. The architectures included in this survey are VOGUE, VIRIM, Array Based Ray Casting, EM-Cube, and VIZARD II. We also discuss future applications of special purpose accelerators.Engineering and Applied Science
Visual object-oriented development of parallel applications
PhD ThesisDeveloping software for parallel architectures is a notoriously difficult task, compounded further by the range of available parallel architectures. There has been little research effort invested in how to engineer parallel applications for more general problem domains than the traditional numerically intensive domain. This thesis addresses these issues. An object-oriented paradigm for the development of general-purpose parallel applications, with full lifecycle support, is proposed and investigated, and a visual programming language to support that paradigm is developed. This thesis presents experiences and results from experiments with this new model for parallel application development.Engineering and Physical Sciences Research Council
Analysing Astronomy Algorithms for GPUs and Beyond
Astronomy depends on ever increasing computing power. Processor clock-rates
have plateaued, and increased performance is now appearing in the form of
additional processor cores on a single chip. This poses significant challenges
to the astronomy software community. Graphics Processing Units (GPUs), now
capable of general-purpose computation, exemplify both the difficult
learning-curve and the significant speedups exhibited by massively-parallel
hardware architectures. We present a generalised approach to tackling this
paradigm shift, based on the analysis of algorithms. We describe a small
collection of foundation algorithms relevant to astronomy and explain how they
may be used to ease the transition to massively-parallel computing
architectures. We demonstrate the effectiveness of our approach by applying it
to four well-known astronomy problems: Hogbom CLEAN, inverse ray-shooting for
gravitational lensing, pulsar dedispersion and volume rendering. Algorithms
with well-defined memory access patterns and high arithmetic intensity stand to
receive the greatest performance boost from massively-parallel architectures,
while those that involve a significant amount of decision-making may struggle
to take advantage of the available processing power.Comment: 10 pages, 3 figures, accepted for publication in MNRA
Performance analysis of direct N-body algorithms for astrophysical simulations on distributed systems
We discuss the performance of direct summation codes used in the simulation
of astrophysical stellar systems on highly distributed architectures. These
codes compute the gravitational interaction among stars in an exact way and
have an O(N^2) scaling with the number of particles. They can be applied to a
variety of astrophysical problems, like the evolution of star clusters, the
dynamics of black holes, the formation of planetary systems, and cosmological
simulations. The simulation of realistic star clusters with sufficiently high
accuracy cannot be performed on a single workstation but may be possible on
parallel computers or grids. We have implemented two parallel schemes for a
direct N-body code and we study their performance on general purpose parallel
computers and large computational grids. We present the results of timing
analyzes conducted on the different architectures and compare them with the
predictions from theoretical models. We conclude that the simulation of star
clusters with up to a million particles will be possible on large distributed
computers in the next decade. Simulating entire galaxies however will in
addition require new hybrid methods to speedup the calculation.Comment: 22 pages, 8 figures, accepted for publication in Parallel Computin
TTC: A Tensor Transposition Compiler for Multiple Architectures
We consider the problem of transposing tensors of arbitrary dimension and
describe TTC, an open source domain-specific parallel compiler. TTC generates
optimized parallel C++/CUDA C code that achieves a significant fraction of the
system's peak memory bandwidth. TTC exhibits high performance across multiple
architectures, including modern AVX-based systems (e.g.,~Intel Haswell, AMD
Steamroller), Intel's Knights Corner as well as different CUDA-based GPUs such
as NVIDIA's Kepler and Maxwell architectures. We report speedups of TTC over a
meaningful baseline implementation generated by external C++ compilers; the
results suggest that a domain-specific compiler can outperform its general
purpose counterpart significantly: For instance, comparing with Intel's latest
C++ compiler on the Haswell and Knights Corner architecture, TTC yields
speedups of up to and , respectively. We also showcase
TTC's support for multiple leading dimensions, making it a suitable candidate
for the generation of performance-critical packing functions that are at the
core of the ubiquitous BLAS 3 routines
An intelligent processing environment for real-time simulation
The development of a highly efficient and thus truly intelligent processing environment for real-time general purpose simulation of continuous systems is described. Such an environment can be created by mapping the simulation process directly onto the University of Alamba's OPERA architecture. To facilitate this effort, the field of continuous simulation is explored, highlighting areas in which efficiency can be improved. Areas in which parallel processing can be applied are also identified, and several general OPERA type hardware configurations that support improved simulation are investigated. Three direct execution parallel processing environments are introduced, each of which greatly improves efficiency by exploiting distinct areas of the simulation process. These suggested environments are candidate architectures around which a highly intelligent real-time simulation configuration can be developed
Active data structures on GPGPUs
Active data structures support operations that may affect a large number of elements of an aggregate data structure. They are well suited for extremely fine grain parallel systems, including circuit parallelism. General purpose GPUs were designed to support regular graphics algorithms, but their intermediate level of granularity makes them potentially viable also for active data structures. We consider the characteristics of active data structures and discuss the feasibility of implementing them on GPGPUs. We describe the GPU implementations of two such data structures (ESF arrays and index intervals), assess their performance, and discuss the potential of active data structures as an unconventional programming model that can exploit the capabilities of emerging fine grain architectures such as GPUs
The Parallel Algorithm for the 2-D Discrete Wavelet Transform
The discrete wavelet transform can be found at the heart of many
image-processing algorithms. Until now, the transform on general-purpose
processors (CPUs) was mostly computed using a separable lifting scheme. As the
lifting scheme consists of a small number of operations, it is preferred for
processing using single-core CPUs. However, considering a parallel processing
using multi-core processors, this scheme is inappropriate due to a large number
of steps. On such architectures, the number of steps corresponds to the number
of points that represent the exchange of data. Consequently, these points often
form a performance bottleneck. Our approach appropriately rearranges
calculations inside the transform, and thereby reduces the number of steps. In
other words, we propose a new scheme that is friendly to parallel environments.
When evaluating on multi-core CPUs, we consistently overcome the original
lifting scheme. The evaluation was performed on 61-core Intel Xeon Phi and
8-core Intel Xeon processors.Comment: accepted for publication at ICGIP 201
- …