59,516 research outputs found
Sparse array representations and some selected array operations on GPUs
A dissertation submitted to the Faculty of Science, University of the Witwatersrand, Johannesburg, in fulfilment of the requirements for the degree of Master of Science. Johannesburg, 2014.A multi-dimensional data model provides a good conceptual view of the data in data warehousing and On-Line
Analytical Processing (OLAP). A typical representation of such a data model is as a multi-dimensional array
which is well suited when the array is dense. If the array is sparse, i.e., has a few number of non-zero elements
relative to the product of the cardinalities of the dimensions, using a multi-dimensional array to represent the
data set requires extremely large memory space while the actual data elements occupy a relatively small fraction
of the space. Existing storage schemes for Multi-Dimensional Sparse Arrays (MDSAs) of higher dimensions
k (k > 2), focus on optimizing the storage utilization, and offer little flexibility in data access efficiency.
Most efficient storage schemes for sparse arrays are limited to matrices that are arrays in 2 dimensions. In
this dissertation, we introduce four storage schemes for MDSAs that handle the sparsity of the array with two
primary goals; reducing the storage overhead and maintaining efficient data element access. These schemes,
including a well known method referred to as the Bit Encoded Sparse Storage (BESS), were evaluated and
compared on four basic array operations, namely construction of a scheme, large scale random element access,
sub-array retrieval and multi-dimensional aggregation. The four storage schemes being proposed, together
with the evaluation results are: i.) The extended compressed row storage (xCRS) which extends CRS method
for sparse matrix storage to sparse arrays of higher dimensions and achieves the best data element access
efficiency among the methods compared; ii.) The bit encoded xCRS (BxCRS) which optimizes the storage
utilization of xCRS by applying data compression methods with run length encoding, while maintaining its
data access efficiency; iii.) A hybrid approach (Hybrid) that provides the best control of the balance between
the storage utilization and data manipulation efficiency by combining xCRS and BESS. iv.) The PATRICIA
trie compressed storage (PTCS) which uses PATRICIA trie to store the valid non-zero array elements. PTCS
supports efficient data access, and has a unique property of supporting update operations conveniently. v.)
BESS performs the best for the multi-dimensional aggregation, closely followed by the other schemes.
We also addressed the problem of accelerating some selected array operations using General Purpose Computing
on Graphics Processing Unit (GPGPU). The experimental results showed different levels of speed up,
ranging from 2 to over 20 times, on large scale random element access and sub-array retrieval. In particular, we
utilized GPUs on the computation of the cube operator, a special case of multi-dimensional aggregation, using
BESS. This resulted in a 5 to 8 times of speed up compared with our CPU only implementation. The main
contributions of this dissertation include the developments, implementations and evaluations of four efficient
schemes to store multi-dimensional sparse arrays, as well as utilizing massive parallelism of GPUs for some
data warehousing operations
A Unified Optimization Approach for Sparse Tensor Operations on GPUs
Sparse tensors appear in many large-scale applications with multidimensional
and sparse data. While multidimensional sparse data often need to be processed
on manycore processors, attempts to develop highly-optimized GPU-based
implementations of sparse tensor operations are rare. The irregular computation
patterns and sparsity structures as well as the large memory footprints of
sparse tensor operations make such implementations challenging. We leverage the
fact that sparse tensor operations share similar computation patterns to
propose a unified tensor representation called F-COO. Combined with
GPU-specific optimizations, F-COO provides highly-optimized implementations of
sparse tensor computations on GPUs. The performance of the proposed unified
approach is demonstrated for tensor-based kernels such as the Sparse Matricized
Tensor- Times-Khatri-Rao Product (SpMTTKRP) and the Sparse Tensor- Times-Matrix
Multiply (SpTTM) and is used in tensor decomposition algorithms. Compared to
state-of-the-art work we improve the performance of SpTTM and SpMTTKRP up to
3.7 and 30.6 times respectively on NVIDIA Titan-X GPUs. We implement a
CANDECOMP/PARAFAC (CP) decomposition and achieve up to 14.9 times speedup using
the unified method over state-of-the-art libraries on NVIDIA Titan-X GPUs
Flexible Multi-layer Sparse Approximations of Matrices and Applications
The computational cost of many signal processing and machine learning
techniques is often dominated by the cost of applying certain linear operators
to high-dimensional vectors. This paper introduces an algorithm aimed at
reducing the complexity of applying linear operators in high dimension by
approximately factorizing the corresponding matrix into few sparse factors. The
approach relies on recent advances in non-convex optimization. It is first
explained and analyzed in details and then demonstrated experimentally on
various problems including dictionary learning for image denoising, and the
approximation of large matrices arising in inverse problems
Tensor Computation: A New Framework for High-Dimensional Problems in EDA
Many critical EDA problems suffer from the curse of dimensionality, i.e. the
very fast-scaling computational burden produced by large number of parameters
and/or unknown variables. This phenomenon may be caused by multiple spatial or
temporal factors (e.g. 3-D field solvers discretizations and multi-rate circuit
simulation), nonlinearity of devices and circuits, large number of design or
optimization parameters (e.g. full-chip routing/placement and circuit sizing),
or extensive process variations (e.g. variability/reliability analysis and
design for manufacturability). The computational challenges generated by such
high dimensional problems are generally hard to handle efficiently with
traditional EDA core algorithms that are based on matrix and vector
computation. This paper presents "tensor computation" as an alternative general
framework for the development of efficient EDA algorithms and tools. A tensor
is a high-dimensional generalization of a matrix and a vector, and is a natural
choice for both storing and solving efficiently high-dimensional EDA problems.
This paper gives a basic tutorial on tensors, demonstrates some recent examples
of EDA applications (e.g., nonlinear circuit modeling and high-dimensional
uncertainty quantification), and suggests further open EDA problems where the
use of tensor computation could be of advantage.Comment: 14 figures. Accepted by IEEE Trans. CAD of Integrated Circuits and
System
Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs
Many problems in geophysical and atmospheric modelling require the fast
solution of elliptic partial differential equations (PDEs) in "flat" three
dimensional geometries. In particular, an anisotropic elliptic PDE for the
pressure correction has to be solved at every time step in the dynamical core
of many numerical weather prediction models, and equations of a very similar
structure arise in global ocean models, subsurface flow simulations and gas and
oil reservoir modelling. The elliptic solve is often the bottleneck of the
forecast, and an algorithmically optimal method has to be used and implemented
efficiently. Graphics Processing Units have been shown to be highly efficient
for a wide range of applications in scientific computing, and recently
iterative solvers have been parallelised on these architectures. We describe
the GPU implementation and optimisation of a Preconditioned Conjugate Gradient
(PCG) algorithm for the solution of a three dimensional anisotropic elliptic
PDE for the pressure correction in NWP. Our implementation exploits the strong
vertical anisotropy of the elliptic operator in the construction of a suitable
preconditioner. As the algorithm is memory bound, performance can be improved
significantly by reducing the amount of global memory access. We achieve this
by using a matrix-free implementation which does not require explicit storage
of the matrix and instead recalculates the local stencil. Global memory access
can also be reduced by rewriting the algorithm using loop fusion and we show
that this further reduces the runtime on the GPU. We demonstrate the
performance of our matrix-free GPU code by comparing it to a sequential CPU
implementation and to a matrix-explicit GPU code which uses existing libraries.
The absolute performance of the algorithm for different problem sizes is
quantified in terms of floating point throughput and global memory bandwidth.Comment: 18 pages, 7 figure
- …