2,938 research outputs found
On Optimizing Distributed Tucker Decomposition for Dense Tensors
The Tucker decomposition expresses a given tensor as the product of a small
core tensor and a set of factor matrices. Apart from providing data
compression, the construction is useful in performing analysis such as
principal component analysis (PCA)and finds applications in diverse domains
such as signal processing, computer vision and text analytics. Our objective is
to develop an efficient distributed implementation for the case of dense
tensors. The implementation is based on the HOOI (Higher Order Orthogonal
Iterator) procedure, wherein the tensor-times-matrix product forms the core
routine. Prior work have proposed heuristics for reducing the computational
load and communication volume incurred by the routine. We study the two metrics
in a formal and systematic manner, and design strategies that are optimal under
the two fundamental metrics. Our experimental evaluation on a large benchmark
of tensors shows that the optimal strategies provide significant reduction in
load and volume compared to prior heuristics, and provide up to 7x speed-up in
the overall running time.Comment: Preliminary version of the paper appears in the proceedings of
IPDPS'1
GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU
High-performance implementations of graph algorithms are challenging to
implement on new parallel hardware such as GPUs because of three challenges:
(1) the difficulty of coming up with graph building blocks, (2) load imbalance
on parallel hardware, and (3) graph problems having low arithmetic intensity.
To address some of these challenges, GraphBLAS is an innovative, on-going
effort by the graph analytics community to propose building blocks based on
sparse linear algebra, which will allow graph algorithms to be expressed in a
performant, succinct, composable and portable manner. In this paper, we examine
the performance challenges of a linear-algebra-based approach to building graph
frameworks and describe new design principles for overcoming these bottlenecks.
Among the new design principles is exploiting input sparsity, which allows
users to write graph algorithms without specifying push and pull direction.
Exploiting output sparsity allows users to tell the backend which values of the
output in a single vectorized computation they do not want computed.
Load-balancing is an important feature for balancing work amongst parallel
workers. We describe the important load-balancing features for handling graphs
with different characteristics. The design principles described in this paper
have been implemented in "GraphBLAST", the first high-performance linear
algebra-based graph framework on NVIDIA GPUs that is open-source. The results
show that on a single GPU, GraphBLAST has on average at least an order of
magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL,
comparable performance to the fastest GPU hardwired primitives and
shared-memory graph frameworks Ligra and Gunrock, and better performance than
any other GPU graph framework, while offering a simpler and more concise
programming model.Comment: 50 pages, 14 figures, 14 table
Polynomial Chaos Expansion of random coefficients and the solution of stochastic partial differential equations in the Tensor Train format
We apply the Tensor Train (TT) decomposition to construct the tensor product
Polynomial Chaos Expansion (PCE) of a random field, to solve the stochastic
elliptic diffusion PDE with the stochastic Galerkin discretization, and to
compute some quantities of interest (mean, variance, exceedance probabilities).
We assume that the random diffusion coefficient is given as a smooth
transformation of a Gaussian random field. In this case, the PCE is delivered
by a complicated formula, which lacks an analytic TT representation. To
construct its TT approximation numerically, we develop the new block TT cross
algorithm, a method that computes the whole TT decomposition from a few
evaluations of the PCE formula. The new method is conceptually similar to the
adaptive cross approximation in the TT format, but is more efficient when
several tensors must be stored in the same TT representation, which is the case
for the PCE. Besides, we demonstrate how to assemble the stochastic Galerkin
matrix and to compute the solution of the elliptic equation and its
post-processing, staying in the TT format.
We compare our technique with the traditional sparse polynomial chaos and the
Monte Carlo approaches. In the tensor product polynomial chaos, the polynomial
degree is bounded for each random variable independently. This provides higher
accuracy than the sparse polynomial set or the Monte Carlo method, but the
cardinality of the tensor product set grows exponentially with the number of
random variables. However, when the PCE coefficients are implicitly
approximated in the TT format, the computations with the full tensor product
polynomial set become possible. In the numerical experiments, we confirm that
the new methodology is competitive in a wide range of parameters, especially
where high accuracy and high polynomial degrees are required.Comment: This is a major revision of the manuscript arXiv:1406.2816 with
significantly extended numerical experiments. Some unused material is remove
- …