473 research outputs found
Generalized Filtering Decomposition
This paper introduces a new preconditioning technique that is suitable for
matrices arising from the discretization of a system of PDEs on unstructured
grids. The preconditioner satisfies a so-called filtering property, which
ensures that the input matrix is identical with the preconditioner on a given
filtering vector. This vector is chosen to alleviate the effect of low
frequency modes on convergence and so decrease or eliminate the plateau which
is often observed in the convergence of iterative methods. In particular, the
paper presents a general approach that allows to ensure that the filtering
condition is satisfied in a matrix decomposition. The input matrix can have an
arbitrary sparse structure. Hence, it can be reordered using nested dissection,
to allow a parallel computation of the preconditioner and of the iterative
process
Accelerating Cosmic Microwave Background map-making procedure through preconditioning
Estimation of the sky signal from sequences of time ordered data is one of
the key steps in Cosmic Microwave Background (CMB) data analysis, commonly
referred to as the map-making problem. Some of the most popular and general
methods proposed for this problem involve solving generalised least squares
(GLS) equations with non-diagonal noise weights given by a block-diagonal
matrix with Toeplitz blocks. In this work we study new map-making solvers
potentially suitable for applications to the largest anticipated data sets.
They are based on iterative conjugate gradient (CG) approaches enhanced with
novel, parallel, two-level preconditioners. We apply the proposed solvers to
examples of simulated non-polarised and polarised CMB observations, and a set
of idealised scanning strategies with sky coverage ranging from nearly a full
sky down to small sky patches. We discuss in detail their implementation for
massively parallel computational platforms and their performance for a broad
range of parameters characterising the simulated data sets. We find that our
best new solver can outperform carefully-optimised standard solvers used today
by a factor of as much as 5 in terms of the convergence rate and a factor of up
to in terms of the time to solution, and to do so without significantly
increasing the memory consumption and the volume of inter-processor
communication. The performance of the new algorithms is also found to be more
stable and robust, and less dependent on specific characteristics of the
analysed data set. We therefore conclude that the proposed approaches are well
suited to address successfully challenges posed by new and forthcoming CMB data
sets.Comment: 19 pages // Final version submitted to A&
LU factorization with panel rank revealing pivoting and its communication avoiding version
We present the LU decomposition with panel rank revealing pivoting (LU_PRRP),
an LU factorization algorithm based on strong rank revealing QR panel
factorization. LU_PRRP is more stable than Gaussian elimination with partial
pivoting (GEPP). Our extensive numerical experiments show that the new
factorization scheme is as numerically stable as GEPP in practice, but it is
more resistant to pathological cases and easily solves the Wilkinson matrix and
the Foster matrix. We also present CALU_PRRP, a communication avoiding version
of LU_PRRP that minimizes communication. CALU_PRRP is based on tournament
pivoting, with the selection of the pivots at each step of the tournament being
performed via strong rank revealing QR factorization. CALU_PRRP is more stable
than CALU, the communication avoiding version of GEPP. CALU_PRRP is also more
stable in practice and is resistant to pathological cases on which GEPP and
CALU fail.Comment: No. RR-7867 (2012
Hybrid static/dynamic scheduling for already optimized dense matrix factorization
We present the use of a hybrid static/dynamic scheduling strategy of the task
dependency graph for direct methods used in dense numerical linear algebra.
This strategy provides a balance of data locality, load balance, and low
dequeue overhead. We show that the usage of this scheduling in communication
avoiding dense factorization leads to significant performance gains. On a 48
core AMD Opteron NUMA machine, our experiments show that we can achieve up to
64% improvement over a version of CALU that uses fully dynamic scheduling, and
up to 30% improvement over the version of CALU that uses fully static
scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic
scheduling approach is up to 8% faster than the version of CALU that uses a
fully static scheduling or fully dynamic scheduling. Our algorithm leads to
speedups over the corresponding routines for computing LU factorization in well
known libraries. On the 48 core AMD NUMA machine, our best implementation is up
to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to
82% faster than MKL. Our approach also shows significant speedups compared with
PLASMA on both of these systems
Kronecker Product Approximation Preconditioners for Convection-diffusion Model Problems
We consider the iterative solution of the linear systems arising from four convection-diffusion model problems: the scalar convection-diffusion problem, Stokes problem, Oseen problem, and Navier-Stokes problem. We give the explicit Kronecker product structure of the coefficient matrices, especially the Kronecker product structure for the convection term. For the latter three model cases, the coefficient matrices have a blocks, and each block is a Kronecker product or a summation of several Kronecker products. We use the Kronecker products and block structures to design the diagonal block preconditioner, the tridiagonal block preconditioner and the constraint preconditioner. We can find that the constraint preconditioner can be regarded as the modification of the tridiagonal block preconditioner and the diagonal block preconditioner based on the cell Reynolds number. That's the reason why the constraint preconditioner is usually better. We also give numerical examples to show the efficiency of this kind of Kronecker product approximation preconditioners
Randomized block Gram-Schmidt process for solution of linear systems and eigenvalue problems
We propose a block version of the randomized Gram-Schmidt process for
computing a QR factorization of a matrix. Our algorithm inherits the major
properties of its single-vector analogue from [Balabanov and Grigori, 2020]
such as higher efficiency than the classical Gram-Schmidt algorithm and
stability of the modified Gram-Schmidt algorithm, which can be refined even
further by using multi-precision arithmetic. As in [Balabanov and Grigori,
2020], our algorithm has an advantage of performing standard high-dimensional
operations, that define the overall computational cost, with a unit roundoff
independent of the dominant dimension of the matrix. This unique feature makes
the methodology especially useful for large-scale problems computed on
low-precision arithmetic architectures. Block algorithms are advantageous in
terms of performance as they are mainly based on cache-friendly matrix-wise
operations, and can reduce communication cost in high-performance computing.
The block Gram-Schmidt orthogonalization is the key element in the block
Arnoldi procedure for the construction of Krylov basis, which in its turn is
used in GMRES and Rayleigh-Ritz methods for the solution of linear systems and
clustered eigenvalue problems. In this article, we develop randomized versions
of these methods, based on the proposed randomized Gram-Schmidt algorithm, and
validate them on nontrivial numerical examples
Spherical harmonic transform with GPUs
We describe an algorithm for computing an inverse spherical harmonic
transform suitable for graphic processing units (GPU). We use CUDA and base our
implementation on a Fortran90 routine included in a publicly available parallel
package, S2HAT. We focus our attention on the two major sequential steps
involved in the transforms computation, retaining the efficient parallel
framework of the original code. We detail optimization techniques used to
enhance the performance of the CUDA-based code and contrast them with those
implemented in the Fortran90 version. We also present performance comparisons
of a single CPU plus GPU unit with the S2HAT code running on either a single or
4 processors. In particular we find that use of the latest generation of GPUs,
such as NVIDIA GF100 (Fermi), can accelerate the spherical harmonic transforms
by as much as 18 times with respect to S2HAT executed on one core, and by as
much as 5.5 with respect to S2HAT on 4 cores, with the overall performance
being limited by the Fast Fourier transforms. The work presented here has been
performed in the context of the Cosmic Microwave Background simulations and
analysis. However, we expect that the developed software will be of more
general interest and applicability
Randomized Householder QR
This paper introduces a randomized Householder QR factorization (RHQR). This
factorization can be used to obtain a well conditioned basis of a vector space
and thus can be employed in a variety of applications. The RHQR factorization
of the input matrix is equivalent to the standard Householder QR
factorization of matrix , where is a sketching matrix that can
be obtained from any subspace embedding technique. For this reason, the RHQR
factorization can also be reconstructed from the Householder QR factorization
of the sketched problem, yielding a single-synchronization randomized QR
factorization (recRHQR). In most contexts, left-looking RHQR requires a single
synchronization per iteration, with half the computational cost of Householder
QR, and a similar cost to Randomized Gram-Schmidt (RGS) overall. We discuss the
usage of RHQR factorization in the Arnoldi process and then in GMRES, showing
thus how it can be used in Krylov subspace methods to solve systems of linear
equations. Based on Charles Sheffield's connection between Householder QR and
Modified Gram-Schmidt (MGS), a BLAS2-RGS is also derived. A finite precision
analysis shows that, under mild probabilistic assumptions, the RHQR
factorization of the input matrix inherits the stability of the Householder
QR factorization, producing a well-conditioned basis and a columnwise backward
stable factorization, all independently of the condition number of the input
, and with the accuracy of the sketching step. We study the subsampled
randomized Hadamard transform (SRHT) as a very stable sketching technique.
Numerical experiments show that RHQR produces a well conditioned basis whose
sketch is numerically orthogonal and an accurate factorization, even for the
most difficult inputs and with high-dimensional operations made in
half-precision
A 3D Parallel Algorithm for QR Decomposition
Interprocessor communication often dominates the runtime of large matrix
computations. We present a parallel algorithm for computing QR decompositions
whose bandwidth cost (communication volume) can be decreased at the cost of
increasing its latency cost (number of messages). By varying a parameter to
navigate the bandwidth/latency tradeoff, we can tune this algorithm for
machines with different communication costs
Communication Avoiding Gaussian Elimination
This paper presents CALU, a Communication Avoiding algorithm for the LU factorization of dense matrices distributed in a two-dimensional (2D) cyclic layout. The algorithm is based on a new pivoting strategy, referred to as ca-pivoting, that is shown to be stable in practice. The ca-pivoting strategy leads to a significant decrease in the number of messages exchanged during the factorization of a block-column relatively to conventional algorithms, and thus CALU overcomes the latency bottleneck of the LU factorization as in current implementations like ScaLAPACK and HPL. The experimental part of this paper focuses on the evaluation of the performance of CALU on two computational systems, an IBM POWER 5 system with 888 compute processors distributed among 111 compute nodes, and a Cray XT4 system with 9660 dual-core AMD Opteron processors. We compare CALU with ScaLAPACK PDGETRF routine that computes the LU factorization. Our experiments show that CALU leads to a reduction in the parallel time of the LU factorization. The gain depends on the size of the matrices and on the characteristics of the computer architecture. In particular the effect is found to be significant in the cases when the latency time is an important factor of the overall time, as for example when a small matrix is executed on large number of processors. The factorization of a block-column, referred to as TSLU, reaches a performance of 215 GFLOPs/s on 64 processors of the IBM POWER 5 system, and a performance of 240 GFLOPs/s on 64 processors of the Cray XT4 system. It represents 44% and 36% of the theoretical peak performances on these systems. TSLU outperforms the corresponding routine PDGETF2 from ScaLAPACK up to a factor of 4.37 on the IBM POWER 5 system and up to a factor of 5.58 on the Cray XT4 system. On square matrices of order 10000, CALU outperforms PDGETRF by a factor of 1.24 on IBM POWER 5 and by a factor of 1.31 on Cray XT4. It represents 40% and 23% of the peak performance on these systems. The best improvement obtained by CALU is a speedup of 2.29 on IBM POWER 5 and a speedup of 1.81 on Cray XT4
- …