21,480 research outputs found
Efficient long division via Montgomery multiply
We present a novel right-to-left long division algorithm based on the
Montgomery modular multiply, consisting of separate highly efficient loops with
simply carry structure for computing first the remainder (x mod q) and then the
quotient floor(x/q). These loops are ideally suited for the case where x
occupies many more machine words than the divide modulus q, and are strictly
linear time in the "bitsize ratio" lg(x)/lg(q). For the paradigmatic
performance test of multiword dividend and single 64-bit-word divisor,
exploitation of the inherent data-parallelism of the algorithm effectively
mitigates the long latency of hardware integer MUL operations, as a result of
which we are able to achieve respective costs for remainder-only and full-DIV
(remainder and quotient) of 6 and 12.5 cycles per dividend word on the Intel
Core 2 implementation of the x86_64 architecture, in single-threaded execution
mode. We further describe a simple "bit-doubling modular inversion" scheme,
which allows the entire iterative computation of the mod-inverse required by
the Montgomery multiply at arbitrarily large precision to be performed with
cost less than that of a single Newtonian iteration performed at the full
precision of the final result. We also show how the Montgomery-multiply-based
powering can be efficiently used in Mersenne and Fermat-number trial
factorization via direct computation of a modular inverse power of 2, without
any need for explicit radix-mod scalings.Comment: 23 pages; 8 tables v2: Tweak formatting, pagecount -= 2. v3: Fix
incorrect powers of R in formulae [7] and [11] v4: Add Eldridge & Walter ref.
v5: Clarify relation between Algos A/A',D and Hensel-div; clarify
true-quotient mechanics; Add Haswell timings, refs to Agner Fog timings pdf
and GMP asm-timings ref-page. v6: Remove stray +bw in MULL line of Algo D
listing; add note re byte-LUT for qinv_
Scalable Task-Based Algorithm for Multiplication of Block-Rank-Sparse Matrices
A task-based formulation of Scalable Universal Matrix Multiplication
Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is
applied to the multiplication of hierarchy-free, rank-structured matrices that
appear in the domain of quantum chemistry (QC). The novel features of our
formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and
(2) fine-grained task-based composition. These features make it tolerant of the
load imbalance due to the irregular matrix structure and eliminate all
artifactual sources of global synchronization.Scalability of iterative
computation of square-root inverse of block-rank-sparse QC matrices is
demonstrated; for full-rank (dense) matrices the performance of our SUMMA
formulation usually exceeds that of the state-of-the-art dense MM
implementations (ScaLAPACK and Cyclops Tensor Framework).Comment: 8 pages, 6 figures, accepted to IA3 2015. arXiv admin note: text
overlap with arXiv:1504.0504
Fast matrix computations for functional additive models
It is common in functional data analysis to look at a set of related
functions: a set of learning curves, a set of brain signals, a set of spatial
maps, etc. One way to express relatedness is through an additive model, whereby
each individual function is assumed to be a variation
around some shared mean . Gaussian processes provide an elegant way of
constructing such additive models, but suffer from computational difficulties
arising from the matrix operations that need to be performed. Recently Heersink
& Furrer have shown that functional additive model give rise to covariance
matrices that have a specific form they called quasi-Kronecker (QK), whose
inverses are relatively tractable. We show that under additional assumptions
the two-level additive model leads to a class of matrices we call restricted
quasi-Kronecker, which enjoy many interesting properties. In particular, we
formulate matrix factorisations whose complexity scales only linearly in the
number of functions in latent field, an enormous improvement over the cubic
scaling of na\"ive approaches. We describe how to leverage the properties of
rQK matrices for inference in Latent Gaussian Models
Quantum Monte Carlo with very large multideterminant wavefunctions
An algorithm to compute efficiently the first two derivatives of (very) large
multideterminant wavefunctions for quantum Monte Carlo calculations is
presented. The calculation of determinants and their derivatives is performed
using the Sherman-Morrison formula for updating the inverse Slater matrix. An
improved implementation based on the reduction of the number of column
substitutions and on a very efficient implementation of the calculation of the
scalar products involved is presented. It is emphasized that multideterminant
expansions contain in general a large number of identical spin-specific
determinants: for typical configuration interaction-type wavefunctions the
number of unique spin-specific determinants
() with a non-negligible weight in the expansion is
of order . We show that a careful implementation
of the calculation of the -dependent contributions can make this
step negligible enough so that in practice the algorithm scales as the total
number of unique spin-specific determinants, , over a wide range of total number of determinants (here,
up to about one million), thus greatly reducing the total
computational cost. Finally, a new truncation scheme for the multideterminant
expansion is proposed so that larger expansions can be considered without
increasing the computational time. The algorithm is illustrated with
all-electron Fixed-Node Diffusion Monte Carlo calculations of the total energy
of the chlorine atom. Calculations using a trial wavefunction including about
750 000 determinants with a computational increase of 400 compared to a
single-determinant calculation are shown to be feasible.Comment: 9 pages, 3 figure
Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication
Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many
high-performance graph algorithms as well as for some linear solvers, such as
algebraic multigrid. The scaling of existing parallel implementations of SpGEMM
is heavily bound by communication. Even though 3D (or 2.5D) algorithms have
been proposed and theoretically analyzed in the flat MPI model on Erdos-Renyi
matrices, those algorithms had not been implemented in practice and their
complexities had not been analyzed for the general case. In this work, we
present the first ever implementation of the 3D SpGEMM formulation that also
exploits multiple (intra-node and inter-node) levels of parallelism, achieving
significant speedups over the state-of-the-art publicly available codes at all
levels of concurrencies. We extensively evaluate our implementation and
identify bottlenecks that should be subject to further research
- …