30,838 research outputs found
Faster Inversion and Other Black Box Matrix Computations Using Efficient Block Projections
Block projections have been used, in [Eberly et al. 2006], to obtain an
efficient algorithm to find solutions for sparse systems of linear equations. A
bound of softO(n^(2.5)) machine operations is obtained assuming that the input
matrix can be multiplied by a vector with constant-sized entries in softO(n)
machine operations. Unfortunately, the correctness of this algorithm depends on
the existence of efficient block projections, and this has been conjectured. In
this paper we establish the correctness of the algorithm from [Eberly et al.
2006] by proving the existence of efficient block projections over sufficiently
large fields. We demonstrate the usefulness of these projections by deriving
improved bounds for the cost of several matrix problems, considering, in
particular, ``sparse'' matrices that can be be multiplied by a vector using
softO(n) field operations. We show how to compute the inverse of a sparse
matrix over a field F using an expected number of softO(n^(2.27)) operations in
F. A basis for the null space of a sparse matrix, and a certification of its
rank, are obtained at the same cost. An application to Kaltofen and Villard's
Baby-Steps/Giant-Steps algorithms for the determinant and Smith Form of an
integer matrix yields algorithms requiring softO(n^(2.66)) machine operations.
The derived algorithms are all probabilistic of the Las Vegas type
Algorithmic patterns for -matrices on many-core processors
In this work, we consider the reformulation of hierarchical ()
matrix algorithms for many-core processors with a model implementation on
graphics processing units (GPUs). matrices approximate specific
dense matrices, e.g., from discretized integral equations or kernel ridge
regression, leading to log-linear time complexity in dense matrix-vector
products. The parallelization of matrix operations on many-core
processors is difficult due to the complex nature of the underlying algorithms.
While previous algorithmic advances for many-core hardware focused on
accelerating existing matrix CPU implementations by many-core
processors, we here aim at totally relying on that processor type. As main
contribution, we introduce the necessary parallel algorithmic patterns allowing
to map the full matrix construction and the fast matrix-vector
product to many-core hardware. Here, crucial ingredients are space filling
curves, parallel tree traversal and batching of linear algebra operations. The
resulting model GPU implementation hmglib is the, to the best of the authors
knowledge, first entirely GPU-based Open Source matrix library of
this kind. We conclude this work by an in-depth performance analysis and a
comparative performance study against a standard matrix library,
highlighting profound speedups of our many-core parallel approach
Highly parallel sparse Cholesky factorization
Several fine grained parallel algorithms were developed and compared to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a 2-D grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the key subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. Although at present architectural limitations prevent the dense factorization from realizing its potential efficiency, it is concluded that a regular data parallel architecture can be used efficiently to solve arbitrarily structured sparse problems. A performance model is also presented and it is used to analyze the algorithms
Computational linear algebra over finite fields
We present here algorithms for efficient computation of linear algebra
problems over finite fields
Computing the Rank Profile Matrix
The row (resp. column) rank profile of a matrix describes the staircase shape
of its row (resp. column) echelon form. In an ISSAC'13 paper, we proposed a
recursive Gaussian elimination that can compute simultaneously the row and
column rank profiles of a matrix as well as those of all of its leading
sub-matrices, in the same time as state of the art Gaussian elimination
algorithms. Here we first study the conditions making a Gaus-sian elimination
algorithm reveal this information. Therefore, we propose the definition of a
new matrix invariant, the rank profile matrix, summarizing all information on
the row and column rank profiles of all the leading sub-matrices. We also
explore the conditions for a Gaussian elimination algorithm to compute all or
part of this invariant, through the corresponding PLUQ decomposition. As a
consequence, we show that the classical iterative CUP decomposition algorithm
can actually be adapted to compute the rank profile matrix. Used, in a Crout
variant, as a base-case to our ISSAC'13 implementation, it delivers a
significant improvement in efficiency. Second, the row (resp. column) echelon
form of a matrix are usually computed via different dedicated triangular
decompositions. We show here that, from some PLUQ decompositions, it is
possible to recover the row and column echelon forms of a matrix and of any of
its leading sub-matrices thanks to an elementary post-processing algorithm
Minimisation of Multiplicity Tree Automata
We consider the problem of minimising the number of states in a multiplicity
tree automaton over the field of rational numbers. We give a minimisation
algorithm that runs in polynomial time assuming unit-cost arithmetic. We also
show that a polynomial bound in the standard Turing model would require a
breakthrough in the complexity of polynomial identity testing by proving that
the latter problem is logspace equivalent to the decision version of
minimisation. The developed techniques also improve the state of the art in
multiplicity word automata: we give an NC algorithm for minimising multiplicity
word automata. Finally, we consider the minimal consistency problem: does there
exist an automaton with states that is consistent with a given finite
sample of weight-labelled words or trees? We show that this decision problem is
complete for the existential theory of the rationals, both for words and for
trees of a fixed alphabet rank.Comment: Paper to be published in Logical Methods in Computer Science. Minor
editing changes from previous versio
A scalable H-matrix approach for the solution of boundary integral equations on multi-GPU clusters
In this work, we consider the solution of boundary integral equations by
means of a scalable hierarchical matrix approach on clusters equipped with
graphics hardware, i.e. graphics processing units (GPUs). To this end, we
extend our existing single-GPU hierarchical matrix library hmglib such that it
is able to scale on many GPUs and such that it can be coupled to arbitrary
application codes. Using a model GPU implementation of a boundary element
method (BEM) solver, we are able to achieve more than 67 percent relative
parallel speed-up going from 128 to 1024 GPUs for a model geometry test case
with 1.5 million unknowns and a real-world geometry test case with almost 1.2
million unknowns. On 1024 GPUs of the cluster Titan, it takes less than 6
minutes to solve the 1.5 million unknowns problem, with 5.7 minutes for the
setup phase and 20 seconds for the iterative solver. To the best of the
authors' knowledge, we here discuss the first fully GPU-based
distributed-memory parallel hierarchical matrix Open Source library using the
traditional H-matrix format and adaptive cross approximation with an
application to BEM problems
Faster all-pairs shortest paths via circuit complexity
We present a new randomized method for computing the min-plus product
(a.k.a., tropical product) of two matrices, yielding a faster
algorithm for solving the all-pairs shortest path problem (APSP) in dense
-node directed graphs with arbitrary edge weights. On the real RAM, where
additions and comparisons of reals are unit cost (but all other operations have
typical logarithmic cost), the algorithm runs in time
and is correct with high probability.
On the word RAM, the algorithm runs in time for edge weights in . Prior algorithms used either time for
various , or time for various
and .
The new algorithm applies a tool from circuit complexity, namely the
Razborov-Smolensky polynomials for approximately representing
circuits, to efficiently reduce a matrix product over the algebra to
a relatively small number of rectangular matrix products over ,
each of which are computable using a particularly efficient method due to
Coppersmith. We also give a deterministic version of the algorithm running in
time for some , which utilizes the
Yao-Beigel-Tarui translation of circuits into "nice" depth-two
circuits.Comment: 24 pages. Updated version now has slightly faster running time. To
appear in ACM Symposium on Theory of Computing (STOC), 201
- …