3,019 research outputs found
A Multi-level Blocking Distinct Degree Factorization Algorithm
We give a new algorithm for performing the distinct-degree factorization of a
polynomial P(x) over GF(2), using a multi-level blocking strategy. The coarsest
level of blocking replaces GCD computations by multiplications, as suggested by
Pollard (1975), von zur Gathen and Shoup (1992), and others. The novelty of our
approach is that a finer level of blocking replaces multiplications by
squarings, which speeds up the computation in GF(2)[x]/P(x) of certain interval
polynomials when P(x) is sparse. As an application we give a fast algorithm to
search for all irreducible trinomials x^r + x^s + 1 of degree r over GF(2),
while producing a certificate that can be checked in less time than the full
search. Naive algorithms cost O(r^2) per trinomial, thus O(r^3) to search over
all trinomials of given degree r. Under a plausible assumption about the
distribution of factors of trinomials, the new algorithm has complexity O(r^2
(log r)^{3/2}(log log r)^{1/2}) for the search over all trinomials of degree r.
Our implementation achieves a speedup of greater than a factor of 560 over the
naive algorithm in the case r = 24036583 (a Mersenne exponent). Using our
program, we have found two new primitive trinomials of degree 24036583 over
GF(2) (the previous record degree was 6972593)
Hybrid static/dynamic scheduling for already optimized dense matrix factorization
We present the use of a hybrid static/dynamic scheduling strategy of the task
dependency graph for direct methods used in dense numerical linear algebra.
This strategy provides a balance of data locality, load balance, and low
dequeue overhead. We show that the usage of this scheduling in communication
avoiding dense factorization leads to significant performance gains. On a 48
core AMD Opteron NUMA machine, our experiments show that we can achieve up to
64% improvement over a version of CALU that uses fully dynamic scheduling, and
up to 30% improvement over the version of CALU that uses fully static
scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic
scheduling approach is up to 8% faster than the version of CALU that uses a
fully static scheduling or fully dynamic scheduling. Our algorithm leads to
speedups over the corresponding routines for computing LU factorization in well
known libraries. On the 48 core AMD NUMA machine, our best implementation is up
to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to
82% faster than MKL. Our approach also shows significant speedups compared with
PLASMA on both of these systems
Recommended from our members
Preparing sparse solvers for exascale computing.
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
- …