3,019 research outputs found

    A Multi-level Blocking Distinct Degree Factorization Algorithm

    Get PDF
    We give a new algorithm for performing the distinct-degree factorization of a polynomial P(x) over GF(2), using a multi-level blocking strategy. The coarsest level of blocking replaces GCD computations by multiplications, as suggested by Pollard (1975), von zur Gathen and Shoup (1992), and others. The novelty of our approach is that a finer level of blocking replaces multiplications by squarings, which speeds up the computation in GF(2)[x]/P(x) of certain interval polynomials when P(x) is sparse. As an application we give a fast algorithm to search for all irreducible trinomials x^r + x^s + 1 of degree r over GF(2), while producing a certificate that can be checked in less time than the full search. Naive algorithms cost O(r^2) per trinomial, thus O(r^3) to search over all trinomials of given degree r. Under a plausible assumption about the distribution of factors of trinomials, the new algorithm has complexity O(r^2 (log r)^{3/2}(log log r)^{1/2}) for the search over all trinomials of degree r. Our implementation achieves a speedup of greater than a factor of 560 over the naive algorithm in the case r = 24036583 (a Mersenne exponent). Using our program, we have found two new primitive trinomials of degree 24036583 over GF(2) (the previous record degree was 6972593)

    Hybrid static/dynamic scheduling for already optimized dense matrix factorization

    Get PDF
    We present the use of a hybrid static/dynamic scheduling strategy of the task dependency graph for direct methods used in dense numerical linear algebra. This strategy provides a balance of data locality, load balance, and low dequeue overhead. We show that the usage of this scheduling in communication avoiding dense factorization leads to significant performance gains. On a 48 core AMD Opteron NUMA machine, our experiments show that we can achieve up to 64% improvement over a version of CALU that uses fully dynamic scheduling, and up to 30% improvement over the version of CALU that uses fully static scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic scheduling approach is up to 8% faster than the version of CALU that uses a fully static scheduling or fully dynamic scheduling. Our algorithm leads to speedups over the corresponding routines for computing LU factorization in well known libraries. On the 48 core AMD NUMA machine, our best implementation is up to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to 82% faster than MKL. Our approach also shows significant speedups compared with PLASMA on both of these systems
    corecore