114 research outputs found

    Cache-Friendly, Modular and Parallel Schemes For Computing Subresultant Chains

    Get PDF
    The RegularChains library in Maple offers a collection of commands for solving polynomial systems symbolically with taking advantage of the theory of regular chains. The primary goal of this thesis is algorithmic contributions, in particular, to high-performance computational schemes for subresultant chains and underlying routines to extend that of RegularChains in a C/C++ open-source library. Subresultants are one of the most fundamental tools in computer algebra. They are at the core of numerous algorithms including, but not limited to, polynomial GCD computations, polynomial system solving, and symbolic integration. When the subresultant chain of two polynomials is involved in a client procedure, not all polynomials of the chain, or not all coefficients of a given subresultant, may be needed. Based on that observation, we design so-called speculative and caching strategies which yield great performance improvements within our polynomial system solver. Our implementation of these techniques has been highly optimized. We have implemented optimized core arithmetic routines and multithreaded subresultant algorithms for univariate, bivariate and multivariate polynomials. We further examine memory access patterns and data locality for computing subresultants of multivariate polynomials, and study different optimization techniques for the fraction-free LU decomposition algorithm to compute subresultants based on determinant of Bezout matrices. Our code is publicly available at www.bpaslib.org as part of the Basic Polynomial Algebra Subprograms (BPAS) library that is mainly written in C, with concurrency support and user interfaces written in C++

    Generic design of Chinese remaindering schemes

    Get PDF
    We propose a generic design for Chinese remainder algorithms. A Chinese remainder computation consists in reconstructing an integer value from its residues modulo non coprime integers. We also propose an efficient linear data structure, a radix ladder, for the intermediate storage and computations. Our design is structured into three main modules: a black box residue computation in charge of computing each residue; a Chinese remaindering controller in charge of launching the computation and of the termination decision; an integer builder in charge of the reconstruction computation. We then show that this design enables many different forms of Chinese remaindering (e.g. deterministic, early terminated, distributed, etc.), easy comparisons between these forms and e.g. user-transparent parallelism at different parallel grains

    Programming matrix algorithms-by-blocks for thread-level parallelism

    Get PDF
    With the emergence of thread-level parallelism as the primary means for continued improvement of performance, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions. We propose a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain. The first abstraction, FLASH, allows algorithms to express computation with matrices consisting of blocks, facilitating algorithms-by-blocks. Transparent to the library implementor, operand descriptions are registered for a particular operation a priori. A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads out-of-order and executed in parallel. But not all classical algorithms in linear algebra lend themselves to conversion to algorithms-by-blocks. We show how our recently proposed LU factorization with incremental pivoting and closely related algorithm-by-blocks for the QR factorization, both originally designed for out-of-core computation, overcome this difficulty. Anecdotal evidence regarding the development of routines with a core functionality demonstrates how the methodology supports high productivity while experimental results suggest that high performance is abundantly achievabl

    Parallel and distributed Gr\"obner bases computation in JAS

    Full text link
    This paper considers parallel Gr\"obner bases algorithms on distributed memory parallel computers with multi-core compute nodes. We summarize three different Gr\"obner bases implementations: shared memory parallel, pure distributed memory parallel and distributed memory combined with shared memory parallelism. The last algorithm, called distributed hybrid, uses only one control communication channel between the master node and the worker nodes and keeps polynomials in shared memory on a node. The polynomials are transported asynchronous to the control-flow of the algorithm in a separate distributed data structure. The implementation is generic and works for all implemented (exact) fields. We present new performance measurements and discuss the performance of the algorithms.Comment: 14 pages, 8 tables, 13 figure

    On the Factor Refinement Principle and its Implementation on Multicore Architectures

    Get PDF
    The factor refinement principle turns a partial factorization of integers (or polynomi­ als) into a more complete factorization represented by basis elements and exponents, with basis elements that are pairwise coprime. There are lots of applications of this refinement technique such as simplifying systems of polynomial inequations and, more generally, speeding up certain algebraic algorithms by eliminating redundant expressions that may occur during intermediate computations. Successive GCD computations and divisions are used to accomplish this task until all the basis elements are pairwise coprime. Moreover, square-free factorization (which is the first step of many factorization algorithms) is used to remove the repeated patterns from each input element. Differentiation, division and GCD calculation op­ erations are required to complete this pre-processing step. Both factor refinement and square-free factorization often rely on plain (quadratic) algorithms for multipli­ cation but can be substantially improved with asymptotically fast multiplication on sufficiently large input. In this work, we review the working principles and complexity estimates of the factor refinement, in case of plain arithmetic, as well as asymptotically fast arithmetic. Following this review process, we design, analyze and implement parallel adaptations of these factor refinement algorithms. We consider several algorithm optimization techniques such as data locality analysis, balancing subproblems, etc. to fully exploit modern multicore architectures. The Cilk++ implementation of our parallel algorithm based on the augment refinement principle of Bach, Driscoll and Shallit achieves linear speedup for input data of sufficiently large size

    Toward high-performance polynomial system solvers based on triangular decompositions

    Full text link

    The Design and Implementation of a High-Performance Polynomial System Solver

    Get PDF
    This thesis examines the algorithmic and practical challenges of solving systems of polynomial equations. We discuss the design and implementation of triangular decomposition to solve polynomials systems exactly by means of symbolic computation. Incremental triangular decomposition solves one equation from the input list of polynomials at a time. Each step may produce several different components (points, curves, surfaces, etc.) of the solution set. Independent components imply that the solving process may proceed on each component concurrently. This so-called component-level parallelism is a theoretical and practical challenge characterized by irregular parallelism. Parallelism is not an algorithmic property but rather a geometrical property of the particular input system’s solution set. Despite these challenges, we have effectively applied parallel computing to triangular decomposition through the layering and cooperation of many parallel code regions. This parallel computing is supported by our generic object-oriented framework based on the dynamic multithreading paradigm. Meanwhile, the required polynomial algebra is sup- ported by an object-oriented framework for algebraic types which allows type safety and mathematical correctness to be determined at compile-time. Our software is implemented in C/C++ and have extensively tested the implementation for correctness and performance on over 3000 polynomial systems that have arisen in practice. The parallel framework has been re-used in the implementation of Hensel factorization as a parallel pipeline to compute roots of a polynomial with multivariate power series coefficients. Hensel factorization is one step toward computing the non-trivial limit points of quasi-components

    Towards Comprehensive Parametric Code Generation Targeting Graphics Processing Units in Support of Scientific Computation

    Get PDF
    The most popular multithreaded languages based on the fork-join concurrency model (CIlkPlus, OpenMP) are currently being extended to support other forms of parallelism (vectorization, pipelining and single-instruction-multiple-data (SIMD)). In the SIMD case, the objective is to execute the corresponding code on a many-core device, like a GPGPU, for which the CUDA language is a natural choice. Since the programming concepts of CilkPlus and OpenMP are very different from those of CUDA, it is desirable to automatically generate optimized CUDA-like code from CilkPlus or OpenMP. In this thesis, we propose an accelerator model for annotated C/C++ code together with an implementation that allows the automatic generation of CUDA code. One of the key features of this CUDA code generator is that it supports the generation of CUDA kernel code where program parameters (like number of threads per block) and machine parameters (like shared memory size) are treated as unknown symbols. Hence, these parameters need not to be known at code-generation-time: machine parameters and program parameters can be respectively determined when the generated code is installed on the target machine. In addition, we show how these parametric CUDA programs can be optimized at compile-time in the form of a case discussion, where cases depend on the values of machine parameters (e.g. hardware resource limits) and program parameters (e.g. dimension sizes of thread-blocks). This generation of parametric CUDA kernels requires to deal with non-linear polynomial expressions during the dependence analysis and tiling phase. To achieve these algebraic calculations, we take advantage of techniques from computer algebra, in particular in the RegularChains library of Maple. Various illustrative examples are provided together with performance evaluation
    • …
    corecore