1,547 research outputs found
Parallel multiplication and powering of polynomials
AbstractThis paper examines the most efficient known serial and parallel algorithms for multiplying and powering polynomials. For sparse polynomials the Simp algorithm multiplies using a simple divide and conquer approach, and the NOMC algorithm computes powers using a multinomial expansion. For dense polynomials the FFT multiplies and powers by evaluating polynomials at a set of points, performing pointwise multiplication or powering, and interpolating a polynomial through the results. Practical issues of applying these algorithms in algebraic manipulation systems are discussed
Gate-Level Simulation of Quantum Circuits
While thousands of experimental physicists and chemists are currently trying
to build scalable quantum computers, it appears that simulation of quantum
computation will be at least as critical as circuit simulation in classical
VLSI design. However, since the work of Richard Feynman in the early 1980s
little progress was made in practical quantum simulation. Most researchers
focused on polynomial-time simulation of restricted types of quantum circuits
that fall short of the full power of quantum computation. Simulating quantum
computing devices and useful quantum algorithms on classical hardware now
requires excessive computational resources, making many important simulation
tasks infeasible. In this work we propose a new technique for gate-level
simulation of quantum circuits which greatly reduces the difficulty and cost of
such simulations. The proposed technique is implemented in a simulation tool
called the Quantum Information Decision Diagram (QuIDD) and evaluated by
simulating Grover's quantum search algorithm. The back-end of our package,
QuIDD Pro, is based on Binary Decision Diagrams, well-known for their ability
to efficiently represent many seemingly intractable combinatorial structures.
This reliance on a well-established area of research allows us to take
advantage of existing software for BDD manipulation and achieve unparalleled
empirical results for quantum simulation
FORM version 4.0
We present version 4.0 of the symbolic manipulation system FORM. The most
important new features are manipulation of rational polynomials and the
factorization of expressions. Many other new functions and commands are also
added; some of them are very general, while others are designed for building
specific high level packages, such as one for Groebner bases. New is also the
checkpoint facility, that allows for periodic backups during long calculations.
Lastly, FORM 4.0 has become available as open source under the GNU General
Public License version 3.Comment: 26 pages. Uses axodra
High Performance Sparse Multivariate Polynomials: Fundamental Data Structures and Algorithms
Polynomials may be represented sparsely in an effort to conserve memory usage and provide a succinct and natural representation. Moreover, polynomials which are themselves sparse – have very few non-zero terms – will have wasted memory and computation time if represented, and operated on, densely. This waste is exacerbated as the number of variables increases. We provide practical implementations of sparse multivariate data structures focused on data locality and cache complexity. We look to develop high-performance algorithms and implementations of fundamental polynomial operations, using these sparse data structures, such as arithmetic (addition, subtraction, multiplication, and division) and interpolation. We revisit a sparse arithmetic scheme introduced by Johnson in 1974, adapting and optimizing these algorithms for modern computer architectures, with our implementations over the integers and rational numbers vastly outperforming the current wide-spread implementations. We develop a new algorithm for sparse pseudo-division based on the sparse polynomial division algorithm, with very encouraging results. Polynomial interpolation is explored through univariate, dense multivariate, and sparse multivariate methods. Arithmetic and interpolation together form a solid high-performance foundation from which many higher-level and more interesting algorithms can be built
Towards Comprehensive Parametric Code Generation Targeting Graphics Processing Units in Support of Scientific Computation
The most popular multithreaded languages based on the fork-join concurrency model (CIlkPlus, OpenMP) are currently being extended to support other forms of parallelism (vectorization, pipelining and single-instruction-multiple-data (SIMD)). In the SIMD case, the objective is to execute the corresponding code on a many-core device, like a GPGPU, for which the CUDA language is a natural choice. Since the programming concepts of CilkPlus and OpenMP are very different from those of CUDA, it is desirable to automatically generate optimized CUDA-like code from CilkPlus or OpenMP.
In this thesis, we propose an accelerator model for annotated C/C++ code together with an implementation that allows the automatic generation of CUDA code. One of the key features of this CUDA code generator is that it supports the generation of CUDA kernel code where program parameters (like number of threads per block) and machine parameters (like shared memory size) are treated as unknown symbols. Hence, these parameters need not to be known at code-generation-time: machine parameters and program parameters can be respectively determined when the generated code is installed on the target machine. In addition, we show how these parametric CUDA programs can be optimized at compile-time in the form of a case discussion, where cases depend on the values of machine parameters (e.g. hardware resource limits) and program parameters (e.g. dimension sizes of thread-blocks). This generation of parametric CUDA kernels requires to deal with non-linear polynomial expressions during the dependence analysis and tiling phase. To achieve these algebraic calculations, we take advantage of techniques from computer algebra, in particular in the RegularChains library of Maple. Various illustrative examples are provided together with performance evaluation
Enabling Massive Deep Neural Networks with the GraphBLAS
Deep Neural Networks (DNNs) have emerged as a core tool for machine learning.
The computations performed during DNN training and inference are dominated by
operations on the weight matrices describing the DNN. As DNNs incorporate more
stages and more nodes per stage, these weight matrices may be required to be
sparse because of memory limitations. The GraphBLAS.org math library standard
was developed to provide high performance manipulation of sparse weight
matrices and input/output vectors. For sufficiently sparse matrices, a sparse
matrix library requires significantly less memory than the corresponding dense
matrix implementation. This paper provides a brief description of the
mathematics underlying the GraphBLAS. In addition, the equations of a typical
DNN are rewritten in a form designed to use the GraphBLAS. An implementation of
the DNN is given using a preliminary GraphBLAS C library. The performance of
the GraphBLAS implementation is measured relative to a standard dense linear
algebra library implementation. For various sizes of DNN weight matrices, it is
shown that the GraphBLAS sparse implementation outperforms a BLAS dense
implementation as the weight matrix becomes sparser.Comment: 10 pages, 7 figures, to appear in the 2017 IEEE High Performance
Extreme Computing (HPEC) conferenc
- …