51,857 research outputs found
Faster truncated integer multiplication
We present new algorithms for computing the low n bits or the high n bits of
the product of two n-bit integers. We show that these problems may be solved in
asymptotically 75% of the time required to compute the full 2n-bit product,
assuming that the underlying integer multiplication algorithm relies on
computing cyclic convolutions of real sequences.Comment: 28 page
Even faster integer multiplication
We give a new proof of F\"urer's bound for the cost of multiplying n-bit
integers in the bit complexity model. Unlike F\"urer, our method does not
require constructing special coefficient rings with "fast" roots of unity.
Moreover, we prove the more explicit bound O(n log n K^(log^* n))$ with K = 8.
We show that an optimised variant of F\"urer's algorithm achieves only K = 16,
suggesting that the new algorithm is faster than F\"urer's by a factor of
2^(log^* n). Assuming standard conjectures about the distribution of Mersenne
primes, we give yet another algorithm that achieves K = 4
Division and conquer
Integer division is an important arithmetic operation on microprocessors. To derive integer division algorithms we present an unconvential approach: a derivation technique in a calculational style, that guarantees that the derived algorithms are correct. Four different algorithms are derived using this method: restoring division, non-restoring divsion, radix-4 division and division by multiplication. We translate these to descriptions into combinatorial circuits, expressed in Verilog code. Then the circuits are compiled on a Spartan-3 Generation FPGA. At the end, we compare the propagation delays and area requirements for these circuits. We show that the division by multiplication is much faster than the other methods, however it only works for 18 bit integers. Integer division is an important arithmetic operation on microprocessors. To derive integer division algorithms we present an unconvential approach: a derivation technique in a calculational style, that guarantees that the derived algorithms are correct. Four different algorithms are derived using this method: restoring division, non-restoring divsion, radix-4 division and division by multiplication. We translate these to descriptions into combinatorial circuits, expressed in Verilog code. Then the circuits are compiled on a Spartan-3 Generation FPGA. At the end, we compare the propagation delays and area requirements for these circuits. We show that the division by multiplication is much faster than the other methods, however it only works for 18 bit integers
Easy scalar decompositions for efficient scalar multiplication on elliptic curves and genus 2 Jacobians
The first step in elliptic curve scalar multiplication algorithms based on
scalar decompositions using efficient endomorphisms-including
Gallant-Lambert-Vanstone (GLV) and Galbraith-Lin-Scott (GLS) multiplication, as
well as higher-dimensional and higher-genus constructions-is to produce a short
basis of a certain integer lattice involving the eigenvalues of the
endomorphisms. The shorter the basis vectors, the shorter the decomposed scalar
coefficients, and the faster the resulting scalar multiplication. Typically,
knowledge of the eigenvalues allows us to write down a long basis, which we
then reduce using the Euclidean algorithm, Gauss reduction, LLL, or even a more
specialized algorithm. In this work, we use elementary facts about quadratic
rings to immediately write down a short basis of the lattice for the GLV, GLS,
GLV+GLS, and Q-curve constructions on elliptic curves, and for genus 2 real
multiplication constructions. We do not pretend that this represents a
significant optimization in scalar multiplication, since the lattice reduction
step is always an offline precomputation---but it does give a better insight
into the structure of scalar decompositions. In any case, it is always more
convenient to use a ready-made short basis than it is to compute a new one
DGEMM on Integer Matrix Multiplication Unit
Deep learning hardware achieves high throughput and low power consumption by
reducing computing precision and specializing in matrix multiplication. For
machine learning inference, fixed-point value computation is commonplace, where
the input and output values and the model parameters are quantized. Thus, many
processors are now equipped with fast integer matrix multiplication units
(IMMU). It is of significant interest to find a way to harness these IMMUs to
improve the performance of HPC applications while maintaining accuracy. We
focus on the Ozaki scheme, which computes a high-precision matrix
multiplication by using lower-precision computing units, and show the
advantages and disadvantages of using IMMU. The experiment using integer Tensor
Cores shows that we can compute double-precision matrix multiplication faster
than cuBLAS and an existing Ozaki scheme implementation on FP16 Tensor Cores on
NVIDIA consumer GPUs. Furthermore, we demonstrate accelerating a quantum
circuit simulation by up to 4.33 while maintaining the FP64 accuracy
On Nondeterministic Derandomization of Freivalds\u27 Algorithm: Consequences, Avenues and Algorithmic Progress
Motivated by studying the power of randomness, certifying algorithms and barriers for fine-grained reductions, we investigate the question whether the multiplication of two n x n matrices can be performed in near-optimal nondeterministic time O~(n^2). Since a classic algorithm due to Freivalds verifies correctness of matrix products probabilistically in time O(n^2), our question is a relaxation of the open problem of derandomizing Freivalds\u27 algorithm.
We discuss consequences of a positive or negative resolution of this problem and provide potential avenues towards resolving it. Particularly, we show that sufficiently fast deterministic verifiers for 3SUM or univariate polynomial identity testing yield faster deterministic verifiers for matrix multiplication. Furthermore, we present the partial algorithmic progress that distinguishing whether an integer matrix product is correct or contains between 1 and n erroneous entries can be performed in time O~(n^2) - interestingly, the difficult case of deterministic matrix product verification is not a problem of "finding a needle in the haystack", but rather cancellation effects in the presence of many errors.
Our main technical contribution is a deterministic algorithm that corrects an integer matrix product containing at most t errors in time O~(sqrt{t} n^2 + t^2). To obtain this result, we show how to compute an integer matrix product with at most t nonzeroes in the same running time. This improves upon known deterministic output-sensitive integer matrix multiplication algorithms for t = Omega(n^{2/3}) nonzeroes, which is of independent interest
- …