1,481 research outputs found
Efficient implementation of the Hardy-Ramanujan-Rademacher formula
We describe how the Hardy-Ramanujan-Rademacher formula can be implemented to
allow the partition function to be computed with softly optimal
complexity and very little overhead. A new implementation
based on these techniques achieves speedups in excess of a factor 500 over
previously published software and has been used by the author to calculate
, an exponent twice as large as in previously reported
computations.
We also investigate performance for multi-evaluation of , where our
implementation of the Hardy-Ramanujan-Rademacher formula becomes superior to
power series methods on far denser sets of indices than previous
implementations. As an application, we determine over 22 billion new
congruences for the partition function, extending Weaver's tabulation of 76,065
congruences.Comment: updated version containing an unconditional complexity proof;
accepted for publication in LMS Journal of Computation and Mathematic
Generating and Searching Families of FFT Algorithms
A fundamental question of longstanding theoretical interest is to prove the
lowest exact count of real additions and multiplications required to compute a
power-of-two discrete Fourier transform (DFT). For 35 years the split-radix
algorithm held the record by requiring just 4n log n - 6n + 8 arithmetic
operations on real numbers for a size-n DFT, and was widely believed to be the
best possible. Recent work by Van Buskirk et al. demonstrated improvements to
the split-radix operation count by using multiplier coefficients or "twiddle
factors" that are not n-th roots of unity for a size-n DFT. This paper presents
a Boolean Satisfiability-based proof of the lowest operation count for certain
classes of DFT algorithms. First, we present a novel way to choose new yet
valid twiddle factors for the nodes in flowgraphs generated by common
power-of-two fast Fourier transform algorithms, FFTs. With this new technique,
we can generate a large family of FFTs realizable by a fixed flowgraph. This
solution space of FFTs is cast as a Boolean Satisfiability problem, and a
modern Satisfiability Modulo Theory solver is applied to search for FFTs
requiring the fewest arithmetic operations. Surprisingly, we find that there
are FFTs requiring fewer operations than the split-radix even when all twiddle
factors are n-th roots of unity.Comment: Preprint submitted on March 28, 2011, to the Journal on
Satisfiability, Boolean Modeling and Computatio
Pruned Bit-Reversal Permutations: Mathematical Characterization, Fast Algorithms and Architectures
A mathematical characterization of serially-pruned permutations (SPPs)
employed in variable-length permuters and their associated fast pruning
algorithms and architectures are proposed. Permuters are used in many signal
processing systems for shuffling data and in communication systems as an
adjunct to coding for error correction. Typically only a small set of discrete
permuter lengths are supported. Serial pruning is a simple technique to alter
the length of a permutation to support a wider range of lengths, but results in
a serial processing bottleneck. In this paper, parallelizing SPPs is formulated
in terms of recursively computing sums involving integer floor and related
functions using integer operations, in a fashion analogous to evaluating
Dedekind sums. A mathematical treatment for bit-reversal permutations (BRPs) is
presented, and closed-form expressions for BRP statistics are derived. It is
shown that BRP sequences have weak correlation properties. A new statistic
called permutation inliers that characterizes the pruning gap of pruned
interleavers is proposed. Using this statistic, a recursive algorithm that
computes the minimum inliers count of a pruned BR interleaver (PBRI) in
logarithmic time complexity is presented. This algorithm enables parallelizing
a serial PBRI algorithm by any desired parallelism factor by computing the
pruning gap in lookahead rather than a serial fashion, resulting in significant
reduction in interleaving latency and memory overhead. Extensions to 2-D block
and stream interleavers, as well as applications to pruned fast Fourier
transforms and LTE turbo interleavers, are also presented. Moreover,
hardware-efficient architectures for the proposed algorithms are developed.
Simulation results demonstrate 3 to 4 orders of magnitude improvement in
interleaving time compared to existing approaches.Comment: 31 page
Implementation and Evaluation of Algorithmic Skeletons: Parallelisation of Computer Algebra Algorithms
This thesis presents design and implementation approaches for the parallel algorithms of computer algebra. We use algorithmic skeletons and also further approaches, like data parallel arithmetic and actors. We have implemented skeletons for divide and conquer algorithms and some special parallel loops, that we call ‘repeated computation with a possibility of premature termination’. We introduce in this thesis a rational data parallel arithmetic. We focus on parallel symbolic computation algorithms, for these algorithms our arithmetic provides a generic parallelisation approach.
The implementation is carried out in Eden, a parallel functional programming language based on Haskell. This choice enables us to encode both the skeletons and the programs in the same language. Moreover, it allows us to refrain from using two different languages—one for the implementation and one for the interface—for our implementation of computer algebra algorithms.
Further, this thesis presents methods for evaluation and estimation of parallel execution times. We partition the parallel execution time into two components. One of them accounts for the quality of the parallelisation, we call it the ‘parallel penalty’. The other is the sequential execution time. For the estimation, we predict both components separately, using statistical methods. This enables very confident estimations, although using drastically less measurement points than other methods. We have applied both our evaluation and estimation approaches to the parallel programs presented in this thesis. We haven also used existing estimation methods.
We developed divide and conquer skeletons for the implementation of fast parallel multiplication. We have implemented the Karatsuba algorithm, Strassen’s matrix multiplication algorithm and the fast Fourier transform. The latter was used to implement polynomial convolution that leads to a further fast multiplication algorithm. Specially for our implementation of Strassen algorithm we have designed and implemented a divide and conquer skeleton basing on actors. We have implemented the parallel fast Fourier transform, and not only did we use new divide and conquer skeletons, but also developed a map-and-transpose skeleton. It enables good parallelisation of the Fourier transform. The parallelisation of Karatsuba multiplication shows a very good performance. We have analysed the parallel penalty of our programs and compared it to the serial fraction—an approach, known from literature. We also performed execution time estimations of our divide and conquer programs.
This thesis presents a parallel map+reduce skeleton scheme. It allows us to combine the usual parallel map skeletons, like parMap, farm, workpool, with a premature termination property. We use this to implement the so-called ‘parallel repeated computation’, a special form of a speculative parallel loop. We have implemented two probabilistic primality tests: the Rabin–Miller test and the Jacobi sum test. We parallelised both with our approach. We analysed the task distribution and stated the fitting configurations of the Jacobi sum test. We have shown formally that the Jacobi sum test can be implemented in parallel. Subsequently, we parallelised it, analysed the load balancing issues, and produced an optimisation. The latter enabled a good implementation, as verified using the parallel penalty. We have also estimated the performance of the tests for further input sizes and numbers of processing elements. Parallelisation of the Jacobi sum test and our generic parallelisation scheme for the repeated computation is our original contribution.
The data parallel arithmetic was defined not only for integers, which is already known, but also for rationals. We handled the common factors of the numerator or denominator of the fraction with the modulus in a novel manner. This is required to obtain a true multiple-residue arithmetic, a novel result of our research. Using these mathematical advances, we have parallelised the determinant computation using the Gauß elimination. As always, we have performed task distribution analysis and estimation of the parallel execution time of our implementation. A similar computation in Maple emphasised the potential of our approach. Data parallel arithmetic enables parallelisation of entire classes of computer algebra algorithms.
Summarising, this thesis presents and thoroughly evaluates new and existing design decisions for high-level parallelisations of computer algebra algorithms
Decoding Generalized Reed-Solomon Codes and Its Application to RLCE Encryption Schemes
This paper compares the efficiency of various algorithms for implementing
quantum resistant public key encryption scheme RLCE on 64-bit CPUs. By
optimizing various algorithms for polynomial and matrix operations over finite
fields, we obtained several interesting (or even surprising) results. For
example, it is well known (e.g., Moenck 1976 \cite{moenck1976practical}) that
Karatsuba's algorithm outperforms classical polynomial multiplication algorithm
from the degree 15 and above (practically, Karatsuba's algorithm only
outperforms classical polynomial multiplication algorithm from the degree 35
and above ). Our experiments show that 64-bit optimized Karatsuba's algorithm
will only outperform 64-bit optimized classical polynomial multiplication
algorithm for polynomials of degree 115 and above over finite field
. The second interesting (surprising) result shows that 64-bit
optimized Chien's search algorithm ourperforms all other 64-bit optimized
polynomial root finding algorithms such as BTA and FFT for polynomials of all
degrees over finite field . The third interesting (surprising)
result shows that 64-bit optimized Strassen matrix multiplication algorithm
only outperforms 64-bit optimized classical matrix multiplication algorithm for
matrices of dimension 750 and above over finite field . It should
be noted that existing literatures and practices recommend Strassen matrix
multiplication algorithm for matrices of dimension 40 and above. All our
experiments are done on a 64-bit MacBook Pro with i7 CPU and single thread C
codes. It should be noted that the reported results should be appliable to 64
or larger bits CPU architectures. For 32 or smaller bits CPUs, these results
may not be applicable. The source code and library for the algorithms covered
in this paper are available at http://quantumca.org/
Hierarchical Orthogonal Matrix Generation and Matrix-Vector Multiplications in Rigid Body Simulations
In this paper, we apply the hierarchical modeling technique and study some
numerical linear algebra problems arising from the Brownian dynamics
simulations of biomolecular systems where molecules are modeled as ensembles of
rigid bodies. Given a rigid body consisting of beads, the
transformation matrix that maps the force on each bead to 's
translational and rotational forces (a vector), and the row
space of , we show how to explicitly construct the matrix
consisting of orthonormal basis vectors of
(orthogonal complement of ) using only operations
and storage. For applications where only the matrix-vector multiplications
and are needed, we introduce
asymptotically optimal hierarchical algorithms without
explicitly forming . Preliminary numerical results are presented to
demonstrate the performance and accuracy of the numerical algorithms
Frequency Domain Finite Field Arithmetic for Elliptic Curve Cryptography
Efficient implementation of the number theoretic transform(NTT), also known as the discrete Fourier transform(DFT) over a finite field, has been studied actively for decades and found many applications in digital signal processing. In 1971 Schonhage and Strassen proposed an NTT based asymptotically fast multiplication method with the asymptotic complexity O(m log m log log m) for multiplication of -bit integers or (m-1)st degree polynomials. Schonhage and Strassen\u27s algorithm was known to be the asymptotically fastest multiplication algorithm until Furer improved upon it in 2007. However, unfortunately, both algorithms bear significant overhead due to the conversions between the time and frequency domains which makes them impractical for small operands, e.g. less than 1000 bits in length as used in many applications. With this work we investigate for the first time the practical application of the NTT, which found applications in digital signal processing, to finite field multiplication with an emphasis on elliptic curve cryptography(ECC). We present efficient parameters for practical application of NTT based finite field multiplication to ECC which requires key and operand sizes as short as 160 bits in length. With this work, for the first time, the use of NTT based finite field arithmetic is proposed for ECC and shown to be efficient. We introduce an efficient algorithm, named DFT modular multiplication, for computing Montgomery products of polynomials in the frequency domain which facilitates efficient multiplication in GF(p^m). Our algorithm performs the entire modular multiplication, including modular reduction, in the frequency domain, and thus eliminates costly back and forth conversions between the frequency and time domains. We show that, especially in computationally constrained platforms, multiplication of finite field elements may be achieved more efficiently in the frequency domain than in the time domain for operand sizes relevant to ECC. This work presents the first hardware implementation of a frequency domain multiplier suitable for ECC and the first hardware implementation of ECC in the frequency domain. We introduce a novel area/time efficient ECC processor architecture which performs all finite field arithmetic operations in the frequency domain utilizing DFT modular multiplication over a class of Optimal Extension Fields(OEF). The proposed architecture achieves extension field modular multiplication in the frequency domain with only a linear number of base field GF(p) multiplications in addition to a quadratic number of simpler operations such as addition and bitwise rotation. With its low area and high speed, the proposed architecture is well suited for ECC in small device environments such as smart cards and wireless sensor networks nodes. Finally, we propose an adaptation of the Itoh-Tsujii algorithm to the frequency domain which can achieve efficient inversion in a class of OEFs relevant to ECC. This is the first time a frequency domain finite field inversion algorithm is proposed for ECC and we believe our algorithm will be well suited for efficient constrained hardware implementations of ECC in affine coordinates
Generic design of Chinese remaindering schemes
We propose a generic design for Chinese remainder algorithms. A Chinese
remainder computation consists in reconstructing an integer value from its
residues modulo non coprime integers. We also propose an efficient linear data
structure, a radix ladder, for the intermediate storage and computations. Our
design is structured into three main modules: a black box residue computation
in charge of computing each residue; a Chinese remaindering controller in
charge of launching the computation and of the termination decision; an integer
builder in charge of the reconstruction computation. We then show that this
design enables many different forms of Chinese remaindering (e.g.
deterministic, early terminated, distributed, etc.), easy comparisons between
these forms and e.g. user-transparent parallelism at different parallel grains
Sparse approaches for the exact distribution of patterns in long state sequences generated by a Markov source
We present two novel approaches for the computation of the exact distribution
of a pattern in a long sequence. Both approaches take into account the sparse
structure of the problem and are two-part algorithms. The first approach relies
on a partial recursion after a fast computation of the second largest
eigenvalue of the transition matrix of a Markov chain embedding. The second
approach uses fast Taylor expansions of an exact bivariate rational
reconstruction of the distribution. We illustrate the interest of both
approaches on a simple toy-example and two biological applications: the
transcription factors of the Human Chromosome 5 and the PROSITE signatures of
functional motifs in proteins. On these example our methods demonstrate their
complementarity and their hability to extend the domain of feasibility for
exact computations in pattern problems to a new level
- …