19,632 research outputs found

    An algorithmic and architectural study on Montgomery exponentiation in RNS

    Get PDF
    The modular exponentiation on large numbers is computationally intensive. An effective way for performing this operation consists in using Montgomery exponentiation in the Residue Number System (RNS). This paper presents an algorithmic and architectural study of such exponentiation approach. From the algorithmic point of view, new and state-of-the-art opportunities that come from the reorganization of operations and precomputations are considered. From the architectural perspective, the design opportunities offered by well-known computer arithmetic techniques are studied, with the aim of developing an efficient arithmetic cell architecture. Furthermore, since the use of efficient RNS bases with a low Hamming weight are being considered with ever more interest, four additional cell architectures specifically tailored to these bases are developed and the tradeoff between benefits and drawbacks is carefully explored. An overall comparison among all the considered algorithmic approaches and cell architectures is presented, with the aim of providing the reader with an extensive overview of the Montgomery exponentiation opportunities in RNS

    On the Effect of Quantum Interaction Distance on Quantum Addition Circuits

    Full text link
    We investigate the theoretical limits of the effect of the quantum interaction distance on the speed of exact quantum addition circuits. For this study, we exploit graph embedding for quantum circuit analysis. We study a logical mapping of qubits and gates of any Ī©(logā”n)\Omega(\log n)-depth quantum adder circuit for two nn-qubit registers onto a practical architecture, which limits interaction distance to the nearest neighbors only and supports only one- and two-qubit logical gates. Unfortunately, on the chosen kk-dimensional practical architecture, we prove that the depth lower bound of any exact quantum addition circuits is no longer Ī©(logā”n)\Omega(\log {n}), but Ī©(nk)\Omega(\sqrt[k]{n}). This result, the first application of graph embedding to quantum circuits and devices, provides a new tool for compiler development, emphasizes the impact of quantum computer architecture on performance, and acts as a cautionary note when evaluating the time performance of quantum algorithms.Comment: accepted for ACM Journal on Emerging Technologies in Computing System

    A Fast Potential and Self-Gravity Solver for Non-Axisymmetric Disks

    Full text link
    Disk self-gravity could play an important role in the dynamic evolution of interaction between disks and embedded protoplanets. We have developed a fast and accurate solver to calculate the disk potential and disk self-gravity forces for disk systems on a uniform polar grid. Our method follows closely the method given by Chan et al. (2006), in which an FFT in the azimuthal direction is performed and a direct integral approach in the frequency domain in the radial direction is implemented on a uniform polar grid. This method can be very effective for disks with vertical structures that depend only on the disk radius, achieving the same computational efficiency as for zero-thickness disks. We describe how to parallelize the solver efficiently on distributed parallel computers. We propose a mode-cutoff procedure to reduce the parallel communication cost and achieve nearly linear scalability for a large number of processors. For comparison, we have also developed a particle-based fast tree-code to calculate the self-gravity of the disk system with vertical structure. The numerical results show that our direct integral method is at least two order of magnitudes faster than our optimized tree-code approach.Comment: 8 figures, accepted to ApJ

    GRAPE-6: The massively-parallel special-purpose computer for astrophysical particle simulation

    Full text link
    In this paper, we describe the architecture and performance of the GRAPE-6 system, a massively-parallel special-purpose computer for astrophysical NN-body simulations. GRAPE-6 is the successor of GRAPE-4, which was completed in 1995 and achieved the theoretical peak speed of 1.08 Tflops. As was the case with GRAPE-4, the primary application of GRAPE-6 is simulation of collisional systems, though it can be used for collisionless systems. The main differences between GRAPE-4 and GRAPE-6 are (a) The processor chip of GRAPE-6 integrates 6 force-calculation pipelines, compared to one pipeline of GRAPE-4 (which needed 3 clock cycles to calculate one interaction), (b) the clock speed is increased from 32 to 90 MHz, and (c) the total number of processor chips is increased from 1728 to 2048. These improvements resulted in the peak speed of 64 Tflops. We also discuss the design of the successor of GRAPE-6.Comment: Accepted for publication in PASJ, scheduled to appear in Vol. 55, No.

    Optimistic Parallelization of Floating-Point Accumulation

    Get PDF
    Floating-point arithmetic is notoriously non-associative due to the limited precision representation which demands intermediate values be rounded to fit in the available precision. The resulting cyclic dependency in floating-point accumulation inhibits parallelization of the computation, including efficient use of pipelining. In practice, however, we observe that floating-point operations are "mostly" associative. This observation can be exploited to parallelize floating-point accumulation using a form of optimistic concurrency. In this scheme, we first compute an optimistic associative approximation to the sum and then relax the computation by iteratively propagating errors until the correct sum is obtained. We map this computation to a network of 16 statically-scheduled, pipelined, double-precision floating-point adders on the Virtex-4 LX160 (-12) device where each floating-point adder runs at 296 MHz and has a pipeline depth of 10. On this 16 PE design, we demonstrate an average speedup of 6Ɨ with randomly generated data and 3-7Ɨ with summations extracted from Conjugate Gradient benchmarks

    Construction and Application of an AMR Algorithm for Distributed Memory Computers

    Get PDF
    While the parallelization of blockstructured adaptive mesh refinement techniques is relatively straight-forward on shared memory architectures, appropriate distribution strategies for the emerging generation of distributed memory machines are a topic of on-going research. In this paper, a locality-preserving domain decomposition is proposed that partitions the entire AMR hierarchy from the base level on. It is shown that the approach reduces the communication costs and simplifies the implementation. Emphasis is put on the effective parallelization of the flux correction procedure at coarse-fine boundaries, which is indispensable for conservative finite volume schemes. An easily reproducible standard benchmark and a highly resolved parallel AMR simulation of a diffracting hydrogen-oxygen detonation demonstrate the proposed strategy in practice

    GreeM : Massively Parallel TreePM Code for Large Cosmological N-body Simulations

    Full text link
    In this paper, we describe the implementation and performance of GreeM, a massively parallel TreePM code for large-scale cosmological N-body simulations. GreeM uses a recursive multi-section algorithm for domain decomposition. The size of the domains are adjusted so that the total calculation time of the force becomes the same for all processes. The loss of performance due to non-optimal load balancing is around 4%, even for more than 10^3 CPU cores. GreeM runs efficiently on PC clusters and massively-parallel computers such as a Cray XT4. The measured calculation speed on Cray XT4 is 5 \times 10^4 particles per second per CPU core, for the case of an opening angle of \theta=0.5, if the number of particles per CPU core is larger than 10^6.Comment: 13 pages, 11 figures, accepted by PAS
    • ā€¦
    corecore