19,632 research outputs found
An algorithmic and architectural study on Montgomery exponentiation in RNS
The modular exponentiation on large numbers is computationally intensive. An effective way for performing this operation consists in using Montgomery exponentiation in the Residue Number System (RNS). This paper presents an algorithmic and architectural study of such exponentiation approach. From the algorithmic point of view, new and state-of-the-art opportunities that come from the reorganization of operations and precomputations are considered. From the architectural perspective, the design opportunities offered by well-known computer arithmetic techniques are studied, with the aim of developing an efficient arithmetic cell architecture. Furthermore, since the use of efficient RNS bases with a low Hamming weight are being considered with ever more interest, four additional cell architectures specifically tailored to these bases are developed and the tradeoff between benefits and drawbacks is carefully explored. An overall comparison among all the considered algorithmic approaches and cell architectures is presented, with the aim of providing the reader with an extensive overview of the Montgomery exponentiation opportunities in RNS
On the Effect of Quantum Interaction Distance on Quantum Addition Circuits
We investigate the theoretical limits of the effect of the quantum
interaction distance on the speed of exact quantum addition circuits. For this
study, we exploit graph embedding for quantum circuit analysis. We study a
logical mapping of qubits and gates of any -depth quantum adder
circuit for two -qubit registers onto a practical architecture, which limits
interaction distance to the nearest neighbors only and supports only one- and
two-qubit logical gates. Unfortunately, on the chosen -dimensional practical
architecture, we prove that the depth lower bound of any exact quantum addition
circuits is no longer , but . This
result, the first application of graph embedding to quantum circuits and
devices, provides a new tool for compiler development, emphasizes the impact of
quantum computer architecture on performance, and acts as a cautionary note
when evaluating the time performance of quantum algorithms.Comment: accepted for ACM Journal on Emerging Technologies in Computing
System
A Fast Potential and Self-Gravity Solver for Non-Axisymmetric Disks
Disk self-gravity could play an important role in the dynamic evolution of
interaction between disks and embedded protoplanets. We have developed a fast
and accurate solver to calculate the disk potential and disk self-gravity
forces for disk systems on a uniform polar grid. Our method follows closely the
method given by Chan et al. (2006), in which an FFT in the azimuthal direction
is performed and a direct integral approach in the frequency domain in the
radial direction is implemented on a uniform polar grid. This method can be
very effective for disks with vertical structures that depend only on the disk
radius, achieving the same computational efficiency as for zero-thickness
disks. We describe how to parallelize the solver efficiently on distributed
parallel computers. We propose a mode-cutoff procedure to reduce the parallel
communication cost and achieve nearly linear scalability for a large number of
processors. For comparison, we have also developed a particle-based fast
tree-code to calculate the self-gravity of the disk system with vertical
structure. The numerical results show that our direct integral method is at
least two order of magnitudes faster than our optimized tree-code approach.Comment: 8 figures, accepted to ApJ
GRAPE-6: The massively-parallel special-purpose computer for astrophysical particle simulation
In this paper, we describe the architecture and performance of the GRAPE-6
system, a massively-parallel special-purpose computer for astrophysical
-body simulations. GRAPE-6 is the successor of GRAPE-4, which was completed
in 1995 and achieved the theoretical peak speed of 1.08 Tflops. As was the case
with GRAPE-4, the primary application of GRAPE-6 is simulation of collisional
systems, though it can be used for collisionless systems. The main differences
between GRAPE-4 and GRAPE-6 are (a) The processor chip of GRAPE-6 integrates 6
force-calculation pipelines, compared to one pipeline of GRAPE-4 (which needed
3 clock cycles to calculate one interaction), (b) the clock speed is increased
from 32 to 90 MHz, and (c) the total number of processor chips is increased
from 1728 to 2048. These improvements resulted in the peak speed of 64 Tflops.
We also discuss the design of the successor of GRAPE-6.Comment: Accepted for publication in PASJ, scheduled to appear in Vol. 55, No.
Optimistic Parallelization of Floating-Point Accumulation
Floating-point arithmetic is notoriously non-associative due to the limited precision representation which demands intermediate values be rounded to fit in the available precision. The resulting cyclic dependency in floating-point accumulation inhibits parallelization of the computation, including efficient use of pipelining. In practice, however, we observe that floating-point operations are "mostly" associative. This observation can be exploited to parallelize floating-point accumulation using a form of optimistic concurrency. In this scheme, we first compute an optimistic associative approximation to the sum and then relax the computation by iteratively propagating errors until the correct sum is obtained. We map this computation to a network of 16 statically-scheduled, pipelined, double-precision floating-point adders on the Virtex-4 LX160 (-12) device where each floating-point adder runs at 296 MHz and has a pipeline depth of 10. On this 16 PE design, we demonstrate an average speedup of 6Ć with randomly generated data and 3-7Ć with summations extracted from Conjugate Gradient benchmarks
Construction and Application of an AMR Algorithm for Distributed Memory Computers
While the parallelization of blockstructured adaptive mesh refinement techniques is relatively straight-forward on shared memory architectures, appropriate distribution strategies for the emerging generation of distributed
memory machines are a topic of on-going research. In this paper, a locality-preserving domain decomposition is proposed that partitions the entire AMR hierarchy from the base level on. It is shown that the approach reduces the
communication costs and simplifies the implementation. Emphasis is put on the effective parallelization of the flux correction procedure at coarse-fine boundaries, which is indispensable for conservative finite volume schemes. An
easily reproducible standard benchmark and a highly resolved parallel AMR
simulation of a diffracting hydrogen-oxygen detonation demonstrate the proposed
strategy in practice
GreeM : Massively Parallel TreePM Code for Large Cosmological N-body Simulations
In this paper, we describe the implementation and performance of GreeM, a
massively parallel TreePM code for large-scale cosmological N-body simulations.
GreeM uses a recursive multi-section algorithm for domain decomposition. The
size of the domains are adjusted so that the total calculation time of the
force becomes the same for all processes. The loss of performance due to
non-optimal load balancing is around 4%, even for more than 10^3 CPU cores.
GreeM runs efficiently on PC clusters and massively-parallel computers such as
a Cray XT4. The measured calculation speed on Cray XT4 is 5 \times 10^4
particles per second per CPU core, for the case of an opening angle of
\theta=0.5, if the number of particles per CPU core is larger than 10^6.Comment: 13 pages, 11 figures, accepted by PAS
- ā¦