2,441 research outputs found
Composite Iterative Algorithm and Architecture for q-th Root Calculation
An algorithm for the q-th root extraction, being q any integer, is presented in this paper. The algorithm is based on an optimized implementation of X^{1/q} by a sequence of parallel and/or overlapped operations: (1) reciprocal, (2) digit-recurrence logarithm, (3) left-to-right carry-free multiplication and (4) on-line exponential. A detailed error analysis and two architectures are proposed, for low precision q and for higher precision q. The execution time and hardware requirements are estimated for single and double precision floating-point computations for several radices; this helps to determine which radices result in the most efficient implementations. The architectures proposed improve the features of other architectures for q-th root extraction.Dans cet article, nous présentons un algorithme matériel pour l'extraction de la racine q-ième d'un nombre X, où q est un entier naturel non nul. Cet algorithme est basé sur une implantation optimisée de la fonction X^{1/q} par une séquence d'opérations parallèles et/ou superposées: (1) réciproque, (2) logarithme chiffre par chiffre, (3) multiplication de gauche-à -droite sans propagation de retenue et (4) exponentielle en ligne. Une analyse détaillée des erreurs et deux architectures sont proposées, pour q de basse précision et pour q de précision plus haute. Le temps d'exécution et les composants matériels à utiliser sont estimés pour des calculs en virgule flottante simple et double précision et pour plusieurs bases. Cette étude aide à déterminer quelles bases mènent aux implantations les plus efficaces. Les architectures proposées améliorent les caractéristiques d'architectures précédentes destinées à l'extraction des racines
Floating Point Calculation of the Cube Function on FPGAs
© 2023 IEEE. This version of the paper has been accepted for publication. Personal use of
this material is permitted. Permission from IEEE must be obtained for all other uses, in
any current or future media, including reprinting/republishing this material for
advertising or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of this work in
other works.
The final published paper is available online at:
https://doi.org/10.1109/TPDS.2022.3220039[Abstract]: Specialized arithmetic units allow fast and efficient computation of lesser used mathematical functions. The overall impact of those units would be negligible in a general purpose processor, as added circuitry makes chips more complex despite most software would seldom make use of it. On the opposite side, custom computing machines are built for a specific task, and they can always benefit from specialized units if they are available. In this work, floating point architectures are proposed for computing the cube on Intel and Xilinx FPGAs. Those implementations reduce the cost and latency compared to using simple floating point multiplications and squarers.This work was supported by the Ministry of Science
and Innovation of Spain (PID2019-104184RB-I00 / AEI
/ 10.13039/501100011033), and by Xunta de Galicia and
FEDER funds of the EU (Centro de Investigaci´on de Galicia
accreditation 2019–2022, ref. ED431G 2019/01; Consolidation
Program of Competitive Reference Groups, ref. ED431C
2021/30).Xunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431C 2021/3
GRAPE-5: A Special-Purpose Computer for N-body Simulation
We have developed a special-purpose computer for gravitational many-body
simulations, GRAPE-5. GRAPE-5 is the successor of GRAPE-3. Both consist of
eight custom pipeline chips (G5 chip and GRAPE chip). The difference between
GRAPE-5 and GRAPE-3 are: (1) The G5 chip contains two pipelines operating at 80
MHz, while the GRAPE chip had one at 20 MHz. Thus, the calculation speed of the
G5 chip and that of GRAPE-5 board are 8 times faster than that of GRAPE chip
and GRAPE-3 board. (2) The GRAPE-5 board adopted PCI bus as the interface to
the host computer instead of VME of GRAPE-3, resulting in the communication
speed one order of magnitude faster. (3) In addition to the pure 1/r potential,
the G5 chip can calculate forces with arbitrary cutoff functions, so that it
can be applied to Ewald or P^3M methods. (4) The pairwise force calculated on
GRAPE-5 is about 10 times more accurate than that on GRAPE-3. On one GRAPE-5
board, one timestep of 128k-body simulation with direct summation algorithm
takes 14 seconds. With Barnes-Hut tree algorithm (theta = 0.75), one timestep
of 10^6-body simulation can be done in 16 seconds.Comment: 19 pages, 24 Postscript figures, 3 tables, Latex, submitted to
Publications of the Astronomical Society of Japa
Design for Implementation of Image Processing Algorithms
Color image processing algorithms are first developed using a high-level mathematical modeling language. Current integrated development environments offer libraries of intrinsic functions, which on one hand enable faster development, but on the other hand hide the use of fundamental operations. The latter have to be detailed for an efficient hardware and/or software physical implementation. Based on the experience accumulated in the process of implementing a segmentation algorithm, this thesis outlines a design for implementation methodology comprised of a development flow and associated guidelines.
The methodology enables algorithm developers to iteratively optimize their algorithms while maintaining the level of image integrity required by their application. Furthermore, it does not require algorithm developers to change their current development process. Rather, the design for implementation methodology is best suited for optimizing a functionally correct algorithm, thus appending to an algorithm developer\u27s design process of choice.
The application of this methodology to four segmentation algorithm steps produced measured results with 2-D correlation coefficients (CORR2) better than 0.99, peak-signal-to-noise-ratio (PSNR) better than 70 dB, and structural-similarity-index (SSIM) better than 0.98, for a majority of test cases
Fast Quantum Modular Exponentiation
We present a detailed analysis of the impact on modular exponentiation of
architectural features and possible concurrent gate execution. Various
arithmetic algorithms are evaluated for execution time, potential concurrency,
and space tradeoffs. We find that, to exponentiate an n-bit number, for storage
space 100n (twenty times the minimum 5n), we can execute modular exponentiation
two hundred to seven hundred times faster than optimized versions of the basic
algorithms, depending on architecture, for n=128. Addition on a neighbor-only
architecture is limited to O(n) time when non-neighbor architectures can reach
O(log n), demonstrating that physical characteristics of a computing device
have an important impact on both real-world running time and asymptotic
behavior. Our results will help guide experimental implementations of quantum
algorithms and devices.Comment: to appear in PRA 71(5); RevTeX, 12 pages, 12 figures; v2 revision is
substantial, with new algorithmic variants, much shorter and clearer text,
and revised equation formattin
- …