Search CORE

3,631 research outputs found

Efficient long division via Montgomery multiply

Author: Mayer Ernst W.
Publication venue
Publication date: 20/08/2016
Field of study

We present a novel right-to-left long division algorithm based on the Montgomery modular multiply, consisting of separate highly efficient loops with simply carry structure for computing first the remainder (x mod q) and then the quotient floor(x/q). These loops are ideally suited for the case where x occupies many more machine words than the divide modulus q, and are strictly linear time in the "bitsize ratio" lg(x)/lg(q). For the paradigmatic performance test of multiword dividend and single 64-bit-word divisor, exploitation of the inherent data-parallelism of the algorithm effectively mitigates the long latency of hardware integer MUL operations, as a result of which we are able to achieve respective costs for remainder-only and full-DIV (remainder and quotient) of 6 and 12.5 cycles per dividend word on the Intel Core 2 implementation of the x86_64 architecture, in single-threaded execution mode. We further describe a simple "bit-doubling modular inversion" scheme, which allows the entire iterative computation of the mod-inverse required by the Montgomery multiply at arbitrarily large precision to be performed with cost less than that of a single Newtonian iteration performed at the full precision of the final result. We also show how the Montgomery-multiply-based powering can be efficiently used in Mersenne and Fermat-number trial factorization via direct computation of a modular inverse power of 2, without any need for explicit radix-mod scalings.Comment: 23 pages; 8 tables v2: Tweak formatting, pagecount -= 2. v3: Fix incorrect powers of R in formulae [7] and [11] v4: Add Eldridge & Walter ref. v5: Clarify relation between Algos A/A',D and Hensel-div; clarify true-quotient mechanics; Add Haswell timings, refs to Agner Fog timings pdf and GMP asm-timings ref-page. v6: Remove stray +bw in MULL line of Algo D listing; add note re byte-LUT for qinv_

arXiv.org e-Print Archive

CiteSeerX

Division with speculation of quotient digits

Author: Cortadella Jordi
Lang Tomás
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1993
Field of study

The speed of SRT-type dividers is mainly determined by the complexity of the quotient-digit selection, so that implementations are limited to low-radix stages. A scheme is presented in which the quotient-digit is speculated and, when this speculation is incorrect, a rollback or a partial advance is performed. This results in a division operation with a shorter cycle time and a variable number of cycles. Several designs have been realized, and a radix-64 implementation that is 30% faster than the fastest conventional implementation (radix-8) at an increase of about 45% in area per quotient bit has been obtained. A radix-16 implementation that is about 10% faster than the radix-8 conventional one, with the additional advantage of requiring about 25% less area per quotient bit, is also shownPeer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Implementation of a Combined OFDM-Demodulation and WCDMA-Equalization Module

Author: Gerez Sabih H.
Hofstra Klaas L.
Kampen David van
Potman Jordy
Publication venue
Publication date: 01/01/2006
Field of study

For a dual-mode baseband receiver for the OFDMWireless LAN andWCDMA standards, integration of the demodulation and equalization tasks on a dedicated hardware module has been investigated. For OFDM demodulation, an FFT algorithm based on cascaded twiddle factor decomposition has been selected. This type of algorithm combines high spatial and temporal regularity in the FFT data-flow graphs with a minimal number of computations. A frequency-domain algorithm based on a circulant channel approximation has been selected for WCDMA equalization. It has good performance, low hardware complexity and a low number of computations. Its main advantage is the reuse of the FFT kernel, which contributes to the integration of both tasks. The demodulation and equalization module has been described at the register transfer level with the in-house developed Arx language. The core of the module is a pipelined radix-23 butterfly combined with a complex multiplier and complex divider. The module has an area of 0.447 mm2 in 0.18 ¿m technology and a power consumption of 10.6 mW. The proposed module compares favorably with solutions reported in literature

CiteSeerX

University of Twente Research Information

Efficient Unified Arithmetic for Hardware Cryptography

Author: Koc Cetin Kaya
Koç Çetin Kaya
Savas Erkay
Savaş Erkay
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 09/03/2008
Field of study

The basic arithmetic operations (i.e. addition, multiplication, and inversion) in finite fields, GF(q), where q = pk and p is a prime integer, have several applications in cryptography, such as RSA algorithm, Diffie-Hellman key exchange algorithm [1], the US federal Digital Signature Standard [2], elliptic curve cryptography [3, 4], and also recently identity based cryptography [5, 6]. Most popular finite fields that are heavily used in cryptographic applications due to elliptic curve based schemes are prime fields GF(p) and binary extension fields GF(2n). Recently, identity based cryptography based on pairing operations defined over elliptic curve points has stimulated a significant level of interest in the arithmetic of ternary extension fields, GF(3^n)

Sabanci University Research Database

Efficient FPGA implementation of high-throughput mixed radix multipath delay commutator FFT processor for MIMO-OFDM

Author: A. AMIRA
A. GUESSOUM
Ayinala
Bingham
Boopal
Chen
Fu
Garrido
Garrido
Gesbert
Li
Lin
Lin
M. DALI
N. RAMZAN
R. M. GIBSON
Sampath
Shousheng He
Shousheng He
Song-Nien Tang
Swartzlander
Tang
Tsai
Uzun
Wang
Wang
Yang
Yu-Wei Lin
Publication venue: 'Universitatea Stefan cel Mare din Suceava'
Publication date: 01/01/2017
Field of study

This article presents and evaluates pipelined architecture designs for an improved high-frequency Fast Fourier Transform (FFT) processor implemented on Field Programmable Gate Arrays (FPGA) for Multiple Input Multiple Output Orthogonal Frequency Division Multiplexing (MIMO-OFDM). The architecture presented is a Mixed-Radix Multipath Delay Commutator. The presented parallel architecture utilizes fewer hardware resources compared to Radix-2 architecture, while maintaining simple control and butterfly structures inherent to Radix-2 implementations. The high-frequency design presented allows enhancing system throughput without requiring additional parallel data paths common in other current approaches, the presented design can process two and four independent data streams in parallel and is suitable for scaling to any power of two FFT size N. FPGA implementation of the architecture demonstrated significant resource efficiency and high-throughput in comparison to relevant current approaches within literature. The proposed architecture designs were realized with Xilinx System Generator (XSG) and evaluated on both Virtex-5 and Virtex-7 FPGA devices. Post place and route results demonstrated maximum frequency values over 400 MHz and 470 MHz for Virtex-5 and Virtex-7 FPGA devices respectively

Crossref

Directory of Open Access Journals

Research Repository and Portal - University of the West of Scotland

ResearchOnline@GCU

Generic design of Chinese remaindering schemes

Author: Dumas Jean-Guillaume
Gautier Thierry
Roch Jean-Louis
Publication venue
Publication date: 01/01/2010
Field of study

We propose a generic design for Chinese remainder algorithms. A Chinese remainder computation consists in reconstructing an integer value from its residues modulo non coprime integers. We also propose an efficient linear data structure, a radix ladder, for the intermediate storage and computations. Our design is structured into three main modules: a black box residue computation in charge of computing each residue; a Chinese remaindering controller in charge of launching the computation and of the termination decision; an integer builder in charge of the reconstruction computation. We then show that this design enables many different forms of Chinese remaindering (e.g. deterministic, early terminated, distributed, etc.), easy comparisons between these forms and e.g. user-transparent parallelism at different parallel grains

arXiv.org e-Print Archive

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Low Power Implementation of Non Power-of-Two FFTs on Coarse-Grain Reconfigurable Architectures

Author: Quevremont Jérôme
Rivaton Arnaud
Smit Gerard
Wolkotte Pascal
Zhang Qiwei
Publication venue: Technology Foundation, STW
Publication date: 01/01/2005
Field of study

The DRM standard for digital radio broadcast in the AM band requires integrated devices for radio receivers at very low power. A System on Chip (SoC) call DiMITRI was developed based on a dual ARM9 RISC core architecture. Analyses showed that most computation power is used in the Coded Orthogonal Frequency Division Multiplexing (COFDM) demodulation to compute Fast Fourier Transforms (FFT) and inverse transforms (IFFT) on complex samples. These FFTs have to be computed on non power-of-two numbers of samples, which is very uncommon in the signal processing world. The results obtained with this chip, lead to the objective to decrease the power dissipated by the COFDM demodulation part using a coarse-grain reconfigurable structure as a coprocessor. This paper introduces two different coarse-grain architectures: PACT XPP technology and the Montium, developed by the University of Twente, and presents the implementation of a\ud Fast Fourier Transform on 1920 complex samples. The implementation result on the Montium shows a saving of a factor 35 in terms of processing time, and 14 in terms of power consumption compared to the RISC implementation, and a\ud smaller area. Then, as a conclusion, the paper presents the next steps of the development and some development issues

University of Twente Research Information