774 research outputs found
Comparison of Scalable Montgomery Modular Multiplication Implementations Embedded in Reconfigurable Hardware
International audienceThis paper presents a comparison of possible approaches for an efficient implementation of Multiple-word radix-2 Montgomery Modular Multiplication (MM) on modern Field Programmable Gate Arrays (FPGAs). The hardware implementation of MM coprocessor is fully scalable what means that it can be reused in order to generate long-precision results independently on the word length of the originally proposed coprocessor. The first of analyzed implementations uses a data path based on traditionally used redundant carry-save adders, the second one exploits, in scalable designs not yet applied, standard carry-propagate adders with fast carry chain logic. As a control unit and a platform for purely software implementation an embedded soft-core processor Altera NIOS is employed. All implementations use large embedded memory blocks available in recent FPGAs. Speed and logic requirements comparisons are performed on the optimized software and combined hardware-software designs in Altera FPGAs. The issues of targeting a design specifically for a FPGA are considered taking into account the underlying architecture imposed by the target FPGA technology. It is shown that the coprocessors based on carry-save adders and carry-propagate adders provide comparable results in constrained FPGA implementations but in case of carry-propagate logic, the solution requires less embedded memory and provides some additional implementation advantages presented in the paper
Uso eficiente de aritmética redundante en FPGAs
Hasta hace pocos años, la utilización de aritmética redundante en FPGAs había
sido descartada por dos razones principalmente. En primer lugar, por el buen
rendimiento que ofrecían los sumadores de acarreo propagado, gracias a la lógica de
de acarreo que poseían de fábrica y al pequeño tamaño de los operandos en las
aplicaciones típicas para FPGAs. En segundo lugar, el excesivo consumo de área que
las herramientas de síntesis obtenían cuando mapeaban unidades que trabajan en carrysave.
En este trabajo, se muestra que es posible la utilización de aritmética redundante
carry-save en FPGAs de manera eficiente, consiguiendo un aumento en la velocidad de
operación con un consumo de recursos razonable. Se ha introducido un nuevo formato
redundante doble carry-save y se ha demostrado que la manera óptima para la
realización de multiplicadores de elevado ancho de palabra es la combinación de
multiplicadores empotrados con sumadores carry-save.Till a few years ago, redundant arithmetic had been discarded to be use in FPGA
mainly for two reasons. First, the efficient results obtained using carry-propagate adders
thanks to the carry-logic embedded in FPGAs and the small sizes of operands in typical
FPGA applications. Second, the high number of resources that the synthesis tools
utilizes to implement carry-save circuits.
In this work, it is demonstrated that carry-save arithmetic can be efficiently used
in FPGA, obtaining an important speed improvement with a reasonable area cost. A
new redundant format, double carry-save, has been introduced, and the optimal
implementation of large size multipliers has been shown based on embedded multipliers
and carry-save adders
Multi-operand Decimal Adder Trees for FPGAs
The research and development of hardware designs for decimal arithmetic is currently going under an intense activity. For most part, the methods proposed to implement fixed and floating point operations are intended for ASIC designs. Thus, a direct mapping or adaptation of these techniques into a FPGA could be far from an optimal solution. Only a few studies have considered new methods more suitable for FPGA implementations. A basic operation that has not received enough attention in this context is multi-operand BCD addition. For example, it is of interest for low latency implementations of decimal fixed and floating point multipliers and decimal fused multiply-add units. We have explored the most representative proposals for multi-operand BCD addition and found that the resultant implementations in FPGAs are still very inefficient in terms of both area and latency when compared to their binary counterparts. In this paper we present a new method for fast and efficient implementation of multi-operand BCD addition in current FPGA devices. In particular, our proposal maps quite well into the slice structure of the Xilinx Virtex-5/Virtex-6 families and it is highly pipelineable. The synthesis results for a Virtex-6 device indicate that our implementations halve the area and latency of previous proposals, presenting area and delay figures close to those of optimal binary adder trees.La recherche sur l'implantation en matériel de l'arithmétique décimale est actuellement très active, la plupart des travaux portant sur des opérateurs pour les processeurs, en virgule fixe ou flottante. Mais les techniques développées pour un circuit intégré n'aboutissent pas forcément à une implémentation optimale dans un FPGA. Il n'y a que peu d'études ciblant explicitement les FPGA. Cet article s'intéresse dans ce contexte, à l'addition BCD multi-opérande, au cœur de multiplieurs et de multiplieurs-accumulateurs à faible latence. Nous étudions les architectures proposées pour cette opération décimale, et nous observons que, sur FPGA, leur performance (surface et latence) est très inférieure à celle des opérations binaire à précision comparable. Nous présentons donc dans cet article une nouvelle technique d'addition BCD multi-opérandes qui s'avère plus efficace que les propositions précédentes sur les FPGA actuels. Elle s'adapte particulièrement bien à la structure fine des FPGA Xilinx Virtex-5/Virtex-6, et se prête bien au pipeline. Les résultats de synthèse montrent que notre implémentation divise par deux la surface et la latence par rapport aux propositions précédentes, les ramenant à des valeurs comparables à celles des meilleurs additionneurs multi-opérandes binaires
Parametric, Secure and Compact Implementation of RSA on FPGA
We present a fast, efficient, and parameterized modular multiplier and a secure exponentiation circuit especially intended for FPGAs on the low end of the price range. The design utilizes dedicated block multipliers as the main functional unit and Block-RAM as storage unit for the operands. The adopted design methodology allows adjusting the number of multipliers, the radix used in the multipliers, and number of words to meet the system requirements such as
available resources, precision and timing constraints. The architecture, based on the Montgomery modular multiplication algorithm, utilizes a pipelining technique that allows concurrent operation of hardwired multipliers. Our
design completes 1020-bit and 2040-bit modular multiplications in 7.62 μs and 27.0 μs, respectively. The multiplier uses a moderate amount of system resources while achieving the best area-time product in literature. 2040-bit modular exponentiation engine can easily fit into Xilinx Spartan-3E 500; moreover the exponentiation circuit withstands known side channel attacks
A Novel VLSI Design On CSKA Of Binary Tree Adder With Compaq Area And High Throughput
Addition is one of the most basic operations performed in all computing units, including microprocessors and digital signal processors. It is also a basic unit utilized in various complicated algorithms of multiplication and division. Efficient implementation of an adder circuit usually revolves around reducing the cost to propagate the carry between successive bit positions. Multi-operand adders are important arithmetic design blocks especially in the addition of partial products of hardware multipliers. The multi-operand adders (MOAs) are widely used in the modern low-power and high-speed portable very-large-scale integration systems for image and signal processing applications such as digital filters, transforms, convolution neural network architecture. Hence, a new high-speed and area efficient adder architecture is proposed using pre-compute bitwise addition followed by carry prefix computation logic to perform the three-operand binary addition that consumes substantially less area, low power and drastically reduces the adder delay. Further, this project is enhanced by using Modified carry bypass adder to further reduce more density and latency constraints. Modified carry skip adder introduces simple and low complex carry skip logic to reduce parameters constraints. In this proposal work, designed binary tree adder (BTA) is analyzed to find the possibilities for area minimization. Based on the analysis, critical path of carry is taken into the new logic implementation and the corresponding design of CSKP are proposed for the BTA with AOI, OAI
HIGH THROUGHPUT IMPLEMENTATION OF 64 BIT MODIFIED WALLANCE MAC USING MULTIOPERAND ADDERS
Although redundant addition is widely used to design parallel multioperand adders for ASIC implementations, the use of redundant adders on Field Programmable Gate Arrays (FPGAs) has generally been avoided. The main reasons are the efficient implementation of carry propagate adders (CPAs) on these devices (due to their specialized carry-chain resources) as well as the area overhead of the redundant adders when they are implemented on FPGAs. This project presents different approaches to the efficient implementation of generic carry-save compressor trees. In computing, especially digital signal processing, the multiply–accumulate operation is a common step that computes the product of two numbers and adds that product to an accumulator. The hardware unit that performs the operation is known as a multiplier–accumulator (MAC, or MAC unit); the operation itself is also often called a MAC or a MAC operation. Power dissipation is one of the most important design objectives in integrated circuit, after speed. Digital signal processing (DSP) circuits whose main building block is a Multiplier-Accumulator (MAC) unit. High speed and low power MAC unit is desirable for any DSP processor. This is because speed and throughput rate are always the concerns of DSP system. MAC unit consists of adder, multiplier, and an accumulator it preserves a unique mapping between input and output vector of the particular circuit. In this MAC operation is performed in two parts Partial Product Generation (PPG) circuit and Multi-Operand Addition (MOA) circui
LUXOR: An FPGA Logic Cell Architecture for Efficient Compressor Tree Implementations
We propose two tiers of modifications to FPGA logic cell architecture to
deliver a variety of performance and utilization benefits with only minor area
overheads. In the irst tier, we augment existing commercial logic cell
datapaths with a 6-input XOR gate in order to improve the expressiveness of
each element, while maintaining backward compatibility. This new architecture
is vendor-agnostic, and we refer to it as LUXOR. We also consider a secondary
tier of vendor-speciic modifications to both Xilinx and Intel FPGAs, which we
refer to as X-LUXOR+ and I-LUXOR+ respectively. We demonstrate that compressor
tree synthesis using generalized parallel counters (GPCs) is further improved
with the proposed modifications. Using both the Intel adaptive logic module and
the Xilinx slice at the 65nm technology node for a comparative study, it is
shown that the silicon area overhead is less than 0.5% for LUXOR and 5-6% for
LUXOR+, while the delay increments are 1-6% and 3-9% respectively. We
demonstrate that LUXOR can deliver an average reduction of 13-19% in logic
utilization on micro-benchmarks from a variety of domains.BNN benchmarks
benefit the most with an average reduction of 37-47% in logic utilization,
which is due to the highly-efficient mapping of the XnorPopcount operation on
our proposed LUXOR+ logic cells.Comment: In Proceedings of the 2020 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays (FPGA'20), February 23-25, 2020, Seaside, CA,
US
An Alternative Carry-save Arithmetic for New Generation Field Programmable Gate Arrays
In this work, a double carry-save addition operation is proposed, which is efficiently synthesized for 6-input
LUT-based eld programmable gate arrays (FPGAs). The proposed arithmetic operation is based on redundant number
representation and provides carry propagation-free addition. Using the proposed arithmetic operation, a compact and
fast multiply and accumulate unit is designed. To our knowledge, the proposed design provides the fastest multiply-add
operation for 6-input LUT-based FPGA systems. A nite impulse response lter implementation is given to show the
performance of the proposed structure. The proposed implementation provides a dramatic performance increase, which
is at least 2 times faster than conventional binary multiply-add implementations
- …