369 research outputs found
A versatile Montgomery multiplier architecture with characteristic three support
We present a novel unified core design which is extended to realize Montgomery multiplication in the fields GF(2n), GF(3m), and GF(p). Our unified design supports RSA and elliptic curve schemes, as well as the identity-based encryption which requires a pairing computation on an elliptic curve. The architecture is pipelined and is highly scalable. The unified core utilizes the redundant signed digit representation to reduce the critical path delay. While the carry-save representation used in classical unified architectures is only good for addition and multiplication operations, the redundant signed digit representation also facilitates efficient computation of comparison and subtraction operations besides addition and multiplication. Thus, there is no need for a transformation between the redundant and the non-redundant representations of field elements, which would be required in the classical unified architectures to realize the subtraction and comparison operations. We also quantify the benefits of the unified architectures in terms of area and critical path delay. We provide detailed implementation results. The metric shows that the new unified architecture provides an improvement over a hypothetical non-unified architecture of at least 24.88%, while the improvement over a classical unified architecture is at least 32.07%
Recommended from our members
Multi-Valued Majority Logic Circuits Using Spin Waves
With increasing data sets for processing, there is a requirement to build faster and smaller arithmetic circuits. One of the ways to improve the performance of higher order arithmetic units is to reduce the carry propagation levels. Multi-valued logic enables this by reducing the number of digits required to represent a range of numbers. Area reduction is also obtained through fewer operations and signals required to realise a function.
Though theoretically multi-valued logic has these advantages, implementation of the multi-valued logic using CMOS has not been efficient. The main reason is because multi-valued logic is emulated in CMOS using binary switches. Two main approaches are followed in CMOS in implementing multi-valued logic using CMOS. Voltage mode logic, where the logic states are encoded using the node voltages suffer from low noise margins and limitation of radix due to the power supply. Current mode logic, where the branch currents are used to represent the logic levels suffer from high power consumption due to static current flow and requirement of restoration devices. The mindset of the post-CMOS approaches explored so far for multi-valued logic circuit design has been to replace the CMOS switches with their novel nano switches. Hence they too suffer from the same issues as CMOS implementation.
Our value proposition is through the use of a truly multi-state device based on electron spin. Spin waves, which are a collection of electron spins of an atom enables multi-valued logic by allowing encoding information in the amplitude and phase of the wave.Another advantage of the spin wave fabric is that the computation is through wave propagation and interference which does not involve any movement of charge. This enables building low energy,smaller and faster multi-valued circuits. In this thesis, implementation of the basic building blocks of multi-valued logic using these novel spin wave based devices is shown. Building of arithmetic circuits like adders using these building blocks have also been demonstrated. To quantify the benefits of spin wave based multi-valued circuits, they are benchmarked with CMOS. For 32-bits, our projected comparisons show a 5X increased performance, 125X area improvement and 1717X power reduction for hexa-decimal spin wave based adders compared to binary CMOS. Similarly there is a 4X increase in performance of hexa-decimal SPWF multiplier compared to CMOS for 16 bits. Finally, we have implemented the I/O circuits for smooth interface between binary CMOS and multi-valued SPWF logic
Arithmetic core generation using bit heaps
International audienceA bit heap is a data structure that holds the unevaluated sum of an arbitrary number of bits, each weighted by some power of two. Most advanced arithmetic cores can be viewed as involving one or several bit heaps. We claim here that this point of view leads to better global optimization at the algebraic level, at the circuit level, and in terms of software engineering. To demonstrate it, a generic software framework is introduced for the definition and optimization of bit heaps. This framework, targeting DSP-enabled FPGAs, is developed within the open-source FloPoCo arithmetic core generator. Its versatility is demonstrated on several examples: multipliers, complex multipliers, polynomials, and discrete cosine transform
High-Performance Accurate and Approximate Multipliers for FPGA-Based Hardware Accelerators
Multiplication is one of the widely used arithmetic operations in a variety of applications, such as image/video processing and machine learning. FPGA vendors provide high-performance multipliers in the form of DSP blocks. These multipliers are not only limited in number and have fixed locations on FPGAs but can also create additional routing delays and may prove inefficient for smaller bit-width multiplications. Therefore, FPGA vendors additionally provide optimized soft IP cores for multiplication. However, in this work, we advocate that these soft multiplier IP cores for FPGAs still need better designs to provide high-performance and resource efficiency. Toward this, we present generic area-optimized, low-latency accurate, and approximate softcore multiplier architectures, which exploit the underlying architectural features of FPGAs, i.e., lookup table (LUT) structures and fast-carry chains to reduce the overall critical path delay (CPD) and resource utilization of multipliers. Compared to Xilinx multiplier LogiCORE IP, our proposed unsigned and signed accurate architecture provides up to 25% and 53% reduction in LUT utilization, respectively, for different sizes of multipliers. Moreover, with our unsigned approximate multiplier architectures, a reduction of up to 51% in the CPD can be achieved with an insignificant loss in output accuracy when compared with the LogiCORE IP. For illustration, we have deployed the proposed multiplier architecture in accelerators used in image and video applications, and evaluated them for area and performance gains. Our library of accurate and approximate multipliers is opensource and available online at https://cfaed.tu-dresden.de/pd-downloads to fuel further research and development in this area, facilitate reproducible research, and thereby enabling a new research direction for the FPGA community
- …