56 research outputs found

    Square-rich fixed point polynomial evaluation on FPGAs

    Get PDF
    Polynomial evaluation is important across a wide range of application domains, so significant work has been done on accelerating its computation. The conventional algorithm, referred to as Horner's rule, involves the least number of steps but can lead to increased latency due to serial computation. Parallel evaluation algorithms such as Estrin's method have shorter latency than Horner's rule, but achieve this at the expense of large hardware overhead. This paper presents an efficient polynomial evaluation algorithm, which reforms the evaluation process to include an increased number of squaring steps. By using a squarer design that is more efficient than general multiplication, this can result in polynomial evaluation with a 57.9% latency reduction over Horner's rule and 14.6% over Estrin's method, while consuming less area than Horner's rule, when implemented on a Xilinx Virtex 6 FPGA. When applied in fixed point function evaluation, where precision requirements limit the rounding of operands, it still achieves a 52.4% performance gain compared to Horner's rule with only a 4% area overhead in evaluating 5th degree polynomials

    A Digital Integrated Inertial Navigation System For Aerial Vehicles

    Get PDF

    Efficient algorithm and architecture for implementation of multiplier circuits in modern EPGAs

    Get PDF
    High speed multiplication in Field Programmable Gate Arrays is often performed either using logic cells or with built-in DSP blocks. The latter provides the highest performance for arithmetic operations while being also optimized in terms of power and area utilization. Scalability of input operands is limited to that of a single DSP block and the current CAD tools provide little help when the designer needs to build larger arithmetic blocks. The present thesis proposes an effective approach to the problem of building large integer multipliers out of smaller ones by giving two algorithms to the system designer, for a given FPGA technology. Large word length is required in applications such as cryptography and video processing. The first proposed algorithm partitions large input multipliers into an architecture-aware design. The second algorithm then places the generated design in an optimal layout minimizing interconnect delay. The thesis concludes with simulation and hardware generated data to support the proposed algorithms

    Hardware Implementations of the WG-16 Stream Cipher with Composite Field Arithmetic

    Get PDF
    The WG stream cipher family consists of stream ciphers based on the Welch-Gong (WG) transformations that are used as a nonlinear filter applied to the output of a linear feedback shift register (LFSR). The aim of this thesis is an exploration of the design space of the WG-16 stream cipher. Five different representations of the field elements were analyzed, namely the polynomial basis representation, the normal basis representation and three isomorphic tower field constructions of F216: F(((22)2)2)2, F(24)4 and F(28)2. Each design option begins with an in-depth description of different field constructions and their impact on the top-level WG transformation circuit. Normal basis representation of elements for each level of the tower was chosen for field constructions F(((22)2)2)2 and F(24)4, and a mixed basis, with polynomial basis for the lower and normal basis for the higher level of the tower for F(28)2. Representation of field elements affects the field arithmetic, which in turn affects the entire design. Targeting high throughput, pipelined architectures were developed, and pipelining was based on the particular field construction: each extension over the prime field offers a new pipelining possibility. Pipelining at a lower level of the tower field reduces the clock period. Most flexible pipelining options are possible for F(((22)2)2)2, a highly regular construction, which permits an algebraic optimization of the WG transformation resulting in two multiplications being removed. High speed, achieved by adequate pipelining granularity, and smaller area due to removed multipliers deem the F(((22)2)2)2 to be the most suitable field construction for the implementation of WG-16. The best WG-16 modules achieve a throughput of 222 Mbit/s with 476 slices used on the Xilinx Spartan-6 FPGA device xc6slx9 (using Xilinx Synthesis Tool (XST) for synthesis and ISE for implementation [47]) and a throughput of 529 Mbit/s with area cost of 12215 GEs for ASIC implementation, using the 65 nm CMOS technology (using Synopsys Design Compiler for synthesis [45] and Cadence SoC Encounter to complete the Place-and-Route phase)

    Harder, Better, Faster, Stronger - Elliptic Curve Discrete Logarithm Computations on FPGAs

    Get PDF
    Computing discrete logarithms takes time. It takes time to develop new algorithms, choose the best algorithms, implement these algorithms correctly and efficiently, keep the system running for several months, and, finally, publish the results. In this paper, we present a highly performant architecture that can be used to compute discrete logarithms of Weierstrass curves defined over binary fields and Koblitz curves using FPGAs. We used the architecture to compute for the first time a discrete logarithm of the elliptic curve \texttt{sect113r1}, a previously standardized binary curve, using 10 Kintex-7 FPGAs. To achieve this result, we investigated different iteration functions, used a negation map, dealt with the fruitless cycle problem, built an efficient FPGA design that processes 900 million iterations per second, and we tended for several months the optimized implementations running on the FPGAs

    Adaptive and hybrid schemes for efficient parallel squaring and cubing units

    Get PDF
    Squaring (X2) and cubing (X3) units are special operations of multiplication used in many applications, such as image compression, equalization, decoding and demodulation, 3D graphics, scientific computing, artificial neural networks, logarithmic number system, and multimedia application. They can also be an efficient way to compute other basic functions. Therefore, improving their performances is a goal for many researchers. This dissertation will discuss modification to algorithms to compute parallel squaring and cubing units in both signed and unsigned representation. After that, truncated technique is applied to improve their performance. Each unit is modeled and estimated to obtain its area, delay by using linear evaluation model. A C program was written to generate Hardware Description Language files for each unit. These units are simulated and verified in simulation. Moreover, area, delay, and power consumption are calculated for each unit and compared with those ones in previous approaches for both Virtex 5 Xilinx FPGA and IBM 65nm ASIC technologies
    corecore