70 research outputs found

    Square-rich fixed point polynomial evaluation on FPGAs

    Get PDF
    Polynomial evaluation is important across a wide range of application domains, so significant work has been done on accelerating its computation. The conventional algorithm, referred to as Horner's rule, involves the least number of steps but can lead to increased latency due to serial computation. Parallel evaluation algorithms such as Estrin's method have shorter latency than Horner's rule, but achieve this at the expense of large hardware overhead. This paper presents an efficient polynomial evaluation algorithm, which reforms the evaluation process to include an increased number of squaring steps. By using a squarer design that is more efficient than general multiplication, this can result in polynomial evaluation with a 57.9% latency reduction over Horner's rule and 14.6% over Estrin's method, while consuming less area than Horner's rule, when implemented on a Xilinx Virtex 6 FPGA. When applied in fixed point function evaluation, where precision requirements limit the rounding of operands, it still achieves a 52.4% performance gain compared to Horner's rule with only a 4% area overhead in evaluating 5th degree polynomials

    Optimized Efficient Soft-Decision Viterbi Architecture for Application-Specific Processors

    Get PDF
    This thesis provides an efficient implementation of a soft-decision Viterbi decoder implemented in Global Foundries cmos32soi 32nm technology. This architecture utilizes an efficient branch metric (BM) and normalization architecture by using application specific squaring and comparator units. Results indicate a good trade off between area, delay, and power. Compared to a previous implementation, results indicate a significant decrease in area and delay while running in excess of 1 GHz. In addition, when compared with a soft-decision implementation using a traditional multiplier and comparator, significant reductions in area, delay, and power were observed. Results are given based on using ARM-based standard-cells, and energy/power results are based on Hardware-Descriptive Language implementation. Although this architecture uses more area than hard-decision branch metric implementations, soft-decision implementations can decrease the bit error rate two orders of magnitude thus reducing possible retransmit rates, resulting in both increased throughput and transmission rates.Electrical Engineerin

    Simulated Performance of a Reduction-Based Multiprocessing System

    Get PDF
    Multiprocessing systems have the potential for increasing system speed over what is now offered by device technology. They must provide the means of generating work for the processors, getting the work to processors, and coherently collecting the results from the processors. For most applications, they should also ensure the repeatability of behavior, i.e., determinacy, speed-independence, or elimination of critical races. Determinacy can be destroyed, for example, by permitting-in separate, concurrent processes statements such as x: = x + 1 and if x = 0 then
 else
 , which share a common variable. Here, there may be a critical race, in that more than one global outcome is possible, depending on execution order. But by basing a multiprocessing system on functional languages, we can avoid such dangers. Our concern is the construction of multiprocessors that can be programmed in a logically transparent fashion. In other words, the programmer should not be aware of programming a multiprocessor versus a uniprocessor, except for optimizing performance for a specific configuration. This means that the programmer should not have to set up processes explicitly to achieve concurrent processing, nor be concerned with synchronizing such processes. Multiprocessor systems present unique concurrency problems. Rediflow combines disciplined von Neumann processes with a hybrid reduction and dataflow model in an effective packet-switching network

    Resource Optimal Squarers for FPGAs

    Get PDF
    International audienceSquaring is an essential operation in computer arithmetic that can be considered as a special case of multiplication where several simplifications can be applied to reduce the complexity of the resulting circuit. However, the design of a squarer is not straightforward for modern FPGAs that provide embedded DSP blocks and look-up-tables (LUTs). This work proposes a flexible method to design resource optimal squarers, i.e., a squarer that uses a minimum number of LUTs for a userdefined number of DSP blocks. The method uses an integer linear programming (ILP) formulation based on a generalization of multiplier tiling. It is shown that the proposed squarer design method significantly improves the LUT utilization for a given number of DSPs over previous methods, while maintaining a similar critical path delay and latency

    Hardware/software optimizations for elliptic curve scalar multiplication on hybrid FPGAs

    Get PDF
    Elliptic curve cryptography (ECC) offers a viable alternative to Rivest-Shamir-Adleman (RSA) by delivering equivalent security with a smaller key size. This has several advantages, including smaller bandwidth demands, faster key exchange, and lower latency encryption and decryption. The fundamental operation for ECC is scalar point multiplication, wherein a point P on an elliptic curve defined over a finite field is multiplied by a scalar k. The complexity of this operation requires a hardware implementation to achieve high performance. The algorithms involved in scalar point multiplication are constantly evolving, incorporating the latest developments in number theory to improve computation time. These competing needs, high performance and flexibility, have caused previous implementations to either limit their adaptability or to incur performance losses. This thesis explores the use of a hybrid-FPGA for scalar point multiplication. A hybrid- FPGA contains a general purpose processor (GPP) in addition to reconfigurable fabric. This allows for a software/hardware co-design with low latency communication between the GPP and custom hardware. The elliptic curve operations and finite field inversion are programmed in C code. All other finite field arithmetic is implemented in the FPGA hardware, providing higher performance while retaining flexibility. The resulting implementation achieves speedups ranging from 24 times to 55 times faster than an optimized software implementation executing on a Pentium II workstation. The scalability of the design is investigated in two directions: faster finite field multiplication and increased instruction level parallelism exploitation. Increasing the number of parallel arithmetic units beyond two is shown to be less efficient than increasing the speed of the finite field multiplier

    Efficient Implementation on Low-Cost SoC-FPGAs of TLSv1.2 Protocol with ECC_AES Support for Secure IoT Coordinators

    Get PDF
    Security management for IoT applications is a critical research field, especially when taking into account the performance variation over the very different IoT devices. In this paper, we present high-performance client/server coordinators on low-cost SoC-FPGA devices for secure IoT data collection. Security is ensured by using the Transport Layer Security (TLS) protocol based on the TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256 cipher suite. The hardware architecture of the proposed coordinators is based on SW/HW co-design, implementing within the hardware accelerator core Elliptic Curve Scalar Multiplication (ECSM), which is the core operation of Elliptic Curve Cryptosystems (ECC). Meanwhile, the control of the overall TLS scheme is performed in software by an ARM Cortex-A9 microprocessor. In fact, the implementation of the ECC accelerator core around an ARM microprocessor allows not only the improvement of ECSM execution but also the performance enhancement of the overall cryptosystem. The integration of the ARM processor enables to exploit the possibility of embedded Linux features for high system flexibility. As a result, the proposed ECC accelerator requires limited area, with only 3395 LUTs on the Zynq device used to perform high-speed, 233-bit ECSMs in 413 ”s, with a 50 MHz clock. Moreover, the generation of a 384-bit TLS handshake secret key between client and server coordinators requires 67.5 ms on a low cost Zynq 7Z007S device

    Towards an algorithmic skeleton framework for programming the Intel R Xeon PhiTM processor

    Get PDF
    The Intel R Xeon PhiTM is the first processor based on Intel’s MIC (Many Integrated Cores) architecture. It is a co-processor specially tailored for data-parallel computations, whose basic architectural design is similar to the ones of GPUs (Graphics Processing Units), leveraging the use of many integrated low computational cores to perform parallel computations. The main novelty of the MIC architecture, relatively to GPUs, is its compatibility with the Intel x86 architecture. This enables the use of many of the tools commonly available for the parallel programming of x86-based architectures, which may lead to a smaller learning curve. However, programming the Xeon Phi still entails aspects intrinsic to accelerator-based computing, in general, and to the MIC architecture, in particular. In this thesis we advocate the use of algorithmic skeletons for programming the Xeon Phi. Algorithmic skeletons abstract the complexity inherent to parallel programming, hiding details such as resource management, parallel decomposition, inter-execution flow communication, thus removing these concerns from the programmer’s mind. In this context, the goal of the thesis is to lay the foundations for the development of a simple but powerful and efficient skeleton framework for the programming of the Xeon Phi processor. For this purpose we build upon Marrow, an existing framework for the orchestration of OpenCLTM computations in multi-GPU and CPU environments. We extend Marrow to execute both OpenCL and C++ parallel computations on the Xeon Phi. We evaluate the newly developed framework, several well-known benchmarks, like Saxpy and N-Body, will be used to compare, not only its performance to the existing framework when executing on the co-processor, but also to assess the performance on the Xeon Phi versus a multi-GPU environment.projects PTDC/EIA- EIA/113613/2009 (Synergy-VM) and PTDC/EEI-CTP/1837/2012 (SwiftComp) for financing the purchase of the Intel R Xeon PhiT

    Cryptography for Ultra-Low Power Devices

    Get PDF
    Ubiquitous computing describes the notion that computing devices will be everywhere: clothing, walls and floors of buildings, cars, forests, deserts, etc. Ubiquitous computing is becoming a reality: RFIDs are currently being introduced into the supply chain. Wireless distributed sensor networks (WSN) are already being used to monitor wildlife and to track military targets. Many more applications are being envisioned. For most of these applications some level of security is of utmost importance. Common to WSN and RFIDs are their severely limited power resources, which classify them as ultra-low power devices. Early sensor nodes used simple 8-bit microprocessors to implement basic communication, sensing and computing services. Security was an afterthought. The main power consumer is the RF-transceiver, or radio for short. In the past years specialized hardware for low-data rate and low-power radios has been developed. The new bottleneck are security services which employ computationally intensive cryptographic operations. Customized hardware implementations hold the promise of enabling security for severely power constrained devices. Most research groups are concerned with developing secure wireless communication protocols, others with designing efficient software implementations of cryptographic algorithms. There has not been a comprehensive study on hardware implementations of cryptographic algorithms tailored for ultra-low power applications. The goal of this dissertation is to develop a suite of cryptographic functions for authentication, encryption and integrity that is specifically fashioned to the needs of ultra-low power devices. This dissertation gives an introduction to the specific problems that security engineers face when they try to solve the seemingly contradictory challenge of providing lightweight cryptographic services that can perform on ultra-low power devices and shows an overview of our current work and its future direction

    Towards Efficient Hardware Implementation of Elliptic and Hyperelliptic Curve Cryptography

    Get PDF
    Implementation of elliptic and hyperelliptic curve cryptographic algorithms has been the focus of a great deal of recent research directed at increasing efficiency. Elliptic curve cryptography (ECC) was introduced independently by Koblitz and Miller in the 1980s. Hyperelliptic curve cryptography (HECC), a generalization of the elliptic curve case, allows a decreasing field size as the genus increases. The work presented in this thesis examines the problems created by limited area, power, and computation time when elliptic and hyperelliptic curves are integrated into constrained devices such as wireless sensor network (WSN) and smart cards. The lack of a battery in wireless sensor network limits the processing power of these devices, but they still require security. It was widely believed that devices with such constrained resources cannot incorporate a strong HECC processor for performing cryptographic operations such as elliptic curve scalar multiplication (ECSM) or hyperelliptic curve divisor multiplication (HCDM). However, the work presented in this thesis has demonstrated the feasibility of integrating an HECC processor into such devices through the use of the proposed architecture synthesis and optimization techniques for several inversion-free algorithms. The goal of this work is to develop a hardware implementation of binary elliptic and hyperelliptic curves. The focus is on the modeling of three factors: register allocation, operation scheduling, and storage binding. These factors were then integrated into architecture synthesis and optimization techniques in order to determine the best overall implementation suitable for constrained devices. The main purpose of the optimization is to reduce the area and power. Through analysis of the architecture optimization techniques for both datapath and control unit synthesis, the number of registers was reduced by an average of 30%. The use of the proposed efficient explicit formula for the different algorithms also enabled a reduction in the number of read/write operations from/to the register file, which reduces the processing power consumption. As a result, an overall HECC processor requires from 1843 to 3595 slices for a Xilinix XC4VLX200 and the total computation time is limited to between 10.08 ms to 15.82 ms at a maximum frequency of 50 MHz for a varity of inversion-free coordinate systems in hyperelliptic curves. The value of the new model has been demonstrated with respect to its implementation in elliptic and hyperelliptic curve crypogrpahic algorithms, through both synthesis and simulations. In summary, a framework has been provided for consideration of interactions with synthesis and optimization through architecture modeling for constrained enviroments. Insights have also been presented with respect to improving the design process for cryptogrpahic algorithms through datapath and control unit analysis
    • 

    corecore