354 research outputs found

    Design of Special Function Units in Modern Microprocessors

    Get PDF
    Today’s computing systems demand high performance for applications such as cloud computing, web-based search engines, network applications, and social media tasks. Such software applications involve an extensive use of hashing and arithmetic operations in their computation. In this thesis, we explore the use of new special function units (SFUs) for modern microprocessors, to accelerate such workloads. First, we design an SFU for hashing. Hashing can reduce the complexity of search and lookup from O(p) to O(p/n), where n bins are used and p items are being processed. In modern microprocessors, hashing is done in software. In our work, we propose a novel hardware hash unit design for use in modern microprocessors. Since the hash unit is designed at the hardware level, several advantages are obtained by our approach. First, a hardware-based hash unit executes a single hash instruction to perform a hash operation. In a software-based hashing in modern microprocessors, a hash operation is compiled into multiple instructions, thereby degrading performance. Second, software-based hashing stores hash data in a DRAM (also, hash operation entries can be stored in one of the cache levels). In a hardware-based hash unit, hash data is stored in a dedicated memory module (a hardware hash table), which improves performance. Third, today’s operating systems execute multiple applications (processes) in parallel, which entail high memory utilization. Hence the operating systems require many context switching between different processes, which results in many cache misses. In a hardware-based hash unit, the cache misses is reduced significantly using the dedicated memory module (hash table). These advantages all reduce the power consumption and increase the overall system performance significantly with a minimal increase in the microprocessor’s die area. We evaluate our hardware-based hash unit and compare its performance with software-based hashing. We start by evaluating our design approach at the micro-architecture level in terms of system performance. After that, we design our approach at the circuit level design to obtain the area overhead. Also, we analyze our design’s power and delay for each hash operation. These results are compared with a traditional hashing implementation. Then, we present an FPGA-based coprocessor for hash unit acceleration, applied to a virus checking application. Second, we present an SFU to speed up arithmetic operations. We call this arithmetic SFU a programmable arithmetic unit (PAU). In modern microprocessors, applications that require heavy arithmetic computations are done in software. To improve the performance for such computations, we present a programmable arithmetic unit (PAU), a partially reconfigurable methodology for arithmetic applications. The PAU consists of a set of IP blocks connected to a reconfigurable FPGA controller via a fast mesh-based interconnect. The IP blocks in the PAU can be any IP block such as adders, subtractors, multipliers, comparators and sign extension units. The PAU can have one or more copies of the same IP block (for example, 5 adders and 7 multipliers). The FPGA controller is an on-chip FPGA-based reconfigurable control fabric. The FPGA controller enables different arithmetic applications to be embedded on the PAU. The FPGA controller is programmed for different applications. The reconfigurable logic is based on a LUT-based design like a traditional FPGA. The FPGA controller and the IP blocks in the PAU communicate via a high speed ring data fabric. In our work, we use the PAU as an SFU in modern microprocessors. We compare the performance of different hardware-based arithmetic applications in the PAU with software-based implementations in modern microprocessors

    Document Classification Systems in Heterogeneous Computing Environments

    Get PDF
    Datacenter workloads demand high throughput, low cost and power efficient solutions. In most data centers the operating costs dominates the infrastructure cost. The ever growing amounts of data and the critical need for higher throughput, more energy efficient document classification solutions motivated us to investigate alternatives to the traditional homogeneous CPU based implementations of document classification systems. Several heterogeneous systems were investigated in the past where CPUs were combined with GPUs and FPGAs as system accelerators. The increasing complexity of FPGAs made them an interesting device in the heterogeneous computing environments and on the other hand difficult to program using Hardware Description languages. We explore the trade-offs when using high level synthesis and low level synthesis when programming FPGAs. Using low level synthesis results in less hardware resource usage on FPGAs and also offers the higher throughput compared to using HLS tool. While using HLS tool different heterogeneous computing devices such as multicore CPU and GPU targeted. Through our implementation experience and empirical results for data centric applications, we conclude that we can achieve power efficient results for these set of applications by either using low level synthesis or high level synthesis for programming FPGAs

    A Standalone FPGA-based Miner for Lyra2REv2 Cryptocurrencies

    Full text link
    Lyra2REv2 is a hashing algorithm that consists of a chain of individual hashing algorithms, and it is used as a proof-of-work function in several cryptocurrencies. The most crucial and exotic hashing algorithm in the Lyra2REv2 chain is a specific instance of the general Lyra2 algorithm. This work presents the first hardware implementation of the specific instance of Lyra2 that is used in Lyra2REv2. Several properties of the aforementioned algorithm are exploited in order to optimize the design. In addition, an FPGA-based hardware implementation of a standalone miner for Lyra2REv2 on a Xilinx Multi-Processor System on Chip is presented. The proposed Lyra2REv2 miner is shown to be significantly more energy efficient than both a GPU and a commercially available FPGA-based miner. Finally, we also explain how the simplified Lyra2 and Lyra2REv2 architectures can be modified with minimal effort to also support the recent Lyra2REv3 chained hashing algorithm.Comment: 13 pages, accepted for publication in IEEE Trans. Circuits Syst. I. arXiv admin note: substantial text overlap with arXiv:1807.0576

    A Multi-layer Fpga Framework Supporting Autonomous Runtime Partial Reconfiguration

    Get PDF
    Partial reconfiguration is a unique capability provided by several Field Programmable Gate Array (FPGA) vendors recently, which involves altering part of the programmed design within an SRAM-based FPGA at run-time. In this dissertation, a Multilayer Runtime Reconfiguration Architecture (MRRA) is developed, evaluated, and refined for Autonomous Runtime Partial Reconfiguration of FPGA devices. Under the proposed MRRA paradigm, FPGA configurations can be manipulated at runtime using on-chip resources. Operations are partitioned into Logic, Translation, and Reconfiguration layers along with a standardized set of Application Programming Interfaces (APIs). At each level, resource details are encapsulated and managed for efficiency and portability during operation. An MRRA mapping theory is developed to link the general logic function and area allocation information to the device related physical configuration level data by using mathematical data structure and physical constraints. In certain scenarios, configuration bit stream data can be read and modified directly for fast operations, relying on the use of similar logic functions and common interconnection resources for communication. A corresponding logic control flow is also developed to make the entire process autonomous. Several prototype MRRA systems are developed on a Xilinx Virtex II Pro platform. The Virtex II Pro on-chip PowerPC core and block RAM are employed to manage control operations while multiple physical interfaces establish and supplement autonomous reconfiguration capabilities. Area, speed and power optimization techniques are developed based on the developed Xilinx prototype. Evaluations and analysis of these prototype and techniques are performed on a number of benchmark and hashing algorithm case studies. The results indicate that based on a variety of test benches, up to 70% reduction in the resource utilization, up to 50% improvement in power consumption, and up to 10 times increase in run-time performance are achieved using the developed architecture and approaches compared with Xilinx baseline reconfiguration flow. Finally, a Genetic Algorithm (GA) for a FPGA fault tolerance case study is evaluated as a ultimate high-level application running on this architecture. It demonstrated that this is a hardware and software infrastructure that enables an FPGA to dynamically reconfigure itself efficiently under the control of a soft microprocessor core that is instantiated within the FPGA fabric. Such a system contributes to the observed benefits of intelligent control, fast reconfiguration, and low overhead

    A pattern matching coprocessor for network security

    Full text link

    Secure Remote Control and Configuration of FPX Platform in Gigabit Ethernet Environment

    Get PDF
    Because of its flexibility and high performance, reconfigurable logic functions implemented on the Field-programmable Port Extender (FPX ) are well suited for implementing network processing such as packet classification, filtering and intrusion detection functions. This project focuses on two key aspects of the FPX system. One is providing a Gigabit Ethernet interface by designing logic for a FPGA which is located on a line card. Address Resolution Protocol (ARP) packets are handled in hardware and Ethernet frames are processed and transformed into cells suitable for standard FPX application. The other effort is to provide a secure channel to enable remote control and configuration of the FPX system through public internet. A suite of security hardware cores were implemented that include the Advanced Encryption Standard (AES), Triple Data Encryption Standard (3DES), Hashed Message Authentication Code (HMAC), Message Digest Version 5 (MD5) and Secure Hash Algorithm (SHA-1). An architecture and an associated protocol have been developed which provide a secure communication channel between a control console and a hardware-based reconfigurable network node. This solution is unique in that it does not require a software process to run on the network stack, so that it has both higher performance and prevents the node from being hacked using traditional vulnerabilities found in common operating systems. The mechanism can be applied to the design and implementation of re-motely managed FPX systems. A hardware module called the Secure Control Packet Processor (SCPP) has been designed for a FPX based firewall. It utilizes AES or 3DES in Error Propagation Block Chaining (EPBC) mode to ensure data confidentiality and data integrity. There is also an authenticated engine that uses HMAC. to generate the acknowledgments. The system can protect the FPX system against attacks that may be sent over the control and configuration channel. Based on this infrastructure, an enhanced protocol is addressed that provides higher efficiency and can defend against replay attack. To support that, a control cell encryption module was designed and tested in the FPX system

    Prime Field ECDSA Signature Processing for Reconfigurable Embedded Systems

    Get PDF
    Growing ubiquity and safety relevance of embedded systems strengthen the need to protect their functionality against malicious attacks. Communication and system authentication by digital signature schemes is a major issue in securing such systems. This contribution presents a complete ECDSA signature processing system over prime fields for bit lengths of up to 256 on reconfigurable hardware. By using dedicated hardware implementation, the performance can be improved by up to two orders of magnitude compared to microcontroller implementations. The flexible system is tailored to serve as an autonomous subsystem providing authentication transparent for any application. Integration into a vehicle-to-vehicle communication system is shown as an application example
    corecore