76 research outputs found

    FPGA structures for high speed and low overhead dynamic circuit specialization

    Get PDF
    A Field Programmable Gate Array (FPGA) is a programmable digital electronic chip. The FPGA does not come with a predefined function from the manufacturer; instead, the developer has to define its function through implementing a digital circuit on the FPGA resources. The functionality of the FPGA can be reprogrammed as desired and hence the name “field programmable”. FPGAs are useful in small volume digital electronic products as the design of a digital custom chip is expensive. Changing the FPGA (also called configuring it) is done by changing the configuration data (in the form of bitstreams) that defines the FPGA functionality. These bitstreams are stored in a memory of the FPGA called configuration memory. The SRAM cells of LookUp Tables (LUTs), Block Random Access Memories (BRAMs) and DSP blocks together form the configuration memory of an FPGA. The configuration data can be modified according to the user’s needs to implement the user-defined hardware. The simplest way to program the configuration memory is to download the bitstreams using a JTAG interface. However, modern techniques such as Partial Reconfiguration (PR) enable us to configure a part in the configuration memory with partial bitstreams during run-time. The reconfiguration is achieved by swapping in partial bitstreams into the configuration memory via a configuration interface called Internal Configuration Access Port (ICAP). The ICAP is a hardware primitive (macro) present in the FPGA used to access the configuration memory internally by an embedded processor. The reconfiguration technique adds flexibility to use specialized ci rcuits that are more compact and more efficient t han t heir b ulky c ounterparts. An example of such an implementation is the use of specialized multipliers instead of big generic multipliers in an FIR implementation with constant coefficients. To specialize these circuits and reconfigure during the run-time, researchers at the HES group proposed the novel technique called parameterized reconfiguration that can be used to efficiently and automatically implement Dynamic Circuit Specialization (DCS) that is built on top of the Partial Reconfiguration method. It uses the run-time reconfiguration technique that is tailored to implement a parameterized design. An application is said to be parameterized if some of its input values change much less frequently than the rest. These inputs are called parameters. Instead of implementing these parameters as regular inputs, in DCS these inputs are implemented as constants, and the application is optimized for the constants. For every change in parameter values, the design is re-optimized (specialized) during run-time and implemented by reconfiguring the optimized design for a new set of parameters. In DCS, the bitstreams of the parameterized design are expressed as Boolean functions of the parameters. For every infrequent change in parameters, a specialized FPGA configuration is generated by evaluating the corresponding Boolean functions, and the FPGA is reconfigured with the specialized configuration. A detailed study of overheads of DCS and providing suitable solutions with appropriate custom FPGA structures is the primary goal of the dissertation. I also suggest different improvements to the FPGA configuration memory architecture. After offering the custom FPGA structures, I investigated the role of DCS on FPGA overlays and the use of custom FPGA structures that help to reduce the overheads of DCS on FPGA overlays. By doing so, I hope I can convince the developer to use DCS (which now comes with minimal costs) in real-world applications. I start the investigations of overheads of DCS by implementing an adaptive FIR filter (using the DCS technique) on three different Xilinx FPGA platforms: Virtex-II Pro, Virtex-5, and Zynq-SoC. The study of how DCS behaves and what is its overhead in the evolution of the three FPGA platforms is the non-trivial basis to discover the costs of DCS. After that, I propose custom FPGA structures (reconfiguration controllers and reconfiguration drivers) to reduce the main overhead (reconfiguration time) of DCS. These structures not only reduce the reconfiguration time but also help curbing the power hungry part of the DCS system. After these chapters, I study the role of DCS on FPGA overlays. I investigate the effect of the proposed FPGA structures on Virtual-Coarse-Grained Reconfigurable Arrays (VCGRAs). I classify the VCGRA implementations into three types: the conventional VCGRA, partially parameterized VCGRA and fully parameterized VCGRA depending upon the level of parameterization. I have designed two variants of VCGRA grids for HPC image processing applications, namely, the MAC grid and Pixie. Finally, I try to tackle the reconfiguration time overhead at the hardware level of the FPGA by customizing the FPGA configuration memory architecture. In this part of my research, I propose to use a parallel memory structure to improve the reconfiguration time of DCS drastically. However, this improvement comes with a significant overhead of hardware resources which will need to be solved in future research on commercial FPGA configuration memory architectures

    High-performance AES-128 algorithm implementation by FPGA-based SoC for 5G communications

    Get PDF
    none4siIn this research work, a fast and lightweight AES-128 cypher based on the Xilinx ZCU102 FPGA board is presented, suitable for 5G communications. In particular, both encryption and decryption algorithms have been developed using a pipelined approach, so enabling the simultaneous processing of the rounds on multiple data packets at each clock cycle. Both the encryption and decryption systems support an operative frequency up to 220 MHz, reaching 28.16 Gbit/s maximum data throughput; besides, the encryption and decryption phases last both only ten clock periods. To guarantee the interoperability of the developed encryption/decryption system with the other sections of the 5G communication apparatus, synchronization and control signals have been integrated. The encryption system uses only 1631 CLBs, whereas the decryption one only 3464 CLBs, ascribable, mainly, to the Inverse Mix Columns step. The developed cypher shows higher efficiency (8.63 Mbps/slice) than similar solutions present in literature.openP.Visconti, R. Velazquez, S. Capoccia, R. de FazioVisconti, P.; Velazquez, R.; Capoccia, S.; de Fazio, R

    Emerging Run-Time Reconfigurable FPGA and CAD Tools

    Get PDF
    Field-programmable gate array (FPGA) is a post fabrication reconfigurable device to accelerate domain specific computing systems. It offers offer high operation speed and low power consumption. However, the design flexibility and performance of FPGAs are severely constrained by the costly on-chip memories, e.g. static random access memory (SRAM) and FLASH memory. The objective of my dissertation is to explore the opportunity and enable the use of the emerging resistance random access memory (ReRAM) in FPGA design. The emerging ReRAM technology features high storage density, low access power consumption, and CMOS compatibility, making it a promising candidate for FPGA implementation. In particular, ReRAM has advantages of the fast access and nonvolatility, enabling the on-chip storage and access of configuration data. In this dissertation, I first propose a novel three-dimensional stacking scheme, namely, high-density interleaved memory (HIM). The structure improves the density of ReRAM meanwhile effectively reducing the signal interference induced by sneak paths in crossbar arrays. To further enhance the access speed and design reliability, a fast sensing circuit is also presented which includes a new sense amplifier scheme and reference cell configuration. The proposed ReRAM FPGA leverages a similar architecture as conventional SRAM based FPGAs but utilizes ReRAM technology in all component designs. First, HIM is used to implement look-up table (LUT) and block random access memories (BRAMs) for func- tionality process. Second, a 2R1T, two ReRAM cells and one transistor, nonvolatile switch design is applied to construct connection blocks (CBs) and switch blocks (SBs) for signal transition. Furthermore, unified BRAM (uBRAM) based on the current BRAM architecture iv is introduced, offering both configuration and temporary data storage. The uBRAMs provides extremely high density effectively and enlarges the FPGA capacity, potentially saving multiple contexts of configuration. The fast configuration scheme from uBRAM to logic and routing components also makes fast run-time partial reconfiguration (PR) much easier, improving the flexibility and performance of the entire FPGA system. Finally, modern place and route tools are designed for homogeneous fabric of FPGA. The PR feature, however, requires the support of heterogeneous logic modules in order to differentiate PR modules from static ones and therefore maintain the signal integration. The existing approaches still reply on designers’ manual effort, which significantly prolongs design time and lowers design efficiency. In this dissertation, I integrate PR support into VPR – an academic place and route tool by introducing a B*-tree modular placer (BMP) and PR-aware router. As such, users are able to explore new architectures or map PR applications to a variety of FPGAs. More importantly, this enhanced feature can also support fast design automation, e.g. mapping IP core, loading pre-synthesizing logic modules, etc

    High-performance AES-128 algorithm implementation by FPGA-based SoC for 5G communications

    Get PDF
    In this research work, a fast and lightweight AES-128 cypher based on the Xilinx ZCU102 FPGA board is presented, suitable for 5G communications. In particular, both encryption and decryption algorithms have been developed using a pipelined approach, so enabling the simultaneous processing of the rounds on multiple data packets at each clock cycle. Both the encryption and decryption systems support an operative frequency up to 220 MHz, reaching 28.16 Gbit/s maximum data throughput; besides, the encryption and decryption phases last both only ten clock periods. To guarantee the interoperability of the developed encryption/decryption system with the other sections of the 5G communication apparatus, synchronization and control signals have been integrated. The encryption system uses only 1631 CLBs, whereas the decryption one only 3464 CLBs, ascribable, mainly, to the Inverse Mix Columns step. The developed cypher shows higher efficiency (8.63 Mbps/slice) than similar solutions present in literature

    A multiple-SIMD architecture for image and tracking analysis

    Get PDF
    The computational requirements for real-time image based applications are such as to warrant the use of a parallel architecture. Commonly used parallel architectures conform to the classifications of Single Instruction Multiple Data (SIMD), or Multiple Instruction Multiple Data (MIMD). Each class of architecture has its advantages and dis-advantages. For example, SIMD architectures can be used on data-parallel problems, such as the processing of an image. Whereas MIMD architectures are more flexible and better suited to general purpose computing. Both types of processing are typically required for the analysis of the contents of an image. This thesis describes a novel massively parallel heterogeneous architecture, implemented as the Warwick Pyramid Machine. Both SIMD and MIMD processor types are combined within this architecture. Furthermore, the SIMD array is partitioned, into smaller SIMD sub-arrays, forming a Multiple-SIMD array. Thus, local data parallel, global data parallel, and control parallel processing are supported. After describing the present options available in the design of massively parallel machines and the nature of the image analysis problem, the architecture of the Warwick Pyramid Machine is described in some detail. The performance of this architecture is then analysed, both in terms of peak available computational power and in terms of representative applications in image analysis and numerical computation. Two tracking applications are also analysed to show the performance of this architecture. In addition, they illustrate the possible partitioning of applications between the SIMD and MIMD processor arrays. Load-balancing techniques are then described which have the potential to increase the utilisation of the Warwick Pyramid Machine at run-time. These include mapping techniques for image regions across the Multiple-SIMD arrays, and for the compression of sparse data. It is envisaged that these techniques may be found useful in other parallel systems

    Quantum and spin-based tunneling devices for memory systems

    Get PDF
    Rapid developments in information technology, such as internet, portable computing, and wireless communication, create a huge demand for fast and reliable ways to store and process information. Thus far, this need has been paralleled with the revolution in solid-state memory technologies. Memory devices, such as SRAM, DRAM, and flash, have been widely used in most electronic products. The primary strategy to keep up the trend is miniaturization. CMOS devices have been scaled down beyond sub-45 nm, the size of only a few atomic layers. Scaling, however, will soon reach the physical limitation of the material and cease to yield the desired enhancement in device performance. In this thesis, an alternative method to scaling is proposed and successfully realized. The proposed scheme integrates quantum devices, Si/SiGe resonant interband tunnel diodes (RITD), with classical CMOS devices forming a microsystem of disparate devices to achieve higher performance as well as higher density. The device/circuit designs, layouts and masks involving 12 levels were fabricated utilizing a process that incorporates nearly a hundred processing steps. Utilizing unique characteristics of each component, a low-power tunneling-based static random access memory (TSRAM) has been demonstrated. The TSRAM cells exhibit bistability operation with a power supply voltage as low as 0.37 V. Various TSRAM cells were also constructed and their latching mechanisms have been extensively investigated. In addition, the operation margins of TSRAM cells are evaluated based on different device structures and temperature variation from room temperature up to 200oC. The versatility of TSRAM is extended beyond the binary system. Using multi-peak Si/SiGe RITD, various multi-valued TSRAM (MV-TSRAM) configurations that can store more than two logic levels per cell are demonstrated. By this virtue, memory density can be substantially increased. Using two novel methods via ambipolar operation and utilization of enable/disable transistors, a six-valued MV-TSRAM cell are demonstrated. A revolutionary novel concept of integrating of Si/SiGe RITD with spin tunnel devices, magnetic tunnel junctions (MTJ), has been developed. This hybrid approach adds non-volatility and multi-valued memory potential as demonstrated by theoretical predictions and simulations. The challenges of physically fabricating these devices have been identified. These include process compatibility and device design. A test bed approach of fabricating RITD-MTJ structures has been developed. In conclusion, this body of work has created a sound foundation for new research frontiers in four different major areas: integrated TSRAM system, MV-TSRAM system, MTJ/RITD-based nonvolatile MRAM, and RITD/CMOS logic circuits
    • …
    corecore