11,849 research outputs found

    Enable++ : a second generation FPGA processor

    Get PDF
    In the computing community field programmable processors are going to fill the niche for special purpose computing devices. A typical example is ultra-fast pattern recognition in experimental particle physics - a task for which we constructed two years ago Enable- 1, an FPGA processor rather specialized for pattern recognition algorithms in μs domain, but also provided with modest features for coping with more general applications. This paper presents the follow-up modell Enable++, a 2nd generation FPGA processor that offers several substantial enhancements over the previous system for a wider range of applications: Enable++ is structured into three different state-of-the-art modules for providing computing power, flexible and high-speed I/O communication and powerful intermodule communication with a raw bandwidth of 3.2 GByte/s by an active backplane. The technical realization of all three modules is guided by the maximum usage of field programmable logic. The actual demand of computing-and I/O-power can be satisified by the number of modules plugged into the crate. Enhanced features of Enable++ comprise the configurable processor topology provided by programmable crossbar switches. In combination with the 4 x 4 FPGA array and 12 MByte distributed RAM the Enable++ computing core offers a strongly increased and scalable computing power. For building new applications the system offers a comfortable programming and debugging environment consisting of a compiler for the C-like hardware description language spC, a simulator and a source level debugger for hardware design. The goal in planning the hardware design environment for Enable++ from scratch is to transfer established methodologies in software design to the design of digital logic. Concerning pattern recognition tasks, we estimate that Enable++ surpasses modern RISC processors by a factor of 100 to 1000

    Genomic co-processor for long read assembly

    Get PDF
    Genomics data is transforming medicine and our understanding of life in fundamental ways; however, it is far outpacing Moore's Law. Third-generation sequencing technologies produce 100X longer reads than second generation technologies and reveal a much broader mutation spectrum of disease and evolution. However, these technologies incur prohibitively high computational costs. In order to enable the vast potential of exponentially growing genomics data, domain specific acceleration provides one of the few remaining approaches to continue to scale compute performance and efficiency, since general-purpose architectures are struggling to handle the huge amount of data needed for genome alignment. The aim of this project is to implement a genomic-coprocessor targeting HPC FPGAs starting from the Darwin FPGA co-processor. In this scenario, the final objective is the simulation and implementation of the algorithms described by Darwin using Alveo boards, exploiting High Bandwidth Memory (HBM) to increase its performance

    From FPGA to ASIC: A RISC-V processor experience

    Get PDF
    This work document a correct design flow using these tools in the Lagarto RISC- V Processor and the RTL design considerations that must be taken into account, to move from a design for FPGA to design for ASIC

    GCC-Plugin for Automated Accelerator Generation and Integration on Hybrid FPGA-SoCs

    Full text link
    In recent years, architectures combining a reconfigurable fabric and a general purpose processor on a single chip became increasingly popular. Such hybrid architectures allow extending embedded software with application specific hardware accelerators to improve performance and/or energy efficiency. Aiding system designers and programmers at handling the complexity of the required process of hardware/software (HW/SW) partitioning is an important issue. Current methods are often restricted, either to bare-metal systems, to subsets of mainstream programming languages, or require special coding guidelines, e.g., via annotations. These restrictions still represent a high entry barrier for the wider community of programmers that new hybrid architectures are intended for. In this paper we revisit HW/SW partitioning and present a seamless programming flow for unrestricted, legacy C code. It consists of a retargetable GCC plugin that automatically identifies code sections for hardware acceleration and generates code accordingly. The proposed workflow was evaluated on the Xilinx Zynq platform using unmodified code from an embedded benchmark suite.Comment: Presented at Second International Workshop on FPGAs for Software Programmers (FSP 2015) (arXiv:1508.06320

    High performance FPGA implementation of the mersenne twister

    Get PDF
    Efficient generation of random and pseudorandom sequences is of great importance to a number of applications [4]. In this paper, an efficient implementation of the Mersenne Twister is presented. The proposed architecture has the smallest footprint of all published architectures to date and occupies only 330 FPGA slices. Partial pipelining and sub-expression simplification has been used to improve throughput per clock cycle. The proposed architecture is implemented on an RC1000 FPGA Development platform equipped with a Xilinx XCV2000E FPGA, and can generate 20 million 32 bit random numbers per second at a clock rate of 24.234 MHz. A through performance analysis has been performed, and it is observed that the proposed architecture clearly outperforms other existing implementations in key comparable performance metrics

    Enabling MPSoC design space exploration on FPGAs

    Get PDF
    Future applications for embedded systems demand chip multiprocessor designs to meet real-time deadlines. These multiprocessors are increasingly becoming heterogeneous for reasons of cost and power. Design space exploration (DSE) of application mapping becomes a major design decision in such systems. The time spent in DSE becomes even greater with multiple applications executing concurrently. Methods have been proposed to automate generation of multiprocessor designs and prototype them on FPGAs. However, only few are able to support heterogeneous platforms. This is because heterogeneous processors require different types of inter-processor communication interfaces. So when we choose a different processor for a particular task, the communication infrastructure of the processor also has to change. In this paper, we present a module that integrates in a multiprocessor design generation flow and allows heterogeneous platform generation. This module is area efficient and fast. The DSE shows that up to 31% FPGA area can be saved when heterogeneous design is used as compared to a homogeneous platform. Moreover, the performance of the application also improves significantly

    A Scalable Correlator Architecture Based on Modular FPGA Hardware, Reuseable Gateware, and Data Packetization

    Full text link
    A new generation of radio telescopes is achieving unprecedented levels of sensitivity and resolution, as well as increased agility and field-of-view, by employing high-performance digital signal processing hardware to phase and correlate large numbers of antennas. The computational demands of these imaging systems scale in proportion to BMN^2, where B is the signal bandwidth, M is the number of independent beams, and N is the number of antennas. The specifications of many new arrays lead to demands in excess of tens of PetaOps per second. To meet this challenge, we have developed a general purpose correlator architecture using standard 10-Gbit Ethernet switches to pass data between flexible hardware modules containing Field Programmable Gate Array (FPGA) chips. These chips are programmed using open-source signal processing libraries we have developed to be flexible, scalable, and chip-independent. This work reduces the time and cost of implementing a wide range of signal processing systems, with correlators foremost among them,and facilitates upgrading to new generations of processing technology. We present several correlator deployments, including a 16-antenna, 200-MHz bandwidth, 4-bit, full Stokes parameter application deployed on the Precision Array for Probing the Epoch of Reionization.Comment: Accepted to Publications of the Astronomy Society of the Pacific. 31 pages. v2: corrected typo, v3: corrected Fig. 1

    Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

    Get PDF
    The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER
    corecore