742 research outputs found

    From FPGA to ASIC: A RISC-V processor experience

    Get PDF
    This work document a correct design flow using these tools in the Lagarto RISC- V Processor and the RTL design considerations that must be taken into account, to move from a design for FPGA to design for ASIC

    Parallel H.264/AVC Fast Rate-Distortion Optimized Motion Estimation using Graphics Processing Unit and Dedicated Hardware

    Get PDF
    Heterogeneous systems on a single chip composed of CPU, Graphical Processing Unit (GPU), and Field Programmable Gate Array (FPGA) are expected to emerge in near future. In this context, the System on Chip (SoC) can be dynamically adapted to employ different architectures for execution of data-intensive applications. Motion estimation is one such task that can be accelerated using FPGA and GPU for high performance H.264/AVC encoder implementation. In most of works on parallel implementation of motion estimation, the bit rate cost of motion vectors is generally ignored. On the contrary, this paper presents a fast rate-distortion optimized parallel motion estimation algorithm implemented on GPU using OpenCL and FPGA/ASIC using VHDL. The predicted motion vectors are estimated from temporally preceding motion vectors and used for evaluating the bit rate cost of the motion vectors simultaneously. The experimental results show that the proposed scheme achieves significant speedup on GPU and FPGA, and has comparable ratedistortion performance with respect to sequential fast motion estimation algorithm

    An integrated soft- and hard-programmable multithreaded architecture

    Get PDF

    Predictive control using an FPGA with application to aircraft control

    Get PDF
    Alternative and more efficient computational methods can extend the applicability of MPC to systems with tight real-time requirements. This paper presents a “system-on-a-chip” MPC system, implemented on a field programmable gate array (FPGA), consisting of a sparse structure-exploiting primal dual interior point (PDIP) QP solver for MPC reference tracking and a fast gradient QP solver for steady-state target calculation. A parallel reduced precision iterative solver is used to accelerate the solution of the set of linear equations forming the computational bottleneck of the PDIP algorithm. A numerical study of the effect of reducing the number of iterations highlights the effectiveness of the approach. The system is demonstrated with an FPGA-inthe-loop testbench controlling a nonlinear simulation of a large airliner. This study considers many more manipulated inputs than any previous FPGA-based MPC implementation to date, yet the implementation comfortably fits into a mid-range FPGA, and the controller compares well in terms of solution quality and latency to state-of-the-art QP solvers running on a standard PC

    Tiny Classifier Circuits: Evolving Accelerators for Tabular Data

    Full text link
    A typical machine learning (ML) development cycle for edge computing is to maximise the performance during model training and then minimise the memory/area footprint of the trained model for deployment on edge devices targeting CPUs, GPUs, microcontrollers, or custom hardware accelerators. This paper proposes a methodology for automatically generating predictor circuits for classification of tabular data with comparable prediction performance to conventional ML techniques while using substantially fewer hardware resources and power. The proposed methodology uses an evolutionary algorithm to search over the space of logic gates and automatically generates a classifier circuit with maximised training prediction accuracy. Classifier circuits are so tiny (i.e., consisting of no more than 300 logic gates) that they are called "Tiny Classifier" circuits, and can efficiently be implemented in ASIC or on an FPGA. We empirically evaluate the automatic Tiny Classifier circuit generation methodology or "Auto Tiny Classifiers" on a wide range of tabular datasets, and compare it against conventional ML techniques such as Amazon's AutoGluon, Google's TabNet and a neural search over Multi-Layer Perceptrons. Despite Tiny Classifiers being constrained to a few hundred logic gates, we observe no statistically significant difference in prediction performance in comparison to the best-performing ML baseline. When synthesised as a Silicon chip, Tiny Classifiers use 8-18x less area and 4-8x less power. When implemented as an ultra-low cost chip on a flexible substrate (i.e., FlexIC), they occupy 10-75x less area and consume 13-75x less power compared to the most hardware-efficient ML baseline. On an FPGA, Tiny Classifiers consume 3-11x fewer resources.Comment: 14 pages, 16 figure

    Digital FPGA Circuits Design for Real-Time Video Processing with Reference to Two Application Scenarios

    Get PDF
    In the present days of digital revolution, image and/or video processing has become a ubiquitous task: from mobile devices to special environments, the need for a real-time approach is everyday more and more evident. Whatever the reason, either for user experience in recreational or internet-based applications or for safety related timeliness in hard-real-time scenarios, the exploration of technologies and techniques which allow for this requirement to be satisfied is a crucial point. General purpose CPU or GPU software implementations of these applications are quite simple and widespread, but commonly do not allow high performance because of the high layering that separates high level languages and libraries, which enforce complicated procedures and algorithms, from the base architecture of the CPUs that offers only limited and basic (although rapidly executed) arithmetic operations. The most practised approach nowadays is based on the use of Very-Large-Scale Integrated (VLSI) digital electronic circuits. Field Programmable Gate Arrays (FPGAs) are integrated digital circuits designed to be configured after manufacturing, "on the field". They typically provide lower performance levels when compared to Application Specific Integrated Circuits (ASICs), but at a lower cost, especially when dealing with limited production volumes. Of course, on-the-field programmability itself (and re-programmability, in the vast majority of cases) is also a characteristic feature that makes FPGA more suitable for applications with changing specifications where an update of capabilities may be a desirable benefit. Moreover, the time needed to fulfill the design cycle for FPGA-based circuits (including of course testing and debug speed) is much reduced when compared to the design flow and time-to-market of ASICs. In this thesis work, we will see (Chapter 1) some common problems and strategies involved with the use of FPGAs and FPGA-based systems for Real Time Image Processing and Real Time Video Processing (in the following alsoindicated interchangeably with the acronym RTVP); we will then focus, in particular, on two applications. Firstly, Chapter 2 will cover the implementation of a novel algorithm for Visual Search, known as CDVS, which has been recently standardised as part of the MPEG-7 standard. Visual search is an emerging field in mobile applications which is rapidly becoming ubiquitous. However, typically, algorithms for this kind of applications are connected with a high leverage on computational power and complex elaborations: as a consequence, implementation efficiency is a crucial point, and this generally results in the need for custom designed hardware. Chapter 3 will cover the implementation of an algorithm for the compression of hyperspectral images which is bit-true compatible with the CCSDS-123.0 standard algorithm. Hyperspectral images are three dimensional matrices in which each 2D plane represents the image, as captured by the sensor, in a given spectral band: their size may range from several millions of pixels up to billions of pixels. Typical scenarios of use of hyperspectral images include airborne and satellite-borne remote sensing. As a consequence, major concerns are the limitedness of both processing power and communication links bandwidth: thus, a proper compression algorithm, as well as the efficiency of its implementation, is crucial. In both cases we will first of all examine the scope of the work with reference to current state-of-the-art. We will then see the proposed implementations in their main characteristics and, to conclude, we will consider the primary experimental results

    High-level verification flow for a high-level synthesis-based digital logic design

    Get PDF
    Abstract. High-level synthesis (HLS) is a method for generating register-transfer level (RTL) hardware description of digital logic designs from high-level languages, such as C/C++/SystemC or MATLAB. The performance and productivity benefits of HLS stem from the untimed, high abstraction level input languages. Another advantage is that the design and verification can focus on the features and high-level architecture, instead of the low-level implementation details. The goal of this thesis was to define and implement a high-level verification (HLV) flow for an HLS design written in C++. The HLV flow takes advantage of the performance and productivity of C++ as opposed to hardware description languages (HDL) and minimises the required RTL verification work. The HLV flow was implemented in the case study of the thesis. The HLS design was verified in a C++ verification environment, and Catapult Coverage was used for pre-HLS coverage closure. Post-HLS verification and coverage closure were done in Universal Verification Methodology (UVM) environment. C++ tests used in the pre-HLS coverage closure were reimplemented in UVM, to get a high initial RTL coverage without manual RTL code analysis. The pre-HLS C++ design was implemented as a predictor into the UVM testbench to verify the equivalence of C++ versus RTL and to speed up post-HLS coverage closure. Results of the case study show that the HLV flow is feasible to implement in practice. The flow shows significant performance and productivity gains of verification in the C++ domain when compared to UVM. The UVM implementation of a somewhat incomplete set of pre-HLS tests and formal exclusions resulted in an initial post-HLS coverage of 96.90%. The C++ predictor implementation was a valuable tool in post-HLS coverage closure. A total of four weeks of coverage work in pre- and post-HLS phases was required to reach 99% RTL coverage. The total time does not include the time required to build both C++ and UVM verification environments.Korkean tason verifiointivuo korkean tason synteesiin perustuvalle digitaalilogiikkasuunnitelmalle. Tiivistelmä. Korkean tason synteesi (HLS) on menetelmä, jolla generoidaan rekisterisiirtotason (RTL) laitteistokuvausta digitaalisille logiikkasuunnitelmille käyttäen korkean tason ohjelmointikieliä, kuten C-pohjaisia kieliä tai MATLAB:ia. HLS:n suorituskykyyn ja tuottavuuteen liittyvät hyödyt perustuvat ohjelmointikielien tarjoamaan korkeampaan abstraktiotasoon. HLS:ää käyttäen suunnittelu- ja varmennustyö voi keskittyä ominaisuuksiin ja korkean tason arkkitehtuuriin matalan tason yksityiskohtien sijaan. Tämän diplomityön tavoite oli määritellä ja implementoida korkean tason verifiointivuo (HLV-vuo) C++:lla kirjoitetulle HLS-suunnitelmalle. HLV-vuo hyödyntää ohjelmointikielien tarjoamaa suorituskykyä ja korkeampaa abstraktion tasoa kovonkuvauskielien sijaan ja siten minimoi RTL:n varmennukseen vaadittavaa työtä. HLV vuo implementoitiin tapaustutkimuksessa. HLS-suunnitelma varmennettiin C++ -verifiointiympäristössä, ja Catapult Coveragea käytettiin kattavuuden analysointiin. RTL-kattavuutta mitattiin universaalilla verifiointimetodologialla (UVM) tehdyssä ympäristössä. C++ varmennuksessa käytetyt testivektorit implementoitiin uudelleen UVM-ympäristössä, jotta RTL-kattavuuden lähtötaso olisi korkea ilman manuaalista RTL-analyysiä. C++-suunnitelma implementoitiin prediktorina (referenssimallina) UVM-testipenkkiin koodikattavuuden parantamiseksi. Tapaustutkimuksen tulokset osoittavat, että määritelty HLV-vuo on toteutettavissa käytännössä. Vuota käyttämällä saavutetaan merkittäviä suorituskyky- ja tuottavuusetuja C++ -testiympäristössä verrattuna UVM-ympäristöön. 90.60% koodikattavuuden saavuttavien C++ testivektoreiden uudelleenimplementoiti UVM-ympäristössä tuotti 96.90% RTL-kattavuuden. C++-predictorin implementointi oli merkittävä työkalu RTL-kattavuustavoitteen saavuttamisessa

    Adaptive High-Bandwidth Digitally Controlled Buck Converter with Improved Line and Load Transient Response

    Get PDF
    Digitally controlled switching converter suffers from bandwidth limitation because of the additional phase delay in the digital feedback control loop. In order to overcome the bandwidth limitation without using a high sampling rate, this paper presents an adaptive third-order digital controller for regulating a voltage-mode buck converter with a modest 2x oversampling ratio. The phase lag due to the ADC conversion time delay is virtually compensated by providing an early estimation of the error voltage for the next sampling time instant, enabling a higher unity-gain bandwidth without compromising stability. An additional pair of low-frequency pole and zero in the third-order controller increases the low-frequency gain, resulting in faster settling time and smaller output voltage deviation during line transient. Both simulation and experimental results demonstrate that the proposed adaptive third-order controller reduces the settling time by 50% in response to a 1 V line transient and 30% in response to a 600 mA load transient, compared to the baseline static second-order controller. The fastest settling time is measured to be around 11.70 s, surpassing the transient performance of conventional digital controllers and approaching that of the state-of-the-art analog-based controllers.postprin

    High-Level Synthesis Based VLSI Architectures for Video Coding

    Get PDF
    High Efficiency Video Coding (HEVC) is state-of-the-art video coding standard. Emerging applications like free-viewpoint video, 360degree video, augmented reality, 3D movies etc. require standardized extensions of HEVC. The standardized extensions of HEVC include HEVC Scalable Video Coding (SHVC), HEVC Multiview Video Coding (MV-HEVC), MV-HEVC+ Depth (3D-HEVC) and HEVC Screen Content Coding. 3D-HEVC is used for applications like view synthesis generation, free-viewpoint video. Coding and transmission of depth maps in 3D-HEVC is used for the virtual view synthesis by the algorithms like Depth Image Based Rendering (DIBR). As first step, we performed the profiling of the 3D-HEVC standard. Computational intensive parts of the standard are identified for the efficient hardware implementation. One of the computational intensive part of the 3D-HEVC, HEVC and H.264/AVC is the Interpolation Filtering used for Fractional Motion Estimation (FME). The hardware implementation of the interpolation filtering is carried out using High-Level Synthesis (HLS) tools. Xilinx Vivado Design Suite is used for the HLS implementation of the interpolation filters of HEVC and H.264/AVC. The complexity of the digital systems is greatly increased. High-Level Synthesis is the methodology which offers great benefits such as late architectural or functional changes without time consuming in rewriting of RTL-code, algorithms can be tested and evaluated early in the design cycle and development of accurate models against which the final hardware can be verified
    corecore