489 research outputs found

    Dynamically reconfigurable asynchronous processor

    Get PDF
    The main design requirements for today's mobile applications are: · high throughput performance. · high energy efficiency. · high programmability. Until now, the choice of platform has often been limited to Application-Specific Integrated Circuits (ASICs), due to their best-of-breed performance and power consumption. The economies of scale possible with these high-volume markets have traditionally been able to hide the high Non-Recurring Engineering (NRE) costs required for designing and fabricating new ASICs. However, with the NREs and design time escalating with each generation of mobile applications, this practice may be reaching its limit. Designers today are looking at programmable solutions, so that they can respond more rapidly to changes in the market and spread costs over several generations of mobile applications. However, there have been few feasible alternatives to ASICs: Digital Signals Processors (DSPs) and microprocessors cannot meet the throughput requirements, whereas Field-Programmable Gate Arrays (FPGAs) require too much area and power. Coarse-grained dynamically reconfigurable architectures offer better solutions for high throughput applications, when power and area considerations are taken into account. One promising example is the Reconfigurable Instruction Cell Array (RICA). RICA consists of an array of cells with an interconnect that can be dynamically reconfigured on every cycle. This allows quite complex datapaths to be rendered onto the fabric and executed in a single configuration - making these architectures particularly suitable to stream processing. Furthermore, RICA can be programmed from C, making it a good fit with existing design methodologies. However the RICA architecture has a drawback: poor scalability in terms of area and power. As the core gets bigger, the number of sequential elements in the array must be increased significantly to maintain the ability to achieve high throughputs through pipelining. As a result, a larger clock tree is required to synchronise the increased number of sequential elements. The clock tree therefore takes up a larger percentage of the area and power consumption of the core. This thesis presents a novel Dynamically Reconfigurable Asynchronous Processor (DRAP), aimed at high-throughput mobile applications. DRAP is based on the RICA architecture, but uses asynchronous design techniques - methods of designing digital systems without clocks. The absence of a global clock signal makes DRAP more scalable in terms of power and area overhead than its synchronous counterpart. The DRAP architecture maintains most of the benefits of custom asynchronous design, whilst also providing programmability via conventional high-level languages. Results show that the DRAP processor delivers considerably lower power consumption when compared to a market-leading Very Long Instruction Word (VLIW) processor and a low-power ARM processor. For example, DRAP resulted in a reduction in power consumption of 20 times compared to the ARM7 processor, and 29 times compared to the TIC64x VLIW, when running the same benchmark capped to the same throughput and for the same process technology (0.13μm). When compared to an equivalent RICA design, DRAP was up to 22% larger than RICA but resulted in a power reduction of up to 1.9 times. It was also capable of achieving up to 2.8 times higher throughputs than RICA for the same benchmarks

    The Potential for a GPU-Like Overlay Architecture for FPGAs

    Get PDF
    We propose a soft processor programming model and architecture inspired by graphics processing units (GPUs) that are well-matched to the strengths of FPGAs, namely, highly parallel and pipelinable computation. In particular, our soft processor architecture exploits multithreading, vector operations, and predication to supply a floating-point pipeline of 64 stages via hardware support for up to 256 concurrent thread contexts. The key new contributions of our architecture are mechanisms for managing threads and register files that maximize data-level and instruction-level parallelism while overcoming the challenges of port limitations of FPGA block memories as well as memory and pipeline latency. Through simulation of a system that (i) is programmable via NVIDIA's high-level Cg language, (ii) supports AMD's CTM r5xx GPU ISA, and (iii) is realizable on an XtremeData XD1000 FPGA-based accelerator system, we demonstrate the potential for such a system to achieve 100% utilization of a deeply pipelined floating-point datapath

    Flexible Baseband Modulator Architecture for Multi-Waveform 5G Communications

    Get PDF
    The fifth-generation (5G) revolution represents more than a mere performance enhancement of previous generations: it will deeply transform the way humans and/or machines interact, enabling a heterogeneous expansion in the number of use cases and services. Crucial to the realization of this revolution is the design of hardware components characterized by high degrees of flexibility, versatility and resource/power efficiency. This chapter proposes a field-programmable gate array (FPGA)-oriented baseband processing architecture suitable for fast-changing communication environments such as 4G/5G waveform coexistence, noncontiguous carrier aggregation (CA) or centralized cloud radio access network (C-RAN) processing. The proposed architecture supports three 5G waveform candidates and is shown to be upgradable, resource-efficient and cost-effective. Through hardware virtualization, enabled by dynamic partial reconfiguration (DPR), the design space exploration of our architecture exceeds the hardware resources available on the Zynq xc7z020 device. Moreover, dynamic frequency scaling (DFS) enables the runtime adjustment of processing throughput and power reductions by up to 88%. The combined resource overhead for DPR and DFS is very low, and the reconfiguration latency stays two orders of magnitude below the control plane latency requirements proposed for 5G communications

    Onboard processing of synthetic aperture radar backprojection algorithm in FPGA

    Get PDF
    Synthetic aperture radar is a microwave technique to extracting image information of the target. Electromagnetic waves that are reflected from the target are acquired by the aircraft or satellite receivers and sent to a ground station to be processed by applying computational demanding algorithms. Radar data streams are acquired by an aircraft or satellite and sent to a ground station to be processed in order to extract images from the data since these processing algorithms are computationally demanding. However, novel applications require real-time processing for real-time analysis and decisions and so onboard processing is necessary. Running computationally demanding algorithms on onboard embedded systems with limited energy and computational capacity is a challenge. This article proposes a configurable hardware core for the execution of the backprojection algorithm with high performance and energy efficiency. The original backprojection algorithm is restructured to expose computational parallelism and then optimized by replacing floating-point with fixed-point arithmetic. The backprojection core was integrated into a system-onchip architecture and implemented in a field-programmable gate array. The proposed solution runs the optimized backprojection algorithm over images of sizes 512 x 512 and 1024 x 1024 in 0.14 s (0.41 J) and 1.11 s (3.24 J), respectively. The architecture is 2.6x faster and consumes 13x less energy than an embedded Jetson TX2 GPU. The solution is scalable and, therefore, a tradeoff exists between performance and utilization of resources.info:eu-repo/semantics/publishedVersio

    Design of a Hybrid Modular Switch

    Full text link
    Network Function Virtualization (NFV) shed new light for the design, deployment, and management of cloud networks. Many network functions such as firewalls, load balancers, and intrusion detection systems can be virtualized by servers. However, network operators often have to sacrifice programmability in order to achieve high throughput, especially at networks' edge where complex network functions are required. Here, we design, implement, and evaluate Hybrid Modular Switch (HyMoS). The hybrid hardware/software switch is designed to meet requirements for modern-day NFV applications in providing high-throughput, with a high degree of programmability. HyMoS utilizes P4-compatible Network Interface Cards (NICs), PCI Express interface and CPU to act as line cards, switch fabric, and fabric controller respectively. In our implementation of HyMos, PCI Express interface is turned into a non-blocking switch fabric with a throughput of hundreds of Gigabits per second. Compared to existing NFV infrastructure, HyMoS offers modularity in hardware and software as well as a higher degree of programmability by supporting a superset of P4 language

    A pipelined configurable gate array for embedded processors

    Get PDF
    In recent years the challenge of high performance, low power retargettable embedded system has been faced with different technological and architectural solutions. In this paper we present a new configurable unit explicitly designed to imple-ment additional reconfigurable pipelined datapaths, suitable for the design of reconfigurable processors. A VLIW recon-figurable processor has been implemented on silicon in a standard 0.18 µm CMOS technology to prove the effective-ness of the proposed unit. Testing on a signal processing algorithms benchmark showed speedups from 4.3x to 13.5x and energy consumption reduction up to 92%

    Low-power Programmable Processor for Fast Fourier Transform Based on Transport Triggered Architecture

    Get PDF
    This paper describes a low-power processor tailored for fast Fourier transform computations where transport triggering template is exploited. The processor is software-programmable while retaining an energy-efficiency comparable to existing fixed-function implementations. The power savings are achieved by compressing the computation kernel into one instruction word. The word is stored in an instruction loop buffer, which is more power-efficient than regular instruction memory storage. The processor supports all power-of-two FFT sizes from 64 to 16384 and given 1 mJ of energy, it can compute 20916 transforms of size 1024.Comment: 5 pages, 4 figures, 1 table, ICASSP 2019 conferenc

    On the Feasibility and Limitations of Just-in-Time Instruction Set Extension for FPGA-Based Reconfigurable Processors

    Get PDF
    Reconfigurable instruction set processors provide the possibility of tailor the instruction set of a CPU to a particular application. While this customization process could be performed during runtime in order to adapt the CPU to the currently executed workload, this use case has been hardly investigated. In this paper, we study the feasibility of moving the customization process to runtime and evaluate the relation of the expected speedups and the associated overheads. To this end, we present a tool flow that is tailored to the requirements of this just-in-time ASIP specialization scenario. We evaluate our methods by targeting our previously introduced Woolcano reconfigurable ASIP architecture for a set of applications from the SPEC2006, SPEC2000, MiBench, and SciMark2 benchmark suites. Our results show that just-in-time ASIP specialization is promising for embedded computing applications, where average speedups of 5x can be achieved by spending 50 minutes for custom instruction identification and hardware generation. These overheads will be compensated if the applications execute for more than 2 hours. For the scientific computing benchmarks, the achievable speedup is only 1.2x, which requires significant execution times in the order of days to amortize the overheads

    Real-time architecture for robust motion estimation under varying illumination conditions

    Get PDF
    Motion estimation from image sequences is a complex problem which requires high computing resources and is highly affected by changes in the illumination conditions in most of the existing approaches. In this contribution we present a high performance system that deals with this limitation. Robustness to varying illumination conditions is achieved by a novel technique that combines a gradient-based optical flow method with a non-parametric image transformation based on the Rank transform. The paper describes this method and quantitatively evaluates its robustness to different illumination changing patterns. This technique has been successfully implemented in a real-time system using reconfigurable hardware. Our contribution presents the computing architecture, including the resources consumption and the obtained performance. The final system is a real-time device capable to computing motion sequences in real-time even in conditions with significant illumination changes. The robustness of the proposed system facilitates its use in multiple potential application fields.This work has been supported by the grants DEPROVI (DPI2004-07032), DRIVSCO (IST-016276-2) and TIC2007:”Plataforma Sw-Hw para sistemas de visión 3D en tiempo real”
    corecore