964 research outputs found

    An embedded system supporting dynamic partial reconfiguration of hardware resources for morphological image processing

    Get PDF
    Processors for high-performance computing applications are generally designed with a focus on high clock rates, parallelism of operations and high communication bandwidth, often at the expense of large power consumption. However, the emphasis of many embedded systems and untethered devices is on minimal hardware requirements and reduced power consumption. With the incessant growth of computational needs for embedded applications, which contradict chip power and area needs, the burden is put on the hardware designers to come up with designs that optimize power and area requirements. This thesis investigates the efficient design of an embedded system for morphological image processing applications on Xilinx FPGAs (Field Programmable Gate Array) by optimizing both area and power usage while delivering high performance. The design leverages a unique capability of FPGAs called dynamic partial reconfiguration (DPR) which allows changing the hardware configuration of silicon pieces at runtime. DPR allows regions of the FPGA to be reprogrammed with new functionality while applications are still running in the remainder of the device. The main aim of this thesis is to design an embedded system for morphological image processing by accounting for real time and area constraints as compared to a statically configured FPGA. IP (Intellectual Property) cores are synthesized for both static and dynamic time. DPR enables instantiation of more hardware logic over a period of time on an existing device by time-multiplexing the hardware realization of functions. A comparison of power consumption is presented for the statically and dynamically reconfigured designs. Finally, a performance comparison is included for the implementation of the respective algorithms on a hardwired ARM processor as well as on another general-purpose processor. The results prove the viability of DPR for morphological image processing applications

    Empowering parallel computing with field programmable gate arrays

    Get PDF
    After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

    Efficiency analysis methodology of FPGAs based on lost frequencies, area and cycles

    Get PDF
    We propose a methodology to study and to quantify efficiency and the impact of overheads on runtime performance. Most work on High-Performance Computing (HPC) for FPGAs only studies runtime performance or cost, while we are interested in how far we are from peak performance and, more importantly, why. The efficiency of runtime performance is defined with respect to the ideal computational runtime in absence of inefficiencies. The analysis of the difference between actual and ideal runtime reveals the overheads and bottlenecks. A formal approach is proposed to decompose the efficiency into three components: frequency, area and cycles. After quantification of the efficiencies, a detailed analysis has to reveal the reasons for the lost frequencies, lost area and lost cycles. We propose a taxonomy of possible causes and practical methods to identify and quantify the overheads. The proposed methodology is applied on a number of use cases to illustrate the methodology. We show the interaction between the three components of efficiency and show how bottlenecks are revealed

    Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors

    Get PDF
    Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor organizations for data-parallel, floating-point computation in SPICE model-evaluation. Our Verilog AMS compiler produces code for parallel evaluation of non-linear circuit models suitable for use in SPICE simulations where the same model is evaluated several times for all the devices in the circuit. Our compiler uses architecture specific parallelization strategies (OpenMP for multi-core, PThreads for Cell, CUDA for GPU, statically scheduled VLIW for FPGA) when producing code for these different architectures. We automatically explore different implementation configurations (e.g. unroll factor, vector length) using our performance-tuner to identify the best possible configuration for each architecture. We demonstrate speedups of 3- 182times for a Xilinx Virtex5 LX 330T, 1.3-33times for an IBM Cell, and 3-131times for an NVIDIA 9600 GT GPU over a 3 GHz Intel Xeon 5160 implementation for a variety of single-precision device models

    Pipelining Of Double Precision Floating Point Division And Square Root Operations On Field-programmable Gate Arrays

    Get PDF
    Many space applications, such as vision-based systems, synthetic aperture radar, and radar altimetry rely increasingly on high data rate DSP algorithms. These algorithms use double precision floating point arithmetic operations. While most DSP applications can be executed on DSP processors, the DSP numerical requirements of these new space applications surpass by far the numerical capabilities of many current DSP processors. Since the tradition in DSP processing has been to use fixed point number representation, only recently have DSP processors begun to incorporate floating point arithmetic units, even though most of these units handle only single precision floating point addition/subtraction, multiplication, and occasionally division. While DSP processors are slowly evolving to meet the numerical requirements of newer space applications, FPGA densities have rapidly increased to parallel and surpass even the gate densities of many DSP processors and commodity CPUs. This makes them attractive platforms to implement compute-intensive DSP computations. Even in the presence of this clear advantage on the side of FPGAs, few attempts have been made to examine how wide precision floating point arithmetic, particularly division and square root operations, can perform on FPGAs to support these compute-intensive DSP applications. In this context, this thesis presents the sequential and pipelined designs of IEEE-754 compliant double floating point division and square root operations based on low radix digit recurrence algorithms. FPGA implementations of these algorithms have the advantage of being easily testable. In particular, the pipelined designs are synthesized based on careful partial and full unrolling of the iterations in the digit recurrence algorithms. In the overall, the implementations of the sequential and pipelined designs are common-denominator implementations which do not use any performance-enhancing embedded components such as multipliers and block memory. As these implementations exploit exclusively the fine-grain reconfigurable resources of Virtex FPGAs, they are easily portable to other FPGAs with similar reconfigurable fabrics without any major modifications. The pipelined designs of these two operations are evaluated in terms of area, throughput, and dynamic power consumption as a function of pipeline depth. Pipelining experiments reveal that the area overhead tends to remain constant regardless of the degree of pipelining to which the design is submitted, while the throughput increases with pipeline depth. In addition, these experiments reveal that pipelining reduces power considerably in shallow pipelines. Pipelining further these designs does not necessarily lead to significant power reduction. By partitioning these designs into deeper pipelines, these designs can reach throughputs close to the 100 MFLOPS mark by consuming a modest 1% to 8% of the reconfigurable fabric within a Virtex-II XC2VX000 (e.g., XC2V1000 or XC2V6000) FPGA

    High Performance Computing via High Level Synthesis

    Get PDF
    As more and more powerful integrated circuits are appearing on the market, more and more applications, with very different requirements and workloads, are making use of the available computing power. This thesis is in particular devoted to High Performance Computing applications, where those trends are carried to the extreme. In this domain, the primary aspects to be taken into consideration are (1) performance (by definition) and (2) energy consumption (since operational costs dominate over procurement costs). These requirements can be satisfied more easily by deploying heterogeneous platforms, which include CPUs, GPUs and FPGAs to provide a broad range of performance and energy-per-operation choices. In particular, as we will see, FPGAs clearly dominate both CPUs and GPUs in terms of energy, and can provide comparable performance. An important aspect of this trend is of course design technology, because these applications were traditionally programmed in high-level languages, while FPGAs required low-level RTL design. The OpenCL (Open Computing Language) developed by the Khronos group enables developers to program CPU, GPU and recently FPGAs using functionally portable (but sadly not performance portable) source code which creates new possibilities and challenges both for research and industry. FPGAs have been always used for mid-size designs and ASIC prototyping thanks to their energy efficient and flexible hardware architecture, but their usage requires hardware design knowledge and laborious design cycles. Several approaches are developed and deployed to address this issue and shorten the gap between software and hardware in FPGA design flow, in order to enable FPGAs to capture a larger portion of the hardware acceleration market in data centers. Moreover, FPGAs usage in data centers is growing already, regardless of and in addition to their use as computational accelerators, because they can be used as high performance, low power and secure switches inside data-centers. High-Level Synthesis (HLS) is the methodology that enables designers to map their applications on FPGAs (and ASICs). It synthesizes parallel hardware from a model originally written C-based programming languages .e.g. C/C++, SystemC and OpenCL. Design space exploration of the variety of implementations that can be obtained from this C model is possible through wide range of optimization techniques and directives, e.g. to pipeline loops and partition memories into multiple banks, which guide RTL generation toward application dependent hardware and benefit designers from flexible parallel architecture of FPGAs. Model Based Design (MBD) is a high-level and visual process used to generate implementations that solve mathematical problems through a varied set of IP-blocks. MBD enables developers with different expertise, e.g. control theory, embedded software development, and hardware design to share a common design framework and contribute to a shared design using the same tool. Simulink, developed by MATLAB, is a model based design tool for simulation and development of complex dynamical systems. Moreover, Simulink embedded code generators can produce verified C/C++ and HDL code from the graphical model. This code can be used to program micro-controllers and FPGAs. This PhD thesis work presents a study using automatic code generator of Simulink to target Xilinx FPGAs using both HDL and C/C++ code to demonstrate capabilities and challenges of high-level synthesis process. To do so, firstly, digital signal processing unit of a real-time radar application is developed using Simulink blocks. Secondly, generated C based model was used for high level synthesis process and finally the implementation cost of HLS is compared to traditional HDL synthesis using Xilinx tool chain. Alternative to model based design approach, this work also presents an analysis on FPGA programming via high-level synthesis techniques for computationally intensive algorithms and demonstrates the importance of HLS by comparing performance-per-watt of GPUs(NVIDIA) and FPGAs(Xilinx) manufactured in the same node running standard OpenCL benchmarks. We conclude that generation of high quality RTL from OpenCL model requires stronger hardware background with respect to the MBD approach, however, the availability of a fast and broad design space exploration ability and portability of the OpenCL code, e.g. to CPUs and GPUs, motivates FPGA industry leaders to provide users with OpenCL software development environment which promises FPGA programming in CPU/GPU-like fashion. Our experiments, through extensive design space exploration(DSE), suggest that FPGAs have higher performance-per-watt with respect to two high-end GPUs manufactured in the same technology(28 nm). Moreover, FPGAs with more available resources and using a more modern process (20 nm) can outperform the tested GPUs while consuming much less power at the cost of more expensive devices

    A Survey and Evaluation of FPGA High-Level Synthesis Tools

    Get PDF
    High-level synthesis (HLS) is increasingly popular for the design of high-performance and energy-efficient heterogeneous systems, shortening time-to-market and addressing today's system complexity. HLS allows designers to work at a higher-level of abstraction by using a software program to specify the hardware functionality. Additionally, HLS is particularly interesting for designing field-programmable gate array circuits, where hardware implementations can be easily refined and replaced in the target device. Recent years have seen much activity in the HLS research community, with a plethora of HLS tool offerings, from both industry and academia. All these tools may have different input languages, perform different internal optimizations, and produce results of different quality, even for the very same input description. Hence, it is challenging to compare their performance and understand which is the best for the hardware to be implemented. We present a comprehensive analysis of recent HLS tools, as well as overview the areas of active interest in the HLS research community. We also present a first-published methodology to evaluate different HLS tools. We use our methodology to compare one commercial and three academic tools on a common set of C benchmarks, aiming at performing an in-depth evaluation in terms of performance and the use of resources
    corecore