137 research outputs found

    Performance and area evaluations of processor-based benchmarks on FPGA devices

    Get PDF
    The computing system on SoCs is being long-term research since the FPGA technology has emerged due to its personality of re-programmable fabric, reconfigurable computing, and fast development time to market. During the last decade, uni-processor in a SoC is no longer to deal with the high growing market for complex applications such as Mobile Phones audio and video encoding, image and network processing. Due to the number of transistors on a silicon wafer is increasing, the recent FPGAs or embedded systems are advancing toward multi-processor-based design to meet tremendous performance and benefit this kind of systems are possible. Therefore, is an upcoming age of the MPSoC. In addition, most of the embedded processors are soft-cores, because they are flexible and reconfigurable for specific software functions and easy to build homogenous multi-processor systems for parallel programming. Moreover, behavioural synthesis tools are becoming a lot more powerful and enable to create datapath of logic units from high-level algorithms such as C to HDL and available for partitioning a HW/SW concurrent methodology. A range of embedded processors is able to implement on a FPGA-based prototyping to integrate the CPUs on a programmable device. This research is, firstly represent different types of computer architectures in modern embedded processors that are followed in different type of software applications (eg. Multi-threading Operations or Complex Functions) on FPGA-based SoCs; and secondly investigate their capability by executing a wide-range of multimedia software codes (Integer-algometric only) in different models of the processor-systems (uni-processor or multi-processor or Co-design), and finally compare those results in terms of the benchmarks and resource utilizations within FPGAs. All the examined programs were written in standard C and executed in a variety numbers of soft-core processors or hardware units to obtain the execution times. However, the number of processors and their customizable configuration or hardware datapath being generated are limited by a target FPGA resource, and designers need to understand the FPGA-based tradeoffs that have been considered - Speed versus Area. For this experimental purpose, I defined benchmarks into DLP / HLS catalogues, which are "data" and "function" intensive respectively. The programs of DLP will be executed in LEON3 MP and LE1 CMP multi-processor systems and the programs of HLS in the LegUp Co-design system on target FPGAs. In preliminary, the performance of the soft-core processors will be examined by executing all the benchmarks. The whole story of this thesis work centres on the issue of the execute times or the speed-up and area breakdown on FPGA devices in terms of different programs

    An OpenCL software compilation framework targeting an SoC-FPGA VLIW chip multiprocessor

    Get PDF
    Modern systems-on-chip augment their baseline CPU with coprocessors and accelerators to increase overall computational capability and power efficiency, and thus have evolved into heterogeneous multi-core systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This paper discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a highly configurable VLIW Chip Multiprocessor architecture known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on a number of hardware configurations of the LE1 CMP. The presented OpenCL framework fully automates the compilation flow and supports work-item coalescing which better maps onto the ILP processor cores of the LE1 architecture. This paper discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework by running 12 industry-standard OpenCL benchmarks drawn from the AMD SDK and the Rodinia suites. The benchmarks are executed on 40 LE1 configurations with 10 implemented on an SoC-FPGA and the remaining on a cycle-accurate simulator. Across 12 OpenCL benchmarks results demonstrate near-linear wall-clock performance improvement of 1.8x (using 2 dual-issue cores), up to 5.2x (using 8 dual-issue cores) and on one case, super-linear improvement of 8.4x (FixOffset kernel, 8 dual-issue cores). The number of OpenCL benchmarks evaluated makes this study one of the most complete in the literature

    Exploiting Hardware Abstraction for Parallel Programming Framework: Platform and Multitasking

    Get PDF
    With the help of the parallelism provided by the fine-grained architecture, hardware accelerators on Field Programmable Gate Arrays (FPGAs) can significantly improve the performance of many applications. However, designers are required to have excellent hardware programming skills and unique optimization techniques to explore the potential of FPGA resources fully. Intermediate frameworks above hardware circuits are proposed to improve either performance or productivity by leveraging parallel programming models beyond the multi-core era. In this work, we propose the PolyPC (Polymorphic Parallel Computing) framework, which targets enhancing productivity without losing performance. It helps designers develop parallelized applications and implement them on FPGAs. The PolyPC framework implements a custom hardware platform, on which programs written in an OpenCL-like programming model can launch. Additionally, the PolyPC framework extends vendor-provided tools to provide a complete development environment including intermediate software framework, and automatic system builders. Designers\u27 programs can be either synthesized as hardware processing elements (PEs) or compiled to executable files running on software PEs. Benefiting from nontrivial features of re-loadable PEs, and independent group-level schedulers, the multitasking is enabled for both software and hardware PEs to improve the efficiency of utilizing hardware resources. The PolyPC framework is evaluated regarding performance, area efficiency, and multitasking. The results show a maximum 66 times speedup over a dual-core ARM processor and 1043 times speedup over a high-performance MicroBlaze with 125 times of area efficiency. It delivers a significant improvement in response time to high-priority tasks with the priority-aware scheduling. Overheads of multitasking are evaluated to analyze trade-offs. With the help of the design flow, the OpenCL application programs are converted into executables through the front-end source-to-source transformation and back-end synthesis/compilation to run on PEs, and the framework is generated from users\u27 specifications

    High-level synthesis for FPGAs: From prototyping to deployment

    Get PDF
    Abstract-Escalating System-on-Chip design complexity is pushing the design community to raise the level of abstraction beyond RTL. Despite the unsuccessful adoptions of early generations of commercial high-level synthesis (HLS) systems, we believe that the tipping point for transitioning to HLS methodology is happening now, especially for FPGA designs. The latest generation of HLS tools has made significant progress in providing wide language coverage and robust compilation technology, platform-based modeling, advancement in core HLS algorithms, and a domain-specific approach. In this paper we use AutoESL's AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains. Complex industrial designs targeting Xilinx FPGAs are also presented as case studies, including comparison of HLS solutions versus optimized manual designs. Index Terms-Domain-specific design, field-programmable gate array (FPGA), high-level synthesis (HLS), quality of results (QoR)

    Accelerating Halide on an FPGA by using CIRCT and Calyx as an intermediate step to go from a high-level and software-centric IRs down to RTL

    Get PDF
    Image processing and, more generally, array processing play an essential role in modern life: from applying filters to the images that we upload to social media to running object detection algorithms on self-driving cars. Optimizing these algorithms can be complex and often results in non-portable code. The Halide language provides a simple way to write image and array processing algorithms by separating the algorithm definition (what needs to be executed) from its execution schedule (how it is executed), delivering state-of-the-art performance that exceeds hand-tuned parallel and vectorized code. Due to the inherent parallel nature of these algorithms, FPGAs present an attractive acceleration platform. While previous work has added an RTL code generator to Halide, and utilized other heterogeneous computing languages as an intermediate step, these projects are no longer maintained. MLIR is an attractive solution, allowing the generation of code that can target multiple devices, such as parallelized and vectorized CPU code, OpenMP, and CUDA. CIRCT builds on top of MLIR to convert generic MLIR code to register transfer level (RTL) languages by using Calyx, a new intermediate language (IL) for compiling high-level programs into hardware designs. This thesis presents a novel flow that implements an MLIR code generator for Halide that generates RTL code, adding the necessary wrappers to execute that code on Xilinx FPGA devices. Additionally, it implements a Halide runtime using the Xilinx Runtime (XRT), enabling seamless execution of the generated Halide RTL kernels. While this thesis provides initial support for running Halide kernels and not all features and optimizations are supported, it also details the future work needed to improve the performance of the generated RTL kernels. The proposed flow serves as a foundation for further research and development in the field of hardware acceleration for image and array processing applications using Halide

    VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads

    Get PDF
    We discuss VThreads, a novel VLIW CMP with hardware-assisted shared-memory Thread support. VThreads supports Instruction Level Parallelism via static multiple-issue and Thread Level Parallelism via hardware-assisted POSIX Threads along with extensive customization. It allows the instantiation of tightlycoupled streaming accelerators and supports up to 7-address Multiple-Input, Multiple-Output instruction extensions. VThreads is designed in technology-independent Register-Transfer-Level VHDL and prototyped on 40 nm and 28 nm Field-Programmable gate arrays. It was evaluated against a PThreads-based multiprocessor based on the Sparc-V8 ISA. On a 65 nm ASIC implementation VThreads achieves up to x7.2 performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and ~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility, hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support

    Software acceleration on Xilinx FPGAs using OmpSs@FPGA ecosystem

    Get PDF
    The OmpSs@FPGA programming model allows offloading application functionality to Xilinx Field Programmable Gate Arrays (FPGAs). The OmpSs compiler splits the code (written in C/C++ high level language) in two parts, targeting the host and the FPGA. The first is usually compiled by the GNU Compiler Collection (GCC), while the latter is given to the Xilinx Vivado HLS tool (hereafter HLS) for high level synthesis to VHDL and bitstream used to program the FPGA. OmpSs@FPGA is based on compiler directives, which allow the programmer to annotate the part of the code to automatically exploit all Symmetric MultiProcessor system (smp) and FPGA resource available in the execution platform. This technical report provides both descriptive and hands-on introductions to build application-specific FPGA systems using the high-level OmpSs@FPGA tool. The goal is to give the reader a baseline view of the process of creating an optimized hardware design annotating C-based code with HLS directives. We assume the reader has a working knowledge of C/C++, and familiarity with basic computer architecture concepts (e.g. speedup, parallelism, pipelining)

    Automated Debugging Methodology for FPGA-based Systems

    Get PDF
    Electronic devices make up a vital part of our lives. These are seen from mobiles, laptops, computers, home automation, etc. to name a few. The modern designs constitute billions of transistors. However, with this evolution, ensuring that the devices fulfill the designer’s expectation under variable conditions has also become a great challenge. This requires a lot of design time and effort. Whenever an error is encountered, the process is re-started. Hence, it is desired to minimize the number of spins required to achieve an error-free product, as each spin results in loss of time and effort. Software-based simulation systems present the main technique to ensure the verification of the design before fabrication. However, few design errors (bugs) are likely to escape the simulation process. Such bugs subsequently appear during the post-silicon phase. Finding such bugs is time-consuming due to inherent invisibility of the hardware. Instead of software simulation of the design in the pre-silicon phase, post-silicon techniques permit the designers to verify the functionality through the physical implementations of the design. The main benefit of the methodology is that the implemented design in the post-silicon phase runs many order-of-magnitude faster than its counterpart in pre-silicon. This allows the designers to validate their design more exhaustively. This thesis presents five main contributions to enable a fast and automated debugging solution for reconfigurable hardware. During the research work, we used an obstacle avoidance system for robotic vehicles as a use case to illustrate how to apply the proposed debugging solution in practical environments. The first contribution presents a debugging system capable of providing a lossless trace of debugging data which permits a cycle-accurate replay. This methodology ensures capturing permanent as well as intermittent errors in the implemented design. The contribution also describes a solution to enhance hardware observability. It is proposed to utilize processor-configurable concentration networks, employ debug data compression to transmit the data more efficiently, and partially reconfiguring the debugging system at run-time to save the time required for design re-compilation as well as preserve the timing closure. The second contribution presents a solution for communication-centric designs. Furthermore, solutions for designs with multi-clock domains are also discussed. The third contribution presents a priority-based signal selection methodology to identify the signals which can be more helpful during the debugging process. A connectivity generation tool is also presented which can map the identified signals to the debugging system. The fourth contribution presents an automated error detection solution which can help in capturing the permanent as well as intermittent errors without continuous monitoring of debugging data. The proposed solution works for designs even in the absence of golden reference. The fifth contribution proposes to use artificial intelligence for post-silicon debugging. We presented a novel idea of using a recurrent neural network for debugging when a golden reference is present for training the network. Furthermore, the idea was also extended to designs where golden reference is not present
    • …
    corecore