Search CORE

137 research outputs found

Performance and area evaluations of processor-based benchmarks on FPGA devices

Author: Jiunn-Tyng Kao (7215932)
Publication venue
Publication date: 01/01/2014
Field of study

The computing system on SoCs is being long-term research since the FPGA technology has emerged due to its personality of re-programmable fabric, reconfigurable computing, and fast development time to market. During the last decade, uni-processor in a SoC is no longer to deal with the high growing market for complex applications such as Mobile Phones audio and video encoding, image and network processing. Due to the number of transistors on a silicon wafer is increasing, the recent FPGAs or embedded systems are advancing toward multi-processor-based design to meet tremendous performance and benefit this kind of systems are possible. Therefore, is an upcoming age of the MPSoC. In addition, most of the embedded processors are soft-cores, because they are flexible and reconfigurable for specific software functions and easy to build homogenous multi-processor systems for parallel programming. Moreover, behavioural synthesis tools are becoming a lot more powerful and enable to create datapath of logic units from high-level algorithms such as C to HDL and available for partitioning a HW/SW concurrent methodology. A range of embedded processors is able to implement on a FPGA-based prototyping to integrate the CPUs on a programmable device. This research is, firstly represent different types of computer architectures in modern embedded processors that are followed in different type of software applications (eg. Multi-threading Operations or Complex Functions) on FPGA-based SoCs; and secondly investigate their capability by executing a wide-range of multimedia software codes (Integer-algometric only) in different models of the processor-systems (uni-processor or multi-processor or Co-design), and finally compare those results in terms of the benchmarks and resource utilizations within FPGAs. All the examined programs were written in standard C and executed in a variety numbers of soft-core processors or hardware units to obtain the execution times. However, the number of processors and their customizable configuration or hardware datapath being generated are limited by a target FPGA resource, and designers need to understand the FPGA-based tradeoffs that have been considered - Speed versus Area. For this experimental purpose, I defined benchmarks into DLP / HLS catalogues, which are "data" and "function" intensive respectively. The programs of DLP will be executed in LEON3 MP and LE1 CMP multi-processor systems and the programs of HLS in the LegUp Co-design system on target FPGAs. In preliminary, the performance of the soft-core processors will be examined by executing all the benchmarks. The whole story of this thesis work centres on the issue of the execute times or the speed-up and area breakdown on FPGA devices in terms of different programs

Loughborough University Institutional Repository

An OpenCL software compilation framework targeting an SoC-FPGA VLIW chip multiprocessor

Author: Samuel J. Parker (7203041)
Vassilios Chouliaras (1251600)
Publication venue
Publication date: 01/01/2016
Field of study

Modern systems-on-chip augment their baseline CPU with coprocessors and accelerators to increase overall computational capability and power efficiency, and thus have evolved into heterogeneous multi-core systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This paper discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a highly configurable VLIW Chip Multiprocessor architecture known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on a number of hardware configurations of the LE1 CMP. The presented OpenCL framework fully automates the compilation flow and supports work-item coalescing which better maps onto the ILP processor cores of the LE1 architecture. This paper discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework by running 12 industry-standard OpenCL benchmarks drawn from the AMD SDK and the Rodinia suites. The benchmarks are executed on 40 LE1 configurations with 10 implemented on an SoC-FPGA and the remaining on a cycle-accurate simulator. Across 12 OpenCL benchmarks results demonstrate near-linear wall-clock performance improvement of 1.8x (using 2 dual-issue cores), up to 5.2x (using 8 dual-issue cores) and on one case, super-linear improvement of 8.4x (FixOffset kernel, 8 dual-issue cores). The number of OpenCL benchmarks evaluated makes this study one of the most complete in the literature

Loughborough University Institutional Repository

Exploiting Hardware Abstraction for Parallel Programming Framework: Platform and Multitasking

Author: Ding Hongyuan
Publication venue: ScholarWorks@UARK
Publication date: 01/01/2017
Field of study

With the help of the parallelism provided by the fine-grained architecture, hardware accelerators on Field Programmable Gate Arrays (FPGAs) can significantly improve the performance of many applications. However, designers are required to have excellent hardware programming skills and unique optimization techniques to explore the potential of FPGA resources fully. Intermediate frameworks above hardware circuits are proposed to improve either performance or productivity by leveraging parallel programming models beyond the multi-core era. In this work, we propose the PolyPC (Polymorphic Parallel Computing) framework, which targets enhancing productivity without losing performance. It helps designers develop parallelized applications and implement them on FPGAs. The PolyPC framework implements a custom hardware platform, on which programs written in an OpenCL-like programming model can launch. Additionally, the PolyPC framework extends vendor-provided tools to provide a complete development environment including intermediate software framework, and automatic system builders. Designers\u27 programs can be either synthesized as hardware processing elements (PEs) or compiled to executable files running on software PEs. Benefiting from nontrivial features of re-loadable PEs, and independent group-level schedulers, the multitasking is enabled for both software and hardware PEs to improve the efficiency of utilizing hardware resources. The PolyPC framework is evaluated regarding performance, area efficiency, and multitasking. The results show a maximum 66 times speedup over a dual-core ARM processor and 1043 times speedup over a high-performance MicroBlaze with 125 times of area efficiency. It delivers a significant improvement in response time to high-priority tasks with the priority-aware scheduling. Overheads of multitasking are evaluated to analyze trade-offs. With the help of the design flow, the OpenCL application programs are converted into executables through the front-end source-to-source transformation and back-end synthesis/compilation to run on PEs, and the framework is generated from users\u27 specifications

ScholarWorks@UARK

UARK (University of Arkansas )

High-level synthesis for FPGAs: From prototyping to deployment

Author: Fellow IEEE Bin Liu
Jason Cong
Kees Vissers
Member IEEE Juanjo Noguera
Member IEEE Zhiru Zhang
Stephen Neuendorffer
Publication venue
Publication date: 06/03/2020
Field of study

Abstract-Escalating System-on-Chip design complexity is pushing the design community to raise the level of abstraction beyond RTL. Despite the unsuccessful adoptions of early generations of commercial high-level synthesis (HLS) systems, we believe that the tipping point for transitioning to HLS methodology is happening now, especially for FPGA designs. The latest generation of HLS tools has made significant progress in providing wide language coverage and robust compilation technology, platform-based modeling, advancement in core HLS algorithms, and a domain-specific approach. In this paper we use AutoESL's AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains. Complex industrial designs targeting Xilinx FPGAs are also presented as case studies, including comparison of HLS solutions versus optimized manual designs. Index Terms-Domain-specific design, field-programmable gate array (FPGA), high-level synthesis (HLS), quality of results (QoR)

CiteSeerX

Accelerating Halide on an FPGA by using CIRCT and Calyx as an intermediate step to go from a high-level and software-centric IRs down to RTL

Author: Granell Escalfet Sergi
Publication venue: Universitat Politècnica de Catalunya
Publication date: 15/05/2023
Field of study

Image processing and, more generally, array processing play an essential role in modern life: from applying filters to the images that we upload to social media to running object detection algorithms on self-driving cars. Optimizing these algorithms can be complex and often results in non-portable code. The Halide language provides a simple way to write image and array processing algorithms by separating the algorithm definition (what needs to be executed) from its execution schedule (how it is executed), delivering state-of-the-art performance that exceeds hand-tuned parallel and vectorized code. Due to the inherent parallel nature of these algorithms, FPGAs present an attractive acceleration platform. While previous work has added an RTL code generator to Halide, and utilized other heterogeneous computing languages as an intermediate step, these projects are no longer maintained. MLIR is an attractive solution, allowing the generation of code that can target multiple devices, such as parallelized and vectorized CPU code, OpenMP, and CUDA. CIRCT builds on top of MLIR to convert generic MLIR code to register transfer level (RTL) languages by using Calyx, a new intermediate language (IL) for compiling high-level programs into hardware designs. This thesis presents a novel flow that implements an MLIR code generator for Halide that generates RTL code, adding the necessary wrappers to execute that code on Xilinx FPGA devices. Additionally, it implements a Halide runtime using the Xilinx Runtime (XRT), enabling seamless execution of the generated Halide RTL kernels. While this thesis provides initial support for running Halide kernels and not all features and optimizations are supported, it also details the future work needed to improve the performance of the generated RTL kernels. The proposed flow serves as a foundation for further research and development in the field of hardware acceleration for image and array processing applications using Halide

UPCommons. Portal del coneixement obert de la UPC

VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads

Author: David Stevens (4254274)
Vassilios Chouliaras (1251600)
Vincent Dwyer (1251447)
Publication venue
Publication date: 01/01/2016
Field of study

We discuss VThreads, a novel VLIW CMP with hardware-assisted shared-memory Thread support. VThreads supports Instruction Level Parallelism via static multiple-issue and Thread Level Parallelism via hardware-assisted POSIX Threads along with extensive customization. It allows the instantiation of tightlycoupled streaming accelerators and supports up to 7-address Multiple-Input, Multiple-Output instruction extensions. VThreads is designed in technology-independent Register-Transfer-Level VHDL and prototyped on 40 nm and 28 nm Field-Programmable gate arrays. It was evaluated against a PThreads-based multiprocessor based on the Sparc-V8 ISA. On a 65 nm ASIC implementation VThreads achieves up to x7.2 performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and ~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility, hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support

Loughborough University Institutional Repository

Software acceleration on Xilinx FPGAs using OmpSs@FPGA ecosystem

Author: BERTOCCO SARA
CORETTI Igor
Goz David
TAFFONI Giuliano
Publication venue
Publication date: 01/01/2021
Field of study

The OmpSs@FPGA programming model allows offloading application functionality to Xilinx Field Programmable Gate Arrays (FPGAs). The OmpSs compiler splits the code (written in C/C++ high level language) in two parts, targeting the host and the FPGA. The first is usually compiled by the GNU Compiler Collection (GCC), while the latter is given to the Xilinx Vivado HLS tool (hereafter HLS) for high level synthesis to VHDL and bitstream used to program the FPGA. OmpSs@FPGA is based on compiler directives, which allow the programmer to annotate the part of the code to automatically exploit all Symmetric MultiProcessor system (smp) and FPGA resource available in the execution platform. This technical report provides both descriptive and hands-on introductions to build application-specific FPGA systems using the high-level OmpSs@FPGA tool. The goal is to give the reader a baseline view of the process of creating an optimized hardware design annotating C-based code with HLS directives. We assume the reader has a working knowledge of C/C++, and familiarity with basic computer architecture concepts (e.g. speedup, parallelism, pipelining)

OA@INAF - Istituto Nazionale di Astrofisica

Recommended from our members

Scalable System-on-Chip Design

Author: Mantovani Paolo
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2017
Field of study

The crisis of technology scaling led the industry of semiconductors towards the adoption of disruptive technologies and innovations to sustain the evolution of microprocessors and keep under control the timing of the design cycle. Multi-core and many-core architectures sought more energy-efficient computation by replacing a power-hungry processor with multiple simpler cores exploiting parallelism. Multi-core processors alone, however, turned out to be insufficient to sustain the ever growing demand for energy and power-efficient computation without compromising performance. Therefore, designers were pushed to drift from homogeneous architectures towards more complex heterogeneous systems that employ the large number of available transistors to incorporate a combination of customized energy-efficient accelerators, along with the general-purpose processor cores. Meanwhile, enhancements in manufacturing processes allowed designers to move a variety of peripheral components and analog devices into the chip. This paradigm shift defined the concept of {\em system-on-chip} (SoC) as a single-chip design that integrates several heterogeneous components. The rise of SoCs corresponds to a rapid decrease of the opportunity cost for integrating accelerators. In fact, on one hand, employing more transistors for powerful cores is not feasible anymore, because transistors cannot be active all at once within reasonable power budgets. On the other hand, increasing the number of homogeneous cores incurs more and more diminishing returns. The availability of cost effective silicon area for specialized hardware creates an opportunity to enter the market of semiconductors for new small players: engineers from several different scientific areas can develop competitive algorithms suitable for acceleration for domain-specific applications, such as multimedia systems, self-driving vehicles, robotics, and more. However, turning these algorithms into SoC components, referred to as {\em intellectual property}, still requires expert hardware designers who are typically not familiar with the specific domain of the target application. Furthermore, heterogeneity makes SoC design and programming much more difficult, especially because of the challenges of the integration process. This is a fine art in the hands of few expert engineers who understand system-level trade-offs, know how to design good hardware, how to handle memory and power management, how to shape and balance the traffic over an interconnect, and are able to deal with many different hardware-software interfaces. Designers need solutions enabling them to build scalable and heterogeneous SoCs. My thesis is that {\em the key to scalable SoC designs is a regular and flexible architecture that hides the complexity of heterogeneous integration from designers, while helping them focus on the important aspects of domain-specific applications through a companion system-level design methodology.} I open a path towards this goal by proposing an architecture that mitigates heterogeneity with regularity and addresses the challenges of heterogeneous component integration by implementing a set of {\em platform services}. These are hardware and software interfaces that from a system-level viewpoint give the illusion of working with a homogeneous SoC, thus making it easier to reuse accelerators and port applications across different designs, each with its own target workload and cost-performance trade-off point. A companion system-level design methodology exploits the regularity of the architecture to guide designers in implementing their intellectual property and enables an extensive design-space exploration across multiple levels of abstraction. Throughout the dissertation, I present a fully automated flow to deploy heterogeneous SoCs on single or multiple field-programmable-gate-array devices. The flow provides non-expert designers with a set of knobs for tuning system-level features based on the given mix of accelerators that they have integrated. Many contributions of my dissertation have already influenced other research projects as well as the content of an advanced course for graduate and senior undergraduate students, which aims to form a new generation of system-level designers. These new professionals need not to be circuit or register-transfer level design experts, and not even gurus of operating systems. Instead, they are trained to design efficient intellectual property by considering system-level trade-offs, while the architecture and the methodology that I describe in this dissertation empower them to integrate their components into an SoC. Finally, with the open-source release of the entire infrastructure, including the SoC-deployment flow and the software stack, I hope I will be able to inspire other research groups and help them implement ideas that further reduce the cost and design-time of future heterogeneous systems

Columbia University Academic Commons

Automated Debugging Methodology for FPGA-based Systems

Author: Khan Habib ul Hasan
Publication venue
Publication date: 30/12/2019
Field of study

Electronic devices make up a vital part of our lives. These are seen from mobiles, laptops, computers, home automation, etc. to name a few. The modern designs constitute billions of transistors. However, with this evolution, ensuring that the devices fulfill the designer’s expectation under variable conditions has also become a great challenge. This requires a lot of design time and effort. Whenever an error is encountered, the process is re-started. Hence, it is desired to minimize the number of spins required to achieve an error-free product, as each spin results in loss of time and effort. Software-based simulation systems present the main technique to ensure the verification of the design before fabrication. However, few design errors (bugs) are likely to escape the simulation process. Such bugs subsequently appear during the post-silicon phase. Finding such bugs is time-consuming due to inherent invisibility of the hardware. Instead of software simulation of the design in the pre-silicon phase, post-silicon techniques permit the designers to verify the functionality through the physical implementations of the design. The main benefit of the methodology is that the implemented design in the post-silicon phase runs many order-of-magnitude faster than its counterpart in pre-silicon. This allows the designers to validate their design more exhaustively. This thesis presents five main contributions to enable a fast and automated debugging solution for reconfigurable hardware. During the research work, we used an obstacle avoidance system for robotic vehicles as a use case to illustrate how to apply the proposed debugging solution in practical environments. The first contribution presents a debugging system capable of providing a lossless trace of debugging data which permits a cycle-accurate replay. This methodology ensures capturing permanent as well as intermittent errors in the implemented design. The contribution also describes a solution to enhance hardware observability. It is proposed to utilize processor-configurable concentration networks, employ debug data compression to transmit the data more efficiently, and partially reconfiguring the debugging system at run-time to save the time required for design re-compilation as well as preserve the timing closure. The second contribution presents a solution for communication-centric designs. Furthermore, solutions for designs with multi-clock domains are also discussed. The third contribution presents a priority-based signal selection methodology to identify the signals which can be more helpful during the debugging process. A connectivity generation tool is also presented which can map the identified signals to the debugging system. The fourth contribution presents an automated error detection solution which can help in capturing the permanent as well as intermittent errors without continuous monitoring of debugging data. The proposed solution works for designs even in the absence of golden reference. The fifth contribution proposes to use artificial intelligence for post-silicon debugging. We presented a novel idea of using a recurrent neural network for debugging when a golden reference is present for training the network. Furthermore, the idea was also extended to designs where golden reference is not present

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa