644 research outputs found

    Evaluating Rapid Application Development with Python for Heterogeneous Processor-based FPGAs

    Full text link
    As modern FPGAs evolve to include more het- erogeneous processing elements, such as ARM cores, it makes sense to consider these devices as processors first and FPGA accelerators second. As such, the conventional FPGA develop- ment environment must also adapt to support more software- like programming functionality. While high-level synthesis tools can help reduce FPGA development time, there still remains a large expertise gap in order to realize highly performing implementations. At a system-level the skill set necessary to integrate multiple custom IP hardware cores, interconnects, memory interfaces, and now heterogeneous processing elements is complex. Rather than drive FPGA development from the hardware up, we consider the impact of leveraging Python to ac- celerate application development. Python offers highly optimized libraries from an incredibly large developer community, yet is limited to the performance of the hardware system. In this work we evaluate the impact of using PYNQ, a Python development environment for application development on the Xilinx Zynq devices, the performance implications, and bottlenecks associated with it. We compare our results against existing C-based and hand-coded implementations to better understand if Python can be the glue that binds together software and hardware developers.Comment: To appear in 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM'17

    The review of heterogeneous design frameworks/Platforms for digital systems embedded in FPGAs and SoCs

    Get PDF
    Systems-on-a-chip integrate specialized modules to provide well-defined functionality. In order to guarantee its efficiency, designersare careful to choose high-level electronic components. In particular,FPGAs (field-programmable gate array) have demonstrated theirability to meet the requirements of emerging technology. However,traditional design methods cannot keep up with the speed andefficiency imposed by the embedded systems industry, so severalframeworks have been developed to simplify the design process of anelectronic system, from its modeling to its physical implementation.This paper illustrates some of them and presents a comparative studybetween them. Indeed, we have selected design methods of SoC(ESP4ML and HLS4ML, OpenESP, LiteX, RubyRTL, PyMTL,SysPy, PyRTL, DSSoC) and NoC networks on OCN chip (PyOCN)and in general on FPGA (PRGA, OpenFPGA, AnyHLS, PYNQ, andPyLog).The objective of this article is to analyze each tool at several levelsand to discuss the benefit of each in the scientific community. Wewill analyze several aspects constituting the architecture and thestructure of the platforms to make a comparative study of thehardware and software design flows of digital systems.

    Direct NN-body code on low-power embedded ARM GPUs

    Full text link
    This work arises on the environment of the ExaNeSt project aiming at design and development of an exascale ready supercomputer with low energy consumption profile but able to support the most demanding scientific and technical applications. The ExaNeSt compute unit consists of densely-packed low-power 64-bit ARM processors, embedded within Xilinx FPGA SoCs. SoC boards are heterogeneous architecture where computing power is supplied both by CPUs and GPUs, and are emerging as a possible low-power and low-cost alternative to clusters based on traditional CPUs. A state-of-the-art direct NN-body code suitable for astrophysical simulations has been re-engineered in order to exploit SoC heterogeneous platforms based on ARM CPUs and embedded GPUs. Performance tests show that embedded GPUs can be effectively used to accelerate real-life scientific calculations, and that are promising also because of their energy efficiency, which is a crucial design in future exascale platforms.Comment: 16 pages, 7 figures, 1 table, accepted for publication in the Computing Conference 2019 proceeding

    PRODUCTIVELY SCALING HARDWARE DESIGNS OVER INCREASING RESOURCES USING A SYSTEMATIC DESIGN ANALYSIS APPROACH

    Get PDF
    As processor development shifts from strict single core frequency scaling to het- erogeneous resource scaling two important considerations require evaluation. First, how to design systems with an increasing amount of heterogeneous resources, and second, how to maintain a designer’s productivity as the number of possible con- figurations grows. Therefore, it is necessary to determine what useful information can be gathered from existing designs to help predict or identify a design’s potential scalability, as well as, identifying which routine tasks can be automated to improve a designer’s productivity. Moreover, once this information is collected, how can this information be conveyed to the designer such that it can be used to increase overall productivity when implementing the design over increasing amounts of resources? This research looks at various approaches to analyze designs and attempts to distribute an application efficiently across a heterogeneous cluster of computing re- sources through the use of a Systematic Design Analysis flow and an assortment of productivity tools. These tools provide the designer with projections on the amount of resources needed to scale an existing design to a specified performance, as well as, projecting the performance based on a specified amount of resources. This is accomplished through the combination of static HDL profiling, component synthesis resource utilization, and runtime performance monitoring. For evaluation, four case studies are presented to demonstrate the proposed flow’s scalability on a small scale cluster of FPGAs. The results are highly favorable, providing orders of magnitude speedup with minimal intervention from the designer

    Toward Fault-Tolerant Applications on Reconfigurable Systems-on-Chip

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Achieving High Speed CFD simulations: Optimization, Parallelization, and FPGA Acceleration for the unstructured DLR TAU Code

    Get PDF
    Today, large scale parallel simulations are fundamental tools to handle complex problems. The number of processors in current computation platforms has been recently increased and therefore it is necessary to optimize the application performance and to enhance the scalability of massively-parallel systems. In addition, new heterogeneous architectures, combining conventional processors with specific hardware, like FPGAs, to accelerate the most time consuming functions are considered as a strong alternative to boost the performance. In this paper, the performance of the DLR TAU code is analyzed and optimized. The improvement of the code efficiency is addressed through three key activities: Optimization, parallelization and hardware acceleration. At first, a profiling analysis of the most time-consuming processes of the Reynolds Averaged Navier Stokes flow solver on a three-dimensional unstructured mesh is performed. Then, a study of the code scalability with new partitioning algorithms are tested to show the most suitable partitioning algorithms for the selected applications. Finally, a feasibility study on the application of FPGAs and GPUs for the hardware acceleration of CFD simulations is presented

    PYDAC: A DISTRIBUTED RUNTIME SYSTEM AND PROGRAMMING MODEL FOR A HETEROGENEOUS MANY-CORE ARCHITECTURE

    Get PDF
    Heterogeneous many-core architectures that consist of big, fast cores and small, energy-efficient cores are very promising for future high-performance computing (HPC) systems. These architectures offer a good balance between single-threaded perfor- mance and multithreaded throughput. Such systems impose challenges on the design of programming model and runtime system. Specifically, these challenges include (a) how to fully utilize the chip’s performance, (b) how to manage heterogeneous, un- reliable hardware resources, and (c) how to generate and manage a large amount of parallel tasks. This dissertation proposes and evaluates a Python-based programming framework called PyDac. PyDac supports a two-level programming model. At the high level, a programmer creates a very large number of tasks, using the divide-and-conquer strategy. At the low level, tasks are written in imperative programming style. The runtime system seamlessly manages the parallel tasks, system resilience, and inter- task communication with architecture support. PyDac has been implemented on both an field-programmable gate array (FPGA) emulation of an unconventional het- erogeneous architecture and a conventional multicore microprocessor. To evaluate the performance, resilience, and programmability of the proposed system, several micro-benchmarks were developed. We found that (a) the PyDac abstracts away task communication and achieves programmability, (b) the micro-benchmarks are scalable on the hardware prototype, but (predictably) serial operation limits some micro-benchmarks, and (c) the degree of protection versus speed could be varied in redundant threading that is transparent to programmers

    Radiation-induced Effects on DMA Data Transfer in Reconfigurable Devices

    Get PDF
    As the adoption of SRAM-based FPGAs and Reconfigurable SoCs for High-Performance Computing increased in the last years, the use of Direct Memory Access for data transfer becomes a key feature of many reconfigurable applications even in the space industry. For such kinds of applications, radiation-induced effects are a serious issue that mines the correctness and success of mission-critical tasks. In this paper, we evaluate the effects of proton-induced errors on a DMA-based application implemented on a Xilinx Zynq-7020 FPGA in order to quantify the robustness of this module in a typical hardware-accelerated configuration. The obtained results confirm the high criticality of the DMA module on programmable logic. Moreover, the Multiple Bits Upsets effect has been evaluated. The most recurring patterns have been reported in order to provide further tools to better characterize the behavior of these systems under future fault injection campaigns, as demonstrated in the experimental results
    corecore