176 research outputs found
Cooperative CPU, GPU, and FPGA heterogeneous execution with EngineCL
Heterogeneous systems are the core architecture of most of the high-performance computing nodes, due to their excellent performance and energy efficiency. However, a key challenge that remains is programmability, specifically, releasing the programmer from the burden of managing data and devices with different architectures. To this end, we extend EngineCL to support FPGA devices. Based on OpenCL, EngineCL is a high-level framework providing load balancing among devices. Our proposal fully integrates FPGAs into the framework, enabling effective cooperation between CPU, GPU, and FPGA. With command overlapping and judicious data management, our work improves performance by up to 96% compared with single-device execution and delivers energy-delay gains of up to 37%. In addition, adopting FPGAs does not require programmers to make big changes in their applications because the extensions do not modify the user-facing interface of EngineCL
Lightweight asynchronous scheduling in heterogeneous reconfigurable systems
The trend for heterogeneous embedded systems is the integration of accelerators and general-purpose CPU cores on the same die. In these integrated architectures, like the Zynq UltraScale+ board (CPU+FPGA) that we target in this work, hardware support for shared memory and low-overhead synchronization between the accelerator and the CPU cores make the case for exploring strategies that exploit a tight collaboration between the CPUs and the accelerator. In this paper we propose a novel lightweight scheduling strategy, FastFit, targeted to FPGA accelerators, and a new scheduler based on it, named MultiFastFit, which asynchronously tackles heterogeneous systems comprised of a variety of CPU cores and FPGA IPs. Our strategy significantly reduces the overhead to automatically compute the near-optimal chunksizes when compared to a previous state-of-the-art auto-tuned approach, which makes our approach more suitable for fine-grained applications. Additionally, our scheduler MultiFastFit has been designed to enable the efficient co-execution of work among compute devices in such a way that all the devices are busy while minimizing the load unbalance. Our approaches have been evaluated using four benchmarks carefully tuned for the low-power UltraScale+ platform. Our experiments demonstrate that the FastFit strategy always finds the near-optimal FPGA chunksize for any device configuration at a reasonable cost, even for fine-grained and irregular applications, and that heterogeneous CPU+FPGA co-executions that exploit all the compute devices are usually faster and more energy efficient than the CPU-only and FPGA-only executions. We have also compared MultiFastFit with other state-of-the-art scheduling strategies, finding that it outperforms other auto-tuned approach up to 2x and it achieves similar results to manually-tuned schedulers without requiring an offline search of the ideal CPU-FPGA partition or FPGA chunk granularity. © 2022 The Author
Hardware implementations of computer-generated holography: a review
Computer-generated holography (CGH) is a technique to generate holographic interference patterns. One of the major issues related to computer hologram generation is the massive computational power required. Hardware accelerators are used to accelerate this process. Previous publications targeting hardware platforms lack performance comparisons between different architectures and do not provide enough information for the evaluation of the suitability of recent hardware platforms for CGH algorithms. We aim to address these limitations and present a comprehensive review of CGH-related hardware implementations
Enabling Python to execute efficiently in heterogeneous distributed infrastructures with PyCOMPSs
Python has been adopted as programming language by a large number of scientific communities. Additionally to the easy programming interface, the large number of libraries and modules that have been made available by a large number of contributors, have taken this language to the top of the list of the most popular programming languages in scientific applications. However, one main drawback of Python is the lack of support for concurrency or parallelism. PyCOMPSs is a proved approach to support task-based parallelism in Python that enables applications to be executed in parallel in distributed computing platforms.
This paper presents PyCOMPSs and how it has been tailored to execute tasks in heterogeneous and multi-threaded environments. We present an approach to combine the task-level parallelism provided by PyCOMPSs with the thread-level parallelism provided by MKL. Performance and behavioral results in distributed computing heterogeneous clusters show the benefits and capabilities of PyCOMPSs in both HPC and Big Data infrastructures.Thiswork has been supported by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contract
TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). Javier Conejero postdoctoral contract is co-financed by the Ministry of Economy and Competitiveness
under Juan de la Cierva Formación postdoctoral fellowship number FJCI- 2015-24651. Cristian Ramon-Cortes predoctoral contract is financed by the Ministry of Economy and Competitiveness under the contract BES-2016-076791. This work is supported by the Intel-BSC Exascale Lab.
This work has been supported by the European Commission through the Horizon 2020 Research and Innovation program under
contract 687584 (TANGO project).Peer ReviewedPostprint (author's final draft
Executing linear algebra kernels in heterogeneous distributed infrastructures with PyCOMPSs
Python is a popular programming language due to the simplicity of its syntax, while still achieving a good performance even being an interpreted language. The adoption from multiple scientific communities has evolved in the emergence of a large number of libraries and modules, which has helped to put Python on the top of the list of the programming languages [1]. Task-based programming has been proposed in the recent years as an alternative parallel programming model. PyCOMPSs follows such approach for Python, and this paper presents its extensions to combine task-based parallelism and thread-level parallelism. Also, we present how PyCOMPSs has been adapted to support heterogeneous architectures, including Xeon Phi and GPUs. Results obtained with linear algebra benchmarks demonstrate that significant performance can be obtained with a few lines of Python.This work has been supported by the
Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and
2014-SGR-1272). Javier Conejero postdoctoral contract is co-financed by the Ministry of Economy and Competitiveness
under Juan de la Cierva Formación postdoctoral fellowship number FJCI-2015-24651. Cristian Ramon-Cortes predoctoral contract is financed by the Ministry of Economy and Competitiveness under the contract BES-2016-076791. This work is
supported by the Intel-BSC Exascale Lab.
This work has been supported by the European Commission through the Horizon 2020 Research and Innovation program
under contract 687584 (TANGO project).Peer ReviewedPostprint (published version
Lightweight asynchronous scheduling in heterogeneous reconfigurable systems
The trend for heterogeneous embedded systems is the integration of accelerators and general-purpose CPU cores on the same die. In these integrated architectures, like the Zynq UltraScale+ board (CPU+FPGA) that we target in this work, hardware support for shared memory and low-overhead synchronization between the accelerator and the CPU cores make the case for exploring strategies that exploit a tight collaboration between the CPUs and the accelerator. In this paper we propose a novel lightweight scheduling strategy, FastFit, targeted to FPGA accelerators, and a new scheduler based on it, named MultiFastFit, which asynchronously tackles heterogeneous systems comprised of a variety of CPU cores and FPGA IPs. Our strategy significantly reduces the overhead to automatically compute the near-optimal chunksizes when compared to a previous state-of-the-art auto-tuned approach, which makes our approach more suitable for fine-grained applications. Additionally, our scheduler MultiFastFit has been designed to enable the efficient co-execution of work among compute devices in such a way that all the devices are busy while minimizing the load unbalance.
Our approaches have been evaluated using four benchmarks carefully tuned for the low-power UltraScale+ platform. Our experiments demonstrate that the FastFit strategy always finds the near-optimal FPGA chunksize for any device configuration at a reasonable cost, even for fine-grained and irregular applications, and that heterogeneous CPU+FPGA co-executions that exploit all the compute devices are usually faster and more energy efficient than the CPU-only and FPGA-only executions. We have also compared MultiFastFit with other state-of-the-art scheduling strategies, finding that it outperforms other auto-tuned approach up to 2x and it achieves similar results to manually-tuned schedulers without requiring an offline search of the ideal CPU-FPGA partition or FPGA chunk granularity.This work was partially supported by the Spanish projects PID2019-105396RB-I00, UMA18-FEDERJA-108, and UK EPSRC projects ENEAC (EP/N002539/1), HOPWARE (EP/V040863/1) and RS MINET (INF\R2\192044). Funding for open access charge: Universidad de Málaga / CBUA
Multi-core architectures with coarse-grained dynamically reconfigurable processors for broadband wireless access technologies
Broadband Wireless Access technologies have significant market potential, especially the
WiMAX protocol which can deliver data rates of tens of Mbps. Strong demand for high
performance WiMAX solutions is forcing designers to seek help from multi-core processors
that offer competitive advantages in terms of all performance metrics, such as speed, power
and area. Through the provision of a degree of flexibility similar to that of a DSP and
performance and power consumption advantages approaching that of an ASIC,
coarse-grained dynamically reconfigurable processors are proving to be strong candidates
for processing cores used in future high performance multi-core processor systems.
This thesis investigates multi-core architectures with a newly emerging dynamically
reconfigurable processor – RICA, targeting WiMAX physical layer applications. A novel
master-slave multi-core architecture is proposed, using RICA processing cores. A SystemC
based simulator, called MRPSIM, is devised to model this multi-core architecture. This
simulator provides fast simulation speed and timing accuracy, offers flexible architectural
options to configure the multi-core architecture, and enables the analysis and investigation
of multi-core architectures. Meanwhile a profiling-driven mapping methodology is
developed to partition the WiMAX application into multiple tasks as well as schedule and
map these tasks onto the multi-core architecture, aiming to reduce the overall system
execution time. Both the MRPSIM simulator and the mapping methodology are seamlessly
integrated with the existing RICA tool flow.
Based on the proposed master-slave multi-core architecture, a series of diverse
homogeneous and heterogeneous multi-core solutions are designed for different fixed
WiMAX physical layer profiles. Implemented in ANSI C and executed on the MRPSIM
simulator, these multi-core solutions contain different numbers of cores, combine various memory architectures and task partitioning schemes, and deliver high throughputs at
relatively low area costs. Meanwhile a design space exploration methodology is developed
to search the design space for multi-core systems to find suitable solutions under certain
system constraints. Finally, laying a foundation for future multithreading exploration on the
proposed multi-core architecture, this thesis investigates the porting of a real-time operating
system – Micro C/OS-II to a single RICA processor. A multitasking version of WiMAX is
implemented on a single RICA processor with the operating system support
- …