4 research outputs found

    Considerations in using OpenCL on GPUs and FPGAs for throughput-oriented genomics workloads

    Get PDF
    The recent upsurge in the available amount of health data and the advances in next-generation sequencing are setting the ground for the long-awaited precision medicine. To process this deluge of data, bioinformatics workloads are becoming more complex and more computationally demanding. For this reasons they have been extended to support different computing architectures, such as GPUs and FPGAs, to leverage the form of parallelism typical of each of such architectures. The paper describes how a genomic workload such as k-mer frequency counting that takes advantage of a GPU can be offloaded to one or even more FPGAs. Moreover, it performs a comprehensive analysis of the FPGA acceleration comparing its performance to a non-accelerated configuration and when using a GPU. Lastly, the paper focuses on how, when using accelerators with a throughput-oriented workload, one should also take into consideration both kernel execution time and how well each accelerator board overlaps kernels and PCIe transferred. Results show that acceleration with two FPGAs can improve both time- and energy-to-solution for the entire accelerated part by a factor of 1.32x. Per contra, acceleration with one GPU delivers an improvement of 1.77x in time-to-solution but of a lower 1.49x in energy-to-solution due to persistently higher power consumption. The paper also evaluates how future FPGA boards with components (i.e., off-chip memory and PCIe) on par with those of the GPU board could provide an energy-efficient alternative to GPUs.Peer ReviewedPostprint (published version

    Empowering parallel computing with field programmable gate arrays

    Get PDF
    After more than 30 years, reconïŹgurable computing has grown from a concept to a mature ïŹeld of science and technology. The cornerstone of this evolution is the ïŹeld programmable gate array, a building block enabling the conïŹguration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural reïŹnements

    A highly parameterizable framework for Conditional Restricted Boltzmann Machine based workloads accelerated with FPGAs and OpenCL

    Get PDF
    © 2020 Elsevier. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/Conditional Restricted Boltzmann Machine (CRBM) is a promising candidate for a multidimensional system modeling that can learn a probability distribution over a set of data. It is a specific type of an artificial neural network with one input (visible) and one output (hidden) layer. Recently published works demonstrate that CRBM is a suitable mechanism for modeling multidimensional time series such as human motion, workload characterization, city traffic analysis. The process of learning and inference of these systems relies on linear algebra functions like matrix–matrix multiplication, and for higher data sets, they are very compute-intensive. In this paper, we present a configurable framework for CRBM based workloads for arbitrary large models. We show how to accelerate the learning process of CRBM with FPGAs and OpenCL, and we conduct an extensive scalability study for different model sizes and system configurations. We show significant improvement in performance/Watt for large models and batch sizes (from 1.51x up to 5.71x depending on the host configuration) when we use FPGA and OpenCL for the acceleration, and limited benefits for small models comparing to the state-of-the-art CPU solution.This work was supported by the European Research Council(ERC) under the European Union’s Horizon 2020 research andinnovation programme (grant agreements No 639595); the Min-istry of Economy of Spain under contract TIN2015-65316-P andGeneralitat de Catalunya, Spain under contract 2014SGR1051;the ICREA, Spain Academia program; the BSC-CNS Severo Ochoaprogram, Spain (SEV-2015-0493) and Intel Corporation, UnitedStatesPeer ReviewedPostprint (published version

    Considerations in using OpenCL on GPUs and FPGAs for throughput-oriented genomics workloads

    No full text
    The recent upsurge in the available amount of health data and the advances in next-generation sequencing are setting the ground for the long-awaited precision medicine. To process this deluge of data, bioinformatics workloads are becoming more complex and more computationally demanding. For this reasons they have been extended to support different computing architectures, such as GPUs and FPGAs, to leverage the form of parallelism typical of each of such architectures. The paper describes how a genomic workload such as k-mer frequency counting that takes advantage of a GPU can be offloaded to one or even more FPGAs. Moreover, it performs a comprehensive analysis of the FPGA acceleration comparing its performance to a non-accelerated configuration and when using a GPU. Lastly, the paper focuses on how, when using accelerators with a throughput-oriented workload, one should also take into consideration both kernel execution time and how well each accelerator board overlaps kernels and PCIe transferred. Results show that acceleration with two FPGAs can improve both time- and energy-to-solution for the entire accelerated part by a factor of 1.32x. Per contra, acceleration with one GPU delivers an improvement of 1.77x in time-to-solution but of a lower 1.49x in energy-to-solution due to persistently higher power consumption. The paper also evaluates how future FPGA boards with components (i.e., off-chip memory and PCIe) on par with those of the GPU board could provide an energy-efficient alternative to GPUs.This work was supported by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement s No 639595); the Ministry of Economy of Spain under contract TIN2015-65316-P and Generalitat de Catalunya, Spain under contract 2014SGR1051; the ICREA, Spain Academia program; and the BSC-CNS Severo Ochoa, Spain program (SEV-2015-0493).Peer Reviewe