16 research outputs found

    Performance and energy footprint assessment of FPGAs and GPUs on HPC systems using Astrophysics application

    Full text link
    New challenges in Astronomy and Astrophysics (AA) are urging the need for a large number of exceptionally computationally intensive simulations. "Exascale" (and beyond) computational facilities are mandatory to address the size of theoretical problems and data coming from the new generation of observational facilities in AA. Currently, the High Performance Computing (HPC) sector is undergoing a profound phase of innovation, in which the primary challenge to the achievement of the "Exascale" is the power-consumption. The goal of this work is to give some insights about performance and energy footprint of contemporary architectures for a real astrophysical application in an HPC context. We use a state-of-the-art N-body application that we re-engineered and optimized to exploit the heterogeneous underlying hardware fully. We quantitatively evaluate the impact of computation on energy consumption when running on four different platforms. Two of them represent the current HPC systems (Intel-based and equipped with NVIDIA GPUs), one is a micro-cluster based on ARM-MPSoC, and one is a "prototype towards Exascale" equipped with ARM-MPSoCs tightly coupled with FPGAs. We investigate the behavior of the different devices where the high-end GPUs excel in terms of time-to-solution while MPSoC-FPGA systems outperform GPUs in power consumption. Our experience reveals that considering FPGAs for computationally intensive application seems very promising, as their performance is improving to meet the requirements of scientific applications. This work can be a reference for future platforms development for astrophysics applications where computationally intensive calculations are required.Comment: 15 pages, 4 figures, 3 tables; Preprint (V2) submitted to MDPI (Special Issue: Energy-Efficient Computing on Parallel Architectures

    Real-time cortical simulations: energy and interconnect scaling on distributed systems

    Full text link
    We profile the impact of computation and inter-processor communication on the energy consumption and on the scaling of cortical simulations approaching the real-time regime on distributed computing platforms. Also, the speed and energy consumption of processor architectures typical of standard HPC and embedded platforms are compared. We demonstrate the importance of the design of low-latency interconnect for speed and energy consumption. The cost of cortical simulations is quantified using the Joule per synaptic event metric on both architectures. Reaching efficient real-time on large scale cortical simulations is of increasing relevance for both future bio-inspired artificial intelligence applications and for understanding the cognitive functions of the brain, a scientific quest that will require to embed large scale simulations into highly complex virtual or real worlds. This work stands at the crossroads between the WaveScalES experiment in the Human Brain Project (HBP), which includes the objective of large scale thalamo-cortical simulations of brain states and their transitions, and the ExaNeSt and EuroExa projects, that investigate the design of an ARM-based, low-power High Performance Computing (HPC) architecture with a dedicated interconnect scalable to million of cores; simulation of deep sleep Slow Wave Activity (SWA) and Asynchronous aWake (AW) regimes expressed by thalamo-cortical models are among their benchmarks.Comment: 8 pages, 8 figures, 4 tables, submitted after final publication on PDP2019 proceedings, corrected final DOI. arXiv admin note: text overlap with arXiv:1812.04974, arXiv:1804.0344

    Tasking in accelerators: performance evaluation

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.In this work, we analyze the implications and results of implementing dynamic parallelism, concurrent kernels and CUDA Graphs to solve task-oriented problems. As a benchmark we propose three different methods for solving DGEMM operation on tiled-matrices; which might be the most popular benchmark for performance analysis. For the algorithms that we study, we present significant differences in terms of data dependencies, synchronization and granularity. The main contribution of this work is determining which of the previous approaches work better for having multiple task running concurrently in a single GPU, as well as stating the main limitations and benefits of every technique. Using dynamic parallelism and CUDA Streams we were able to achieve up to 30% speedups and for CUDA Graph API up to 25x acceleration outperforming state of the art results.This project has received funding from the EPEEC project from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 801051, from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII ( TIN2015-65316-P ) and the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Pro-gramació i Entorns d’Execució Paral·lels (2014-SGR-1051 ). Finally, this project also received funding from the Spanish Ministry of Economy and Competitiveness under the Juan de la Cierva Grant Agreement No IJCI-2017-33511 , and from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska Curie grant agreement No. 749516 .Peer ReviewedPostprint (author's final draft

    Towards an auto-tuned and task-based SpMV (LASs Library)

    Get PDF
    We present a novel approach to parallelize the SpMV kernel included in LASs (Linear Algebra routines on OmpSs) library, after a deep review and analysis of several well-known approaches. LASs is based on OmpSs, a task-based runtime that extends OpenMP directives, providing more flexibility to apply new strategies. Based on tasking and nesting, with the aim of improving the workload imbalance inherent to the SpMV operation, we present a strategy especially useful for highly imbalanced input matrices. In this approach, the number of created tasks is dynamically decided in order to maximize the use of the resources of the platform. Throughout this paper, SpMV behavior depending on the selected strategy (state of the art and proposed strategies) is deeply analyzed, setting in this way the base for a future auto-tunable code that is able to select the most suitable approach depending on the input matrix. The experiments of this work were carried out for a set of 12 matrices from the Suite Sparse Matrix Collection, all of them with different characteristics regarding their sparsity. The experiments of this work were performed on a node of Marenostrum 4 supercomputer (with two sockets Intel Xeon, 24 cores each) and on a node of Dibona cluster (using one ARM ThunderX2 socket with 32 cores). Our tests show that, for Intel Xeon, the best parallelization strategy reduces the execution time of the reference MKL multi-threaded version up to 67%. On ARM ThunderX2, the reduction is up to 56% with respect to the OmpSs parallel reference.This project has received funding from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII (TIN2015- 65316-P), the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d’Execució Parallels (2014-SGR-1051), and the Juan de la Cierva Grant Agreement No IJCI-2017- 33511, and the Spanish Ministry of Science and Innovation under the project Heterogeneidad y especialización en la era post-Moore (RTI2018-093684-BI00). We also acknowledge the funding provided by Fujitsu under the BSC-Fujitsu joint project: Math Libraries Migration and OptimizationPeer ReviewedPostprint (author's final draft

    Deep Learning at Scale with Nearest Neighbours Communications

    Get PDF
    As deep learning techniques become more and more popular, there is the need to move these applications from the data scientist’s Jupyter notebook to efficient and reliable enterprise solutions. Moreover, distributed training of deep learning models will happen more and more outside the well-known borders of cloud and HPC infrastructure and will move to edge and mobile platforms. Current techniques for distributed deep learning have drawbacks in both these scenarios, limiting their long-term applicability. After a critical review of the established techniques for Data Parallel training from both a distributed computing and deep learning perspective, a novel approach based on nearest-neighbour communications is presented in order to overcome some of the issues related to mainstream approaches, such as global communication patterns. Moreover, in order to validate the proposed strategy, the Flexible Asynchronous Scalable Training (FAST) framework is introduced, which allows to apply the nearest-neighbours communications approach to a deep learning framework of choice. Finally, a relevant use-case is deployed on a medium-scale infrastructure to demonstrate both the framework and the methodology presented. Training convergence and scalability results are presented and discussed in comparison to a baseline defined by using state-of-the-art distributed training tools provided by a well-known deep learning framework

    Translating Timing into an Architecture: The Synergy of COTSon and HLS (Domain Expertise: Designing a Computer Architecture via HLS)

    Get PDF
    Translating a system requirement into a low-level representation (e.g., register transfer level or RTL) is the typical goal of the design of FPGA-based systems. However, the Design Space Exploration (DSE) needed to identify the final architecture may be time consuming, even when using high-level synthesis (HLS) tools. In this article, we illustrate our hybrid methodology, which uses a frontend for HLS so that the DSE is performed more rapidly by using a higher level abstraction, but without losing accuracy, thanks to the HP-Labs COTSon simulation infrastructure in combination with our DSE tools (MYDSE tools). In particular, this proposed methodology proved useful to achieve an appropriate design of a whole system in a shorter time than trying to design everything directly in HLS. Our motivating problem was to deploy a novel execution model called data-flow threads (DF-Threads) running on yet-to-be-designed hardware. For that goal, directly using the HLS was too premature in the design cycle. Therefore, a key point of our methodology consists in defining the first prototype in our simulation framework and gradually migrating the design into the Xilinx HLS after validating the key performance metrics of our novel system in the simulator. To explain this workflow, we first use a simple driving example consisting in the modelling of a two-way associative cache. Then, we explain how we generalized this methodology and describe the types of results that we were able to analyze in the AXIOM project, which helped us reduce the development time from months/weeks to days/hours

    Workflow models for heterogeneous distributed systems

    Get PDF
    The role of data in modern scientific workflows becomes more and more crucial. The unprecedented amount of data available in the digital era, combined with the recent advancements in Machine Learning and High-Performance Computing (HPC), let computers surpass human performances in a wide range of fields, such as Computer Vision, Natural Language Processing and Bioinformatics. However, a solid data management strategy becomes crucial for key aspects like performance optimisation, privacy preservation and security. Most modern programming paradigms for Big Data analysis adhere to the principle of data locality: moving computation closer to the data to remove transfer-related overheads and risks. Still, there are scenarios in which it is worth, or even unavoidable, to transfer data between different steps of a complex workflow. The contribution of this dissertation is twofold. First, it defines a novel methodology for distributed modular applications, allowing topology-aware scheduling and data management while separating business logic, data dependencies, parallel patterns and execution environments. In addition, it introduces computational notebooks as a high-level and user-friendly interface to this new kind of workflow, aiming to flatten the learning curve and improve the adoption of such methodology. Each of these contributions is accompanied by a full-fledged, Open Source implementation, which has been used for evaluation purposes and allows the interested reader to experience the related methodology first-hand. The validity of the proposed approaches has been demonstrated on a total of five real scientific applications in the domains of Deep Learning, Bioinformatics and Molecular Dynamics Simulation, executing them on large-scale mixed cloud-High-Performance Computing (HPC) infrastructures

    Vitis OneGenE: a causality-based approach to generate gene networks in Vitis vinifera sheds light on the laccase and dirigent gene families

    Get PDF
    9openInternationalBothThe abundance of transcriptomic data and the development of causal inference methods have paved the way for gene network analyses in grapevine. Vitis OneGenE is a transcriptomic data mining tool that finds direct correlations between genes, thus producing association networks. As a proof of concept, the stilbene synthase gene regulatory network obtained with OneGenE has been compared with published co-expression analysis and experimental data, including cistrome data for MYB stilbenoid regulators. As a case study, the two secondary metabolism pathways of stilbenoids and lignin synthesis were explored. Several isoforms of laccase, peroxidase, and dirigent protein genes, putatively involved in the final oxidative oligomerization steps, were identified as specifically belonging to either one of these pathways. Manual curation of the predicted sequences exploiting the last available genome assembly, and the integration of phylogenetic and OneGenE analyses, identified a group of laccases exclusively present in grapevine and related to stilbenoids. Here we show how network analysis by OneGenE can accelerate knowledge discovery by suggesting new candidates for functional characterization and application in breeding programs.openPilati, Stefania; Malacarne, Giulia; Navarro-Payá, David; Tomè, Gabriele; Riscica, Laura; Cavecchia, Valter; Matus, José Tomás; Moser, Claudio; Blanzieri, EnricoPilati, S.; Malacarne, G.; Navarro-Payá, D.; Tomè, G.; Riscica, L.; Cavecchia, V.; Matus, J.T.; Moser, C.; Blanzieri, E

    High-Performance Modelling and Simulation for Big Data Applications

    Get PDF
    This open access book was prepared as a Final Publication of the COST Action IC1406 “High-Performance Modelling and Simulation for Big Data Applications (cHiPSet)“ project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications
    corecore