17 research outputs found

    HDOT — An approach towards productive programming of hybrid applications

    Get PDF
    bulk synchronous parallel (BSP) communication model can hinder performance increases. This is due to the complexity to handle load imbalances, to reduce serialisation imposed by blocking communication patterns, to overlap communication with computation and, finally, to deal with increasing memory overheads. The MPI specification provides advanced features such as non-blocking calls or shared memory to mitigate some of these factors. However, applying these features efficiently usually requires significant changes on the application structure. Task parallel programming models are being developed as a means of mitigating the abovementioned issues but without requiring extensive changes on the application code. In this work, we present a methodology to develop hybrid applications based on tasks called hierarchical domain over-decomposition with tasking (HDOT). This methodology overcomes most of the issues found on MPI-only and traditional hybrid MPI+OpenMP applications. However, by emphasising the reuse of data partition schemes from process-level and applying them to task-level, it enables a natural coexistence between MPI and shared-memory programming models. The proposed methodology shows promising results in terms of programmability and performance measured on a set of applications.This work has been developed with the support of the European Union H2020 program through the INTERTWinE project (agreement number 671602); the Severo Ochoa Program awarded by the Spanish Government (SEV-2015-0493); the Generalitat de Catalunya (contract 2017-SGR-1414); and the Spanish Ministry of Science and Innovation (TIN2015-65316-P, Computaci on de Altas Prestaciones VII). The authors gratefully acknowledge Dr. Arnaud Mura, CNRS researcher at Institut PPRIME in France, for the numerical tool CREAMS. Finally, the manuscript has greatly bene ted from the precise comments of the reviewers.Peer ReviewedPostprint (author's final draft

    Evaluating worksharing tasks on distributed environments

    Get PDF
    ©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Hybrid programming is a promising approach to exploit clusters of multicore systems. Our focus is on the combination of MPI and tasking. This hybrid approach combines the low-latency and high throughput of MPI with the flexibility of tasking models and their inherent ability to handle load imbalance. However, combining tasking with standard MPI implementations can be a challenge. The Task-Aware MPI library (TAMPI) eases the development of applications combining tasking with MPI. TAMPI enables developers to overlap computation and communication phases by relying on the tasking data-flow execution model. Using this approach, the original computation that was distributed in many different MPI ranks is grouped together in fewer MPI ranks, and split into several tasks per rank. Nevertheless, programmers must be careful with task granularity. Too fine-grained tasks introduce too much overhead, while too coarse-grained tasks lead to lack of parallelism. An adequate granularity may not always exist, especially in distributed environments where the same amount of work is distributed among many more cores. Worksharing tasks are a special kind of tasks, recently proposed, that internally leverage worksharing techniques. By doing so, a single worksharing task may run in several cores concurrently. Nonetheless, the task management costs remain the same than a regular task. In this work, we study the combination of worksharing tasks and TAMPI on distributed environments using two well known mini-apps: HPCCG and LULESH. Our results show significant improvements using worksharing tasks compared to regular tasks, and to other state-of-the-art alternatives such as OpenMP worksharing.This project is supported by the European Union’s Horizon 2020 Research and Innovation programme under grant agreement No.s 754304 (DEEP-EST) and 823767 (PRACE), the Ministry of Economy of Spain through the Severo Ochoa Center of Excellence Program (SEV-2015-0493), by the Spanish Ministry of Science and Innovation (contract PID2019-107255GB) and by the Generalitat de Catalunya (2017-SGR1481). The work has been performed under the Project HPCEUROPA3 (INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme; in particular, the author gratefully acknowledges the support of Dr Mark Bull (EPCC) and the computer resources and technical support provided by EPCC.Peer ReviewedPostprint (author's final draft

    Integrating blocking and non-blocking MPI primitives with task-based programming models

    Get PDF
    In this paper we present the Task-Aware MPI library (TAMPI) that integrates both blocking and non-blocking MPI primitives with task-based programming models. The TAMPI library leverages two new runtime APIs to improve both programmability and performance of hybrid applications. The first API allows to pause and resume the execution of a task depending on external events. This API is used to improve the interoperability between blocking MPI communication primitives and tasks. When an MPI operation executed inside a task blocks, the task running is paused so that the runtime system can schedule a new task on the core that became idle. Once the blocked MPI operation is completed, the paused task is put again on the runtime system’s ready queue, so eventually it will be scheduled again and its execution will be resumed. The second API defers the release of dependencies associated with a task completion until some external events are fulfilled. This API is composed only of two functions, one to bind external events to a running task and another function to notify about the completion of external events previously bound. TAMPI leverages this API to bind non-blocking MPI operations with tasks, deferring the release of their task dependencies until both task execution and all its bound MPI operations are completed. Our experiments reveal that the enhanced features of TAMPI not only simplify the development of hybrid MPI+OpenMP applications that use blocking or non-blocking MPI primitives but they also naturally overlap computation and communication phases, which improves application performance and scalability by removing artificial dependencies across communication tasks.This work has been developed with the support of the European Union H2020 Programme through both the INTERTWinE project (agreement no. 671602) and the Marie SkƂodowska-Curie grant (agreement no. 749516); the Spanish Ministry of Economy and Competitiveness through the Severo Ochoa Program (SEV-2015-0493); the Spanish Ministry of Science and Innovation (TIN2015-65316-P) and the Generalitat de Catalunya (2017-SGR1414).Peer ReviewedPostprint (author's final draft

    Role-shifting threads: Increasing OpenMP malleability to address load imbalance at MPI and OpenMP

    Get PDF
    This paper presents the evolution of the free agent threads for OpenMP to the new role-shifting threads model and their integration with the Dynamic Load Balancing (DLB) library. We demonstrate how free agent threads can improve resource utilization in OpenMP applications with load imbalance in their nested parallel regions. We also demonstrate how DLB efficiently manages the malleability exposed by the role-shifting threads to address load imbalance issues. We use three real-world scientific applications, one of them to demonstrate that free agents alone can improve the OpenMP model without external tools, and two other MPI+OpenMP applications, one of them with a coupling case, to illustrate the potential of the free agent threads’ malleability with an external resource manager to increase the efficiency of the system. In addition, we demonstrate that the new implementation is more usable than the former one, letting the runtime system automatically make decisions that were made by the programmer previously. All software is released open-source.This work has received funding from the DEEP Projects, at the European Commission’s FP7, H2020, and EuroHPC Programmes, under Grant Agreements 287530, 610476, 754304, and 955606. The PCI2021-121958 financed by the Spanish State Research Agency - Ministry of Science and Innovation. And it also has the support of the Spanish Ministry of Science and Innovation (Computacion de Altas Prestaciones VIII: PID2019-107255GB).Peer ReviewedPostprint (author's final draft

    Challenges and opportunities for RISC-V architectures towards genomics-based workloads

    Get PDF
    The use of large-scale supercomputing architectures is a hard requirement for scientific computing Big-Data applications. An example is genomics analytics, where millions of data transformations and tests per patient need to be done to find relevant clinical indicators. Therefore, to ensure open and broad access to high-performance technologies, governments, and academia are pushing toward the introduction of novel computing architectures in large-scale scientific environments. This is the case of RISC-V, an open-source and royalty-free instruction-set architecture. To evaluate such technologies, here we present the Variant-Interaction Analytics use case benchmarking suite and datasets. Through this use case, we search for possible genetic interactions using computational and statistical methods, providing a representative case for heavy ETL (Extract, Transform, Load) data processing. Current implementations are implemented in x86-based supercomputers (e.g. MareNostrum-IV at the Barcelona Supercomputing Center (BSC)), and future steps propose RISC-V as part of the next MareNostrum generations. Here we describe the Variant Interaction Use Case, highlighting the characteristics leveraging high-performance computing, indicating the caveats and challenges towards the next RISC-V developments and designs to come from a first comparison between x86 and RISC-V architectures on real Variant Interaction executions over real hardware implementations.This work has been partially financed by the European Commission (EU-HORIZON NEARDATA GA.101092644, VITAMIN-V GA.101093062), the MEEP Project which received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 946002. The JU receives support from the European Union’s Horizon 2020 research and innovation program and Spain, Croatia and Turkey. Also by the Spanish Ministry of Science (MICINN) under scholarship BES-2017-081635, the Research State Agency (AEI) and European Regional Development Funds (ERDF/FEDER) under DALEST grant agreement PID2021-126248OBI00, MCIN/AEI/10.13039/ 501100011033/FEDER and PID GA PID2019-107255GB-C21, and the Generalitat de Catalunya (AGAUR) under grant agreements 2021-SGR-00478, 2021-SGR-01626 and ”FSE Invertint en el teu futur”.Peer ReviewedPostprint (author's final draft

    J-PLUS: The javalambre photometric local universe survey

    Get PDF
    ABSTRACT: TheJavalambrePhotometric Local UniverseSurvey (J-PLUS )isanongoing 12-band photometricopticalsurvey, observingthousands of squaredegrees of theNorthernHemispherefromthededicated JAST/T80 telescope at the Observatorio AstrofĂ­sico de Javalambre (OAJ). The T80Cam is a camera with a field of view of 2 deg2 mountedon a telescopewith a diameter of 83 cm, and isequippedwith a uniquesystem of filtersspanningtheentireopticalrange (3500–10 000 Å). Thisfiltersystemis a combination of broad-, medium-, and narrow-band filters, optimallydesigned to extracttherest-framespectralfeatures (the 3700–4000 Å Balmer break region, HÎŽ, Ca H+K, the G band, and the Mg b and Ca triplets) that are key to characterizingstellartypes and delivering a low-resolutionphotospectrumforeach pixel of theobservedsky. With a typicaldepth of AB ∌21.25 mag per band, thisfilter set thusallowsforanunbiased and accuratecharacterization of thestellarpopulation in our Galaxy, itprovidesanunprecedented 2D photospectralinformationforall resolved galaxies in the local Universe, as well as accuratephoto-z estimates (at the ή z/(1 + z)∌0.005–0.03 precisionlevel) formoderatelybright (up to r ∌ 20 mag) extragalacticsources. Whilesomenarrow-band filters are designedforthestudy of particular emissionfeatures ([O II]/λ3727, Hα/λ6563) up to z < 0.017, theyalsoprovidewell-definedwindowsfortheanalysis of otheremissionlines at higherredshifts. As a result, J-PLUS has thepotential to contribute to a widerange of fields in Astrophysics, both in thenearbyUniverse (MilkyWaystructure, globular clusters, 2D IFU-likestudies, stellarpopulations of nearby and moderate-redshiftgalaxies, clusters of galaxies) and at highredshifts (emission-line galaxies at z ≈ 0.77, 2.2, and 4.4, quasi-stellarobjects, etc.). Withthispaper, wereleasethefirst∌1000 deg2 of J-PLUS data, containingabout 4.3 millionstars and 3.0 milliongalaxies at r <  21mag. With a goal of 8500 deg2 forthe total J-PLUS footprint, thesenumbers are expected to rise to about 35 millionstars and 24 milliongalaxiesbytheend of thesurvey.Funding for the J-PLUS Project has been provided by the Governments of Spain and AragĂłn through the Fondo de Inversiones de Teruel, the Spanish Ministry of Economy and Competitiveness (MINECO; under grants AYA2017-86274-P, AYA2016-77846-P, AYA2016-77237-C3-1-P, AYA2015-66211-C2-1-P, AYA2015-66211-C2-2, AYA2012-30789, AGAUR grant SGR-661/2017, and ICTS-2009-14), and European FEDER funding (FCDD10-4E-867, FCDD13-4E-2685

    OpenMP taskloop dependences

    Get PDF
    Exascale systems will contain multicore/manycore processors with high core count in each node. Therefore, using a model that relaxes the synchronization, such as data-flow, is crucial to adequately exploit the potential of the hardware. The flexibility of the data-flow execution model relies on the dynamic management of data-dependences among tasks. The OpenMP standard already provides a construct, known as taskloop, that distributes the loop iteration space into several tasks, but this construct does not support the use of the depend clause yet. In this paper we propose the use of the induction variable to define data dependences in tasks created by the taskloop construct. By using the induction variable, each task will contain its own dependences based on the partition of work they received. We also aim to demonstrate that using taskloop with dependences provides an enhancement in terms of programmability with respect to using stand-alone tasks to parallelize a loop. Our implementation does not introduce any significant overhead on the taskloop implementation and, in certain cases, it outperforms the stand-alone task version.This research has received funding from the European Union’s Horizon 2020/EuroHPC research and innovation programme under grant agreement No 955606 (DEEP-SEA); and the support of the Spanish Ministry of Science and Innovation (Computacion de Altas Prestaciones VIII: PID2019-107255GB).Peer ReviewedPostprint (author's final draft

    An OpenMP free agent threads implementation

    Get PDF
    In this paper, we introduce a design and implementation of the free agent threads for OpenMP. These threads increase the malleability of the OpenMP programming model, offering resource managers and runtime systems flexibility to manage threads and resources efficiently. We demonstrate how free agent threads can address load imbalances problems at the OpenMP level and at an MPI level or higher. We use two mini-apps extracted from two real HPC applications and representative of real-world codes to demonstrate this. We conclude that more malleability in thread management is necessary, and free agents can be regarded as a practical starting point to increase malleability in thread management.This work has been done as part of the European Processor Initiative project. The European Processor Initiative (EPI) (FPA: 800928) has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement EPI-SGA1: 826647. It has also received funding from the European Union's Horizon 2020/EuroHPC research and innovation programme under grant agreement No 955606 (DEEP-SEA); and the support of the Spanish Ministry of Science and Innovation (Computacion de Altas Prestaciones VIII: PID2019-107255GB).Peer ReviewedPostprint (author's final draft

    Auto-tuned OpenCL kernel co-execution in OmpSs for heterogeneous systems

    No full text
    © . This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/The emergence of heterogeneous systems has been very notable recently. The nodes of the most powerful computers integrate several compute accelerators, like GPUs. Profiting from such node configurations is not a trivial endeavour. OmpSs is a framework for task based parallel applications, that allows the execution of OpenCl kernels on different compute devices. However, it does not support the co-execution of a single kernel on several devices. This paper presents an extension of OmpSs that rises to this challenge, and presents Auto-Tune, a load balancing algorithm that automatically adjusts its internal parameters to suit the hardware capabilities and application behavior. The extension allows programmers to take full advantage of the computing devices with negligible impact on the code. It takes care of two main issues. First, the automatic distribution of datasets and the management of device memory address spaces. Second, the implementation of a set of load balancing algorithms to adapt to the particularities of applications and systems. Experimental results reveal that the co-execution of single kernels on all the devices in the node is beneficial in terms of performance and energy consumption, and that Auto-Tune gives the best overall results.Peer Reviewe
    corecore