4,218 research outputs found

    Enabling dynamic and intelligent workflows for HPC, data analytics, and AI convergence

    Get PDF
    The evolution of High-Performance Computing (HPC) platforms enables the design and execution of progressively larger and more complex workflow applications in these systems. The complexity comes not only from the number of elements that compose the workflows but also from the type of computations they perform. While traditional HPC workflows target simulations and modelling of physical phenomena, current needs require in addition data analytics (DA) and artificial intelligence (AI) tasks. However, the development of these workflows is hampered by the lack of proper programming models and environments that support the integration of HPC, DA, and AI, as well as the lack of tools to easily deploy and execute the workflows in HPC systems. To progress in this direction, this paper presents use cases where complex workflows are required and investigates the main issues to be addressed for the HPC/DA/AI convergence. Based on this study, the paper identifies the challenges of a new workflow platform to manage complex workflows. Finally, it proposes a development approach for such a workflow platform addressing these challenges in two directions: first, by defining a software stack that provides the functionalities to manage these complex workflows; and second, by proposing the HPC Workflow as a Service (HPCWaaS) paradigm, which leverages the software stack to facilitate the reusability of complex workflows in federated HPC infrastructures. Proposals presented in this work are subject to study and development as part of the EuroHPC eFlows4HPC project.This work has received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558. The JU receives support from the European Union’s Horizon 2020 research and innovation programme and Spain, Germany, France, Italy, Poland, Switzerland and Norway. In Spain, it has received complementary funding from MCIN/AEI/10.13039/501100011033, Spain and the European Union NextGenerationEU/PRTR (contracts PCI2021-121957, PCI2021-121931, PCI2021-121944, and PCI2021-121927). In Germany, it has received complementary funding from the German Federal Ministry of Education and Research (contracts 16HPC016K, 6GPC016K, 16HPC017 and 16HPC018). In France, it has received financial support from Caisse des dépôts et consignations (CDC) under the action PIA ADEIP (project Calculateurs). In Italy, it has been preliminary approved for complimentary funding by Ministero dello Sviluppo Economico (MiSE) (ref. project prop. 2659). In Norway, it has received complementary funding from the Norwegian Research Council, Norway under project number 323825. In Switzerland, it has been preliminary approved for complimentary funding by the State Secretariat for Education, Research, and Innovation (SERI), Norway. In Poland, it is partially supported by the National Centre for Research and Development under decision DWM/EuroHPCJU/4/2021. The authors also acknowledge financial support by MCIN/AEI /10.13039/501100011033, Spain through the “Severo Ochoa Programme for Centres of Excellence in R&D” under Grant CEX2018-000797-S, the Spanish Government, Spain (contract PID2019-107255 GB) and by Generalitat de Catalunya, Spain (contract 2017-SGR-01414). Anna Queralt is a Serra Húnter Fellow.With funding from the Spanish government through the ‘Severo Ochoa Centre of Excellence’ accreditation (CEX2018-000797-S)

    Performance Observability and Monitoring of High Performance Computing with Microservices

    Get PDF
    Traditionally, High Performance Computing (HPC) softwarehas been built and deployed as bulk-synchronous, parallel executables based on the message-passing interface (MPI) programming model. The rise of data-oriented computing paradigms and an explosion in the variety of applications that need to be supported on HPC platforms have forced a re-think of the appropriate programming and execution models to integrate this new functionality. In situ workflows demarcate a paradigm shift in HPC software development methodologies enabling a range of new applications --- from user-level data services to machine learning (ML) workflows that run alongside traditional scientific simulations. By tracing the evolution of HPC software developmentover the past 30 years, this dissertation identifies the key elements and trends responsible for the emergence of coupled, distributed, in situ workflows. This dissertation's focus is on coupled in situ workflows involving composable, high-performance microservices. After outlining the motivation to enable performance observability of these services and why existing HPC performance tools and techniques can not be applied in this context, this dissertation proposes a solution wherein a set of techniques gathers, analyzes, and orients performance data from different sources to generate observability. By leveraging microservice components initially designed to build high performance data services, this dissertation demonstrates their broader applicability for building and deploying performance monitoring and visualization as services within an in situ workflow. The results from this dissertation suggest that: (1) integration of performance data from different sources is vital to understanding the performance of service components, (2) the in situ (online) analysis of this performance data is needed to enable the adaptivity of distributed components and manage monitoring data volume, (3) statistical modeling combined with performance observations can help generate better service configurations, and (4) services are a promising architecture choice for deploying in situ performance monitoring and visualization functionality. This dissertation includes previously published and co-authored material and unpublished co-authored material

    End-to-end eScience: integrating workflow, query, visualization, and provenance at an ocean observatory

    Get PDF
    Journal ArticleData analysis tasks at an Ocean Observatory require integrative and and domain-specialized use of database, workflow, visualization systems. We describe a platform to support these tasks developed as part of the cyberinfrastructure at the NSF Science and Technology Center for Coastal Margin Observation and Prediction integrating a provenance-aware workflow system, 3D visualization, and a remote query engine for large-scale ocean circulation models. We show how these disparate tools complement each other and give examples of real scientific insights delivered by the integrated system. We conclude that data management solutions for eScience require this kind of holistic, integrative approach, explain how our approach may be generalized, and recommend a broader, application-oriented research agenda to explore relevant architectures

    Towards optimising distributed data streaming graphs using parallel streams

    Full text link
    Modern scientific collaborations have opened up the op-portunity of solving complex problems that involve multi-disciplinary expertise and large-scale computational experi-ments. These experiments usually involve large amounts of data that are located in distributed data repositories running various software systems, and managed by different organi-sations. A common strategy to make the experiments more manageable is executing the processing steps as a work-flow. In this paper, we look into the implementation of fine-grained data-flow between computational elements in a scientific workflow as streams. We model the distributed computation as a directed acyclic graph where the nodes rep-resent the processing elements that incrementally implement specific subtasks. The processing elements are connected in a pipelined streaming manner, which allows task executions to overlap. We further optimise the execution by splitting pipelines across processes and by introducing extra parallel streams. We identify performance metrics and design a mea-surement tool to evaluate each enactment. We conducted ex-periments to evaluate our optimisation strategies with a real world problem in the Life Sciences—EURExpress-II. The paper presents our distributed data-handling model, the op-timisation and instrumentation strategies and the evaluation experiments. We demonstrate linear speed up and argue that this use of data-streaming to enable both overlapped pipeline and parallelised enactment is a generally applicable optimisation strategy

    Advanced Simulation and Computing FY12-13 Implementation Plan, Volume 2, Revision 0.5

    Full text link

    High Energy Physics Forum for Computational Excellence: Working Group Reports (I. Applications Software II. Software Libraries and Tools III. Systems)

    Full text link
    Computing plays an essential role in all aspects of high energy physics. As computational technology evolves rapidly in new directions, and data throughput and volume continue to follow a steep trend-line, it is important for the HEP community to develop an effective response to a series of expected challenges. In order to help shape the desired response, the HEP Forum for Computational Excellence (HEP-FCE) initiated a roadmap planning activity with two key overlapping drivers -- 1) software effectiveness, and 2) infrastructure and expertise advancement. The HEP-FCE formed three working groups, 1) Applications Software, 2) Software Libraries and Tools, and 3) Systems (including systems software), to provide an overview of the current status of HEP computing and to present findings and opportunities for the desired HEP computational roadmap. The final versions of the reports are combined in this document, and are presented along with introductory material.Comment: 72 page

    Coupling streaming AI and HPC ensembles to achieve 100-1000x faster biomolecular simulations

    Full text link
    Machine learning (ML)-based steering can improve the performance of ensemble-based simulations by allowing for online selection of more scientifically meaningful computations. We present DeepDriveMD, a framework for ML-driven steering of scientific simulations that we have used to achieve orders-of-magnitude improvements in molecular dynamics (MD) performance via effective coupling of ML and HPC on large parallel computers. We discuss the design of DeepDriveMD and characterize its performance. We demonstrate that DeepDriveMD can achieve between 100-1000x acceleration for protein folding simulations relative to other methods, as measured by the amount of simulated time performed, while covering the same conformational landscape as quantified by the states sampled during a simulation. Experiments are performed on leadership-class platforms on up to 1020 nodes. The results establish DeepDriveMD as a high-performance framework for ML-driven HPC simulation scenarios, that supports diverse MD simulation and ML back-ends, and which enables new scientific insights by improving the length and time scales accessible with current computing capacity

    Purdue Contribution of Fusion Simulation Program

    Full text link
    The overall science goal of the FSP is to develop predictive simulation capability for magnetically confined fusion plasmas at an unprecedented level of integration and fidelity. This will directly support and enable effective U.S. participation in research related to the International Thermonuclear Experimental Reactor (ITER) and the overall mission of delivering practical fusion energy. The FSP will address a rich set of scientific issues together with experimental programs, producing validated integrated physics results. This is very well aligned with the mission of the ITER Organization to coordinate with its members the integrated modeling and control of fusion plasmas, including benchmarking and validation activities. [1]. Initial FSP research will focus on two critical areas: 1) the plasma edge and 2) whole device modeling including disruption avoidance. The first of these problems involves the narrow plasma boundary layer and its complex interactions with the plasma core and the surrounding material wall. The second requires development of a computationally tractable, but comprehensive model that describes all equilibrium and dynamic processes at a sufficient level of detail to provide useful prediction of the temporal evolution of fusion plasma experiments. The initial driver for the whole device model (WDM) will be prediction and avoidance of discharge-terminating disruptions, especially at high performance, which are a critical impediment to successful operation of machines like ITER. If disruptions prove unable to be avoided, their associated dynamics and effects will be addressed in the next phase of the FSP. The FSP plan targets the needed modeling capabilities by developing Integrated Science Applications (ISAs) specific to their needs. The Pedestal-Boundary model will include boundary magnetic topology, cross-field transport of multi-species plasmas, parallel plasma transport, neutral transport, atomic physics and interactions with the plasma wall. It will address the origins and structure of the plasma electric field, rotation, the L-H transition, and the wide variety of pedestal relaxation mechanisms. The Whole Device Model will predict the entire discharge evolution given external actuators (i.e., magnets, power supplies, heating, current drive and fueling systems) and control strategies. Based on components operating over a range of physics fidelity, the WDM will model the plasma equilibrium, plasma sources, profile evolution, linear stability and nonlinear evolution toward a disruption (but not the full disruption dynamics). The plan assumes that, as the FSP matures and demonstrates success, the program will evolve and grow, enabling additional science problems to be addressed. The next set of integration opportunities could include: 1) Simulation of disruption dynamics and their effects; 2) Prediction of core profile including 3D effects, mesoscale dynamics and integration with the edge plasma; 3) Computation of non-thermal particle distributions, self-consistent with fusion, radio frequency (RF) and neutral beam injection (NBI) sources, magnetohydrodynamics (MHD) and short-wavelength turbulence

    A Fortran Kernel Generation Framework for Scientific Legacy Code

    Get PDF
    Quality assurance procedure is very important for software development. The complexity of modules and structure in software impedes the testing procedure and further development. For complex and poorly designed scientific software, module developers and software testers need to put a lot of extra efforts to monitor not related modules\u27 impacts and to test the whole system\u27s constraints. In addition, widely used benchmarks cannot help programmers with accurate and program specific system performance evaluation. In this situation, the generated kernels could provide considerable insight into better performance tuning. Therefore, in order to greatly improve the productivity of various scientific software engineering tasks such as performance tuning, debugging, and verification of simulation results, we developed an automatic compute kernel extraction prototype platform for complex legacy scientific code. In addition, considering that scientific research and experiment require long-term simulation procedure and the huge size of data transfer, we apply message passing based parallelization and I/O behavior optimization to highly improve the performance of the kernel extractor framework and then use profiling tools to give guidance for parallel distribution. Abnormal event detection is another important aspect for scientific research; dealing with huge observational datasets combined with simulation results it becomes not only essential but also extremely difficult. In this dissertation, for the sake of detecting high frequency event and low frequency events, we reconfigured this framework equipped with in-situ data transfer infrastructure. Through the method of combining signal processing data preprocess(decimation) with machine learning detection model to train the stream data, our framework can significantly decrease the amount of transferred data demand for concurrent data analysis (between distributed computing CPU/GPU nodes). Finally, the dissertation presents the implementation of the framework and a case study of the ACME Land Model (ALM) for demonstration. It turns out that the generated compute kernel with lower cost can be used in performance tuning experiments and quality assurance, which include debugging legacy code, verification of simulation results through single point and multiple points of variables tracking, collaborating with compiler vendors, and generating custom benchmark tests

    Modeling High-throughput Applications for in situ Analytics

    Get PDF
    International audienceWith the goal of performing exascale computing, the importance of I/Omanagement becomes more and more critical to maintain system performance.While the computing capacities of machines are getting higher, the I/O capa-bilities of systems do not increase as fast. We are able to generate more databut unable to manage them eciently due to variability of I/O performance.Limiting the requests to the Parallel File System (PFS) becomes necessary. Toaddress this issue, new strategies are being developed such as online in situanalysis. The idea is to overcome the limitations of basic post-mortem dataanalysis where the data have to be stored on PFS rst and processed later.There are several software solutions that allow users to specically dedicatenodes for analysis of data and distribute the computation tasks over dier-ent sets of nodes. Thus far, they rely on a manual resource partitioning andallocation by the user of tasks (simulations, analysis).In this work, we propose a memory-constraint modelization for in situ anal-ysis. We use this model to provide dierent scheduling policies to determineboth the number of resources that should be dedicated to analysis functions,and that schedule eciently these functions. We evaluate them and show theimportance of considering memory constraints in the model. Finally, we discussthe dierent challenges that have to be addressed in order to build automatictools for in situ analytics
    corecore