Search CORE

496 research outputs found

Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability

Author: da Silva Rafael Ferreira
Skluzacek Tyler J.
Souza Renan
Wilkinson Sean R.
Ziatdinov Maxim
Publication venue
Publication date: 17/08/2023
Field of study

Modern large-scale scientific discovery requires multidisciplinary collaboration across diverse computing facilities, including High Performance Computing (HPC) machines and the Edge-to-Cloud continuum. Integrated data analysis plays a crucial role in scientific discovery, especially in the current AI era, by enabling Responsible AI development, FAIR, Reproducibility, and User Steering. However, the heterogeneous nature of science poses challenges such as dealing with multiple supporting tools, cross-facility environments, and efficient HPC execution. Building on data observability, adapter system design, and provenance, we propose MIDA: an approach for lightweight runtime Multi-workflow Integrated Data Analysis. MIDA defines data observability strategies and adaptability methods for various parallel systems and machine learning tools. With observability, it intercepts the dataflows in the background without requiring instrumentation while integrating domain, provenance, and telemetry data at runtime into a unified database ready for user steering queries. We conduct experiments showing end-to-end multi-workflow analysis integrating data from Dask and MLFlow in a real distributed deep learning use case for materials science that runs on multiple environments with up to 276 GPUs in parallel. We show near-zero overhead running up to 100,000 tasks on 1,680 CPU cores on the Summit supercomputer.Comment: 10 pages, 5 figures, 2 Listings, 42 references, Paper accepted at IEEE eScience'2

arXiv.org e-Print Archive

From Facility to Application Sensor Data: Modular, Continuous and Holistic Monitoring with DCDB

Author: Auweter Axel
Guillen Carla
Mueller Micha
Netti Alessio
Ott Michael
Schulz Martin
Tafani Daniele
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/08/2019
Field of study

Today's HPC installations are highly-complex systems, and their complexity will only increase as we move to exascale and beyond. At each layer, from facilities to systems, from runtimes to applications, a wide range of tuning decisions must be made in order to achieve efficient operation. This, however, requires systematic and continuous monitoring of system and user data. While many insular solutions exist, a system for holistic and facility-wide monitoring is still lacking in the current HPC ecosystem. In this paper we introduce DCDB, a comprehensive monitoring system capable of integrating data from all system levels. It is designed as a modular and highly-scalable framework based on a plugin infrastructure. All monitored data is aggregated at a distributed noSQL data store for analysis and cross-system correlation. We demonstrate the performance and scalability of DCDB, and describe two use cases in the area of energy management and characterization.Comment: Accepted at the The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 201

arXiv.org e-Print Archive

Crossref

Seastar: A Comprehensive Framework for Telemetry Data in HPC Environments

Author: Atkinson Malcolm
Barker Adam
Weidner Ole
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 05/06/2017
Field of study

A large number of 2nd generation high-performance computing applications and services rely on adaptive and dynamic architectures and execution strategies to run efficiently,resiliently, and at scale on today’s HPC infrastructures. They require information about applications and their environment to steer and optimize execution. We define this information as telemetry data. Current HPC platforms do not provide the infrastructure,interfaces and conceptual models to collect, store, analyze,and access such data. Today, applications depend on application and platform specific techniques for collecting telemetry data; introducing significant development overheads that inhibit portability and mobility. The development and adoption of adaptive, context-aware strategies is thereby impaired. To facilitate 2nd generation applications,more efficient application development, and swift adoption of adaptive applications in production, a comprehensive framework for telemetry data management must be provided by future HPC systems and services. We introduce Seastar, a conceptual model and a software framework to collect, store, analyze, and exploit streams of telemetry data generated by HPC systems and their applications. We show how Seastar can be integrated with HPC platform architectures and how it enables common application execution strategies.Postprin

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

University of St. Andrews - Pure

St Andrews Research Repository

ICARUS Training and Support System

Author: Baptista Ricardo
Będkowski Janusz
Coelho Antonio
Goncalves Ricardo
Majek Karol
Masłowski Andrzej
Pełka Michal
Sanchez Jose Manuel
Publication venue: 'IntechOpen'
Publication date: 23/08/2017
Field of study

The ICARUS unmanned tools act as gatherers, which acquire enormous amount of information. The management of all these data requires the careful consideration of an intelligent support system. This chapter discusses the High-Performance Computing (HPC) support tools, which were developed for rapid 3D data extraction, combination, fusion, segmentation, classification and rendering. These support tools were seamlessly connected to a training framework. Indeed, training is a key in the world of search and rescue. Search and rescue workers will never use tools on the field for which they have not been extensively trained beforehand. For this reason, a comprehensive serious gaming training framework was developed, supporting all ICARUS unmanned vehicles in realistic 3D-simulated (based on inputs from the support system) and real environments

IntechOpen

OAPEN Library

Crossref

Design and implementation of a telemetry platform for high-performance computing environments

Author: Weidner Ole
Publication venue: The University of Edinburgh
Publication date: 30/11/2021
Field of study

A new generation of high-performance and distributed computing applications and services rely on adaptive and dynamic architectures and execution strategies to run efficiently, resiliently, and at scale in today’s HPC environments. These architectures require insights into their execution behaviour and the state of their execution environment at various levels of detail, in order to make context-aware decisions. HPC telemetry provides this information. It describes the continuous stream of time series and event data that is generated on HPC systems by the hardware, operating systems, services, runtime systems, and applications. Current HPC ecosystems do not provide the conceptual models, infrastructure, and interfaces to collect, store, analyse, and integrate telemetry in a structured and efficient way. Consequently, applications and services largely depend on one-off solutions and custom-built technologies to achieve these goals; introducing significant development overheads that inhibit portability and mobility. To facilitate a broader mix of applications, more efficient application development, and swift adoption of adaptive architectures in production, a comprehensive framework for telemetry management and analysis must be provided as part of future HPC ecosystem designs. This thesis provides the blueprint for such a framework: it proposes a new approach to telemetry management in HPC: the Telemetry Platform concept. Departing from the observation that telemetry data and the corresponding analysis, and integration pat- terns on modern multi-tenant HPC systems have a lot of similarities to the patterns observed in large-scale data analytics or “Big Data” platforms, the telemetry platform concept takes the data platform paradigm and architectural approach and applies them to HPC telemetry. The result is the blueprint for a system that provides services for storing, searching, analysing, and integrating telemetry data in HPC applications and other HPC system services. It allows users to create and share telemetry data-driven insights using everything from simple time-series analysis to complex statistical and machine learning models while at the same time hiding many of the inherent complexities of data management such as data transport, clean-up, storage, cataloguing, access management, and providing appropriate and scalable analytics and integration capabilities. The main contributions of this research are (1) the application of the data platform concept to HPC telemetry data management and usage; (2) a graph-based, time-variant telemetry data model that captures structures and properties of platform and applications and in which telemetry data can be organized; (3) an architecture blueprint and prototype of a concrete implementation and integration architecture of the telemetry platform; and (4) a proposal for decoupled HPC application architectures, separating telemetry data management, and feedback-control-loop logic from the core application code. First experimental results with the prototype implementation suggest that the telemetry platform paradigm can reduce overhead and redundancy in the development of telemetry-based application architectures, and lower the barrier for HPC systems research and the provisioning of new, innovative HPC system services

Edinburgh Research Archive

Heterogeneity, High Performance Computing, Self-Organization and the Cloud

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2020
Field of study

application; blueprints; self-management; self-organisation; resource management; supply chain; big data; PaaS; Saas; HPCaa

Directory of Open Access Books (DOAB)

Heterogeneity, High Performance Computing, Self-Organization and the Cloud

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

application; blueprints; self-management; self-organisation; resource management; supply chain; big data; PaaS; Saas; HPCaa

OAPEN Library

Ubiquitous supercomputing : design and development of enabling technologies for multi-robot systems rethinking supercomputing

Author: Camargo Forero Leonardo
Publication venue: Universitat Politècnica de Catalunya
Publication date: 25/10/2019
Field of study

Supercomputing, also known as High Performance Computing (HPC), is almost everywhere (ubiquitous), from the small widget in your phone telling you that today will be a sunny day, up to the next great contribution to the understanding of the origins of the universe.However, there is a field where supercomputing has been only slightly explored - robotics. Other than attempts to optimize complex robotics tasks, the two forces lack an effective alignment and a purposeful long-term contract. With advancements in miniaturization, communications and the appearance of powerful, energy and weight optimized embedded computing boards, a next logical transition corresponds to the creation of clusters of robots, a set of robotic entities that behave similarly as a supercomputer does. Yet, there is key aspect regarding our current understanding of what supercomputing means, or is useful for, that this work aims to redefine. For decades, supercomputing has been solely intended as a computing efficiency mechanism i.e. decreasing the computing time for complex tasks. While such train of thought have led to countless findings, supercomputing is more than that, because in order to provide the capacity of solving most problems quickly, another complete set of features must be provided, a set of features that can also be exploited in contexts such as robotics and that ultimately transform a set of independent entities into a cohesive unit.This thesis aims at rethinking what supercomputing means and to devise strategies to effectively set its inclusion within the robotics realm, contributing therefore to the ubiquity of supercomputing, the first main ideal of this work. With this in mind, a state of the art concerning previous attempts to mix robotics and HPC will be outlined, followed by the proposal of High Performance Robotic Computing (HPRC), a new concept mapping supercomputing to the nuances of multi-robot systems. HPRC can be thought as supercomputing in the edge and while this approach will provide all kind of advantages, in certain applications it might not be enough since interaction with external infrastructures will be required or desired. To facilitate such interaction, this thesis proposes the concept of ubiquitous supercomputing as the union of HPC, HPRC and two more type of entities, computing-less devices (e.g. sensor networks, etc.) and humans.The results of this thesis include the ubiquitous supercomputing ontology and an enabling technology depicted as The ARCHADE. The technology serves as a middleware between a mission and a supercomputing infrastructure and as a framework to facilitate the execution of any type of mission, i.e. precision agriculture, entertainment, inspection and monitoring, etc. Furthermore, the results of the execution of a set of missions are discussed.By integrating supercomputing and robotics, a second ideal is targeted, ubiquitous robotics, i.e. the use of robots in all kind of applications. Correspondingly, a review of existing ubiquitous robotics frameworks is presented and based upon its conclusions, The ARCHADE's design and development have followed the guidelines for current and future solutions. Furthermore, The ARCHADE is based on a rethought supercomputing where performance is not the only feature to be provided by ubiquitous supercomputing systems. However, performance indicators will be discussed, along with those related to other supercomputing features.Supercomputing has been an excellent ally for scientific exploration and not so long ago for commercial activities, leading to all kind of improvements in our lives, in our society and in our future. With the results of this thesis, the joining of two fields, two forces previously disconnected because of their philosophical approaches and their divergent backgrounds, holds enormous potential to open up our imagination for all kind of new applications and for a world where robotics and supercomputing are everywhere.La supercomputación, también conocida como Computación de Alto Rendimiento (HPC por sus siglas en inglés) puede encontrarse en casi cualquier lugar (ubicua), desde el widget en tu teléfono diciéndote que hoy será un día soleado, hasta la siguiente gran contribución al entendimiento de los orígenes del universo. Sin embargo, hay un campo en el que ha sido poco explorada - la robótica. Más allá de intentos de optimizar tareas robóticas complejas, las dos fuerzas carecen de un contrato a largo plazo. Dado los avances en miniaturización, comunicaciones y la aparición de potentes computadores embebidos, optimizados en peso y energía, la siguiente transición corresponde a la creación de un cluster de robots, un conjunto de robots que se comportan de manera similar a un supercomputador. No obstante, hay un aspecto clave, con respecto a la comprensión de la supercomputación, que esta tesis pretende redefinir. Durante décadas, la supercomputación ha sido entendida como un mecanismo de eficiencia computacional, es decir para reducir el tiempo de computación de ciertos problemas extremadamente complejos. Si bien este enfoque ha conducido a innumerables hallazgos, la supercomputación es más que eso, porque para proporcionar la capacidad de resolver todo tipo de problemas rápidamente, se debe proporcionar otro conjunto de características que también pueden ser explotadas en la robótica y que transforman un conjunto de robots en una unidad cohesiva. Esta tesis pretende repensar lo que significa la supercomputación y diseñar estrategias para establecer su inclusión dentro del mundo de la robótica, contribuyendo así a su ubicuidad, el principal ideal de este trabajo. Con esto en mente, se presentará un estado del arte relacionado con intentos anteriores de mezclar robótica y HPC, seguido de la propuesta de Computación Robótica de Alto Rendimiento (HPRC, por sus siglas en inglés), un nuevo concepto, que mapea la supercomputación a los matices específicos de los sistemas multi-robot. HPRC puede pensarse como supercomputación en el borde y si bien este enfoque proporcionará todo tipo de ventajas, ciertas aplicaciones requerirán una interacción con infraestructuras externas. Para facilitar dicha interacción, esta tesis propone el concepto de supercomputación ubicua como la unión de HPC, HPRC y dos tipos más de entidades, dispositivos sin computación embebida y seres humanos. Los resultados de esta tesis incluyen la ontología de la supercomputación ubicua y una tecnología llamada The ARCHADE. La tecnología actúa como middleware entre una misión y una infraestructura de supercomputación y como framework para facilitar la ejecución de cualquier tipo de misión, por ejemplo, agricultura de precisión, inspección y monitoreo, etc. Al integrar la supercomputación y la robótica, se busca un segundo ideal, robótica ubicua, es decir el uso de robots en todo tipo de aplicaciones. Correspondientemente, una revisión de frameworks existentes relacionados serán discutidos. El diseño y desarrollo de The ARCHADE ha seguido las pautas y sugerencias encontradas en dicha revisión. Además, The ARCHADE se basa en una supercomputación repensada donde la eficiencia computacional no es la única característica proporcionada a sistemas basados en la tecnología. Sin embargo, se analizarán indicadores de eficiencia computacional, junto con otros indicadores relacionados con otras características de la supercomputación. La supercomputación ha sido un excelente aliado para la exploración científica, conduciendo a todo tipo de mejoras en nuestras vidas, nuestra sociedad y nuestro futuro. Con los resultados de esta tesis, la unión de dos campos, dos fuerzas previamente desconectadas debido a sus enfoques filosóficos y sus antecedentes divergentes, tiene un enorme potencial para abrir nuestra imaginación hacia todo tipo de aplicaciones nuevas y para un mundo donde la robótica y la supercomputación estén en todos ladosPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Ubiquitous supercomputing : design and development of enabling technologies for multi-robot systems rethinking supercomputing

Author: Camargo Forero Leonardo
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2019
Field of study

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Improving efficiency and resilience in large-scale computing systems through analytics and data-driven management

Author: Tuncer Ozan
Publication venue
Publication date: 03/07/2018
Field of study

Applications running in large-scale computing systems such as high performance computing (HPC) or cloud data centers are essential to many aspects of modern society, from weather forecasting to financial services. As the number and size of data centers increase with the growing computing demand, scalable and efficient management becomes crucial. However, data center management is a challenging task due to the complex interactions between applications, middleware, and hardware layers such as processors, network, and cooling units. This thesis claims that to improve robustness and efficiency of large-scale computing systems, significantly higher levels of automated support than what is available in today's systems are needed, and this automation should leverage the data continuously collected from various system layers. Towards this claim, we propose novel methodologies to automatically diagnose the root causes of performance and configuration problems and to improve efficiency through data-driven system management. We first propose a framework to diagnose software and hardware anomalies that cause undesired performance variations in large-scale computing systems. We show that by training machine learning models on resource usage and performance data collected from servers, our approach successfully diagnoses 98% of the injected anomalies at runtime in real-world HPC clusters with negligible computational overhead. We then introduce an analytics framework to address another major source of performance anomalies in cloud data centers: software misconfigurations. Our framework discovers and extracts configuration information from cloud instances such as containers or virtual machines. This is the first framework to provide comprehensive visibility into software configurations in multi-tenant cloud platforms, enabling systematic analysis for validating the correctness of software configurations. This thesis also contributes to the design of robust and efficient system management methods that leverage continuously monitored resource usage data. To improve performance under power constraints, we propose a workload- and cooling-aware power budgeting algorithm that distributes the available power among servers and cooling units in a data center, achieving up to 21% improvement in throughput per Watt compared to the state-of-the-art. Additionally, we design a network- and communication-aware HPC workload placement policy that reduces communication overhead by up to 30% in terms of hop-bytes compared to existing policies.2019-07-02T00:00:00

Boston University Institutional Repository (OpenBU)