2,247 research outputs found

    Deployment and Operation of Complex Software in Heterogeneous Execution Environments

    Get PDF
    This open access book provides an overview of the work developed within the SODALITE project, which aims at facilitating the deployment and operation of distributed software on top of heterogeneous infrastructures, including cloud, HPC and edge resources. The experts participating in the project describe how SODALITE works and how it can be exploited by end users. While multiple languages and tools are available in the literature to support DevOps teams in the automation of deployment and operation steps, still these activities require specific know-how and skills that cannot be found in average teams. The SODALITE framework tackles this problem by offering modelling and smart editing features to allow those we call Application Ops Experts to work without knowing low level details about the adopted, potentially heterogeneous, infrastructures. The framework offers also mechanisms to verify the quality of the defined models, generate the corresponding executable infrastructural code, automatically wrap application components within proper execution containers, orchestrate all activities concerned with deployment and operation of all system components, and support on-the-fly self-adaptation and refactoring

    Analyzing Resource Utilization in an HPC System: A Case Study of NERSC Perlmutter

    Full text link
    Resource demands of HPC applications vary significantly. However, it is common for HPC systems to primarily assign resources on a per-node basis to prevent interference from co-located workloads. This gap between the coarse-grained resource allocation and the varying resource demands can lead to HPC resources being not fully utilized. In this study, we analyze the resource usage and application behavior of NERSC's Perlmutter, a state-of-the-art open-science HPC system with both CPU-only and GPU-accelerated nodes. Our one-month usage analysis reveals that CPUs are commonly not fully utilized, especially for GPU-enabled jobs. Also, around 64% of both CPU and GPU-enabled jobs used 50% or less of the available host memory capacity. Additionally, about 50% of GPU-enabled jobs used up to 25% of the GPU memory, and the memory capacity was not fully utilized in some ways for all jobs. While our study comes early in Perlmutter's lifetime thus policies and application workload may change, it provides valuable insights on performance characterization, application behavior, and motivates systems with more fine-grain resource allocation

    SODALITE@RT: Orchestrating Applications on Cloud-Edge Infrastructures

    Get PDF
    AbstractIoT-based applications need to be dynamically orchestrated on cloud-edge infrastructures for reasons such as performance, regulations, or cost. In this context, a crucial problem is facilitating the work of DevOps teams in deploying, monitoring, and managing such applications by providing necessary tools and platforms. The SODALITE@RT open-source framework aims at addressing this scenario. In this paper, we present the main features of the SODALITE@RT: modeling of cloud-edge resources and applications using open standards and infrastructural code, and automated deployment, monitoring, and management of the applications in the target infrastructures based on such models. The capabilities of the SODALITE@RT are demonstrated through a relevant case study

    Energy challenges for ICT

    Get PDF
    The energy consumption from the expanding use of information and communications technology (ICT) is unsustainable with present drivers, and it will impact heavily on the future climate change. However, ICT devices have the potential to contribute signi - cantly to the reduction of CO2 emission and enhance resource e ciency in other sectors, e.g., transportation (through intelligent transportation and advanced driver assistance systems and self-driving vehicles), heating (through smart building control), and manu- facturing (through digital automation based on smart autonomous sensors). To address the energy sustainability of ICT and capture the full potential of ICT in resource e - ciency, a multidisciplinary ICT-energy community needs to be brought together cover- ing devices, microarchitectures, ultra large-scale integration (ULSI), high-performance computing (HPC), energy harvesting, energy storage, system design, embedded sys- tems, e cient electronics, static analysis, and computation. In this chapter, we introduce challenges and opportunities in this emerging eld and a common framework to strive towards energy-sustainable ICT

    Intelligent Resource Prediction for HPC and Scientific Workflows

    Get PDF
    Scientific workflows and high-performance computing (HPC) platforms are critically important to modern scientific research. In order to perform scientific experiments at scale, domain scientists must have knowledge and expertise in software and hardware systems that are highly complex and rapidly evolving. While computational expertise will be essential for domain scientists going forward, any tools or practices that reduce this burden for domain scientists will greatly increase the rate of scientific discoveries. One challenge that exists for domain scientists today is knowing the resource usage patterns of an application for the purpose of resource provisioning. A tool that accurately estimates these resource requirements would benefit HPC users in many ways, by reducing job failures and queue times on traditional HPC platforms and reducing costs on cloud computing platforms. To that end, we present Tesseract, a semi-automated tool that predicts resource usage for any application on any computing platform, from historical data, with minimal input from the user. We employ Tesseract to predict runtime, memory usage, and disk usage for a diverse set of scientific workflows, and in particular we show how these resource estimates can prevent under-provisioning. Finally, we leverage this core prediction capability to develop solutions for the related challenges of anomaly detection, cross-platform runtime prediction, and cost prediction

    Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability

    Full text link
    Modern large-scale scientific discovery requires multidisciplinary collaboration across diverse computing facilities, including High Performance Computing (HPC) machines and the Edge-to-Cloud continuum. Integrated data analysis plays a crucial role in scientific discovery, especially in the current AI era, by enabling Responsible AI development, FAIR, Reproducibility, and User Steering. However, the heterogeneous nature of science poses challenges such as dealing with multiple supporting tools, cross-facility environments, and efficient HPC execution. Building on data observability, adapter system design, and provenance, we propose MIDA: an approach for lightweight runtime Multi-workflow Integrated Data Analysis. MIDA defines data observability strategies and adaptability methods for various parallel systems and machine learning tools. With observability, it intercepts the dataflows in the background without requiring instrumentation while integrating domain, provenance, and telemetry data at runtime into a unified database ready for user steering queries. We conduct experiments showing end-to-end multi-workflow analysis integrating data from Dask and MLFlow in a real distributed deep learning use case for materials science that runs on multiple environments with up to 276 GPUs in parallel. We show near-zero overhead running up to 100,000 tasks on 1,680 CPU cores on the Summit supercomputer.Comment: 10 pages, 5 figures, 2 Listings, 42 references, Paper accepted at IEEE eScience'2

    Contribution à la convergence d'infrastructure entre le calcul haute performance et le traitement de données à large échelle

    Get PDF
    The amount of produced data, either in the scientific community or the commercialworld, is constantly growing. The field of Big Data has emerged to handle largeamounts of data on distributed computing infrastructures. High-Performance Computing (HPC) infrastructures are traditionally used for the execution of computeintensive workloads. However, the HPC community is also facing an increasingneed to process large amounts of data derived from high definition sensors andlarge physics apparati. The convergence of the two fields -HPC and Big Data- iscurrently taking place. In fact, the HPC community already uses Big Data tools,which are not always integrated correctly, especially at the level of the file systemand the Resource and Job Management System (RJMS).In order to understand how we can leverage HPC clusters for Big Data usage, andwhat are the challenges for the HPC infrastructures, we have studied multipleaspects of the convergence: We initially provide a survey on the software provisioning methods, with a focus on data-intensive applications. We contribute a newRJMS collaboration technique called BeBiDa which is based on 50 lines of codewhereas similar solutions use at least 1000 times more. We evaluate this mechanism on real conditions and in simulated environment with our simulator Batsim.Furthermore, we provide extensions to Batsim to support I/O, and showcase thedevelopments of a generic file system model along with a Big Data applicationmodel. This allows us to complement BeBiDa real conditions experiments withsimulations while enabling us to study file system dimensioning and trade-offs.All the experiments and analysis of this work have been done with reproducibilityin mind. Based on this experience, we propose to integrate the developmentworkflow and data analysis in the reproducibility mindset, and give feedback onour experiences with a list of best practices.RĂ©sumĂ©La quantitĂ© de donnĂ©es produites, que ce soit dans la communautĂ© scientifiqueou commerciale, est en croissance constante. Le domaine du Big Data a Ă©mergĂ©face au traitement de grandes quantitĂ©s de donnĂ©es sur les infrastructures informatiques distribuĂ©es. Les infrastructures de calcul haute performance (HPC) sont traditionnellement utilisĂ©es pour l’exĂ©cution de charges de travail intensives en calcul. Cependant, la communautĂ© HPC fait Ă©galement face Ă  un nombre croissant debesoin de traitement de grandes quantitĂ©s de donnĂ©es dĂ©rivĂ©es de capteurs hautedĂ©finition et de grands appareils physique. La convergence des deux domaines-HPC et Big Data- est en cours. En fait, la communautĂ© HPC utilise dĂ©jĂ  des outilsBig Data, qui ne sont pas toujours correctement intĂ©grĂ©s, en particulier au niveaudu systĂšme de fichiers ainsi que du systĂšme de gestion des ressources (RJMS).Afin de comprendre comment nous pouvons tirer parti des clusters HPC pourl’utilisation du Big Data, et quels sont les dĂ©fis pour les infrastructures HPC, nousavons Ă©tudiĂ© plusieurs aspects de la convergence: nous avons d’abord proposĂ© uneĂ©tude sur les mĂ©thodes de provisionnement logiciel, en mettant l’accent sur lesapplications utilisant beaucoup de donnĂ©es. Nous contribuons a l’état de l’art avecune nouvelle technique de collaboration entre RJMS appelĂ©e BeBiDa basĂ©e sur 50lignes de code alors que des solutions similaires en utilisent au moins 1000 fois plus.Nous Ă©valuons ce mĂ©canisme en conditions rĂ©elles et en environnement simulĂ©avec notre simulateur Batsim. En outre, nous fournissons des extensions Ă  Batsimpour prendre en charge les entrĂ©es/sorties et prĂ©sentons le dĂ©veloppements d’unmodĂšle de systĂšme de fichiers gĂ©nĂ©rique accompagnĂ© d’un modĂšle d’applicationBig Data. Cela nous permet de complĂ©ter les expĂ©riences en conditions rĂ©ellesde BeBiDa en simulation tout en Ă©tudiant le dimensionnement et les diffĂ©rentscompromis autours des systĂšmes de fichiers.Toutes les expĂ©riences et analyses de ce travail ont Ă©tĂ© effectuĂ©es avec la reproductibilitĂ© Ă  l’esprit. Sur la base de cette expĂ©rience, nous proposons d’intĂ©grerle flux de travail du dĂ©veloppement et de l’analyse des donnĂ©es dans l’esprit dela reproductibilitĂ©, et de donner un retour sur nos expĂ©riences avec une liste debonnes pratiques
    • 

    corecore