Search CORE

2,247 research outputs found

Performance metrics and auditing framework using application kernels for high‐performance computer systems

Author: Bruno Andrew E.
DeLeon Robert L.
Furlani Thomas R.
Gallo Steven M.
Gentner Ryan J.
Ghadersohi Amin
Jones Matthew D.
Laszewski Gregor
Lu Charng‐da
Patra Abani
Wang Fugang
Zimmerman Ann
Publication venue: 'Wiley'
Publication date: 01/05/2013
Field of study

Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/97468/1/cpe2871.pd

Deep Blue Documents at the University of Michigan

Deployment and Operation of Complex Software in Heterogeneous Execution Environments

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/08/2022
Field of study

This open access book provides an overview of the work developed within the SODALITE project, which aims at facilitating the deployment and operation of distributed software on top of heterogeneous infrastructures, including cloud, HPC and edge resources. The experts participating in the project describe how SODALITE works and how it can be exploited by end users. While multiple languages and tools are available in the literature to support DevOps teams in the automation of deployment and operation steps, still these activities require specific know-how and skills that cannot be found in average teams. The SODALITE framework tackles this problem by offering modelling and smart editing features to allow those we call Application Ops Experts to work without knowing low level details about the adopted, potentially heterogeneous, infrastructures. The framework offers also mechanisms to verify the quality of the defined models, generate the corresponding executable infrastructural code, automatically wrap application components within proper execution containers, orchestrate all activities concerned with deployment and operation of all system components, and support on-the-fly self-adaptation and refactoring

Directory of Open Access Books (DOAB)

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC Perlmutter

Author: Chen Yong
Cook Brandon
Cooray Dulanya
Li Jie
Michelogiannakis George
Publication venue
Publication date: 12/03/2023
Field of study

Resource demands of HPC applications vary significantly. However, it is common for HPC systems to primarily assign resources on a per-node basis to prevent interference from co-located workloads. This gap between the coarse-grained resource allocation and the varying resource demands can lead to HPC resources being not fully utilized. In this study, we analyze the resource usage and application behavior of NERSC's Perlmutter, a state-of-the-art open-science HPC system with both CPU-only and GPU-accelerated nodes. Our one-month usage analysis reveals that CPUs are commonly not fully utilized, especially for GPU-enabled jobs. Also, around 64% of both CPU and GPU-enabled jobs used 50% or less of the available host memory capacity. Additionally, about 50% of GPU-enabled jobs used up to 25% of the GPU memory, and the memory capacity was not fully utilized in some ways for all jobs. While our study comes early in Perlmutter's lifetime thus policies and application workload may change, it provides valuable insights on performance characterization, application behavior, and motivates systems with more fine-grain resource allocation

arXiv.org e-Print Archive

SODALITE@RT: Orchestrating Applications on Cloud-Edge Infrastructures

Author: Alexander Maslennikov
Damian A. Tamburri
Dragan Radolovic
Elisabetta Di Nitto
Georgios Meditskos
Giovanni Quattrocchi
Indika Kumara
Jorge F. Fabeiro
Kalman Meth
Kamil Tokmakov
Paul Mundt
Román Sosa González
Willem-Jan van den Heuvel
Publication venue
Publication date: 10/07/2021
Field of study

AbstractIoT-based applications need to be dynamically orchestrated on cloud-edge infrastructures for reasons such as performance, regulations, or cost. In this context, a crucial problem is facilitating the work of DevOps teams in deploying, monitoring, and managing such applications by providing necessary tools and platforms. The SODALITE@RT open-source framework aims at addressing this scenario. In this paper, we present the main features of the SODALITE@RT: modeling of cloud-edge resources and applications using open standards and infrastructural code, and automated deployment, monitoring, and management of the applications in the target infrastructures based on such models. The capabilities of the SODALITE@RT are demonstrated through a relevant case study

Open Access Repository

Energy challenges for ICT

Author: Fagas Giorgos
Gallagher John P.
Luca Gammaitoni
Paul Douglas J.
Publication venue: Intech
Publication date: 01/01/2017
Field of study

The energy consumption from the expanding use of information and communications technology (ICT) is unsustainable with present drivers, and it will impact heavily on the future climate change. However, ICT devices have the potential to contribute signi - cantly to the reduction of CO2 emission and enhance resource e ciency in other sectors, e.g., transportation (through intelligent transportation and advanced driver assistance systems and self-driving vehicles), heating (through smart building control), and manu- facturing (through digital automation based on smart autonomous sensors). To address the energy sustainability of ICT and capture the full potential of ICT in resource e - ciency, a multidisciplinary ICT-energy community needs to be brought together cover- ing devices, microarchitectures, ultra large-scale integration (ULSI), high-performance computing (HPC), energy harvesting, energy storage, system design, embedded sys- tems, e cient electronics, static analysis, and computation. In this chapter, we introduce challenges and opportunities in this emerging eld and a common framework to strive towards energy-sustainable ICT

Intelligent Resource Prediction for HPC and Scientific Workflows

Author: Shealy Benjamin
Publication venue: Clemson University Libraries
Publication date: 01/12/2021
Field of study

Scientific workflows and high-performance computing (HPC) platforms are critically important to modern scientific research. In order to perform scientific experiments at scale, domain scientists must have knowledge and expertise in software and hardware systems that are highly complex and rapidly evolving. While computational expertise will be essential for domain scientists going forward, any tools or practices that reduce this burden for domain scientists will greatly increase the rate of scientific discoveries. One challenge that exists for domain scientists today is knowing the resource usage patterns of an application for the purpose of resource provisioning. A tool that accurately estimates these resource requirements would benefit HPC users in many ways, by reducing job failures and queue times on traditional HPC platforms and reducing costs on cloud computing platforms. To that end, we present Tesseract, a semi-automated tool that predicts resource usage for any application on any computing platform, from historical data, with minimal input from the user. We employ Tesseract to predict runtime, memory usage, and disk usage for a diverse set of scientific workflows, and in particular we show how these resource estimates can prevent under-provisioning. Finally, we leverage this core prediction capability to develop solutions for the related challenges of anomaly detection, cross-platform runtime prediction, and cost prediction

Clemson University: TigerPrints

Towards Lightweight Data Integration using Multi-workflow Provenance and Data Observability

Author: da Silva Rafael Ferreira
Skluzacek Tyler J.
Souza Renan
Wilkinson Sean R.
Ziatdinov Maxim
Publication venue
Publication date: 17/08/2023
Field of study

Modern large-scale scientific discovery requires multidisciplinary collaboration across diverse computing facilities, including High Performance Computing (HPC) machines and the Edge-to-Cloud continuum. Integrated data analysis plays a crucial role in scientific discovery, especially in the current AI era, by enabling Responsible AI development, FAIR, Reproducibility, and User Steering. However, the heterogeneous nature of science poses challenges such as dealing with multiple supporting tools, cross-facility environments, and efficient HPC execution. Building on data observability, adapter system design, and provenance, we propose MIDA: an approach for lightweight runtime Multi-workflow Integrated Data Analysis. MIDA defines data observability strategies and adaptability methods for various parallel systems and machine learning tools. With observability, it intercepts the dataflows in the background without requiring instrumentation while integrating domain, provenance, and telemetry data at runtime into a unified database ready for user steering queries. We conduct experiments showing end-to-end multi-workflow analysis integrating data from Dask and MLFlow in a real distributed deep learning use case for materials science that runs on multiple environments with up to 276 GPUs in parallel. We show near-zero overhead running up to 100,000 tasks on 1,680 CPU cores on the Summit supercomputer.Comment: 10 pages, 5 figures, 2 Listings, 42 references, Paper accepted at IEEE eScience'2

arXiv.org e-Print Archive

Contribution à la convergence d'infrastructure entre le calcul haute performance et le traitement de données à large échelle

Author: Mercier Michael
Publication venue: HAL CCSD
Publication date: 01/07/2019
Field of study

The amount of produced data, either in the scientific community or the commercialworld, is constantly growing. The field of Big Data has emerged to handle largeamounts of data on distributed computing infrastructures. High-Performance Computing (HPC) infrastructures are traditionally used for the execution of computeintensive workloads. However, the HPC community is also facing an increasingneed to process large amounts of data derived from high definition sensors andlarge physics apparati. The convergence of the two fields -HPC and Big Data- iscurrently taking place. In fact, the HPC community already uses Big Data tools,which are not always integrated correctly, especially at the level of the file systemand the Resource and Job Management System (RJMS).In order to understand how we can leverage HPC clusters for Big Data usage, andwhat are the challenges for the HPC infrastructures, we have studied multipleaspects of the convergence: We initially provide a survey on the software provisioning methods, with a focus on data-intensive applications. We contribute a newRJMS collaboration technique called BeBiDa which is based on 50 lines of codewhereas similar solutions use at least 1000 times more. We evaluate this mechanism on real conditions and in simulated environment with our simulator Batsim.Furthermore, we provide extensions to Batsim to support I/O, and showcase thedevelopments of a generic file system model along with a Big Data applicationmodel. This allows us to complement BeBiDa real conditions experiments withsimulations while enabling us to study file system dimensioning and trade-offs.All the experiments and analysis of this work have been done with reproducibilityin mind. Based on this experience, we propose to integrate the developmentworkflow and data analysis in the reproducibility mindset, and give feedback onour experiences with a list of best practices.RésuméLa quantité de données produites, que ce soit dans la communauté scientifiqueou commerciale, est en croissance constante. Le domaine du Big Data a émergéface au traitement de grandes quantités de données sur les infrastructures informatiques distribuées. Les infrastructures de calcul haute performance (HPC) sont traditionnellement utilisées pour l’exécution de charges de travail intensives en calcul. Cependant, la communauté HPC fait également face à un nombre croissant debesoin de traitement de grandes quantités de données dérivées de capteurs hautedéfinition et de grands appareils physique. La convergence des deux domaines-HPC et Big Data- est en cours. En fait, la communauté HPC utilise déjà des outilsBig Data, qui ne sont pas toujours correctement intégrés, en particulier au niveaudu système de fichiers ainsi que du système de gestion des ressources (RJMS).Afin de comprendre comment nous pouvons tirer parti des clusters HPC pourl’utilisation du Big Data, et quels sont les défis pour les infrastructures HPC, nousavons étudié plusieurs aspects de la convergence: nous avons d’abord proposé uneétude sur les méthodes de provisionnement logiciel, en mettant l’accent sur lesapplications utilisant beaucoup de données. Nous contribuons a l’état de l’art avecune nouvelle technique de collaboration entre RJMS appelée BeBiDa basée sur 50lignes de code alors que des solutions similaires en utilisent au moins 1000 fois plus.Nous évaluons ce mécanisme en conditions réelles et en environnement simuléavec notre simulateur Batsim. En outre, nous fournissons des extensions à Batsimpour prendre en charge les entrées/sorties et présentons le développements d’unmodèle de système de fichiers générique accompagné d’un modèle d’applicationBig Data. Cela nous permet de compléter les expériences en conditions réellesde BeBiDa en simulation tout en étudiant le dimensionnement et les différentscompromis autours des systèmes de fichiers.Toutes les expériences et analyses de ce travail ont été effectuées avec la reproductibilité à l’esprit. Sur la base de cette expérience, nous proposons d’intégrerle flux de travail du développement et de l’analyse des données dans l’esprit dela reproductibilité, et de donner un retour sur nos expériences avec une liste debonnes pratiques