Search CORE

26 research outputs found

The Changing Role of RSEs over the Lifetime of Parsl

Author: Babuji Yadu
Chard Kyle
Clifford Ben
Katz Daniel S.
Kesling Kevin Hunter
Woodard Anna
Publication venue
Publication date: 20/07/2023
Field of study

This position paper describes the Parsl open source research software project and its various phases over seven years. It defines four types of research software engineers (RSEs) who have been important to the project in those phases; we believe this is also applicable to other research software projects.Comment: 3 page

arXiv.org e-Print Archive

Leveraging Large Language Models to Build and Execute Computational Workflows

Author: Berry Matthew J.
Day Kastan V.
Duque Alejandro
Katz Daniel S.
Kindratenko Volodymyr V.
Syed Abdullah
Publication venue
Publication date: 12/12/2023
Field of study

The recent development of large language models (LLMs) with multi-billion parameters, coupled with the creation of user-friendly application programming interfaces (APIs), has paved the way for automatically generating and executing code in response to straightforward human queries. This paper explores how these emerging capabilities can be harnessed to facilitate complex scientific workflows, eliminating the need for traditional coding methods. We present initial findings from our attempt to integrate Phyloflow with OpenAI's function-calling API, and outline a strategy for developing a comprehensive workflow management system based on these concepts

arXiv.org e-Print Archive

Fixed-target serial crystallography at the Structural Biology Center

Author: Akins Chase
Babnigg Gyorgy
Chang Changsoo
Chard Ryan
Foster Ian
Joachimiak Andrzej
Johnson Jessica L.
Kim Youngchang
Lavens Alex
Lazarski Krzysztof
Michalska Karolina
Rosenbaum Gerold
Sherrell Darren A.
Vescovi Rafael
Wilamowski Mateusz
Publication venue
Publication date: 01/01/2022
Field of study

Serial synchrotron crystallography enables the study of protein structures under physiological temperature and reduced radiation damage by collection of data from thousands of crystals. The Structural Biology Center at Sector 19 of the Advanced Photon Source has implemented a fixed-target approach with a new 3D-printed mesh-holder optimized for sample handling. The holder immobilizes a crystal suspension or droplet emulsion on a nylon mesh, trapping and sealing a near-monolayer of crystals in its mother liquor between two thin Mylar films. Data can be rapidly collected in scan mode and analyzed in near real-time using piezoelectric linear stages assembled in an XYZ arrangement, controlled with a graphical user interface and analyzed using a high-performance computing pipeline. Here, the system was applied to two β-lactamases: a class D serine β-lactamase from Chitinophaga pinensis DSM 2588 and L1 metallo-β-lactamase from Stenotrophomonas maltophilia K279a

PubMed Central

Jagiellonian Univeristy Repository

Xar-Trek: Run-Time Execution Migration among FPGAs and Heterogeneous-ISA CPUs

Author: Barbalace Antonio
Chuang Ho-Ren
Horta Edson
Olivier Pierre
Philippidis Cesar
Ravindran Binoy
VSathish Naarayanan Rao
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 27/10/2021
Field of study

Datacenter servers are increasingly heterogeneous: from x86 host CPUs, to ARM or RISC-V CPUs in NICs/SSDs, to FPGAs. Previous works have demonstrated that migrating application execution at run-time across heterogeneous-ISA CPUs can yield significant performance and energy gains, with relatively little programmer effort. However, FPGAs have often been overlooked in that context: hardware acceleration using FPGAs involves statically implementing select application functions, which prohibits dynamic and transparent migration. We present Xar-Trek, a new compiler and run-time software framework that overcomes this limitation. Xar-Trek compiles an application for several CPU ISAs and select application functions for acceleration on an FPGA, allowing execution migration between heterogeneous-ISA CPUs and FPGAs at run-time. Xar-Trek's run-time monitors server workloads and migrates application functions to an FPGA or to heterogeneous-ISA CPUs based on a scheduling policy. We develop a heuristic policy that uses application workload profiles to make scheduling decisions. Our evaluations conducted on a system with x86-64 server CPUs, ARM64 server CPUs, and an Alveo accelerator card reveal 88%-1% performance gains over no-migration baselines

arXiv.org e-Print Archive

Edinburgh Research Explorer

GekkoFS: A temporary burst buffer file system for HPC applications

Author: Brinkmann Andre
Cortés Toni
Miranda Bueno Alberto
Moti Nafiseh
Nou Ramon
Süb Tim
Tacke Markus
Tocci Tommaso
Vef Marc-André
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Many scientific fields increasingly use high-performance computing (HPC) to process and analyze massive amounts of experimental data while storage systems in today’s HPC environments have to cope with new access patterns. These patterns include many metadata operations, small I/O requests, or randomized file I/O, while general-purpose parallel file systems have been optimized for sequential shared access to large files. Burst buffer file systems create a separate file system that applications can use to store temporary data. They aggregate node-local storage available within the compute nodes or use dedicated SSD clusters and offer a peak bandwidth higher than that of the backend parallel file system without interfering with it. However, burst buffer file systems typically offer many features that a scientific application, running in isolation for a limited amount of time, does not require. We present GekkoFS, a temporary, highly-scalable file system which has been specifically optimized for the aforementioned use cases. GekkoFS provides relaxed POSIX semantics which only offers features which are actually required by most (not all) applications. GekkoFS is, therefore, able to provide scalable I/O performance and reaches millions of metadata operations already for a small number of nodes, significantly outperforming the capabilities of common parallel file systems.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Container Resource Allocation versus Performance of Data-intensive Applications on Different Cloud Servers

Author: Anjam Khayam
Barrineau Geddings
Izard Ryan
Kar Snigdhaswin
Linduff Caleb
Mishra Prabodh
Wang Kuang-Ching
Wang Qing
Zulfiqar Junaid
Publication venue
Publication date: 13/11/2023
Field of study

In recent years, data-intensive applications have been increasingly deployed on cloud systems. Such applications utilize significant compute, memory, and I/O resources to process large volumes of data. Optimizing the performance and cost-efficiency for such applications is a non-trivial problem. The problem becomes even more challenging with the increasing use of containers, which are popular due to their lower operational overheads and faster boot speed at the cost of weaker resource assurances for the hosted applications. In this paper, two containerized data-intensive applications with very different performance objectives and resource needs were studied on cloud servers with Docker containers running on Intel Xeon E5 and AMD EPYC Rome multi-core processors with a range of CPU, memory, and I/O configurations. Primary findings from our experiments include: 1) Allocating multiple cores to a compute-intensive application can improve performance, but only if the cores do not contend for the same caches, and the optimal core counts depend on the specific workload; 2) allocating more memory to a memory-intensive application than its deterministic data workload does not further improve performance; however, 3) having multiple such memory-intensive containers on the same server can lead to cache and memory bus contention leading to significant and volatile performance degradation. The comparative observations on Intel and AMD servers provided insights into trade-offs between larger numbers of distributed chiplets interconnected with higher speed buses (AMD) and larger numbers of centrally integrated cores and caches with lesser speed buses (Intel). For the two types of applications studied, the more distributed caches and faster data buses have benefited the deployment of larger numbers of containers

arXiv.org e-Print Archive

Optimizing checkpointing techniques for machine learning frameworks

Author: Perelló Bacardit Marc
Publication venue: Universitat Politècnica de Catalunya
Publication date: 15/05/2023
Field of study

While most deep learning frameworks provide mechanisms to checkpoint models, their implementation is naive and based on the assumption of single machine training. While this implementation can be adequate at a small scale, it becomes progressively inefficient when further scaling out the training. Since large models are often trained at large scale in HPC environments, we need checkpoint procedures which make efficient use of the available resources. Several checkpoint techniques and libraries have been designed for HPC environments, in which local storage is leveraged in order to alleviate the I/O bottleneck from writing to a parallel file system. However, deep learning training has very different data requirements than most HPC applications. In order to solve this problem, we develop DeepPart, a python module that provides optimizations to distribute shared checkpoint data across processes, effectively transforming it into a HPC-like checkpoint procedure. DeepPart sends the resulting distributed data to FTI, a multi-level HPC checkpoint library which leverages local storage. We implement an heuristic algorithm to distribute the elements of a collection across processes while minimizing the computational cost. We devise a method for automatically choosing the best sub-collection to partition, referred as partition candidate, independently of the specific structure of the main collection passed to checkpoint. Additionally, we allow individual elements to be partitioned between two or more processes if our algorithm detects size imbalance. We allow sub-collections to be recursively partitioned, proportionally to their size, in order to efficiently partition non-trivial collection structures. We show that the computational cost of our approach behaves similarly to an embarrassingly parallel workload, and achieves close to ideal speed-ups with up to 16 nodes and 4 processes per node. With a model size of 20GB, we observe overall gains of 5.6x compared to a standard PyTorch checkpoint implementation. Using BERT-LARGE model, we obtain a checkpoint speed-up of 2.1x and 2.7x without compression and with distributed lossless compression respectively compared to a standard PyTorch checkpoint approach. In order to understand model serialization cost, we perform an analysis using several model data sizes with different number of tensors. Our findings show that the serialization cost of a model is dependant on the relation between the total model size and the number of tensors, and is optimal when being above a lower threshold and below an upper threshold. As such, when distributing model data across processes, it is important to reduce both the total size and the number of tensors on each process proportionally in order to minimize serialization cost. Using a simulator we designed, we show how our data distribution approach scales with very large model and optimizer structures based on the large variant of BERT model in different configurations

UPCommons. Portal del coneixement obert de la UPC