26 research outputs found

    The Changing Role of RSEs over the Lifetime of Parsl

    Full text link
    This position paper describes the Parsl open source research software project and its various phases over seven years. It defines four types of research software engineers (RSEs) who have been important to the project in those phases; we believe this is also applicable to other research software projects.Comment: 3 page

    Leveraging Large Language Models to Build and Execute Computational Workflows

    Full text link
    The recent development of large language models (LLMs) with multi-billion parameters, coupled with the creation of user-friendly application programming interfaces (APIs), has paved the way for automatically generating and executing code in response to straightforward human queries. This paper explores how these emerging capabilities can be harnessed to facilitate complex scientific workflows, eliminating the need for traditional coding methods. We present initial findings from our attempt to integrate Phyloflow with OpenAI's function-calling API, and outline a strategy for developing a comprehensive workflow management system based on these concepts

    Fixed-target serial crystallography at the Structural Biology Center

    Get PDF
    Serial synchrotron crystallography enables the study of protein structures under physiological temperature and reduced radiation damage by collection of data from thousands of crystals. The Structural Biology Center at Sector 19 of the Advanced Photon Source has implemented a fixed-target approach with a new 3D-printed mesh-holder optimized for sample handling. The holder immobilizes a crystal suspension or droplet emulsion on a nylon mesh, trapping and sealing a near-monolayer of crystals in its mother liquor between two thin Mylar films. Data can be rapidly collected in scan mode and analyzed in near real-time using piezoelectric linear stages assembled in an XYZ arrangement, controlled with a graphical user interface and analyzed using a high-performance computing pipeline. Here, the system was applied to two β-lactamases: a class D serine β-lactamase from Chitinophaga pinensis DSM 2588 and L1 metallo-β-lactamase from Stenotrophomonas maltophilia K279a

    Xar-Trek: Run-Time Execution Migration among FPGAs and Heterogeneous-ISA CPUs

    Get PDF
    Datacenter servers are increasingly heterogeneous: from x86 host CPUs, to ARM or RISC-V CPUs in NICs/SSDs, to FPGAs. Previous works have demonstrated that migrating application execution at run-time across heterogeneous-ISA CPUs can yield significant performance and energy gains, with relatively little programmer effort. However, FPGAs have often been overlooked in that context: hardware acceleration using FPGAs involves statically implementing select application functions, which prohibits dynamic and transparent migration. We present Xar-Trek, a new compiler and run-time software framework that overcomes this limitation. Xar-Trek compiles an application for several CPU ISAs and select application functions for acceleration on an FPGA, allowing execution migration between heterogeneous-ISA CPUs and FPGAs at run-time. Xar-Trek's run-time monitors server workloads and migrates application functions to an FPGA or to heterogeneous-ISA CPUs based on a scheduling policy. We develop a heuristic policy that uses application workload profiles to make scheduling decisions. Our evaluations conducted on a system with x86-64 server CPUs, ARM64 server CPUs, and an Alveo accelerator card reveal 88%-1% performance gains over no-migration baselines

    GekkoFS: A temporary burst buffer file system for HPC applications

    Get PDF
    Many scientific fields increasingly use high-performance computing (HPC) to process and analyze massive amounts of experimental data while storage systems in today’s HPC environments have to cope with new access patterns. These patterns include many metadata operations, small I/O requests, or randomized file I/O, while general-purpose parallel file systems have been optimized for sequential shared access to large files. Burst buffer file systems create a separate file system that applications can use to store temporary data. They aggregate node-local storage available within the compute nodes or use dedicated SSD clusters and offer a peak bandwidth higher than that of the backend parallel file system without interfering with it. However, burst buffer file systems typically offer many features that a scientific application, running in isolation for a limited amount of time, does not require. We present GekkoFS, a temporary, highly-scalable file system which has been specifically optimized for the aforementioned use cases. GekkoFS provides relaxed POSIX semantics which only offers features which are actually required by most (not all) applications. GekkoFS is, therefore, able to provide scalable I/O performance and reaches millions of metadata operations already for a small number of nodes, significantly outperforming the capabilities of common parallel file systems.Peer ReviewedPostprint (author's final draft

    Container Resource Allocation versus Performance of Data-intensive Applications on Different Cloud Servers

    Full text link
    In recent years, data-intensive applications have been increasingly deployed on cloud systems. Such applications utilize significant compute, memory, and I/O resources to process large volumes of data. Optimizing the performance and cost-efficiency for such applications is a non-trivial problem. The problem becomes even more challenging with the increasing use of containers, which are popular due to their lower operational overheads and faster boot speed at the cost of weaker resource assurances for the hosted applications. In this paper, two containerized data-intensive applications with very different performance objectives and resource needs were studied on cloud servers with Docker containers running on Intel Xeon E5 and AMD EPYC Rome multi-core processors with a range of CPU, memory, and I/O configurations. Primary findings from our experiments include: 1) Allocating multiple cores to a compute-intensive application can improve performance, but only if the cores do not contend for the same caches, and the optimal core counts depend on the specific workload; 2) allocating more memory to a memory-intensive application than its deterministic data workload does not further improve performance; however, 3) having multiple such memory-intensive containers on the same server can lead to cache and memory bus contention leading to significant and volatile performance degradation. The comparative observations on Intel and AMD servers provided insights into trade-offs between larger numbers of distributed chiplets interconnected with higher speed buses (AMD) and larger numbers of centrally integrated cores and caches with lesser speed buses (Intel). For the two types of applications studied, the more distributed caches and faster data buses have benefited the deployment of larger numbers of containers

    Optimizing checkpointing techniques for machine learning frameworks

    Get PDF
    While most deep learning frameworks provide mechanisms to checkpoint models, their implementation is naive and based on the assumption of single machine training. While this implementation can be adequate at a small scale, it becomes progressively inefficient when further scaling out the training. Since large models are often trained at large scale in HPC environments, we need checkpoint procedures which make efficient use of the available resources. Several checkpoint techniques and libraries have been designed for HPC environments, in which local storage is leveraged in order to alleviate the I/O bottleneck from writing to a parallel file system. However, deep learning training has very different data requirements than most HPC applications. In order to solve this problem, we develop DeepPart, a python module that provides optimizations to distribute shared checkpoint data across processes, effectively transforming it into a HPC-like checkpoint procedure. DeepPart sends the resulting distributed data to FTI, a multi-level HPC checkpoint library which leverages local storage. We implement an heuristic algorithm to distribute the elements of a collection across processes while minimizing the computational cost. We devise a method for automatically choosing the best sub-collection to partition, referred as partition candidate, independently of the specific structure of the main collection passed to checkpoint. Additionally, we allow individual elements to be partitioned between two or more processes if our algorithm detects size imbalance. We allow sub-collections to be recursively partitioned, proportionally to their size, in order to efficiently partition non-trivial collection structures. We show that the computational cost of our approach behaves similarly to an embarrassingly parallel workload, and achieves close to ideal speed-ups with up to 16 nodes and 4 processes per node. With a model size of 20GB, we observe overall gains of 5.6x compared to a standard PyTorch checkpoint implementation. Using BERT-LARGE model, we obtain a checkpoint speed-up of 2.1x and 2.7x without compression and with distributed lossless compression respectively compared to a standard PyTorch checkpoint approach. In order to understand model serialization cost, we perform an analysis using several model data sizes with different number of tensors. Our findings show that the serialization cost of a model is dependant on the relation between the total model size and the number of tensors, and is optimal when being above a lower threshold and below an upper threshold. As such, when distributing model data across processes, it is important to reduce both the total size and the number of tensors on each process proportionally in order to minimize serialization cost. Using a simulator we designed, we show how our data distribution approach scales with very large model and optimizer structures based on the large variant of BERT model in different configurations
    corecore