2 research outputs found

    Optimizing Collective Communication for Scalable Scientific Computing and Deep Learning

    Get PDF
    In the realm of distributed computing, collective operations involve coordinated communication and synchronization among multiple processing units, enabling efficient data exchange and collaboration. Scientific applications, such as simulations, computational fluid dynamics, and scalable deep learning, require complex computations that can be parallelized across multiple nodes in a distributed system. These applications often involve data-dependent communication patterns, where collective operations are critical for achieving high performance in data exchange. Optimizing collective operations for scientific applications and deep learning involves improving the algorithms, communication patterns, and data distribution strategies to minimize communication overhead and maximize computational efficiency. Within the context of this dissertation, the specific focus is on optimizing the alltoall operation in 3D Fast Fourier Transform (FFT) applications and the allreduce operation in parallel deep learning, particularly on High-Performance Computing (HPC) systems. Advanced communication algorithms and methods are explored and implemented to improve communication efficiency, consequently enhancing the overall performance of 3D FFT applications. Furthermore, this dissertation investigates the identification of performance bottlenecks during collective communication over Horovod on distributed systems. These bottlenecks are addressed by proposing an optimized parallel communication pattern specifically tailored to alleviate the aforementioned limitations during the training phase in distributed deep learning. The objective is to achieve faster convergence and improve the overall training efficiency. Moreover, this dissertation proposes fault tolerance and elastic scaling features for distributed deep learning by leveraging the User-Level Failure Mitigation (ULFM) from Message Passing Interface (MPI). By incorporating ULFM MPI, the dissertation aims to enhance the elastic capabilities of distributed deep learning systems. This approach enables graceful and lightweight handling of failures while facilitating seamless scaling in dynamic computing environments

    Supporting complex workflows for data-intensive discovery reliably and efficiently

    Get PDF
    Scientific workflows have emerged as well-established pillars of large-scale computational science and appeared as torchbearers to formalize and structure a massive amount of complex heterogeneous data and accelerate scientific progress. Scientists of diverse domains can analyze their data by constructing scientific workflows as a useful paradigm to manage complex scientific computations. A workflow can analyze terabyte-scale datasets, contain numerous individual tasks, and coordinate between heterogeneous tasks with the help of scientific workflow management systems (SWfMSs). However, even for expert users, workflow creation is a complex task due to the dramatic growth of tools and data heterogeneity. Scientists are now more willing to publicly share scientific datasets and analysis pipelines in the interest of open science. As sharing of research data and resources increases in scientific communities, scientists can reuse existing workflows shared in several workflow repositories. Unfortunately, several challenges can prevent scientists from reusing those workflows, which hurts the purpose of the community-oriented knowledge base. In this thesis, we first identify the repositories that scientists use to share and reuse scientific workflows. Among several repositories, we find Galaxy repositories have numerous workflows, and Galaxy is the mostly used SWfMS. After selecting the Galaxy repositories, we attempt to explore the workflows and encounter several challenges in reusing them. We classify the reusability status (reusable/nonreusable). Based on the effort level, we further categorize the reusable workflows (reusable without modification, easily reusable, moderately difficult to reuse, and difficult to reuse). Upon failure, we record the associated challenges that prevent reusability. We also list the actions upon success. The challenges preventing reusability include tool upgrading, tool support unavailability, design flaws, incomplete workflows, failure to load a workflow, etc. We need to perform several actions to overcome the challenges. The actions include identifying proper input datasets, updating/upgrading tools, finding alternative tools support for obsolete tools, debugging to find the issue creating tools and connections and solving them, modifying tools connections, etc. Such challenges and our action list offer guidelines to future workflow composers to create better workflows with enhanced reusability. A SWfMS stores provenance data at different phases of a workflow life cycle, which can help workflow construction. This provenance data allows reproducibility and knowledge reuse in the scientific community. But, this provenance information is usually many times larger than the workflow and input data, and managing provenance data is growing in complexity with large-scale applications. In our second study, we document the challenges of provenance management and reuse in e-science, focusing primarily on scientific workflow approaches by exploring different SWfMSs and provenance management systems. We also investigate the ways to overcome the challenges. Creating a workflow is difficult but essential for data-intensive complex analysis, and the existing workflows have several challenges to be reused, so in our third study, we build a recommendation system to recommend tool(s) using machine learning approaches to help scientists create optimal, error-free, and efficient workflows by using existing reusable workflows in Galaxy workflow repositories. The findings from our studies and proposed techniques have the potential to simplify the data-intensive analysis, ensuring reliability and efficiency
    corecore