16 research outputs found

    Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

    Full text link
    Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations---in particular, many linear algebra computations that are the basis for solving common machine learning problems---are significantly slower in Spark than when done using libraries written in a high-performance computing framework such as the Message-Passing Interface (MPI). To remedy this, we introduce Alchemist, a system designed to call MPI-based libraries from Apache Spark. Using Alchemist with Spark helps accelerate linear algebra, machine learning, and related computations, while still retaining the benefits of working within the Spark environment. We discuss the motivation behind the development of Alchemist, and we provide a brief overview of its design and implementation. We also compare the performances of pure Spark implementations with those of Spark implementations that leverage MPI-based codes via Alchemist. To do so, we use data science case studies: a large-scale application of the conjugate gradient method to solve very large linear systems arising in a speech classification problem, where we see an improvement of an order of magnitude; and the truncated singular value decomposition (SVD) of a 400GB three-dimensional ocean temperature data set, where we see a speedup of up to 7.9x. We also illustrate that the truncated SVD computation is easily scalable to terabyte-sized data by applying it to data sets of sizes up to 17.6TB.Comment: Accepted for publication in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, 201

    Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

    Get PDF
    Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations---in particular, many linear algebra computations that are the basis for solving common machine learning problems---are significantly slower in Spark than when done using libraries written in a high-performance computing framework such as the Message-Passing Interface (MPI). To remedy this, we introduce Alchemist, a system designed to call MPI-based libraries from Apache Spark. Using Alchemist with Spark helps accelerate linear algebra, machine learning, and related computations, while still retaining the benefits of working within the Spark environment. We discuss the motivation behind the development of Alchemist, and we provide a brief overview of its design and implementation. We also compare the performances of pure Spark implementations with those of Spark implementations that leverage MPI-based codes via Alchemist. To do so, we use data science case studies: a large-scale application of the conjugate gradient method to solve very large linear systems arising in a speech classification problem, where we see an improvement of an order of magnitude; and the truncated singular value decomposition (SVD) of a 400GB three-dimensional ocean temperature data set, where we see a speedup of up to 7.9x. We also illustrate that the truncated SVD computation is easily scalable to terabyte-sized data by applying it to data sets of sizes up to 17.6TB.Comment: Accepted for publication in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, 201

    Spark-DIY: A framework for interoperable Spark Operations with high performance Block-Based Data Models

    Get PDF
    This work was partially funded by the Spanish Ministry of Economy, Industry and Competitiveness under the grant TIN2016-79637-P ”Towards Unification of HPC and Big Data Paradigms”; the Spanish Ministry of Education under the FPU15/00422 Training Program for Academic and Teaching Staff Grant; the Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357; and by DOE with agreement No. DE-DC000122495, program manager Laura Biven

    Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY

    Get PDF
    Convergence between high-performance computing (HPC) and big data analytics (BDA) is currently an established research area that has spawned new opportunities for unifying the platform layer and data abstractions in these ecosystems. This work presents an architectural model that enables the interoperability of established BDA and HPC execution models, reflecting the key design features that interest both the HPC and BDA communities, and including an abstract data collection and operational model that generates a unified interface for hybrid applications. This architecture can be implemented in different ways depending on the process- and data-centric platforms of choice and the mechanisms put in place to effectively meet the requirements of the architecture. The Spark-DIY platform is introduced in the paper as a prototype implementation of the architecture proposed. It preserves the interfaces and execution environment of the popular BDA platform Apache Spark, making it compatible with any Spark-based application and tool, while providing efficient communication and kernel execution via DIY, a powerful communication pattern library built on top of MPI. Later, Spark-DIY is analyzed in terms of performance by building a representative use case from the hydrogeology domain, EnKF-HGS. This application is a clear example of how current HPC simulations are evolving toward hybrid HPC-BDA applications, integrating HPC simulations within a BDA environment.This work was supported in part by the Spanish Ministry of Economy, Industry and Competitiveness under Grant TIN2016-79637-P(toward Unification of HPC and Big Data Paradigms), in part by the Spanish Ministry of Education under Grant FPU15/00422 TrainingProgram for Academic and Teaching Staff Grant, in part by the Advanced Scientific Computing Research, Office of Science, U.S.Department of Energy, under Contract DE-AC02-06CH11357, and in part by the DOE with under Agreement DE-DC000122495,Program Manager Laura Biven

    A unified framework to improve the interoperability between HPC and Big Data languages and programming models

    Get PDF
    One of the most important issues in the path to the convergence of HPC and Big Data is caused by the differences in their software stacks. Despite some research efforts, the interoperability between their programming models and languages is still limited. To deal with this problem we introduce a new computing framework called IgnisHPC, whose main objective is to unify the execution of Big Data and HPC workloads in the same framework. IgnisHPC has native support for multi-language applications using JVM and non-JVM-based languages. Since MPI was used as its backbone technology, IgnisHPC takes advantage of many communication models and network architectures. Moreover, MPI applications can be directly executed in an efficient way in the framework. The main consequence is that users could combine in the same multi-language code HPC tasks (using MPI) with Big Data tasks (using MapReduce operations). The experimental evaluation demonstrates the benefits of our proposal in terms of performance and productivity with respect to other frameworks. IgnisHPC is publicly available for the Big Data and HPC research communityThis work has been supported by MICINN, Spain (RTI2018-093336-B-C21, PLEC2021-007662), Xunta de Galicia, Spain (ED431G/08, ED431G-2019/04 and ED431C-2018/19) and the European Regional Development Fund (ERDF)S

    New technologies to bridge the gap between High Performance Computing (HPC) and Big Data

    Get PDF
    The unification of HPC and Big Data has received increasing attention in the last years. It is a common belief that exascale computing and Big Data are closely associated since HPC requires processing large-scale data from scientific instruments and simulations. But, at the same time, it was observed that tools and cultures of HPC and Big Data communities differ significantly. One of the most important issues in the path to the convergence is caused by the differences in their software stacks. This thesis will address the research challenge of bridging the gap between Big Data and HPC worlds. With this goal in mind, a set of tools and technologies will be developed and integrated into a new unified Big Data-HPC framework that will allow the execution of scientific multi-language applications on both environments using containers

    Effective techniques for understanding and improving data structure usage

    Get PDF
    Turing Award winner Niklaus Wirth famously noted, `Algorithms + Data Structures = Programs', and it follows that data structures should be carefully considered for effective application development. In fact, data structures are the main focus of program understanding, performance engineering, bug detection, and security enhancement, etc. Our research is aimed at providing effective techniques for analyzing and improving data structure usage in fundamentally new approaches: First, detecting data structures; identifying what data structures are used within an application is a critical step toward application understanding and performance engineering. Second, selecting efficient data structures; analyzing data structures' behavior can recognize improper use of data structures and suggest alternative data structures better suited for the current situation where the application runs. Third, detecting memory leaks for data structures; tracking data accesses with little overhead and their careful analysis can enable practical and accurate memory leak detection. Finally, offloading time-consuming data structure operations; By leveraging a dedicated helper thread that executes the operations on the behalf of the application thread, we can improve the overall performance of the application.Ph.D
    corecore