5 research outputs found

    Toward High-Performance Computing and Big Data Analytics Convergence: The Case of Spark-DIY

    Get PDF
    Convergence between high-performance computing (HPC) and big data analytics (BDA) is currently an established research area that has spawned new opportunities for unifying the platform layer and data abstractions in these ecosystems. This work presents an architectural model that enables the interoperability of established BDA and HPC execution models, reflecting the key design features that interest both the HPC and BDA communities, and including an abstract data collection and operational model that generates a unified interface for hybrid applications. This architecture can be implemented in different ways depending on the process- and data-centric platforms of choice and the mechanisms put in place to effectively meet the requirements of the architecture. The Spark-DIY platform is introduced in the paper as a prototype implementation of the architecture proposed. It preserves the interfaces and execution environment of the popular BDA platform Apache Spark, making it compatible with any Spark-based application and tool, while providing efficient communication and kernel execution via DIY, a powerful communication pattern library built on top of MPI. Later, Spark-DIY is analyzed in terms of performance by building a representative use case from the hydrogeology domain, EnKF-HGS. This application is a clear example of how current HPC simulations are evolving toward hybrid HPC-BDA applications, integrating HPC simulations within a BDA environment.This work was supported in part by the Spanish Ministry of Economy, Industry and Competitiveness under Grant TIN2016-79637-P(toward Unification of HPC and Big Data Paradigms), in part by the Spanish Ministry of Education under Grant FPU15/00422 TrainingProgram for Academic and Teaching Staff Grant, in part by the Advanced Scientific Computing Research, Office of Science, U.S.Department of Energy, under Contract DE-AC02-06CH11357, and in part by the DOE with under Agreement DE-DC000122495,Program Manager Laura Biven

    Learning from Multi-Class Imbalanced Big Data with Apache Spark

    Get PDF
    With data becoming a new form of currency, its analysis has become a top priority in both academia and industry, furthering advancements in high-performance computing and machine learning. However, these large, real-world datasets come with additional complications such as noise and class overlap. Problems are magnified when with multi-class data is presented, especially since many of the popular algorithms were originally designed for binary data. Another challenge arises when the number of examples are not evenly distributed across all classes in a dataset. This often causes classifiers to favor the majority class over the minority classes, leading to undesirable results as learning from the rare cases may be the primary goal. Many of the classic machine learning algorithms were not designed for multi-class, imbalanced data or parallelism, and so their effectiveness has been hindered. This dissertation addresses some of these challenges with in-depth experimentation using novel implementations of machine learning algorithms using Apache Spark, a distributed computing framework based on the MapReduce model designed to handle very large datasets. Experimentation showed that many of the traditional classifier algorithms do not translate well to a distributed computing environment, indicating the need for a new generation of algorithms targeting modern high-performance computing. A collection of popular oversampling methods, originally designed for small binary class datasets, have been implemented using Apache Spark for the first time to improve parallelism and add multi-class support. An extensive study on how instance level difficulty affects the learning from large datasets was also performed

    Accelerating Apache Spark with FPGAs

    No full text
    corecore