8 research outputs found

    Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

    Full text link
    Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations---in particular, many linear algebra computations that are the basis for solving common machine learning problems---are significantly slower in Spark than when done using libraries written in a high-performance computing framework such as the Message-Passing Interface (MPI). To remedy this, we introduce Alchemist, a system designed to call MPI-based libraries from Apache Spark. Using Alchemist with Spark helps accelerate linear algebra, machine learning, and related computations, while still retaining the benefits of working within the Spark environment. We discuss the motivation behind the development of Alchemist, and we provide a brief overview of its design and implementation. We also compare the performances of pure Spark implementations with those of Spark implementations that leverage MPI-based codes via Alchemist. To do so, we use data science case studies: a large-scale application of the conjugate gradient method to solve very large linear systems arising in a speech classification problem, where we see an improvement of an order of magnitude; and the truncated singular value decomposition (SVD) of a 400GB three-dimensional ocean temperature data set, where we see a speedup of up to 7.9x. We also illustrate that the truncated SVD computation is easily scalable to terabyte-sized data by applying it to data sets of sizes up to 17.6TB.Comment: Accepted for publication in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, 201

    Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

    Get PDF
    Apache Spark is a popular system aimed at the analysis of large data sets, but recent studies have shown that certain computations---in particular, many linear algebra computations that are the basis for solving common machine learning problems---are significantly slower in Spark than when done using libraries written in a high-performance computing framework such as the Message-Passing Interface (MPI). To remedy this, we introduce Alchemist, a system designed to call MPI-based libraries from Apache Spark. Using Alchemist with Spark helps accelerate linear algebra, machine learning, and related computations, while still retaining the benefits of working within the Spark environment. We discuss the motivation behind the development of Alchemist, and we provide a brief overview of its design and implementation. We also compare the performances of pure Spark implementations with those of Spark implementations that leverage MPI-based codes via Alchemist. To do so, we use data science case studies: a large-scale application of the conjugate gradient method to solve very large linear systems arising in a speech classification problem, where we see an improvement of an order of magnitude; and the truncated singular value decomposition (SVD) of a 400GB three-dimensional ocean temperature data set, where we see a speedup of up to 7.9x. We also illustrate that the truncated SVD computation is easily scalable to terabyte-sized data by applying it to data sets of sizes up to 17.6TB.Comment: Accepted for publication in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, 201

    Parallel Flow-Based Hypergraph Partitioning

    Get PDF
    We present a shared-memory parallelization of flow-based refinement, which is considered the most powerful iterative improvement technique for hypergraph partitioning at the moment. Flow-based refinement works on bipartitions, so current sequential partitioners schedule it on different block pairs to improve k-way partitions. We investigate two different sources of parallelism: a parallel scheduling scheme and a parallel maximum flow algorithm based on the well-known push-relabel algorithm. In addition to thoroughly engineered implementations, we propose several optimizations that substantially accelerate the algorithm in practice, enabling the use on extremely large hypergraphs (up to 1 billion pins). We integrate our approach in the state-of-the-art parallel multilevel framework Mt-KaHyPar and conduct extensive experiments on a benchmark set of more than 500 real-world hypergraphs, to show that the partition quality of our code is on par with the highest quality sequential code (KaHyPar), while being an order of magnitude faster with 10 threads

    Understanding and Adapting Tree Ensembles: A Training Data Perspective

    Get PDF
    Despite the impressive success of deep-learning models on unstructured data (e.g., images, audio, text), tree-based ensembles such as random forests and gradient-boosted trees are hugely popular and remain the preferred choice for tabular or structured data, and are regularly used to win challenges on data-competition websites such as Kaggle and DrivenData. Despite their impressive predictive performance, tree-based ensembles lack certain characteristics which may limit their further adoption, especially for safety-critical or privacy-sensitive domains such as weather forecasting or predictive medical modeling. This dissertation investigates the shortcomings currently facing tree-based ensembles---lack of explainable predictions, limited uncertainty estimation, and inefficient adaptability to changes in the training data---and posits that numerous improvements to tree-based ensembles can be made by analyzing the relationships between the training data and the resulting learned model. By studying the effects of one or many training examples on tree-based ensembles, we develop solutions for these models which (1) increase their predictive explainability, (2) provide accurate uncertainty estimates for individual predictions, and (3) efficiently adapt learned models to accurately reflect updated training data. This dissertation includes previously published coauthored material

    Forecasting Large-scale Time Series Data

    Get PDF
    The forecasting of time series data is an integral component for management, planning, and decision making in many domains. The prediction of electricity demand and supply in the energy domain or sales figures in market research are just two of the many application scenarios that require thorough predictions. Many of these domains have in common that they are influenced by the Big Data trend which also affects the time series forecasting. Data sets consist of thousands of temporal fine grained time series and have to be predicted in reasonable time. The time series may suffer from noisy behavior and missing values which makes modeling these time series especially hard, nonetheless accurate predictions are required. Furthermore, data sets from different domains exhibit various characteristics. Therefore, forecast techniques have to be flexible and adaptable to these characteristics. Long-established forecast techniques like ARIMA and Exponential Smoothing do not fulfill these new requirements. Most of the traditional models only represent one individual time series. This makes the prediction of thousands of time series very time consuming, as an equally large number of models has to be created. Furthermore, these models do not incorporate additional data sources and are, therefore, not capable of compensating missing measurements or noisy behavior of individual time series. In this thesis, we introduce CSAR (Cross-Sectional AutoRegression Model), a new forecast technique which is designed to address the new requirements on forecasting large-scale time series data. It is based on the novel concept of cross-sectional forecasting that assumes that time series from the same domain follow a similar behavior and represents many time series with one common model. CSAR combines this new approach with the modeling concept of ARIMA to make the model adaptable to the various properties of data sets from different domains. Furthermore, we introduce auto.CSAR, that helps to configure the model and to choose the right model components for a specific data set and forecast task. With CSAR, we present a new forecast technique that is suited for the prediction of large-scale time series data. By representing many time series with one model, large data sets can be predicted in short time. Furthermore, using data from many time series in one model helps to compensate missing values and noisy behavior of individual series. The evaluation on three real world data sets shows that CSAR outperforms long-established forecast techniques in accuracy and execution time. Finally, with auto.CSAR, we create a way to apply CSAR to new data sets without requiring the user to have extensive knowledge about our new forecast technique and its configuration
    corecore