409 research outputs found
Building Near-Real-Time Processing Pipelines with the Spark-MPI Platform
Advances in detectors and computational technologies provide new
opportunities for applied research and the fundamental sciences. Concurrently,
dramatic increases in the three Vs (Volume, Velocity, and Variety) of
experimental data and the scale of computational tasks produced the demand for
new real-time processing systems at experimental facilities. Recently, this
demand was addressed by the Spark-MPI approach connecting the Spark
data-intensive platform with the MPI high-performance framework. In contrast
with existing data management and analytics systems, Spark introduced a new
middleware based on resilient distributed datasets (RDDs), which decoupled
various data sources from high-level processing algorithms. The RDD middleware
significantly advanced the scope of data-intensive applications, spreading from
SQL queries to machine learning to graph processing. Spark-MPI further extended
the Spark ecosystem with the MPI applications using the Process Management
Interface. The paper explores this integrated platform within the context of
online ptychographic and tomographic reconstruction pipelines.Comment: New York Scientific Data Summit, August 6-9, 201
Scientific Computing Meets Big Data Technology: An Astronomy Use Case
Scientific analyses commonly compose multiple single-process programs into a
dataflow. An end-to-end dataflow of single-process programs is known as a
many-task application. Typically, tools from the HPC software stack are used to
parallelize these analyses. In this work, we investigate an alternate approach
that uses Apache Spark -- a modern big data platform -- to parallelize
many-task applications. We present Kira, a flexible and distributed astronomy
image processing toolkit using Apache Spark. We then use the Kira toolkit to
implement a Source Extractor application for astronomy images, called Kira SE.
With Kira SE as the use case, we study the programming flexibility, dataflow
richness, scheduling capacity and performance of Apache Spark running on the
EC2 cloud. By exploiting data locality, Kira SE achieves a 2.5x speedup over an
equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon
EC2 cloud. Furthermore, we show that by leveraging software originally designed
for big data infrastructure, Kira SE achieves competitive performance to the C
implementation running on the NERSC Edison supercomputer. Our experience with
Kira indicates that emerging Big Data platforms such as Apache Spark are a
performant alternative for many-task scientific applications
Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist
Apache Spark is a popular system aimed at the analysis of large data sets,
but recent studies have shown that certain computations---in particular, many
linear algebra computations that are the basis for solving common machine
learning problems---are significantly slower in Spark than when done using
libraries written in a high-performance computing framework such as the
Message-Passing Interface (MPI).
To remedy this, we introduce Alchemist, a system designed to call MPI-based
libraries from Apache Spark. Using Alchemist with Spark helps accelerate linear
algebra, machine learning, and related computations, while still retaining the
benefits of working within the Spark environment. We discuss the motivation
behind the development of Alchemist, and we provide a brief overview of its
design and implementation.
We also compare the performances of pure Spark implementations with those of
Spark implementations that leverage MPI-based codes via Alchemist. To do so, we
use data science case studies: a large-scale application of the conjugate
gradient method to solve very large linear systems arising in a speech
classification problem, where we see an improvement of an order of magnitude;
and the truncated singular value decomposition (SVD) of a 400GB
three-dimensional ocean temperature data set, where we see a speedup of up to
7.9x. We also illustrate that the truncated SVD computation is easily scalable
to terabyte-sized data by applying it to data sets of sizes up to 17.6TB.Comment: Accepted for publication in Proceedings of the 24th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, London, UK,
201
PiCo: A Domain-Specific Language for Data Analytics Pipelines
In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models—for which only informal (and often confusing) semantics is generally provided—all share a common under- lying model, namely, the Dataflow model. Using this model as a starting point, it is possible to categorize and analyze almost all aspects about Big Data analytics tools from a high level perspective. This analysis can be considered as a first step toward a formal model to be exploited in the design of a (new) framework for Big Data analytics. By putting clear separations between all levels of abstraction (i.e., from the runtime to the user API), it is easier for a programmer or software designer to avoid mixing low level with high level aspects, as we are often used to see in state-of-the-art Big Data analytics frameworks.
From the user-level perspective, we think that a clearer and simple semantics is preferable, together with a strong separation of concerns. For this reason, we use the Dataflow model as a starting point to build a programming environment with a simplified programming model implemented as a Domain-Specific Language, that is on top of a stack of layers that build a prototypical framework for Big Data analytics.
The contribution of this thesis is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm, Google Dataflow), thus making it easier to understand high-level data-processing applications written in such frameworks. As result of this analysis, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level.
Second, we propose a programming environment based on such layered model in the form of a Domain-Specific Language (DSL) for processing data collections, called PiCo (Pipeline Composition). The main entity of this programming model is the Pipeline, basically a DAG-composition of processing elements. This model is intended to give the user an unique interface for both stream and batch processing, hiding completely data management and focusing only on operations, which are represented by Pipeline stages. Our DSL will be built on top of the FastFlow library, exploiting both shared and distributed parallelism, and implemented in C++11/14 with the aim of porting C++ into the Big Data world
SparkFlow : towards high-performance data analytics for Spark-based genome analysis
The recent advances in DNA sequencing technology triggered next-generation sequencing (NGS) research in full scale. Big Data (BD) is becoming the main driver in analyzing these large-scale bioinformatic data. However, this complicated process has become the system bottleneck, requiring an amalgamation of scalable approaches to deliver the needed performance and hide the deployment complexity. Utilizing cutting-edge scientific workflows can robustly address these challenges. This paper presents a Spark-based alignment workflow called SparkFlow for massive NGS analysis over singularity containers. SparkFlow is highly scalable, reproducible, and capable of parallelizing computation by utilizing data-level parallelism and load balancing techniques in HPC and Cloud environments. The proposed workflow capitalizes on benchmarking two state-of-art NGS workflows, i.e., BaseRecalibrator and ApplyBQSR. SparkFlow realizes the ability to accelerate large-scale cancer genomic analysis by scaling vertically (HyperThreading) and horizontally (provisions on-demand). Our result demonstrates a trade-off inevitably between the targeted applications and processor architecture. SparkFlow achieves a decisive improvement in NGS computation performance, throughput, and scalability while maintaining deployment complexity. The paper’s findings aim to pave the way for a wide range of revolutionary enhancements and future trends within the High-performance Data Analytics (HPDA) genome analysis realm.Postprin
Recommended from our members
Ray: A Distributed Execution Engine for the Machine Learning Ecosystem
In recent years, growing data volumes and more sophisticated computational procedures have greatly increased the demand for computational power. Machine learning and artificial intelligence applications, for example, are notorious for their computational requirements. At the same time, Moores law is ending and processor speeds are stalling. As a result, distributed computing has become ubiquitous. While the cloud makes distributed hardware infrastructure widely accessible and therefore offers the potential of horizontal scale, developing these distributed algorithms and applications remains surprisingly hard. This is due to the inherent complexity of concurrent algorithms, the engineering challenges that arise when communicating between many machines, the requirements like fault tolerance and straggler mitigation that arise at large scale and the lack of a general-purpose distributed execution engine that can support a wide variety of applications.In this thesis, we study the requirements for a general-purpose distributed computation model and present a solution that is easy to use yet expressive and resilient to faults. At its core our model takes familiar concepts from serial programming, namely functions and classes, and generalizes them to the distributed world, therefore unifying stateless and stateful distributed computation. This model not only supports many machine learning workloads like training or serving, but is also a good t for cross-cutting machine learning applications like reinforcement learning and data processing applications like streaming or graph processing. We implement this computational model as an open-source system called Ray, which matches or exceeds the performance of specialized systems in many application domains, while also offering horizontally scalability and strong fault tolerance properties
Spark-DIY: A framework for interoperable Spark Operations with high performance Block-Based Data Models
This work was partially funded by the Spanish Ministry of Economy, Industry and Competitiveness under the grant TIN2016-79637-P ”Towards Unification of HPC and Big Data Paradigms”; the Spanish Ministry of Education under the FPU15/00422 Training Program for Academic and Teaching Staff Grant; the Advanced Scientific Computing
Research, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357; and by DOE with agreement No. DE-DC000122495, program manager Laura Biven
- …