12,417 research outputs found

    A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures

    Full text link
    Scientific problems that depend on processing large amounts of data require overcoming challenges in multiple areas: managing large-scale data distribution, co-placement and scheduling of data with compute resources, and storing and transferring large volumes of data. We analyze the ecosystems of the two prominent paradigms for data-intensive applications, hereafter referred to as the high-performance computing and the Apache-Hadoop paradigm. We propose a basis, common terminology and functional factors upon which to analyze the two approaches of both paradigms. We discuss the concept of "Big Data Ogres" and their facets as means of understanding and characterizing the most common application workloads found across the two paradigms. We then discuss the salient features of the two paradigms, and compare and contrast the two approaches. Specifically, we examine common implementation/approaches of these paradigms, shed light upon the reasons for their current "architecture" and discuss some typical workloads that utilize them. In spite of the significant software distinctions, we believe there is architectural similarity. We discuss the potential integration of different implementations, across the different levels and components. Our comparison progresses from a fully qualitative examination of the two paradigms, to a semi-quantitative methodology. We use a simple and broadly used Ogre (K-means clustering), characterize its performance on a range of representative platforms, covering several implementations from both paradigms. Our experiments provide an insight into the relative strengths of the two paradigms. We propose that the set of Ogres will serve as a benchmark to evaluate the two paradigms along different dimensions.Comment: 8 pages, 2 figure

    Building Near-Real-Time Processing Pipelines with the Spark-MPI Platform

    Full text link
    Advances in detectors and computational technologies provide new opportunities for applied research and the fundamental sciences. Concurrently, dramatic increases in the three Vs (Volume, Velocity, and Variety) of experimental data and the scale of computational tasks produced the demand for new real-time processing systems at experimental facilities. Recently, this demand was addressed by the Spark-MPI approach connecting the Spark data-intensive platform with the MPI high-performance framework. In contrast with existing data management and analytics systems, Spark introduced a new middleware based on resilient distributed datasets (RDDs), which decoupled various data sources from high-level processing algorithms. The RDD middleware significantly advanced the scope of data-intensive applications, spreading from SQL queries to machine learning to graph processing. Spark-MPI further extended the Spark ecosystem with the MPI applications using the Process Management Interface. The paper explores this integrated platform within the context of online ptychographic and tomographic reconstruction pipelines.Comment: New York Scientific Data Summit, August 6-9, 201

    DALiuGE: A Graph Execution Framework for Harnessing the Astronomical Data Deluge

    Full text link
    The Data Activated Liu Graph Engine - DALiuGE - is an execution framework for processing large astronomical datasets at a scale required by the Square Kilometre Array Phase 1 (SKA1). It includes an interface for expressing complex data reduction pipelines consisting of both data sets and algorithmic components and an implementation run-time to execute such pipelines on distributed resources. By mapping the logical view of a pipeline to its physical realisation, DALiuGE separates the concerns of multiple stakeholders, allowing them to collectively optimise large-scale data processing solutions in a coherent manner. The execution in DALiuGE is data-activated, where each individual data item autonomously triggers the processing on itself. Such decentralisation also makes the execution framework very scalable and flexible, supporting pipeline sizes ranging from less than ten tasks running on a laptop to tens of millions of concurrent tasks on the second fastest supercomputer in the world. DALiuGE has been used in production for reducing interferometry data sets from the Karl E. Jansky Very Large Array and the Mingantu Ultrawide Spectral Radioheliograph; and is being developed as the execution framework prototype for the Science Data Processor (SDP) consortium of the Square Kilometre Array (SKA) telescope. This paper presents a technical overview of DALiuGE and discusses case studies from the CHILES and MUSER projects that use DALiuGE to execute production pipelines. In a companion paper, we provide in-depth analysis of DALiuGE's scalability to very large numbers of tasks on two supercomputing facilities.Comment: 31 pages, 12 figures, currently under review by Astronomy and Computin

    Performance Evaluation of Job Scheduling and Resource Allocation in Apache Spark

    Get PDF
    Advancements in data acquisition techniques and devices are revolutionizing the way image data are collected, managed and processed. Devices such as time-lapse cameras and multispectral cameras generate large amount of image data daily. Therefore, there is a clear need for many organizations and researchers to deal with large volume of image data efficiently. On the other hand, Big Data processing on distributed systems such as Apache Spark are gaining popularity in recent years. Apache Spark is a widely used in-memory framework for distributed processing of large datasets on a cluster of inexpensive computers. This thesis proposes using Spark for distributed processing of large amount of image data in a time efficient manner. However, to share cluster resources efficiently, multiple image processing applications submitted to the cluster must be appropriately scheduled by Spark cluster managers to take advantage of all the compute power and storage capacity of the cluster. Spark can run on three cluster managers including Standalone, Mesos and YARN, and provides several configuration parameters that control how resources are allocated and scheduled. Using default settings for these multiple parameters is not enough to efficiently share cluster resources between multiple applications running concurrently. This leads to performance issues and resource underutilization because cluster administrators and users do not know which Spark cluster manager is the right fit for their applications and how the scheduling behaviour and parameter settings of these cluster managers affect the performance of their applications in terms of resource utilization and response times. This thesis parallelized a set of heterogeneous image processing applications including Image Registration, Flower Counter and Image Clustering, and presents extensive comparisons and analyses of running these applications on a large server and a Spark cluster using three different cluster managers for resource allocation, including Standalone, Apache Mesos and Hodoop YARN. In addition, the thesis examined the two different job scheduling and resource allocations modes available in Spark: static and dynamic allocation. Furthermore, the thesis explored the various configurations available on both modes that control speculative execution of tasks, resource size and the number of parallel tasks per job, and explained their impact on image processing applications. The thesis aims to show that using optimal values for these parameters reduces jobs makespan, maximizes cluster utilization, and ensures each application is allocated a fair share of cluster resources in a timely manner
    corecore