548 research outputs found

    A Workload-Specific Memory Capacity Configuration Approach for In-Memory Data Analytic Platforms

    Full text link
    We propose WSMC, a workload-specific memory capacity configuration approach for the Spark workloads, which guides users on the memory capacity configuration with the accurate prediction of the workload's memory requirement under various input data size and parameter settings.First, WSMC classifies the in-memory computing workloads into four categories according to the workloads' Data Expansion Ratio. Second, WSMC establishes a memory requirement prediction model with the consideration of the input data size, the shuffle data size, the parallelism of the workloads and the data block size. Finally, for each workload category, WSMC calculates the shuffle data size in the prediction model in a workload-specific way. For the ad-hoc workload, WSMC can profile its Data Expansion Ratio with small-sized input data and decide the category that the workload falls into. Users can then determine the accurate configuration in accordance with the corresponding memory requirement prediction.Through the comprehensive evaluations with SparkBench workloads, we found that, contrasting with the default configuration, configuration with the guide of WSMC can save over 40% memory capacity with the workload performance slight degradation (only 5%), and compared to the proper configuration found out manually, the configuration with the guide of WSMC leads to only 7% increase in the memory waste with the workload's performance slight improvement (about 1%

    Spark versus Flink: Understanding Performance in Big Data Analytics Frameworks

    Get PDF
    International audienceBig Data analytics has recently gained increasing popularity as a tool to process large amounts of data on-demand. Spark and Flink are two Apache-hosted data analytics frameworks that facilitate the development of multi-step data pipelines using directly acyclic graph patterns. Making the most out of these frameworks is challenging because efficient executions strongly rely on complex parameter configurations and on an in-depth understanding of the underlying architectural choices. Although extensive research has been devoted to improving and evaluating the performance of such analytics frameworks, most of them benchmark the platforms against Hadoop, as a baseline, a rather unfair comparison considering the fundamentally different design principles. This paper aims to bring some justice in this respect, by directly evaluating the performance of Spark and Flink. Our goal is to identify and explain the impact of the different architectural choices and the parameter configurations on the perceived end-to-end performance. To this end, we develop a methodology for correlating the parameter settings and the operators execution plan with the resource usage. We use this methodology to dissect the performance of Spark and Flink with several representative batch and iterative workloads on up to 100 nodes. Our key finding is that there none of the two framework outperforms the other for all data types, sizes and job patterns. This paper performs a fine characterization of the cases when each framework is superior, and we highlight how this performance correlates to operators, to resource usage and to the specifics of the internal framework design

    Understanding Spark System Performance for Image Processing in a Heterogeneous Commodity Cluster

    Get PDF
    In recent years, Apache Spark has seen a widespread adoption in industries and institutions due to its cache mechanism for faster Big Data analytics. However, the speed advantage Spark provides, especially in a heterogeneous cluster environment, is not obtainable out-of-the-box; it requires the right combination of configuration parameters from the myriads of parameters provided by Spark developers. Recognizing this challenge, this thesis undertakes a study to provide insight on Spark performance particularly, regarding the impact of choice parameter settings. These are parameters that are critical to fast job completion and effective utilization of resources. To this end, the study focuses on two specific example applications namely, flowerCounter and imageClustering, for processing still image datasets of Canola plants collected during the Summer of 2016 from selected plot fields using timelapse cameras in a heterogeneous Spark-clustered environments. These applications were of initial interest to the Plant Phenotyping and Imaging Research Centre (P2IRC) at the University of Saskatchewan. The P2IRC is responsible for developing systems that will aid fast analysis of large-scale seed breeding to ensure global food security. The flowerCounter application estimates the count of flowers from the images while the imageClustering application clusters images based on physical plant attributes. Two clusters are used for the experiments: a 12-node and 3-node cluster (including a master node), with Hadoop Distributed File System (HDFS) as the storage medium for the image datasets. Experiments with the two case study applications demonstrate that increasing the number of tasks does not always speed-up job processing due to increased communication overheads. Findings from other experiments show that numerous tasks with one core per executor and small allocated memory limits parallelism within an executor and result in inefficient use of cluster resources. Executors with large CPU and memory, on the other hand, do not speed-up analytics due to processing delays and threads concurrency. Further experimental results indicate that application processing time depends on input data storage in conjunction with locality levels and executor run time is largely dominated by the disk I/O time especially, the read time cost. With respect to horizontal node scaling, Spark scales with increasing homogeneous computing nodes but the speed-up degrades with heterogeneous nodes. Finally, this study shows that the effectiveness of speculative tasks execution in mitigating the impact of slow nodes varies for the applications

    Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

    Full text link
    Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed
    • 

    corecore