3 research outputs found

    Modeling and Simulation of Spark Streaming

    Full text link
    As more and more devices connect to Internet of Things, unbounded streams of data will be generated, which have to be processed "on the fly" in order to trigger automated actions and deliver real-time services. Spark Streaming is a popular realtime stream processing framework. To make efficient use of Spark Streaming and achieve stable stream processing, it requires a careful interplay between different parameter configurations. Mistakes may lead to significant resource overprovisioning and bad performance. To alleviate such issues, this paper develops an executable and configurable model named SSP (stands for Spark Streaming Processing) to model and simulate Spark Streaming. SSP is written in ABS, which is a formal, executable, and object-oriented language for modeling distributed systems by means of concurrent object groups. SSP allows users to rapidly evaluate and compare different parameter configurations without deploying their applications on a cluster/cloud. The simulation results show that SSP is able to mimic Spark Streaming in different scenarios.Comment: 7 pages and 13 figures. This paper is published in IEEE 32nd International Conference on Advanced Information Networking and Applications (AINA 2018

    An innovative parameter optimization of Spark Streaming based on D3QN with Gaussian process regression

    Get PDF
    Nowadays, Spark Streaming, a computing framework based on Spark, is widely used to process streaming data such as social media data, IoT sensor data or web logs. Due to the extensive utilization of streaming media data analysis, performance optimization for Spark Streaming has gradually developed into a popular research topic. Several methods for enhancing Spark Streaming's performance include task scheduling, resource allocation and data skew optimization, which primarily focus on how to manually tune the parameter configuration. However, it is indeed very challenging and inefficient to adjust more than 200 parameters by means of continuous debugging. In this paper, we propose an improved dueling double deep Q-network (DQN) technique for parameter tuning, which can significantly improve the performance of Spark Streaming. This approach fuses reinforcement learning and Gaussian process regression to cut down on the number of iterations and speed convergence dramatically. The experimental results demonstrate that the performance of the dueling double DQN method with Gaussian process regression can be enhanced by up to 30.24%

    Scale-Out Algorithm For Apache Storm In SaaS Environment

    Get PDF
    The main appeal of the Cloud is in its cost effective and flexible access to computing power. Apache Storm is a data processing framework used to process streaming data. In our work we explore the possibility of offering Apache Storm as a software service. Further, we take advantage of the cgroups feature in Storm to divide the computing power of worker machine into smaller units to be offered to users. We predict that the compute bounds placed on the cgroups could be used to approximate the state of the workflow. We discuss the limitations of the current schedulers in facilitating this type of approximation as the resources are distributed in arbitrary ways. We implement a new custom scheduler that allows the user with more explicit control over the way resources are distributed to components in the workflow. We further build a simple model to approximate the current state and also predict the future state of the workflow due to changes in resource allocation. We propose a scale-out algorithm to increase the throughput of the workflow. We use the predictive model to measure the effects of many candidate allocations before choosing it. Our approach analyzes the strengths and drawbacks of Stela algorithm and design a complementary algorithm. We show that the combined algorithm complement each others strengths and drawbacks and provides allocations to maximize throughput for much larger set of scenarios. We implement the algorithm as a stand alone scheduler and evaluate the strategy through physical simulation on the Apache Storm Cluster and on software simulations for a set of workflows. Adviser: Ying L
    corecore