566 research outputs found

    Modeling and Simulation of Spark Streaming

    Full text link
    As more and more devices connect to Internet of Things, unbounded streams of data will be generated, which have to be processed "on the fly" in order to trigger automated actions and deliver real-time services. Spark Streaming is a popular realtime stream processing framework. To make efficient use of Spark Streaming and achieve stable stream processing, it requires a careful interplay between different parameter configurations. Mistakes may lead to significant resource overprovisioning and bad performance. To alleviate such issues, this paper develops an executable and configurable model named SSP (stands for Spark Streaming Processing) to model and simulate Spark Streaming. SSP is written in ABS, which is a formal, executable, and object-oriented language for modeling distributed systems by means of concurrent object groups. SSP allows users to rapidly evaluate and compare different parameter configurations without deploying their applications on a cluster/cloud. The simulation results show that SSP is able to mimic Spark Streaming in different scenarios.Comment: 7 pages and 13 figures. This paper is published in IEEE 32nd International Conference on Advanced Information Networking and Applications (AINA 2018

    A Survey on Automatic Parameter Tuning for Big Data Processing Systems

    Get PDF
    Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.Peer reviewe

    Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming

    Get PDF
    Genomic repeats, i.e., pattern searching in the string processing process to find repeated base pairs in the order of Deoxyribonucleic Acid (DNA), requires a long processing time. This research builds a big-data computational model to look for patterns in strings by modifying and implementing the Boyer-Moore algorithm on Apache Spark Streaming for human DNA sequences from the Ensemble site. Moreover, we perform some experiments on cloud computing by varying different specifications of computer clusters with involving datasets of human DNA sequences. The results obtained show that the proposed computational model on Apache Spark Streaming is faster than standalone computing and parallel computing with multicore. Therefore, it can be stated that the main contribution in this research, which is to develop a computational model for reducing the computational costs, has been achieved

    Data Stream Operations as First-Class Entities in Palladio

    Get PDF

    Assessing Apache Spark Streaming with Scientific Data

    Get PDF
    Processing real-world data requires the ability to analyze data in real-time. Data processing engines like Hadoop come short when results are needed on the fly. Apache Spark\u27s streaming library is increasingly becoming a popular choice as it can stream and analyze a significant amount of data. To showcase and assess the ability of Spark various metrics were designed and operated using data collected from the USGODAE data catalog. The latency of streaming in Apache Spark was measured and analyzed against many nodes in the cluster. Scalability was monitored by adding and removing nodes in the middle of a streaming job. Fault tolerance was verified by stopping nodes in the middle of a job and making sure that the job was rescheduled and completed on other node/s. A full stack application was designed that would automate data collection, data processing and visualizing the results. Google Maps API was used to visualize results by color coding the world map with values from various analytics

    Assessing Apache Spark Streaming with Scientific Data

    Get PDF
    Processing real-world data requires the ability to analyze data in real-time. Data processing engines like Hadoop come short when results are needed on the fly. Apache Spark\u27s streaming library is increasingly becoming a popular choice as it can stream and analyze a significant amount of data. To showcase and assess the ability of Spark various metrics were designed and operated using data collected from the USGODAE data catalog. The latency of streaming in Apache Spark was measured and analyzed against many nodes in the cluster. Scalability was monitored by adding and removing nodes in the middle of a streaming job. Fault tolerance was verified by stopping nodes in the middle of a job and making sure that the job was rescheduled and completed on other node/s. A full stack application was designed that would automate data collection, data processing and visualizing the results. Google Maps API was used to visualize results by color coding the world map with values from various analytics

    Model-driven Scheduling for Distributed Stream Processing Systems

    Full text link
    Distributed Stream Processing frameworks are being commonly used with the evolution of Internet of Things(IoT). These frameworks are designed to adapt to the dynamic input message rate by scaling in/out.Apache Storm, originally developed by Twitter is a widely used stream processing engine while others includes Flink, Spark streaming. For running the streaming applications successfully there is need to know the optimal resource requirement, as over-estimation of resources adds extra cost.So we need some strategy to come up with the optimal resource requirement for a given streaming application. In this article, we propose a model-driven approach for scheduling streaming applications that effectively utilizes a priori knowledge of the applications to provide predictable scheduling behavior. Specifically, we use application performance models to offer reliable estimates of the resource allocation required. Further, this intuition also drives resource mapping, and helps narrow the estimated and actual dataflow performance and resource utilization. Together, this model-driven scheduling approach gives a predictable application performance and resource utilization behavior for executing a given DSPS application at a target input stream rate on distributed resources.Comment: 54 page

    Why High-Performance Modelling and Simulation for Big Data Applications Matters

    Get PDF
    Modelling and Simulation (M&S) offer adequate abstractions to manage the complexity of analysing big data in scientific and engineering domains. Unfortunately, big data problems are often not easily amenable to efficient and effective use of High Performance Computing (HPC) facilities and technologies. Furthermore, M&S communities typically lack the detailed expertise required to exploit the full potential of HPC solutions while HPC specialists may not be fully aware of specific modelling and simulation requirements and applications. The COST Action IC1406 High-Performance Modelling and Simulation for Big Data Applications has created a strategic framework to foster interaction between M&S experts from various application domains on the one hand and HPC experts on the other hand to develop effective solutions for big data applications. One of the tangible outcomes of the COST Action is a collection of case studies from various computing domains. Each case study brought together both HPC and M&S experts, giving witness of the effective cross-pollination facilitated by the COST Action. In this introductory article we argue why joining forces between M&S and HPC communities is both timely in the big data era and crucial for success in many application domains. Moreover, we provide an overview on the state of the art in the various research areas concerned
    corecore