Search CORE

566 research outputs found

Modeling and Simulation of Spark Streaming

Author: Johnsen Einar Broch
Lee Ming-Chang
Lin Jia-Chun
Yu Ingrid Chieh
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

As more and more devices connect to Internet of Things, unbounded streams of data will be generated, which have to be processed "on the fly" in order to trigger automated actions and deliver real-time services. Spark Streaming is a popular realtime stream processing framework. To make efficient use of Spark Streaming and achieve stable stream processing, it requires a careful interplay between different parameter configurations. Mistakes may lead to significant resource overprovisioning and bad performance. To alleviate such issues, this paper develops an executable and configurable model named SSP (stands for Spark Streaming Processing) to model and simulate Spark Streaming. SSP is written in ABS, which is a formal, executable, and object-oriented language for modeling distributed systems by means of concurrent object groups. SSP allows users to rapidly evaluate and compare different parameter configurations without deploying their applications on a cluster/cloud. The simulation results show that SSP is able to mimic Spark Streaming in different scenarios.Comment: 7 pages and 13 figures. This paper is published in IEEE 32nd International Conference on Advanced Information Networking and Applications (AINA 2018

arXiv.org e-Print Archive

Crossref

NORA - Norwegian Open Research Archives

A Survey on Automatic Parameter Tuning for Big Data Processing Systems

Author: Chen Yuxing
Herodotou Herodotos
Lu Jiaheng
Publication venue
Publication date: 01/04/2020
Field of study

Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.Peer reviewe

Ktisis

Helsingin yliopiston digitaalinen arkisto

Genomic repeats detection using Boyer-Moore algorithm on Apache Spark Streaming

Author: Fahsi Mahmoud
Piantari Erna
Pratama Farhan Dhiyaa
Riza Lala Septem
Publication venue: 'Universitas Ahmad Dahlan'
Publication date: 01/04/2020
Field of study

Genomic repeats, i.e., pattern searching in the string processing process to find repeated base pairs in the order of Deoxyribonucleic Acid (DNA), requires a long processing time. This research builds a big-data computational model to look for patterns in strings by modifying and implementing the Boyer-Moore algorithm on Apache Spark Streaming for human DNA sequences from the Ensemble site. Moreover, we perform some experiments on cloud computing by varying different specifications of computer clusters with involving datasets of human DNA sequences. The results obtained show that the proposed computational model on Apache Spark Streaming is faster than standalone computing and parallel computing with multicore. Therefore, it can be stated that the main contribution in this research, which is to develop a computational model for reducing the computational costs, has been achieved

TELKOMNIKA (Telecommunication Computing Electronics and Control)

UAD Journal Management System

Data Stream Operations as First-Class Entities in Palladio

Author: Koziolek Anne
Seifermann Stephan
Werle Dominik
Publication venue: Gesellschaft für Informatik e.V.
Publication date: 20/10/2021
Field of study

KITopen

Assessing Apache Spark Streaming with Scientific Data

Author: Dahal Janak
Publication venue: ScholarWorks@UNO
Publication date: 06/08/2018
Field of study

Processing real-world data requires the ability to analyze data in real-time. Data processing engines like Hadoop come short when results are needed on the fly. Apache Spark\u27s streaming library is increasingly becoming a popular choice as it can stream and analyze a significant amount of data. To showcase and assess the ability of Spark various metrics were designed and operated using data collected from the USGODAE data catalog. The latency of streaming in Apache Spark was measured and analyzed against many nodes in the cluster. Scalability was monitored by adding and removing nodes in the middle of a streaming job. Fault tolerance was verified by stopping nodes in the middle of a job and making sure that the job was rescheduled and completed on other node/s. A full stack application was designed that would automate data collection, data processing and visualizing the results. Google Maps API was used to visualize results by color coding the world map with values from various analytics

Assessing Apache Spark Streaming with Scientific Data

Author: Dahal Janak
Publication venue: ScholarWorks@UNO
Publication date: 06/08/2018
Field of study

University of New Orleans

Model-driven Scheduling for Distributed Stream Processing Systems

Author: Shukla Anshu
Simmhan Yogesh
Publication venue: 'Elsevier BV'
Publication date: 06/02/2017
Field of study

Distributed Stream Processing frameworks are being commonly used with the evolution of Internet of Things(IoT). These frameworks are designed to adapt to the dynamic input message rate by scaling in/out.Apache Storm, originally developed by Twitter is a widely used stream processing engine while others includes Flink, Spark streaming. For running the streaming applications successfully there is need to know the optimal resource requirement, as over-estimation of resources adds extra cost.So we need some strategy to come up with the optimal resource requirement for a given streaming application. In this article, we propose a model-driven approach for scheduling streaming applications that effectively utilizes a priori knowledge of the applications to provide predictable scheduling behavior. Specifically, we use application performance models to offer reliable estimates of the resource allocation required. Further, this intuition also drives resource mapping, and helps narrow the estimated and actual dataflow performance and resource utilization. Together, this model-driven scheduling approach gives a predictable application performance and resource utilization behavior for executing a given DSPS application at a target input stream rate on distributed resources.Comment: 54 page

arXiv.org e-Print Archive

Open Access Repository of IISc Research Publications

Why High-Performance Modelling and Simulation for Big Data Applications Matters

Author: Aldinucci M.
Bracciali A.
Grelck C.
Larsson E.
Niewiadomska-Szynkiewicz E.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

International Migration, Integration and Social Cohesion online publications

Why High-Performance Modelling and Simulation for Big Data Applications Matters

Author: A Abdullatif
A Al-Fuqaha
A Bracciali
A Cristina-Bicharra
A Fanti
A Glotić
A Heidari Gorji
A Inostrosa-Psijas
A Oussous
A Serra
A Sikora
A Singh
A Zamuda
A Zamuda
A Zamuda
A Zamuda
A Zamuda
A Zamuda
B Kitchenham
C Grelck
C Grelck
C Lorenzo
C Misale
C Sansom
C Zechner
CS Iliopoulos
D Griol
E Bartocci
E Bartocci
E Capobianco
E Capobianco
E Frank
E Niewiadomska-Szynkiewicz
E Niewiadomska-Szynkiewicz
EA Lee
EI Vlahogianni
F Bardozzo
F Berman
G Bernardini
G Garnett
G Vitello
H Casanova
H Kennedy
I Cotes-Ruiz
I Milne
I Park
J Dean
J Holub
J Zhang
K Rutherford
L Calviello
L Calzone
L Garg
L Garg
L Huang
L Lazzerini-Ospri
L Marti
L Nasti
M Aldinucci
M Aldinucci
M Aldinucci
M Aldinucci
M Beccuti
M Cannataro
M Cole
M Herlihy
M Jahangirian
M Karpowicz
M Patterson
MA Martínez-del-Amor
MP Karpowicz
N Akhter
N Paoletti
N Sehgal
N Totis
P Danecek
P Liò
P Liò
P Suravajhala
P Szynkiewicz
PD Healy
PL Luisi
PS Pacheco
R Calheiros
RJ Walters
S Aleem
S John Walker
S McClean
S McClean
S Shanmugam
S Vitabile
T Akidau
T Carver
T Mastelic
T White
X Song
Y Kuruma
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Modelling and Simulation (M&S) offer adequate abstractions to manage the complexity of analysing big data in scientific and engineering domains. Unfortunately, big data problems are often not easily amenable to efficient and effective use of High Performance Computing (HPC) facilities and technologies. Furthermore, M&S communities typically lack the detailed expertise required to exploit the full potential of HPC solutions while HPC specialists may not be fully aware of specific modelling and simulation requirements and applications. The COST Action IC1406 High-Performance Modelling and Simulation for Big Data Applications has created a strategic framework to foster interaction between M&S experts from various application domains on the one hand and HPC experts on the other hand to develop effective solutions for big data applications. One of the tangible outcomes of the COST Action is a collection of case studies from various computing domains. Each case study brought together both HPC and M&S experts, giving witness of the effective cross-pollination facilitated by the COST Action. In this introductory article we argue why joining forces between M&S and HPC communities is both timely in the big data era and crucial for success in many application domains. Moreover, we provide an overview on the state of the art in the various research areas concerned

Crossref

Stirling Online Research Repository (RIOXX)

Stirling Online Research Repository

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Institutional Research Information System University of Turin