Search CORE

5,269 research outputs found

Development of HU Cloud-based Spark Applications for Streaming Data Analytics

Author: Cha Sangwhan
Publication venue: Digital Commons at Harrisburg University
Publication date: 01/07/2019
Field of study

Nowadays, streaming data overflows from various sources and technologies such as Internet of Things (IoT), making conventional data analytics methods unsuitable to manage the latency of data processing relative to the growing demand for high processing speed and algorithmically scalability [1]. Real-time streaming data analytics, which processes data while it is in motion, is required to allow many organizations to analyze streaming data effectively and efficiently for being more active in their strategies. To analyze real time “Big” streaming data, parallel and distributed computing over a cloud of computers has become a mainstream solution to allow scalability, resiliency to failure, and fast processing of massive data sets. Several open source data analytics frameworks have been proposed and developed for streaming data analytics successfully. Apache Spark is one such framework being developed at the University of California, Berkley and gains lots of attentions due to reducing IO by storing data in a memory and a unique data executing model. In Computer & Information Sciences (CISC) at Harrisburg University (HU), we have been working on building a private Cloud Computing for future research and planning to involve industry collaboration where high volumes of real time streaming data are used to develop solutions to practical problems in industry. By developing a HU Cloud based environment for Apache Spark applications for streaming data analytics with batch processing on Hadoop Distributed File System (HDFS), we can prepare future big data era that can turn big data into beneficial actions for industry needs. This research aims to develop Spark applications supporting an entire streaming data analytics workflow, which consists of data ingestion, data analytics, data visualization and data storing. In particular, we will focus on a real time stock recommender system based on state-of-the-art Machine Learning (ML)/Deep Learning (DL) frameworks such as mllib, TensorFlow, Apache mxnet and pytorch. The plan is to gather real time stock market data from Google/Yahoo finance data streams to build a model to predict a future stock market trend. The proposed Spark applications on the HU cloud-based architecture will give emphasis to finding time-series forcating module for a specific period, typically based on selected attributes. In addition, we will test scale-out architecture, efficient parallel processing and fault tolerance of Spark applications on the HU Cloud based HDFS. We believe that this research will bring the CISC program at HU significant competitive advantages globally

Digital Commons @ Harrisburg University of Science and Technology

Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server

Author: Awan Ahsan Javed
Ayguade Eduard
Brorsson Mats
Vlassov Vladimir
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

In last decade, data analytics have rapidly progressed from traditional disk-based processing to modern in-memory processing. However, little effort has been devoted at enhancing performance at micro-architecture level. This paper characterizes the performance of in-memory data analytics using Apache Spark framework. We use a single node NUMA machine and identify the bottlenecks hampering the scalability of workloads. We also quantify the inefficiencies at micro-architecture level for various data analysis workloads. Through empirical evaluation, we show that spark workloads do not scale linearly beyond twelve threads, due to work time inflation and thread level load imbalance. Further, at the micro-architecture level, we observe memory bound latency to be the major cause of work time inflation.Comment: Accepted to The 5th IEEE International Conference on Big Data and Cloud Computing (BDCloud 2015

arXiv.org e-Print Archive

Crossref

UPCommons. Portal del coneixement obert de la UPC

Hive on spark and MapReduce : a methodology for parameter tuning

Author: Forster Rodrigo Richard
Publication venue
Publication date: 29/10/2018
Field of study

Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies ManagementAs the era of “big data” has arrived, more and more companies start using distributed file systems to manage and process their data streams like the Hadoop distributed file system framework (HDFS). This software library offers a way to store large files across multiple machines. Large data sets are processed by using its inherent programming model MapReduce. Apache Spark is a relatively new alternative to Hadoop MapReduce and claims to offer a performance boost up to 10 times for certain applications, while maintaining its automatic fault tolerance. To leverage the Data Warehouse capabilities of Hadoop Apache Hive was introduced. It is a concept for Big Data analytics that works on top of Hadoop and provides data analysis tools and most importantly translates queries to MapReduce and Spark jobs. Therefore, it exploits the scalability of Hadoop and offers data exploration and mining capabilities to non-developers. However, it is difficult for users to utilize the full potential of the Apache Spark execution engine. This results in very long execution times. Therefore, this project work gives researches and companies a tuning methodology that significantly can improve the execution time of queries. As a result, this tuning methodology could optimize a real-world batch-processing query by 5 times. Moreover, it gives insides in the underlying reasons of this big improvement by using Apache Spark Monitoring tools. The result can be helpful for many practitioners and researchers that would like to optimise the performance of Spark and MapReduce queries executed in Hive on top of an Apache Hadoop cluster

Repositório da Universidade Nova de Lisboa

Performance Analysis of Hadoop MapReduce And Apache Spark for Big Data

Author: Adesokan Adeyemi
Publication venue
Publication date: 10/09/2020
Field of study

In the recent era, information has evolved at an exponential rate. In order to obtain new insights, this information must be carefully interpreted and analyzed. There is, therefore, a need for a system that can process data efficiently all the time. Distributed cloud computing data processing platforms are important tools for data analytics on a large scale. In this area, Apache Hadoop (High-Availability Distributed Object-Oriented Platform) MapReduce has evolved as the standard. The MapReduce job reads, processes its input data and then returns it to Hadoop Distributed Files Systems (HDFS). Although there is limitation to its programming interface, this has led to the development of modern data flow-oriented frameworks known as Apache Spark, which uses Resilient Distributed Datasets (RDDs) to execute data structures in memory. Since RDDs can be stored in the memory, algorithms can iterate very efficiently over its data many times. Cluster computing is a major investment for any organization that chooses to perform Big Data Analysis. The MapReduce and Spark were indeed two famous open-source cluster-computing frameworks for big data analysis. Cluster computing hides the task complexity and low latency with simple user-friendly programming. It improves performance throughput, and backup uptime should the main system fail. Its features include flexibility, task scheduling, higher availability, and faster processing speed. Big Data analytics has become more computer-intensive as data management becomes a big issue for scientific computation. High-Performance Computing is undoubtedly of great importance for big data processing. The main application of this research work is towards the realization of High-Performance Computing (HPC) for Big Data Analysis. This thesis work investigates the processing capability and efficiency of Hadoop MapReduce and Apache Spark using Cloudera Manager (CM). The Cloudera Manager provides end-to-end cluster management for Cloudera Distribution for Apache Hadoop (CDH). The implementation was carried out with Amazon Web Services (AWS). Amazon Web Service is used to configure window Virtual Machine (VM). Four Linux In-stances of free tier eligible t2.micro were launched using Amazon Elastic Compute Cloud (EC2). The Linux Instances were configured into four cluster nodes using Secure Socket Shell (SSH). A Big Data application is generated and injected while both MapReduce and Spark job are run with different queries such as scan, aggregation, two way and three-way join. The time taken for each task to be completed are recorded, observed, and thoroughly analyzed. It was observed that Spark executes job faster than MapReduce

Osuva

Integration of Skyline Queries into Spark SQL

Author: Grasmann Lukas
Pichler Reinhard
Selzer Alexander
Publication venue
Publication date: 07/10/2022
Field of study

Skyline queries are frequently used in data analytics and multi-criteria decision support applications to filter relevant information from big amounts of data. Apache Spark is a popular framework for processing big, distributed data. The framework even provides a convenient SQL-like interface via the Spark SQL module. However, skyline queries are not natively supported and require tedious rewriting to fit the SQL standard or Spark's SQL-like language. The goal of our work is to fill this gap. We thus provide a full-fledged integration of the skyline operator into Spark SQL. This allows for a simple and easy to use syntax to input skyline queries. Moreover, our empirical results show that this integrated solution of skyline queries by far outperforms a solution based on rewriting into standard SQL

arXiv.org e-Print Archive

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

Author: Cakmak Ali
Tekdogan Taha
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 21/09/2022
Field of study

Most of the popular Big Data analytics tools evolved to adapt their working environment to extract valuable information from a vast amount of unstructured data. The ability of data mining techniques to filter this helpful information from Big Data led to the term Big Data Mining. Shifting the scope of data from small-size, structured, and stable data to huge volume, unstructured, and quickly changing data brings many data management challenges. Different tools cope with these challenges in their own way due to their architectural limitations. There are numerous parameters to take into consideration when choosing the right data management framework based on the task at hand. In this paper, we present a comprehensive benchmark for two widely used Big Data analytics tools, namely Apache Spark and Hadoop MapReduce, on a common data mining task, i.e., classification. We employ several evaluation metrics to compare the performance of the benchmarked frameworks, such as execution time, accuracy, and scalability. These metrics are specialized to measure the performance for classification task. To the best of our knowledge, there is no previous study in the literature that employs all these metrics while taking into consideration task-specific concerns. We show that Spark is 5 times faster than MapReduce on training the model. Nevertheless, the performance of Spark degrades when the input workload gets larger. Scaling the environment by additional clusters significantly improves the performance of Spark. However, similar enhancement is not observed in Hadoop. Machine learning utility of MapReduce tend to have better accuracy scores than that of Spark, like around 3%, even in small size data sets.Comment: 2021 5th International Conference on Cloud and Big Data Computing (ICCBDC 2021

arXiv.org e-Print Archive

Seer: Empowering Software Defined Networking with Data Analytics

Author: Nejabati Reza
Sideris Kyriakos
Simeonidou Dimitra
Publication venue
Publication date: 04/10/2016
Field of study

Network complexity is increasing, making network control and orchestration a challenging task. The proliferation of network information and tools for data analytics can provide an important insight into resource provisioning and optimisation. The network knowledge incorporated in software defined networking can facilitate the knowledge driven control, leveraging the network programmability. We present Seer: a flexible, highly configurable data analytics platform for network intelligence based on software defined networking and big data principles. Seer combines a computational engine with a distributed messaging system to provide a scalable, fault tolerant and real-time platform for knowledge extraction. Our first prototype uses Apache Spark for streaming analytics and open network operating system (ONOS) controller to program a network in real-time. The first application we developed aims to predict the mobility pattern of mobile devices inside a smart city environment.Comment: 8 pages, 6 figures, Big data, data analytics, data mining, knowledge centric networking (KCN), software defined networking (SDN), Seer, 2016 15th International Conference on Ubiquitous Computing and Communications and 2016 International Symposium on Cyberspace and Security (IUCC-CSS 2016

arXiv.org e-Print Archive

Crossref

Explore Bristol Research