5,269 research outputs found
Development of HU Cloud-based Spark Applications for Streaming Data Analytics
Nowadays, streaming data overflows from various sources and technologies such as Internet of Things (IoT), making conventional data analytics methods unsuitable to manage the latency of data processing relative to the growing demand for high processing speed and algorithmically scalability [1]. Real-time streaming data analytics, which processes data while it is in motion, is required to allow many organizations to analyze streaming data effectively and efficiently for being more active in their strategies. To analyze real time “Big” streaming data, parallel and distributed computing over a cloud of computers has become a mainstream solution to allow scalability, resiliency to failure, and fast processing of massive data sets. Several open source data analytics frameworks have been proposed and developed for streaming data analytics successfully. Apache Spark is one such framework being developed at the University of California, Berkley and gains lots of attentions due to reducing IO by storing data in a memory and a unique data executing model. In Computer & Information Sciences (CISC) at Harrisburg University (HU), we have been working on building a private Cloud Computing for future research and planning to involve industry collaboration where high volumes of real time streaming data are used to develop solutions to practical problems in industry. By developing a HU Cloud based environment for Apache Spark applications for streaming data analytics with batch processing on Hadoop Distributed File System (HDFS), we can prepare future big data era that can turn big data into beneficial actions for industry needs. This research aims to develop Spark applications supporting an entire streaming data analytics workflow, which consists of data ingestion, data analytics, data visualization and data storing. In particular, we will focus on a real time stock recommender system based on state-of-the-art Machine Learning (ML)/Deep Learning (DL) frameworks such as mllib, TensorFlow, Apache mxnet and pytorch. The plan is to gather real time stock market data from Google/Yahoo finance data streams to build a model to predict a future stock market trend. The proposed Spark applications on the HU cloud-based architecture will give emphasis to finding time-series forcating module for a specific period, typically based on selected attributes. In addition, we will test scale-out architecture, efficient parallel processing and fault tolerance of Spark applications on the HU Cloud based HDFS. We believe that this research will bring the CISC program at HU significant competitive advantages globally
Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server
In last decade, data analytics have rapidly progressed from traditional
disk-based processing to modern in-memory processing. However, little effort
has been devoted at enhancing performance at micro-architecture level. This
paper characterizes the performance of in-memory data analytics using Apache
Spark framework. We use a single node NUMA machine and identify the bottlenecks
hampering the scalability of workloads. We also quantify the inefficiencies at
micro-architecture level for various data analysis workloads. Through empirical
evaluation, we show that spark workloads do not scale linearly beyond twelve
threads, due to work time inflation and thread level load imbalance. Further,
at the micro-architecture level, we observe memory bound latency to be the
major cause of work time inflation.Comment: Accepted to The 5th IEEE International Conference on Big Data and
Cloud Computing (BDCloud 2015
Hive on spark and MapReduce : a methodology for parameter tuning
Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies ManagementAs the era of “big data” has arrived, more and more companies start using distributed file systems to manage and process their data streams like the Hadoop distributed file system framework (HDFS). This software library offers a way to store large files across multiple machines. Large data sets are processed by using its inherent programming model MapReduce. Apache Spark is a relatively new alternative to Hadoop MapReduce and claims to offer a performance boost up to 10 times for certain applications, while maintaining its automatic fault tolerance. To leverage the Data Warehouse capabilities of Hadoop Apache Hive was introduced. It is a concept for Big Data analytics that works on top of Hadoop and provides data analysis tools and most importantly translates queries to MapReduce and Spark jobs. Therefore, it exploits the scalability of Hadoop and offers data exploration and mining capabilities to non-developers. However, it is difficult for users to utilize the full potential of the Apache Spark execution engine. This results in very long execution times. Therefore, this project work gives researches and companies a tuning methodology that significantly can improve the execution time of queries. As a result, this tuning methodology could optimize a real-world batch-processing query by 5 times. Moreover, it gives insides in the underlying reasons of this big improvement by using Apache Spark Monitoring tools. The result can be helpful for many practitioners and researchers that would like to optimise the performance of Spark and MapReduce queries executed in Hive on top of an Apache Hadoop cluster
Performance Analysis of Hadoop MapReduce And Apache Spark for Big Data
In the recent era, information has evolved at an exponential rate. In order to obtain new insights, this information must be carefully interpreted and analyzed. There is, therefore, a need for a system that can process data efficiently all the time. Distributed cloud computing data processing platforms are important tools for data analytics on a large scale. In this area, Apache Hadoop (High-Availability Distributed Object-Oriented Platform) MapReduce has evolved as the standard. The MapReduce job reads, processes its input data and then returns it to Hadoop Distributed Files Systems (HDFS). Although there is limitation to its programming interface, this has led to the development of modern data flow-oriented frameworks known as Apache Spark, which uses Resilient Distributed Datasets (RDDs) to execute data structures in memory. Since RDDs can be stored in the memory, algorithms can iterate very efficiently over its data many times.
Cluster computing is a major investment for any organization that chooses to perform Big Data Analysis. The MapReduce and Spark were indeed two famous open-source cluster-computing frameworks for big data analysis. Cluster computing hides the task complexity and low latency with simple user-friendly programming. It improves performance throughput, and backup uptime should the main system fail. Its features include flexibility, task scheduling, higher availability, and faster processing speed. Big Data analytics has become more computer-intensive as data management becomes a big issue for scientific computation. High-Performance Computing is undoubtedly of great importance for big data processing. The main application of this research work is towards the realization of High-Performance Computing (HPC) for Big Data Analysis.
This thesis work investigates the processing capability and efficiency of Hadoop MapReduce and Apache Spark using Cloudera Manager (CM). The Cloudera Manager provides end-to-end cluster management for Cloudera Distribution for Apache Hadoop (CDH). The implementation was carried out with Amazon Web Services (AWS). Amazon Web Service is used to configure window Virtual Machine (VM). Four Linux In-stances of free tier eligible t2.micro were launched using Amazon Elastic Compute Cloud (EC2). The Linux Instances were configured into four cluster nodes using Secure Socket Shell (SSH).
A Big Data application is generated and injected while both MapReduce and Spark job are run with different queries such as scan, aggregation, two way and three-way join. The time taken for each task to be completed are recorded, observed, and thoroughly analyzed. It was observed that Spark executes job faster than MapReduce
Integration of Skyline Queries into Spark SQL
Skyline queries are frequently used in data analytics and multi-criteria
decision support applications to filter relevant information from big amounts
of data. Apache Spark is a popular framework for processing big, distributed
data. The framework even provides a convenient SQL-like interface via the Spark
SQL module. However, skyline queries are not natively supported and require
tedious rewriting to fit the SQL standard or Spark's SQL-like language. The
goal of our work is to fill this gap. We thus provide a full-fledged
integration of the skyline operator into Spark SQL. This allows for a simple
and easy to use syntax to input skyline queries. Moreover, our empirical
results show that this integrated solution of skyline queries by far
outperforms a solution based on rewriting into standard SQL
Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification
Most of the popular Big Data analytics tools evolved to adapt their working
environment to extract valuable information from a vast amount of unstructured
data. The ability of data mining techniques to filter this helpful information
from Big Data led to the term Big Data Mining. Shifting the scope of data from
small-size, structured, and stable data to huge volume, unstructured, and
quickly changing data brings many data management challenges. Different tools
cope with these challenges in their own way due to their architectural
limitations. There are numerous parameters to take into consideration when
choosing the right data management framework based on the task at hand. In this
paper, we present a comprehensive benchmark for two widely used Big Data
analytics tools, namely Apache Spark and Hadoop MapReduce, on a common data
mining task, i.e., classification. We employ several evaluation metrics to
compare the performance of the benchmarked frameworks, such as execution time,
accuracy, and scalability. These metrics are specialized to measure the
performance for classification task. To the best of our knowledge, there is no
previous study in the literature that employs all these metrics while taking
into consideration task-specific concerns. We show that Spark is 5 times faster
than MapReduce on training the model. Nevertheless, the performance of Spark
degrades when the input workload gets larger. Scaling the environment by
additional clusters significantly improves the performance of Spark. However,
similar enhancement is not observed in Hadoop. Machine learning utility of
MapReduce tend to have better accuracy scores than that of Spark, like around
3%, even in small size data sets.Comment: 2021 5th International Conference on Cloud and Big Data Computing
(ICCBDC 2021
Seer: Empowering Software Defined Networking with Data Analytics
Network complexity is increasing, making network control and orchestration a
challenging task. The proliferation of network information and tools for data
analytics can provide an important insight into resource provisioning and
optimisation. The network knowledge incorporated in software defined networking
can facilitate the knowledge driven control, leveraging the network
programmability. We present Seer: a flexible, highly configurable data
analytics platform for network intelligence based on software defined
networking and big data principles. Seer combines a computational engine with a
distributed messaging system to provide a scalable, fault tolerant and
real-time platform for knowledge extraction. Our first prototype uses Apache
Spark for streaming analytics and open network operating system (ONOS)
controller to program a network in real-time. The first application we
developed aims to predict the mobility pattern of mobile devices inside a smart
city environment.Comment: 8 pages, 6 figures, Big data, data analytics, data mining, knowledge
centric networking (KCN), software defined networking (SDN), Seer, 2016 15th
International Conference on Ubiquitous Computing and Communications and 2016
International Symposium on Cyberspace and Security (IUCC-CSS 2016
- …