Search CORE

1,935 research outputs found

Open Source Big Data Platforms and Tools: An Analysis

Author: Benlachmi Yassine
Hasnaoui Moulay Lahcen
Publication venue: IAES Indonesia Section
Publication date: 29/09/2021
Field of study

Big data is attracting an excessive amount of interest in the IT and academic sectors. On a regular basis, computer and digital industries generate more data than they have space to store. In the current situation, five billion people have their own mobile phone, and over two billion people are linked globally to exchange various types of data. By 2020, it is estimated that about fifty billion people will be connected to the internet. During2020, data generation, use, and sharing would be forty-four times higher than in previous years. A variety of sectors and organizations are using big data to manage various operations. As a result, a thorough examination of big data's benefits, drawbacks, meaning, and characteristics is needed. The primary goal of this research is to gather information on the various open-source big data tools and platforms that are used by various organizations. In this paper we use a three perspective methodology to identify the strength and weaknesses of the workflow in a open source big data arena. This helps to establish a pipeline of workflow events for both researcher and entrepreneur decision making

Indonesian Journal of Electrical Engineering and Informatics (IJEEI)

Scientific Computing Meets Big Data Technology: An Astronomy Use Case

Author: Barbary Kyle
Franklin Michael J.
Nothaft Frank Austin
Patterson David A.
Perlmutter Saul
Sparks Evan
Zahn Oliver
Zhang Zhao
Publication venue
Publication date: 22/12/2015
Field of study

Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark -- a modern big data platform -- to parallelize many-task applications. We present Kira, a flexible and distributed astronomy image processing toolkit using Apache Spark. We then use the Kira toolkit to implement a Source Extractor application for astronomy images, called Kira SE. With Kira SE as the use case, we study the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the EC2 cloud. By exploiting data locality, Kira SE achieves a 2.5x speedup over an equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon EC2 cloud. Furthermore, we show that by leveraging software originally designed for big data infrastructure, Kira SE achieves competitive performance to the C implementation running on the NERSC Edison supercomputer. Our experience with Kira indicates that emerging Big Data platforms such as Apache Spark are a performant alternative for many-task scientific applications

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Performance Analysis of Hadoop MapReduce And Apache Spark for Big Data

Author: Adesokan Adeyemi
Publication venue
Publication date: 10/09/2020
Field of study

In the recent era, information has evolved at an exponential rate. In order to obtain new insights, this information must be carefully interpreted and analyzed. There is, therefore, a need for a system that can process data efficiently all the time. Distributed cloud computing data processing platforms are important tools for data analytics on a large scale. In this area, Apache Hadoop (High-Availability Distributed Object-Oriented Platform) MapReduce has evolved as the standard. The MapReduce job reads, processes its input data and then returns it to Hadoop Distributed Files Systems (HDFS). Although there is limitation to its programming interface, this has led to the development of modern data flow-oriented frameworks known as Apache Spark, which uses Resilient Distributed Datasets (RDDs) to execute data structures in memory. Since RDDs can be stored in the memory, algorithms can iterate very efficiently over its data many times. Cluster computing is a major investment for any organization that chooses to perform Big Data Analysis. The MapReduce and Spark were indeed two famous open-source cluster-computing frameworks for big data analysis. Cluster computing hides the task complexity and low latency with simple user-friendly programming. It improves performance throughput, and backup uptime should the main system fail. Its features include flexibility, task scheduling, higher availability, and faster processing speed. Big Data analytics has become more computer-intensive as data management becomes a big issue for scientific computation. High-Performance Computing is undoubtedly of great importance for big data processing. The main application of this research work is towards the realization of High-Performance Computing (HPC) for Big Data Analysis. This thesis work investigates the processing capability and efficiency of Hadoop MapReduce and Apache Spark using Cloudera Manager (CM). The Cloudera Manager provides end-to-end cluster management for Cloudera Distribution for Apache Hadoop (CDH). The implementation was carried out with Amazon Web Services (AWS). Amazon Web Service is used to configure window Virtual Machine (VM). Four Linux In-stances of free tier eligible t2.micro were launched using Amazon Elastic Compute Cloud (EC2). The Linux Instances were configured into four cluster nodes using Secure Socket Shell (SSH). A Big Data application is generated and injected while both MapReduce and Spark job are run with different queries such as scan, aggregation, two way and three-way join. The time taken for each task to be completed are recorded, observed, and thoroughly analyzed. It was observed that Spark executes job faster than MapReduce

Osuva