1,580 research outputs found
Recommended from our members
Hadoop performance modeling and job optimization for big data analytics
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonBig data has received a momentum from both academia and industry. The MapReduce model has emerged into a major computing model in support of big data analytics. Hadoop, which is an open source implementation of the MapReduce model, has been widely taken up by the community. Cloud service providers such as Amazon EC2 cloud have now supported Hadoop user applications. However, a key challenge is that the cloud service providers do not a have resource provisioning mechanism to satisfy user jobs with deadline requirements. Currently, it is solely the user responsibility to estimate the require amount of resources for their job running in a public cloud. This thesis presents a Hadoop performance model that accurately estimates the execution duration of a job and further provisions the required amount of resources for a job to be completed within a deadline. The proposed model employs Locally Weighted Linear Regression (LWLR) model to estimate execution time of a job and Lagrange Multiplier technique for resource provisioning to satisfy user job with a given deadline. The performance of the propose model is extensively evaluated in both in-house Hadoop cluster and Amazon EC2 Cloud. Experimental results show that the proposed model is highly accurate in job execution estimation and jobs are completed within the required deadlines following on the resource provisioning scheme of the proposed model. In addition, the Hadoop framework has over 190 configuration parameters and some of them have significant effects on the performance of a Hadoop job. Manually setting the optimum values for these parameters is a challenging task and also a time consuming process. This thesis presents optimization works that enhances the performance of Hadoop by automatically tuning its parameter values. It employs Gene Expression Programming (GEP) technique to build an objective function that represents the performance of a job and the correlation among the configuration parameters. For the purpose of optimization, Particle Swarm Optimization (PSO) is employed to find automatically an optimal or a near optimal configuration settings. The performance of the proposed work is intensively evaluated on a Hadoop cluster and the experimental results show that the proposed work enhances the performance of Hadoop significantly compared with the default settings.Abdul Wali Khan University Marda
Recommended from our members
A Platform for Scalable Low-Latency Analytics using MapReduce
Today, the ability to process big data has become crucial to the information needs of many enterprise businesses, scientific applications, and governments. Recently, there have been increasing needs of processing data that is not only big but also fast . Here fast data refers to high-speed real-time and near real-time data streams, such as Twitter feeds, search query streams, click streams, impressions, and system logs. To handle both historical data and real-time data, many companies have to maintain multiple systems. However, recent real-world case studies show that maintaining multiple systems cause not only code duplication, but also intensive manual work to partition the analytics workloads and determine which data is processed by which system. These issues point to the need for a general, unified data processing framework to support analytical queries with different latency requirements.
This thesis takes a further step towards building a general, unified system for big and fast data analytics. In order to build such a system, I propose to build on existing solutions on data parallelism and extend them with two new features: incremental processing and stream processing with latency constraints. This thesis starts with Hadoop, the most popular open-source MapReduce implementation, which provides proven scalability based on data parallelism. I answer the following questions: (1) Is Hadoop able to support incremental processing? (2) What are the necessary architecture changes in order to support incremental processing? (3) What are the additional design features needed to support stream processing with latency constraints? The thesis includes three parts that answer each of the questions.
The first part of the thesis validates whether the existing MapReduce implementations can support incremental processing. Incremental processing means that computation is performed as soon as the relevant data becomes available. My extensive benchmark study of Hadoop-based MapReduce systems shows that the widely-used sort-merge implementation for partitioning and parallel processing poses a fundamental barrier to incremental computation. I further propose a cost model, and optimize the Hadoop system configuration based on the model. The benchmark results over the optimized system verify that the barrier to incremental computation is intrinsic, and cannot be removed by tuning system parameters.
In the second part of the thesis, I employ various purely hash-based techniques to enable fast in-memory incremental processing in MapReduce, and frequent key based techniques to extend such processing to workloads that require memory more than available. I evaluate my Hadoop-based prototype equipped with all proposed techniques. The results show that the hash techniques allow the reduce progress to keep up with the map progress with up to 3 orders of magnitude reduction of internal disk spills, and enable results to be returned early.
The third part of the thesis aims to support stream processing with latency constraints based on the incremental processing platform resulted from the second part. I perform a benchmark study to understand the sources of latency. I then propose a number of necessary architecture changes to support stream processing, and augment the platform with new latency-aware model-driven resource planning and latency-aware runtime scheduling techniques to meet user-specified latency constraints while maximizing throughput. Experiments using real-world workloads show that the techniques reduce the latency from tens or hundreds of seconds to sub-second, with 2x-5x increase in throughput. The new platform offers 1-2 orders of magnitude improvements over Storm, a commercial-grade distributed stream system, and Spark Streaming, a state-of-the-art academic prototype, when considering both latency and throughput
DALiuGE: A Graph Execution Framework for Harnessing the Astronomical Data Deluge
The Data Activated Liu Graph Engine - DALiuGE - is an execution framework for
processing large astronomical datasets at a scale required by the Square
Kilometre Array Phase 1 (SKA1). It includes an interface for expressing complex
data reduction pipelines consisting of both data sets and algorithmic
components and an implementation run-time to execute such pipelines on
distributed resources. By mapping the logical view of a pipeline to its
physical realisation, DALiuGE separates the concerns of multiple stakeholders,
allowing them to collectively optimise large-scale data processing solutions in
a coherent manner. The execution in DALiuGE is data-activated, where each
individual data item autonomously triggers the processing on itself. Such
decentralisation also makes the execution framework very scalable and flexible,
supporting pipeline sizes ranging from less than ten tasks running on a laptop
to tens of millions of concurrent tasks on the second fastest supercomputer in
the world. DALiuGE has been used in production for reducing interferometry data
sets from the Karl E. Jansky Very Large Array and the Mingantu Ultrawide
Spectral Radioheliograph; and is being developed as the execution framework
prototype for the Science Data Processor (SDP) consortium of the Square
Kilometre Array (SKA) telescope. This paper presents a technical overview of
DALiuGE and discusses case studies from the CHILES and MUSER projects that use
DALiuGE to execute production pipelines. In a companion paper, we provide
in-depth analysis of DALiuGE's scalability to very large numbers of tasks on
two supercomputing facilities.Comment: 31 pages, 12 figures, currently under review by Astronomy and
Computin
Hive on spark and MapReduce : a methodology for parameter tuning
Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies ManagementAs the era of “big data” has arrived, more and more companies start using distributed file systems to manage and process their data streams like the Hadoop distributed file system framework (HDFS). This software library offers a way to store large files across multiple machines. Large data sets are processed by using its inherent programming model MapReduce. Apache Spark is a relatively new alternative to Hadoop MapReduce and claims to offer a performance boost up to 10 times for certain applications, while maintaining its automatic fault tolerance. To leverage the Data Warehouse capabilities of Hadoop Apache Hive was introduced. It is a concept for Big Data analytics that works on top of Hadoop and provides data analysis tools and most importantly translates queries to MapReduce and Spark jobs. Therefore, it exploits the scalability of Hadoop and offers data exploration and mining capabilities to non-developers. However, it is difficult for users to utilize the full potential of the Apache Spark execution engine. This results in very long execution times. Therefore, this project work gives researches and companies a tuning methodology that significantly can improve the execution time of queries. As a result, this tuning methodology could optimize a real-world batch-processing query by 5 times. Moreover, it gives insides in the underlying reasons of this big improvement by using Apache Spark Monitoring tools. The result can be helpful for many practitioners and researchers that would like to optimise the performance of Spark and MapReduce queries executed in Hive on top of an Apache Hadoop cluster
Private Summation in the Multi-Message Shuffle Model
The shuffle model of differential privacy (Erlingsson et al. SODA 2019; Cheu
et al. EUROCRYPT 2019) and its close relative encode-shuffle-analyze (Bittau et
al. SOSP 2017) provide a fertile middle ground between the well-known local and
central models. Similarly to the local model, the shuffle model assumes an
untrusted data collector who receives privatized messages from users, but in
this case a secure shuffler is used to transmit messages from users to the
collector in a way that hides which messages came from which user. An
interesting feature of the shuffle model is that increasing the amount of
messages sent by each user can lead to protocols with accuracies comparable to
the ones achievable in the central model. In particular, for the problem of
privately computing the sum of bounded real values held by different
users, Cheu et al. showed that messages per user suffice to
achieve error (the optimal rate in the central model), while Balle et
al. (CRYPTO 2019) recently showed that a single message per user leads to
MSE (mean squared error), a rate strictly in-between what is
achievable in the local and central models.
This paper introduces two new protocols for summation in the shuffle model
with improved accuracy and communication trade-offs. Our first contribution is
a recursive construction based on the protocol from Balle et al. mentioned
above, providing error with
messages per user. The second contribution is a protocol with error and
messages per user based on a novel analysis of the reduction from secure
summation to shuffling introduced by Ishai et al. (FOCS 2006) (the original
reduction required messages per user).Comment: Published at CCS'2
InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models
Deep learning-based recommender models (DLRMs) have become an essential
component of many modern recommender systems. Several companies are now
building large compute clusters reserved only for DLRM training, driving new
interest in cost- and time- saving optimizations. The systems challenges faced
in this setting are unique; while typical deep learning training jobs are
dominated by model execution, the most important factor in DLRM training
performance is often online data ingestion.
In this paper, we explore the unique characteristics of this data ingestion
problem and provide insights into DLRM training pipeline bottlenecks and
challenges. We study real-world DLRM data processing pipelines taken from our
compute cluster at Netflix to observe the performance impacts of online
ingestion and to identify shortfalls in existing pipeline optimizers. We find
that current tooling either yields sub-optimal performance, frequent crashes,
or else requires impractical cluster re-organization to adopt. Our studies lead
us to design and build a new solution for data pipeline optimization, InTune.
InTune employs a reinforcement learning (RL) agent to learn how to distribute
the CPU resources of a trainer machine across a DLRM data pipeline to more
effectively parallelize data loading and improve throughput. Our experiments
show that InTune can build an optimized data pipeline configuration within only
a few minutes, and can easily be integrated into existing training workflows.
By exploiting the responsiveness and adaptability of RL, InTune achieves higher
online data ingestion rates than existing optimizers, thus reducing idle times
in model execution and increasing efficiency. We apply InTune to our real-world
cluster, and find that it increases data ingestion throughput by as much as
2.29X versus state-of-the-art data pipeline optimizers while also improving
both CPU & GPU utilization.Comment: Accepted at RecSys 2023. 11 pages, 2 pages of references. 8 figures
with 2 table
- …