862 research outputs found
Recommended from our members
Hadoop performance modeling and job optimization for big data analytics
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonBig data has received a momentum from both academia and industry. The MapReduce model has emerged into a major computing model in support of big data analytics. Hadoop, which is an open source implementation of the MapReduce model, has been widely taken up by the community. Cloud service providers such as Amazon EC2 cloud have now supported Hadoop user applications. However, a key challenge is that the cloud service providers do not a have resource provisioning mechanism to satisfy user jobs with deadline requirements. Currently, it is solely the user responsibility to estimate the require amount of resources for their job running in a public cloud. This thesis presents a Hadoop performance model that accurately estimates the execution duration of a job and further provisions the required amount of resources for a job to be completed within a deadline. The proposed model employs Locally Weighted Linear Regression (LWLR) model to estimate execution time of a job and Lagrange Multiplier technique for resource provisioning to satisfy user job with a given deadline. The performance of the propose model is extensively evaluated in both in-house Hadoop cluster and Amazon EC2 Cloud. Experimental results show that the proposed model is highly accurate in job execution estimation and jobs are completed within the required deadlines following on the resource provisioning scheme of the proposed model. In addition, the Hadoop framework has over 190 configuration parameters and some of them have significant effects on the performance of a Hadoop job. Manually setting the optimum values for these parameters is a challenging task and also a time consuming process. This thesis presents optimization works that enhances the performance of Hadoop by automatically tuning its parameter values. It employs Gene Expression Programming (GEP) technique to build an objective function that represents the performance of a job and the correlation among the configuration parameters. For the purpose of optimization, Particle Swarm Optimization (PSO) is employed to find automatically an optimal or a near optimal configuration settings. The performance of the proposed work is intensively evaluated on a Hadoop cluster and the experimental results show that the proposed work enhances the performance of Hadoop significantly compared with the default settings.Abdul Wali Khan University Marda
Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers
Data processing frameworks such as Apache Beam and Apache Spark are used for
a wide range of applications, from logs analysis to data preparation for DNN
training. It is thus unsurprising that there has been a large amount of work on
optimizing these frameworks, including their storage management. The shift to
cloud computing requires optimization across all pipelines concurrently running
across a cluster. In this paper, we look at one specific instance of this
problem: placement of I/O-intensive temporary intermediate data on SSD and HDD.
Efficient data placement is challenging since I/O density is usually unknown at
the time data needs to be placed. Additionally, external factors such as load
variability, job preemption, or job priorities can impact job completion times,
which ultimately affect the I/O density of the temporary files in the workload.
In this paper, we envision that machine learning can be used to solve this
problem. We analyze production logs from Google's data centers for a range of
data processing pipelines. Our analysis shows that I/O density may be
predictable. This suggests that learning-based strategies, if crafted
carefully, could extract predictive features for I/O density of temporary files
involved in various transformations, which could be used to improve the
efficiency of storage management in data processing pipelines
- …