3 research outputs found
Recommended from our members
Optimisation of computing and networking resources of a Hadoop cluster based on software defined network
in this paper, we discuss some challenges regarding the Hadoop framework. One of the main
ones is the computing performance of Hadoop MapReduce jobs in terms of CPU, memory and hard disk
I/O. The networking side of a Hadoop cluster is another challenge, especially for large scale clusters with
many switch devices and computing nodes, such as a data centre network. The configurations of Hadoop
MapReduce parameters can have a significant impact on the computing performance of a Hadoop cluster.
All issues relating to Hadoop MapReduce parameter settings are addressed. Some significant parameters of
Hadoop MapReduce are tuned using a novel intelligent technique based on both genetic programming and a
genetic Algorithm, with aim of optimising the performance of a Hadoop MapReduce job. In the Hadoop
framework, there are more than 150 configurations of parameters and hence, setting them manually is not
difficult, but also time consuming. Consequently, the above-mentioned algorithms are used to search for the
optimum values of parameter settings. Software Defined Network (SDN) is also employed to improve the
networking performance of a Hadoop cluster, thus accelerating Hadoop jobs. Experiments have been
carried out on two typical applications of Hadoop, including a Word Count Application and Tera Sort
application, using 14 virtual machines in both a traditional network and an SDN. The results for the
traditional network show that our proposed technique improves MapReduce jobs performance for 20 GB
with the Word Count application by 69.63% and 30.31% when compared to the default and Gunther work,
respectively. Whilst for the Tera Sort application, the performance of Hadoop MapReduce is improved by
73.39% and 55.93%, compared with the default and Gunther work, respectively. Moreover, the
experimental results in an SDN environment showed the performance of a Hadoop MapReduce job is
further improved due to the advantages of the intelligent and centralised management achieved using it.
Another experiment has been conducted to evaluate the performance of Hadoop jobs using a large scale
cluster in a data centre network, also based on SDN, with the results revealing that this exceeded the
performance of a conventional networkIraqi Ministry of Higher Education and Scientific Research and University of Diyal
Recommended from our members
Hadoop performance modeling and job optimization for big data analytics
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonBig data has received a momentum from both academia and industry. The MapReduce model has emerged into a major computing model in support of big data analytics. Hadoop, which is an open source implementation of the MapReduce model, has been widely taken up by the community. Cloud service providers such as Amazon EC2 cloud have now supported Hadoop user applications. However, a key challenge is that the cloud service providers do not a have resource provisioning mechanism to satisfy user jobs with deadline requirements. Currently, it is solely the user responsibility to estimate the require amount of resources for their job running in a public cloud. This thesis presents a Hadoop performance model that accurately estimates the execution duration of a job and further provisions the required amount of resources for a job to be completed within a deadline. The proposed model employs Locally Weighted Linear Regression (LWLR) model to estimate execution time of a job and Lagrange Multiplier technique for resource provisioning to satisfy user job with a given deadline. The performance of the propose model is extensively evaluated in both in-house Hadoop cluster and Amazon EC2 Cloud. Experimental results show that the proposed model is highly accurate in job execution estimation and jobs are completed within the required deadlines following on the resource provisioning scheme of the proposed model. In addition, the Hadoop framework has over 190 configuration parameters and some of them have significant effects on the performance of a Hadoop job. Manually setting the optimum values for these parameters is a challenging task and also a time consuming process. This thesis presents optimization works that enhances the performance of Hadoop by automatically tuning its parameter values. It employs Gene Expression Programming (GEP) technique to build an objective function that represents the performance of a job and the correlation among the configuration parameters. For the purpose of optimization, Particle Swarm Optimization (PSO) is employed to find automatically an optimal or a near optimal configuration settings. The performance of the proposed work is intensively evaluated on a Hadoop cluster and the experimental results show that the proposed work enhances the performance of Hadoop significantly compared with the default settings.Abdul Wali Khan University Marda
Recommended from our members
Optimisation of a hadoop cluster based on SDN in cloud computing for big data applications
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonBig data has received a great deal attention from many sectors, including academia, industry and government. The Hadoop framework has emerged for supporting its storage and analysis using the MapReduce programming module. However, this framework is a complex system that has more than 150 parameters and some of them can exert a considerable effect on the performance of a Hadoop job. The optimum tuning of the Hadoop parameters is a difficult task as well as being time consuming. In this thesis, an optimisation approach is presented to improve the performance of a Hadoop framework by setting the values of the Hadoop parameters automatically. Specifically, genetic programming is used to construct a fitness function that represents the interrelations among the Hadoop parameters. Then, a genetic algorithm is employed to search for the optimum or near the optimum values of the Hadoop parameters. A Hadoop cluster is configured on two severe at Brunel University London to evaluate the performance of the proposed optimisation approach. The experimental results show that the performance of a Hadoop MapReduce job for 20 GB on Word Count Application is improved by 69.63% and 30.31% when compared to the default settings and state of the art, respectively. Whilst on Tera sort application, it is improved by 73.39% and 55.93%. For better optimisation, SDN is also employed to improve the performance of a Hadoop job. The experimental results show that the performance of a Hadoop job in SDN network for 50 GB is improved by 32.8% when compared to traditional network. Whilst on Tera sort application, the improvement for 50 GB is on average 38.7%. An effective computing platform is also presented in this thesis to support solar irradiation data analytics. It is built based on RHIPE to provide fast analysis and calculation for solar irradiation datasets. The performance of RHIPE is compared with the R language in terms of accuracy, scalability and speedup. The speed up of RHIPE is evaluated by Gustafson's Law, which is revised to enhance the performance of the parallel computation on intensive irradiation data sets in a cluster computing environment like Hadoop. The performance of the proposed work is evaluated using a Hadoop cluster based on the Microsoft azure cloud and the experimental results show that RHIPE provides considerable improvements over the R language. Finally, an effective routing algorithm based on SDN to improve the performance of a Hadoop job in a large scale cluster in a data centre network is presented. The proposed algorithm is used to improve the performance of a Hadoop job during the shuffle phase by allocating efficient paths for each shuffling flow, according to the network resources demand of each flow as well as their size and number. Furthermore, it is also employed to allocate alternative paths for each shuffling flow in the case of any link crashing or failure. This algorithm is evaluated by two network topologies, namely, fat tree and leaf-spine, built by EstiNet emulator software. The experimental results show that the proposed approach improves the performance of a Hadoop job in a data centre network.Ministry of Higher Education and Scientific Researc