8 research outputs found

    Improving Performance in Hadoop Mapreduce

    Get PDF
    Hadoop MapReduce is a parallel, distributed programming model for processing large data sets or so-called Big data, on a cluster. The basic idea of MapReduce is to split the large input data set into many small pieces and assign these pieces to different devices for processing [5]. In this thesis, we took a look at performance evaluation of the MapReduce framework. MapReduce can be improved to perform speculative execution with maximum performance. Thus, optimizing the cost of computation and cost of communication will help achieve better performance. These optimizations are done by measuring the processing power of each machine and distributing task based on the capacity of each machine. The second step, measure he communication overheads and distribute tasks in the system for a given job or workload. To this end, we represent the Hadoop MapReduce execution with a functional model, and develop an optimization model for performance improvement in the system. Our experiments show that the proposed developed optimization functional model outperforms the regular functional model of the Hadoop MapReduce system by a factor of 2.Computer Scienc

    On data skewness, stragglers, and MapReduce progress indicators

    Full text link
    We tackle the problem of predicting the performance of MapReduce applications, designing accurate progress indicators that keep programmers informed on the percentage of completed computation time during the execution of a job. Through extensive experiments, we show that state-of-the-art progress indicators (including the one provided by Hadoop) can be seriously harmed by data skewness, load unbalancing, and straggling tasks. This is mainly due to their implicit assumption that the running time depends linearly on the input size. We thus design a novel profile-guided progress indicator, called NearestFit, that operates without the linear hypothesis assumption and exploits a careful combination of nearest neighbor regression and statistical curve fitting techniques. Our theoretical progress model requires fine-grained profile data, that can be very difficult to manage in practice. To overcome this issue, we resort to computing accurate approximations for some of the quantities used in our model through space- and time-efficient data streaming algorithms. We implemented NearestFit on top of Hadoop 2.6.0. An extensive empirical assessment over the Amazon EC2 platform on a variety of real-world benchmarks shows that NearestFit is practical w.r.t. space and time overheads and that its accuracy is generally very good, even in scenarios where competitors incur non-negligible errors and wide prediction fluctuations. Overall, NearestFit significantly improves the current state-of-art on progress analysis for MapReduce

    Analytical composite performance models for Big Data applications

    Get PDF
    In the era of Big Data, whose digital industry is facing the massive growth of data size and development of data intensive software, more and more companies are moving to use new frameworks and paradigms capable of handling data at scale. The outstanding MapRe- duce (MR) paradigm and its implementation framework, Hadoop are among the most re- ferred ones, and basis for later and more advanced frameworks like Tez and Spark. Accurate prediction of the execution time of a Big Data application helps improving design time de- cisions, reduces over allocation charges, and assists budget management. In this regard, we propose analytical models based on the Stochastic Activity Networks (SANs) to accurately model the execution of MR, Tez and Spark applications in Hadoop environments governed by the YARN Capacity scheduler. We evaluate the accuracy of the proposed models over the TPC-DS industry benchmark across different configurations. Results obtained by numeri- cally solving analytical SAN models show an average error of 6% in estimating the execution time of an application compared to the data gathered from experiments and moreover the model evaluation time is lower than simulation time of state of the art solutions

    Hadoop performance modeling for job estimation and resource provisioning

    Get PDF
    MapReduce has become a major computing model for data intensive applications. Hadoop, an open source implementation of MapReduce, has been adopted by an increasingly growing user community. Cloud computing service providers such as Amazon EC2 Cloud offer the opportunities for Hadoop users to lease a certain amount of resources and pay for their use. However, a key challenge is that cloud service providers do not have a resource provisioning mechanism to satisfy user jobs with deadline requirements. Currently, it is solely the user's responsibility to estimate the required amount of resources for running a job in the cloud. This paper presents a Hadoop job performance model that accurately estimates job completion time and further provisions the required amount of resources for a job to be completed within a deadline. The proposed model builds on historical job execution records and employs Locally Weighted Linear Regression (LWLR) technique to estimate the execution time of a job. Furthermore, it employs Lagrange Multipliers technique for resource provisioning to satisfy jobs with deadline requirements. The proposed model is initially evaluated on an in-house Hadoop cluster and subsequently evaluated in the Amazon EC2 Cloud. Experimental results show that the accuracy of the proposed model in job execution estimation is in the range of 94.97% and 95.51%, and jobs are completed within the required deadlines following on the resource provisioning scheme of the proposed model.This research is partially supported by the 973 project on Network Big Data Analytics funded by the Ministry of Science and Technology, China. No. 2014CB340404

    Automatic physical layer tuning of mapreduce-based query processing engines

    Get PDF
    Orientador: Eduardo Cunha de AlmeidaTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 29/06/2020Inclui referências: p. 98-109Área de concentração: Ciência da ComputaçãoResumo: A crescente necessidade de processar grandes quantidades de dados semi-estruturados e nãoestruturados levou ao desenvolvimento de mecanismos de processamento especializados como o MapReduce. O MapReduce é um modelo de programação projetado para processar grandes quantidades de dados semiestruturados de maneira distribuída e paralela. Os sistemas SQLon-Hadoop são interfaces SQL construídas sobre os mecanismos de processamento baseados em MapReduce para consultar grandes quantidades de dados semi-estruturados. No entanto, o número de máquinas, o número de sistemas na pilha de software e os mecanismos de controle fornecidos pelos mecanismos do MapReduce aumentam a complexidade e os custos operacionais de um cluster SQL-on-Hadoop. O aumento do desempenho dos motores de processamento MapReduce é um fator chave que pode ser alcançado delegando a quantidade certa de recursos físicos para suas tarefas. No entanto, usuários e até administradores especializados lutam para entender e ajustar as tarefas MapReduce para obter um desempenho melhor. A falta de conhecimento para ajustar as tarefas MapReduce deu origem a uma linha de pesquisa bem-sucedida sobre o ajuste automático dos parâmetros do MapReduce, originando vários Orientadores de Ajuste. No entanto, o problema de ajustar automaticamente as consultas SQL-no-Hadoop permanece amplamente inexplorado, pois a abordagem atual da aplicação dos Orientadores de Ajuste projetados para MapReduce em consultas SQL-on-Hadoop acarreta em vários problemas. Por exemplo, o processador de consultas do Hive, um sistema SQL-on-Hadoop popular, traduz consultas HiveQL em grafos de tarefas MapReduce, e seria fácil supor que, ajustando as configurações do motor de processamento MapReduce, as consultas HiveQL também se beneficiariam. Entretanto, essa suposição não se aplica quando os Orientadores de Ajuste existentes são aplicados ingenuamente às consultas HiveQL devido a arquitetura do Hive, Hadoop e dos Orientadores de Ajuste. Nesta tese tratamos da questão de como ajustar corretamente as consultas SQL-no-Hadoop. Por "corretamente", entendemos que, ao ajustar as configurações das consultas SQL-no-Hadoop, a geração das configurações deve considerar várias características que estão presentes apenas em tarefas geradas pelos sistemas SQL-no-Hadoop. Essas características incluem: (i) no caso de consultas individuais, todas as tarefas MapReduce que constituem o plano de consulta desta consulta são executadas com configurações idênticas. (ii) apesar da busca e geração das configurações de ajuste serem realizadas para cada tarefa MapReduce, apenas uma configuração de ajuste é selecionada e aplicada à consulta e as demais configurações de ajuste são simplesmente descartadas. (iii) Os Orientadores de Ajuste do Hadoop tratam as funções do MapReduce como caixas-pretas e fazem suposições de modelagem simplificadoras que podem valer para tarefas clássicas do MapReduce (Sort, Grep), mas não são verdadeiras para consultas do tipo SQL como o HiveQL, onde as tarefas contêm vários operadores de álgebra relacional como junções e agregadores. Estendemos o processador de consultas do Hive para ajustar as consultas SQL-no-Hadoop. Esta extensão compreende uma abordagem chamada de ajuste não-uniforme que permite que os sistemas SQL-on-Hadoop tenham um controle mais refinado da configuração das consultas, onde cada tarefa MapReduce recebe uma configuração especializada. Apresentamos um modelo conceitual, chamado assinatura de código, que usa informações estáticas disponíveis antes da execução de cada tafera para mapear tarefas que tenham padrões de consumo de recursos similares. Também apresentamos um cache que armazena configurações de ajuste, geradas por algum Orientadore de Ajuste, e as recicla entre tarefas que possuem consumo de recursos semelhantes. Nossa extensão funciona em conjunto como uma solução única para o ajuste automático de consultas SQL-no-Hadoop. Para validar nossa solução, realizamos um estudo experimental focado no Hive executando sobre o Hadoop porque (i) O Hive é um bom representante dos sistemas SQL-on-Hadoop nativos (como o System-R fez para os sistemas de bancos de dados relacionais); (ii) o Hive e o Hadoop são altamente populares para processamento analítico; e (iii) O ajuste de parâmetros do Hadoop foi estudado extensivamente nos últimos anos. Para preencher o cache de ajuste, empregamos o Starfish, o primeiro Orientador de Ajuste baseado em custo que encontra configurações (quase) ótimas e é o único Orientador de Ajuste disponível ao público para fins de pesquisa acadêmica. Em nossos experimentos, apresentamos que as consultas otimizadas com nossa abordagem de ajuste apresentaram acelerações de até 25%, contrastando com a abordagem atual que degradou o desempenho em várias ocasiões. Especificamente, a abordagem atual de ajuste pode causar variações no tempo de execução entre -171% e 27% em relação à configuração padrão. Mais importante ainda, nosso método de ajuste leva a uma melhor utilização de recursos, diminuindo o uso da CPU e a paginação de memória em até 40%. Nossa abordagem também reduziu a quantidade total de dados gravados em discos em 5×. Nossa abordagem de ajuste tem um cache usado para evitar a recriação de perfis de tarefas MapReduce semelhantes. Nosso cache reduziu a geração de perfils em 50% para a carga de trabalho TPC-H, permitindo até o ajuste parcial de consultas ad-hoc antes de sua execução. Palavras-chave: Sintonia da camada física. Processamento de consulta em MapReduce. SQL-On-Hadoop.Abstract: The increasing need to process large amounts of semi- and non-structured data has led to the development of specialized processing engines like MapReduce. MapReduce is a programming model designed to process large-scale semi-structured data in a distributed and parallel fashion. SQL-on-Hadoop systems are SQL-like interfaces build on top of MapReduce processing engines to query semi-structured data in large-scale. However, the number of computing nodes, the number of systems in the software stack, and the controlling mechanisms provided by MapReduce engines increase the complexity and the operational costs of maintaining a large SQL-on-Hadoop cluster. Increasing performance of such engines is a key factor that can be achieved by delegating the right amount of physical resources. Yet, regular users and even expert administrators struggle to understand and tune MapReduce jobs to achieve good performance. This skill gap has given rise to a successful line of research on automatically tuning MapReduce parameters, originating several tuning advisors. Yet, the problem of automatically tuning SQL-on-Hadoop queries remains largely unexplored today as the current approach of applying MapReduce tuning advisors direct to SQL-on-Hadoop queries entail a number of problems. For instance, the Hive SQL-on-Hadoop engine compiles HiveQL queries into a workflow of MapReduce jobs, and it would be straightforward to assume that by tuning the underlying Hadoop processing engine, HiveQL queries would benefit as well. However, this assumption does not hold when existing tuning advisors are naively applied to HiveQL queries due to the design choices of Hive, Hadoop, and the tuning advisors. This thesis addresses the question of how to properly tune SQL-on-Hadoop queries? By "properly" we mean, when tuning SQL-on-Hadoop queries, the generation of the tuning setups has to consider several characteristics that are only present in jobs generated by SQL-on-Hadoop systems. These characteristics include: (i) at the level of individual queries, all MapReduce jobs that constitute a query plan are executed with identical configuration settings. (ii) despite profiling and search heuristics being performed in a job-basis to generate tuning setups, only one tuning setup is applied to the query and the remaining tuning setups are simply discarded. (iii) Hadoop tuning advisors treat the MapReduce functions as black boxes and make simplifying modeling assumptions that may hold for classical MapReduce jobs (Sort, Grep), but they are not true for SQL-like queries like HiveQL where jobs contain multiple relational algebra operators like joins and aggregators. We extended the Hive query processor for tune SQL-on-Hadoop queries. This extension comprises an approach called non-uniform tuning that enables SQL-on-Hadoop systems to have a fine-grained control for tuning queries, where jobs receive specialized tuning setups. We present a conceptual model, called code-signature, that uses static information available upfront execution to match jobs with similar resource consumption patterns. We also present a tuning cache that stores tuning setups, generated by third part tuning advisors, and recycle them between jobs that have the similar resource consumption. The extension works together as a single solution for automatic tuning of SQL-on-Hadoop queries. In order to validate our solution, we conduct an experimental study focused on Hive over Hadoop because (i) Hive is a good representative of native SQL-on-Hadoop systems (like System-R did for relational database systems); (ii) both Hive and Hadoop are highly popular for analytical processing; and (iii) Hadoop parameter tuning has been studied extensively in recent years. For populate the Tuning Cache, we employ Starfish, the first cost-based optimizer for finding (near-) optimal configuration parameter settings and the only publicly available tuning advisor for academic research purposes. In our experiments, we present that queries optimized with our tuning approach always presented positive speed ups up to 25%, contrasting the current approach that degraded performance in several occasions. Specifically, the current tuning approach can cause variations in the execution run time between -171% and 27% over default configuration. Most importantly, our tuning method leads to considerable better resource utilization, decreasing CPU usage and Memory paging over 40%. Also reducing the total amount of data written to disks in 5×. Our tuning approach has a Tuning Cache used to avoid reprofiling similar jobs. Our Tuning Cache reduced the profilings in 50% for TPC-H queries, enabling upfront tuning of ad-hoc queries. Keywords: Physical-layer tuning. MapReduce query processing. SQL-On-Hadoop

    Benchmarking Approach for Designing a MapReduce Performance Model

    No full text
    In MapReduce environments, many of the programs are reused for processing a regularly incoming new data. A typical user question is how to estimate the completion time of these programs as a function of a new dataset and the cluster resources. In this work 1, we offer a novel performance evaluation framework for answering this question. We observe that the execution of each map (reduce) tasks consists of specific, well-defined data processing phases. Only map and reduce functions are custom and their executions are user-defined for different MapReduce jobs. The executions of the remaining phases are generic and depend on the amount of data processed by the phase and the performance of underlying Hadoop cluster. First, we design a set of parameterizable microbenchmarks to measure generic phases and to derive a platform performance model of a given Hadoop cluster. Then using the job past executions, we summarize job’s properties and performance of its custom map/reduce functions in a compact job profile. Finally, by combining the knowledge of the job profile and the derived platform performance model, we offer a MapReduce performance model that estimates the program completion time for processing a new dataset. The evaluation study justifies our approach and the proposed framework: we are able to accurately predict performance of the diverse set of twelve MapReduce applications. The predicted completion times for most experiments are within 10 % of the measured ones (with a worst case resulting in 17 % of error) on our 66-node Hadoop cluster
    corecore