5 research outputs found
Importance of data distribution on hive-based systems for query performance: An experimental study
SQL-on-Hadoop systems have been gaining popularity in recent years. One popular example of SQL-on-Hadoop systems is Apache Hive; the pioneer of SQL-on-Hadoop systems. Hive is located on the top of big data stack as an application layer. Besides the application layer, the Hadoop Ecosystem is composed of 3 different main layers: storage, the resource manager and processing engine. The demand from industry has led to the development of new efficient components for each layer. As the ecosystem evolves over time, Hive employed different execution engines too. Understanding the strengths of components is very important in order to exploit the full performance of the Hadoop Ecosystem. Therefore, recent works in the literature study the importance of each layer separately. To the best of our knowledge, the present work is the first work that focuses on the performance of the combination of both the storage layer and the execution engine. In this work, we compare the Hive\u27s query performance by using three different execution engines: MR, Tez and Spark on the skewed/well-balanced data distribution through the full TPC-H benchmark. Our results show the importance of data distribution on the storage layer for overall job performance of SQL-on-Hadoop systems and empirically showed even distribution improves performance up to 48% compared to skewed distribution. Moreover, the present study provides insightful findings by identifying particular SQL query cases that the certain processing engine deals exceptionally well
On the performance of SQL scalable systems on Kubernetes: a comparative study
The popularization of Hadoop as the the-facto standard platform for data analytics in the context of Big Data applications
has led to the upsurge of SQL-on-Hadoop systems, which provide scalable query execution engines allowing the use of
SQL queries on data stored in HDFS. In this context, Kubernetes appears as the leading choice to simplify the deployment
and scaling of containerized applications; however, there is a lack of studies about the performance of SQL-on-Hadoop
systems deployed on Kubernetes, and this is the gap we intend to fill in this paper. We present an experimental study
involving four representative SQL scalable platforms: Apache Drill, Apache Hive, Apache Spark SQL and Trino. Concretely, we analyze the performance of these systems when they are deployed on a Hadoop cluster with Kubernetes by
using the TPC-H benchmark. The results of our study can help practitioners and users about what they can expect in terms
of performance if they plan to use the advantages of Kubernetes to deploy applications using the analyzed SQL scalable
platforms.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. Funding for open access charge: Universidad de Málaga / CBUA. This work has been partially funded by the Spanish Ministry of Science and Innovation via Grant PID2020-112540RB-C41 (AEI/FEDER, UE), Andalusian PAIDI program with grant P18-RT-2799, and by project ”Evolución y desarrollo de la plataforma DOP de Big Data” (702C2000044) under Andalusian “Programa de Apoyo a la I+D+i Empresarial”