Search CORE

6,364 research outputs found

Rumble: Data Independence for Large Messy Data Sets

Author: Alonso Gustavo
Cikis Can Berker
Fourny Ghislain
Irimescu Stefan
Müller Ingo
Publication venue
Publication date: 06/05/2020
Field of study

This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogeneous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The design is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR clauses to Spark SQL on DataFrames. We have developed a working implementation of these mappings showing that JSONiq can efficiently run on Spark to query billions of objects into, at least, the TB range. The JSONiq code is concise in comparison to Spark's host languages while seamlessly supporting the nested, heterogeneous data sets that Spark SQL does not. The ability to process this kind of input, commonly found, is paramount for data cleaning and curation. The experimental analysis indicates that there is no excessive performance loss, occasionally even a gain, over Spark SQL for structured data, and a performance gain over PySpark. This demonstrates that a language such as JSONiq is a simple and viable approach to large-scale querying of denormalized, heterogeneous, arborescent data sets, in the same way as SQL can be leveraged for structured data sets. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.Comment: Preprint, 9 page

arXiv.org e-Print Archive

Repository for Publications and Research Data

Using SQL-based Scripting Languages in Hadoop Ecosystem for Data Analytics

Author: Koppel Madis-Karli
Publication venue
Publication date: 01/01/2016
Field of study

Selle lõputöö eesmärk on andmeanalüütika algoritmide rakendamine,\n\ret võrrelda erinevaid SQL-il põhinevaid skriptimiskeeli Hadoopi ökosüsteemis.\n\rLõputöö võrdleb erinevate raamistike efektiivsust ja algoritmide implementeerimise\n\rlihtsust kasutajal, kellel pole varasemat hajusarvutuse kogemust. Eesmärgi\n\rtäitmiseks implementeeriti kolm algoritmi: Pearsoni korrelatsioon, lihtne lineaarne\n\rregressioon ja naiivne Bayesi klassifikaator. Algoritmid implementeerti kahes\n\rSQL-il põhinevas raamistikus: Spark SQL-s ja HiveQL-s, samuti implementeeriti\n\rsamade algoritmide Spark MLlibi versioon. Algoritme testiti klastris erinevate sisendfaili\n\rsuurustega, samuti muudeti kasutatavate tuumade arvu. Selles lõputöös\n\ruuriti ka Spark SQLi ja Spark MLlibi algoritmide skaleeruvust. Algoritmide jooksutamise\n\rtulemusel selgus, et Pearsoni korrelatsioon oli HiveQL’is veidi kiirem kui\n\rteistes uuritud raamistikes. Lineaarse regressiooni tulemused näitavad, et Spark\n\rSQL ja Spark MLlib olid selle algoritmiga sama kiired, HiveQL oli umbes 30%\n\raeglasem. Kahe esimese algoritmiga skaleerusid Spark SQL ja Spark MLlibist pärit\n\ralgoritm hästi. Naiivse Bayesi klasifikaatoriga tehtud testid näitasid, et Spark\n\rSQL on selle algoritmiga kiirem kui HiveQL, hoolimata sellest, et ta ei skallerunud\n\rhästi. Spark MLlibi tulemused selle algoritmiga ei olnud piisavad järelduste\n\rtegemiseks. Korrelatsiooni ja lineaarse regressiooni implementatsioonid HiveContextis\n\rja SQLContextis andsid sama tulemuse. Selle lõputöö käigus leiti, et SQL-il\n\rpõhinevaid raamistikke on kerge kasutada: HiveQL oli kõige lihtsam samas kui\n\rSpark SQL nõudis veidi hajusarvutuse tundma õppimist. Spark MLlibi algoritmide\n\rimplementeerimine oli raskem kui oodatud, kuna nõudis algoritmi sisemise töö\n\rmõistmist, samuti osutusid vajalikuks teadmised hajusarvutusest.The goal of this thesis is to compare different SQL-based scripting languages\n\rin Hadoop ecosystem by implementing data analytics algorithms. The thesis compared framework efficiencies and easiness of implementing algorithms with no previous\n\rexperience in distributed computing. To fulfill this goal three algorithms were\n\rimplemented: Pearson’s correlation, simple linear regression and naive Bayes classifier.\n\rThe algorithms were implemented in two SQL-based frameworks on Hadoop\n\recosystem: Spark SQL and HiveQL, algorithms were also implemented from Spark\n\rMLlib. SQLContext and HiveContext were also compared in Spark SQL. Algorithms\n\rwere tested in a cluster with different dataset sizes and different number of\n\rexecutors. Scaling of Spark SQL and Spark MLlib’s algorithm was also measured.\n\rResults obtained in this thesis show that in the implementation of Pearson’s correlation\n\rHiveQL is slightly faster than other two frameworks. Linear regression\n\rresults show that Spark SQL and Spark MLlib are with similar run times, both\n\rabout 30% faster than HiveQL. Spark SQL and Spark MLlib algorithms scaled\n\rwell with these two algorithms. In the implementation of naive Bayes classifier\n\rSpark SQL did not scale well but was still faster than HiveQL. Results for Spark\n\rMLlib in multinomial naive Bayes proved to be inconclusive. With correlation\n\rand regression no difference between SQLContext and HiveContext was found.\n\rThe thesis found SQL-based frameworks easy to use: HiveQL was the easiest\n\rwhile Spark SQL required some additional investigation into distributed computing.\n\rImplementing algorithms from Spark MLlib was more difficulty as there it\n\rwas necessary to understand the internal workings of the algorithm, knowledge of\n\rdistributed computing was also necessary

DSpace at Tartu University Library

Integration of Skyline Queries into Spark SQL

Author: Grasmann Lukas
Pichler Reinhard
Selzer Alexander
Publication venue
Publication date: 07/10/2022
Field of study

Skyline queries are frequently used in data analytics and multi-criteria decision support applications to filter relevant information from big amounts of data. Apache Spark is a popular framework for processing big, distributed data. The framework even provides a convenient SQL-like interface via the Spark SQL module. However, skyline queries are not natively supported and require tedious rewriting to fit the SQL standard or Spark's SQL-like language. The goal of our work is to fill this gap. We thus provide a full-fledged integration of the skyline operator into Spark SQL. This allows for a simple and easy to use syntax to input skyline queries. Moreover, our empirical results show that this integrated solution of skyline queries by far outperforms a solution based on rewriting into standard SQL

arXiv.org e-Print Archive

Building Near-Real-Time Processing Pipelines with the Spark-MPI Platform

Author: Chaudhary Aashish
Cowan Matt
Hanwell Marcus
Jourdain Sebastien
Malitsky Nikolay
O'Leary Patrick
Van Dam Kerstin Kleese
Publication venue
Publication date: 12/05/2018
Field of study

Advances in detectors and computational technologies provide new opportunities for applied research and the fundamental sciences. Concurrently, dramatic increases in the three Vs (Volume, Velocity, and Variety) of experimental data and the scale of computational tasks produced the demand for new real-time processing systems at experimental facilities. Recently, this demand was addressed by the Spark-MPI approach connecting the Spark data-intensive platform with the MPI high-performance framework. In contrast with existing data management and analytics systems, Spark introduced a new middleware based on resilient distributed datasets (RDDs), which decoupled various data sources from high-level processing algorithms. The RDD middleware significantly advanced the scope of data-intensive applications, spreading from SQL queries to machine learning to graph processing. Spark-MPI further extended the Spark ecosystem with the MPI applications using the Process Management Interface. The paper explores this integrated platform within the context of online ptychographic and tomographic reconstruction pipelines.Comment: New York Scientific Data Summit, August 6-9, 201

arXiv.org e-Print Archive

Crossref

Scipedia

Rover: An online Spark SQL tuning service via generalized transfer learning

Author: Cui Bin
Jiang Huaijun
Li Yang
Lu Yupeng
Peng Di
Ren Xinyuyang
Shen Yu
Xu Huanyong
Zhang Wentao
Publication venue
Publication date: 29/05/2023
Field of study

Distributed data analytic engines like Spark are common choices to process massive data in industry. However, the performance of Spark SQL highly depends on the choice of configurations, where the optimal ones vary with the executed workloads. Among various alternatives for Spark SQL tuning, Bayesian optimization (BO) is a popular framework that finds near-optimal configurations given sufficient budget, but it suffers from the re-optimization issue and is not practical in real production. When applying transfer learning to accelerate the tuning process, we notice two domain-specific challenges: 1) most previous work focus on transferring tuning history, while expert knowledge from Spark engineers is of great potential to improve the tuning performance but is not well studied so far; 2) history tasks should be carefully utilized, where using dissimilar ones lead to a deteriorated performance in production. In this paper, we present Rover, a deployed online Spark SQL tuning service for efficient and safe search on industrial workloads. To address the challenges, we propose generalized transfer learning to boost the tuning performance based on external knowledge, including expert-assisted Bayesian optimization and controlled history transfer. Experiments on public benchmarks and real-world tasks show the superiority of Rover over competitive baselines. Notably, Rover saves an average of 50.1% of the memory cost on 12k real-world Spark SQL tasks in 20 iterations, among which 76.2% of the tasks achieve a significant memory reduction of over 60%.Comment: Accepted by KDD 202

arXiv.org e-Print Archive