6,364 research outputs found
Rumble: Data Independence for Large Messy Data Sets
This paper introduces Rumble, an engine that executes JSONiq queries on
large, heterogeneous and nested collections of JSON objects, leveraging the
parallel capabilities of Spark so as to provide a high degree of data
independence. The design is based on two key insights: (i) how to map JSONiq
expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR
clauses to Spark SQL on DataFrames. We have developed a working implementation
of these mappings showing that JSONiq can efficiently run on Spark to query
billions of objects into, at least, the TB range. The JSONiq code is concise in
comparison to Spark's host languages while seamlessly supporting the nested,
heterogeneous data sets that Spark SQL does not. The ability to process this
kind of input, commonly found, is paramount for data cleaning and curation. The
experimental analysis indicates that there is no excessive performance loss,
occasionally even a gain, over Spark SQL for structured data, and a performance
gain over PySpark. This demonstrates that a language such as JSONiq is a simple
and viable approach to large-scale querying of denormalized, heterogeneous,
arborescent data sets, in the same way as SQL can be leveraged for structured
data sets. The results also illustrate that Codd's concept of data independence
makes as much sense for heterogeneous, nested data sets as it does on highly
structured tables.Comment: Preprint, 9 page
Using SQL-based Scripting Languages in Hadoop Ecosystem for Data Analytics
Selle lĂ”putöö eesmĂ€rk on andmeanalĂŒĂŒtika algoritmide rakendamine,\n\ret vĂ”rrelda erinevaid SQL-il pĂ”hinevaid skriptimiskeeli Hadoopi ökosĂŒsteemis.\n\rLĂ”putöö vĂ”rdleb erinevate raamistike efektiivsust ja algoritmide implementeerimise\n\rlihtsust kasutajal, kellel pole varasemat hajusarvutuse kogemust. EesmĂ€rgi\n\rtĂ€itmiseks implementeeriti kolm algoritmi: Pearsoni korrelatsioon, lihtne lineaarne\n\rregressioon ja naiivne Bayesi klassifikaator. Algoritmid implementeerti kahes\n\rSQL-il pĂ”hinevas raamistikus: Spark SQL-s ja HiveQL-s, samuti implementeeriti\n\rsamade algoritmide Spark MLlibi versioon. Algoritme testiti klastris erinevate sisendfaili\n\rsuurustega, samuti muudeti kasutatavate tuumade arvu. Selles lĂ”putöös\n\ruuriti ka Spark SQLi ja Spark MLlibi algoritmide skaleeruvust. Algoritmide jooksutamise\n\rtulemusel selgus, et Pearsoni korrelatsioon oli HiveQLâis veidi kiirem kui\n\rteistes uuritud raamistikes. Lineaarse regressiooni tulemused nĂ€itavad, et Spark\n\rSQL ja Spark MLlib olid selle algoritmiga sama kiired, HiveQL oli umbes 30%\n\raeglasem. Kahe esimese algoritmiga skaleerusid Spark SQL ja Spark MLlibist pĂ€rit\n\ralgoritm hĂ€sti. Naiivse Bayesi klasifikaatoriga tehtud testid nĂ€itasid, et Spark\n\rSQL on selle algoritmiga kiirem kui HiveQL, hoolimata sellest, et ta ei skallerunud\n\rhĂ€sti. Spark MLlibi tulemused selle algoritmiga ei olnud piisavad jĂ€relduste\n\rtegemiseks. Korrelatsiooni ja lineaarse regressiooni implementatsioonid HiveContextis\n\rja SQLContextis andsid sama tulemuse. Selle lĂ”putöö kĂ€igus leiti, et SQL-il\n\rpĂ”hinevaid raamistikke on kerge kasutada: HiveQL oli kĂ”ige lihtsam samas kui\n\rSpark SQL nĂ”udis veidi hajusarvutuse tundma Ă”ppimist. Spark MLlibi algoritmide\n\rimplementeerimine oli raskem kui oodatud, kuna nĂ”udis algoritmi sisemise töö\n\rmĂ”istmist, samuti osutusid vajalikuks teadmised hajusarvutusest.The goal of this thesis is to compare different SQL-based scripting languages\n\rin Hadoop ecosystem by implementing data analytics algorithms. The thesis compared framework efficiencies and easiness of implementing algorithms with no previous\n\rexperience in distributed computing. To fulfill this goal three algorithms were\n\rimplemented: Pearsonâs correlation, simple linear regression and naive Bayes classifier.\n\rThe algorithms were implemented in two SQL-based frameworks on Hadoop\n\recosystem: Spark SQL and HiveQL, algorithms were also implemented from Spark\n\rMLlib. SQLContext and HiveContext were also compared in Spark SQL. Algorithms\n\rwere tested in a cluster with different dataset sizes and different number of\n\rexecutors. Scaling of Spark SQL and Spark MLlibâs algorithm was also measured.\n\rResults obtained in this thesis show that in the implementation of Pearsonâs correlation\n\rHiveQL is slightly faster than other two frameworks. Linear regression\n\rresults show that Spark SQL and Spark MLlib are with similar run times, both\n\rabout 30% faster than HiveQL. Spark SQL and Spark MLlib algorithms scaled\n\rwell with these two algorithms. In the implementation of naive Bayes classifier\n\rSpark SQL did not scale well but was still faster than HiveQL. Results for Spark\n\rMLlib in multinomial naive Bayes proved to be inconclusive. With correlation\n\rand regression no difference between SQLContext and HiveContext was found.\n\rThe thesis found SQL-based frameworks easy to use: HiveQL was the easiest\n\rwhile Spark SQL required some additional investigation into distributed computing.\n\rImplementing algorithms from Spark MLlib was more difficulty as there it\n\rwas necessary to understand the internal workings of the algorithm, knowledge of\n\rdistributed computing was also necessary
Integration of Skyline Queries into Spark SQL
Skyline queries are frequently used in data analytics and multi-criteria
decision support applications to filter relevant information from big amounts
of data. Apache Spark is a popular framework for processing big, distributed
data. The framework even provides a convenient SQL-like interface via the Spark
SQL module. However, skyline queries are not natively supported and require
tedious rewriting to fit the SQL standard or Spark's SQL-like language. The
goal of our work is to fill this gap. We thus provide a full-fledged
integration of the skyline operator into Spark SQL. This allows for a simple
and easy to use syntax to input skyline queries. Moreover, our empirical
results show that this integrated solution of skyline queries by far
outperforms a solution based on rewriting into standard SQL
Building Near-Real-Time Processing Pipelines with the Spark-MPI Platform
Advances in detectors and computational technologies provide new
opportunities for applied research and the fundamental sciences. Concurrently,
dramatic increases in the three Vs (Volume, Velocity, and Variety) of
experimental data and the scale of computational tasks produced the demand for
new real-time processing systems at experimental facilities. Recently, this
demand was addressed by the Spark-MPI approach connecting the Spark
data-intensive platform with the MPI high-performance framework. In contrast
with existing data management and analytics systems, Spark introduced a new
middleware based on resilient distributed datasets (RDDs), which decoupled
various data sources from high-level processing algorithms. The RDD middleware
significantly advanced the scope of data-intensive applications, spreading from
SQL queries to machine learning to graph processing. Spark-MPI further extended
the Spark ecosystem with the MPI applications using the Process Management
Interface. The paper explores this integrated platform within the context of
online ptychographic and tomographic reconstruction pipelines.Comment: New York Scientific Data Summit, August 6-9, 201
Rover: An online Spark SQL tuning service via generalized transfer learning
Distributed data analytic engines like Spark are common choices to process
massive data in industry. However, the performance of Spark SQL highly depends
on the choice of configurations, where the optimal ones vary with the executed
workloads. Among various alternatives for Spark SQL tuning, Bayesian
optimization (BO) is a popular framework that finds near-optimal configurations
given sufficient budget, but it suffers from the re-optimization issue and is
not practical in real production. When applying transfer learning to accelerate
the tuning process, we notice two domain-specific challenges: 1) most previous
work focus on transferring tuning history, while expert knowledge from Spark
engineers is of great potential to improve the tuning performance but is not
well studied so far; 2) history tasks should be carefully utilized, where using
dissimilar ones lead to a deteriorated performance in production. In this
paper, we present Rover, a deployed online Spark SQL tuning service for
efficient and safe search on industrial workloads. To address the challenges,
we propose generalized transfer learning to boost the tuning performance based
on external knowledge, including expert-assisted Bayesian optimization and
controlled history transfer. Experiments on public benchmarks and real-world
tasks show the superiority of Rover over competitive baselines. Notably, Rover
saves an average of 50.1% of the memory cost on 12k real-world Spark SQL tasks
in 20 iterations, among which 76.2% of the tasks achieve a significant memory
reduction of over 60%.Comment: Accepted by KDD 202
- âŠ