6,364 research outputs found

    Rumble: Data Independence for Large Messy Data Sets

    Full text link
    This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogeneous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The design is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR clauses to Spark SQL on DataFrames. We have developed a working implementation of these mappings showing that JSONiq can efficiently run on Spark to query billions of objects into, at least, the TB range. The JSONiq code is concise in comparison to Spark's host languages while seamlessly supporting the nested, heterogeneous data sets that Spark SQL does not. The ability to process this kind of input, commonly found, is paramount for data cleaning and curation. The experimental analysis indicates that there is no excessive performance loss, occasionally even a gain, over Spark SQL for structured data, and a performance gain over PySpark. This demonstrates that a language such as JSONiq is a simple and viable approach to large-scale querying of denormalized, heterogeneous, arborescent data sets, in the same way as SQL can be leveraged for structured data sets. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.Comment: Preprint, 9 page

    Using SQL-based Scripting Languages in Hadoop Ecosystem for Data Analytics

    Get PDF
    Selle lĂ”putöö eesmĂ€rk on andmeanalĂŒĂŒtika algoritmide rakendamine,\n\ret vĂ”rrelda erinevaid SQL-il pĂ”hinevaid skriptimiskeeli Hadoopi ökosĂŒsteemis.\n\rLĂ”putöö vĂ”rdleb erinevate raamistike efektiivsust ja algoritmide implementeerimise\n\rlihtsust kasutajal, kellel pole varasemat hajusarvutuse kogemust. EesmĂ€rgi\n\rtĂ€itmiseks implementeeriti kolm algoritmi: Pearsoni korrelatsioon, lihtne lineaarne\n\rregressioon ja naiivne Bayesi klassifikaator. Algoritmid implementeerti kahes\n\rSQL-il pĂ”hinevas raamistikus: Spark SQL-s ja HiveQL-s, samuti implementeeriti\n\rsamade algoritmide Spark MLlibi versioon. Algoritme testiti klastris erinevate sisendfaili\n\rsuurustega, samuti muudeti kasutatavate tuumade arvu. Selles lĂ”putöös\n\ruuriti ka Spark SQLi ja Spark MLlibi algoritmide skaleeruvust. Algoritmide jooksutamise\n\rtulemusel selgus, et Pearsoni korrelatsioon oli HiveQL’is veidi kiirem kui\n\rteistes uuritud raamistikes. Lineaarse regressiooni tulemused nĂ€itavad, et Spark\n\rSQL ja Spark MLlib olid selle algoritmiga sama kiired, HiveQL oli umbes 30%\n\raeglasem. Kahe esimese algoritmiga skaleerusid Spark SQL ja Spark MLlibist pĂ€rit\n\ralgoritm hĂ€sti. Naiivse Bayesi klasifikaatoriga tehtud testid nĂ€itasid, et Spark\n\rSQL on selle algoritmiga kiirem kui HiveQL, hoolimata sellest, et ta ei skallerunud\n\rhĂ€sti. Spark MLlibi tulemused selle algoritmiga ei olnud piisavad jĂ€relduste\n\rtegemiseks. Korrelatsiooni ja lineaarse regressiooni implementatsioonid HiveContextis\n\rja SQLContextis andsid sama tulemuse. Selle lĂ”putöö kĂ€igus leiti, et SQL-il\n\rpĂ”hinevaid raamistikke on kerge kasutada: HiveQL oli kĂ”ige lihtsam samas kui\n\rSpark SQL nĂ”udis veidi hajusarvutuse tundma Ă”ppimist. Spark MLlibi algoritmide\n\rimplementeerimine oli raskem kui oodatud, kuna nĂ”udis algoritmi sisemise töö\n\rmĂ”istmist, samuti osutusid vajalikuks teadmised hajusarvutusest.The goal of this thesis is to compare different SQL-based scripting languages\n\rin Hadoop ecosystem by implementing data analytics algorithms. The thesis compared framework efficiencies and easiness of implementing algorithms with no previous\n\rexperience in distributed computing. To fulfill this goal three algorithms were\n\rimplemented: Pearson’s correlation, simple linear regression and naive Bayes classifier.\n\rThe algorithms were implemented in two SQL-based frameworks on Hadoop\n\recosystem: Spark SQL and HiveQL, algorithms were also implemented from Spark\n\rMLlib. SQLContext and HiveContext were also compared in Spark SQL. Algorithms\n\rwere tested in a cluster with different dataset sizes and different number of\n\rexecutors. Scaling of Spark SQL and Spark MLlib’s algorithm was also measured.\n\rResults obtained in this thesis show that in the implementation of Pearson’s correlation\n\rHiveQL is slightly faster than other two frameworks. Linear regression\n\rresults show that Spark SQL and Spark MLlib are with similar run times, both\n\rabout 30% faster than HiveQL. Spark SQL and Spark MLlib algorithms scaled\n\rwell with these two algorithms. In the implementation of naive Bayes classifier\n\rSpark SQL did not scale well but was still faster than HiveQL. Results for Spark\n\rMLlib in multinomial naive Bayes proved to be inconclusive. With correlation\n\rand regression no difference between SQLContext and HiveContext was found.\n\rThe thesis found SQL-based frameworks easy to use: HiveQL was the easiest\n\rwhile Spark SQL required some additional investigation into distributed computing.\n\rImplementing algorithms from Spark MLlib was more difficulty as there it\n\rwas necessary to understand the internal workings of the algorithm, knowledge of\n\rdistributed computing was also necessary

    Integration of Skyline Queries into Spark SQL

    Full text link
    Skyline queries are frequently used in data analytics and multi-criteria decision support applications to filter relevant information from big amounts of data. Apache Spark is a popular framework for processing big, distributed data. The framework even provides a convenient SQL-like interface via the Spark SQL module. However, skyline queries are not natively supported and require tedious rewriting to fit the SQL standard or Spark's SQL-like language. The goal of our work is to fill this gap. We thus provide a full-fledged integration of the skyline operator into Spark SQL. This allows for a simple and easy to use syntax to input skyline queries. Moreover, our empirical results show that this integrated solution of skyline queries by far outperforms a solution based on rewriting into standard SQL

    Building Near-Real-Time Processing Pipelines with the Spark-MPI Platform

    Full text link
    Advances in detectors and computational technologies provide new opportunities for applied research and the fundamental sciences. Concurrently, dramatic increases in the three Vs (Volume, Velocity, and Variety) of experimental data and the scale of computational tasks produced the demand for new real-time processing systems at experimental facilities. Recently, this demand was addressed by the Spark-MPI approach connecting the Spark data-intensive platform with the MPI high-performance framework. In contrast with existing data management and analytics systems, Spark introduced a new middleware based on resilient distributed datasets (RDDs), which decoupled various data sources from high-level processing algorithms. The RDD middleware significantly advanced the scope of data-intensive applications, spreading from SQL queries to machine learning to graph processing. Spark-MPI further extended the Spark ecosystem with the MPI applications using the Process Management Interface. The paper explores this integrated platform within the context of online ptychographic and tomographic reconstruction pipelines.Comment: New York Scientific Data Summit, August 6-9, 201

    Rover: An online Spark SQL tuning service via generalized transfer learning

    Full text link
    Distributed data analytic engines like Spark are common choices to process massive data in industry. However, the performance of Spark SQL highly depends on the choice of configurations, where the optimal ones vary with the executed workloads. Among various alternatives for Spark SQL tuning, Bayesian optimization (BO) is a popular framework that finds near-optimal configurations given sufficient budget, but it suffers from the re-optimization issue and is not practical in real production. When applying transfer learning to accelerate the tuning process, we notice two domain-specific challenges: 1) most previous work focus on transferring tuning history, while expert knowledge from Spark engineers is of great potential to improve the tuning performance but is not well studied so far; 2) history tasks should be carefully utilized, where using dissimilar ones lead to a deteriorated performance in production. In this paper, we present Rover, a deployed online Spark SQL tuning service for efficient and safe search on industrial workloads. To address the challenges, we propose generalized transfer learning to boost the tuning performance based on external knowledge, including expert-assisted Bayesian optimization and controlled history transfer. Experiments on public benchmarks and real-world tasks show the superiority of Rover over competitive baselines. Notably, Rover saves an average of 50.1% of the memory cost on 12k real-world Spark SQL tasks in 20 iterations, among which 76.2% of the tasks achieve a significant memory reduction of over 60%.Comment: Accepted by KDD 202
    • 

    corecore