Search CORE

4,309 research outputs found

Is a Dataframe Just a Table?

Author: Wu Yifan
Publication venue: OASIcs - OpenAccess Series in Informatics. 10th Workshop on Evaluation and Usability of Programming Languages and Tools (PLATEAU 2019)
Publication date: 01/01/2020
Field of study

Querying data is core to databases and data science. However, the two communities have seemingly different concepts and use cases. As a result, both designers and users of the query languages disagree on whether the core abstractions - dataframes (data science) and tables (databases) - and the operations are the same. To investigate the difference from a PL-HCI perspective, we identify the basic affordances provided by tables and dataframes and how programming experiences over tables and dataframes differ. We show that the data structures nudge programmers to query and store their data in different ways. We hope the case study could clarify confusions, dispel misinformation, increase cross-pollination between the two communities, and identify open PL-HCI questions

Dagstuhl Research Online Publication Server

Rumble: Data Independence for Large Messy Data Sets

Author: Alonso Gustavo
Cikis Can Berker
Fourny Ghislain
Irimescu Stefan
Müller Ingo
Publication venue
Publication date: 06/05/2020
Field of study

This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogeneous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The design is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR clauses to Spark SQL on DataFrames. We have developed a working implementation of these mappings showing that JSONiq can efficiently run on Spark to query billions of objects into, at least, the TB range. The JSONiq code is concise in comparison to Spark's host languages while seamlessly supporting the nested, heterogeneous data sets that Spark SQL does not. The ability to process this kind of input, commonly found, is paramount for data cleaning and curation. The experimental analysis indicates that there is no excessive performance loss, occasionally even a gain, over Spark SQL for structured data, and a performance gain over PySpark. This demonstrates that a language such as JSONiq is a simple and viable approach to large-scale querying of denormalized, heterogeneous, arborescent data sets, in the same way as SQL can be leveraged for structured data sets. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.Comment: Preprint, 9 page

arXiv.org e-Print Archive

Repository for Publications and Research Data

Relational Graph Models at Work

Author: Breuvart Flavien
Manzonetto Giulio
Ruoppolo Domenico
Publication venue
Publication date: 19/07/2018
Field of study

We study the relational graph models that constitute a natural subclass of relational models of lambda-calculus. We prove that among the lambda-theories induced by such models there exists a minimal one, and that the corresponding relational graph model is very natural and easy to construct. We then study relational graph models that are fully abstract, in the sense that they capture some observational equivalence between lambda-terms. We focus on the two main observational equivalences in the lambda-calculus, the theory H+ generated by taking as observables the beta-normal forms, and H* generated by considering as observables the head normal forms. On the one hand we introduce a notion of lambda-K\"onig model and prove that a relational graph model is fully abstract for H+ if and only if it is extensional and lambda-K\"onig. On the other hand we show that the dual notion of hyperimmune model, together with extensionality, captures the full abstraction for H*

arXiv.org e-Print Archive

Episciences.org