4,309 research outputs found
Is a Dataframe Just a Table?
Querying data is core to databases and data science. However, the two communities have seemingly different concepts and use cases. As a result, both designers and users of the query languages disagree on whether the core abstractions - dataframes (data science) and tables (databases) - and the operations are the same. To investigate the difference from a PL-HCI perspective, we identify the basic affordances provided by tables and dataframes and how programming experiences over tables and dataframes differ. We show that the data structures nudge programmers to query and store their data in different ways. We hope the case study could clarify confusions, dispel misinformation, increase cross-pollination between the two communities, and identify open PL-HCI questions
Rumble: Data Independence for Large Messy Data Sets
This paper introduces Rumble, an engine that executes JSONiq queries on
large, heterogeneous and nested collections of JSON objects, leveraging the
parallel capabilities of Spark so as to provide a high degree of data
independence. The design is based on two key insights: (i) how to map JSONiq
expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR
clauses to Spark SQL on DataFrames. We have developed a working implementation
of these mappings showing that JSONiq can efficiently run on Spark to query
billions of objects into, at least, the TB range. The JSONiq code is concise in
comparison to Spark's host languages while seamlessly supporting the nested,
heterogeneous data sets that Spark SQL does not. The ability to process this
kind of input, commonly found, is paramount for data cleaning and curation. The
experimental analysis indicates that there is no excessive performance loss,
occasionally even a gain, over Spark SQL for structured data, and a performance
gain over PySpark. This demonstrates that a language such as JSONiq is a simple
and viable approach to large-scale querying of denormalized, heterogeneous,
arborescent data sets, in the same way as SQL can be leveraged for structured
data sets. The results also illustrate that Codd's concept of data independence
makes as much sense for heterogeneous, nested data sets as it does on highly
structured tables.Comment: Preprint, 9 page
Relational Graph Models at Work
We study the relational graph models that constitute a natural subclass of
relational models of lambda-calculus. We prove that among the lambda-theories
induced by such models there exists a minimal one, and that the corresponding
relational graph model is very natural and easy to construct. We then study
relational graph models that are fully abstract, in the sense that they capture
some observational equivalence between lambda-terms. We focus on the two main
observational equivalences in the lambda-calculus, the theory H+ generated by
taking as observables the beta-normal forms, and H* generated by considering as
observables the head normal forms. On the one hand we introduce a notion of
lambda-K\"onig model and prove that a relational graph model is fully abstract
for H+ if and only if it is extensional and lambda-K\"onig. On the other hand
we show that the dual notion of hyperimmune model, together with
extensionality, captures the full abstraction for H*
- …