4,309 research outputs found

    Is a Dataframe Just a Table?

    Get PDF
    Querying data is core to databases and data science. However, the two communities have seemingly different concepts and use cases. As a result, both designers and users of the query languages disagree on whether the core abstractions - dataframes (data science) and tables (databases) - and the operations are the same. To investigate the difference from a PL-HCI perspective, we identify the basic affordances provided by tables and dataframes and how programming experiences over tables and dataframes differ. We show that the data structures nudge programmers to query and store their data in different ways. We hope the case study could clarify confusions, dispel misinformation, increase cross-pollination between the two communities, and identify open PL-HCI questions

    Rumble: Data Independence for Large Messy Data Sets

    Full text link
    This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogeneous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The design is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR clauses to Spark SQL on DataFrames. We have developed a working implementation of these mappings showing that JSONiq can efficiently run on Spark to query billions of objects into, at least, the TB range. The JSONiq code is concise in comparison to Spark's host languages while seamlessly supporting the nested, heterogeneous data sets that Spark SQL does not. The ability to process this kind of input, commonly found, is paramount for data cleaning and curation. The experimental analysis indicates that there is no excessive performance loss, occasionally even a gain, over Spark SQL for structured data, and a performance gain over PySpark. This demonstrates that a language such as JSONiq is a simple and viable approach to large-scale querying of denormalized, heterogeneous, arborescent data sets, in the same way as SQL can be leveraged for structured data sets. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.Comment: Preprint, 9 page

    Relational Graph Models at Work

    Full text link
    We study the relational graph models that constitute a natural subclass of relational models of lambda-calculus. We prove that among the lambda-theories induced by such models there exists a minimal one, and that the corresponding relational graph model is very natural and easy to construct. We then study relational graph models that are fully abstract, in the sense that they capture some observational equivalence between lambda-terms. We focus on the two main observational equivalences in the lambda-calculus, the theory H+ generated by taking as observables the beta-normal forms, and H* generated by considering as observables the head normal forms. On the one hand we introduce a notion of lambda-K\"onig model and prove that a relational graph model is fully abstract for H+ if and only if it is extensional and lambda-K\"onig. On the other hand we show that the dual notion of hyperimmune model, together with extensionality, captures the full abstraction for H*
    • …
    corecore