Search CORE

2,924 research outputs found

Is a Dataframe Just a Table?

Author: Wu Yifan
Publication venue: OASIcs - OpenAccess Series in Informatics. 10th Workshop on Evaluation and Usability of Programming Languages and Tools (PLATEAU 2019)
Publication date: 01/01/2020
Field of study

Querying data is core to databases and data science. However, the two communities have seemingly different concepts and use cases. As a result, both designers and users of the query languages disagree on whether the core abstractions - dataframes (data science) and tables (databases) - and the operations are the same. To investigate the difference from a PL-HCI perspective, we identify the basic affordances provided by tables and dataframes and how programming experiences over tables and dataframes differ. We show that the data structures nudge programmers to query and store their data in different ways. We hope the case study could clarify confusions, dispel misinformation, increase cross-pollination between the two communities, and identify open PL-HCI questions

Dagstuhl Research Online Publication Server

Rumble: Data Independence for Large Messy Data Sets

Author: Alonso Gustavo
Cikis Can Berker
Fourny Ghislain
Irimescu Stefan
Müller Ingo
Publication venue
Publication date: 06/05/2020
Field of study

This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogeneous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The design is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR clauses to Spark SQL on DataFrames. We have developed a working implementation of these mappings showing that JSONiq can efficiently run on Spark to query billions of objects into, at least, the TB range. The JSONiq code is concise in comparison to Spark's host languages while seamlessly supporting the nested, heterogeneous data sets that Spark SQL does not. The ability to process this kind of input, commonly found, is paramount for data cleaning and curation. The experimental analysis indicates that there is no excessive performance loss, occasionally even a gain, over Spark SQL for structured data, and a performance gain over PySpark. This demonstrates that a language such as JSONiq is a simple and viable approach to large-scale querying of denormalized, heterogeneous, arborescent data sets, in the same way as SQL can be leveraged for structured data sets. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.Comment: Preprint, 9 page

arXiv.org e-Print Archive

Repository for Publications and Research Data

Interactive Multivariate Data Analysis in R with the ade4 and ade4TkGUI Packages

Author: Jean Thioulouse
Stéphane Dray
Publication venue
Publication date
Field of study

ade4 is a multivariate data analysis package for the R statistical environment, and ade4TkGUI is a Tcl/Tk graphical user interface for the most essential methods of ade4. Both packages are available on CRAN. An overview of ade4TkGUI is presented, and the pros and cons of this approach are discussed. We conclude that command line interfaces (CLI) and graphical user interfaces (GUI) are complementary. ade4TkGUI can be valuable for biologists and particularly for ecologists who are often occasional users of R. It can spare them having to acquire an in-depth knowledge of R, and it can help first time users in a first approach.

Research Papers in Economics

D4M 3.0: Extended Database and Language Capabilities

Author: Chen Alexander
Gadepally Vijay
Hutchison Dylan
Kepner Jeremy
Milechin Lauren
Samsi Siddharth
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/08/2017
Field of study

The D4M tool was developed to address many of today's data needs. This tool is used by hundreds of researchers to perform complex analytics on unstructured data. Over the past few years, the D4M toolbox has evolved to support connectivity with a variety of new database engines, including SciDB. D4M-Graphulo provides the ability to do graph analytics in the Apache Accumulo database. Finally, an implementation using the Julia programming language is also now available. In this article, we describe some of our latest additions to the D4M toolbox and our upcoming D4M 3.0 release. We show through benchmarking and scaling results that we can achieve fast SciDB ingest using the D4M-SciDB connector, that using Graphulo can enable graph algorithms on scales that can be memory limited, and that the Julia implementation of D4M achieves comparable performance or exceeds that of the existing MATLAB(R) implementation.Comment: IEEE HPEC 201

arXiv.org e-Print Archive

Crossref

Vaex: Big Data exploration in the era of Gaia

Author: Breddels Maarten A.
Veljanoski Jovan
Publication venue: 'EDP Sciences'
Publication date: 08/01/2018
Field of study

We present a new Python library called vaex, to handle extremely large tabular datasets, such as astronomical catalogues like the Gaia catalogue, N-body simulations or any other regular datasets which can be structured in rows and columns. Fast computations of statistics on regular N-dimensional grids allows analysis and visualization in the order of a billion rows per second. We use streaming algorithms, memory mapped files and a zero memory copy policy to allow exploration of datasets larger than memory, e.g. out-of-core algorithms. Vaex allows arbitrary (mathematical) transformations using normal Python expressions and (a subset of) numpy functions which are lazily evaluated and computed when needed in small chunks, which avoids wasting of RAM. Boolean expressions (which are also lazily evaluated) can be used to explore subsets of the data, which we call selections. Vaex uses a similar DataFrame API as Pandas, a very popular library, which helps migration from Pandas. Visualization is one of the key points of vaex, and is done using binned statistics in 1d (e.g. histogram), in 2d (e.g. 2d histograms with colormapping) and 3d (using volume rendering). Vaex is split in in several packages: vaex-core for the computational part, vaex-viz for visualization mostly based on matplotlib, vaex-jupyter for visualization in the Jupyter notebook/lab based in IPyWidgets, vaex-server for the (optional) client-server communication, vaex-ui for the Qt based interface, vaex-hdf5 for hdf5 based memory mapped storage, vaex-astro for astronomy related selections, transformations and memory mapped (column based) fits storage. Vaex is open source and available under MIT license on github, documentation and other information can be found on the main website: https://vaex.io, https://docs.vaex.io or https://github.com/maartenbreddels/vaexComment: 14 pages, 8 figures, Submitted to A&A, interactive version of Fig 4: https://vaex.io/paper/fig

arXiv.org e-Print Archive

Proceedings - University of Groningen

University of Groningen

EDP Sciences OAI-PMH repository (1.2.0)

ARTS repository - University of Groningen

Dissertations of the University of Groningen