Search CORE

3,488 research outputs found

Hillview:A trillion-cell spreadsheet for big data

Author: Aguilera Marcos K.
Budiu Mihai
Gopalan Parikshit
Kruiger Han
Suresh Lalith
Wieder Udi
Publication venue: 'VLDB Endowment'
Publication date: 01/07/2019
Field of study

Hillview is a distributed spreadsheet for browsing very large datasets that cannot be handled by a single machine. As a spreadsheet, Hillview provides a high degree of interactivity that permits data analysts to explore information quickly along many dimensions while switching visualizations on a whim. To provide the required responsiveness, Hillview introduces visualization sketches, or vizketches, as a simple idea to produce compact data visualizations. Vizketches combine algorithmic techniques for data summarization with computer graphics principles for efficient rendering. While simple, vizketches are effective at scaling the spreadsheet by parallelizing computation, reducing communication, providing progressive visualizations, and offering precise accuracy guarantees. Using Hillview running on eight servers, we can navigate and visualize datasets of tens of billions of rows and trillions of cells, much beyond the published capabilities of competing systems

arXiv.org e-Print Archive

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Recommended from our members

Multi-resolution Storage and Search in Sensor Networks

Author: Ganesan Deepak
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2003
Field of study

ScholarWorks@UMass Amherst

PolyFit:Polynomial-based indexing approach for fast approximate range aggregate queries

Author: Chan Tsz Nam
Jensen Christian S.
Li Zhe
Yiu Man Lung
Publication venue: OpenProceedings.org
Publication date: 01/01/2021
Field of study

VBN

Entity Resolution On-Demand

Author: Bergamaschi Sonia
Naumann Felix
Simonini Giovanni
Zecchini Luca
Publication venue
Publication date: 01/01/2022
Field of study

Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entity. ER is typically employed as an expensive cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire, continuously-growing data. Similarly, when querying data lakes, we want to transform data on-demand and return the results in a timely manner---a fundamental requirement of ELT (Extract-Load-Transform) pipelines. We propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, following an ORDER BY predicate. Thus, it inherently supports top-k and stop-and-resume execution. For a wide range of applications, a significant amount of resources can be saved. We exhaustively evaluate and show the efficacy of BrewER on four real-world datasets

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

How to evaluate multiple range-sum queries progressively

Author: Cyrus Shahabi
Rolfe R. Schmidt
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2002
Field of study

Decision support system users typically submit batches of range-sum queries simultaneously rather than issuing individual, unrelated queries. We propose a wavelet based technique that exploits I/O sharing across a query batch to evaluate the set of queries progressively and efficiently. The challenge is that now controlling the structure of errors across query results becomes more critical than minimizing error per individual query. Consequently, we define a class of structural error penalty functions and show how they are controlled by our technique. Experiments demonstrate that our technique is efficient as an exact algorithm, and the progressive estimates are accurate, even after less than one I/O per query

CiteSeerX

Crossref