573 research outputs found
Parallel Construction of Wavelet Trees on Multicore Architectures
The wavelet tree has become a very useful data structure to efficiently
represent and query large volumes of data in many different domains, from
bioinformatics to geographic information systems. One problem with wavelet
trees is their construction time. In this paper, we introduce two algorithms
that reduce the time complexity of a wavelet tree's construction by taking
advantage of nowadays ubiquitous multicore machines.
Our first algorithm constructs all the levels of the wavelet in parallel in
time and bits of working space, where
is the size of the input sequence and is the size of the alphabet. Our
second algorithm constructs the wavelet tree in a domain-decomposition fashion,
using our first algorithm in each segment, reaching time and
bits of extra space, where is the
number of available cores. Both algorithms are practical and report good
speedup for large real datasets.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Space-Efficient Data-Analysis Queries on Grids
We consider various data-analysis queries on two-dimensional points. We give
new space/time tradeoffs over previous work on geometric queries such as
dominance and rectangle visibility, and on semigroup and group queries such as
sum, average, variance, minimum and maximum. We also introduce new solutions to
queries less frequently considered in the literature such as two-dimensional
quantiles, majorities, successor/predecessor, mode, and various top-
queries, considering static and dynamic scenarios.Comment: 20 pages, 2 figures, submittin
SciQL, Bridging the Gap between Science and Relational DBMS
Scientific discoveries increasingly rely on the ability to efficiently grind massive amounts of experimental data using database technologies. To bridge the gap between the needs of the Data-Intensive Research fields and the current DBMS technologies, we propose SciQL (pronounced as âcycleâ), the first SQL-based query language for scientific applications with both tables and arrays as first class citizens. It provides a seamless symbiosis of array-, set- and sequence- interpretations. A key innovation is the extension of value-based grouping of SQL:2003 with structural grouping, i.e., fixed-sized and unbounded groups based on explicit relationships between elements positions. This leads to a generalisation of window-based query processing with wide applicability in science domains. This paper describes the main language features of SciQL and illustrates
it using time-series concepts
Towards unifying spreadsheets with databases for ad-hoc interactive data management at scale
We are witnessing the increasing availability of data across a spectrum of domains, necessitating the interactive ad-hoc management and analysis of this data, in order to put it to use. Unfortunately, interactive ad-hoc management of very large datasets presents a host of challenges, ranging from performance to interface usability. This thesis introduces a new research direction of manipulation of large datasets using an interactive interface and makes several steps towards this direction. In particular, we develop DataSpread, a tool that enables users to work with arbitrary large datasets via a direct manipulation interface. DataSpread holistically unifies spreadsheets and relational databases to leverage the benefits of both. However, this holistic integration is not trivial due to the differences in the architecture and ideologies of the two paradigms: spreadsheets and databases. We have built a prototype of DataSpread, which, in addition to motivating the underlying challenges, demonstrates the feasibility and usefulness of this holistic integration. We focus on the following challenges encountered while developing DataSpread. (i) Representationâhere, we address the challenges of flexibly representing ad-hoc spreadsheet data within a relational database; (ii) Indexingâhere, we develop indexing data structures for supporting and maintaining access by position; (iii) Formula Computationâhere, we introduce an asynchronous formula computation framework that addresses the challenge of ensuring consistency and interactivity at the same time; and (iv) Organizationâhere, we develop a framework to best organize data based on a workload, e.g., queries specified on the spreadsheet interface
- âŠ