11 research outputs found
A Survey on Array Storage, Query Languages, and Systems
Since scientific investigation is one of the most important providers of
massive amounts of ordered data, there is a renewed interest in array data
processing in the context of Big Data. To the best of our knowledge, a unified
resource that summarizes and analyzes array processing research over its long
existence is currently missing. In this survey, we provide a guide for past,
present, and future research in array processing. The survey is organized along
three main topics. Array storage discusses all the aspects related to array
partitioning into chunks. The identification of a reduced set of array
operators to form the foundation for an array query language is analyzed across
multiple such proposals. Lastly, we survey real systems for array processing.
The result is a thorough survey on array data storage and processing that
should be consulted by anyone interested in this research topic, independent of
experience level. The survey is not complete though. We greatly appreciate
pointers towards any work we might have forgotten to mention.Comment: 44 page
Granite: A scientific database model and implementation
The principal goal of this research was to develop a formal comprehensive model for representing highly complex scientific data. An effective model should provide a conceptually uniform way to represent data and it should serve as a framework for the implementation of an efficient and easy-to-use software environment that implements the model. The dissertation work presented here describes such a model and its contributions to the field of scientific databases. In particular, the Granite model encompasses a wide variety of datatypes used across many disciplines of science and engineering today. It is unique in that it defines dataset geometry and topology as separate conceptual components of a scientific dataset. We provide a novel classification of geometries and topologies that has important practical implications for a scientific database implementation. The Granite model also offers integrated support for multiresolution and adaptive resolution data. Many of these ideas have been addressed by others, but no one has tried to bring them all together in a single comprehensive model.
The datasource portion of the Granite model offers several further contributions. In addition to providing a convenient conceptual view of rectilinear data, it also supports multisource data. Data can be taken from various sources and combined into a unified view.
The rod storage model is an abstraction for file storage that has proven an effective platform upon which to develop efficient access to storage. Our spatial prefetching technique is built upon the rod storage model, and demonstrates very significant improvement in access to scientific datasets, and also allows machines to access data that is far too large to fit in main memory. These improvements bring the extremely large datasets now being generated in many scientific fields into the realm of tractability for the ordinary researcher.
We validated the feasibility and viability of the model by implementing a significant portion of it in the Granite system. Extensive performance evaluations of the implementation indicate that the features of the model can be provided in a user-friendly manner with an efficiency that is competitive with more ad hoc systems and more specialized application specific solutions
New data structures and algorithms for the efficient management of large spatial datasets
[Resumen] En esta tesis estudiamos la representación eficiente de matrices multidimensionales,
presentando nuevas estructuras de datos compactas para almacenar y procesar
grids en distintos ámbitos de aplicación. Proponemos varias estructuras de datos
estáticas y dinámicas para la representación de matrices binarias o de enteros
y estudiamos aplicaciones a la representación de datos raster en Sistemas de
Información Geográfica, bases de datos RDF, etc.
En primer lugar proponemos una colección de estructuras de datos estáticas para
la representación de matrices binarias y de enteros: 1) una nueva representación
de matrices binarias con grandes grupos de valores uniformes, con aplicaciones
a la representación de datos raster binarios; 2) una nueva estructura de datos
para representar matrices multidimensionales; 3) una nueva estructura de datos
para representar matrices de enteros con soporte para consultas top-k de rango.
También proponemos una nueva representación dinámica de matrices binarias, una
nueva estructura de datos que proporciona las mismas funcionalidades que nuestras
propuestas estáticas pero también soporta cambios en la matriz.
Nuestras estructuras de datos pueden utilizarse en distintos dominios. Proponemos
variantes específicas y combinaciones de nuestras propuestas para representar
grafos temporales, bases de datos RDF, datos raster binarios o generales y
datos raster temporales. También proponemos un nuevo algoritmo para consultar
conjuntamente un conjuto de datos raster (almacenado usando nuestras propuestas)
y un conjunto de datos vectorial almacenado en una estructura de datos clásica,
mostrando que nuestra propuesta puede ser más rápida y usar menos espacio que
otras alternativas. Nuestras representaciones proporcionan interesantes trade-offs y
son competitivas en espacio y tiempos de consulta con representaciones habituales
en los diferentes dominios.[Resumo] Nesta tese estudiamos a representación eficiente de matrices multidimensionais,
presentando novas estruturas de datos compactas para almacenar e procesar grids
en distintos ámbitos de aplicación. Propoñemos varias estruturas de datos estáticas
e dinámicas para a representación de matrices binarias ou de enteiros e estudiamos
aplicacións á representación de datos raster en Sistemas de Información Xeográfica,
bases de datos RDF, etc.
En primeiro lugar propoñemos unha colección de estruturas de datos estáticas
para a representación de matrices binarias e de enteiros: 1) unha nova representación
de matrices binarias con grandes grupos de valores uniformes, con aplicacións
á representación de datos raster binarios; 2) unha nova estrutura de datos
para representar matrices multidimensionais; 3) unha nova estrutura de datos
para representar matrices de enteiros con soporte para consultas top-k. Tamén
propoñemos unha nova representación dinámica de matrices binarias, unha nova
estrutura de datos que proporciona as mesmas funcionalidades que as nosas
propostas estáticas pero tamén soporta cambios na matriz.
As nosas estruturas de datos poden utilizarse en distintos dominios. Propoñemos
variantes específicas e combinacións das nosas propostas para representar grafos temporais,
bases de datos RDF, datos raster binarios ou xerais e datos raster temporais.
Tamén propoñemos un novo algoritmo para consultar conxuntamente datos raster
(almacenados usando as nosas propostas) con datos vectoriais almacenados nunha
estrutura de datos clásica, amosando que a nosa proposta pode ser máis rápida e
usar menos espazo que outras alternativas. As nosas representacións proporcionan
interesantes trade-offs e son competitivas en espazo e tempos de consulta con
representacións habituais nos diferentes dominios.[Abstract] In this thesis we study the efficient representation of multidimensional grids,
presenting new compact data structures to store and query grids in different
application domains. We propose several static and dynamic data structures for the
representation of binary grids and grids of integers, and study applications to the
representation of raster data in Geographic Information Systems, RDF databases,
etc.
We first propose a collection of static data structures for the representation of
binary grids and grids of integers: 1) a new representation of bi-dimensional binary
grids with large clusters of uniform values, with applications to the representation
of binary raster data; 2) a new data structure to represent multidimensional binary
grids; 3) a new data structure to represent grids of integers with support for top-k
range queries. We also propose a new dynamic representation of binary grids, a new
data structure that provides the same functionalities that our static representations
of binary grids but also supports changes in the grid.
Our data structures can be used in several application domains. We propose
specific variants and combinations of our generic proposals to represent temporal
graphs, RDF databases, OLAP databases, binary or general raster data, and
temporal raster data. We also propose a new algorithm to jointly query a raster
dataset (stored using our representations) and a vectorial dataset stored in a classic
data structure, showing that our proposal can be faster and require less space than
the usual alternatives. Our representations provide interesting trade-offs and are
competitive in terms of space and query times with usual representations in the
different domains
Extending General Compact Querieable Representations to GIS Applications
The raster model is commonly used for the representation of images in many
domains, and is especially useful in Geographic Information Systems (GIS) to
store information about continuous variables of the space (elevation,
temperature, etc.). Current representations of raster data are usually designed
for external memory or, when stored in main memory, lack efficient query
capabilities. In this paper we propose compact representations to efficiently
store and query raster datasets in main memory. We present different
representations for binary raster data, general raster data and time-evolving
raster data. We experimentally compare our proposals with traditional storage
mechanisms such as linear quadtrees or compressed GeoTIFF files. Results show
that our structures are up to 10 times smaller than classical linear quadtrees,
and even comparable in space to non-querieable representations of raster data,
while efficiently answering a number of typical queries.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sklodowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941
Soundtrack recommendation for images
The drastic increase in production of multimedia content has emphasized the research concerning its organization and retrieval. In this thesis, we address the problem of music retrieval when a set of images is given as input query, i.e., the problem of soundtrack recommendation for images. The task at hand is to recommend appropriate music to be played during the presentation of a given set of query images. To tackle this problem, we formulate a hypothesis that the knowledge appropriate for the task is contained in publicly available contemporary movies. Our approach, Picasso, employs similarity search techniques inside the image and music domains, harvesting movies to form a link between the domains. To achieve a fair and unbiased comparison between different soundtrack recommendation approaches, we proposed an evaluation benchmark. The evaluation results are reported for Picasso and the baseline approach, using the proposed benchmark. We further address two efficiency aspects that arise from the Picasso approach. First, we investigate the problem of processing top-K queries with set-defined selections and propose an index structure that aims at minimizing the query answering latency. Second, we address the problem of similarity search in high-dimensional spaces and propose two enhancements to the Locality Sensitive Hashing (LSH) scheme. We also investigate the prospects of a distributed similarity search algorithm based on LSH using the MapReduce framework. Finally, we give an overview of the PicasSound|a smartphone application based on the Picasso approach.Der drastische Anstieg von verfügbaren Multimedia-Inhalten hat die Bedeutung der Forschung über deren Organisation sowie Suche innerhalb der Daten hervorgehoben. In dieser Doktorarbeit betrachten wir das Problem der Suche nach geeigneten Musikstücken als Hintergrundmusik für Diashows. Wir formulieren die Hypothese, dass die für das Problem erforderlichen Kenntnisse in öffentlich zugänglichen, zeitgenössischen Filmen enthalten sind. Unser Ansatz, Picasso, verwendet Techniken aus dem Bereich der Ähnlichkeitssuche innerhalb von Bild- und Musik-Domains, um basierend auf Filmszenen eine Verbindung zwischen beliebigen Bildern und Musikstücken zu lernen. Um einen fairen und unvoreingenommenen Vergleich zwischen verschiedenen Ansätzen zur Musikempfehlung zu erreichen, schlagen wir einen Bewertungs-Benchmark vor. Die Ergebnisse der Auswertung werden, anhand des vorgeschlagenen Benchmarks, für Picasso und einen weiteren, auf Emotionen basierenden Ansatz, vorgestellt. Zusätzlich behandeln wir zwei Effizienzaspekte, die sich aus dem Picasso Ansatz ergeben. (i) Wir untersuchen das Problem der Ausführung von top-K Anfragen, bei denen die Ergebnismenge ad-hoc auf eine kleine Teilmenge des gesamten Indexes eingeschränkt wird. (ii) Wir behandeln das Problem der Ähnlichkeitssuche in hochdimensionalen Räumen und schlagen zwei Erweiterungen des Lokalitätssensitiven Hashing (LSH) Schemas vor. Zusätzlich untersuchen wir die Erfolgsaussichten eines verteilten Algorithmus für die Ähnlichkeitssuche, der auf LSH unter Verwendung des MapReduce Frameworks basiert. Neben den vorgenannten wissenschaftlichen Ergebnissen beschreiben wir ferner das Design und die Implementierung von PicassSound, einer auf Picasso basierenden Smartphone-Anwendung
LIPIcs, Volume 248, ISAAC 2022, Complete Volume
LIPIcs, Volume 248, ISAAC 2022, Complete Volum