15 research outputs found
Efficient Management of Short-Lived Data
Motivated by the increasing prominence of loosely-coupled systems, such as
mobile and sensor networks, which are characterised by intermittent
connectivity and volatile data, we study the tagging of data with so-called
expiration times. More specifically, when data are inserted into a database,
they may be tagged with time values indicating when they expire, i.e., when
they are regarded as stale or invalid and thus are no longer considered part of
the database. In a number of applications, expiration times are known and can
be assigned at insertion time. We present data structures and algorithms for
online management of data tagged with expiration times. The algorithms are
based on fully functional, persistent treaps, which are a combination of binary
search trees with respect to a primary attribute and heaps with respect to a
secondary attribute. The primary attribute implements primary keys, and the
secondary attribute stores expiration times in a minimum heap, thus keeping a
priority queue of tuples to expire. A detailed and comprehensive experimental
study demonstrates the well-behavedness and scalability of the approach as well
as its efficiency with respect to a number of competitors.Comment: switched to TimeCenter latex styl
Efficient Processing of Raster and Vector Data
[Abstract] In this work, we propose a framework to store and manage spatial data, which includes new efficient algorithms to perform operations accepting as input a raster dataset and a vector dataset. More concretely, we present algorithms for solving a spatial join between a raster and a vector dataset imposing a restriction on the values of the cells of the raster; and an algorithm for retrieving K objects of a vector dataset that overlap cells of a raster dataset, such that the K objects are those overlapping the highest (or lowest) cell values among all objects. The raster data is stored using a compact data structure, which can directly manipulate compressed data without the need for prior decompression. This leads to better running times and lower memory consumption. In our experimental evaluation comparing our solution to other baselines, we obtain the best space/time trade-offs.This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 690941; from the Ministerio de Ciencia, Innovación y Universidades (PGE and ERDF) grant numbers TIN2016-78011-C4-1-R; TIN2016-77158 C4-3-R; RTC-2017-5908-7; from Xunta de Galicia (co-founded with ERDF) grant numbers ED431C 2017/58; ED431G/01; IN852A 2018/14; and University of Bío-Bío grant numbers 192119 2/R; 195119 GI/VCXunta de Galicia; ED431C 2017/58Xunta de Galicia; ED431G/01Xunta de Galicia; IN852A 2018/14Universidad del Bío-Bío (Chile); 192119 2/RUniversidad del Bío-Bío (Chile); 195119 GI/V
Learning-Augmented B-Trees
We study learning-augmented binary search trees (BSTs) and B-Trees via Treaps
with composite priorities. The result is a simple search tree where the depth
of each item is determined by its predicted weight . To achieve the
result, each item has its composite priority
where is the uniform
random variable. This generalizes the recent learning-augmented BSTs
[Lin-Luo-Woodruff ICML`22], which only work for Zipfian distributions, to
arbitrary inputs and predictions. It also gives the first B-Tree data structure
that can provably take advantage of localities in the access sequence via
online self-reorganization. The data structure is robust to prediction errors
and handles insertions, deletions, as well as prediction updates.Comment: 25 page
Efficient processing of raster and vector data
[Abstract] In this work, we propose a framework to store and manage spatial data, which includes new efficient algorithms to perform operations accepting as input a raster dataset and a vector dataset. More concretely, we present algorithms for solving a spatial join between a raster and a vector dataset imposing a restriction on the values of the cells of the raster; and an algorithm for retrieving K objects of a vector dataset that overlap cells of a raster dataset, such that the K objects are those overlapping the highest (or lowest) cell values among all objects. The raster data is stored using a compact data structure, which can directly manipulate compressed data without the need for prior decompression. This leads to better running times and lower memory consumption. In our experimental evaluation comparing our solution to other baselines, we obtain the best space/time trade-offs.Ministerio de Ciencia, Innovación y Universidades; TIN2016-78011-C4-1-RMinisterio de Ciencia, Innovación y Universidades; TIN2016-77158 C4-3-RMinisterio de Ciencia, Innovación y Universidades; RTC-2017-5908-7Xunta de Galicia; ED431C 2017/58Xunta de Galicia; ED431G/01Xunta de Galicia; IN852A 2018/14University of Bío-Bío; 192119 2/RUniversity of Bío-Bío; 195119 GI/V
Compact data structures for large and complex datasets
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
In this thesis, we study the problem of processing large and complex collections of
data, presenting new data structures and algorithms that allow us to efficiently store
and analyze them. We focus on three main domains: processing of multidimensional
data, representation of spatial information, and analysis of scientific data.
The common nexus is the use of compact data structures, which combine in a
unique data structure a compressed representation of the data and the structures to
access such data. The target is to be able to manage data directly in compressed
form, and in this way, to keep data always compressed, even in main memory. With
this, we obtain two benefits: we can manage larger datasets in main memory and
we take advantage of a better usage of the memory hierarchy.
In the first part, we propose a compact data structure for multidimensional
databases where the domains of each dimension are hierarchical. It allows efficient
queries of aggregate information at different levels of each dimension. A typical
application environment for our solution would be an OLAP system.
Second, we focus on the representation of spatial information, specifically on
raster data, which are commonly used in geographic information systems (GIS) to
represent spatial attributes (such as the altitude of a terrain, the average temperature,
etc.). The new method enables several typical spatial queries with better response
times than the state of the art, at the same time that saves space in both main
memory and disk. Besides, we also present a framework to run a spatial join between
raster and vector datasets, that uses the compact data structure previously presented
in this part of the thesis.
Finally, we present a solution for the computation of empirical moments from a
set of trajectories of a continuous time stochastic process observed in a given period
of time. The empirical autocovariance function is an example of such operations.
In this thesis, we propose a method that compresses sequences of floating numbers
representing Brownian motion trajectories, although it can be used in other similar
areas. In addition, we also introduce a new algorithm for the calculation of the
autocovariance that uses a single trajectory at a time, instead of loading the whole
dataset, reducing the memory consumption during the calculation process.[Resumo]
Nesta tese estudamos o problema de procesar grandes coleccións de datos,
presentando novas estruturas de datos compactas e algoritmos que nos permiten
almacenalas e analizalas de forma eficiente. Centrámonos en tres dominios principais:
procesamento de datos multidimensionais, representación de información espacial e
análise de datos científicos.
O nexo común é o uso de estruturas de datos compactas, que combinan nunha
única estrutura de datos unha representación comprimida dos datos e as estruturas
para acceder a tales datos. O obxectivo é poder manipular os datos directamente en
forma comprimida, e desta maneira, manter os datos sempre comprimidos, incluso na
memoria principal. Con esto obtemos dous beneficios: podemos xestionar conxuntos
de datos máis grandes na memoria principal e aproveitar un mellor uso da xerarquía
da memoria.
Na primera parte propoñemos unha estructura de datos compacta para bases de
datos multidimensionais onde os dominios de cada dimensión están xerarquizados.
Permítenos consultar eficientemente a información agregada (sumar valor máximo,
etc) a diferentes niveis de cada dimensión. Un entorno de aplicación típico para a
nosa solución sería un sistema OLAP.
En segundo lugar, centrámonos na representación de información espacial,
especificamente en datos ráster, que se utilizan comunmente en sistemas de
información xeográfica (SIX) para representar atributos espaciais (como a altitude
dun terreo, a temperatura media, etc.). O novo método permite realizar
eficientemente varias consultas espaciais típicas con tempos de resposta mellores que
o estado da arte, ao mesmo tempo que reduce o espazo utilizado tanto na memoria
principal como no disco. Ademais, tamén presentamos un marco de traballo para
realizar un join espacial entre conxuntos de datos vectoriais e ráster, que usa a
estructura de datos compacta previamente presentada nesta parte da tese.
Por último, presentamos unha solución para o cálculo de momentos empíricos
a partir dun conxunto de traxectorias dun proceso estocástico de tempo continuo
observadas nun período de tempo dado. A función de autocovarianza empírica
é un exemplo de tales operacións. Nesta tese propoñemos un método que
comprime secuencias de números flotantes que representan traxectorias de movemento Browniano, aínda que pode ser empregado noutras áreas similares. Ademais, tamén
introducimos un novo algoritmo para o cálculo da autocovarianza que emprega unha
única traxectoria á vez, en lugar de cargar todo o conxunto de datos, reducindo o
consumo de memoria durante o proceso de cálculo.[Resumen]
En esta tesis estudiamos el problema de procesar grandes colecciones de datos,
presentando nuevas estructuras de datos compactas y algoritmos que nos permiten
almacenarlas y analizarlas de forma eficiente. Nos centramos principalmente en tres
dominios: procesamiento de datos multidimensionales, representación de información
espacial y análisis de datos científicos.
El nexo común es el uso de estructuras de datos compactas, que combinan en
una única estructura de datos una representación comprimida de los datos y las
estructuras para acceder a dichos datos. El objetivo es poder manipular los datos
directamente en forma comprimida, y de esta manera, mantener los datos siempre
comprimidos, incluso en la memoria principal. Con esto obtenemos dos beneficios:
podemos gestionar conjuntos de datos más grandes en la memoria principal y
aprovechar un mejor uso de la jerarquía de la memoria.
En la primera parte proponemos una estructura de datos compacta para bases de
datos multidimensionales donde los dominios de cada dimensión están jerarquizados.
Nos permite consultar eficientemente la información agregada (suma, valor máximo,
etc.) a diferentes niveles de cada dimensión. Un entorno de aplicación típico para
nuestra solución sería un sistema OLAP.
En segundo lugar, nos centramos en la representación de la información espacial,
específicamente en datos ráster, que se utilizan comúnmente en sistemas de
información geográfica (SIG) para representar atributos espaciales (como la altitud
de un terreno, la temperatura media, etc.). El nuevo método permite realizar
eficientemente varias consultas espaciales típicas con tiempos de respuesta mejores
que el estado del arte, al mismo tiempo que reduce el espacio utilizado tanto en la
memoria principal como en el disco. Además, también presentamos un marco de
trabajo para realizar un join espacial entre conjuntos de datos vectoriales y ráster,
que usa la estructura de datos compacta previamente presentada en esta parte de la
tesis.
Por último, presentamos una solución para el cálculo de momentos empíricos a
partir de un conjunto de trayectorias de un proceso estocástico de tiempo continuo
observadas en un período de tiempo dado. La función de autocovariancia empírica
es un ejemplo de tales operaciones. En esta tesis proponemos un método que comprime secuencias de números flotantes que representan trayectorias de movimiento
Browniano, aunque puede ser utilizado en otras áreas similares. En esta parte,
también introducimos un nuevo algoritmo para el cálculo de la autocovariancia que
utiliza una única trayectoria a la vez, en lugar de cargar todo el conjunto de datos,
reduciendo el consumo de memoria durante el proceso de cálculoXunta de Galicia; ED431G/01Ministerio de Economía y Competitividad ;TIN2016-78011-C4-1-RMinisterio de Economía y Competitividad; TIN2016-77158-C4-3-RMinisterio de Economía y Competitividad; TIN2013-46801-C4-3-RCentro para el desarrollo Tecnológico e Industrial; IDI-20141259Centro para el desarrollo Tecnológico e Industrial; ITC-20151247Xunta de Galicia; GRC2013/05
Sampling Algorithms for Evolving Datasets
Perhaps the most flexible synopsis of a database is a uniform random sample of the data; such samples are widely used to speed up the processing of analytic queries and data-mining tasks, to enhance query optimization, and to facilitate information integration. Most of the existing work on database sampling focuses on how to create or exploit a random sample of a static database, that is, a database that does not change over time. The assumption of a static database, however, severely limits the applicability of these techniques in practice, where data is often not static but continuously evolving. In order to maintain the statistical validity of the sample, any changes to the database have to be appropriately reflected in the sample. In this thesis, we study efficient methods for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions, updates, and deletions. We consider instances of the maintenance problem that arise when sampling from an evolving set, from an evolving multiset, from the distinct items in an evolving multiset, or from a sliding window over a data stream. Our algorithms completely avoid any accesses to the base data and can be several orders of magnitude faster than algorithms that do rely on such expensive accesses. The improved efficiency of our algorithms comes at virtually no cost: the resulting samples are provably uniform and only a small amount of auxiliary information is associated with the sample. We show that the auxiliary information not only facilitates efficient maintenance, but it can also be exploited to derive unbiased, low-variance estimators for counts, sums, averages, and the number of distinct items in the underlying dataset. In addition to sample maintenance, we discuss methods that greatly improve the flexibility of random sampling from a system's point of view. More specifically, we initiate the study of algorithms that resize a random sample upwards or downwards. Our resizing algorithms can be exploited to dynamically control the size of the sample when the dataset grows or shrinks; they facilitate resource management and help to avoid under- or oversized samples. Furthermore, in large-scale databases with data being distributed across several remote locations, it is usually infeasible to reconstruct the entire dataset for the purpose of sampling. To address this problem, we provide efficient algorithms that directly combine the local samples maintained at each location into a sample of the global dataset. We also consider a more general problem, where the global dataset is defined as an arbitrary set or multiset expression involving the local datasets, and provide efficient solutions based on hashing
Big Data Approaches to Improving the Identification of Drug or Disease Mechanisms for Drug Innovation
Advances in science and technology have substantially changed drug research and development (R&D) processes. However, the efficiency of drug R&D, described in the number of new drugs approved per billion US dollars spent, dramatically declined between 1950 to 2010. Some of the main causes to the attrition include the cautious regulator, the potential risk in chemical screening methods for early drug discovery, and the lack of understanding
of disease mechanisms. In order to improve the efficiency and productivity in drug R&D, more powerful tools are needed to assist in prediction forecasting and decision making in drug development. This dissertation describes my work in developing computational approaches to provide
better understanding of drug or disease mechanisms at the systems level. The first project involves collaboration with RIKEN institute in Japan for innovation of influenza vaccine adjuvant. We performed comparative analysis of RNA-Seq data from mice treated with different adjuvants to identify mechanisms supporting adjuvant activity. In the second project, we predicted immune cell dynamics by linear regression-based algorithms or statistical tools, and suggested a new approach that can improve the discovery of key disease-associated genes. In the third project, we found that the network topological features, especially network betweenness, predominantly define the accuracy of a major drug target inference algorithm. We proposed a novel algorithm, TREAP, which integrated betweenness and differential gene expression and can accurately predict drug targets in a time-efficient manner. Through the projects above, we have demonstrated how computational algorithms can assist in mining big biological data to improve understanding of drug or disease mechanisms for drug innovation and development
New data structures and algorithms for the efficient management of large spatial datasets
[Resumen] En esta tesis estudiamos la representación eficiente de matrices multidimensionales,
presentando nuevas estructuras de datos compactas para almacenar y procesar
grids en distintos ámbitos de aplicación. Proponemos varias estructuras de datos
estáticas y dinámicas para la representación de matrices binarias o de enteros
y estudiamos aplicaciones a la representación de datos raster en Sistemas de
Información Geográfica, bases de datos RDF, etc.
En primer lugar proponemos una colección de estructuras de datos estáticas para
la representación de matrices binarias y de enteros: 1) una nueva representación
de matrices binarias con grandes grupos de valores uniformes, con aplicaciones
a la representación de datos raster binarios; 2) una nueva estructura de datos
para representar matrices multidimensionales; 3) una nueva estructura de datos
para representar matrices de enteros con soporte para consultas top-k de rango.
También proponemos una nueva representación dinámica de matrices binarias, una
nueva estructura de datos que proporciona las mismas funcionalidades que nuestras
propuestas estáticas pero también soporta cambios en la matriz.
Nuestras estructuras de datos pueden utilizarse en distintos dominios. Proponemos
variantes específicas y combinaciones de nuestras propuestas para representar
grafos temporales, bases de datos RDF, datos raster binarios o generales y
datos raster temporales. También proponemos un nuevo algoritmo para consultar
conjuntamente un conjuto de datos raster (almacenado usando nuestras propuestas)
y un conjunto de datos vectorial almacenado en una estructura de datos clásica,
mostrando que nuestra propuesta puede ser más rápida y usar menos espacio que
otras alternativas. Nuestras representaciones proporcionan interesantes trade-offs y
son competitivas en espacio y tiempos de consulta con representaciones habituales
en los diferentes dominios.[Resumo] Nesta tese estudiamos a representación eficiente de matrices multidimensionais,
presentando novas estruturas de datos compactas para almacenar e procesar grids
en distintos ámbitos de aplicación. Propoñemos varias estruturas de datos estáticas
e dinámicas para a representación de matrices binarias ou de enteiros e estudiamos
aplicacións á representación de datos raster en Sistemas de Información Xeográfica,
bases de datos RDF, etc.
En primeiro lugar propoñemos unha colección de estruturas de datos estáticas
para a representación de matrices binarias e de enteiros: 1) unha nova representación
de matrices binarias con grandes grupos de valores uniformes, con aplicacións
á representación de datos raster binarios; 2) unha nova estrutura de datos
para representar matrices multidimensionais; 3) unha nova estrutura de datos
para representar matrices de enteiros con soporte para consultas top-k. Tamén
propoñemos unha nova representación dinámica de matrices binarias, unha nova
estrutura de datos que proporciona as mesmas funcionalidades que as nosas
propostas estáticas pero tamén soporta cambios na matriz.
As nosas estruturas de datos poden utilizarse en distintos dominios. Propoñemos
variantes específicas e combinacións das nosas propostas para representar grafos temporais,
bases de datos RDF, datos raster binarios ou xerais e datos raster temporais.
Tamén propoñemos un novo algoritmo para consultar conxuntamente datos raster
(almacenados usando as nosas propostas) con datos vectoriais almacenados nunha
estrutura de datos clásica, amosando que a nosa proposta pode ser máis rápida e
usar menos espazo que outras alternativas. As nosas representacións proporcionan
interesantes trade-offs e son competitivas en espazo e tempos de consulta con
representacións habituais nos diferentes dominios.[Abstract] In this thesis we study the efficient representation of multidimensional grids,
presenting new compact data structures to store and query grids in different
application domains. We propose several static and dynamic data structures for the
representation of binary grids and grids of integers, and study applications to the
representation of raster data in Geographic Information Systems, RDF databases,
etc.
We first propose a collection of static data structures for the representation of
binary grids and grids of integers: 1) a new representation of bi-dimensional binary
grids with large clusters of uniform values, with applications to the representation
of binary raster data; 2) a new data structure to represent multidimensional binary
grids; 3) a new data structure to represent grids of integers with support for top-k
range queries. We also propose a new dynamic representation of binary grids, a new
data structure that provides the same functionalities that our static representations
of binary grids but also supports changes in the grid.
Our data structures can be used in several application domains. We propose
specific variants and combinations of our generic proposals to represent temporal
graphs, RDF databases, OLAP databases, binary or general raster data, and
temporal raster data. We also propose a new algorithm to jointly query a raster
dataset (stored using our representations) and a vectorial dataset stored in a classic
data structure, showing that our proposal can be faster and require less space than
the usual alternatives. Our representations provide interesting trade-offs and are
competitive in terms of space and query times with usual representations in the
different domains