10 research outputs found
Compact data structures for large and complex datasets
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
In this thesis, we study the problem of processing large and complex collections of
data, presenting new data structures and algorithms that allow us to efficiently store
and analyze them. We focus on three main domains: processing of multidimensional
data, representation of spatial information, and analysis of scientific data.
The common nexus is the use of compact data structures, which combine in a
unique data structure a compressed representation of the data and the structures to
access such data. The target is to be able to manage data directly in compressed
form, and in this way, to keep data always compressed, even in main memory. With
this, we obtain two benefits: we can manage larger datasets in main memory and
we take advantage of a better usage of the memory hierarchy.
In the first part, we propose a compact data structure for multidimensional
databases where the domains of each dimension are hierarchical. It allows efficient
queries of aggregate information at different levels of each dimension. A typical
application environment for our solution would be an OLAP system.
Second, we focus on the representation of spatial information, specifically on
raster data, which are commonly used in geographic information systems (GIS) to
represent spatial attributes (such as the altitude of a terrain, the average temperature,
etc.). The new method enables several typical spatial queries with better response
times than the state of the art, at the same time that saves space in both main
memory and disk. Besides, we also present a framework to run a spatial join between
raster and vector datasets, that uses the compact data structure previously presented
in this part of the thesis.
Finally, we present a solution for the computation of empirical moments from a
set of trajectories of a continuous time stochastic process observed in a given period
of time. The empirical autocovariance function is an example of such operations.
In this thesis, we propose a method that compresses sequences of floating numbers
representing Brownian motion trajectories, although it can be used in other similar
areas. In addition, we also introduce a new algorithm for the calculation of the
autocovariance that uses a single trajectory at a time, instead of loading the whole
dataset, reducing the memory consumption during the calculation process.[Resumo]
Nesta tese estudamos o problema de procesar grandes coleccións de datos,
presentando novas estruturas de datos compactas e algoritmos que nos permiten
almacenalas e analizalas de forma eficiente. Centrámonos en tres dominios principais:
procesamento de datos multidimensionais, representación de información espacial e
análise de datos científicos.
O nexo común é o uso de estruturas de datos compactas, que combinan nunha
única estrutura de datos unha representación comprimida dos datos e as estruturas
para acceder a tales datos. O obxectivo é poder manipular os datos directamente en
forma comprimida, e desta maneira, manter os datos sempre comprimidos, incluso na
memoria principal. Con esto obtemos dous beneficios: podemos xestionar conxuntos
de datos máis grandes na memoria principal e aproveitar un mellor uso da xerarquía
da memoria.
Na primera parte propoñemos unha estructura de datos compacta para bases de
datos multidimensionais onde os dominios de cada dimensión están xerarquizados.
Permítenos consultar eficientemente a información agregada (sumar valor máximo,
etc) a diferentes niveis de cada dimensión. Un entorno de aplicación típico para a
nosa solución sería un sistema OLAP.
En segundo lugar, centrámonos na representación de información espacial,
especificamente en datos ráster, que se utilizan comunmente en sistemas de
información xeográfica (SIX) para representar atributos espaciais (como a altitude
dun terreo, a temperatura media, etc.). O novo método permite realizar
eficientemente varias consultas espaciais típicas con tempos de resposta mellores que
o estado da arte, ao mesmo tempo que reduce o espazo utilizado tanto na memoria
principal como no disco. Ademais, tamén presentamos un marco de traballo para
realizar un join espacial entre conxuntos de datos vectoriais e ráster, que usa a
estructura de datos compacta previamente presentada nesta parte da tese.
Por último, presentamos unha solución para o cálculo de momentos empíricos
a partir dun conxunto de traxectorias dun proceso estocástico de tempo continuo
observadas nun período de tempo dado. A función de autocovarianza empírica
é un exemplo de tales operacións. Nesta tese propoñemos un método que
comprime secuencias de números flotantes que representan traxectorias de movemento Browniano, aínda que pode ser empregado noutras áreas similares. Ademais, tamén
introducimos un novo algoritmo para o cálculo da autocovarianza que emprega unha
única traxectoria á vez, en lugar de cargar todo o conxunto de datos, reducindo o
consumo de memoria durante o proceso de cálculo.[Resumen]
En esta tesis estudiamos el problema de procesar grandes colecciones de datos,
presentando nuevas estructuras de datos compactas y algoritmos que nos permiten
almacenarlas y analizarlas de forma eficiente. Nos centramos principalmente en tres
dominios: procesamiento de datos multidimensionales, representación de información
espacial y análisis de datos científicos.
El nexo común es el uso de estructuras de datos compactas, que combinan en
una única estructura de datos una representación comprimida de los datos y las
estructuras para acceder a dichos datos. El objetivo es poder manipular los datos
directamente en forma comprimida, y de esta manera, mantener los datos siempre
comprimidos, incluso en la memoria principal. Con esto obtenemos dos beneficios:
podemos gestionar conjuntos de datos más grandes en la memoria principal y
aprovechar un mejor uso de la jerarquía de la memoria.
En la primera parte proponemos una estructura de datos compacta para bases de
datos multidimensionales donde los dominios de cada dimensión están jerarquizados.
Nos permite consultar eficientemente la información agregada (suma, valor máximo,
etc.) a diferentes niveles de cada dimensión. Un entorno de aplicación típico para
nuestra solución sería un sistema OLAP.
En segundo lugar, nos centramos en la representación de la información espacial,
específicamente en datos ráster, que se utilizan comúnmente en sistemas de
información geográfica (SIG) para representar atributos espaciales (como la altitud
de un terreno, la temperatura media, etc.). El nuevo método permite realizar
eficientemente varias consultas espaciales típicas con tiempos de respuesta mejores
que el estado del arte, al mismo tiempo que reduce el espacio utilizado tanto en la
memoria principal como en el disco. Además, también presentamos un marco de
trabajo para realizar un join espacial entre conjuntos de datos vectoriales y ráster,
que usa la estructura de datos compacta previamente presentada en esta parte de la
tesis.
Por último, presentamos una solución para el cálculo de momentos empíricos a
partir de un conjunto de trayectorias de un proceso estocástico de tiempo continuo
observadas en un período de tiempo dado. La función de autocovariancia empírica
es un ejemplo de tales operaciones. En esta tesis proponemos un método que comprime secuencias de números flotantes que representan trayectorias de movimiento
Browniano, aunque puede ser utilizado en otras áreas similares. En esta parte,
también introducimos un nuevo algoritmo para el cálculo de la autocovariancia que
utiliza una única trayectoria a la vez, en lugar de cargar todo el conjunto de datos,
reduciendo el consumo de memoria durante el proceso de cálculoXunta de Galicia; ED431G/01Ministerio de Economía y Competitividad ;TIN2016-78011-C4-1-RMinisterio de Economía y Competitividad; TIN2016-77158-C4-3-RMinisterio de Economía y Competitividad; TIN2013-46801-C4-3-RCentro para el desarrollo Tecnológico e Industrial; IDI-20141259Centro para el desarrollo Tecnológico e Industrial; ITC-20151247Xunta de Galicia; GRC2013/05
INTELLIGENT VIRTUAL ASSISTANT FOR GAMIFIED ENVIRONMENTS
As the body of Information Systems (IS) research on social media grows, it faces increasing challenges of staying relevant to real world contexts. In this research-in-progress paper, we analyze and contrast research on social media in the e-government field and in IS research, by reviewing and categorizing 63 studies published in key journal outlets, in order to identify and complement research foci and gaps. We find that in comparison with e-government social media research, IS studies tend to adopt an abstract view of the individual user, focus on a monetary view of value added by social media, and overlook the role of contextual factors. We thus propose an extended framework for mapping social media research, by including a focus on the role of context and environment, and identify a research agenda for future studies on social media-related phenomena relevant to real world contexts
Map algebra on raster datasets represented by compact data structures
Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract]: The increase in the size of data repositories has forced the design of new computing paradigms to be able to process large volumes of data in a reasonable amount of time. One of them is in-memory computing, which advocates storing all the data in main memory to avoid the disk I/O bottleneck. Compression is one of the key technologies for this approach. For raster data, a compact data structure, called (Formula presented.) -raster, have been recently been proposed. It compresses raster maps while still supporting fast retrieval of a given datum or a portion of the data directly from the compressed data. (Formula presented.) -raster's original work introduced several queries in which it was superior to competitors. However, to be used as the basis of an in-memory system for raster data, it is mandatory to demonstrate its efficiency when performing more complex operations such as the map algebra operators. In this work, we present the algorithms to run a set of these operators directly on (Formula presented.) -raster without a decompression procedure.This work was supported by the National Natural Science Foundation of China (Grant Nos. 31171944, 31640068), Anhui Provincial Natural Science Foundation (Grant No. 2019B319), Earmarked Fund for Anhui Science and Technology Major Project (202003b06020016).
Information CITIC, Ministerio de Ciencia e Innovación, Grant/Award Numbers: PID2020-114635RB-I00; PDC2021-120917-C21; PDC2021-121239-C31; PID2019-105221RB-C41; TED2021-129245-C21; Xunta de Galicia, Grant/Award Numbers: ED431C 2021/53; IN852D 2021/3 (CO3)This work was partially supported by CITIC, CITIC is funded by the Xunta de Galicia through the collaboration agreement between the Department of Culture, Education, Vocational Training and Universities and the Galician universities for the reinforcement of the research centers of the Galician University System (CIGUS). IN852D 2021/3(CO3): partially funded by UE, (ERDF), GAIN, convocatoria Conecta COVID. GRC: ED431C 2021/53: partially funded by GAIN/Xunta de Galicia. TED2021-129245B-C21; PDC2021-121239-C31; PDC2021-120917-C21: partially funded by MCIN/AEI/10.13039/501100011033 and “NextGenerationEU”/PRTR. PID2020-114635RB-I00; PID2019-105221RB-C41: partially funded by MCIN/AEI/10.13039/501100011033. Funding for open access charge: Universidadeda Coruña/CISUG.Xunta de Galicia; ED431C 2021/53Xunta de Galicia; IN852D 2021/3 (CO3)National Natural Science Foundation of China; 31171944National Natural Science Foundation of China; 31640068Anhui Provincial Natural Science Foundation; 2019B31
Scalable processing and autocovariance computation of big functional data
This is the peer reviewed version of the following article: Brisaboa NR, Cao R, Paramá JR, Silva-Coira F. Scalable processing and autocovariance computation of big functional data. Softw Pract Exper. 2018; 48: 123–140 which has been published in final form at https://doi.org/10.1002/spe.2524 . This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions. This article may not be enhanced, enriched or otherwise transformed into a derivative work, without express permission from Wiley or by statutory rights under applicable legislation. Copyright notices must not be removed, obscured or modified. The article must be linked to Wiley’s version of record on Wiley Online Library and any embedding, framing or otherwise making available the article or pages thereof by third parties from platforms, services and websites other than Wiley Online Library must be prohibited.[Abstract]: This paper presents 2 main contributions. The first is a compact representation of huge sets of functional data or trajectories of continuous-time stochastic processes, which allows keeping the data always compressed even during the processing in main memory. It is oriented to facilitate the efficient computation of the sample autocovariance function without a previous decompression of the data set, by using only partial local decoding. The second contribution is a new memory-efficient algorithm to compute the sample autocovariance function. The combination of the compact representation and the new memory-efficient algorithm obtained in our experiments the following benefits. The compressed data occupy in the disk 75% of the space needed by the original data. The computation of the autocovariance function used up to 13 times less main memory, and run 65% faster than the classical method implemented, for example, in the R package.This work was supported by the Ministerio de Economía y Competitividad (PGE and FEDER) under grants [TIN2016-78011-C4-1-R; MTM2014-52876-R; TIN2013-46238-C4-3-R], Centro para el desarrollo Tecnológico e Industrial MINECO [IDI-20141259; ITC-20151247; ITC-20151305; ITC-20161074]; Xunta de Galicia (cofounded with FEDER) under Grupos de Referencia Competitiva grant ED431C-2016-015; Xunta de Galicia-Consellería de Cultura, Educación e Ordenación Universitaria (cofounded with FEDER) under Redes grants R2014/041, ED341D R2016/045; Xunta de Galicia-Consellería de Cultura, Educación e Ordenación Universitaria (cofounded with FEDER) under Centro Singular de Investigación de Galicia grant ED431G/01.Xunta de Galicia; D431C-2016-015Xunta de Galicia; R2014/041Xunta de Galicia; ED341D R2016/045Xunta de Galicia; ED431G/0
Compact and indexed representation for LiDAR point clouds
[Abstract]: LiDAR devices are capable of acquiring clouds of 3D points reflecting any object around them, and adding additional attributes to each point such as color, position, time, etc. LiDAR datasets are usually large, and compressed data formats (e.g. LAZ) have been proposed over the years. These formats are capable of transparently decompressing portions of the data, but they are not focused on solving general queries over the data. In contrast to that traditional approach, a new recent research line focuses on designing data structures that combine compression and indexation, allowing directly querying the compressed data. Compression is used to fit the data structure in main memory all the time, thus getting rid of disk accesses, and indexation is used to query the compressed data as fast as querying the uncompressed data. In this paper, we present the first data structure capable of losslessly compressing point clouds that have attributes and jointly indexing all three dimensions of space and attribute values. Our method is able to run range queries and attribute queries up to 100 times faster than previous methods.Secretara Xeral de Universidades; [ED431G 2019/01]Ministerio de Ciencia e Innovacion; [PID2020-114635RB-I00]Ministerio de Ciencia e Innovacion; [PDC2021-120917C21]Ministerio de Ciencia e Innovación; [PDC2021-121239-C31]Ministerio de Ciencia e Innovación; [PID2019-105221RB-C41]Xunta de Galicia; [ED431C 2021/53]Xunta de Galicia; [IG240.2020.1.185
Space-Efficient Representations of Raster Time Series
Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] Raster time series, a.k.a. temporal rasters, are collections of rasters covering the same region at consecutive timestamps. These data have been used in many different applications ranging from weather forecast systems to monitoring of forest degradation or soil contamination. Many different sensors are generating this type of data, which makes such analyses possible, but also challenges the technological capacity to store and retrieve the data. In this work, we propose a space-efficient representation of raster time series that is based on Compact Data Structures (CDS). Our method uses a strategy of snapshots and logs to represent the data, in which both components are represented using CDS. We study two variants of this strategy, one with regular sampling and another one based on a heuristic that determines at which timestamps should the snapshots be created to reduce the space redundancy. We perform a comprehensive experimental evaluation using real datasets. The results show that the proposed strategy is competitive in space with alternatives based on pure data compression, while providing much more efficient query times for different types of queries.The data used in this study were acquired as part of the mission of NASA’s Earth Science Division and archived and distributed by the Goddard Earth Sciences (GES) Data and Information Services Center (DISC). Funding: CITIC, as Research Center accredited by Galician University System, is funded by “Consellería de Cultura, Educación e Universidade from Xunta de Galicia”, supported in an 80% through ERDF Funds, ERDF Operational Programme Galicia 2014-2020, and the remaining 20% by “Secretaría Xeral de Universidades” (Grant ED431G 2019/01). This work was also supported by Xunta de Galicia/FEDER-UE under Grants [IG240.2020.1.185; IN852A 2018/14]; Ministerio de Ciencia, Innovación y Universidades under Grants [TIN2016-78011-C4-1-R; RTC-2017-5908-7; PID2019- 105221RB-C41/AEI/10.13039/501100011033]; ANID - Millennium Science Initiative Program - Code ICN17_002; Programa Iberoamericano de Ciencia y Tecnología para el Desarrollo (CYTED) [Grant No. 519RT0579]Xunta de Galicia; ED431G 2019/01Xunta de Galicia; IG240.2020.1.185Xunta de Galicia; IN852A 2018/14Chile. Agencia Nacional de Investigación y Desarrollo; ICN17_00
Efficient Processing of Raster and Vector Data
[Abstract] In this work, we propose a framework to store and manage spatial data, which includes new efficient algorithms to perform operations accepting as input a raster dataset and a vector dataset. More concretely, we present algorithms for solving a spatial join between a raster and a vector dataset imposing a restriction on the values of the cells of the raster; and an algorithm for retrieving K objects of a vector dataset that overlap cells of a raster dataset, such that the K objects are those overlapping the highest (or lowest) cell values among all objects. The raster data is stored using a compact data structure, which can directly manipulate compressed data without the need for prior decompression. This leads to better running times and lower memory consumption. In our experimental evaluation comparing our solution to other baselines, we obtain the best space/time trade-offs.This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 690941; from the Ministerio de Ciencia, Innovación y Universidades (PGE and ERDF) grant numbers TIN2016-78011-C4-1-R; TIN2016-77158 C4-3-R; RTC-2017-5908-7; from Xunta de Galicia (co-founded with ERDF) grant numbers ED431C 2017/58; ED431G/01; IN852A 2018/14; and University of Bío-Bío grant numbers 192119 2/R; 195119 GI/VCXunta de Galicia; ED431C 2017/58Xunta de Galicia; ED431G/01Xunta de Galicia; IN852A 2018/14Universidad del Bío-Bío (Chile); 192119 2/RUniversidad del Bío-Bío (Chile); 195119 GI/V
Indexing and Retrieval of Scores by Humming based on Extracted Features
Cursos e Congresos, C-155[Abstract] In order to be able to conduct searches over large collections of music scores with
queries provided in audio format, this article considers recent literature in the field and proposes
an implementation to extract specific features from music pieces. Afterwards, we index
those features using modern Lempel-Ziv (LZ)-based data structures. These data structures take
advantage of the intrinsic repetitiveness within music to reduce space consumption and, at the
same time, to index the information optimizing the search time per query. Furthermore, taking
advantage of this property-based representation framework, which does not depend on the way
the music is portrayed, we enable the possibility to perform melodic searches by simply providing
a query audio. This research branch is known as “query by humming” and has commonly
been applied to audio sources. A preliminary study for its application in other forms of music
representation is presented in this research.Xunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431C 2021/53Work funded by: CITIC is funded by the Xunta de Galicia through the collaboration
agreement between the Consellería de Cultura, Educación, Formación Profesional e Universidades and the Galician universities for the reinforcement of the research centres of the Galician University System (CIGUS), 80% through FEDER funds, Galicia Operational Programme FEDER 2014-2020, and the remaining 20% by the “Secretaría Xeral de Universidades” (Grant ED431G 2019/01), Xunta de Galicia/FEDERUE [ED431C 2021/53]; Ministry of Science and Innovation [PID2020-114635RBI00; PDC2021-120917-C21; PDC2021-121239-C31; PID2019-105221RB-C41; TED2021- 129245-C21