35 research outputs found
Efficient geographic information systems: Data structures, Boolean operations and concurrency control
Geographic Information Systems (GIS) are crucial to the ability of govern mental agencies and business to record, manage and analyze geographic data efficiently. They provide methods of analysis and simulation on geographic data that were previously infeasible using traditional hardcopy maps. Creation of realistic 3-D sceneries by overlaying satellite imagery over digital elevation models (DEM) was not possible using paper maps. Determination of suitable areas for construction that would have the fewest environmental impacts once required manual tracing of different map sets on mylar sheets; now it can be done in real time by GIS. Geographic information processing has significant space and time require ments. This thesis concentrates on techniques which can make existing GIS more efficient by considering these issues: Data Structure, Boolean Operations on Geographic Data, Concurrency Control. Geographic data span multiple dimensions and consist of geometric shapes such as points, lines, and areas, which cannot be efficiently handled using a traditional one-dimensional data structure. We therefore first survey spatial data structures for geographic data and then show how a spatial data structure called an R-tree can be used to augment the performance of many existing GIS. Boolean operations on geographic data are fundamental to the spatial anal ysis common in geographic data processing. They allow the user to analyze geographic data by using operators such as AND, OR, NOT on geographic ob jects. An example of a boolean operation query would be, Find all regions that have low elevation AND soil type clay. Boolean operations require signif icant time to process. We present a generalized solution that could significantly improve the time performance of evaluating complex boolean operation queries. Concurrency control on spatial data structures for geographic data processing is becoming more critical as the size and resolution of geographic databases increase. We present algorithms to enable concurrent access to R-tree spatial data structures so that efficient sharing of geographic data can occur in a multi user GIS environment
New data structures and algorithms for the efficient management of large spatial datasets
[Resumen] En esta tesis estudiamos la representación eficiente de matrices multidimensionales,
presentando nuevas estructuras de datos compactas para almacenar y procesar
grids en distintos ámbitos de aplicación. Proponemos varias estructuras de datos
estáticas y dinámicas para la representación de matrices binarias o de enteros
y estudiamos aplicaciones a la representación de datos raster en Sistemas de
Información Geográfica, bases de datos RDF, etc.
En primer lugar proponemos una colección de estructuras de datos estáticas para
la representación de matrices binarias y de enteros: 1) una nueva representación
de matrices binarias con grandes grupos de valores uniformes, con aplicaciones
a la representación de datos raster binarios; 2) una nueva estructura de datos
para representar matrices multidimensionales; 3) una nueva estructura de datos
para representar matrices de enteros con soporte para consultas top-k de rango.
También proponemos una nueva representación dinámica de matrices binarias, una
nueva estructura de datos que proporciona las mismas funcionalidades que nuestras
propuestas estáticas pero también soporta cambios en la matriz.
Nuestras estructuras de datos pueden utilizarse en distintos dominios. Proponemos
variantes especÃficas y combinaciones de nuestras propuestas para representar
grafos temporales, bases de datos RDF, datos raster binarios o generales y
datos raster temporales. También proponemos un nuevo algoritmo para consultar
conjuntamente un conjuto de datos raster (almacenado usando nuestras propuestas)
y un conjunto de datos vectorial almacenado en una estructura de datos clásica,
mostrando que nuestra propuesta puede ser más rápida y usar menos espacio que
otras alternativas. Nuestras representaciones proporcionan interesantes trade-offs y
son competitivas en espacio y tiempos de consulta con representaciones habituales
en los diferentes dominios.[Resumo] Nesta tese estudiamos a representación eficiente de matrices multidimensionais,
presentando novas estruturas de datos compactas para almacenar e procesar grids
en distintos ámbitos de aplicación. Propoñemos varias estruturas de datos estáticas
e dinámicas para a representación de matrices binarias ou de enteiros e estudiamos
aplicacións á representación de datos raster en Sistemas de Información Xeográfica,
bases de datos RDF, etc.
En primeiro lugar propoñemos unha colección de estruturas de datos estáticas
para a representación de matrices binarias e de enteiros: 1) unha nova representación
de matrices binarias con grandes grupos de valores uniformes, con aplicacións
á representación de datos raster binarios; 2) unha nova estrutura de datos
para representar matrices multidimensionais; 3) unha nova estrutura de datos
para representar matrices de enteiros con soporte para consultas top-k. Tamén
propoñemos unha nova representación dinámica de matrices binarias, unha nova
estrutura de datos que proporciona as mesmas funcionalidades que as nosas
propostas estáticas pero tamén soporta cambios na matriz.
As nosas estruturas de datos poden utilizarse en distintos dominios. Propoñemos
variantes especÃficas e combinacións das nosas propostas para representar grafos temporais,
bases de datos RDF, datos raster binarios ou xerais e datos raster temporais.
Tamén propoñemos un novo algoritmo para consultar conxuntamente datos raster
(almacenados usando as nosas propostas) con datos vectoriais almacenados nunha
estrutura de datos clásica, amosando que a nosa proposta pode ser máis rápida e
usar menos espazo que outras alternativas. As nosas representacións proporcionan
interesantes trade-offs e son competitivas en espazo e tempos de consulta con
representacións habituais nos diferentes dominios.[Abstract] In this thesis we study the efficient representation of multidimensional grids,
presenting new compact data structures to store and query grids in different
application domains. We propose several static and dynamic data structures for the
representation of binary grids and grids of integers, and study applications to the
representation of raster data in Geographic Information Systems, RDF databases,
etc.
We first propose a collection of static data structures for the representation of
binary grids and grids of integers: 1) a new representation of bi-dimensional binary
grids with large clusters of uniform values, with applications to the representation
of binary raster data; 2) a new data structure to represent multidimensional binary
grids; 3) a new data structure to represent grids of integers with support for top-k
range queries. We also propose a new dynamic representation of binary grids, a new
data structure that provides the same functionalities that our static representations
of binary grids but also supports changes in the grid.
Our data structures can be used in several application domains. We propose
specific variants and combinations of our generic proposals to represent temporal
graphs, RDF databases, OLAP databases, binary or general raster data, and
temporal raster data. We also propose a new algorithm to jointly query a raster
dataset (stored using our representations) and a vectorial dataset stored in a classic
data structure, showing that our proposal can be faster and require less space than
the usual alternatives. Our representations provide interesting trade-offs and are
competitive in terms of space and query times with usual representations in the
different domains
Management of digital map data using a relational database model
Special issue (CISRG - Cartographic Information Systems Research Group) ;
Compressing and Performing Algorithms on Massively Large Networks
Networks are represented as a set of nodes (vertices) and the arcs (links) connecting them. Such networks can model various real-world structures such as social networks (e.g., Facebook), information networks (e.g., citation networks), technological networks (e.g., the Internet), and biological networks (e.g., gene-phenotype network). Analysis of such structures is a heavily studied area with many applications. However, in this era of big data, we find ourselves with networks so massive that the space requirements inhibit network analysis.
Since many of these networks have nodes and arcs on the order of billions to trillions, even basic data structures such as adjacency lists could cost petabytes to zettabytes of storage. Storing these networks in secondary memory would require I/O access (i.e., disk access) during analysis, thus drastically slowing analysis time. To perform analysis efficiently on such extensive data, we either need enough main memory for the data structures and algorithms, or we need to develop compressions that require much less space while still being able to answer queries efficiently.
In this dissertation, we develop several compression techniques that succinctly represent these real-world networks while still being able to efficiently query the network (e.g., check if an arc exists between two nodes). Furthermore, since many of these networks continue to grow over time, our compression techniques also support the ability to add and remove nodes and edges directly on the compressed structure. We also provide a way to compress the data quickly without any intermediate structure, thus giving minimal memory overhead. We provide detailed analysis and prove that our compression is indeed succinct (i.e., achieves the information-theoretic lower bound). Also, we empirically show that our compression rates outperform or are equal to existing compression algorithms on many benchmark datasets.
We also extend our technique to time-evolving networks. That is, we store the entire state of the network at each time frame. Studying time-evolving networks allows us to find patterns throughout the time that would not be available in regular, static network analysis. A succinct representation for time-evolving networks is arguably more important than static graphs, due to the extra dimension inflating the space requirements of basic data structures even more. Again, we manage to achieve succinctness while also providing fast encoding, minimal memory overhead during encoding, fast queries, and fast, direct modification. We also compare against several benchmarks and empirically show that we achieve compression rates better than or equal to the best performing benchmark for each dataset.
Finally, we also develop both static and time-evolving algorithms that run directly on our compressed structures. Using our static graph compression combined with our differential technique, we find that we can speed up matrix-vector multiplication by reusing previously computed products. We compare our results against a similar technique using the Webgraph Framework, and we see that not only are our base query speeds faster, but we also gain a more significant speed-up from reusing products. Then, we use our time-evolving compression to solve the earliest arrival paths problem and time-evolving transitive closure. We found that not only were we the first to run such algorithms directly on compressed data, but that our technique was particularly efficient at doing so
On the Practice and Application of Context-Free Language Reachability
The Context-Free Language Reachability (CFL-R) formalism relates to some of the most important computational problems facing researchers and industry practitioners. CFL-R is a generalisation of graph reachability and language recognition, such that pairs in a labelled graph are reachable if and only if there is a path between them whose labels, joined together in the order they were encountered, spell a word in a given context-free language. The formalism finds particular use as a vehicle for phrasing and reasoning about program analysis, since complex relationships within the data, logic or structure of computer programs are easily expressed and discovered in CFL-R. Unfortunately, The potential of CFL-R can not be met by state of the art solvers. Current algorithms have scalability and expressibility issues that prevent them from being used on large graph instances or complex grammars. This work outlines our efforts in understanding the practical concerns surrounding CFL-R, and applying this knowledge to improve the performance of CFL-R applications. We examine the major difficulties with solving CFL-R-based analyses at-scale, via a case-study of points-to analysis as a CFL-R problem. Points-to analysis is fundamentally important to many modern research and industry efforts, and is relevant to optimisation, bug-checking and security technologies. Our understanding of the scalability challenge motivates work in developing practical CFL-R techniques. We present improved evaluation algorithms and declarative optimisation techniques for CFL-R, capitalising on the simplicity of CFL-R to creating fully automatic methodologies. The culmination of our work is a general-purpose and high-performance tool called Cauliflower, a solver-generator for CFL-R problems. We describe Cauliflower and evaluate its performance experimentally, showing significant improvement over alternative general techniques
Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension
We examine the estimation of selectivities for range and spatial join queries
in real spatial databases. As we have shown earlier,
real point sets:
(a) violate consistently the "uniformity" and "independence" assumptions,
(b) can often be described as "fractals", with non-integer (fractal)
dimension.
In this paper we show that, among the infinite family of fractal dimensions,
the so called "Correlation Dimension" D2 is the one that we need to predict
the selectivity of spatial join.
The main contribution is that, for all the real and synthetic point-sets we
tried, the average number of neighbors for a given point of the point-set
follows a power law, with D2 as the exponent. This immediately solves the
selectivity estimation for spatial joins, as well as for "biased" range
queries (i.e., queries whose centers prefer areas of high point density).
We present the formulas to estimate the selectivity for the biased queries,
including an integration constant (Kshape) for each query shape.
Finally, we show results on real and synthetic point sets, where our formulas
achieve very low relative errors (typically about 10%, versus 40%-100%
of the uniform assumption)
Weiterentwicklung analytischer Datenbanksysteme
This thesis contributes to the state of the art in analytical database systems. First, we identify and explore extensions to better support analytics on event streams. Second, we propose a novel polygon index to enable efficient geospatial data processing in main memory. Third, we contribute a new deep learning approach to cardinality estimation, which is the core problem in cost-based query optimization.Diese Arbeit trägt zum aktuellen Forschungsstand von analytischen Datenbanksystemen bei. Wir identifizieren und explorieren Erweiterungen um Analysen auf Eventströmen besser zu unterstützen. Wir stellen eine neue Indexstruktur für Polygone vor, die eine effiziente Verarbeitung von Geodaten im Hauptspeicher ermöglicht. Zudem präsentieren wir einen neuen Ansatz für Kardinalitätsschätzungen mittels maschinellen Lernens
Compact data structures for large and complex datasets
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
In this thesis, we study the problem of processing large and complex collections of
data, presenting new data structures and algorithms that allow us to efficiently store
and analyze them. We focus on three main domains: processing of multidimensional
data, representation of spatial information, and analysis of scientific data.
The common nexus is the use of compact data structures, which combine in a
unique data structure a compressed representation of the data and the structures to
access such data. The target is to be able to manage data directly in compressed
form, and in this way, to keep data always compressed, even in main memory. With
this, we obtain two benefits: we can manage larger datasets in main memory and
we take advantage of a better usage of the memory hierarchy.
In the first part, we propose a compact data structure for multidimensional
databases where the domains of each dimension are hierarchical. It allows efficient
queries of aggregate information at different levels of each dimension. A typical
application environment for our solution would be an OLAP system.
Second, we focus on the representation of spatial information, specifically on
raster data, which are commonly used in geographic information systems (GIS) to
represent spatial attributes (such as the altitude of a terrain, the average temperature,
etc.). The new method enables several typical spatial queries with better response
times than the state of the art, at the same time that saves space in both main
memory and disk. Besides, we also present a framework to run a spatial join between
raster and vector datasets, that uses the compact data structure previously presented
in this part of the thesis.
Finally, we present a solution for the computation of empirical moments from a
set of trajectories of a continuous time stochastic process observed in a given period
of time. The empirical autocovariance function is an example of such operations.
In this thesis, we propose a method that compresses sequences of floating numbers
representing Brownian motion trajectories, although it can be used in other similar
areas. In addition, we also introduce a new algorithm for the calculation of the
autocovariance that uses a single trajectory at a time, instead of loading the whole
dataset, reducing the memory consumption during the calculation process.[Resumo]
Nesta tese estudamos o problema de procesar grandes coleccións de datos,
presentando novas estruturas de datos compactas e algoritmos que nos permiten
almacenalas e analizalas de forma eficiente. Centrámonos en tres dominios principais:
procesamento de datos multidimensionais, representación de información espacial e
análise de datos cientÃficos.
O nexo común é o uso de estruturas de datos compactas, que combinan nunha
única estrutura de datos unha representación comprimida dos datos e as estruturas
para acceder a tales datos. O obxectivo é poder manipular os datos directamente en
forma comprimida, e desta maneira, manter os datos sempre comprimidos, incluso na
memoria principal. Con esto obtemos dous beneficios: podemos xestionar conxuntos
de datos máis grandes na memoria principal e aproveitar un mellor uso da xerarquÃa
da memoria.
Na primera parte propoñemos unha estructura de datos compacta para bases de
datos multidimensionais onde os dominios de cada dimensión están xerarquizados.
PermÃtenos consultar eficientemente a información agregada (sumar valor máximo,
etc) a diferentes niveis de cada dimensión. Un entorno de aplicación tÃpico para a
nosa solución serÃa un sistema OLAP.
En segundo lugar, centrámonos na representación de información espacial,
especificamente en datos ráster, que se utilizan comunmente en sistemas de
información xeográfica (SIX) para representar atributos espaciais (como a altitude
dun terreo, a temperatura media, etc.). O novo método permite realizar
eficientemente varias consultas espaciais tÃpicas con tempos de resposta mellores que
o estado da arte, ao mesmo tempo que reduce o espazo utilizado tanto na memoria
principal como no disco. Ademais, tamén presentamos un marco de traballo para
realizar un join espacial entre conxuntos de datos vectoriais e ráster, que usa a
estructura de datos compacta previamente presentada nesta parte da tese.
Por último, presentamos unha solución para o cálculo de momentos empÃricos
a partir dun conxunto de traxectorias dun proceso estocástico de tempo continuo
observadas nun perÃodo de tempo dado. A función de autocovarianza empÃrica
é un exemplo de tales operacións. Nesta tese propoñemos un método que
comprime secuencias de números flotantes que representan traxectorias de movemento Browniano, aÃnda que pode ser empregado noutras áreas similares. Ademais, tamén
introducimos un novo algoritmo para o cálculo da autocovarianza que emprega unha
única traxectoria á vez, en lugar de cargar todo o conxunto de datos, reducindo o
consumo de memoria durante o proceso de cálculo.[Resumen]
En esta tesis estudiamos el problema de procesar grandes colecciones de datos,
presentando nuevas estructuras de datos compactas y algoritmos que nos permiten
almacenarlas y analizarlas de forma eficiente. Nos centramos principalmente en tres
dominios: procesamiento de datos multidimensionales, representación de información
espacial y análisis de datos cientÃficos.
El nexo común es el uso de estructuras de datos compactas, que combinan en
una única estructura de datos una representación comprimida de los datos y las
estructuras para acceder a dichos datos. El objetivo es poder manipular los datos
directamente en forma comprimida, y de esta manera, mantener los datos siempre
comprimidos, incluso en la memoria principal. Con esto obtenemos dos beneficios:
podemos gestionar conjuntos de datos más grandes en la memoria principal y
aprovechar un mejor uso de la jerarquÃa de la memoria.
En la primera parte proponemos una estructura de datos compacta para bases de
datos multidimensionales donde los dominios de cada dimensión están jerarquizados.
Nos permite consultar eficientemente la información agregada (suma, valor máximo,
etc.) a diferentes niveles de cada dimensión. Un entorno de aplicación tÃpico para
nuestra solución serÃa un sistema OLAP.
En segundo lugar, nos centramos en la representación de la información espacial,
especÃficamente en datos ráster, que se utilizan comúnmente en sistemas de
información geográfica (SIG) para representar atributos espaciales (como la altitud
de un terreno, la temperatura media, etc.). El nuevo método permite realizar
eficientemente varias consultas espaciales tÃpicas con tiempos de respuesta mejores
que el estado del arte, al mismo tiempo que reduce el espacio utilizado tanto en la
memoria principal como en el disco. Además, también presentamos un marco de
trabajo para realizar un join espacial entre conjuntos de datos vectoriales y ráster,
que usa la estructura de datos compacta previamente presentada en esta parte de la
tesis.
Por último, presentamos una solución para el cálculo de momentos empÃricos a
partir de un conjunto de trayectorias de un proceso estocástico de tiempo continuo
observadas en un perÃodo de tiempo dado. La función de autocovariancia empÃrica
es un ejemplo de tales operaciones. En esta tesis proponemos un método que comprime secuencias de números flotantes que representan trayectorias de movimiento
Browniano, aunque puede ser utilizado en otras áreas similares. En esta parte,
también introducimos un nuevo algoritmo para el cálculo de la autocovariancia que
utiliza una única trayectoria a la vez, en lugar de cargar todo el conjunto de datos,
reduciendo el consumo de memoria durante el proceso de cálculoXunta de Galicia; ED431G/01Ministerio de EconomÃa y Competitividad ;TIN2016-78011-C4-1-RMinisterio de EconomÃa y Competitividad; TIN2016-77158-C4-3-RMinisterio de EconomÃa y Competitividad; TIN2013-46801-C4-3-RCentro para el desarrollo Tecnológico e Industrial; IDI-20141259Centro para el desarrollo Tecnológico e Industrial; ITC-20151247Xunta de Galicia; GRC2013/05