35 research outputs found

    Trends and concerns in digital cartography

    Get PDF
    CISRG discussion paper ;

    Efficient geographic information systems: Data structures, Boolean operations and concurrency control

    Get PDF
    Geographic Information Systems (GIS) are crucial to the ability of govern mental agencies and business to record, manage and analyze geographic data efficiently. They provide methods of analysis and simulation on geographic data that were previously infeasible using traditional hardcopy maps. Creation of realistic 3-D sceneries by overlaying satellite imagery over digital elevation models (DEM) was not possible using paper maps. Determination of suitable areas for construction that would have the fewest environmental impacts once required manual tracing of different map sets on mylar sheets; now it can be done in real time by GIS. Geographic information processing has significant space and time require ments. This thesis concentrates on techniques which can make existing GIS more efficient by considering these issues: Data Structure, Boolean Operations on Geographic Data, Concurrency Control. Geographic data span multiple dimensions and consist of geometric shapes such as points, lines, and areas, which cannot be efficiently handled using a traditional one-dimensional data structure. We therefore first survey spatial data structures for geographic data and then show how a spatial data structure called an R-tree can be used to augment the performance of many existing GIS. Boolean operations on geographic data are fundamental to the spatial anal ysis common in geographic data processing. They allow the user to analyze geographic data by using operators such as AND, OR, NOT on geographic ob jects. An example of a boolean operation query would be, Find all regions that have low elevation AND soil type clay. Boolean operations require signif icant time to process. We present a generalized solution that could significantly improve the time performance of evaluating complex boolean operation queries. Concurrency control on spatial data structures for geographic data processing is becoming more critical as the size and resolution of geographic databases increase. We present algorithms to enable concurrent access to R-tree spatial data structures so that efficient sharing of geographic data can occur in a multi user GIS environment

    New data structures and algorithms for the efficient management of large spatial datasets

    Get PDF
    [Resumen] En esta tesis estudiamos la representación eficiente de matrices multidimensionales, presentando nuevas estructuras de datos compactas para almacenar y procesar grids en distintos ámbitos de aplicación. Proponemos varias estructuras de datos estáticas y dinámicas para la representación de matrices binarias o de enteros y estudiamos aplicaciones a la representación de datos raster en Sistemas de Información Geográfica, bases de datos RDF, etc. En primer lugar proponemos una colección de estructuras de datos estáticas para la representación de matrices binarias y de enteros: 1) una nueva representación de matrices binarias con grandes grupos de valores uniformes, con aplicaciones a la representación de datos raster binarios; 2) una nueva estructura de datos para representar matrices multidimensionales; 3) una nueva estructura de datos para representar matrices de enteros con soporte para consultas top-k de rango. También proponemos una nueva representación dinámica de matrices binarias, una nueva estructura de datos que proporciona las mismas funcionalidades que nuestras propuestas estáticas pero también soporta cambios en la matriz. Nuestras estructuras de datos pueden utilizarse en distintos dominios. Proponemos variantes específicas y combinaciones de nuestras propuestas para representar grafos temporales, bases de datos RDF, datos raster binarios o generales y datos raster temporales. También proponemos un nuevo algoritmo para consultar conjuntamente un conjuto de datos raster (almacenado usando nuestras propuestas) y un conjunto de datos vectorial almacenado en una estructura de datos clásica, mostrando que nuestra propuesta puede ser más rápida y usar menos espacio que otras alternativas. Nuestras representaciones proporcionan interesantes trade-offs y son competitivas en espacio y tiempos de consulta con representaciones habituales en los diferentes dominios.[Resumo] Nesta tese estudiamos a representación eficiente de matrices multidimensionais, presentando novas estruturas de datos compactas para almacenar e procesar grids en distintos ámbitos de aplicación. Propoñemos varias estruturas de datos estáticas e dinámicas para a representación de matrices binarias ou de enteiros e estudiamos aplicacións á representación de datos raster en Sistemas de Información Xeográfica, bases de datos RDF, etc. En primeiro lugar propoñemos unha colección de estruturas de datos estáticas para a representación de matrices binarias e de enteiros: 1) unha nova representación de matrices binarias con grandes grupos de valores uniformes, con aplicacións á representación de datos raster binarios; 2) unha nova estrutura de datos para representar matrices multidimensionais; 3) unha nova estrutura de datos para representar matrices de enteiros con soporte para consultas top-k. Tamén propoñemos unha nova representación dinámica de matrices binarias, unha nova estrutura de datos que proporciona as mesmas funcionalidades que as nosas propostas estáticas pero tamén soporta cambios na matriz. As nosas estruturas de datos poden utilizarse en distintos dominios. Propoñemos variantes específicas e combinacións das nosas propostas para representar grafos temporais, bases de datos RDF, datos raster binarios ou xerais e datos raster temporais. Tamén propoñemos un novo algoritmo para consultar conxuntamente datos raster (almacenados usando as nosas propostas) con datos vectoriais almacenados nunha estrutura de datos clásica, amosando que a nosa proposta pode ser máis rápida e usar menos espazo que outras alternativas. As nosas representacións proporcionan interesantes trade-offs e son competitivas en espazo e tempos de consulta con representacións habituais nos diferentes dominios.[Abstract] In this thesis we study the efficient representation of multidimensional grids, presenting new compact data structures to store and query grids in different application domains. We propose several static and dynamic data structures for the representation of binary grids and grids of integers, and study applications to the representation of raster data in Geographic Information Systems, RDF databases, etc. We first propose a collection of static data structures for the representation of binary grids and grids of integers: 1) a new representation of bi-dimensional binary grids with large clusters of uniform values, with applications to the representation of binary raster data; 2) a new data structure to represent multidimensional binary grids; 3) a new data structure to represent grids of integers with support for top-k range queries. We also propose a new dynamic representation of binary grids, a new data structure that provides the same functionalities that our static representations of binary grids but also supports changes in the grid. Our data structures can be used in several application domains. We propose specific variants and combinations of our generic proposals to represent temporal graphs, RDF databases, OLAP databases, binary or general raster data, and temporal raster data. We also propose a new algorithm to jointly query a raster dataset (stored using our representations) and a vectorial dataset stored in a classic data structure, showing that our proposal can be faster and require less space than the usual alternatives. Our representations provide interesting trade-offs and are competitive in terms of space and query times with usual representations in the different domains

    Management of digital map data using a relational database model

    Get PDF
    Special issue (CISRG - Cartographic Information Systems Research Group) ;

    Compressing and Performing Algorithms on Massively Large Networks

    Get PDF
    Networks are represented as a set of nodes (vertices) and the arcs (links) connecting them. Such networks can model various real-world structures such as social networks (e.g., Facebook), information networks (e.g., citation networks), technological networks (e.g., the Internet), and biological networks (e.g., gene-phenotype network). Analysis of such structures is a heavily studied area with many applications. However, in this era of big data, we find ourselves with networks so massive that the space requirements inhibit network analysis. Since many of these networks have nodes and arcs on the order of billions to trillions, even basic data structures such as adjacency lists could cost petabytes to zettabytes of storage. Storing these networks in secondary memory would require I/O access (i.e., disk access) during analysis, thus drastically slowing analysis time. To perform analysis efficiently on such extensive data, we either need enough main memory for the data structures and algorithms, or we need to develop compressions that require much less space while still being able to answer queries efficiently. In this dissertation, we develop several compression techniques that succinctly represent these real-world networks while still being able to efficiently query the network (e.g., check if an arc exists between two nodes). Furthermore, since many of these networks continue to grow over time, our compression techniques also support the ability to add and remove nodes and edges directly on the compressed structure. We also provide a way to compress the data quickly without any intermediate structure, thus giving minimal memory overhead. We provide detailed analysis and prove that our compression is indeed succinct (i.e., achieves the information-theoretic lower bound). Also, we empirically show that our compression rates outperform or are equal to existing compression algorithms on many benchmark datasets. We also extend our technique to time-evolving networks. That is, we store the entire state of the network at each time frame. Studying time-evolving networks allows us to find patterns throughout the time that would not be available in regular, static network analysis. A succinct representation for time-evolving networks is arguably more important than static graphs, due to the extra dimension inflating the space requirements of basic data structures even more. Again, we manage to achieve succinctness while also providing fast encoding, minimal memory overhead during encoding, fast queries, and fast, direct modification. We also compare against several benchmarks and empirically show that we achieve compression rates better than or equal to the best performing benchmark for each dataset. Finally, we also develop both static and time-evolving algorithms that run directly on our compressed structures. Using our static graph compression combined with our differential technique, we find that we can speed up matrix-vector multiplication by reusing previously computed products. We compare our results against a similar technique using the Webgraph Framework, and we see that not only are our base query speeds faster, but we also gain a more significant speed-up from reusing products. Then, we use our time-evolving compression to solve the earliest arrival paths problem and time-evolving transitive closure. We found that not only were we the first to run such algorithms directly on compressed data, but that our technique was particularly efficient at doing so

    On the Practice and Application of Context-Free Language Reachability

    Get PDF
    The Context-Free Language Reachability (CFL-R) formalism relates to some of the most important computational problems facing researchers and industry practitioners. CFL-R is a generalisation of graph reachability and language recognition, such that pairs in a labelled graph are reachable if and only if there is a path between them whose labels, joined together in the order they were encountered, spell a word in a given context-free language. The formalism finds particular use as a vehicle for phrasing and reasoning about program analysis, since complex relationships within the data, logic or structure of computer programs are easily expressed and discovered in CFL-R. Unfortunately, The potential of CFL-R can not be met by state of the art solvers. Current algorithms have scalability and expressibility issues that prevent them from being used on large graph instances or complex grammars. This work outlines our efforts in understanding the practical concerns surrounding CFL-R, and applying this knowledge to improve the performance of CFL-R applications. We examine the major difficulties with solving CFL-R-based analyses at-scale, via a case-study of points-to analysis as a CFL-R problem. Points-to analysis is fundamentally important to many modern research and industry efforts, and is relevant to optimisation, bug-checking and security technologies. Our understanding of the scalability challenge motivates work in developing practical CFL-R techniques. We present improved evaluation algorithms and declarative optimisation techniques for CFL-R, capitalising on the simplicity of CFL-R to creating fully automatic methodologies. The culmination of our work is a general-purpose and high-performance tool called Cauliflower, a solver-generator for CFL-R problems. We describe Cauliflower and evaluate its performance experimentally, showing significant improvement over alternative general techniques

    Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension

    Get PDF
    We examine the estimation of selectivities for range and spatial join queries in real spatial databases. As we have shown earlier, real point sets: (a) violate consistently the "uniformity" and "independence" assumptions, (b) can often be described as "fractals", with non-integer (fractal) dimension. In this paper we show that, among the infinite family of fractal dimensions, the so called "Correlation Dimension" D2 is the one that we need to predict the selectivity of spatial join. The main contribution is that, for all the real and synthetic point-sets we tried, the average number of neighbors for a given point of the point-set follows a power law, with D2 as the exponent. This immediately solves the selectivity estimation for spatial joins, as well as for "biased" range queries (i.e., queries whose centers prefer areas of high point density). We present the formulas to estimate the selectivity for the biased queries, including an integration constant (Kshape) for each query shape. Finally, we show results on real and synthetic point sets, where our formulas achieve very low relative errors (typically about 10%, versus 40%-100% of the uniform assumption)

    Weiterentwicklung analytischer Datenbanksysteme

    Get PDF
    This thesis contributes to the state of the art in analytical database systems. First, we identify and explore extensions to better support analytics on event streams. Second, we propose a novel polygon index to enable efficient geospatial data processing in main memory. Third, we contribute a new deep learning approach to cardinality estimation, which is the core problem in cost-based query optimization.Diese Arbeit trägt zum aktuellen Forschungsstand von analytischen Datenbanksystemen bei. Wir identifizieren und explorieren Erweiterungen um Analysen auf Eventströmen besser zu unterstützen. Wir stellen eine neue Indexstruktur für Polygone vor, die eine effiziente Verarbeitung von Geodaten im Hauptspeicher ermöglicht. Zudem präsentieren wir einen neuen Ansatz für Kardinalitätsschätzungen mittels maschinellen Lernens

    Compact data structures for large and complex datasets

    Get PDF
    Programa Oficial de Doutoramento en Computación . 5009V01[Abstract] In this thesis, we study the problem of processing large and complex collections of data, presenting new data structures and algorithms that allow us to efficiently store and analyze them. We focus on three main domains: processing of multidimensional data, representation of spatial information, and analysis of scientific data. The common nexus is the use of compact data structures, which combine in a unique data structure a compressed representation of the data and the structures to access such data. The target is to be able to manage data directly in compressed form, and in this way, to keep data always compressed, even in main memory. With this, we obtain two benefits: we can manage larger datasets in main memory and we take advantage of a better usage of the memory hierarchy. In the first part, we propose a compact data structure for multidimensional databases where the domains of each dimension are hierarchical. It allows efficient queries of aggregate information at different levels of each dimension. A typical application environment for our solution would be an OLAP system. Second, we focus on the representation of spatial information, specifically on raster data, which are commonly used in geographic information systems (GIS) to represent spatial attributes (such as the altitude of a terrain, the average temperature, etc.). The new method enables several typical spatial queries with better response times than the state of the art, at the same time that saves space in both main memory and disk. Besides, we also present a framework to run a spatial join between raster and vector datasets, that uses the compact data structure previously presented in this part of the thesis. Finally, we present a solution for the computation of empirical moments from a set of trajectories of a continuous time stochastic process observed in a given period of time. The empirical autocovariance function is an example of such operations. In this thesis, we propose a method that compresses sequences of floating numbers representing Brownian motion trajectories, although it can be used in other similar areas. In addition, we also introduce a new algorithm for the calculation of the autocovariance that uses a single trajectory at a time, instead of loading the whole dataset, reducing the memory consumption during the calculation process.[Resumo] Nesta tese estudamos o problema de procesar grandes coleccións de datos, presentando novas estruturas de datos compactas e algoritmos que nos permiten almacenalas e analizalas de forma eficiente. Centrámonos en tres dominios principais: procesamento de datos multidimensionais, representación de información espacial e análise de datos científicos. O nexo común é o uso de estruturas de datos compactas, que combinan nunha única estrutura de datos unha representación comprimida dos datos e as estruturas para acceder a tales datos. O obxectivo é poder manipular os datos directamente en forma comprimida, e desta maneira, manter os datos sempre comprimidos, incluso na memoria principal. Con esto obtemos dous beneficios: podemos xestionar conxuntos de datos máis grandes na memoria principal e aproveitar un mellor uso da xerarquía da memoria. Na primera parte propoñemos unha estructura de datos compacta para bases de datos multidimensionais onde os dominios de cada dimensión están xerarquizados. Permítenos consultar eficientemente a información agregada (sumar valor máximo, etc) a diferentes niveis de cada dimensión. Un entorno de aplicación típico para a nosa solución sería un sistema OLAP. En segundo lugar, centrámonos na representación de información espacial, especificamente en datos ráster, que se utilizan comunmente en sistemas de información xeográfica (SIX) para representar atributos espaciais (como a altitude dun terreo, a temperatura media, etc.). O novo método permite realizar eficientemente varias consultas espaciais típicas con tempos de resposta mellores que o estado da arte, ao mesmo tempo que reduce o espazo utilizado tanto na memoria principal como no disco. Ademais, tamén presentamos un marco de traballo para realizar un join espacial entre conxuntos de datos vectoriais e ráster, que usa a estructura de datos compacta previamente presentada nesta parte da tese. Por último, presentamos unha solución para o cálculo de momentos empíricos a partir dun conxunto de traxectorias dun proceso estocástico de tempo continuo observadas nun período de tempo dado. A función de autocovarianza empírica é un exemplo de tales operacións. Nesta tese propoñemos un método que comprime secuencias de números flotantes que representan traxectorias de movemento Browniano, aínda que pode ser empregado noutras áreas similares. Ademais, tamén introducimos un novo algoritmo para o cálculo da autocovarianza que emprega unha única traxectoria á vez, en lugar de cargar todo o conxunto de datos, reducindo o consumo de memoria durante o proceso de cálculo.[Resumen] En esta tesis estudiamos el problema de procesar grandes colecciones de datos, presentando nuevas estructuras de datos compactas y algoritmos que nos permiten almacenarlas y analizarlas de forma eficiente. Nos centramos principalmente en tres dominios: procesamiento de datos multidimensionales, representación de información espacial y análisis de datos científicos. El nexo común es el uso de estructuras de datos compactas, que combinan en una única estructura de datos una representación comprimida de los datos y las estructuras para acceder a dichos datos. El objetivo es poder manipular los datos directamente en forma comprimida, y de esta manera, mantener los datos siempre comprimidos, incluso en la memoria principal. Con esto obtenemos dos beneficios: podemos gestionar conjuntos de datos más grandes en la memoria principal y aprovechar un mejor uso de la jerarquía de la memoria. En la primera parte proponemos una estructura de datos compacta para bases de datos multidimensionales donde los dominios de cada dimensión están jerarquizados. Nos permite consultar eficientemente la información agregada (suma, valor máximo, etc.) a diferentes niveles de cada dimensión. Un entorno de aplicación típico para nuestra solución sería un sistema OLAP. En segundo lugar, nos centramos en la representación de la información espacial, específicamente en datos ráster, que se utilizan comúnmente en sistemas de información geográfica (SIG) para representar atributos espaciales (como la altitud de un terreno, la temperatura media, etc.). El nuevo método permite realizar eficientemente varias consultas espaciales típicas con tiempos de respuesta mejores que el estado del arte, al mismo tiempo que reduce el espacio utilizado tanto en la memoria principal como en el disco. Además, también presentamos un marco de trabajo para realizar un join espacial entre conjuntos de datos vectoriales y ráster, que usa la estructura de datos compacta previamente presentada en esta parte de la tesis. Por último, presentamos una solución para el cálculo de momentos empíricos a partir de un conjunto de trayectorias de un proceso estocástico de tiempo continuo observadas en un período de tiempo dado. La función de autocovariancia empírica es un ejemplo de tales operaciones. En esta tesis proponemos un método que comprime secuencias de números flotantes que representan trayectorias de movimiento Browniano, aunque puede ser utilizado en otras áreas similares. En esta parte, también introducimos un nuevo algoritmo para el cálculo de la autocovariancia que utiliza una única trayectoria a la vez, en lugar de cargar todo el conjunto de datos, reduciendo el consumo de memoria durante el proceso de cálculoXunta de Galicia; ED431G/01Ministerio de Economía y Competitividad ;TIN2016-78011-C4-1-RMinisterio de Economía y Competitividad; TIN2016-77158-C4-3-RMinisterio de Economía y Competitividad; TIN2013-46801-C4-3-RCentro para el desarrollo Tecnológico e Industrial; IDI-20141259Centro para el desarrollo Tecnológico e Industrial; ITC-20151247Xunta de Galicia; GRC2013/05
    corecore