15 research outputs found

    Bulk Insertions into xBR+ -trees

    Get PDF
    Bulk insertion refers to the process of updating an existing index by inserting a large batch of new data, treating the items of this batch as a whole and not by inserting these items one-by-one. Bulk insertion is related to bulk loading, which refers to the process of creating a non-existing index from scratch, when the dataset to be indexed is available beforehand. The xBR + -tree is a balanced, disk-resident, Quadtree-based index for point data, which is very efficient for processing spatial queries. In this paper, we present the first algorithm for bulk insertion into xBR+ -trees. This algorithm incorporates extensions of techniques that we have recently developed for bulk loading xBR+ -trees. Moreover, using real and artificial datasets of various cardinalities, we present an experimental comparison of this algorithm vs. inserting items one-by-one for updating xBR+ -trees, regarding performance (I/O and execution time) and the characteristics of the resulting trees. We also present experimental results regarding the query-processing efficiency of xBR+ -trees built by bulk insertions vs. xBR+ -trees built by inserting items one-by-one

    An Efficient Algorithm for Bulk-Loading xBR+ -trees

    Get PDF
    A major part of the interface to a database is made up of the queries that can be addressed to this database and answered (processed) in an efficient way, contributing to the quality of the developed software. Efficiently processed spatial queries constitute a fundamental part of the interface to spatial databases due to the wide area of applications that may address such queries, like geographical information systems (GIS), location-based services, computer visualization, automated mapping, facilities management, etc. Another important capability of the interface to a spatial database is to offer the creation of efficient index structures to speed up spatial query processing. The xBR + -tree is a balanced disk-resident quadtree-based index structure for point data, which is very efficient for processing such queries. Bulk-loading refers to the process of creating an index from scratch, when the dataset to be indexed is available beforehand, instead of creating the index gradually (and more slowly), when the dataset elements are inserted one-by-one. In this paper, we present an algorithm for bulk-loading xBR + -trees for big datasets residing on disk, using a limited amount of main memory. The resulting tree is not only built fast, but exhibits high performance in processing a broad range of spatial queries, where one or two datasets are involved. To justify these characteristics, using real and artificial datasets of various cardinalities, first, we present an experimental comparison of this algorithm vs. a previous version of the same algorithm and STR, a popular algorithm of bulk-loading R-trees, regarding tree creation time and the characteristics of the trees created, and second, we experimentally compare the query efficiency of bulk-loaded xBR + -trees vs. bulk-loaded R-trees, regarding I/O and execution time. Thus, this paper contributes to the implementation of spatial database interfaces and the efficient storage organization for big spatial data management

    A QUADTREE SPATIAL INDEX METHOD WITH INCLUSION RELATIONS FOR THE INCREMENTAL UPDATING OF VECTOR LANDCOVER DATABASE

    Get PDF
    In vector landcover database, there are a lot of complex polygons with many holes, even nesting holes. In the incremental updating (i.e., using the change-only information to update the land cover database), a new changed parcel usually has 2-dimensional intersections (e.g., overlap, cover, equal and inside, etc.) with several existing regions, automatic updating operations need to identify the affected objects for the new changes at first. If the existing parcels include complex polygons (i.e., the polygon with holes), it is still needed to determine if there are 2-dimensional intersections between the new changed polygon and each holes of the involved complex polygons. The relation between the complex polygon and its holes has not been presented in the current spatial data indexing methods, only the MBB (Minimum Bounding Box) of the exterior ring of the complex polygon has been stored, the non-involved holes can not be filtered at the first step of spatial access methods. As the refinement geometric operation is costly, therefore the updating process for the complex polygons is very complicated and low efficient using the current spatial data indexing methods. In order to solve this problem, an improved quadtree spatial index method is presented in this paper. In this method, the polygons is divided to two categories according to the relations with the quadrant axes, i.e., disjoint to the axes and intersect with the axes. The intersect polygons are still divided to 5 cases according to the intersection position among the polygons and the different level quadrant axes. The intersection polygons are stored in the different level root nodes in our index tree, and five buckets denoted as XpB, XnB, YpB, YnB, XYB are used to store the polygons intersecting the different level quadrant axes respectively. The polygons disjoint to all quadrant axes are stored in the leaf nodes in this method. The authors developed the spatial index structure with inclusion relations and the algorithms of the corresponding index operations (e.g., insert, delete and query) for the complex polygons. The effectiveness of the improved index is verified by an experiment of land cover data incremental updating. Experimental results show that the proposed index method is significantly more efficient than the traditional quadtree index in terms of spatial query efficiency, and the time efficiency of the incremental updating is increased about 3 times using the proposed index method than that using the traditional quadtree index

    OPTIMIZING CLIENT-SERVER COMMUNICATION FOR REMOTE SPATIAL DATABASE ACCESS

    Get PDF
    Technological advances in recent years have opened ways for easier creation of spatial data. Every day, vast amounts of data are collected by both governmental institutions (e.g., USGS, NASA) and commercial entities (e.g., IKONOS). This process is driven by increased popularity and affordability across the whole spectrum of collection methods, ranging from personal GPS units to satellite systems. Many collection methods such as satellite systems produce data in raster format. Often, such raster data is analyzed by the researchers directly, while at other times such data is used to produce the final dataset in vector format. With the rapidly increasing supply of data, more applications for this data are being developed that are of interest to a wider consumer base. The increasing popularity of spatial data viewers and query tools with end users introduces a requirement for methods to allow these basic users to access this data for viewing and querying instantly and without much effort. In our work, we focus on providing remote access to vector-based spatial data, rather than raster data. We explore new ways of allowing visualization of both spatial and non-spatial data stored in a central server database on a simple client connected to this server by possibly a slow and unreliable connection. We considered usage scenarios where transferring the whole database for processing on the client was not feasible. This is due to the large volume of data stored on the server as well as a lack of computing power on the client and a slow link between the two. We focus on finding an optimal way of distributing work between the server, clients, and possibly other entities introduced into the model for query evaluation and data management. We address issues of scalability for clients that have only limited access to system resources (e.g., a Java applet). Methods to allow these clients to provide an interactive user interface, even for databases of arbitrary size, are also examined

    Large-Scale Spatial Data Management on Modern Parallel and Distributed Platforms

    Full text link
    Rapidly growing volume of spatial data has made it desirable to develop efficient techniques for managing large-scale spatial data. Traditional spatial data management techniques cannot meet requirements of efficiency and scalability for large-scale spatial data processing. In this dissertation, we have developed new data-parallel designs for large-scale spatial data management that can better utilize modern inexpensive commodity parallel and distributed platforms, including multi-core CPUs, many-core GPUs and computer clusters, to achieve both efficiency and scalability. After introducing background on spatial data management and modern parallel and distributed systems, we present our parallel designs for spatial indexing and spatial join query processing on both multi-core CPUs and GPUs for high efficiency as well as their integrations with Big Data systems for better scalability. Experiment results using real world datasets demonstrate the effectiveness and efficiency of the proposed techniques on managing large-scale spatial data

    Efficient Index-based Methods for Processing Large Biological Databases.

    Full text link
    Over the last few decades, advances in life sciences have generated a vast amount of biological data. To cope with the rapid increase in data volume, there is a pressing need for efficient computational methods to query large biological datasets. This thesis develops efficient and scalable querying methods for biological data. For an efficient sequence database search, we developed two q-gram index based algorithms, miBLAST and ProbeMatch. miBLAST is designed to expedite batch identification of statistically significant sequence alignments. ProbeMatch is designed for identifying sequence alignments based on a k-mismatch model. For an efficient protein structure database search, we also developed a multi-dimensional index based algorithm method called proCC, an automatic and efficient classification framework. All these algorithms result in substantial performance improvements over existing methods. When designing index-based methods, the right choice of indexing methods is essential. In addition to developing index-based methods for biological applications, we also investigated an essential database problem that reexamines the state-of-the-art indexing methods by experimental evaluation. Our experimental study provides a valuable insight for choosing the right indexing method and also motivates a careful consideration of index structures when designing index-based methods. In the long run, index-based methods can lead to new and more efficient algorithms for querying and mining biological datasets. The examples above, which include query processing on biological sequence and geometrical structure datasets, employ index-based methods very effectively. While the database research community has long recognized the need for index-based query processing algorithms, the bioinformatics community has been slow to adopt such algorithms. However, since many biological datasets are growing very rapidly, database-style index-based algorithms are likely to play a crucial role in modern bioinformatics methods. The work proposed in this thesis lays the foundation for such methods.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61570/1/youjkim_1.pd

    Scalable Query Processing on Spatial Networks

    Get PDF
    Spatial networks (e.g., road networks) are general graphs with spatial information (e.g., latitude/longitude) information associated with the vertices and/or the edges of the graph. Techniques are presented for query processing on spatial networks that are based on the observed coherence between the spatial positions of the vertices and the shortest paths between them. This facilitates aggregation of the vertices into coherent regions that share vertices on the shortest paths between them. Using this observation, a framework, termed SILC, is introduced that precomputes and compactly encodes the N^2 shortest path and network distances between every pair of vertices on a spatial network containing N vertices. The compactness of the shortest paths from source vertex V is achieved by partitioning the destination vertices into subsets based on the identity of the first edge to them from V. The spatial coherence of these subsets is captured by using a quadtree representation whose dimension-reducing property enables the storage requirements of each subset to be reduced to be proportional to the perimeter of the spatially coherent regions, instead of to the number of vertices in the spatial network. In particular, experiments on a number of large road networks as well as a theoretical analysis have shown that the total storage for the shortest paths has been reduced from O(N^3) to O(N^1.5). In addition to SILC, another framework, termed PCP, is proposed that also takes advantage of the spatial coherence of the source vertices and makes use of the Well Separated Pair decomposition to further reduce the storage, under suitably defined conditions, to O(N). Using these frameworks, scalable algorithms are presented to implement a wide variety of operations such as nearest neighbor finding and distance joins on large datasets of locations residing on a spatial network. These frameworks essentially decouple the process of computing shortest paths from that of spatial query processing as well as also decouple the domain of the participating objects from the domain of the vertices of the spatial network. This means that as long as the spatial network is unchanged, the algorithm and underlying representation of the shortest paths in the spatial network can be used with different sets of objects

    indexing and querying moving objects databases

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Efficient Algorithms for Similarity and Skyline Summary on Multidimensional Datasets.

    Full text link
    Efficient management of large multidimensional datasets has attracted much attention in the database research community. Such large multidimensional datasets are common and efficient algorithms are needed for analyzing these data sets for a variety of applications. In this thesis, we focus our study on two very common classes of analysis: similarity and skyline summarization. We first focus on similarity when one of the dimensions in the multidimensional dataset is temporal. We then develop algorithms for evaluating skyline summaries effectively for both temporal and low-cardinality attribute domain datasets and propose different methods for improving the effectiveness of the skyline summary operation. This thesis begins by studying similarity measures for time-series datasets and efficient algorithms for time-series similarity evaluation. The first contribution of this thesis is a new algorithm which can be used to evaluate similarity methods whose matching criteria is bounded by a specified threshold value. The second contribution of this thesis is the development of a new time-interval skyline operator, which continuously computes the current skyline over a data stream. We present a new algorithm called LookOut for evaluating such queries efficiently, and empirically demonstrate the scalability of this algorithm. Current skyline evaluation techniques follow a common paradigm that eliminates data elements from skyline consideration by finding other elements in the dataset that dominate them. The performance of such techniques is heavily influenced by the underlying data distribution. The third contribution of this thesis is a novel technique called the Lattice Skyline Algorithm (LS) that is built around a new paradigm for skyline evaluation on datasets with attributes that are drawn from low-cardinality domains. The utility of the skyline as a data summarization technique is often diminished by the volume of points in the skyline The final contribution of this thesis is a novel scheme which remedies the skyline volume problem by ranking the elements of the skyline based on their importance to the skyline summary. Collectively, the techniques described in this thesis present efficient methods for two common and computationally intensive analysis operations on large multidimensional datasets.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57643/2/mmorse_1.pd
    corecore