473 research outputs found

    Optimal Joins Using Compact Data Structures

    Get PDF
    Worst-case optimal join algorithms have gained a lot of attention in the database literature. We now count with several algorithms that are optimal in the worst case, and many of them have been implemented and validated in practice. However, the implementation of these algorithms often requires an enhanced indexing structure: to achieve optimality we either need to build completely new indexes, or we must populate the database with several instantiations of indexes such as B+-trees. Either way, this means spending an extra amount of storage space that may be non-negligible. We show that optimal algorithms can be obtained directly from a representation that regards the relations as point sets in variable-dimensional grids, without the need of extra storage. Our representation is a compact quadtree for the static indexes, and a dynamic quadtree sharing subtrees (which we dub a qdag) for intermediate results. We develop a compositional algorithm to process full join queries under this representation, and show that the running time of this algorithm is worst-case optimal in data complexity. Remarkably, we can extend our framework to evaluate more expressive queries from relational algebra by introducing a lazy version of qdags (lqdags). Once again, we can show that the running time of our algorithms is worst-case optimal

    Constellation Queries over Big Data

    Full text link
    A geometrical pattern is a set of points with all pairwise distances (or, more generally, relative distances) specified. Finding matches to such patterns has applications to spatial data in seismic, astronomical, and transportation contexts. For example, a particularly interesting geometric pattern in astronomy is the Einstein cross, which is an astronomical phenomenon in which a single quasar is observed as four distinct sky objects (due to gravitational lensing) when captured by earth telescopes. Finding such crosses, as well as other geometric patterns, is a challenging problem as the potential number of sets of elements that compose shapes is exponentially large in the size of the dataset and the pattern. In this paper, we denote geometric patterns as constellation queries and propose algorithms to find them in large data applications. Our methods combine quadtrees, matrix multiplication, and unindexed join processing to discover sets of points that match a geometric pattern within some additive factor on the pairwise distances. Our distributed experiments show that the choice of composition algorithm (matrix multiplication or nested loops) depends on the freedom introduced in the query geometry through the distance additive factor. Three clearly identified blocks of threshold values guide the choice of the best composition algorithm. Finally, solving the problem for relative distances requires a novel continuous-to-discrete transformation. To the best of our knowledge this paper is the first to investigate constellation queries at scale

    Optimizing Spatial Databases

    Get PDF
    This paper describes the best way to improve the optimization of spatial databases: through spatial indexes. The most commune and utilized spatial indexes are R-tree and Quadtree and they are presented, analyzed and compared in this paper. Also there are given a few examples of queries that run in Oracle Spatial and are being supported by an R-tree spatial index. Spatial databases offer special features that can be very helpful when needing to represent such data. But in terms of storage and time costs, spatial data can require a lot of resources. This is why optimizing the database is one of the most important aspects when working with large volumes of data.Spatial Database, Spatial Index, R-tree, Quadtree, Optimization

    bdbms -- A Database Management System for Biological Data

    Full text link
    Biologists are increasingly using databases for storing and managing their data. Biological databases typically consist of a mixture of raw data, metadata, sequences, annotations, and related data obtained from various sources. Current database technology lacks several functionalities that are needed by biological databases. In this paper, we introduce bdbms, an extensible prototype database management system for supporting biological data. bdbms extends the functionalities of current DBMSs to include: (1) Annotation and provenance management including storage, indexing, manipulation, and querying of annotation and provenance as first class objects in bdbms, (2) Local dependency tracking to track the dependencies and derivations among data items, (3) Update authorization to support data curation via content-based authorization, in contrast to identity-based authorization, and (4) New access methods and their supporting operators that support pattern matching on various types of compressed biological data types. This paper presents the design of bdbms along with the techniques proposed to support these functionalities including an extension to SQL. We also outline some open issues in building bdbms.Comment: This article is published under a Creative Commons License Agreement (http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute, display, and perform the work, make derivative works and make commercial use of the work, but, you must attribute the work to the author and CIDR 2007. 3rd Biennial Conference on Innovative Data Systems Research (CIDR) January 710, 2007, Asilomar, California, US

    Efficient geographic information systems: Data structures, Boolean operations and concurrency control

    Get PDF
    Geographic Information Systems (GIS) are crucial to the ability of govern mental agencies and business to record, manage and analyze geographic data efficiently. They provide methods of analysis and simulation on geographic data that were previously infeasible using traditional hardcopy maps. Creation of realistic 3-D sceneries by overlaying satellite imagery over digital elevation models (DEM) was not possible using paper maps. Determination of suitable areas for construction that would have the fewest environmental impacts once required manual tracing of different map sets on mylar sheets; now it can be done in real time by GIS. Geographic information processing has significant space and time require ments. This thesis concentrates on techniques which can make existing GIS more efficient by considering these issues: Data Structure, Boolean Operations on Geographic Data, Concurrency Control. Geographic data span multiple dimensions and consist of geometric shapes such as points, lines, and areas, which cannot be efficiently handled using a traditional one-dimensional data structure. We therefore first survey spatial data structures for geographic data and then show how a spatial data structure called an R-tree can be used to augment the performance of many existing GIS. Boolean operations on geographic data are fundamental to the spatial anal ysis common in geographic data processing. They allow the user to analyze geographic data by using operators such as AND, OR, NOT on geographic ob jects. An example of a boolean operation query would be, Find all regions that have low elevation AND soil type clay. Boolean operations require signif icant time to process. We present a generalized solution that could significantly improve the time performance of evaluating complex boolean operation queries. Concurrency control on spatial data structures for geographic data processing is becoming more critical as the size and resolution of geographic databases increase. We present algorithms to enable concurrent access to R-tree spatial data structures so that efficient sharing of geographic data can occur in a multi user GIS environment

    Geographic Information Systems: The Developer\u27s Perspective

    Get PDF
    Geographic information systems, which manage data describing the surface of the earth, are becoming increasingly popular. This research details the current state of the art of geographic data processing in terms of the needs of the geographic information system developer. The research focuses chiefly on the geographic data model--the basic building block of the geographic information system. The two most popular models, tessellation and vector, are studied in detail, as well as a number of hybrid data models. In addition, geographic database management is discussed in terms of geographic data access and query processing. Finally, a pragmatic discussion of geographic information system design is presented covering such topics as distributed database considerations and artificial intelligence considerations

    A spatial data handling system for retrieval of images by unrestricted regions of user interest

    Get PDF
    The Intelligent Data Management (IDM) project at NASA/Goddard Space Flight Center has prototyped an Intelligent Information Fusion System (IIFS), which automatically ingests metadata from remote sensor observations into a large catalog which is directly queryable by end-users. The greatest challenge in the implementation of this catalog was supporting spatially-driven searches, where the user has a possible complex region of interest and wishes to recover those images that overlap all or simply a part of that region. A spatial data management system is described, which is capable of storing and retrieving records of image data regardless of their source. This system was designed and implemented as part of the IIFS catalog. A new data structure, called a hypercylinder, is central to the design. The hypercylinder is specifically tailored for data distributed over the surface of a sphere, such as satellite observations of the Earth or space. Operations on the hypercylinder are regulated by two expert systems. The first governs the ingest of new metadata records, and maintains the efficiency of the data structure as it grows. The second translates, plans, and executes users' spatial queries, performing incremental optimization as partial query results are returned

    Incremental elasticity for array databases

    Get PDF
    Relational databases benefit significantly from elasticity, whereby they execute on a set of changing hardware resources provisioned to match their storage and processing requirements. Such flexibility is especially attractive for scientific databases because their users often have a no-overwrite storage model, in which they delete data only when their available space is exhausted. This results in a database that is regularly growing and expanding its hardware proportionally. Also, scientific databases frequently store their data as multidimensional arrays optimized for spatial querying. This brings about several novel challenges in clustered, skew-aware data placement on an elastic shared-nothing database. In this work, we design and implement elasticity for an array database. We address this challenge on two fronts: determining when to expand a database cluster and how to partition the data within it. In both steps we propose incremental approaches, affecting a minimum set of data and nodes, while maintaining high performance. We introduce an algorithm for gradually augmenting an array database's hardware using a closed-loop control system. After the cluster adds nodes, we optimize data placement for n-dimensional arrays. Many of our elastic partitioners incrementally reorganize an array, redistributing data only to new nodes. By combining these two tools, the scientific database efficiently and seamlessly manages its monotonically increasing hardware resources.Intel Corporation (Science and Technology Center for Big Data
    • 

    corecore