8 research outputs found

    Evaluation of Main Memory : Join Algorithms for Joins with Set Comparison Predicates

    Full text link
    Current data models like the NF2 model and object-oriented models support set-valued attributes. Hence, it becomes possible to have join predicates based on set comparison. This paper introduces and evaluates several main memory algorithms to evaluate efficiently this kind of join. More specifically, we concentrate on the set equality and the subset predicates

    Cost estimation of spatial join in spatialhadoop

    Get PDF
    Spatial join is an important operation in geo-spatial applications, since it is frequently used for performing data analysis involving geographical information. Many efforts have been done in the past decades in order to provide efficient algorithms for spatial join and this becomes particularly important as the amount of spatial data to be processed increases. In recent years, the MapReduce approach has become a de-facto standard for processing large amount of data (big-data) and some attempts have been made for extending existing frameworks for the processing of spatial data. In this context, several different MapReduce implementations of spatial join have been defined which mainly differ in the use of a spatial index and in the way this index is built and used. In general, none of these algorithms can be considered better than the others, but the choice might depend on the characteristics of the involved datasets. The aim of this work is to deeply analyse them and define a cost model for ranking them based on the characteristics of the dataset at hand (i.e., selectivity or spatial properties). This cost model has been extensively tested w.r.t. a set of synthetic datasets in order to prove its effectiveness

    Use of a weighted matching algorithm to sequence clusters in spatial join processing

    Get PDF
    One of the most expensive operations in a spatial database is spatial join processing. This study focuses on how to improve the performance of such processing. The main objective is to reduce the Input/Output (I/O) cost of the spatial join process by using a technique called cluster-scheduling. Generally, the spatial join is processed in two steps, namely filtering and refinement. The cluster-scheduling technique is performed after the filtering step and before the refinement step and is part of the housekeeping phase. The key point of this technique is to realise order wherein two consecutive clusters in the sequence have maximal overlapping objects. However, finding the maximal overlapping order has been shown to be Nondeterministic Polynomial-time (NP)-complete. This study proposes an algorithm to provide approximate maximal overlapping (AMO) order in a Cluster Overlapping (CO) graph. The study proposes the use of an efficient maximum weighted matching algorithm to solve the problem of finding AMO order. As a result, the I/O cost in spatial join processing can be minimised

    Efficient Index-based Methods for Processing Large Biological Databases.

    Full text link
    Over the last few decades, advances in life sciences have generated a vast amount of biological data. To cope with the rapid increase in data volume, there is a pressing need for efficient computational methods to query large biological datasets. This thesis develops efficient and scalable querying methods for biological data. For an efficient sequence database search, we developed two q-gram index based algorithms, miBLAST and ProbeMatch. miBLAST is designed to expedite batch identification of statistically significant sequence alignments. ProbeMatch is designed for identifying sequence alignments based on a k-mismatch model. For an efficient protein structure database search, we also developed a multi-dimensional index based algorithm method called proCC, an automatic and efficient classification framework. All these algorithms result in substantial performance improvements over existing methods. When designing index-based methods, the right choice of indexing methods is essential. In addition to developing index-based methods for biological applications, we also investigated an essential database problem that reexamines the state-of-the-art indexing methods by experimental evaluation. Our experimental study provides a valuable insight for choosing the right indexing method and also motivates a careful consideration of index structures when designing index-based methods. In the long run, index-based methods can lead to new and more efficient algorithms for querying and mining biological datasets. The examples above, which include query processing on biological sequence and geometrical structure datasets, employ index-based methods very effectively. While the database research community has long recognized the need for index-based query processing algorithms, the bioinformatics community has been slow to adopt such algorithms. However, since many biological datasets are growing very rapidly, database-style index-based algorithms are likely to play a crucial role in modern bioinformatics methods. The work proposed in this thesis lays the foundation for such methods.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61570/1/youjkim_1.pd

    Benchmarking Spatial Join Operations with Spatial Output

    No full text
    The spatial join operation is benchmarked using variants of well-known spatial data structures such as the R-tree, R-tree, R +-tree, and the PMR quadtree. The focus is on a spatial join with spatial output because the result of the spatial join frequently serves as input to subsequent spatial operations (i.e., a cascaded spatial join as would be common in a spatial spreadsheet). Thus, in addition to the time required to perform the spatial join itself (whose output is not always required to be spatial), the time to build the spatial data structure also plays an important role in the benchmark. The studied quantities are the time to build the data structure and the time to do the spatial join in an application domain consisting of planar line segment data. Experiments reveal that spatial data structures based on a disjoint decomposition of space and bounding boxes (i.e., the R +-tree and the PMR quadtree with bounding boxes) outperform the other structures that are based upon a non-disjoint decomposition (i.e., the R-tree and R-tree). As the size of the output of the spatial join increases with respect to the larger of the two inputs, the advantage of the bounding boxes used in methods based on a disjoint non-regular decomposition is no longer a factor an
    corecore