389 research outputs found

    Optimising Sargable Conjunctive Predicate Queries in the Context of Big Data

    Get PDF
    With the continued increase in the volume of data, the volume dimension of big data has become a significant factor in estimating query time. When all other factors are held constant, query time increases as the volume of data increases and vice versa. To enhance query time, several techniques have come out of research efforts in this direction. One of such techniques is factorisation of query predicates. Factorisation has been used as a query optimization technique for the general class of predicates but has been found inapplicable to the subclass of sargable conjunctive equality predicates. Experiments performed exposed a peculiar nature of sargable conjunctive equality predicates based on which insight, the concatenated predicate model was formulated as capable of optimising sargable conjunctive equality predicates. Equations from research results were combined in a way that theorems describing the application and optimality of the concatenated predicate model were derived and proved

    Adaptive content mapping for internet navigation

    Get PDF
    The Internet as the biggest human library ever assembled keeps on growing. Although all kinds of information carriers (e.g. audio/video/hybrid file formats) are available, text based documents dominate. It is estimated that about 80% of all information worldwide stored electronically exists in (or can be converted into) text form. More and more, all kinds of documents are generated by means of a text processing system and are therefore available electronically. Nowadays, many printed journals are also published online and may even discontinue to appear in print form tomorrow. This development has many convincing advantages: the documents are both available faster (cf. prepress services) and cheaper, they can be searched more easily, the physical storage only needs a fraction of the space previously necessary and the medium will not age. For most people, fast and easy access is the most interesting feature of the new age; computer-aided search for specific documents or Web pages becomes the basic tool for information-oriented work. But this tool has problems. The current keyword based search machines available on the Internet are not really appropriate for such a task; either there are (way) too many documents matching the specified keywords are presented or none at all. The problem lies in the fact that it is often very difficult to choose appropriate terms describing the desired topic in the first place. This contribution discusses the current state-of-the-art techniques in content-based searching (along with common visualization/browsing approaches) and proposes a particular adaptive solution for intuitive Internet document navigation, which not only enables the user to provide full texts instead of manually selected keywords (if available), but also allows him/her to explore the whole database

    Distribution Policies for Datalog

    Get PDF
    Modern data management systems extensively use parallelism to speed up query processing over massive volumes of data. This trend has inspired a rich line of research on how to formally reason about the parallel complexity of join computation. In this paper, we go beyond joins and study the parallel evaluation of recursive queries. We introduce a novel framework to reason about multi-round evaluation of Datalog programs, which combines implicit predicate restriction with distribution policies to allow expressing a combination of data-parallel and query-parallel evaluation strategies. Using our framework, we reason about key properties of distributed Datalog evaluation, including parallel-correctness of the evaluation strategy, disjointness of the computation effort, and bounds on the number of communication rounds

    An experimental study of learned cardinality estimation

    Get PDF
    Cardinality estimation is a fundamental but long unresolved problem in query optimization. Recently, multiple papers from different research groups consistently report that learned models have the potential to replace existing cardinality estimators. In this thesis, we ask a forward-thinking question: Are we ready to deploy these learned cardinality models in production? Our study consists of three main parts. Firstly, we focus on the static environment (i.e., no data updates) and compare five new learned methods with eight traditional methods on four real-world datasets under a unified workload setting. The results show that learned models are indeed more accurate than traditional methods, but they often suffer from high training and inference costs. Secondly, we explore whether these learned models are ready for dynamic environments (i.e., frequent data updates). We find that they can- not catch up with fast data updates and return large errors for different reasons. For less frequent updates, they can perform better but there is no clear winner among themselves. Thirdly, we take a deeper look into learned models and explore when they may go wrong. Our results show that the performance of learned methods can be greatly affected by the changes in correlation, skewness, or domain size. More importantly, their behaviors are much harder to interpret and often unpredictable. Based on these findings, we identify two promising research directions (control the cost of learned models and make learned models trustworthy) and suggest a number of research opportunities. We hope that our study can guide researchers and practitioners to work together to eventually push learned cardinality estimators into real database systems

    APRIL: Approximating Polygons as Raster Interval Lists

    Full text link
    The spatial intersection join an important spatial query operation, due to its popularity and high complexity. The spatial join pipeline takes as input two collections of spatial objects (e.g., polygons). In the filter step, pairs of object MBRs that intersect are identified and passed to the refinement step for verification of the join predicate on the exact object geometries. The bottleneck of spatial join evaluation is in the refinement step. We introduce APRIL, a powerful intermediate step in the pipeline, which is based on raster interval approximations of object geometries. Our technique applies a sequence of interval joins on 'intervalized' object approximations to determine whether the objects intersect or not. Compared to previous work, APRIL approximations are simpler, occupy much less space, and achieve similar pruning effectiveness at a much higher speed. Besides intersection joins between polygons, APRIL can directly be applied and has high effectiveness for polygonal range queries, within joins, and polygon-linestring joins. By applying a lightweight compression technique, APRIL approximations may occupy even less space than object MBRs. Furthermore, APRIL can be customized to apply on partitioned data and on polygons of varying sizes, rasterized at different granularities. Our last contribution is a novel algorithm that computes the APRIL approximation of a polygon without having to rasterize it in full, which is orders of magnitude faster than the computation of other raster approximations. Experiments on real data demonstrate the effectiveness and efficiency of APRIL; compared to the state-of-the-art intermediate filter, APRIL occupies 2x-8x less space, is 3.5x-8.5x more time-efficient, and reduces the end-to-end join cost up to 3 times.Comment: 12 page

    Intelligent Data Storage and Retrieval for Design Optimisation – an Overview

    Get PDF
    This paper documents the findings of a literature review conducted by the Sir Lawrence Wackett Centre for Aerospace Design Technology at RMIT University. The review investigates aspects of a proposed system for intelligent design optimisation. Such a system would be capable of efficiently storing (and compressing if required) a range of types of design data into an intelligent database. This database would be accessed by the system during subsequent design processes, allowing for search of relevant design data for re-use in later designs, allowing it to become very efficient in reducing the time for later designs as the database grows in size. Extensive research has been performed, in both theoretical aspects of the project, and practical examples of current similar systems. This research covers the areas of database systems, database queries, representation and compression of design data, geometric representation and heuristic methods for design applications.

    FRIOD: a deeply integrated feature-rich interactive system for effective and efficient outlier detection

    Get PDF
    In this paper, we propose an novel interactive outlier detection system called feature-rich interactive outlier detection (FRIOD), which features a deep integration of human interaction to improve detection performance and greatly streamline the detection process. A user-friendly interactive mechanism is developed to allow easy and intuitive user interaction in all the major stages of the underlying outlier detection algorithm which includes dense cell selection, location-aware distance thresholding, and final top outlier validation. By doing so, we can mitigate the major difficulty of the competitive outlier detection methods in specifying the key parameter values, such as the density and distance thresholds. An innovative optimization approach is also proposed to optimize the grid-based space partitioning, which is a critical step of FRIOD. Such optimization fully considers the high-quality outliers it detects with the aid of human interaction. The experimental evaluation demonstrates that FRIOD can improve the quality of the detected outliers and make the detection process more intuitive, effective, and efficient
    • …
    corecore