4,365 research outputs found

    Early Grouping Gets the Skew

    Full text link
    We propose a new algorithm for external grouping with large results. Our approach handles skewed data gracefully and lowers the amount of random IO on disk considerably. Contrary to existing grouping algorithms, our new algorithm does not require the optimizer to employ complicated or error-prone procedures adjusting the parameters prior to query plan execution. We implemented several variants of our algorithm as well as the most commonly used algorithms for grouping and carried out extensive experiments on both synthetic and real data. The results of these experiments reveal the dominance of our approach. In case of heavily skewed data we outperform the other algorithms by a factor of two

    Srql: Sorted relational query language

    Get PDF
    A relation is an unordered collection of records. Often, however, there is an underlying order (e.g., a sequence of stock prices), and users want to pose queries that reflect this order (e.g., find a weekly moving average). SQL provides no support for posing such queries. In this paper, we show how a rich class of queries reflecting sort order can be naturally expressed and efficiently executed with simple extensions to SQL. 1

    Leveraging range joins for the computation of overlap joins

    Get PDF
    Joins are essential and potentially expensive operations in database management systems. When data is associated with time periods, joins commonly include predicates that require pairs of argument tuples to overlap in order to qualify for the result. Our goal is to enable built-in systems support for such joins. In particular, we present an approach where overlap joins are formulated as unions of range joins, which are more general purpose joins compared to overlap joins, i.e., are useful in their own right, and are supported well by B+-trees. The approach is sufficiently flexible that it also supports joins with additional equality predicates, as well as open, closed, and half-open time periods over discrete and continuous domains, thus offering both generality and simplicity, which is important in a system setting. We provide both a stand-alone solution that performs on par with the state-of-the-art and a DBMS embedded solution that is able to exploit standard indexing and clearly outperforms existing DBMS solutions that depend on specialized indexing techniques. We offer both analytical and empirical evaluations of the proposals. The empirical study includes comparisons with pertinent existing proposals and offers detailed insight into the performance characteristics of the proposals

    BigDansing

    Get PDF
    Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms
    corecore