Search CORE

4,365 research outputs found

Early Grouping Gets the Skew

Author: Helmer Sven
Moerkotte Guido
Neumann Thomas
Publication venue
Publication date: 01/01/2002
Field of study

We propose a new algorithm for external grouping with large results. Our approach handles skewed data gracefully and lowers the amount of random IO on disk considerably. Contrary to existing grouping algorithms, our new algorithm does not require the optimizer to employ complicated or error-prone procedures adjusting the parameters prior to query plan execution. We implemented several variants of our algorithm as well as the most commonly used algorithms for grouping and carried out extensive experiments on both synthetic and real data. The results of these experiments reveal the dominance of our approach. In case of heavily skewed data we outperform the other algorithms by a factor of two

MAnnheim DOCument Server

Srql: Sorted relational query language

Author: Arvind Ranganathan
Donko Donjerkovic
Kevin S. Beyer
Muralidhar Krishnaprasad
Raghu Ramakrishnan
Publication venue
Publication date: 01/01/1998
Field of study

A relation is an unordered collection of records. Often, however, there is an underlying order (e.g., a sequence of stock prices), and users want to pose queries that reflect this order (e.g., find a weekly moving average). SQL provides no support for posing such queries. In this paper, we show how a rich class of queries reflecting sort order can be naturally expressed and efficiently executed with simple extensions to SQL. 1

CiteSeerX

Leveraging range joins for the computation of overlap joins

Author: Böhlen Michael H.
Dignös Anton
Gamper Johann
Jensen Christian S.
Moser Peter
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Joins are essential and potentially expensive operations in database management systems. When data is associated with time periods, joins commonly include predicates that require pairs of argument tuples to overlap in order to qualify for the result. Our goal is to enable built-in systems support for such joins. In particular, we present an approach where overlap joins are formulated as unions of range joins, which are more general purpose joins compared to overlap joins, i.e., are useful in their own right, and are supported well by B+-trees. The approach is sufficiently flexible that it also supports joins with additional equality predicates, as well as open, closed, and half-open time periods over discrete and continuous domains, thus offering both generality and simplicity, which is important in a system setting. We provide both a stand-alone solution that performs on par with the state-of-the-art and a DBMS embedded solution that is able to exploit standard indexing and clearly outperforms existing DBMS solutions that depend on specialized indexing techniques. We offer both analytical and empirical evaluations of the proposals. The empirical study includes comparisons with pertinent existing proposals and offers detailed insight into the performance characteristics of the proposals

VBN

BigDansing

Author: Khayyat Zuhair
Ilyas Ihab F.
Ouzzani Mourad
Papotti Paolo
Quiané-Ruiz Jorge-Arnulfo
Tang Nan
Yin Si
Madden Samuel R
Jindal Alekh
Publication venue: Association for Computing Machinery
Publication date: 15/01/2003
Field of study

Data cleansing approaches have usually focused on detecting and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enumerating pairs of tuples, handling inequality joins, and dealing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle efficiency, scalability, and ease-of-use issues in data cleansing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly programming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that BigDansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms

Electronic Archive of Poltava University of Economics and Trade

Електронний архів Полтавського університету економіки і торгівлі (Electronic archive of Poltava University of Economics and Trade)