Search CORE

1,138 research outputs found

PF-OLA: A High-Performance Framework for Parallel On-Line Aggregation

Author: Qin Chengjie
Rusu Florin
Publication venue
Publication date: 20/02/2013
Field of study

Online aggregation provides estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution. This allows for the interactive data exploration of the largest datasets. In this paper we introduce the first framework for parallel online aggregation in which the estimation virtually does not incur any overhead on top of the actual execution. We define a generic interface to express any estimation model that abstracts completely the execution details. We design a novel estimator specifically targeted at parallel online aggregation. When executed by the framework over a massive

8\text{TB}

TPC-H instance, the estimator provides accurate confidence bounds early in the execution even when the cardinality of the final result is seven orders of magnitude smaller than the dataset size and without incurring overhead.Comment: 36 page

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Parallelizing Windowed Stream Joins in a Shared-Nothing Cluster

Author: Chakraborty Abhirup
Singh Ajit
Publication venue
Publication date: 24/07/2013
Field of study

The availability of large number of processing nodes in a parallel and distributed computing environment enables sophisticated real time processing over high speed data streams, as required by many emerging applications. Sliding window stream joins are among the most important operators in a stream processing system. In this paper, we consider the issue of parallelizing a sliding window stream join operator over a shared nothing cluster. We propose a framework, based on fixed or predefined communication pattern, to distribute the join processing loads over the shared-nothing cluster. We consider various overheads while scaling over a large number of nodes, and propose solution methodologies to cope with the issues. We implement the algorithm over a cluster using a message passing system, and present the experimental results showing the effectiveness of the join processing algorithm.Comment: 11 page

arXiv.org e-Print Archive

Crossref

Recommended from our members

A Generalization of Band Joins and the Merge-Purge Problem

Author: Hernandez Mauricio A.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/1995
Field of study

The problem of merging multiple databases of information about common entities is frequently encountered in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data always have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent "equational theory" that identifies equivalent items by a complex, domain dependent matching process. We have developed a system for accomplishing this task for lists of names of potential customers in a direct marketing-type application. Our results for statistically generated data are shown to be accurate and effective when processing the data multiple times using different keys for sorting. The system provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massive amounts of data

Columbia University Academic Commons

Engineering Aggregation Operators for Relational In-Memory Database Systems

Author: Müller Ingo
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2016
Field of study

In this thesis we study the design and implementation of Aggregation operators in the context of relational in-memory database systems. In particular, we identify and address the following challenges: cache-efficiency, CPU-friendliness, parallelism within and across processors, robust handling of skewed data, adaptive processing, processing with constrained memory, and integration with modern database architectures. Our resulting algorithm outperforms the state-of-the-art by up to 3.7x

KITopen

Cloud-Scale Entity Resolution: Current State and Open Challenges

Author: Eike Schallehn
Gunter Saake
Xiao Chen
Publication venue: RonPub
Publication date: 01/01/2018
Field of study

Entity resolution (ER) is a process to identify records in information systems, which refer to the same real-world entity. Because in the two recent decades the data volume has grown so large, parallel techniques are called upon to satisfy the ER requirements of high performance and scalability. The development of parallel ER has reached a relatively prosperous stage, and has found its way into several applications. In this work, we first comprehensively survey the state of the art of parallel ER approaches. From the comprehensive overview, we then extract the classification criteria of parallel ER, classify and compare these approaches based on these criteria. Finally, we identify open research questions and challenges and discuss potential solutions and further research potentials in this field

RonPub -- Research Online Publishing

10381 Summary and Abstracts Collection -- Robust Query Processing

Author: Kuno Harumi Anne
Markl Volker
Sattler Kai-Uwe
Publication venue: Dagstuhl Seminar Proceedings. 10381 - Robust Query Processing
Publication date: 01/01/2011
Field of study

Dagstuhl seminar 10381 on robust query processing (held 19.09.10 - 24.09.10) brought together a diverse set of researchers and practitioners with a broad range of expertise for the purpose of fostering discussion and collaboration regarding causes, opportunities, and solutions for achieving robust query processing. The seminar strove to build a unified view across the loosely-coupled system components responsible for the various stages of database query processing. Participants were chosen for their experience with database query processing and, where possible, their prior work in academic research or in product development towards robustness in database query processing. In order to pave the way to motivate, measure, and protect future advances in robust query processing, seminar 10381 focused on developing tests for measuring the robustness of query processing. In these proceedings, we first review the seminar topics, goals, and results, then present abstracts or notes of some of the seminar break-out sessions. We also include, as an appendix, the robust query processing reading list that was collected and distributed to participants before the seminar began, as well as summaries of a few of those papers that were contributed by some participants

Dagstuhl Research Online Publication Server

Enhanced Merge Sort- A New Approach to the Merging Process

Author: Haradome Hiroki
Ichikawa Tomoaki
Kondo Fukuo
Kondo Hiroshi
Kwee Thomas C.
Morisaka Hiroyuki
Sano Keiji
Sugitani Masahiko
Takayama Tadatoshi
Toda Yusuke
Unno Toshiyuki
Publication venue: The Author(s). Published by Elsevier B.V.
Publication date: 01/01/2016
Field of study

AbstractOne of the major fundamental issues of Computer Science is arrangement of elements in the database. The efficiency of the sorting algorithms is to optimize the importance of other sorting algorithms11. The optimality of these sorting algorithms is judged while calculating their time and space complexities12. The idea behind this paper is to modify the conventional Merge Sort Algorithm and to present a new method with reduced execution time. The newly proposed algorithm is faster than the conventional Merge Sort algorithm having a time complexity of O(n log2 n). The proposed algorithm has been tested, implemented, compared and the experimental results are promising

Elsevier - Publisher Connector

Crossref

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen