Search CORE

1,386 research outputs found

PF-OLA: A High-Performance Framework for Parallel On-Line Aggregation

Author: Qin Chengjie
Rusu Florin
Publication venue
Publication date: 20/02/2013
Field of study

Online aggregation provides estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution. This allows for the interactive data exploration of the largest datasets. In this paper we introduce the first framework for parallel online aggregation in which the estimation virtually does not incur any overhead on top of the actual execution. We define a generic interface to express any estimation model that abstracts completely the execution details. We design a novel estimator specifically targeted at parallel online aggregation. When executed by the framework over a massive

8\text{TB}

TPC-H instance, the estimator provides accurate confidence bounds early in the execution even when the cardinality of the final result is seven orders of magnitude smaller than the dataset size and without incurring overhead.Comment: 36 page

arXiv.org e-Print Archive

eScholarship - University of California

Cost-based Optimization of Multistore Query Plans

Author: Forresi C
Francia M
Gallinucci E
Golfarelli M
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Multistores are data management systems that enable query processing across different and heterogeneous databases; besides the distribution of data, complexity factors like schema heterogeneity and data replication must be resolved through integration and data fusion activities. Our multistore solution relies on a dataspace to provide the user with an integrated view of the available data and enables the formulation and execution of GPSJ queries. In this paper, we propose a technique to optimize the execution of GPSJ queries by formulating and evaluating different execution plans on the multistore. In particular, we outline different strategies to carry out joins and data fusion by relying on different schema representations; then, a self-learning black-box cost model is used to estimate execution times and select the most efficient plan. The experiments assess the effectiveness of the cost model in choosing the best execution plan for the given queries and exploit multiple multistore benchmarks to investigate the factors that influence the performance of different plans

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

An effective scalable SQL engine for NoSQL databases

Author: A. Abouzeid
E. Meijer
K. Shvachko
L. Lin
M. Armbrust
M. Brantner
M. Rys
M. Stonebraker
P. Hunt
P. Nadkarni
R. Vilaça
S. Ghemawat
S. Harizopoulos
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

NoSQL databases were initially devised to support a few concrete extreme scale applications. Since the specificity and scale of the target systems justified the investment of manually crafting application code their limited query and indexing capabilities were not a major im- pediment. However, with a considerable number of mature alternatives now available there is an increasing willingness to use NoSQL databases in a wider and more diverse spectrum of applications and, to most of them, hand-crafted query code is not an enticing trade-off. In this paper we address this shortcoming of current NoSQL databases with an effective approach for executing SQL queries while preserving their scalability and schema flexibility. We show how a full-fledged SQL engine can be integrated atop of HBase leading to an ANSI SQL compli- ant database. Under a standard TPC-C workload our prototype scales linearly with the number of nodes in the system and outperforms a NoSQL TPC-C implementation optimized for HBase.(undefined

CiteSeerX

Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture

Author: Dalton Jeff
Li Zhenghua
Lin Jimmy
Mishne Gilad
Sharma Aneesh
Publication venue
Publication date: 27/10/2012
Field of study

We present the architecture behind Twitter's real-time related query suggestion and spelling correction service. Although these tasks have received much attention in the web search literature, the Twitter context introduces a real-time "twist": after significant breaking news events, we aim to provide relevant results within minutes. This paper provides a case study illustrating the challenges of real-time data processing in the era of "big data". We tell the story of how our system was built twice: our first implementation was built on a typical Hadoop-based analytics stack, but was later replaced because it did not meet the latency requirements necessary to generate meaningful real-time results. The second implementation, which is the system deployed in production, is a custom in-memory processing engine specifically designed for the task. This experience taught us that the current typical usage of Hadoop as a "big data" platform, while great for experimentation, is not well suited to low-latency processing, and points the way to future work on data analytics platforms that can handle "big" as well as "fast" data

arXiv.org e-Print Archive

CiteSeerX

A Survey on Array Storage, Query Languages, and Systems

Author: Cheng Yu
Rusu Florin
Publication venue
Publication date: 19/02/2013
Field of study

Since scientific investigation is one of the most important providers of massive amounts of ordered data, there is a renewed interest in array data processing in the context of Big Data. To the best of our knowledge, a unified resource that summarizes and analyzes array processing research over its long existence is currently missing. In this survey, we provide a guide for past, present, and future research in array processing. The survey is organized along three main topics. Array storage discusses all the aspects related to array partitioning into chunks. The identification of a reduced set of array operators to form the foundation for an array query language is analyzed across multiple such proposals. Lastly, we survey real systems for array processing. The result is a thorough survey on array data storage and processing that should be consulted by anyone interested in this research topic, independent of experience level. The survey is not complete though. We greatly appreciate pointers towards any work we might have forgotten to mention.Comment: 44 page

arXiv.org e-Print Archive

CiteSeerX

Architecture and quality in data warehouses - An extended repository approach.

Author: Jarke M.
Jeusfeld M.A.
Quix C.
Vassiliadis P.
Publication venue
Publication date
Field of study

Understanding the Benefits of Ontology Use for Australian Industry: A Conceptual Study

Author: Keen Chris D.
Kurnia Sherah
Milton Simon K.
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/01/2010
Field of study

In IT, rather than philosophy, an ontology makes explicit the meanings of terms used in domains, or concerning a specific reality, so that people and machines can precisely discuss the meaning of data. Specifically, ontology makes data sharing and analysis easier by making the meaning of data, and of the reality to which the database refers, explicit. Ontology has significant uptake in biomedicine but not yet in industry despite much technical development and reporting of specific successes. This research seeks to determine how and why organisations gain benefits from using ontology leading to a rigorously tested model of how business gains benefit from ontology use. This research in progress paper develops a model explaining the benefit of ontology use to firms and outlines our plans to test the model empirically. The outcome is significant for Australian industry because it will guide the efforts of organisations to use ontology effectively

AIS Electronic Library (AISeL)