969 research outputs found

    Forecasting the cost of processing multi-join queries via hashing for main-memory databases (Extended version)

    Full text link
    Database management systems (DBMSs) carefully optimize complex multi-join queries to avoid expensive disk I/O. As servers today feature tens or hundreds of gigabytes of RAM, a significant fraction of many analytic databases becomes memory-resident. Even after careful tuning for an in-memory environment, a linear disk I/O model such as the one implemented in PostgreSQL may make query response time predictions that are up to 2X slower than the optimal multi-join query plan over memory-resident data. This paper introduces a memory I/O cost model to identify good evaluation strategies for complex query plans with multiple hash-based equi-joins over memory-resident data. The proposed cost model is carefully validated for accuracy using three different systems, including an Amazon EC2 instance, to control for hardware-specific differences. Prior work in parallel query evaluation has advocated right-deep and bushy trees for multi-join queries due to their greater parallelization and pipelining potential. A surprising finding is that the conventional wisdom from shared-nothing disk-based systems does not directly apply to the modern shared-everything memory hierarchy. As corroborated by our model, the performance gap between the optimal left-deep and right-deep query plan can grow to about 10X as the number of joins in the query increases.Comment: 15 pages, 8 figures, extended version of the paper to appear in SoCC'1

    Robust Query Optimization Methods With Respect to Estimation Errors: A Survey

    Get PDF
    International audienceThe quality of a query execution plan chosen by a Cost-Based Optimizer (CBO) depends greatly on the estimation accuracy of input parameter values. Many research results have been produced on improving the estimation accuracy, but they do not work for every situation. Therefore, "robust query optimization" was introduced, in an effort to minimize the sub-optimality risk by accepting the fact that estimates could be inaccurate. In this survey, we aim to provide an overview of robust query optimization methods by classifying them into different categories, explaining the essential ideas, listing their advantages and limitations, and comparing them with multiple criteria

    Emergent relational schemas for RDF

    Get PDF

    When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Processing

    Full text link
    Carefully balancing load in distributed stream processing systems has a fundamental impact on execution latency and throughput. Load balancing is challenging because real-world workloads are skewed: some tuples in the stream are associated to keys which are significantly more frequent than others. Skew is remarkably more problematic in large deployments: more workers implies fewer keys per worker, so it becomes harder to "average out" the cost of hot keys with cold keys. We propose a novel load balancing technique that uses a heaving hitter algorithm to efficiently identify the hottest keys in the stream. These hot keys are assigned to d≥2d \geq 2 choices to ensure a balanced load, where dd is tuned automatically to minimize the memory and computation cost of operator replication. The technique works online and does not require the use of routing tables. Our extensive evaluation shows that our technique can balance real-world workloads on large deployments, and improve throughput and latency by 150%\mathbf{150\%} and 60%\mathbf{60\%} respectively over the previous state-of-the-art when deployed on Apache Storm.Comment: 12 pages, 14 Figures, this paper is accepted and will be published at ICDE 201

    Using Deep Learning for Big Spatial Data Partitioning

    Get PDF
    This article explores the use of deep learning to choose an appropriate spatial partitioning technique for big data. The exponential increase in the volumes of spatial datasets resulted in the development of big spatial data frameworks. These systems need to partition the data across machines to be able to scale out the computation. Unfortunately, there is no current method to automatically choose an appropriate partitioning technique based on the input data distribution. This article addresses this problem by using deep learning to train a model that captures the relationship between the data distribution and the quality of the partitioning techniques.We propose a solution that runs in two phases, training and application. The offline training phase generates synthetic data based on diverse distributions, partitions them using six different partitioning techniques, and measures their quality using four quality metrics. At the same time, it summarizes the datasets using a histogram and well-designed skewness measures. The data summaries and the quality metrics are then use to train a deep learning model. The second phase uses this model to predict the best partitioning technique given a new dataset that needs to be partitioned.We run an extensive experimental evaluation on big spatial data, andwe experimentally showthe applicability of the proposed technique.We showthat the proposed model outperforms the baseline method in terms of accuracy for choosing the best partitioning technique by only analyzing the summary of the datasets

    Distributed Caching for Complex Querying of Raw Arrays

    Get PDF
    As applications continue to generate multi-dimensional data at exponentially increasing rates, fast analytics to extract meaningful results is becoming extremely important. The database community has developed array databases that alleviate this problem through a series of techniques. In-situ mechanisms provide direct access to raw data in the original format---without loading and partitioning. Parallel processing scales to the largest datasets. In-memory caching reduces latency when the same data are accessed across a workload of queries. However, we are not aware of any work on distributed caching of multi-dimensional raw arrays. In this paper, we introduce a distributed framework for cost-based caching of multi-dimensional arrays in native format. Given a set of files that contain portions of an array and an online query workload, the framework computes an effective caching plan in two stages. First, the plan identifies the cells to be cached locally from each of the input files by continuously refining an evolving R-tree index. In the second stage, an optimal assignment of cells to nodes that collocates dependent cells in order to minimize the overall data transfer is determined. We design cache eviction and placement heuristic algorithms that consider the historical query workload. A thorough experimental evaluation over two real datasets in three file formats confirms the superiority -- by as much as two orders of magnitude -- of the proposed framework over existing techniques in terms of cache overhead and workload execution time
    • …
    corecore