13 research outputs found

    Optimal histograms for limiting worst-case error propagation in the size of join results.

    No full text
    Summarization: Many current relational database systems use some form of histograms to approximate the frequency distribution of values in the attributes of relations and on this basis estimate query result sizes and access plan costs. The errors that exist in the histogram approximations directly or transitively affect many estimates derived by the database system. We identify the class of serial histograms and its subclass of end-btased histograms; the latter is of particular interest because such histograms are used in several database systems. We concentrate on equality join queries without function symbols where each relation is joined on the same attribute(s) for all joins in which it participates. Join queries of this restricted type are called t-cllque queries. We show that the optimal histogram for reducing the worst-case error in the result size of such a query is always serial. For queries with one join and no function symbols (all of which are vacuously t-clique queries), we present results on finding the optimal serial histogram and the optimal end-biased histogram based on the query characteristics and the frequency distributions of values in the join attributes of the query relations. Finally, we prove that for t-clique queries with a very large number of joins, h~gh-bzased h zstograms (which form a subclass of end-biased histograms) are always optimal. To construct a histogram for the join attribute(s) of a relation, the values in the attribute(s) must first be sorted based on their frequency and then assigned into buckets according to the optimality results above.Presented on: ACM Transactions on Database System

    κατάτμηση διαστημάτων με εφαρμογές σε ζεύξη και ταξινόμηση

    Get PDF
    Στην εποχή μας ο όγκος των δεδομένων που υπάρχει διαθέσιμος είναι τεράστιος. Επιπρόσθετα, κάθε λεπτό που περνάει αυτός ο όγκος αυξάνεται συνεχώς. Αυτό έχει πολλές φορές σαν αποτέλεσμα οι δυνατότητες μίας μόνο υπολογιστικής μονάδας να μην μας είναι επαρκείς. Για το λόγο αυτό είναι πολύ χρήσιμο να διαμοιράζουμε τον όγκο των δεδομένων σε συστάδες υπολογιστών. Με τον τρόπο αυτό καταφέρνουμε να έχουμε γρηγορότερη επεξεργασία και αποδοτικότερη ανάκτηση δεδομένων. Για να γίνει αυτή η κατάτμηση των δεδομένων υπάρχουν αρκετές τεχνικές που εφαρμόζουμε. Οι δύο πιο συνηθισμένες είναι η Τυχαία Κατάτμηση και η Κατάτμηση Κατακερματισμού. Χρησιμοποιούμε τις μεθόδους αυτές ανάλογα με τη διαχείριση και το είδος των ερωτημάτων που θέλουμε να εφαρμόσουμε στα δεδομένα μας. Για παράδειγμα, Τυχαία Κατάτμηση θα χρησιμοποιήσουμε όταν θέλουμε να αποθηκεύσουμε δεδομένα σε διαφορετικούς κόμβους αποθήκευσης χωρίς να ενδιαφερόμαστε για τις τιμές που θα στείλουμε στον καθένα. Από την άλλη, Κατάτμηση Κατακερματισμού θα χρησιμοποιήσουμε όταν τα ερωτήματα που θα εκτελέσουμε είναι ερωτήματα ισότητας. Στη συγκεκριμένη πτυχιακή ασχοληθήκαμε με μία διαφορετική τεχνική που εφαρμόζεται σπανιότερα στα διάφορα υπολογιστικά συστήματα αλλά είναι εξίσου αποτελεσματική. Η μέθοδος αυτή είναι η Κατάτμηση Διαστημάτων. Η συγκεκριμένη τεχνική διαχωρίζει ένα πίνακα δεδομένων με βάση την κλίμακα των τιμών του σε ένα συγκεκριμένο πεδίο. Η Κατάτμηση Διαστημάτων αυξάνει την απόδοση σε λειτουργίες όπως η παράλληλη ταξινόμηση και οι ζεύξεις διαστημάτων. Αναπόσπαστο κομμάτι της Κατάτμησης Διαστημάτων είναι τα Ιστογράμματα τα οποία παρέχουν εικόνα της ποικιλομορφίας των τιμών των δεδομένων. Υλοποιήσαμε τον αλγόριθμο της Κατάτμηση Διαστημάτων με Ιστογράμματα στο σύστημα EXAREME του Madgik Lab. Επίσης κάναμε πειραματική ανάλυση του αλγορίθμου με διαφορετικούς όγκους δεδομένων για να διαπιστώσουμε την αποτελεσματικότητά του.Nowadays the volume of the data which are available is huge. Furthermore, every minute that volume is being continuously increased. Thus, many times the possibilities of a unique machine are not enough. For that reason, it is very useful to distribute the volume of data to clusters. In that way we manage to have faster management and more efficient recovery of the data. To achieve the data partitioning there are many techniques which we use. The two most common methods are the Random Partitioning and the Hash Partitioning. We use those methods according to the kind of queries which we would like to execute to our data. For instance, we would use Random Partitioning if we would like to store our data to different storage nodes without caring about the values which we will send to each one. Moreover, we would use Hash Partitioning if the queries, which we will execute, are equity queries. In this dissertation project, we dealt with a different technique which is applied rarely in the compute systems but it is equally efficient. That method is the Range Partitioning. That technique splits a table according to the scale of the values in a specific field. Range Partitioning increases the attribution in certain operations such as the parallel sorting and the interval joins. A very important part of Range Partitioning is the Histograms which give an image of the value distribution of the data. We developed the Range Partitioning algorithm with Histograms upon the EXAREME system of Madgik Lab. Also, we made experimental evaluation of the algorithm with different data volumes to testify its efficiency

    Synopsis data structures for massive data sets

    Full text link

    Optimization of Regular Path Queries in Graph Databases

    Get PDF
    Regular path queries offer a powerful navigational mechanism in graph databases. Recently, there has been renewed interest in such queries in the context of the Semantic Web. The extension of SPARQL in version 1.1 with property paths offers a type of regular path query for RDF graph databases. While eminently useful, such queries are difficult to optimize and evaluate efficiently, however. We design and implement a cost-based optimizer we call Waveguide for SPARQL queries with property paths. Waveguide builds a query planwhich we call a waveplan (WP)which guides the query evaluation. There are numerous choices in the con- struction of a plan, and a number of optimization methods, so the space of plans for a query can be quite large. Execution costs of plans for the same query can vary by orders of magnitude with the best plan often offering excellent performance. A WPs costs can be estimated, which opens the way to cost-based optimization. We demonstrate that Waveguide properly subsumes existing techniques and that the new plans it adds are relevant. We analyze the effective plan space which is enabled by Waveguide and design an efficient enumerator for it. We implement a pro- totype of a Waveguide cost-based optimizer on top of an open-source relational RDF store. Finally, we perform a comprehensive performance study of the state of the art for evaluation of SPARQL property paths and demonstrate the significant performance gains that Waveguide offers

    Approximate query processing in a data warehouse using random sampling

    Get PDF
    Data analysis consumes a large volume of data on a routine basis.. With the fast increase in both the volume of the data and the complexity of the analytic tasks, data processing becomes more complicated and expensive. The cost efficiency is a key factor in the design and deployment of data warehouse systems. Approximate query processing is a well-known approach to handle massive data among different methods to make big data processing more efficient, in which a small sample is used to answer the query. For many applications, a small error is justifiable for the saving of resources consumed to answer the query, as well as reducing the latency. We focus on the approximate query processing using random sampling in a data warehouse system, including algorithms to draw samples, methods to maintain sample quality, and effective usages of the sample for approximately answering different classes of queries. First, we study different methods of sampling, focusing on stratified sampling that is optimized for population aggregate query. Next, as the query involves, we propose sampling algorithms for group-by aggregate queries. Finally, we introduce the sampling over the pipeline model of queries processing, where multiple queries and tables are involved in order to accomplish complicated tasks. Modern big data analyses routinely involve complex pipelines in which multiple tasks are choreographed to execute queries over their inputs and write the results into their outputs (which, in turn, may be used as inputs for other tasks) in a synchronized dance of gradual data refinement until the final insight is calculated. In a pipeline, approximate results are fed into downstream queries, unlike in a single query. Thus, we see both aggregate computations from sampled input and approximate input. We propose a sampling-based approximate pipeline processing algorithm that uses unbiased estimation and calculates the confidence interval for produced approximate results. The key insight of the algorithm calls for enriching the output of queries with additional information. This enables the algorithm to piggyback on the modular structure of the pipeline without having to perform any global rewrites, i.e. no extra query or table is added into the pipeline. Compared to the bootstrap method, the approach described in this paper provides the confidence interval while computing aggregation estimates only once and avoids the need for maintaining intermediary aggregation distributions. Our empirical study on public and private datasets shows that our sampling algorithm can have significantly (1.4 to 50.0 times) smaller variance, compared to the Neyman algorithm, for optimal sample for population aggregate queries. Our experimental results for group-by queries show that our sample algorithm outperforms the current state-of-the-art on sample quality and estimation accuracy. The optimal sample yields relative errors that are 5x smaller than competing approaches, under the same budget. The experiments for approximate pipeline processing show the high accuracy of the computed estimation, with an average error as low as 2%, using only a 1% sample. It also shows the usefulness of the confidence interval. At the confidence level of 95%, the computed CI is as tight as +/- 8%, while the actual values fall within the CI boundary from 70.49% to 95.15% of times

    Exploring run-time reduction in programming codes via query optimization and caching

    Get PDF
    Object oriented programming languages raised the level of abstraction by supporting the explicit first class query constructs in the programming codes. These query constructs allow programmers to express operations on collections more abstractly than relying on their realization in loops or through provided libraries. Join optimization techniques from the field of database technology support efficient realizations of such language constructs. However, the problem associated with the existing techniques such as query optimization in Java Query Language (JQL) incurs run time overhead. Besides the programming languages supporting first-class query constructs, the usage of annotations has also increased in the software engineering community recently. Annotations are a common means of providing metadata information to the source code. The object oriented programming languages such as C# provides attributes constraints and Java has its own annotation constructs that allow the developers to include the metadata information in the program codes. This work introduces a series of query optimization approaches to reduce the run time of the programs involving explicit queries over collections. The proposed approaches rely on histograms to estimate the selectivity of the predicates and the joins in order to construct the query plans. The annotations in the source code are also utilized to gather the metadata required for the selectivity estimation of the numerical as well as the string valued predicates and joins in the queries. Several cache heuristics are proposed that effectively cache the results of repeated queries in the program codes. The cached query results are incrementally maintained up-to-date after the update operations to the collections --Abstract, page iv

    Clustering-Initialized Adaptive Histograms and Probabilistic Cost Estimation for Query Optimization

    Get PDF
    An assumption with self-tuning histograms has been that they can "learn" the dataset if given enough training queries. We show that this is not the case with the current approaches. The quality of the histogram depends on the initial configuration. Starting with few good buckets can improve the efficiency of learning. Without this, the histogram is likely to stagnate, i.e. converge to a bad configuration and stop learning. We also present a probabilistic cost estimation model
    corecore