889 research outputs found

    QuickSel: Quick Selectivity Learning with Mixture Models

    Full text link
    Estimating the selectivity of a query is a key step in almost any cost-based query optimizer. Most of today's databases rely on histograms or samples that are periodically refreshed by re-scanning the data as the underlying data changes. Since frequent scans are costly, these statistics are often stale and lead to poor selectivity estimates. As an alternative to scans, query-driven histograms have been proposed, which refine the histograms based on the actual selectivities of the observed queries. Unfortunately, these approaches are either too costly to use in practice---i.e., require an exponential number of buckets---or quickly lose their advantage as they observe more queries. In this paper, we propose a selectivity learning framework, called QuickSel, which falls into the query-driven paradigm but does not use histograms. Instead, it builds an internal model of the underlying data, which can be refined significantly faster (e.g., only 1.9 milliseconds for 300 queries). This fast refinement allows QuickSel to continuously learn from each query and yield increasingly more accurate selectivity estimates over time. Unlike query-driven histograms, QuickSel relies on a mixture model and a new optimization algorithm for training its model. Our extensive experiments on two real-world datasets confirm that, given the same target accuracy, QuickSel is 34.0x-179.4x faster than state-of-the-art query-driven histograms, including ISOMER and STHoles. Further, given the same space budget, QuickSel is 26.8%-91.8% more accurate than periodically-updated histograms and samples, respectively

    FactorJoin: A New Cardinality Estimation Framework for Join Queries

    Full text link
    Cardinality estimation is one of the most fundamental and challenging problems in query optimization. Neither classical nor learning-based methods yield satisfactory performance when estimating the cardinality of the join queries. They either rely on simplified assumptions leading to ineffective cardinality estimates or build large models to understand the data distributions, leading to long planning times and a lack of generalizability across queries. In this paper, we propose a new framework FactorJoin for estimating join queries. FactorJoin combines the idea behind the classical join-histogram method to efficiently handle joins with the learning-based methods to accurately capture attribute correlation. Specifically, FactorJoin scans every table in a DB and builds single-table conditional distributions during an offline preparation phase. When a join query comes, FactorJoin translates it into a factor graph model over the learned distributions to effectively and efficiently estimate its cardinality. Unlike existing learning-based methods, FactorJoin does not need to de-normalize joins upfront or require executed query workloads to train the model. Since it only relies on single-table statistics, FactorJoin has small space overhead and is extremely easy to train and maintain. In our evaluation, FactorJoin can produce more effective estimates than the previous state-of-the-art learning-based methods, with 40x less estimation latency, 100x smaller model size, and 100x faster training speed at comparable or better accuracy. In addition, FactorJoin can estimate 10,000 sub-plan queries within one second to optimize the query plan, which is very close to the traditional cardinality estimators in commercial DBMS.Comment: Paper accepted by SIGMOD 202

    The {RDF}-3X Engine for Scalable Management of {RDF} Data

    Get PDF
    RDF is a data model for schema-free structured information that is gaining momentum in the context of Semantic-Web data, life sciences, and also Web 2.0 platforms. The ``pay-as-you-go'' nature of RDF and the flexible pattern-matching capabilities of its query language SPARQL entail efficiency and scalability challenges for complex queries including long join paths. This paper presents the RDF-3X engine, an implementation of SPARQL that achieves excellent performance by pursuing a RISC-style architecture with streamlined indexing and query processing. The physical design is identical for all RDF-3X databases regardless of their workloads, and completely eliminates the need for index tuning by exhaustive indexes for all permutations of subject-property-object triples and their binary and unary projections. These indexes are highly compressed, and the query processor can aggressively leverage fast merge joins with excellent performance of processor caches. The query optimizer is able to choose optimal join orders even for complex queries, with a cost model that includes statistical synopses for entire join paths. Although RDF-3X is optimized for queries, it also provides good support for efficient online updates by means of a staging architecture: direct updates to the main database indexes are deferred, and instead applied to compact differential indexes which are later merged into the main indexes in a batched manner. Experimental studies with several large-scale datasets with more than 50 million RDF triples and benchmark queries that include pattern matching, manyway star-joins, and long path-joins demonstrate that RDF-3X can outperform the previously best alternatives by one or two orders of magnitude

    Direct and constructivist approaches for the design of instruction in well-structured domains: a comparison of efficiency via mental workload and performance.

    Get PDF
    This doctoral research investigates the efficiency of two instructional designs: a design based on the direct-instruction approach to learning and its extension with a collaborative activity based upon the community of inquiry approach to learning. This is motivated by the educational challenge associated with the improvement of the learning phase. The goal is to investigate the extent to which highly guided communities of inquiry, when added to direct-instruction teaching methods, can actually improve the efficiency of learners. A total of 577 students participated in the experiments across 24 third-level classes that were divided into two groups. A control group of learners attended a delivery based on direct instructional guidelines only, while an experimental group received the same delivery (in equal conditions) extended through a collaborative and inquiring design. Subsequently, learners of each group individually answered a multiple-choice questionnaire (MCQ), from which a performance measure was extracted for the evaluation of the acquired factual, conceptual and procedural knowledge. Two measures of cognitive load (CL) were acquired through self-reporting questionnaires: one unidimensional and one multidimensional. These, in conjunction with the performance measure, contributed to the definition of three measures of efficiency. Statistical evidence shows a positive impact of the experimental layout on the efficiency scores of students, as a consequence of its improvement across three phases: tuning, experimental and refined. The minor contribution to the body of knowledge is a replicable primary research that requalifies an inquiry activity technique, usually employed at primary and secondary levels, as well as other ill-structured domains, in better-structured domains within thirdlevel education. This contribution is connected to a major one that lies in the example of the complementarity between cognitivist direct instructional techniques and social constructivist approaches to teaching and to learning, rather than in the example of their individual, distinct and competitive uses

    Characterization of a big data storage workload in the cloud

    Get PDF
    The proliferation of big data processing platforms has led to radically different system designs, such as MapReduce and the newer Spark. Understanding the workloads of such systems facilitates tuning and could foster new designs. However, whereas MapReduce workloads have been characterized extensively, relatively little public knowledge exists about the characteristics of Spark workloads in representative environments. To address this problem, in this work we collect and analyze a 6-month Spark workload from a major provider of big data processing services, Databricks. Our analysis focuses on a number of key features, such as the long-term trends of reads and modifications, the statistical properties of reads, and the popularity of clusters and of file formats. Overall, we present numerous findings that could form the basis of new systems studies and designs. Our quantitative evidence and its analysis suggest the existence of daily and weekly load imbalances, of heavy-tailed and bursty behaviour, of the relative rarity of modifications, and of proliferation of big data specific formats

    Learning workload behaviour models from monitored time-series for resource estimation towards data center optimization

    Get PDF
    In recent years there has been an extraordinary growth of the demand of Cloud Computing resources executed in Data Centers. Modern Data Centers are complex systems that need management. As distributed computing systems grow, and workloads benefit from such computing environments, the management of such systems increases in complexity. The complexity of resource usage and power consumption on cloud-based applications makes the understanding of application behavior through expert examination difficult. The difficulty increases when applications are seen as "black boxes", where only external monitoring can be retrieved. Furthermore, given the different amount of scenarios and applications, automation is required. To deal with such complexity, Machine Learning methods become crucial to facilitate tasks that can be automatically learned from data. Firstly, this thesis proposes an unsupervised learning technique to learn high level representations from workload traces. Such technique provides a fast methodology to characterize workloads as sequences of abstract phases. The learned phase representation is validated on a variety of datasets and used in an auto-scaling task where we show that it can be applied in a production environment, achieving better performance than other state-of-the-art techniques. Secondly, this thesis proposes a neural architecture, based on Sequence-to-Sequence models, that provides the expected resource usage of applications sharing hardware resources. The proposed technique provides resource managers the ability to predict resource usage over time as well as the completion time of the running applications. The technique provides lower error predicting usage when compared with other popular Machine Learning methods. Thirdly, this thesis proposes a technique for auto-tuning Big Data workloads from the available tunable parameters. The proposed technique gathers information from the logs of an application generating a feature descriptor that captures relevant information from the application to be tuned. Using this information we demonstrate that performance models can generalize up to a 34% better when compared with other state-of-the-art solutions. Moreover, the search time to find a suitable solution can be drastically reduced, with up to a 12x speedup and almost equal quality results as modern solutions. These results prove that modern learning algorithms, with the right feature information, provide powerful techniques to manage resource allocation for applications running in cloud environments. This thesis demonstrates that learning algorithms allow relevant optimizations in Data Center environments, where applications are externally monitored and careful resource management is paramount to efficiently use computing resources. We propose to demonstrate this thesis in three areas that orbit around resource management in server environmentsEls Centres de Dades (Data Centers) moderns són sistemes complexos que necessiten ser gestionats. A mesura que creixen els sistemes de computació distribuïda i les aplicacions es beneficien d’aquestes infraestructures, també n’augmenta la seva complexitat. La complexitat que implica gestionar recursos de còmput i d’energia en sistemes de computació al núvol fa difícil entendre el comportament de les aplicacions que s'executen de manera manual. Aquesta dificultat s’incrementa quan les aplicacions s'observen com a "caixes negres", on només es poden monitoritzar algunes mètriques de les caixes de manera externa. A més, degut a la gran varietat d’escenaris i aplicacions, és necessari automatitzar la gestió d'aquests recursos. Per afrontar-ne el repte, l'aprenentatge automàtic juga un paper cabdal que facilita aquestes tasques, que poden ser apreses automàticament en base a dades prèvies del sistema que es monitoritza. Aquesta tesi demostra que els algorismes d'aprenentatge poden aportar optimitzacions molt rellevants en la gestió de Centres de Dades, on les aplicacions són monitoritzades externament i la gestió dels recursos és de vital importància per a fer un ús eficient de la capacitat de còmput d'aquests sistemes. En primer lloc, aquesta tesi proposa emprar aprenentatge no supervisat per tal d’aprendre representacions d'alt nivell a partir de traces d'aplicacions. Aquesta tècnica ens proporciona una metodologia ràpida per a caracteritzar aplicacions vistes com a seqüències de fases abstractes. La representació apresa de fases és validada en diferents “datasets” i s'aplica a la gestió de tasques d'”auto-scaling”, on es conclou que pot ser aplicable en un medi de producció, aconseguint un millor rendiment que altres mètodes de vanguardia. En segon lloc, aquesta tesi proposa l'ús de xarxes neuronals, basades en arquitectures “Sequence-to-Sequence”, que proporcionen una estimació dels recursos usats per aplicacions que comparteixen recursos de hardware. La tècnica proposada facilita als gestors de recursos l’habilitat de predir l'ús de recursos a través del temps, així com també una estimació del temps de còmput de les aplicacions. Tanmateix, redueix l’error en l’estimació de recursos en comparació amb d’altres tècniques populars d'aprenentatge automàtic. Per acabar, aquesta tesi introdueix una tècnica per a fer “auto-tuning” dels “hyper-paràmetres” d'aplicacions de Big Data. Consisteix així en obtenir informació dels “logs” de les aplicacions, generant un vector de característiques que captura informació rellevant de les aplicacions que s'han de “tunejar”. Emprant doncs aquesta informació es valida que els ”Regresors” entrenats en la predicció del rendiment de les aplicacions són capaços de generalitzar fins a un 34% millor que d’altres “Regresors” de vanguàrdia. A més, el temps de cerca per a trobar una bona solució es pot reduir dràsticament, aconseguint un increment de millora de fins a 12 vegades més dels resultats de qualitat en contraposició a alternatives modernes. Aquests resultats posen de manifest que els algorismes moderns d'aprenentatge automàtic esdevenen tècniques molt potents per tal de gestionar l'assignació de recursos en aplicacions que s'executen al núvol.Arquitectura de computador

    Query Optimization on Distributed Databases

    Get PDF
    Τα τελευταία χρόνια, το Διαδίκτυο έχει εξελιχθεί από ένα παγκόσμιο χώρο πληροφοριών αποτελούμενο από συνδεδεμένα έγγραφα σε έναν παγκόσμιο ιστό συνδεδεμένων δεδομέ- νων. Ο αριθμός των πηγών δεδομένων και ο όγκος των δημοσιευμένων δεδομένων έχει εκραγεί, καλύπτοντας διάφορους τομείς όπως ανθρώπους, εταιρείες, δημοσιεύσεις, λαϊκή κουλτούρα και διαδικτυακές κοινότητες, επιστήμες ζωής, κυβερνητικά και στατιστικά στοιχεία και πολλά άλλα. Συνεπώς, σήμερα απαιτείται έντονα η εφαρμογή τεχνικών βελτι- στοποίησης στα συστήματα που διερευνούν τα δεδομένα αυτά. Η αποτελεσματική επεξε- ργασία ενός ερωτήματος εξαρτάται από την κατασκευή ενός αποτελεσματικού πλάνου για την εκτέλεση του ερωτήματος. Λεπτομερή μεταδεδομένα σχετικά με τις πηγές δεδομένων και τα στατιστικά στοιχεία σχετικά με την κατανομή των δεδομένων χρησιμοποιούνται για την εκτίμηση του κόστους διαφορετικών πλάνων εκτέλεσης ερωτημάτων και επιλογή του βέλτιστου. Οι βελτιστοποιητές ερωτημάτων στα συστήματα επεξεργασίας ερωτημάτων συνήθως βασίζονται σε ιστογράμματα, δομές δεδομένων που απεικονίζουν τη κατανομή των δεδομένων, προκειμένου να μπορέσουν να εφαρμοστούν το μοντέλο υπολογισμού του κόστους των διαφορετικών πλάνων. Παρατηρήσαμε ότι υπήρξαν περιπτώσεις όπου ο βελτιστοποιητής είχε πραγματικά κακή απόδοση που προκλήθηκε από τις κακές εκτιμήσεις του ιστογράμματος. Αυτές οι περιπτώσεις είναι σπάνιες, συνήθως οφείλονται σε μια ακραία τιμή, και αυτός είναι ο λόγος για τον οποίο τα προσαρμοστικά ιστογράμματα δεν μπορούν να τις αντιμετωπίσουν. Επομένως, σε αυτή την πτυχιακή ανιχνεύσαμε τέτοιες περιπτώσεις και δημιουργήσαμε μια μέθοδο για την βελτίωση των εκτιμήσεων του ιστογράμματος σε τέτοιες σπάνιες περιπτώσεις. Παρόλο που αυτό είχε αρνητικό αντίκτυπο στη μέση περί- πτωση, η βελτίωση στις ακραίες περιπτώσεις ήταν πιο σημαντική.In recent years the Web has evolved from a global information space of linked documents to a web of linked data. The number of data sources and the amount of data published has been exploding, covering diverse domains such as people, companies, publications, popular culture and online communities, life sciences, governmental and statistical data, and many more. So nowadays it is heavily required to apply optimization techniques on the systems querying these data. Efficient query processing depends on the construction of an efficient query plan to guide query execution. Detailed instance-level metadata about the data sources and statistics among the data distribution are used to estimate the cost of different query plans and select the optimal one. Query optimizers in query processing systems typically rely on histograms, data structures that approximate data distribution, in order to be able to apply their cost model. We noticed that there were cases where optimizer had really bad performance caused by the bad estimations of the histogram. These cases are rare,usually a big outlier is involved, and this is the reason why adaptive histograms can not deal with them. Therefore in this thesis we detected such cases and created a method to make histogram provide the optimizer better statistics on these rare edge cases. Even though this had a negative impact to the average case the improvement on the edge cases was more significant

    Performance evaluation in database research: principles and experience

    Get PDF
    International audienceSignificant part of today's database research focuses on improving performance of a specific system. Quantitative experiments are the best way to validate such results. However, performing experiments is not always easy. Besides the complexity of the system under test, designing an experiment, choosing the right environment and parameter values, analyzing the data which is gathered, and reporting it to a third party in an expressive and intelligible way is hard. In this tutorial, we present a general road-map to the above steps, including tips and tricks on how to organize and present code that performs experiments, so that an outsider can repeat them. The tutorial is primarily aimed at MS and PhD students seeking to improve their experiment practices, but more senior attendants may also find it interesting
    corecore