531 research outputs found

    Scalable aggregation predictive analytics: a query-driven machine learning approach

    Get PDF
    We introduce a predictive modeling solution that provides high quality predictive analytics over aggregation queries in Big Data environments. Our predictive methodology is generally applicable in environments in which large-scale data owners may or may not restrict access to their data and allow only aggregation operators like COUNT to be executed over their data. In this context, our methodology is based on historical queries and their answers to accurately predict ad-hoc queries’ answers. We focus on the widely used set-cardinality, i.e., COUNT, aggregation query, as COUNT is a fundamental operator for both internal data system optimizations and for aggregation-oriented data exploration and predictive analytics. We contribute a novel, query-driven Machine Learning (ML) model whose goals are to: (i) learn the query-answer space from past issued queries, (ii) associate the query space with local linear regression & associative function estimators, (iii) define query similarity, and (iv) predict the cardinality of the answer set of unseen incoming queries, referred to the Set Cardinality Prediction (SCP) problem. Our ML model incorporates incremental ML algorithms for ensuring high quality prediction results. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general Big Data environments, which include restricted-access data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for query-driven data exploration, and (iii) offers a performance (in terms of scalability, SCP accuracy, processing time, and memory requirements) that is superior to data-centric approaches. We provide a comprehensive performance evaluation of our model evaluating its sensitivity, scalability and efficiency for quality predictive analytics. In addition, we report on the development and incorporation of our ML model in Spark showing its superior performance compared to the Spark’s COUNT method

    Clustering-Initialized Adaptive Histograms and Probabilistic Cost Estimation for Query Optimization

    Get PDF
    An assumption with self-tuning histograms has been that they can "learn" the dataset if given enough training queries. We show that this is not the case with the current approaches. The quality of the histogram depends on the initial configuration. Starting with few good buckets can improve the efficiency of learning. Without this, the histogram is likely to stagnate, i.e. converge to a bad configuration and stop learning. We also present a probabilistic cost estimation model

    Query-driven learning for predictive analytics of data subspace cardinality

    Get PDF
    Fundamental to many predictive analytics tasks is the ability to estimate the cardinality (number of data items) of multi-dimensional data subspaces, defined by query selections over datasets. This is crucial for data analysts dealing with, e.g., interactive data subspace explorations, data subspace visualizations, and in query processing optimization. However, in many modern data systems, predictive analytics may be (i) too costly money-wise, e.g., in clouds, (ii) unreliable, e.g., in modern Big Data query engines, where accurate statistics are difficult to obtain/maintain, or (iii) infeasible, e.g., for privacy issues. We contribute a novel, query-driven, function estimation model of analyst-defined data subspace cardinality. The proposed estimation model is highly accurate in terms of prediction and accommodating the well-known selection queries: multi-dimensional range and distance-nearest neighbors (radius) queries. Our function estimation model: (i) quantizes the vectorial query space, by learning the analysts’ access patterns over a data space, (ii) associates query vectors with their corresponding cardinalities of the analyst-defined data subspaces, (iii) abstracts and employs query vectorial similarity to predict the cardinality of an unseen/unexplored data subspace, and (iv) identifies and adapts to possible changes of the query subspaces based on the theory of optimal stopping. The proposed model is decentralized, facilitating the scaling-out of such predictive analytics queries. The research significance of the model lies in that (i) it is an attractive solution when data-driven statistical techniques are undesirable or infeasible, (ii) it offers a scale-out, decentralized training solution, (iii) it is applicable to different selection query types, and (iv) it offers a performance that is superior to that of data-driven approaches

    QuickSel: Quick Selectivity Learning with Mixture Models

    Full text link
    Estimating the selectivity of a query is a key step in almost any cost-based query optimizer. Most of today's databases rely on histograms or samples that are periodically refreshed by re-scanning the data as the underlying data changes. Since frequent scans are costly, these statistics are often stale and lead to poor selectivity estimates. As an alternative to scans, query-driven histograms have been proposed, which refine the histograms based on the actual selectivities of the observed queries. Unfortunately, these approaches are either too costly to use in practice---i.e., require an exponential number of buckets---or quickly lose their advantage as they observe more queries. In this paper, we propose a selectivity learning framework, called QuickSel, which falls into the query-driven paradigm but does not use histograms. Instead, it builds an internal model of the underlying data, which can be refined significantly faster (e.g., only 1.9 milliseconds for 300 queries). This fast refinement allows QuickSel to continuously learn from each query and yield increasingly more accurate selectivity estimates over time. Unlike query-driven histograms, QuickSel relies on a mixture model and a new optimization algorithm for training its model. Our extensive experiments on two real-world datasets confirm that, given the same target accuracy, QuickSel is 34.0x-179.4x faster than state-of-the-art query-driven histograms, including ISOMER and STHoles. Further, given the same space budget, QuickSel is 26.8%-91.8% more accurate than periodically-updated histograms and samples, respectively

    Non-parametric Methods for Correlation Analysis in Multivariate Data with Applications in Data Mining

    Get PDF
    In this thesis, we develop novel methods for correlation analysis in multivariate data, with a special focus on mining correlated subspaces. Our methods handle major open challenges arisen when combining correlation analysis with subspace mining. Besides traditional correlation analysis, we explore interaction-preserving discretization of multivariate data and causality analysis. We conduct experiments on a variety of real-world data sets. The results validate the benefits of our methods

    On Invariance and Selectivity in Representation Learning

    Get PDF
    We discuss data representation which can be learned automatically from data, are invariant to transformations, and at the same time selective, in the sense that two points have the same representation only if they are one the transformation of the other. The mathematical results here sharpen some of the key claims of i-theory -- a recent theory of feedforward processing in sensory cortex

    Query Optimization on Distributed Databases

    Get PDF
    Τα τελευταία χρόνια, το Διαδίκτυο έχει εξελιχθεί από ένα παγκόσμιο χώρο πληροφοριών αποτελούμενο από συνδεδεμένα έγγραφα σε έναν παγκόσμιο ιστό συνδεδεμένων δεδομέ- νων. Ο αριθμός των πηγών δεδομένων και ο όγκος των δημοσιευμένων δεδομένων έχει εκραγεί, καλύπτοντας διάφορους τομείς όπως ανθρώπους, εταιρείες, δημοσιεύσεις, λαϊκή κουλτούρα και διαδικτυακές κοινότητες, επιστήμες ζωής, κυβερνητικά και στατιστικά στοιχεία και πολλά άλλα. Συνεπώς, σήμερα απαιτείται έντονα η εφαρμογή τεχνικών βελτι- στοποίησης στα συστήματα που διερευνούν τα δεδομένα αυτά. Η αποτελεσματική επεξε- ργασία ενός ερωτήματος εξαρτάται από την κατασκευή ενός αποτελεσματικού πλάνου για την εκτέλεση του ερωτήματος. Λεπτομερή μεταδεδομένα σχετικά με τις πηγές δεδομένων και τα στατιστικά στοιχεία σχετικά με την κατανομή των δεδομένων χρησιμοποιούνται για την εκτίμηση του κόστους διαφορετικών πλάνων εκτέλεσης ερωτημάτων και επιλογή του βέλτιστου. Οι βελτιστοποιητές ερωτημάτων στα συστήματα επεξεργασίας ερωτημάτων συνήθως βασίζονται σε ιστογράμματα, δομές δεδομένων που απεικονίζουν τη κατανομή των δεδομένων, προκειμένου να μπορέσουν να εφαρμοστούν το μοντέλο υπολογισμού του κόστους των διαφορετικών πλάνων. Παρατηρήσαμε ότι υπήρξαν περιπτώσεις όπου ο βελτιστοποιητής είχε πραγματικά κακή απόδοση που προκλήθηκε από τις κακές εκτιμήσεις του ιστογράμματος. Αυτές οι περιπτώσεις είναι σπάνιες, συνήθως οφείλονται σε μια ακραία τιμή, και αυτός είναι ο λόγος για τον οποίο τα προσαρμοστικά ιστογράμματα δεν μπορούν να τις αντιμετωπίσουν. Επομένως, σε αυτή την πτυχιακή ανιχνεύσαμε τέτοιες περιπτώσεις και δημιουργήσαμε μια μέθοδο για την βελτίωση των εκτιμήσεων του ιστογράμματος σε τέτοιες σπάνιες περιπτώσεις. Παρόλο που αυτό είχε αρνητικό αντίκτυπο στη μέση περί- πτωση, η βελτίωση στις ακραίες περιπτώσεις ήταν πιο σημαντική.In recent years the Web has evolved from a global information space of linked documents to a web of linked data. The number of data sources and the amount of data published has been exploding, covering diverse domains such as people, companies, publications, popular culture and online communities, life sciences, governmental and statistical data, and many more. So nowadays it is heavily required to apply optimization techniques on the systems querying these data. Efficient query processing depends on the construction of an efficient query plan to guide query execution. Detailed instance-level metadata about the data sources and statistics among the data distribution are used to estimate the cost of different query plans and select the optimal one. Query optimizers in query processing systems typically rely on histograms, data structures that approximate data distribution, in order to be able to apply their cost model. We noticed that there were cases where optimizer had really bad performance caused by the bad estimations of the histogram. These cases are rare,usually a big outlier is involved, and this is the reason why adaptive histograms can not deal with them. Therefore in this thesis we detected such cases and created a method to make histogram provide the optimizer better statistics on these rare edge cases. Even though this had a negative impact to the average case the improvement on the edge cases was more significant
    corecore