539 research outputs found

    Efficient Scalable Accurate Regression Queries in In-DBMS Analytics

    Get PDF
    Recent trends aim to incorporate advanced data analytics capabilities within DBMSs. Linear regression queries are fundamental to exploratory analytics and predictive modeling. However, computing their exact answers leaves a lot to be desired in terms of efficiency and scalability. We contribute a novel predictive analytics model and associated regression query processing algorithms, which are efficient, scalable and accurate. We focus on predicting the answers to two key query types that reveal dependencies between the values of different attributes: (i) mean-value queries and (ii) multivariate linear regression queries, both within specific data subspaces defined based on the values of other attributes. Our algorithms achieve many orders of magnitude improvement in query processing efficiency and nearperfect approximations of the underlying relationships among data attributes

    Leveraging Edge Computing through Collaborative Machine Learning

    Get PDF
    The Internet of Things (IoT) offers the ability to analyze and predict our surroundings through sensor networks at the network edge. To facilitate this predictive functionality, Edge Computing (EC) applications are developed by considering: power consumption, network lifetime and quality of context inference. Humongous contextual data from sensors provide data scientists better knowledge extraction, albeit coming at the expense of holistic data transfer that threatens the network feasibility and lifetime. To cope with this, collaborative machine learning is applied to EC devices to (i) extract the statistical relationships and (ii) construct regression (predictive) models to maximize communication efficiency. In this paper, we propose a learning methodology that improves the prediction accuracy by quantizing the input space and leveraging the local knowledge of the EC devices

    Scalable aggregation predictive analytics: a query-driven machine learning approach

    Get PDF
    We introduce a predictive modeling solution that provides high quality predictive analytics over aggregation queries in Big Data environments. Our predictive methodology is generally applicable in environments in which large-scale data owners may or may not restrict access to their data and allow only aggregation operators like COUNT to be executed over their data. In this context, our methodology is based on historical queries and their answers to accurately predict ad-hoc queries’ answers. We focus on the widely used set-cardinality, i.e., COUNT, aggregation query, as COUNT is a fundamental operator for both internal data system optimizations and for aggregation-oriented data exploration and predictive analytics. We contribute a novel, query-driven Machine Learning (ML) model whose goals are to: (i) learn the query-answer space from past issued queries, (ii) associate the query space with local linear regression & associative function estimators, (iii) define query similarity, and (iv) predict the cardinality of the answer set of unseen incoming queries, referred to the Set Cardinality Prediction (SCP) problem. Our ML model incorporates incremental ML algorithms for ensuring high quality prediction results. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general Big Data environments, which include restricted-access data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for query-driven data exploration, and (iii) offers a performance (in terms of scalability, SCP accuracy, processing time, and memory requirements) that is superior to data-centric approaches. We provide a comprehensive performance evaluation of our model evaluating its sensitivity, scalability and efficiency for quality predictive analytics. In addition, we report on the development and incorporation of our ML model in Spark showing its superior performance compared to the Spark’s COUNT method

    Explaining Aggregates for Exploratory Analytics

    Get PDF
    Analysts wishing to explore multivariate data spaces, typically pose queries involving selection operators, i.e., range or radius queries, which define data subspaces of possible interest and then use aggregation functions, the results of which determine their exploratory analytics interests. However, such aggregate query (AQ) results are simple scalars and as such, convey limited information about the queried subspaces for exploratory analysis.We address this shortcoming aiding analysts to explore and understand data subspaces by contributing a novel explanation mechanism coined XAXA: eXplaining Aggregates for eXploratory Analytics. XAXA’s novel AQ explanations are represented using functions obtained by a three-fold joint optimization problem. Explanations assume the form of a set of parametric piecewise-linear functions acquired through a statistical learning model. A key feature of the proposed solution is that model training is performed by only monitoring AQs and their answers on-line. In XAXA, explanations for future AQs can be computed without any database (DB) access and can be used to further explore the queried data subspaces, without issuing any more queries to the DB. We evaluate the explanation accuracy and efficiency of XAXA through theoretically grounded metrics over real-world and synthetic datasets and query workloads

    Towards Integrated Data Analytics: Time Series Forecasting in DBMS

    Get PDF
    Integrating sophisticated statistical methods into database management systems is gaining more and more attention in research and industry in order to be able to cope with increasing data volume and increasing complexity of the analytical algorithms. One important statistical method is time series forecasting, which is crucial for decision making processes in many domains. The deep integration of time series forecasting offers additional advanced functionalities within a DBMS. More importantly, however, it allows for optimizations that improve the efficiency, consistency, and transparency of the overall forecasting process. To enable efficient integrated forecasting, we propose to enhance the traditional 3-layer ANSI/SPARC architecture of a DBMS with forecasting functionalities. This article gives a general overview of our proposed enhancements and presents how forecast queries can be processed using an example from the energy data management domain. We conclude with open research topics and challenges that arise in this area
    • …
    corecore