12,169 research outputs found

    Learning a Partitioning Advisor with Deep Reinforcement Learning

    Full text link
    Commercial data analytics products such as Microsoft Azure SQL Data Warehouse or Amazon Redshift provide ready-to-use scale-out database solutions for OLAP-style workloads in the cloud. While the provisioning of a database cluster is usually fully automated by cloud providers, customers typically still have to make important design decisions which were traditionally made by the database administrator such as selecting the partitioning schemes. In this paper we introduce a learned partitioning advisor for analytical OLAP-style workloads based on Deep Reinforcement Learning (DRL). The main idea is that a DRL agent learns its decisions based on experience by monitoring the rewards for different workloads and partitioning schemes. We evaluate our learned partitioning advisor in an experimental evaluation with different databases schemata and workloads of varying complexity. In the evaluation, we show that our advisor is not only able to find partitionings that outperform existing approaches for automated partitioning design but that it also can easily adjust to different deployments. This is especially important in cloud setups where customers can easily migrate their cluster to a new set of (virtual) machines

    StreamLearner: Distributed Incremental Machine Learning on Event Streams: Grand Challenge

    Full text link
    Today, massive amounts of streaming data from smart devices need to be analyzed automatically to realize the Internet of Things. The Complex Event Processing (CEP) paradigm promises low-latency pattern detection on event streams. However, CEP systems need to be extended with Machine Learning (ML) capabilities such as online training and inference in order to be able to detect fuzzy patterns (e.g., outliers) and to improve pattern recognition accuracy during runtime using incremental model training. In this paper, we propose a distributed CEP system denoted as StreamLearner for ML-enabled complex event detection. The proposed programming model and data-parallel system architecture enable a wide range of real-world applications and allow for dynamically scaling up and out system resources for low-latency, high-throughput event processing. We show that the DEBS Grand Challenge 2017 case study (i.e., anomaly detection in smart factories) integrates seamlessly into the StreamLearner API. Our experiments verify scalability and high event throughput of StreamLearner.Comment: Christian Mayer, Ruben Mayer, and Majd Abdo. 2017. StreamLearner: Distributed Incremental Machine Learning on Event Streams: Grand Challenge. In Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems (DEBS '17), 298-30

    Cache Hierarchy Inspired Compression: a Novel Architecture for Data Streams

    Get PDF
    We present an architecture for data streams based on structures typically found in web cache hierarchies. The main idea is to build a meta level analyser from a number of levels constructed over time from a data stream. We present the general architecture for such a system and an application to classification. This architecture is an instance of the general wrapper idea allowing us to reuse standard batch learning algorithms in an inherently incremental learning environment. By artificially generating data sources we demonstrate that a hierarchy containing a mixture of models is able to adapt over time to the source of the data. In these experiments the hierarchies use an elementary performance based replacement policy and unweighted voting for making classification decisions

    Speculative Approximations for Terascale Analytics

    Full text link
    Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the incapacity to evaluate multiple parameter configurations simultaneously and the lack of support to quickly identify sub-optimal configurations are the principal causes. In this paper, we develop two database-inspired techniques for efficient model calibration. Speculative parameter testing applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. The number of configurations is determined adaptively at runtime, while the configurations themselves are extracted from a distribution that is continuously learned following a Bayesian process. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration. We design concurrent online aggregation estimators and define halting conditions to accurately and timely stop the execution. We apply the proposed techniques to distributed gradient descent optimization -- batch and incremental -- for support vector machines and logistic regression models. We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big Data analytics system -- and evaluate their performance over terascale-size synthetic and real datasets. The results confirm that as many as 32 configurations can be evaluated concurrently almost as fast as one, while sub-optimal configurations are detected accurately in as little as a 1/20th1/20^{\text{th}} fraction of the time
    • 

    corecore