5,636 research outputs found

    Learning a Partitioning Advisor with Deep Reinforcement Learning

    Full text link
    Commercial data analytics products such as Microsoft Azure SQL Data Warehouse or Amazon Redshift provide ready-to-use scale-out database solutions for OLAP-style workloads in the cloud. While the provisioning of a database cluster is usually fully automated by cloud providers, customers typically still have to make important design decisions which were traditionally made by the database administrator such as selecting the partitioning schemes. In this paper we introduce a learned partitioning advisor for analytical OLAP-style workloads based on Deep Reinforcement Learning (DRL). The main idea is that a DRL agent learns its decisions based on experience by monitoring the rewards for different workloads and partitioning schemes. We evaluate our learned partitioning advisor in an experimental evaluation with different databases schemata and workloads of varying complexity. In the evaluation, we show that our advisor is not only able to find partitionings that outperform existing approaches for automated partitioning design but that it also can easily adjust to different deployments. This is especially important in cloud setups where customers can easily migrate their cluster to a new set of (virtual) machines

    Scalable distributed event detection for Twitter

    Get PDF
    Social media streams, such as Twitter, have shown themselves to be useful sources of real-time information about what is happening in the world. Automatic detection and tracking of events identified in these streams have a variety of real-world applications, e.g. identifying and automatically reporting road accidents for emergency services. However, to be useful, events need to be identified within the stream with a very low latency. This is challenging due to the high volume of posts within these social streams. In this paper, we propose a novel event detection approach that can both effectively detect events within social streams like Twitter and can scale to thousands of posts every second. Through experimentation on a large Twitter dataset, we show that our approach can process the equivalent to the full Twitter Firehose stream, while maintaining event detection accuracy and outperforming an alternative distributed event detection system

    A Survey of Parallel Data Mining

    Get PDF
    With the fast, continuous increase in the number and size of databases, parallel data mining is a natural and cost-effective approach to tackle the problem of scalability in data mining. Recently there has been a considerable research on parallel data mining. However, most projects focus on the parallelization of a single kind of data mining algorithm/paradigm. This paper surveys parallel data mining with a broader perspective. More precisely, we discuss the parallelization of data mining algorithms of four knowledge discovery paradigms, namely rule induction, instance-based learning, genetic algorithms and neural networks. Using the lessons learned from this discussion, we also derive a set of heuristic principles for designing efficient parallel data mining algorithms

    Durable Digital Objects Rather Than Digital Preservation

    Get PDF
    Long-term digital preservation is not the best available objective. Instead, what information producers and consumers almost surely want is a universe of durable digital objects—documents and programs that are as accessible and useful a century from now as they are today. Given the will, we could implement and deploy a practical and pleasing durability infrastructure within two years. Tools for daily work can embed packaging for durability without much burdening their users. Moving responsibility for durability from archival employees to information producers also avoids burdening repositories with keeping up with Internet scale. An engineering prescription is available. Research libraries’ and archives’ slow advance towards practical preservation of digital content is remarkable to outsiders. Why is their progress stalled? Ineffective collaboration across disciplinary boundaries has surely been a major impediment. We speculate about cultural reasons for this situation and warn about possible marginalization of research librarianship as a profession.

    Durable Digital Objects Rather Than Digital Preservation

    Get PDF
    Long-term digital preservation is not the best available objective. Instead, what information producers and consumers almost surely want is a universe of durable digital objects—documents and programs that will be as accessible and useful a century from now as they are today. Given the will, we could implement and deploy a practical and pleasing durability infrastructure within two years. Tools for daily work can embed packaging for durability without much burdening their users. Moving responsibility for durability from archival employees to information producers would also avoid burdening repositories with keeping up with Internet scale. An engineering prescription is available. Research libraries’ and archives’ slow advance towards practical preservation of digital content is remarkable to outsiders. Why does their progress seem stalled? Ineffective collaboration across disciplinary boundaries has surely been a major impediment. We speculate about cultural reasons for this situation and warn about possible marginalization of research librarianship as a profession.

    Middleware-based Database Replication: The Gaps between Theory and Practice

    Get PDF
    The need for high availability and performance in data management systems has been fueling a long running interest in database replication from both academia and industry. However, academic groups often attack replication problems in isolation, overlooking the need for completeness in their solutions, while commercial teams take a holistic approach that often misses opportunities for fundamental innovation. This has created over time a gap between academic research and industrial practice. This paper aims to characterize the gap along three axes: performance, availability, and administration. We build on our own experience developing and deploying replication systems in commercial and academic settings, as well as on a large body of prior related work. We sift through representative examples from the last decade of open-source, academic, and commercial database replication systems and combine this material with case studies from real systems deployed at Fortune 500 customers. We propose two agendas, one for academic research and one for industrial R&D, which we believe can bridge the gap within 5-10 years. This way, we hope to both motivate and help researchers in making the theory and practice of middleware-based database replication more relevant to each other.Comment: 14 pages. Appears in Proc. ACM SIGMOD International Conference on Management of Data, Vancouver, Canada, June 200
    • …
    corecore