5,636 research outputs found
Learning a Partitioning Advisor with Deep Reinforcement Learning
Commercial data analytics products such as Microsoft Azure SQL Data Warehouse
or Amazon Redshift provide ready-to-use scale-out database solutions for
OLAP-style workloads in the cloud. While the provisioning of a database cluster
is usually fully automated by cloud providers, customers typically still have
to make important design decisions which were traditionally made by the
database administrator such as selecting the partitioning schemes.
In this paper we introduce a learned partitioning advisor for analytical
OLAP-style workloads based on Deep Reinforcement Learning (DRL). The main idea
is that a DRL agent learns its decisions based on experience by monitoring the
rewards for different workloads and partitioning schemes. We evaluate our
learned partitioning advisor in an experimental evaluation with different
databases schemata and workloads of varying complexity. In the evaluation, we
show that our advisor is not only able to find partitionings that outperform
existing approaches for automated partitioning design but that it also can
easily adjust to different deployments. This is especially important in cloud
setups where customers can easily migrate their cluster to a new set of
(virtual) machines
Scalable distributed event detection for Twitter
Social media streams, such as Twitter, have shown themselves to be useful sources of real-time information about what is happening in the world. Automatic detection and tracking of events identified in these streams have a variety of real-world applications, e.g. identifying and automatically reporting road accidents for emergency services. However, to be useful, events need to be identified within the stream with a very low latency. This is challenging due to the high volume of posts within these social streams. In this paper, we propose a novel event detection approach that can both effectively detect events within social streams like Twitter and can scale to thousands of posts every second. Through experimentation on a large Twitter dataset, we show that our approach can process the equivalent to the full Twitter Firehose stream, while maintaining event detection accuracy and outperforming an alternative distributed event detection system
A Survey of Parallel Data Mining
With the fast, continuous increase in the number and size of databases, parallel data mining is a natural and cost-effective approach to tackle the problem of scalability in data mining. Recently there has been a considerable research on parallel data mining. However, most projects focus on the parallelization of a single kind of data mining algorithm/paradigm. This paper surveys parallel data mining with a broader perspective. More precisely, we discuss the parallelization of data mining algorithms of four knowledge discovery paradigms, namely rule induction, instance-based learning, genetic algorithms and neural networks. Using the lessons
learned from this discussion, we also derive a set of heuristic principles for designing efficient parallel data mining algorithms
Durable Digital Objects Rather Than Digital Preservation
Long-term digital preservation is not the best available objective. Instead, what information producers and consumers almost surely want is a universe of durable digital objects—documents and programs that are as accessible and useful a century from now as they are today.
Given the will, we could implement and deploy a practical and pleasing durability infrastructure within two years. Tools for daily work can embed packaging for durability without much burdening their users. Moving responsibility for durability from archival employees to information producers also avoids burdening repositories with keeping up with Internet scale. An engineering prescription is available.
Research libraries’ and archives’ slow advance towards practical preservation of digital content is remarkable to outsiders. Why is their progress stalled? Ineffective collaboration across disciplinary boundaries has surely been a major impediment. We speculate about cultural reasons for this situation and warn about possible marginalization of research librarianship as a profession.
Durable Digital Objects Rather Than Digital Preservation
Long-term digital preservation is not the best available objective. Instead, what information producers and consumers almost surely want is a universe of durable digital objects—documents and programs that will be as accessible and useful a century from now as they are today.
Given the will, we could implement and deploy a practical and pleasing durability infrastructure within two years. Tools for daily work can embed packaging for durability without much burdening their users. Moving responsibility for durability from archival employees to information producers would also avoid burdening repositories with keeping up with Internet scale. An engineering prescription is available.
Research libraries’ and archives’ slow advance towards practical preservation of digital content is remarkable to outsiders. Why does their progress seem stalled? Ineffective collaboration across disciplinary boundaries has surely been a major impediment. We speculate about cultural reasons for this situation and warn about possible marginalization of research librarianship as a profession.
Middleware-based Database Replication: The Gaps between Theory and Practice
The need for high availability and performance in data management systems has
been fueling a long running interest in database replication from both academia
and industry. However, academic groups often attack replication problems in
isolation, overlooking the need for completeness in their solutions, while
commercial teams take a holistic approach that often misses opportunities for
fundamental innovation. This has created over time a gap between academic
research and industrial practice.
This paper aims to characterize the gap along three axes: performance,
availability, and administration. We build on our own experience developing and
deploying replication systems in commercial and academic settings, as well as
on a large body of prior related work. We sift through representative examples
from the last decade of open-source, academic, and commercial database
replication systems and combine this material with case studies from real
systems deployed at Fortune 500 customers. We propose two agendas, one for
academic research and one for industrial R&D, which we believe can bridge the
gap within 5-10 years. This way, we hope to both motivate and help researchers
in making the theory and practice of middleware-based database replication more
relevant to each other.Comment: 14 pages. Appears in Proc. ACM SIGMOD International Conference on
Management of Data, Vancouver, Canada, June 200
- …