16,676 research outputs found
SQPR: Stream Query Planning with Reuse
When users submit new queries to a distributed stream processing system (DSPS), a query planner must allocate physical resources, such as CPU cores, memory and network bandwidth, from a set of hosts to queries. Allocation decisions must provide the correct mix of resources required by queries, while achieving an efficient overall allocation to scale in the number of admitted queries. By exploiting overlap between queries and reusing partial results, a query planner can conserve resources but has to carry out more complex planning decisions. In this paper, we describe SQPR, a query planner that targets DSPSs in data centre environments with heterogeneous resources. SQPR models query admission, allocation and reuse as a single constrained optimisation problem and solves an approximate version to achieve scalability. It prevents individual resources from becoming bottlenecks by re-planning past allocation decisions and supports different allocation objectives. As our experimental evaluation in comparison with a state-of-the-art planner shows SQPR makes efficient resource allocation decisions, even with a high utilisation of resources, with acceptable overheads
Database Learning: Toward a Database that Becomes Smarter Every Time
In today's databases, previous query answers rarely benefit answering future
queries. For the first time, to the best of our knowledge, we change this
paradigm in an approximate query processing (AQP) context. We make the
following observation: the answer to each query reveals some degree of
knowledge about the answer to another query because their answers stem from the
same underlying distribution that has produced the entire dataset. Exploiting
and refining this knowledge should allow us to answer queries more
analytically, rather than by reading enormous amounts of raw data. Also,
processing more queries should continuously enhance our knowledge of the
underlying distribution, and hence lead to increasingly faster response times
for future queries.
We call this novel idea---learning from past query answers---Database
Learning. We exploit the principle of maximum entropy to produce answers, which
are in expectation guaranteed to be more accurate than existing sample-based
approximations. Empowered by this idea, we build a query engine on top of Spark
SQL, called Verdict. We conduct extensive experiments on real-world query
traces from a large customer of a major database vendor. Our results
demonstrate that Verdict supports 73.7% of these queries, speeding them up by
up to 23.0x for the same accuracy level compared to existing AQP systems.Comment: This manuscript is an extended report of the work published in ACM
SIGMOD conference 201
ReStore: Reusing Results of MapReduce Jobs
Analyzing large scale data has emerged as an important activity for many
organizations in the past few years. This large scale data analysis is
facilitated by the MapReduce programming and execution model and its
implementations, most notably Hadoop. Users of MapReduce often have analysis
tasks that are too complex to express as individual MapReduce jobs. Instead,
they use high-level query languages such as Pig, Hive, or Jaql to express their
complex tasks. The compilers of these languages translate queries into
workflows of MapReduce jobs. Each job in these workflows reads its input from
the distributed file system used by the MapReduce system and produces output
that is stored in this distributed file system and read as input by the next
job in the workflow. The current practice is to delete these intermediate
results from the distributed file system at the end of executing the workflow.
One way to improve the performance of workflows of MapReduce jobs is to keep
these intermediate results and reuse them for future workflows submitted to the
system. In this paper, we present ReStore, a system that manages the storage
and reuse of such intermediate results. ReStore can reuse the output of whole
MapReduce jobs that are part of a workflow, and it can also create additional
reuse opportunities by materializing and storing the output of query execution
operators that are executed within a MapReduce job. We have implemented ReStore
as an extension to the Pig dataflow system on top of Hadoop, and we
experimentally demonstrate significant speedups on queries from the PigMix
benchmark.Comment: VLDB201
Two-phased knowledge formalisation for hydrometallurgical gold ore process recommendation and validation
This paper describes an approach to externalising and formalising expert knowledge involved in the design and evaluation of hydrometallurgical process chains for gold ore treatment. The objective was to create a case-based reasoning application for recommending and validating a treatment process of gold ores. We describe a twofold approach. Formalising human expert knowledge about gold mining situations enables the retrieval of similar mining contexts and respective process chains, based on prospection data gathered from a potential gold mining site. Secondly, empirical knowledge on hydrometallurgical treatments is formalised. This enabled us to evaluate and, where needed, redesign the process chain that was recommended by the first aspect of our approach. The main problems with formalisation of knowledge in the domain of gold ore refinement are the diversity and the amount of parameters used in literature and by experts to describe a mining context. We demonstrate how similarity knowledge was used to formalise literature knowledge. The evaluation of data gathered from experiments with an initial prototype workflow recommender, Auric Adviser, provides promising results
From Cooperative Scans to Predictive Buffer Management
In analytical applications, database systems often need to sustain workloads
with multiple concurrent scans hitting the same table. The Cooperative Scans
(CScans) framework, which introduces an Active Buffer Manager (ABM) component
into the database architecture, has been the most effective and elaborate
response to this problem, and was initially developed in the X100 research
prototype. We now report on the the experiences of integrating Cooperative
Scans into its industrial-strength successor, the Vectorwise database product.
During this implementation we invented a simpler optimization of concurrent
scan buffer management, called Predictive Buffer Management (PBM). PBM is based
on the observation that in a workload with long-running scans, the buffer
manager has quite a bit of information on the workload in the immediate future,
such that an approximation of the ideal OPT algorithm becomes feasible. In the
evaluation on both synthetic benchmarks as well as a TPC-H throughput run we
compare the benefits of naive buffer management (LRU) versus CScans, PBM and
OPT; showing that PBM achieves benefits close to Cooperative Scans, while
incurring much lower architectural impact.Comment: VLDB201
Active Transfer Learning with Zero-Shot Priors: Reusing Past Datasets for Future Tasks
How can we reuse existing knowledge, in the form of available datasets, when
solving a new and apparently unrelated target task from a set of unlabeled
data? In this work we make a first contribution to answer this question in the
context of image classification. We frame this quest as an active learning
problem and use zero-shot classifiers to guide the learning process by linking
the new task to the existing classifiers. By revisiting the dual formulation of
adaptive SVM, we reveal two basic conditions to choose greedily only the most
relevant samples to be annotated. On this basis we propose an effective active
learning algorithm which learns the best possible target classification model
with minimum human labeling effort. Extensive experiments on two challenging
datasets show the value of our approach compared to the state-of-the-art active
learning methodologies, as well as its potential to reuse past datasets with
minimal effort for future tasks
Applying semantic web technologies to knowledge sharing in aerospace engineering
This paper details an integrated methodology to optimise Knowledge reuse and sharing, illustrated with a use case in the aeronautics domain. It uses Ontologies as a central modelling strategy for the Capture of Knowledge from legacy docu-ments via automated means, or directly in systems interfacing with Knowledge workers, via user-defined, web-based forms. The domain ontologies used for Knowledge Capture also guide the retrieval of the Knowledge extracted from the data using a Semantic Search System that provides support for multiple modalities during search. This approach has been applied and evaluated successfully within the aerospace domain, and is currently being extended for use in other domains on an increasingly large scale
- âŚ