397 research outputs found
The lifecycle of provenance metadata and its associated challenges and opportunities
This chapter outlines some of the challenges and opportunities associated
with adopting provenance principles and standards in a variety of disciplines,
including data publication and reuse, and information sciences
TAPER: query-aware, partition-enhancement for large, heterogenous, graphs
Graph partitioning has long been seen as a viable approach to address Graph
DBMS scalability. A partitioning, however, may introduce extra query processing
latency unless it is sensitive to a specific query workload, and optimised to
minimise inter-partition traversals for that workload. Additionally, it should
also be possible to incrementally adjust the partitioning in reaction to
changes in the graph topology, the query workload, or both. Because of their
complexity, current partitioning algorithms fall short of one or both of these
requirements, as they are designed for offline use and as one-off operations.
The TAPER system aims to address both requirements, whilst leveraging existing
partitioning algorithms. TAPER takes any given initial partitioning as a
starting point, and iteratively adjusts it by swapping chosen vertices across
partitions, heuristically reducing the probability of inter-partition
traversals for a given pattern matching queries workload. Iterations are
inexpensive thanks to time and space optimisations in the underlying support
data structures. We evaluate TAPER on two different large test graphs and over
realistic query workloads. Our results indicate that, given a hash-based
partitioning, TAPER reduces the number of inter-partition traversals by around
80%; given an unweighted METIS partitioning, by around 30%. These reductions
are achieved within 8 iterations and with the additional advantage of being
workload-aware and usable online.Comment: 12 pages, 11 figures, unpublishe
Loom: Query-aware Partitioning of Online Graphs
As with general graph processing systems, partitioning data over a cluster of
machines improves the scalability of graph database management systems.
However, these systems will incur additional network cost during the execution
of a query workload, due to inter-partition traversals. Workload-agnostic
partitioning algorithms typically minimise the likelihood of any edge crossing
partition boundaries. However, these partitioners are sub-optimal with respect
to many workloads, especially queries, which may require more frequent
traversal of specific subsets of inter-partition edges. Furthermore, they
largely unsuited to operating incrementally on dynamic, growing graphs.
We present a new graph partitioning algorithm, Loom, that operates on a
stream of graph updates and continuously allocates the new vertices and edges
to partitions, taking into account a query workload of graph pattern
expressions along with their relative frequencies.
First we capture the most common patterns of edge traversals which occur when
executing queries. We then compare sub-graphs, which present themselves
incrementally in the graph update stream, against these common patterns.
Finally we attempt to allocate each match to single partitions, reducing the
number of inter-partition edges within frequently traversed sub-graphs and
improving average query performance.
Loom is extensively evaluated over several large test graphs with realistic
query workloads and various orderings of the graph updates. We demonstrate
that, given a workload, our prototype produces partitionings of significantly
better quality than existing streaming graph partitioning algorithms Fennel and
LDG
Design and evaluation of a genomics variant analysis pipeline using GATK Spark tools
Scalable and efficient processing of genome sequence data, i.e. for variant
discovery, is key to the mainstream adoption of High Throughput technology for
disease prevention and for clinical use. Achieving scalability, however,
requires a significant effort to enable the parallel execution of the analysis
tools that make up the pipelines. This is facilitated by the new Spark versions
of the well-known GATK toolkit, which offer a black-box approach by
transparently exploiting the underlying Map Reduce architecture. In this paper
we report on our experience implementing a standard variant discovery pipeline
using GATK 4.0 with Docker-based deployment over a cluster. We provide a
preliminary performance analysis, comparing the processing times and cost to
those of the new Microsoft Genomics Services
Taverna Workflows: Syntax and Semantics
This paper presents the formal syntax and the operational semantics of Taverna, a workflow management system with a large user base among the e-Science community. Such formal foundation, which has so far been lacking, opens the way to the translation between Taverna workflows and other process models. In particular, the ability to automatically compile a simple domain-specific process description into Taverna facilitates its adoption by e-scientists who are not expert workflow developers. We demonstrate this potential through a practical use case
Data trajectories: tracking reuse of published data for transitive credit attribution
The ability to measure the use and impact of published data sets is key to the success of the open data/open science paradigm. A direct measure of impact would require tracking data (re)use in the wild, which is difficult to achieve. This is therefore commonly replaced by simpler metrics based on data download and citation counts. In this paper we describe a scenario where it is possible to track the trajectory of a dataset after its publication, and show how this enables the design of accurate models for ascribing credit to data originators. A Data Trajectory (DT) is a graph that encodes knowledge of how, by whom, and in which context data has been re-used, possibly after several generations. We provide a theoretical model of DTs that is grounded in the W3C PROV data model for provenance, and we show how DTs can be used to automatically propagate a fraction of the credit associated with transitively derived datasets, back to original data contributors. We also show this model of transitive credit in action by means of a Data Reuse Simulator. In the longer term, our ultimate hope is that credit models based on direct measures of data reuse will provide further incentives to data publication. We conclude by outlining a research agenda to address the hard questions of creating, collecting, and using DTs systematically across a large number of data reuse instances in the wild
Building Rule Hierarchies for Efficient Logical Rule Learning from Knowledge Graphs
Many systems have been developed in recent years to mine logical rules from
large-scale Knowledge Graphs (KGs), on the grounds that representing
regularities as rules enables both the interpretable inference of new facts,
and the explanation of known facts. Among these systems, the walk-based methods
that generate the instantiated rules containing constants by abstracting
sampled paths in KGs demonstrate strong predictive performance and
expressivity. However, due to the large volume of possible rules, these systems
do not scale well where computational resources are often wasted on generating
and evaluating unpromising rules. In this work, we address such scalability
issues by proposing new methods for pruning unpromising rules using rule
hierarchies. The approach consists of two phases. Firstly, since rule
hierarchies are not readily available in walk-based methods, we have built a
Rule Hierarchy Framework (RHF), which leverages a collection of subsumption
frameworks to build a proper rule hierarchy from a set of learned rules. And
secondly, we adapt RHF to an existing rule learner where we design and
implement two methods for Hierarchical Pruning (HPMs), which utilize the
generated hierarchies to remove irrelevant and redundant rules. Through
experiments over four public benchmark datasets, we show that the application
of HPMs is effective in removing unpromising rules, which leads to significant
reductions in the runtime as well as in the number of learned rules, without
compromising the predictive performance
- …