132 research outputs found
Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance
Successful data-driven science requires complex data engineering pipelines to
clean, transform, and alter data in preparation for machine learning, and
robust results can only be achieved when each step in the pipeline can be
justified, and its effect on the data explained. In this framework, our aim is
to provide data scientists with facilities to gain an in-depth understanding of
how each step in the pipeline affects the data, from the raw input to training
sets ready to be used for learning. Starting from an extensible set of data
preparation operators commonly used within a data science setting, in this work
we present a provenance management infrastructure for generating, storing, and
querying very granular accounts of data transformations, at the level of
individual elements within datasets whenever possible. Then, from the formal
definition of a core set of data science preprocessing operators, we derive a
provenance semantics embodied by a collection of templates expressed in PROV, a
standard model for data provenance. Using those templates as a reference, our
provenance generation algorithm generalises to any operator with observable
input/output pairs. We provide a prototype implementation of an
application-level provenance capture library to produce, in a semi-automatic
way, complete provenance documents that account for the entire pipeline. We
report on the ability of our implementations to capture provenance in real ML
benchmark pipelines and over TCP-DI synthetic data. We finally show how the
collected provenance can be used to answer a suite of provenance benchmark
queries that underpin some common pipeline inspection questions, as expressed
on the Data Science Stack Exchange.Comment: 37 pages, 27 figures, submitted to a journa
Fine-grained provenance for high-quality data science
In this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, (ii) the definition of provenance patterns for each of them, and (iii) a prototype implementation of an application-level provenance capture library that works alongside Python.</p
Incorporating Provenance in Database Systems
The importance of maintaining provenance has been widely recognized. Currently there are two approaches: provenance generated within workflow frameworks, and provenance within a contained relational database. Workflow provenance allows workflow re-execution, and can offer some explanation of results. Within relational databases, knowledge of SQL queries and relational operators is used to express provenance.
There is a disconnect between these two areas of provenance research. Techniques that work in relational databases cannot be applied to workflow systems because of heterogeneous data types and black-box operators. Meanwhile, the real-life utility of workflow systems has not been extended to database provenance. In the gap between provenance in workflow systems and databases, there are myriads of systems that need provenance. For instance, when creating a new dataset, like MiMI, using several sources and processes, or building an algorithm that generates sequence alignments, like MiBlast. These hybrid systems cannot be mashed into a workflow framework and do not solely exist within a database. This work solves issues that block provenance usage in hybrid systems. In particular, we look at capturing, storing, and using provenance information outside of workflow and database provenance systems.
Database provenance and workflow systems provide no support for tracking the provenance of user actions, but manual effort is often a large component of effort in these hybrid systems. We describe an approach to track and record the user's actions in a queriable form. Once provenance is captured, storage can become prohibitively expensive, in both hybrid and workflow systems. We identify several techniques to reduce the provenance store. Additionally, usable provenance is a problem in workflow, database and hybrid provenance systems. Provenance contains both too much and too little information. We highlight the missing information that can assist user understanding, and develop a model of provenance answers to decrease information overload. Finally, workflow and database systems are designed to explain the results users see; they do not explain why items are not in the result. We allow researchers to specify what they are looking for and answer why it does not exist in the result set.PhDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61645/1/apchapma_1.pd
Prospecting Period Measurements with LSST - Low Mass X-ray Binaries as a Test Case
The Large Synoptic Survey Telescope (LSST) will provide for unbiased sampling
of variability properties of objects with mag 24. This should allow for
those objects whose variations reveal their orbital periods (), such
as low mass X-ray binaries (LMXBs) and related objects, to be examined in much
greater detail and with uniform systematic sampling. However, the baseline LSST
observing strategy has temporal sampling that is not optimised for such work in
the Galaxy. Here we assess four candidate observing strategies for measurement
of in the range 10 minutes to 50 days. We simulate multi-filter
quiescent LMXB lightcurves including ellipsoidal modulation and stochastic
flaring, and then sample these using LSST's operations simulator (OpSim) over
the (mag, ) parameter space, and over five sightlines sampling a range
of possible reddening values. The percentage of simulated parameter space with
correctly returned periods ranges from 23 %, for the current baseline
strategy, to 70 % for the two simulated specialist strategies. Convolving
these results with a distribution, a modelled Galactic spatial
distribution and reddening maps, we conservatively estimate that the most
recent version of the LSST baseline strategy will allow determination
for 18 % of the Milky Way's LMXB population, whereas strategies that do
not reduce observations of the Galactic Plane can improve this dramatically to
32 %. This increase would allow characterisation of the full binary
population by breaking degeneracies between suggested distributions
in the literature. Our results can be used in the ongoing assessment of the
effectiveness of various potential cadencing strategies.Comment: Replacement after addressing minor corrections from the referee -
mainly improvements in clarificatio
Surrogate Parenthood: Protected and Informative Graphs
Many applications, including provenance and some analyses of social networks,
require path-based queries over graph-structured data. When these graphs
contain sensitive information, paths may be broken, resulting in uninformative
query results. This paper presents innovative techniques that give users more
informative graph query results; the techniques leverage a common industry
practice of providing what we call surrogates: alternate, less sensitive
versions of nodes and edges releasable to a broader community. We describe
techniques for interposing surrogate nodes and edges to protect sensitive graph
components, while maximizing graph connectivity and giving users as much
information as possible. In this work, we formalize the problem of creating a
protected account G' of a graph G. We provide a utility measure to compare the
informativeness of alternate protected accounts and an opacity measure for
protected accounts, which indicates the likelihood that an attacker can
recreate the topology of the original graph from the protected account. We
provide an algorithm to create a maximally useful protected account of a
sensitive graph, and show through evaluation with the PLUS prototype that using
surrogates and protected accounts adds value for the user, with no significant
impact on the time required to generate results for graph queries.Comment: VLDB201
Provenance management in curated databases
Curated databases in bioinformatics and other disciplines are the result of a great deal of manual annotation, correction and transfer of data from other sources. Provenance information concerning the creation, attribution, or version history of such data is crucial for assessing its integrity and scientific value. General purpose database systems provide little support for tracking provenance, especially when data moves among databases. This paper investigates general-purpose techniques for recording provenance for data that is copied among databases. We describe an approach in which we track the user’s actions while browsing source databases and copying data into a curated database, in order to record the user’s actions in a convenient, queryable form. We present an implementation of this technique and use it to evaluate the feasibility of database support for provenance management. Our experiments show that although the overhead of a naïve approach is fairly high, it can be decreased to an acceptable level using simple optimizations. 1
Dataset search: a survey
Generating value from data requires the ability to find, access and make
sense of datasets. There are many efforts underway to encourage data sharing
and reuse, from scientific publishers asking authors to submit data alongside
manuscripts to data marketplaces, open data portals and data communities.
Google recently beta released a search service for datasets, which allows users
to discover data stored in various online repositories via keyword queries.
These developments foreshadow an emerging research field around dataset search
or retrieval that broadly encompasses frameworks, methods and tools that help
match a user data need against a collection of datasets. Here, we survey the
state of the art of research and commercial systems in dataset retrieval. We
identify what makes dataset search a research field in its own right, with
unique challenges and methods and highlight open problems. We look at
approaches and implementations from related areas dataset search is drawing
upon, including information retrieval, databases, entity-centric and tabular
search in order to identify possible paths to resolve these open problems as
well as immediate next steps that will take the field forward.Comment: 20 pages, 153 reference
Computational Notebooks as Co-Design Tools: Engaging Young Adults Living with Diabetes, Family Carers, and Clinicians with Machine Learning Models
Engaging end user groups with machine learning (ML) models can help align the design of predictive systems with
people’s needs and expectations. We present a co-design study investigating the benefits and challenges of using
computational notebooks to inform ML models with end user groups. We used a computational notebook to engage
young adults, carers, and clinicians with an example ML model that predicted health risk in diabetes care. Through codesign workshops and retrospective interviews, we found that participants particularly valued using the interactive
data visualisations of the computational notebook to scaffold multidisciplinary learning, anticipate benefits and harms
of the example ML model, and create fictional feature importance plots to highlight care needs. Participants also
reported challenges, from running code cells to managing information asymmetries and power imbalances. We discuss
the potential of leveraging computational notebooks as interactive co-design tools to meet end user needs early in ML
model lifecycles
- …