132 research outputs found

    Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance

    Full text link
    Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, our aim is to provide data scientists with facilities to gain an in-depth understanding of how each step in the pipeline affects the data, from the raw input to training sets ready to be used for learning. Starting from an extensible set of data preparation operators commonly used within a data science setting, in this work we present a provenance management infrastructure for generating, storing, and querying very granular accounts of data transformations, at the level of individual elements within datasets whenever possible. Then, from the formal definition of a core set of data science preprocessing operators, we derive a provenance semantics embodied by a collection of templates expressed in PROV, a standard model for data provenance. Using those templates as a reference, our provenance generation algorithm generalises to any operator with observable input/output pairs. We provide a prototype implementation of an application-level provenance capture library to produce, in a semi-automatic way, complete provenance documents that account for the entire pipeline. We report on the ability of our implementations to capture provenance in real ML benchmark pipelines and over TCP-DI synthetic data. We finally show how the collected provenance can be used to answer a suite of provenance benchmark queries that underpin some common pipeline inspection questions, as expressed on the Data Science Stack Exchange.Comment: 37 pages, 27 figures, submitted to a journa

    Fine-grained provenance for high-quality data science

    Get PDF
    In this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, (ii) the definition of provenance patterns for each of them, and (iii) a prototype implementation of an application-level provenance capture library that works alongside Python.</p

    Incorporating Provenance in Database Systems

    Full text link
    The importance of maintaining provenance has been widely recognized. Currently there are two approaches: provenance generated within workflow frameworks, and provenance within a contained relational database. Workflow provenance allows workflow re-execution, and can offer some explanation of results. Within relational databases, knowledge of SQL queries and relational operators is used to express provenance. There is a disconnect between these two areas of provenance research. Techniques that work in relational databases cannot be applied to workflow systems because of heterogeneous data types and black-box operators. Meanwhile, the real-life utility of workflow systems has not been extended to database provenance. In the gap between provenance in workflow systems and databases, there are myriads of systems that need provenance. For instance, when creating a new dataset, like MiMI, using several sources and processes, or building an algorithm that generates sequence alignments, like MiBlast. These hybrid systems cannot be mashed into a workflow framework and do not solely exist within a database. This work solves issues that block provenance usage in hybrid systems. In particular, we look at capturing, storing, and using provenance information outside of workflow and database provenance systems. Database provenance and workflow systems provide no support for tracking the provenance of user actions, but manual effort is often a large component of effort in these hybrid systems. We describe an approach to track and record the user's actions in a queriable form. Once provenance is captured, storage can become prohibitively expensive, in both hybrid and workflow systems. We identify several techniques to reduce the provenance store. Additionally, usable provenance is a problem in workflow, database and hybrid provenance systems. Provenance contains both too much and too little information. We highlight the missing information that can assist user understanding, and develop a model of provenance answers to decrease information overload. Finally, workflow and database systems are designed to explain the results users see; they do not explain why items are not in the result. We allow researchers to specify what they are looking for and answer why it does not exist in the result set.PhDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61645/1/apchapma_1.pd

    Prospecting Period Measurements with LSST - Low Mass X-ray Binaries as a Test Case

    Full text link
    The Large Synoptic Survey Telescope (LSST) will provide for unbiased sampling of variability properties of objects with rr mag << 24. This should allow for those objects whose variations reveal their orbital periods (PorbP_{orb}), such as low mass X-ray binaries (LMXBs) and related objects, to be examined in much greater detail and with uniform systematic sampling. However, the baseline LSST observing strategy has temporal sampling that is not optimised for such work in the Galaxy. Here we assess four candidate observing strategies for measurement of PorbP_{orb} in the range 10 minutes to 50 days. We simulate multi-filter quiescent LMXB lightcurves including ellipsoidal modulation and stochastic flaring, and then sample these using LSST's operations simulator (OpSim) over the (mag, PorbP_{orb}) parameter space, and over five sightlines sampling a range of possible reddening values. The percentage of simulated parameter space with correctly returned periods ranges from ∼\sim23 %, for the current baseline strategy, to ∼\sim70 % for the two simulated specialist strategies. Convolving these results with a PorbP_{orb} distribution, a modelled Galactic spatial distribution and reddening maps, we conservatively estimate that the most recent version of the LSST baseline strategy will allow PorbP_{orb} determination for ∼\sim18 % of the Milky Way's LMXB population, whereas strategies that do not reduce observations of the Galactic Plane can improve this dramatically to ∼\sim32 %. This increase would allow characterisation of the full binary population by breaking degeneracies between suggested PorbP_{orb} distributions in the literature. Our results can be used in the ongoing assessment of the effectiveness of various potential cadencing strategies.Comment: Replacement after addressing minor corrections from the referee - mainly improvements in clarificatio

    Surrogate Parenthood: Protected and Informative Graphs

    Full text link
    Many applications, including provenance and some analyses of social networks, require path-based queries over graph-structured data. When these graphs contain sensitive information, paths may be broken, resulting in uninformative query results. This paper presents innovative techniques that give users more informative graph query results; the techniques leverage a common industry practice of providing what we call surrogates: alternate, less sensitive versions of nodes and edges releasable to a broader community. We describe techniques for interposing surrogate nodes and edges to protect sensitive graph components, while maximizing graph connectivity and giving users as much information as possible. In this work, we formalize the problem of creating a protected account G' of a graph G. We provide a utility measure to compare the informativeness of alternate protected accounts and an opacity measure for protected accounts, which indicates the likelihood that an attacker can recreate the topology of the original graph from the protected account. We provide an algorithm to create a maximally useful protected account of a sensitive graph, and show through evaluation with the PLUS prototype that using surrogates and protected accounts adds value for the user, with no significant impact on the time required to generate results for graph queries.Comment: VLDB201

    Provenance management in curated databases

    Get PDF
    Curated databases in bioinformatics and other disciplines are the result of a great deal of manual annotation, correction and transfer of data from other sources. Provenance information concerning the creation, attribution, or version history of such data is crucial for assessing its integrity and scientific value. General purpose database systems provide little support for tracking provenance, especially when data moves among databases. This paper investigates general-purpose techniques for recording provenance for data that is copied among databases. We describe an approach in which we track the user’s actions while browsing source databases and copying data into a curated database, in order to record the user’s actions in a convenient, queryable form. We present an implementation of this technique and use it to evaluate the feasibility of database support for provenance management. Our experiments show that although the overhead of a naïve approach is fairly high, it can be decreased to an acceptable level using simple optimizations. 1

    Dataset search: a survey

    Get PDF
    Generating value from data requires the ability to find, access and make sense of datasets. There are many efforts underway to encourage data sharing and reuse, from scientific publishers asking authors to submit data alongside manuscripts to data marketplaces, open data portals and data communities. Google recently beta released a search service for datasets, which allows users to discover data stored in various online repositories via keyword queries. These developments foreshadow an emerging research field around dataset search or retrieval that broadly encompasses frameworks, methods and tools that help match a user data need against a collection of datasets. Here, we survey the state of the art of research and commercial systems in dataset retrieval. We identify what makes dataset search a research field in its own right, with unique challenges and methods and highlight open problems. We look at approaches and implementations from related areas dataset search is drawing upon, including information retrieval, databases, entity-centric and tabular search in order to identify possible paths to resolve these open problems as well as immediate next steps that will take the field forward.Comment: 20 pages, 153 reference

    Computational Notebooks as Co-Design Tools: Engaging Young Adults Living with Diabetes, Family Carers, and Clinicians with Machine Learning Models

    Get PDF
    Engaging end user groups with machine learning (ML) models can help align the design of predictive systems with people’s needs and expectations. We present a co-design study investigating the benefits and challenges of using computational notebooks to inform ML models with end user groups. We used a computational notebook to engage young adults, carers, and clinicians with an example ML model that predicted health risk in diabetes care. Through codesign workshops and retrospective interviews, we found that participants particularly valued using the interactive data visualisations of the computational notebook to scaffold multidisciplinary learning, anticipate benefits and harms of the example ML model, and create fictional feature importance plots to highlight care needs. Participants also reported challenges, from running code cells to managing information asymmetries and power imbalances. We discuss the potential of leveraging computational notebooks as interactive co-design tools to meet end user needs early in ML model lifecycles
    • …
    corecore