14 research outputs found
Data, Responsibly: Fairness, Neutrality and Transparency in Data Analysis
ABSTRACT Big data technology holds incredible promise of improving people's lives, accelerating scientific discovery and innovation, and bringing about positive societal change. Yet, if not used responsibly, this technology can propel economic inequality, destabilize global markets and affirm systemic bias. While the potential benefits of big data are well-accepted, the importance of using these techniques in a fair and transparent manner is rarely considered. The primary goal of this tutorial is to draw the attention of the data management community to the important emerging subject of responsible data management and analysis. We will offer our perspective on the issue, will give an overview of existing technical work, primarily from the data mining and algorithms communities, and will motivate future research directions
Data, Responsibly: Fairness, Neutrality and Transparency in Data Analysis
International audienceBig data technology holds incredible promise of improving people's lives, accelerating scientific discovery and innovation , and bringing about positive societal change. Yet, if not used responsibly, this technology can propel economic inequality , destabilize global markets and affirm systemic bias. While the potential benefits of big data are well-accepted, the importance of using these techniques in a fair and transparent manner is rarely considered. The primary goal of this tutorial is to draw the attention of the data management community to the important emerging subject of responsible data management and analysis. We will offer our perspective on the issue, will give an overview of existing technical work, primarily from the data mining and algorithms communities, and will motivate future research directions
Provenance and Probabilities in Relational Databases: From Theory to Practice
International audienceWe review the basics of data provenance in relational databases. We describe different provenance formalisms, from Boolean provenance to provenance semirings and beyond, that can be used for a wide variety of purposes, to obtain additional information on the output of a query. We discuss representation systems for data provenance, circuits in particular, with a focus on practical implementation. Finally, we explain how provenance is practically used for probabilistic query evaluation in probabilistic databases
Provenance Tools
The importance of provenance has arose for all kinds of sciences over the recent years. During research on data provenance, several tools have been developed to use provenance in a practical way. We chose seven of those tools and exhaustingly tested five of them: Trio, ORCHESTRA, Perm, GProM, and ProvSQL. In this article, we first introduce the basics of data provenance, especially where-, why-, and how-provenance. After that, we present the results of our tool tests
PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models
The ubiquitous use of machine learning algorithms brings new challenges to
traditional database problems such as incremental view update. Much effort is
being put in better understanding and debugging machine learning models, as
well as in identifying and repairing errors in training datasets. Our focus is
on how to assist these activities when they have to retrain the machine
learning model after removing problematic training samples in cleaning or
selecting different subsets of training data for interpretability. This paper
presents an efficient provenance-based approach, PrIU, and its optimized
version, PrIU-opt, for incrementally updating model parameters without
sacrificing prediction accuracy. We prove the correctness and convergence of
the incrementally updated model parameters, and validate it experimentally.
Experimental results show that up to two orders of magnitude speed-ups can be
achieved by PrIU-opt compared to simply retraining the model from scratch, yet
obtaining highly similar models.Comment: 28 Pages, published in 2020 ACM SIGMOD International Conference on
Management of Data (SIGMOD 2020
Conceptual Modeling of Data with Provenance
Traditional database systems manage data, but often do not address its provenance. In the past, users were often implicitly familiar with data they used, how it was created (and hence how it might be appropriately used), and from which sources it came. Today, users may be physically and organizationally remote from the data they use, so this information may not be easily accessible to them. In recent years, several models have been proposed for recording provenance of data. Our work is motivated by opportunities to make provenance easy to manage and query. For example, current approaches model provenance as expressions that may be easily stored alongside data, but are difficult to parse and reconstruct for querying, and are difficult to query with available languages. We contribute a conceptual model for data and provenance, and evaluate how well it addresses these opportunities. We compare the expressive power of our model\u27s language to that of other models. We also define a benchmark suite with which to study performance of our model, and use this suite to study key model aspects implemented on existing software platforms. We discover some salient performance bottlenecks in these implementations, and suggest future work to explore improvements. Finally, we show that our implementations can comprise a logical model that faithfully supports our conceptual model
Content sensitivity based access control model for big data
Big data technologies have seen tremendous growth in recent years. They are being widely used in both industry and academia. In spite of such exponential growth, these technologies lack adequate measures to protect the data from misuse or abuse. Corporations that collect data from multiple sources are at risk of liabilities due to exposure of sensitive information. In the current implementation of Hadoop, only file level access control is feasible. Providing users, the ability to access data based on attributes in a dataset or based on their role is complicated due to the sheer volume and multiple formats (structured, unstructured and semi-structured) of data. In this dissertation an access control framework, which enforces access control policies dynamically based on the sensitivity of the data is proposed. This framework enforces access control policies by harnessing the data context, usage patterns and information sensitivity. Information sensitivity changes over time with the addition and removal of datasets, which can lead to modifications in the access control decisions and the proposed framework accommodates these changes. The proposed framework is automated to a large extent and requires minimal user intervention. The experimental results show that the proposed framework is capable of enforcing access control policies on non-multimedia datasets with minimal overhea