2,050 research outputs found
Designing Traceability into Big Data Systems
Providing an appropriate level of accessibility and traceability to data or
process elements (so-called Items) in large volumes of data, often
Cloud-resident, is an essential requirement in the Big Data era.
Enterprise-wide data systems need to be designed from the outset to support
usage of such Items across the spectrum of business use rather than from any
specific application view. The design philosophy advocated in this paper is to
drive the design process using a so-called description-driven approach which
enriches models with meta-data and description and focuses the design process
on Item re-use, thereby promoting traceability. Details are given of the
description-driven design of big data systems at CERN, in health informatics
and in business process management. Evidence is presented that the approach
leads to design simplicity and consequent ease of management thanks to loose
typing and the adoption of a unified approach to Item management and usage.Comment: 10 pages; 6 figures in Proceedings of the 5th Annual International
Conference on ICT: Big Data, Cloud and Security (ICT-BDCS 2015), Singapore
July 2015. arXiv admin note: text overlap with arXiv:1402.5764,
arXiv:1402.575
Data provenance tracking as the basis for a biomedical virtual research environment
In complex data analyses it is increasingly important to capture information about the usage of data sets in addition to their preservation over time to ensure reproducibility of results, to verify the work of others and to ensure appropriate conditions data have been used for specific analyses. Scientific workflow based studies are beginning to realize the benefit of capturing this provenance of data and the activities used to process, transform and carry out studies on those data. This is especially true in biomedicine where the collection of data through experiment is costly and/or difficult to reproduce and where that data needs to be preserved over time. One way to support the development of workflows and their use in (collaborative) biomedical analyses is through the use of a Virtual Research Environment. The dynamic and distributed nature of Grid/Cloud computing, however, makes the capture and processing of provenance information a major research challenge. Furthermore most workflow provenance management services are designed only for data-flow oriented workflows and researchers are now realising that tracking data or workflows alone or separately is insufficient to support the scientific process. What is required for collaborative research is traceable and reproducible provenance support in a full orchestrated Virtual Research Environment (VRE) that enables researchers to define their studies in terms of the datasets and processes used, to monitor and visualize the outcome of their analyses and to log their results so that others users can call upon that acquired knowledge to support subsequent studies. We have extended the work carried out in the neuGRID and N4U projects in providing a so-called Virtual Laboratory to provide the foundation for a generic VRE in which sets of biomedical data (images, laboratory test results, patient records, epidemiological analyses etc.) and the workflows (pipelines) used to process those data, together with their provenance data and results sets are captured in the CRISTAL software. This paper outlines the functionality provided for a VRE by the Open Source CRISTAL software and examines how that can provide the foundations for a practice-based knowledge base for biomedicine and, potentially, for a wider research community
Establishing Data Provenance for Responsible Artificial Intelligence Systems
Data provenance, a record that describes the origins and processing of data, offers new promises in the increasingly important role of artificial intelligence (AI)-based systems in guiding human decision making. To avoid disastrous outcomes that can result from bias-laden AI systems, responsible AI builds on four important characteristics: fairness, accountability, transparency, and explainability. To stimulate further research on data provenance that enables responsible AI, this study outlines existing biases and discusses possible implementations of data provenance to mitigate them. We first review biases stemming from the data’s origins and pre-processing. We then discuss the current state of practice, the challenges it presents, and corresponding recommendations to address them. We present a summary highlighting how our recommendations can help establish data provenance and thereby mitigate biases stemming from the data’s origins and pre-processing to realize responsible AI-based systems. We conclude with a research agenda suggesting further research avenues
Designing Reusable Systems that Can Handle Change - Description-Driven Systems : Revisiting Object-Oriented Principles
In the age of the Cloud and so-called Big Data systems must be increasingly
flexible, reconfigurable and adaptable to change in addition to being developed
rapidly. As a consequence, designing systems to cater for evolution is becoming
critical to their success. To be able to cope with change, systems must have
the capability of reuse and the ability to adapt as and when necessary to
changes in requirements. Allowing systems to be self-describing is one way to
facilitate this. To address the issues of reuse in designing evolvable systems,
this paper proposes a so-called description-driven approach to systems design.
This approach enables new versions of data structures and processes to be
created alongside the old, thereby providing a history of changes to the
underlying data models and enabling the capture of provenance data. The
efficacy of the description-driven approach is exemplified by the CRISTAL
project. CRISTAL is based on description-driven design principles; it uses
versions of stored descriptions to define various versions of data which can be
stored in diverse forms. This paper discusses the need for capturing holistic
system description when modelling large-scale distributed systems.Comment: 8 pages, 1 figure and 1 table. Accepted by the 9th Int Conf on the
Evaluation of Novel Approaches to Software Engineering (ENASE'14). Lisbon,
Portugal. April 201
Distributed Ledger for Provenance Tracking of Artificial Intelligence Assets
High availability of data is responsible for the current trends in Artificial
Intelligence (AI) and Machine Learning (ML). However, high-grade datasets are
reluctantly shared between actors because of lacking trust and fear of losing
control. Provenance tracing systems are a possible measure to build trust by
improving transparency. Especially the tracing of AI assets along complete AI
value chains bears various challenges such as trust, privacy, confidentiality,
traceability, and fair remuneration. In this paper we design a graph-based
provenance model for AI assets and their relations within an AI value chain.
Moreover, we propose a protocol to exchange AI assets securely to selected
parties. The provenance model and exchange protocol are then combined and
implemented as a smart contract on a permission-less blockchain. We show how
the smart contract enables the tracing of AI assets in an existing industry use
case while solving all challenges. Consequently, our smart contract helps to
increase traceability and transparency, encourages trust between actors and
thus fosters collaboration between them
- …