Search CORE

6,970 research outputs found

Making Digital Artifacts on the Web Verifiable and Reliable

Author: Dumontier Michel
Kuhn Tobias
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

The current Web has no general mechanisms to make digital artifacts --- such as datasets, code, texts, and images --- verifiable and permanent. For digital artifacts that are supposed to be immutable, there is moreover no commonly accepted method to enforce this immutability. These shortcomings have a serious negative impact on the ability to reproduce the results of processes that rely on Web resources, which in turn heavily impacts areas such as science where reproducibility is important. To solve this problem, we propose trusty URIs containing cryptographic hash values. We show how trusty URIs can be used for the verification of digital artifacts, in a manner that is independent of the serialization format in the case of structured data files such as nanopublications. We demonstrate how the contents of these files become immutable, including dependencies to external digital artifacts and thereby extending the range of verifiability to the entire reference tree. Our approach sticks to the core principles of the Web, namely openness and decentralized architecture, and is fully compatible with existing standards and protocols. Evaluation of our reference implementations shows that these design goals are indeed accomplished by our approach, and that it remains practical even for very large files.Comment: Extended version of conference paper: arXiv:1401.577

arXiv.org e-Print Archive

Maastricht University Research Portal

VU Research Portal

Crossref

Theory and Practice of Data Citation

Author: Silvello Gianmaria
Publication venue: 'Wiley'
Publication date: 24/06/2017
Field of study

Citations are the cornerstone of knowledge propagation and the primary means of assessing the quality of research, as well as directing investments in science. Science is increasingly becoming "data-intensive", where large volumes of data are collected and analyzed to discover complex patterns through simulations and experiments, and most scientific reference works have been replaced by online curated datasets. Yet, given a dataset, there is no quantitative, consistent and established way of knowing how it has been used over time, who contributed to its curation, what results have been yielded or what value it has. The development of a theory and practice of data citation is fundamental for considering data as first-class research objects with the same relevance and centrality of traditional scientific products. Many works in recent years have discussed data citation from different viewpoints: illustrating why data citation is needed, defining the principles and outlining recommendations for data citation systems, and providing computational methods for addressing specific issues of data citation. The current panorama is many-faceted and an overall view that brings together diverse aspects of this topic is still missing. Therefore, this paper aims to describe the lay of the land for data citation, both from the theoretical (the why and what) and the practical (the how) angle.Comment: 24 pages, 2 tables, pre-print accepted in Journal of the Association for Information Science and Technology (JASIST), 201

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università di Padova

Research Data: Who will share what, with whom, when, and why?

Author: Christine L. Borgman
Publication venue
Publication date
Field of study

The deluge of scientific research data has excited the general public, as well as the scientific community, with the possibilities for better understanding of scientific problems, from climate to culture. For data to be available, researchers must be willing and able to share them. The policies of governments, funding agencies, journals, and university tenure and promotion committees also influence how, when, and whether research data are shared. Data are complex objects. Their purposes and the methods by which they are produced vary widely across scientific fields, as do the criteria for sharing them. To address these challenges, it is necessary to examine the arguments for sharing data and how those arguments match the motivations and interests of the scientific community and the public. Four arguments are examined: to make the results of publicly funded data available to the public, to enable others to ask new questions of extant data, to advance the state of science, and to reproduce research. Libraries need to consider their role in the face of each of these arguments, and what expertise and systems they require for data curation.

Research Papers in Economics

Data Management Roles for Librarians

Author: Henderson Margaret E.
Publication venue: VCU Scholars Compass
Publication date: 01/01/2016
Field of study

In this Chapter:● Looking at data through different lenses● Exploring the range of data use and data support ● Using data as the basis for informed decision making ● Treating data as a legitimate scholarly research produc

VCU Scholars Compass

Master of Science

Author: Ebbert Mark Tyler Wilkinson
Publication venue: University of Utah
Publication date: 01/08/2012
Field of study

thesisMultivariate assays using gene expression as their contributing factors, such as the centroid-based PAM50 Breast Cancer Intrinsic Classi er, are becoming commonly used in assisting treatment decisions in medicine, especially in oncology. Although physicians may rely on these multivariate assays for planning treatment, little is known about the e ects on the results of an assay due to the intrinsic error in the laboratory process and measuring its contributing factors. While we expect that classi cation of samples in proximity to one of the centroids de ning the tumor classes will be stable with respect to experimental errors in the gene expression measurements, what happens to the samples not in proximity to a single centroid is unknown. Results reported to the attending physician may be misleading because he or she is receiving no information about the probability for sample misclassi cation. Given the serious consequences due to ambiguous results in clinical classi cations, methods to measure the e ects of a multivariate assay's intrinsic errors need to be established and communicated to attending physicians. In this study, a method to characterize the technical uncertainty in the classi cation of centroid-based multivariate assays, is developed and described, using the PAM50 Breast Cancer Intrinsic Classi er as the model multivariate assay. Furthermore, the described method provides a general and individual classi cation con dence measurement that advances multivariate assays towards personalized healthcare by providing personalized con dence measurements on the assay's result. Finally, this study explores whether using parametric versus nonparametric distance measurements is most e ective when using a single gene expression platform, such as microarray or Real-time, quantitative Polymerase Chain Reaction

The University of Utah: J. Willard Marriott Digital Library

Nanoinformatics 2010 Program

Author: Baker Nathan A
Chaka Anne
Cohen Yoram
Colvin Vicki
Fritts Martin
Geraci Charles L.
Hoover Mark D
Ku Sharon
Kulinowski Kristen M
Lippell Phil
Luo James
McLennan Michael
Morse Jeffrey
Ostraat Michele L
Rajan Krishna
Reznik-Zellen Rebecca
Schad Peter
Tuominen Mark T.
Publication venue
Publication date: 01/11/2010
Field of study

InterNano Nanomanufacturing Repository

Collaborative Cloud Computing Framework for Health Data with Open Source Technologies

Author: Bisong Ekaba
Miao Zhuqi
Scheufele Elisabeth
Weil Sage A
Winn Peter A
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/07/2020
Field of study

The proliferation of sensor technologies and advancements in data collection methods have enabled the accumulation of very large amounts of data. Increasingly, these datasets are considered for scientific research. However, the design of the system architecture to achieve high performance in terms of parallelization, query processing time, aggregation of heterogeneous data types (e.g., time series, images, structured data, among others), and difficulty in reproducing scientific research remain a major challenge. This is specifically true for health sciences research, where the systems must be i) easy to use with the flexibility to manipulate data at the most granular level, ii) agnostic of programming language kernel, iii) scalable, and iv) compliant with the HIPAA privacy law. In this paper, we review the existing literature for such big data systems for scientific research in health sciences and identify the gaps of the current system landscape. We propose a novel architecture for software-hardware-data ecosystem using open source technologies such as Apache Hadoop, Kubernetes and JupyterHub in a distributed environment. We also evaluate the system using a large clinical data set of 69M patients.Comment: This paper is accepted in ACM-BCB 202

arXiv.org e-Print Archive

Crossref