6,970 research outputs found

    Making Digital Artifacts on the Web Verifiable and Reliable

    Get PDF
    The current Web has no general mechanisms to make digital artifacts --- such as datasets, code, texts, and images --- verifiable and permanent. For digital artifacts that are supposed to be immutable, there is moreover no commonly accepted method to enforce this immutability. These shortcomings have a serious negative impact on the ability to reproduce the results of processes that rely on Web resources, which in turn heavily impacts areas such as science where reproducibility is important. To solve this problem, we propose trusty URIs containing cryptographic hash values. We show how trusty URIs can be used for the verification of digital artifacts, in a manner that is independent of the serialization format in the case of structured data files such as nanopublications. We demonstrate how the contents of these files become immutable, including dependencies to external digital artifacts and thereby extending the range of verifiability to the entire reference tree. Our approach sticks to the core principles of the Web, namely openness and decentralized architecture, and is fully compatible with existing standards and protocols. Evaluation of our reference implementations shows that these design goals are indeed accomplished by our approach, and that it remains practical even for very large files.Comment: Extended version of conference paper: arXiv:1401.577

    Theory and Practice of Data Citation

    Full text link
    Citations are the cornerstone of knowledge propagation and the primary means of assessing the quality of research, as well as directing investments in science. Science is increasingly becoming "data-intensive", where large volumes of data are collected and analyzed to discover complex patterns through simulations and experiments, and most scientific reference works have been replaced by online curated datasets. Yet, given a dataset, there is no quantitative, consistent and established way of knowing how it has been used over time, who contributed to its curation, what results have been yielded or what value it has. The development of a theory and practice of data citation is fundamental for considering data as first-class research objects with the same relevance and centrality of traditional scientific products. Many works in recent years have discussed data citation from different viewpoints: illustrating why data citation is needed, defining the principles and outlining recommendations for data citation systems, and providing computational methods for addressing specific issues of data citation. The current panorama is many-faceted and an overall view that brings together diverse aspects of this topic is still missing. Therefore, this paper aims to describe the lay of the land for data citation, both from the theoretical (the why and what) and the practical (the how) angle.Comment: 24 pages, 2 tables, pre-print accepted in Journal of the Association for Information Science and Technology (JASIST), 201

    Research Data: Who will share what, with whom, when, and why?

    Get PDF
    The deluge of scientific research data has excited the general public, as well as the scientific community, with the possibilities for better understanding of scientific problems, from climate to culture. For data to be available, researchers must be willing and able to share them. The policies of governments, funding agencies, journals, and university tenure and promotion committees also influence how, when, and whether research data are shared. Data are complex objects. Their purposes and the methods by which they are produced vary widely across scientific fields, as do the criteria for sharing them. To address these challenges, it is necessary to examine the arguments for sharing data and how those arguments match the motivations and interests of the scientific community and the public. Four arguments are examined: to make the results of publicly funded data available to the public, to enable others to ask new questions of extant data, to advance the state of science, and to reproduce research. Libraries need to consider their role in the face of each of these arguments, and what expertise and systems they require for data curation.

    Data Management Roles for Librarians

    Get PDF
    In this Chapter:● Looking at data through different lenses● Exploring the range of data use and data support ● Using data as the basis for informed decision making ● Treating data as a legitimate scholarly research produc

    Master of Science

    Get PDF
    thesisMultivariate assays using gene expression as their contributing factors, such as the centroid-based PAM50 Breast Cancer Intrinsic Classi er, are becoming commonly used in assisting treatment decisions in medicine, especially in oncology. Although physicians may rely on these multivariate assays for planning treatment, little is known about the e ects on the results of an assay due to the intrinsic error in the laboratory process and measuring its contributing factors. While we expect that classi cation of samples in proximity to one of the centroids de ning the tumor classes will be stable with respect to experimental errors in the gene expression measurements, what happens to the samples not in proximity to a single centroid is unknown. Results reported to the attending physician may be misleading because he or she is receiving no information about the probability for sample misclassi cation. Given the serious consequences due to ambiguous results in clinical classi cations, methods to measure the e ects of a multivariate assay's intrinsic errors need to be established and communicated to attending physicians. In this study, a method to characterize the technical uncertainty in the classi cation of centroid-based multivariate assays, is developed and described, using the PAM50 Breast Cancer Intrinsic Classi er as the model multivariate assay. Furthermore, the described method provides a general and individual classi cation con dence measurement that advances multivariate assays towards personalized healthcare by providing personalized con dence measurements on the assay's result. Finally, this study explores whether using parametric versus nonparametric distance measurements is most e ective when using a single gene expression platform, such as microarray or Real-time, quantitative Polymerase Chain Reaction

    Collaborative Cloud Computing Framework for Health Data with Open Source Technologies

    Full text link
    The proliferation of sensor technologies and advancements in data collection methods have enabled the accumulation of very large amounts of data. Increasingly, these datasets are considered for scientific research. However, the design of the system architecture to achieve high performance in terms of parallelization, query processing time, aggregation of heterogeneous data types (e.g., time series, images, structured data, among others), and difficulty in reproducing scientific research remain a major challenge. This is specifically true for health sciences research, where the systems must be i) easy to use with the flexibility to manipulate data at the most granular level, ii) agnostic of programming language kernel, iii) scalable, and iv) compliant with the HIPAA privacy law. In this paper, we review the existing literature for such big data systems for scientific research in health sciences and identify the gaps of the current system landscape. We propose a novel architecture for software-hardware-data ecosystem using open source technologies such as Apache Hadoop, Kubernetes and JupyterHub in a distributed environment. We also evaluate the system using a large clinical data set of 69M patients.Comment: This paper is accepted in ACM-BCB 202
    corecore