4 research outputs found

    DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis [version 2; referees: 2 approved, 1 approved with reservations]

    Get PDF
    A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years

    Ancillary study management systems: a review of needs

    No full text
    Abstract Background The valuable clinical data, specimens, and assay results collected during a primary clinical trial or observational study can enable researchers to answer additional, pressing questions with relatively small investments in new measurements. However, management of such follow-on, “ancillary” studies is complex. It requires coordinating across institutions, sites, repositories, and approval boards, as well as distributing, integrating, and analyzing diverse data types. General-purpose software systems that simplify the management of ancillary studies have not yet been explored in the research literature. Methods We have identified requirements for ancillary study management primarily as part of our ongoing work with a number of large research consortia. These organizations include the Center for HIV/AIDS Vaccine Immunology (CHAVI), the Immune Tolerance Network (ITN), the HIV Vaccine Trials Network (HVTN), the U.S. Military HIV Research Program (MHRP), and the Network for Pancreatic Organ Donors with Diabetes (nPOD). We also consulted with researchers at a range of other disease research organizations regarding their workflows and data management strategies. Lastly, to enhance breadth, we reviewed process documents for ancillary study management from other organizations. Results By exploring characteristics of ancillary studies, we identify differentiating requirements and scenarios for ancillary study management systems (ASMSs). Distinguishing characteristics of ancillary studies may include the collection of additional measurements (particularly new analyses of existing specimens); the initiation of studies by investigators unaffiliated with the original study; cross-protocol data pooling and analysis; pre-existing participant consent; and pre-existing data context and provenance. For an ASMS to address these characteristics, it would need to address both operational requirements (e.g., allocating existing specimens) and data management requirements (e.g., securely distributing and integrating primary and ancillary data). Conclusions The scenarios and requirements we describe can help guide the development of systems that make conducting ancillary studies easier, less expensive, and less error-prone. Given the relatively consistent characteristics and challenges of ancillary study management, general-purpose ASMSs are likely to be useful to a wide range of organizations. Using the requirements identified in this paper, we are currently developing an open-source, general-purpose ASMS based on LabKey Server (http://www.labkey.org) in collaboration with CHAVI, the ITN and nPOD.</p

    Ancillary study management systems: a review of needs

    Get PDF
    BACKGROUND: The valuable clinical data, specimens, and assay results collected during a primary clinical trial or observational study can enable researchers to answer additional, pressing questions with relatively small investments in new measurements. However, management of such follow-on, “ancillary” studies is complex. It requires coordinating across institutions, sites, repositories, and approval boards, as well as distributing, integrating, and analyzing diverse data types. General-purpose software systems that simplify the management of ancillary studies have not yet been explored in the research literature. METHODS: We have identified requirements for ancillary study management primarily as part of our ongoing work with a number of large research consortia. These organizations include the Center for HIV/AIDS Vaccine Immunology (CHAVI), the Immune Tolerance Network (ITN), the HIV Vaccine Trials Network (HVTN), the U.S. Military HIV Research Program (MHRP), and the Network for Pancreatic Organ Donors with Diabetes (nPOD). We also consulted with researchers at a range of other disease research organizations regarding their workflows and data management strategies. Lastly, to enhance breadth, we reviewed process documents for ancillary study management from other organizations. RESULTS: By exploring characteristics of ancillary studies, we identify differentiating requirements and scenarios for ancillary study management systems (ASMSs). Distinguishing characteristics of ancillary studies may include the collection of additional measurements (particularly new analyses of existing specimens); the initiation of studies by investigators unaffiliated with the original study; cross-protocol data pooling and analysis; pre-existing participant consent; and pre-existing data context and provenance. For an ASMS to address these characteristics, it would need to address both operational requirements (e.g., allocating existing specimens) and data management requirements (e.g., securely distributing and integrating primary and ancillary data). CONCLUSIONS: The scenarios and requirements we describe can help guide the development of systems that make conducting ancillary studies easier, less expensive, and less error-prone. Given the relatively consistent characteristics and challenges of ancillary study management, general-purpose ASMSs are likely to be useful to a wide range of organizations. Using the requirements identified in this paper, we are currently developing an open-source, general-purpose ASMS based on LabKey Server (http://www.labkey.org) in collaboration with CHAVI, the ITN and nPOD
    corecore