3 research outputs found
Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD.
Combined analysis of multiple, large datasets is a common objective in the health- and biosciences. Existing methods tend to require researchers to physically bring data together in one place or follow an analysis plan and share results. Developed over the last 10 years, the DataSHIELD platform is a collection of R packages that reduce the challenges of these methods. These include ethico-legal constraints which limit researchers' ability to physically bring data together and the analytical inflexibility associated with conventional approaches to sharing results. The key feature of DataSHIELD is that data from research studies stay on a server at each of the institutions that are responsible for the data. Each institution has control over who can access their data. The platform allows an analyst to pass commands to each server and the analyst receives results that do not disclose the individual-level data of any study participants. DataSHIELD uses Opal which is a data integration system used by epidemiological studies and developed by the OBiBa open source project in the domain of bioinformatics. However, until now the analysis of big data with DataSHIELD has been limited by the storage formats available in Opal and the analysis capabilities available in the DataSHIELD R packages. We present a new architecture ("resources") for DataSHIELD and Opal to allow large, complex datasets to be used at their original location, in their original format and with external computing facilities. We provide some real big data analysis examples in genomics and geospatial projects. For genomic data analyses, we also illustrate how to extend the resources concept to address specific big data infrastructures such as GA4GH or EGA, and make use of shell commands. Our new infrastructure will help researchers to perform data analyses in a privacy-protected way from existing data sharing initiatives or projects. To help researchers use this framework, we describe selected packages and present an online book (https://isglobal-brge.github.io/resource_bookdown)
Software Application Profile: exposomeShiny-a toolbox for exposome data analysis
Motivation: Studying the role of the exposome in human health and its impact on different omic layers requires advanced statistical methods. Many of these methods are implemented in different R and Bioconductor packages, but their use may require strong expertise in R, in writing pipelines and in using new R classes which may not be familiar to non-advanced users. ExposomeShiny provides a bridge between researchers and most of the state-of-the-art exposome analysis methodologies, without the need of advanced programming skills. Implementation: ExposomeShiny is a standalone web application implemented in R. It is available as source files and can be installed in any server or computer avoiding problems with data confidentiality. It is executed in RStudio which opens a browser window with the web application. General features: The presented implementation allows the conduct of: (i) data pre-processing: normalization and missing imputation (including limit of detection); (ii) descriptive analysis; (iii) exposome principal component analysis (PCA) and hierarchical clustering; (iv) exposome-wide association studies (ExWAS) and variable selection ExWAS; (v) omic data integration by single association and multi-omic analyses; and (vi) post-exposome data analyses to gain biological insight for the exposures, genes or using the Comparative Toxicogenomics Database (CTD) and pathway analysis. Availability: The exposomeShiny source code is freely available on Github at [https://github.com/isglobal-brge/exposomeShiny], Git tag v1.4. The software is also available as a Docker image [https://hub.docker.com/r/brgelab/exposome-shiny], tag v1.4. A user guide with information about the analysis methodologies as well as information on how to use exposomeShiny is freely hosted at [https://isglobal-brge.github.io/exposome_bookdown/].This research has received funding from: the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 874583 (ATHLETE); the Ministerio de Ciencia, Innovación y Universidades (MICIU), Agencia Estatal de Investigación (AEI) and Fondo Europeo de Desarrollo Regional, UE (RTI2018-100789-B-I00) ,also through the ‘Centro de Excelencia Severo Ochoa 2019–2023’ Program (CEX2018-000806-S); and the Catalan Government through the CERCA Program. This article is part of the project VEIS: 001-P-001647 co-financed by the European Regional Development Fund of the European Union in the framework of the Operational Program FEDER of Catalonia 2014–2020 with the support of the Secretaria d'Universitats i Recerca del Departament d'Empresa i Coneixement de la Generalitat de Catalunya
Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD
Combined analysis of multiple, large datasets is a common objective in the health- and biosciences. Existing methods tend to require researchers to physically bring data together in one place or follow an analysis plan and share results. Developed over the last 10 years, the DataSHIELD platform is a collection of R packages that reduce the challenges of these methods. These include ethico-legal constraints which limit researchers' ability to physically bring data together and the analytical inflexibility associated with conventional approaches to sharing results. The key feature of DataSHIELD is that data from research studies stay on a server at each of the institutions that are responsible for the data. Each institution has control over who can access their data. The platform allows an analyst to pass commands to each server and the analyst receives results that do not disclose the individual-level data of any study participants. DataSHIELD uses Opal which is a data integration system used by epidemiological studies and developed by the OBiBa open source project in the domain of bioinformatics. However, until now the analysis of big data with DataSHIELD has been limited by the storage formats available in Opal and the analysis capabilities available in the DataSHIELD R packages. We present a new architecture (" resources ") for DataSHIELD and Opal to allow large, complex datasets to be used at their original location, in their original format and with external computing facilities. We provide some real big data analysis examples in genomics and geospatial projects. For genomic data analyses, we also illustrate how to extend the resources concept to address specific big data infrastructures such as GA4GH or EGA, and make use of shell commands. Our new infrastructure will help researchers to perform data analyses in a privacy-protected way from existing data sharing initiatives or projects. To help researchers use this framework, we describe selected packages and present an online book (). Data sharing enhances understanding of research results beyond what is possible from any single study. Data pooling across multiple studies increases statistical power and allows exploration of between-study heterogeneity. But, considerations related to ethico-legal and intellectual/commercial value regularly prevent or impede physical data sharing. DataSHIELD is designed to circumvent this problem. However, despite the growing confidence users have been placing in DataSHIELD to perform privacy-protected analyses of data in cohort consortia, there are real challenges to federated analytics. They include considering the wide range of data formats, and big data sources used, for example, in 'omics-based research. This article describes the development and implementation of the new "resources" architecture in DataSHIELD that overcomes this limitation. We illustrate its value with real world examples related to genomics and geographical data. We also demonstrate how genomic data sharing initiatives such as GA4GH and EGA can benefit directly from our development. Our new infrastructure will help researchers to perform data analyses in a privacy-protected way from existing data sharing initiatives or projects