18 research outputs found

    The rockerverse : packages and applications for containerisation with R

    Get PDF
    The Rocker Project provides widely used Docker images for R across different application scenarios. This article surveys downstream projects that build upon the Rocker Project images and presents the current state of R packages for managing Docker images and controlling containers. These use cases cover diverse topics such as package development, reproducible research, collaborative work, cloud-based data processing, and production deployment of services. The variety of applications demonstrates the power of the Rocker Project specifically and containerisation in general. Across the diverse ways to use containers, we identified common themes: reproducible environments, scalability and efficiency, and portability across clouds. We conclude that the current growth and diversification of use cases is likely to continue its positive impact, but see the need for consolidating the Rockerverse ecosystem of packages, developing common practices for applications, and exploring alternative containerisation software

    Inverting the model of genomics data sharing with the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space

    Get PDF
    The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL; https://anvilproject.org) was developed to address a widespread community need for a unified computing environment for genomics data storage, management, and analysis. In this perspective, we present AnVIL, describe its ecosystem and interoperability with other platforms, and highlight how this platform and associated initiatives contribute to improved genomic data sharing efforts. The AnVIL is a federated cloud platform designed to manage and store genomics and related data, enable population-scale analysis, and facilitate collaboration through the sharing of data, code, and analysis results. By inverting the traditional model of data sharing, the AnVIL eliminates the need for data movement while also adding security measures for active threat detection and monitoring and provides scalable, shared computing resources for any researcher. We describe the core data management and analysis components of the AnVIL, which currently consists of Terra, Gen3, Galaxy, RStudio/Bioconductor, Dockstore, and Jupyter, and describe several flagship genomics datasets available within the AnVIL. We continue to extend and innovate the AnVIL ecosystem by implementing new capabilities, including mechanisms for interoperability and responsible data sharing, while streamlining access management. The AnVIL opens many new opportunities for analysis, collaboration, and data sharing that are needed to drive research and to make discoveries through the joint analysis of hundreds of thousands to millions of genomes along with associated clinical and molecular data types

    Ten simple rules for writing Dockerfiles for reproducible data science.

    Get PDF
    Computational science has been greatly improved by the use of containers for packaging software and data dependencies. In a scholarly context, the main drivers for using these containers are transparency and support of reproducibility; in turn, a workflow's reproducibility can be greatly affected by the choices that are made with respect to building containers. In many cases, the build process for the container's image is created from instructions provided in a Dockerfile format. In support of this approach, we present a set of rules to help researchers write understandable Dockerfiles for typical data science workflows. By following the rules in this article, researchers can create containers suitable for sharing with fellow scientists, for including in scholarly communication such as education or scientific papers, and for effective and sustainable personal workflows

    Reproducible big data science: A case study in continuous FAIRness.

    Get PDF
    Big biomedical data create exciting opportunities for discovery, but make it difficult to capture analyses and outputs in forms that are findable, accessible, interoperable, and reusable (FAIR). In response, we describe tools that make it easy to capture, and assign identifiers to, data and code throughout the data lifecycle. We illustrate the use of these tools via a case study involving a multi-step analysis that creates an atlas of putative transcription factor binding sites from terabytes of ENCODE DNase I hypersensitive sites sequencing data. We show how the tools automate routine but complex tasks, capture analysis algorithms in understandable and reusable forms, and harness fast networks and powerful cloud computers to process data rapidly, all without sacrificing usability or reproducibility-thus ensuring that big data are not hard-to-(re)use data. We evaluate our approach via a user study, and show that 91% of participants were able to replicate a complex analysis involving considerable data volumes

    Epiviz: Integrative Visual Analysis Software for Genomics

    Get PDF
    Computational and visual data analysis for genomics has traditionally involved a combination of tools and resources, of which the most ubiquitous consist of genome browsers, focused mainly on integrative visualization of large numbers of big datasets, and computational environments, focused on data modeling of a small number of moderately sized datasets. Workflows that involve the integration and exploration of multiple heterogeneous data sources, small and large, public and user specific have been poorly addressed by these tools. Commonly, the data visualized in these tools is the output of analyses performed in powerful computing environments like R/Bioconductor or Python. Two essential aspects of data analysis are usually treated as distinct, in spite of being part of the same exploratory process: algorithmic analysis and interactive visualization. In current technologies these are not integrated within one tool, but rather, one precedes the other. Recent technological advances in web-based data visualization have made it possible for interactive visualization tools to tightly integrate with powerful algorithmic tools, without being restricted to one such tool in particular. We introduce Epiviz (http://epiviz.cbcb.umd.edu), an integrative visualization tool that bridges the gap between the two types of tools, simplifying genomic data analysis workflows. Epiviz is the first genomics interactive visualization tool to provide tight-knit integration with computational and statistical modeling and data analysis. We discuss three ways in which Epiviz advances the field of genomic data analysis: 1) it brings code to interactive visualizations at various different levels; 2) takes the first steps in the direction of collaborative data analysis by incorporating user plugins from source control providers, as well as by allowing analysis states to be shared among the scientific community; 3) combines established analysis features that have never before been available simultaneously in a visualization tool for genomics. Epiviz can be used in multiple branches of genomics data analysis for various types of datasets, of which we detail two: functional genomics data, aligned to a continuous coordinate such as the genome, and metagenomics, organized according to volatile hierarchical coordinate spaces. We also present security implications of the current design, performance benchmarks, a series of limitations and future research steps

    The why, when, and how of computing in biology classrooms [version 1; peer review: 2 approved]

    Get PDF
    Many biologists are interested in teaching computing skills or using computing in the classroom, despite not being formally trained in these skills themselves. Thus biologists may find themselves researching how to teach these skills, and therefore many individuals are individually attempting to discover resources and methods to do so. Recent years have seen an expansion of new technologies to assist in delivering course content interactively. Educational research provides insights into how learners absorb and process information during interactive learning. In this review, we discuss the value of teaching foundational computing skills to biologists, and strategies and tools to do so. Additionally, we review the literature on teaching practices to support the development of these skills. We pay special attention to meeting the needs of diverse learners, and consider how different ways of delivering course content can be leveraged to provide a more inclusive classroom experience. Our goal is to enable biologists to teach computational skills and use computing in the classroom successfully

    ML-MEDIC: A Preliminary Study of an Interactive Visual Analysis Tool Facilitating Clinical Applications of Machine Learning for Precision Medicine

    Get PDF
    Accessible interactive tools that integrate machine learning methods with clinical research and reduce the programming experience required are needed to move science forward. Here, we present Machine Learning for Medical Exploration and Data-Inspired Care (ML-MEDIC), a point-and-click, interactive tool with a visual interface for facilitating machine learning and statistical analyses in clinical research. We deployed ML-MEDIC in the American Heart Association (AHA) Precision Medicine Platform to provide secure internet access and facilitate collaboration. ML-MEDIC’s efficacy for facilitating the adoption of machine learning was evaluated through two case studies in collaboration with clinical domain experts. A domain expert review was also conducted to obtain an impression of the usability and potential limitations

    Analyzing epigenomic data in a large-scale context

    Get PDF
    While large amounts of epigenomic data are publicly available, their retrieval in a form suitable for downstream analysis is a bottleneck in current research. In a typical analysis, users are required to download huge files that span the entire genome, even if they are only interested in a small subset (e.g., promoter regions) or an aggregation thereof. Moreover, complex operations on genome-level data are not always feasible on a local computer due to resource limitations. The DeepBlue Epigenomic Data Server mitigates this issue by providing a robust server that affords a powerful API for searching, filtering, transforming, aggregating, enriching, and downloading data from several epigenomic consortia. Furthermore, its main component implements scalable data storage and Manipulation methods that scale with the increasing amount of epigenetic data, thereby making it the ideal resource for researchers that seek to integrate epigenomic data into their analysis workflow. This work also presents companion tools that utilize the DeepBlue API to enable users not proficient in scripting or programming languages to analyze epigenomic data in a user-friendly way: (i) an R/Bioconductor package that integrates DeepBlue into the R analysis workflow. The extracted data are automatically converted into suitable R data structures for downstream analysis and visualization within the Bioconductor frame- work; (ii) a web portal that enables users to search, select, filter and download the epigenomic data available in the DeepBlue Server. This interface provides elements, such as data tables, grids, data selections, developed for empowering users to find the required epigenomic data in a straightforward interface; (iii) DIVE, a web data analysis tool that allows researchers to perform large-epigenomic data analysis in a programming-free environment. DIVE enables users to compare their datasets to the datasets available in the DeepBlue Server in an intuitive interface, which summarizes the comparison of hundreds of datasets in a simple chart. Furthermore, these tools are integrated, being capable of sharing results among themselves, creating a powerful large-scale epigenomic data analysis environment. The DeepBlue Epigenomic Data Server and its ecosystem was well received by the International Human Epigenome Consortium and already attracted much attention by the epigenomic research community with currently 160 registered users and more than three million anonymous workflow processing requests since its release.Während große Mengen epigenomischer Daten öffentlich verfügbar sind, ist ihre Abfrage in einer für die Downstream-Analyse geeigneten Form ein Engpass in der aktuellen Forschung. Bei einer typischen Analyse müssen Benutzer riesige Dateien herunterladen, die das gesamte Genom umfassen, selbst wenn sie nur an einer kleinen Teilmenge (z.B., Promotorregionen) oder einer Aggregation davon interessiert sind. Darüber hinaus sind komplexe Vorgänge mit Daten auf Genomebene aufgrund von Ressourceneinschränkungen auf einem lokalen Computer nicht immer möglich. Der DeepBlue Epigenomic Data Server behebt dieses Problem, indem er eine leistungsstarke API zum Suchen, Filtern, Umwandeln, Aggregieren, Anreichern und Herunterladen von Daten verschiedener epigenomischer Konsortien bietet. Darüber hinaus implementiert der DeepBlue-Server skalierbare Datenspeicherungs- und manipulationsmethoden, die der zunehmenden Menge epigenetischer Daten gerecht werden. Dadurch ist der DeepBlue Server ideal für Forscher geeignet, die die aktuellen epigenomischen Ressourcen in ihren Analyse-Workflow integrieren möchten. In dieser Arbeit werden zusätzlich Begleittools vorgestellt, die die DeepBlue-API verwenden, um Benutzern, die sich mit Scripting oder Programmiersprachen nicht auskennen, die Möglichkeit zu geben, epigenomische Daten auf benutzerfreundliche Weise zu analysieren: (i) ein R/ Bioconductor-Paket, das DeepBlue in den R-Analyse-Workflow integriert. Die extrahierten Daten werden automatisch in geeignete R-Datenstrukturen für die Downstream-Analyse und Visualisierung innerhalb des Bioconductor-Frameworks konvertiert; (ii) ein Webportal, über das Benutzer die auf dem DeepBlue Server verfügbaren epigenomischen Daten suchen, auswählen, filtern und herunterladen können. Diese Schnittstelle bietet Elemente wie Datentabellen, Raster, Datenselektionen, mit denen Benutzer die erforderlichen epigenomischen Daten in einer einfachen Schnittstelle finden können; (iii) DIVE, ein Webdatenanalysetool, mit dem Forscher umfangreiche epigenomische Datenanalysen in einer programmierungsfreien Umgebung durchführen können. Mit DIVE können Benutzer ihre Datensätze mit den im Deep- Blue Server verfügbaren Datensätzen in einer intuitiven Benutzeroberfläche vergleichen. Dabei kann der Vergleich hunderter Datensätze in einem Diagramm ausgedrückt werden. Aufgrund der großen Datenmenge, die in DIVE verfügbar ist, werden Methoden bereitgestellt, mit denen die ähnlichsten Datensätze für eine vergleichende Analyse vorgeschlagen werden können. Alle zuvor genannten Tools sind miteinander integriert, so dass sie die Ergebnisse untereinander austauschen können, wodurch eine leistungsstarke Umgebung für die Analyse epigenomischer Daten entsteht. Der DeepBlue Epigenomic Data Server und sein Ökosystem wurden vom International Human Epigenome Consortium äußerst gut aufgenommen und erreichten seit ihrer Veröffentlichung große Aufmerksamkeit bei der epigenomischen Forschungsgemeinschaft mit derzeit 160 registrierten Benutzern und mehr als drei Millionen anonymen Verarbeitungsanforderungen
    corecore