44 research outputs found

    The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment.

    Get PDF
    OBJECTIVE: Coronavirus disease 2019 (COVID-19) poses societal challenges that require expeditious data and knowledge sharing. Though organizational clinical data are abundant, these are largely inaccessible to outside researchers. Statistical, machine learning, and causal analyses are most successful with large-scale data beyond what is available in any given organization. Here, we introduce the National COVID Cohort Collaborative (N3C), an open science community focused on analyzing patient-level data from many centers. MATERIALS AND METHODS: The Clinical and Translational Science Award Program and scientific community created N3C to overcome technical, regulatory, policy, and governance barriers to sharing and harmonizing individual-level clinical data. We developed solutions to extract, aggregate, and harmonize data across organizations and data models, and created a secure data enclave to enable efficient, transparent, and reproducible collaborative analytics. RESULTS: Organized in inclusive workstreams, we created legal agreements and governance for organizations and researchers; data extraction scripts to identify and ingest positive, negative, and possible COVID-19 cases; a data quality assurance and harmonization pipeline to create a single harmonized dataset; population of the secure data enclave with data, machine learning, and statistical analytics tools; dissemination mechanisms; and a synthetic data pilot to democratize data access. CONCLUSIONS: The N3C has demonstrated that a multisite collaborative learning health network can overcome barriers to rapidly build a scalable infrastructure incorporating multiorganizational clinical data for COVID-19 analytics. We expect this effort to save lives by enabling rapid collaboration among clinicians, researchers, and data scientists to identify treatments and specialized care and thereby reduce the immediate and long-term impacts of COVID-19

    The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment

    Get PDF
    OBJECTIVE: Coronavirus disease 2019 (COVID-19) poses societal challenges that require expeditious data and knowledge sharing. Though organizational clinical data are abundant, these are largely inaccessible to outside researchers. Statistical, machine learning, and causal analyses are most successful with large-scale data beyond what is available in any given organization. Here, we introduce the National COVID Cohort Collaborative (N3C), an open science community focused on analyzing patient-level data from many centers. MATERIALS AND METHODS: The Clinical and Translational Science Award Program and scientific community created N3C to overcome technical, regulatory, policy, and governance barriers to sharing and harmonizing individual-level clinical data. We developed solutions to extract, aggregate, and harmonize data across organizations and data models, and created a secure data enclave to enable efficient, transparent, and reproducible collaborative analytics. RESULTS: Organized in inclusive workstreams, we created legal agreements and governance for organizations and researchers; data extraction scripts to identify and ingest positive, negative, and possible COVID-19 cases; a data quality assurance and harmonization pipeline to create a single harmonized dataset; population of the secure data enclave with data, machine learning, and statistical analytics tools; dissemination mechanisms; and a synthetic data pilot to democratize data access. CONCLUSIONS: The N3C has demonstrated that a multisite collaborative learning health network can overcome barriers to rapidly build a scalable infrastructure incorporating multiorganizational clinical data for COVID-19 analytics. We expect this effort to save lives by enabling rapid collaboration among clinicians, researchers, and data scientists to identify treatments and specialized care and thereby reduce the immediate and long-term impacts of COVID-19

    Envisaging a global infrastructure to exploit the potential of digitised collections

    Get PDF
    Tens of millions of images from biological collections have become available online over the last two decades. In parallel, there has been a dramatic increase in the capabilities of image analysis technologies, especially those involving machine learning and computer vision. While image analysis has become mainstream in consumer applications, it is still used only on an artisanal basis in the biological collections community, largely because the image corpora are dispersed. Yet, there is massive untapped potential for novel applications and research if images of collection objects could be made accessible in a single corpus. In this paper, we make the case for infrastructure that could support image analysis of collection objects. We show that such infrastructure is entirely feasible and well worth investing in

    The Pharmacoepigenomics Informatics Pipeline and H-GREEN Hi-C Compiler: Discovering Pharmacogenomic Variants and Pathways with the Epigenome and Spatial Genome

    Full text link
    Over the last decade, biomedical science has been transformed by the epigenome and spatial genome, but the discipline of pharmacogenomics, the study of the genetic underpinnings of pharmacological phenotypes like drug response and adverse events, has not. Scientists have begun to use omics atlases of increasing depth, and inferences relating to the bidirectional causal relationship between the spatial epigenome and gene expression, as a foundational underpinning for genetics research. The epigenome and spatial genome are increasingly used to discover causative regulatory variants in the significance regions of genome-wide association studies, for the discovery of the biological mechanisms underlying these phenotypes and the design of genetic tests to predict them. Such variants often have more predictive power than coding variants, but in the area of pharmacogenomics, such advances have been radically underapplied. The majority of pharmacogenomics tests are designed manually on the basis of mechanistic work with coding variants in candidate genes, and where genome wide approaches are used, they are typically not interpreted with the epigenome. This work describes a series of analyses of pharmacogenomics association studies with the tools and datasets of the epigenome and spatial genome, undertaken with the intent of discovering causative regulatory variants to enable new genetic tests. It describes the potent regulatory variants discovered thereby to have a putative causative and predictive role in a number of medically important phenotypes, including analgesia and the treatment of depression, bipolar disorder, and traumatic brain injury with opiates, anxiolytics, antidepressants, lithium, and valproate, and in particular the tendency for such variants to cluster into spatially interacting, conceptually unified pathways which offer mechanistic insight into these phenotypes. It describes the Pharmacoepigenomics Informatics Pipeline (PIP), an integrative multiple omics variant discovery pipeline designed to make this kind of analysis easier and cheaper to perform, more reproducible, and amenable to the addition of advanced features. It described the successes of the PIP in rediscovering manually discovered gene networks for lithium response, as well as discovering a previously unknown genetic basis for warfarin response in anticoagulation therapy. It describes the H-GREEN Hi-C compiler, which was designed to analyze spatial genome data and discover the distant target genes of such regulatory variants, and its success in discovering spatial contacts not detectable by preceding methods and using them to build spatial contact networks that unite disparate TADs with phenotypic relationships. It describes a potential featureset of a future pipeline, using the latest epigenome research and the lessons of the previous pipeline. It describes my thinking about how to use the output of a multiple omics variant pipeline to design genetic tests that also incorporate clinical data. And it concludes by describing a long term vision for a comprehensive pharmacophenomic atlas, to be constructed by applying a variant pipeline and machine learning test design system, such as is described, to thousands of phenotypes in parallel. Scientists struggled to assay genotypes for the better part of a century, and in the last twenty years, succeeded. The struggle to predict phenotypes on the basis of the genotypes we assay remains ongoing. The use of multiple omics variant pipelines and machine learning models with omics atlases, genetic association, and medical records data will be an increasingly significant part of that struggle for the foreseeable future.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145835/1/ariallyn_1.pd

    From geospatial data capture to the delivery of GIS-ready information : improved management within a GIS environment

    Get PDF
    This thesis presents the research undertaken to investigate how geospatial data handling techniques and technology can be potentially used to enhance the existing management of entire survey datasets from their captured stage to a GIS-ready state and the delivery of this to the user. Discovery of the current systems for managing survey data and information in the Survey and Mapping Department Malaysia (JUPEM) has been presented. In addition, the surveying practice and processes carried out have been examined, especially the different type of data and information existed within the raw data capture right through to the production of GIS-ready information. The current GIS technology and techniques for managing geospatial data have been inspected to gain an in-depth understanding of them. Geospatial object as an approach to model reality of the world has been discovered and used to model the raw, processed, the GIS-ready information. To implement the management, a prototype Database Management System (DBMS) has been implemented, and a trial data population and processing steps have been carried out. An enhancement of the managemenot f the datasetsf rom geospatiald ata capturet o the GIS-ready infori-nation has beend emonstratedT. o deliver online the final product, demonstrationo f available methods were illustrated, and then contrasted. A range of datasets around Malaysian context were used in the research. The investigation revealed that raw, processed and GIS-ready information can be successfully modelled as object in an object-relational spatial database. Using inherent GIS tools, survey datasets management and processing steps within the same system are evidently achieved in a prototype implemented DBMS. An improved management showing the capability of 'drill-down search' and 'two-way traceability' to access and search spatial and non-spatial information in the system is effectively illustrated. Demonstration of the vendor specific and open source technology for the GIS-ready information delivery leads to the comparison between them. The thesis concludes by recognising that a management for raw captured data, processed set of data and GIS-ready information, and the delivery of this, within GIS environment is possible. The inherent GIS tools and DBMS have presented a single-view system for geospatial data management providing superior interfaces that are easy to learn and use, and users are able to specify and perform the desired tasks efficiently. Delivery of data has some constraints that need to be considered before embarking into either vendor specific application or open source technology. In JUPEM, time and cost can be reduced by applying and implementing the suggested GIS application for cadastral and topographic surveys right up to the creation of GIS-ready information, as detailed in the thesis. The research also finds that the in-depth understanding and experience, practically and theoretically, of all aspects of current GIS technologies and techniques gained through this research has achieved an overarching inspiration: equalisation of a high level of awareness and ability of staff in handling GIS project development within currently developing countries with those in the developed countries, and within the national survey and mapping department with those of other government departments and commercial GIS contractors.EThOS - Electronic Theses Online ServicePublic Service Department of Malaysia : Department of Survey and Mapping Malaysia : University of Newcastle upon TyneGBUnited Kingdo

    Generation and Applications of Knowledge Graphs in Systems and Networks Biology

    Get PDF
    The acceleration in the generation of data in the biomedical domain has necessitated the use of computational approaches to assist in its interpretation. However, these approaches rely on the availability of high quality, structured, formalized biomedical knowledge. This thesis has the two goals to improve methods for curation and semantic data integration to generate high granularity biological knowledge graphs and to develop novel methods for using prior biological knowledge to propose new biological hypotheses. The first two publications describe an ecosystem for handling biological knowledge graphs encoded in the Biological Expression Language throughout the stages of curation, visualization, and analysis. Further, the second two publications describe the reproducible acquisition and integration of high-granularity knowledge with low contextual specificity from structured biological data sources on a massive scale and support the semi-automated curation of new content at high speed and precision. After building the ecosystem and acquiring content, the last three publications in this thesis demonstrate three different applications of biological knowledge graphs in modeling and simulation. The first demonstrates the use of agent-based modeling for simulation of neurodegenerative disease biomarker trajectories using biological knowledge graphs as priors. The second applies network representation learning to prioritize nodes in biological knowledge graphs based on corresponding experimental measurements to identify novel targets. Finally, the third uses biological knowledge graphs and develops algorithmics to deconvolute the mechanism of action of drugs, that could also serve to identify drug repositioning candidates. Ultimately, the this thesis lays the groundwork for production-level applications of drug repositioning algorithms and other knowledge-driven approaches to analyzing biomedical experiments

    Placing taxonomists at the heart of a definitive and comprehensive global resource on the world's plants

    Get PDF
    It is time to synthesize the knowledge that has been generated through more than 260 years of botanical exploration, taxonomic and, more recently, phylogenetic research throughout the world. The adoption of an updated Global Strategy for Plant Conservation (GSPC) in 2011 provided the essential impetus for the development of the World Flora Online (WFO) project. The project represents an international, coordinated effort by the botanical community to achieve GSPC Target 1, an electronic Flora of all plants. It will be a first‐ever unique and authoritative global source of information on the world's plant diversity, compiled, curated, moderated and updated by an expert and specialist‐based community (Taxonomic Expert Networks – “TENs” – covering a taxonomic group such as family or order) and actively managed by those who have compiled and contributed the data it includes. Full credit and acknowledgement will be given to the original sources, allowing users to refer back to the primary data. A strength of the project is that it is led and endorsed by a global consortium of more than 40 leading botanical institutions worldwide. A first milestone for producing the World Flora Online is to be accomplished by the end of 2020, but the WFO Consortium is committed to continuing the WFO programme beyond 2020 when it will develop its full impact as the authoritative source of information on the world's plant biodiversity

    Cell Population Identification and Benchmarking of Tools in Single-Cell Data Analysis

    Get PDF
    In an era of Biology where modern imaging and sequencing technologies allow to study almost any biological process at molecular levels, it is now possible to study the content, activity and identity of single cells from almost any organism. High-throughput sequencing technologies are now widely used in laboratories around the globe and are able to routinely sequence the genetic content of thousands or even millions of cells. Not only it is used to study basic cellular properties but also to understand the mechanisms underlying disease states and the cellular responses to new treatments. This field of science is very energetic and, because so many efforts are put into (computational) method development, the scientific community needs a barometer to measure its current state and needs. Guidelines and recommendations to analyze these new types of data are critical, which is one of the roles of benchmarks. However the current strategy to perform benchmarks still often relies on building a new evaluation framework from scratch, a process that is often time consuming, not reproducible and prone to partiality. In this thesis, I have tackled these challenges from different angles. First, from an analyst point of view, I participated in the analysis of single-cell data of an immunotherapy experiment and developed a semi-automatic analysis toolkit aiming to facilitate the data analysis procedure of single-cell data for other researchers. Second, from a benchmarker perspective, I evaluated analysis tools used for single-cell data with a focus on how to best retrieve cell populations from single-cell data using clustering algorithms. Finally, I participated in the development of a new computational platform, Omnibenchmark, hosting collaborative and continuous benchmarks with the aim to provide updated method recommendations for the community. Single-cell RNA sequencing is one of the most popular and most used high-throughput technologies. It yields information about the transcriptional profile (genome-wide or targeted) of single cells and these data hold potentially key information for basic and translational research. For example, it is now possible to identify the response of a set of genes (differential expression) or cells (differential abundance) to a given treatment or condition, retrieve the gene pathways that are involved in a cellular response (pathway analysis), or identify molecular markers that could be used to identify a population of cells (marker gene identification). Ultimately, most of these biological findings rely on a critical step called clustering; an unsupervised machine learning approach that groups data points (here, cells) with similar properties (here, their transcriptomic profiles). In the context of single-cell, the aim of clustering is to group together similar entities, give a proxy of the cell-type heterogeneity in the data and use this information for downstream analyses. However, clustering can be influenced by technical effects and not correctly removing them will bias the classification (by grouping cells by batch, sample or other technical effects instead of relevant biological variations). So-called ‘preprocessing’ steps aim to tackle this issue by removing data variations originating from technical effects. However, the research field in single cell is bloated with methods performing the same tasks, each one reporting performance scores higher than their direct competitors. In Manuscripts I and II, I contributed to benchmarking efforts to evaluate how processing methods influence the critical step of clustering. In other words, we evaluated the performance of processing methods to remove technical effects in single-cell data to correctly interpret biological effects. With these benchmarks, we could provide guidance on how to best perform several data analysis tasks in different experimental settings. The single-cell RNA sequencing technologies were, a few years after their discovery, enhanced by the development of multimodal sequencing technologies. Sequencing was focused for decades on targeting one molecule at a time (DNA, RNA, protein,...) but it was a challenge to integrate these data across different experiments. Multimodal technologies allow different data modalities to be analyzed simultaneously, while maintaining single-cell resolution. One of these technologies, CITEseq, allows RNA and cell surface proteins to be studied together. The reasoning behind CITE-seq development is that, in addition to RNA, surface proteins are: i) tightly linked to cell type identity; ii) a closer and more stable proxy for cell activity; and, iii) the main target of most developed drugs. In Manuscript III, I contributed to the analysis of metastasic renal cell cancer samples treated with an immune checkpoint inhibition therapy. Using CITE-seq data, we could distinguish several subpopulations of CD8 and CD4 T lymphocytes and found that several of them were positively reacting to the immunotherapy, information that could be latter used to better understand the cellular mechanisms involved in the positive response to the treatment. The data analysis of such data can be challenging, especially for experimentalists who often rely on bioinformaticians to generate an analysis pipeline. In Manuscript IV, I helped to develop a standardized and flexible analysis pipeline for multimodal data analysis performing classical processing and downstream tasks. Our analysis toolkit provides guidance to analysts new to multimodal analysis but we also provide a tool for experimentalists to facilitate the requirements to analyze their own data. In the last part of the thesis, I focused on the current approaches used to benchmark methods in single-cell and on a novel way to tackle their limitations. In the single-cell field, more than a thousand software tools have been developed to analyze these high dimensional data and most of the publications presenting these tools perform benchmarking according to their own rules and their own appreciation. ‘Neutral’ benchmarks do exist however; they do not present new methods but give a neutral evaluation of the current state in research. In Manuscript V, we performed a meta-analysis of these neutral benchmarks to highlight the current practices and the limitations of these studies. We found that, while data and code are generally available, most studies do not use all available informatic tools developed for reproducibility; containerization, workflow systems or provenance tracking for instance. In Manuscript VI, we present a new benchmarking system that aims to tackle these issues. This system, called ‘Omnibenchmark’, is a collaborative platform for method developers to host their tools and for analysts to find latest recommendations. We hope to lay the foundations for new benchmarking practices and ultimately increase the neutrality, extensibility and re-usability in this field of science

    Report from Dagstuhl Seminar 23031: Frontiers of Information Access Experimentation for Research and Education

    Full text link
    This report documents the program and the outcomes of Dagstuhl Seminar 23031 ``Frontiers of Information Access Experimentation for Research and Education'', which brought together 37 participants from 12 countries. The seminar addressed technology-enhanced information access (information retrieval, recommender systems, natural language processing) and specifically focused on developing more responsible experimental practices leading to more valid results, both for research as well as for scientific education. The seminar brought together experts from various sub-fields of information access, namely IR, RS, NLP, information science, and human-computer interaction to create a joint understanding of the problems and challenges presented by next generation information access systems, from both the research and the experimentation point of views, to discuss existing solutions and impediments, and to propose next steps to be pursued in the area in order to improve not also our research methods and findings but also the education of the new generation of researchers and developers. The seminar featured a series of long and short talks delivered by participants, who helped in setting a common ground and in letting emerge topics of interest to be explored as the main output of the seminar. This led to the definition of five groups which investigated challenges, opportunities, and next steps in the following areas: reality check, i.e. conducting real-world studies, human-machine-collaborative relevance judgment frameworks, overcoming methodological challenges in information retrieval and recommender systems through awareness and education, results-blind reviewing, and guidance for authors.Comment: Dagstuhl Seminar 23031, report

    Safe Class and Data Evolution in Large and Long-Lived Java Applications

    Get PDF
    There is a growing class of applications implemented in object-oriented languages that are large and complex, that exploit object persistence, and need to run uninterrupted for long periods of time. Development and maintenance of such applications can present challenges in the following interrelated areas: consistent and scalable evolution of persistent data and code, optimal build management, and runtime changes to applications. The research presented in this thesis addresses the above issues. Since Java is becoming increasingly popular platform for implementing large and long-lived applications, it was chosen for experiments. The first part of the research was undertaken in the context of the PJama system, an orthogonally persistent platform for Java. A technology that supports persistent class and object evolution for this platform was designed, built and evaluated. This technology integrates build management, persistent class evolution, and support for several forms of eager conversion of persistent objects. Research in build management for Java has resulted in the creation of a generally applicable, compiler-independent smart recompilation technology, which can be re-used in a Java IDE, or as a standalone Java-specific utility similar to make. The technology for eager object conversion that we developed allows the developers to perform arbitrarily complex changes to persistent objects and their collections. A high level of developer's control over the conversion process was achieved in part due to introduction of a mechanism for dynamic renaming of old class versions. This mechanism was implemented using minor non-standard extensions to the Java language. However, we also demonstrate how to achieve nearly the same results without modifying the language specification. In this form, we believe, our technology can be largely re-used with practically any persistent object solution for Java. The second part of this research was undertaken using as an implementation platform the HotSpot Java Virtual Machine (JVM), which is currently Sun's main production JVM. A technology was developed that allows the engineers to redefine classes on-the-fly in the running VM. Our main focus was on the runtime evolution of server-type applications, though we also address modification of applications running in the debugger. Unlike the only other similar system for Java known to us, our technology supports redefinition of classes that have methods currently active. Several policies for handling such methods have been proposed, one of them is currently operational, another one is in the experimental stage. We also propose to re-use the runtime evolution technology for dynamic fine-grain profiling of applications
    corecore