139 research outputs found

    Knowledge-Driven Methods for Geographic Information Extraction in the Biomedical Domain

    Get PDF
    abstract: Accounting for over a third of all emerging and re-emerging infections, viruses represent a major public health threat, which researchers and epidemiologists across the world have been attempting to contain for decades. Recently, genomics-based surveillance of viruses through methods such as virus phylogeography has grown into a popular tool for infectious disease monitoring. When conducting such surveillance studies, researchers need to manually retrieve geographic metadata denoting the location of infected host (LOIH) of viruses from public sequence databases such as GenBank and any publication related to their study. The large volume of semi-structured and unstructured information that must be reviewed for this task, along with the ambiguity of geographic locations, make it especially challenging. Prior work has demonstrated that the majority of GenBank records lack sufficient geographic granularity concerning the LOIH of viruses. As a result, reviewing full-text publications is often necessary for conducting in-depth analysis of virus migration, which can be a very time-consuming process. Moreover, integrating geographic metadata pertaining to the LOIH of viruses from different sources, including different fields in GenBank records as well as full-text publications, and normalizing the integrated metadata to unique identifiers for subsequent analysis, are also challenging tasks, often requiring expert domain knowledge. Therefore, automated information extraction (IE) methods could help significantly accelerate this process, positively impacting public health research. However, very few research studies have attempted the use of IE methods in this domain. This work explores the use of novel knowledge-driven geographic IE heuristics for extracting, integrating, and normalizing the LOIH of viruses based on information available in GenBank and related publications; when evaluated on manually annotated test sets, the methods were found to have a high accuracy and shown to be adequate for addressing this challenging problem. It also presents GeoBoost, a pioneering software system for georeferencing GenBank records, as well as a large-scale database containing over two million virus GenBank records georeferenced using the algorithms introduced here. The methods, database and software developed here could help support diverse public health domains focusing on sequence-informed virus surveillance, thereby enhancing existing platforms for controlling and containing disease outbreaks.Dissertation/ThesisDoctoral Dissertation Biomedical Informatics 201

    Biomedical Information Extraction Pipelines for Public Health in the Age of Deep Learning

    Get PDF
    abstract: Unstructured texts containing biomedical information from sources such as electronic health records, scientific literature, discussion forums, and social media offer an opportunity to extract information for a wide range of applications in biomedical informatics. Building scalable and efficient pipelines for natural language processing and extraction of biomedical information plays an important role in the implementation and adoption of applications in areas such as public health. Advancements in machine learning and deep learning techniques have enabled rapid development of such pipelines. This dissertation presents entity extraction pipelines for two public health applications: virus phylogeography and pharmacovigilance. For virus phylogeography, geographical locations are extracted from biomedical scientific texts for metadata enrichment in the GenBank database containing 2.9 million virus nucleotide sequences. For pharmacovigilance, tools are developed to extract adverse drug reactions from social media posts to open avenues for post-market drug surveillance from non-traditional sources. Across these pipelines, high variance is observed in extraction performance among the entities of interest while using state-of-the-art neural network architectures. To explain the variation, linguistic measures are proposed to serve as indicators for entity extraction performance and to provide deeper insight into the domain complexity and the challenges associated with entity extraction. For both the phylogeography and pharmacovigilance pipelines presented in this work the annotated datasets and applications are open source and freely available to the public to foster further research in public health.Dissertation/ThesisDoctoral Dissertation Biomedical Informatics 201

    Host ecology determines the dispersal patterns of a plant virus

    Get PDF
    Since its isolation in 1966 in Kenya, rice yellow mottle virus (RYMV) has been reported throughout Africa resulting in one of the economically most important tropical plant emerging diseases. A thorough understanding of RYMV evolution and dispersal is critical to manage viral spread in tropical areas that heavily rely on agriculture for subsistence. Phylogenetic analyses have suggested a relatively recent expansion, perhaps driven by the intensification of agricultural practices, but this has not yet been examined in a coherent statistical framework. To gain insight into the historical spread of RYMV within Africa rice cultivations, we analyse a dataset of 300 coat protein gene sequences, sampled from East to West Africa over a 46-year period, using Bayesian evolutionary inference. Spatiotemporal reconstructions date the origin of RMYV back to 1852 (1791-1903) and confirm Tanzania as the most likely geographic origin. Following a single long-distance transmission event from East to West Africa, separate viral populations have been maintained for about a century. To identify the factors that shaped the RYMV distribution, we apply a generalised linear model (GLM) extension of discrete phylogenetic diffusion and provide strong support for distances measured on a rice connectivity landscape as the major determinant of RYMV spread. Phylogeographic estimates in continuous space further complement this by demonstrating more pronounced expansion dynamics in West Africa that are consistent with agricultural intensification and extensification. Taken together, our principled phylogeographic inference approach shows for the first time that host ecology dynamics have shaped the historical spread of a plant virus.status: publishe

    Generalized Linear Models in Bayesian Phylogeography

    Get PDF
    abstract: Bayesian phylogeography is a framework that has enabled researchers to model the spatiotemporal diffusion of pathogens. In general, the framework assumes that discrete geographic sampling traits follow a continuous-time Markov chain process along the branches of an unknown phylogeny that is informed through nucleotide sequence data. Recently, this framework has been extended to model the transition rate matrix between discrete states as a generalized linear model (GLM) of predictors of interest to the pathogen. In this dissertation, I focus on these GLMs and describe their capabilities, limitations, and introduce a pipeline that may enable more researchers to utilize this framework. I first demonstrate how a GLM can be employed and how the support for the predictors can be measured using influenza A/H5N1 in Egypt as an example. Secondly, I compare the GLM framework to two alternative frameworks of Bayesian phylogeography: one that uses an advanced computational technique and one that does not. For this assessment, I model the diffusion of influenza A/H3N2 in the United States during the 2014-15 flu season with five methods encapsulated by the three frameworks. I summarize metrics of the phylogenies created by each and demonstrate their reproducibility by performing analyses on several random sequence samples under a variety of population growth scenarios. Next, I demonstrate how discretization of the location trait for a given sequence set can influence phylogenies and support for predictors. That is, I perform several GLM analyses on a set of sequences and change how the sequences are pooled, then show how aggregating predictors at four levels of spatial resolution will alter posterior support. Finally, I provide a solution for researchers that wish to use the GLM framework but may be deterred by the tedious file-manipulation requirements that must be completed to do so. My pipeline, which is publicly available, should alleviate concerns pertaining to the difficulty and time-consuming nature of creating the files necessary to perform GLM analyses. This dissertation expands the knowledge of Bayesian phylogeographic GLMs and will facilitate the use of this framework, which may ultimately reveal the variables that drive the spread of pathogens.Dissertation/ThesisDoctoral Dissertation Biomedical Informatics 201

    Convergent trends and spatiotemporal patterns of Aedes-borne arboviruses in Mexico and Central America

    Get PDF
    Background Aedes-borne arboviruses cause both seasonal epidemics and emerging outbreaks with a significant impact on global health. These viruses share mosquito vector species, often infecting the same host population within overlapping geographic regions. Thus, comparative analyses of the virus evolutionary and epidemiological dynamics across spatial and temporal scales could reveal convergent trends. Methodology/Principal findings Focusing on Mexico as a case study, we generated novel chikungunya and dengue (CHIKV, DENV-1 and DENV-2) virus genomes from an epidemiological surveillance-derived historical sample collection, and analysed them together with longitudinally-collected genome and epidemiological data from the Americas. Aedes-borne arboviruses endemically circulating within the country were found to be introduced multiple times from lineages predominantly sampled from the Caribbean and Central America. For CHIKV, at least thirteen introductions were inferred over a year, with six of these leading to persistent transmission chains. For both DENV-1 and DENV-2, at least seven introductions were inferred over a decade. Conclusions/Significance Our results suggest that CHIKV, DENV-1 and DENV-2 in Mexico share evolutionary and epidemiological trajectories. The southwest region of the country was determined to be the most likely location for viral introductions from abroad, with a subsequent spread into the Pacific coast towards the north of Mexico. Virus diffusion patterns observed across the country are likely driven by multiple factors, including mobility linked to human migration from Central towards North America. Considering Mexico’s geographic positioning displaying a high human mobility across borders, our results prompt the need to better understand the role of anthropogenic factors in the transmission dynamics of Aedes-borne arboviruses, particularly linked to land-based human migration

    A quantitative synthesis of how movement has been incorporated within species distribution modelling

    Get PDF
    Movement is a ubiquitous ecological process that influences the distribution of all species. In spite of this ecological significance, the incorporation of movement in species distribution models (SDMs) has lagged in comparison with other methodological and conceptual advancements. Many studies still ignore movement processes in applications inherently linked to movement (e.g. tracking changes in climate), and moreover, finer scale movements (e.g. foraging) have been neglected even more severely. We reviewed almost 600 research articles published in the last decade to identify important trends in the way that movement has been explicitly incorporated in SDM. We note that the conceptual differences associated with the ‘object’ whose movement is of interest, as well as subtler differences among taxon groups (e.g. plants v animals) and levels of organization (e.g. individuals, populations, species) that have significant implications for how movement processes occur, have hindered more substantial integration of these concepts. Finally, we highlight novel and unique methodological issues such as the use of successive telemetry data as response data in these correlative models. The gaps and trends identified in this review should foster future research in this burgeoning research area
    • …
    corecore