33 research outputs found

    A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity

    Full text link
    Ortholog detection (OD) is a critical step for comparative genomic analysis of protein-coding sequences. In this paper, we begin with a comprehensive comparison of four popular, methodologically diverse OD methods: MultiParanoid, Blat, Multiz, and OMA. In head-to-head comparisons, these methods are shown to significantly outperform one another 12-30% of the time. This high complementarity motivates the presentation of the first tool for integrating methodologically diverse OD methods. We term this program MOSAIC, or Multiple Orthologous Sequence Analysis and Integration by Cluster optimization. Relative to component and competing methods, we demonstrate that MOSAIC more than quintuples the number of alignments for which all species are present, while simultaneously maintaining or improving functional-, phylogenetic-, and sequence identity-based measures of ortholog quality. Further, we demonstrate that this improvement in alignment quality yields 40-280% more confidently aligned sites. Combined, these factors translate to higher estimated levels of overall conservation, while at the same time allowing for the detection of up to 180% more positively selected sites. MOSAIC is available as python package. MOSAIC alignments, source code, and full documentation are available at http://pythonhosted.org/bio-MOSAIC

    The Fitness Cost of Antibiotic Resistance in Streptococcus pneumoniae: Insight from the Field

    Get PDF
    Laboratory studies have suggested that antibiotic resistance may result in decreased fitness in the bacteria that harbor it. Observational studies have supported this, but due to ethical and practical considerations, it is rare to have experimental control over antibiotic prescription rates.We analyze data from a 54-month longitudinal trial that monitored pneumococcal drug resistance during and after biannual mass distribution of azithromycin for the elimination of the blinding eye disease, trachoma. Prescription of azithromycin and antibiotics that can create cross-resistance to it is rare in this part of the world. As a result, we were able to follow trends in resistance with minimal influence from unmeasured antibiotic use. Using these data, we fit a probabilistic disease transmission model that included two resistant strains, corresponding to the two dominant modes of resistance to macrolide antibiotics. We estimated the relative fitness of these two strains to be 0.86 (95% CI 0.80 to 0.90), and 0.88 (95% CI 0.82 to 0.93), relative to antibiotic-sensitive strains. We then used these estimates to predict that, within 5 years of the last antibiotic treatment, there would be a 95% chance of elimination of macrolide resistance by intra-species competition alone.Although it is quite possible that the fitness cost of macrolide resistance is sufficient to ensure its eventual elimination in the absence of antibiotic selection, this process takes time, and prevention is likely the best policy in the fight against resistance

    A framework for future national pediatric pandemic respiratory disease severity triage: The HHS pediatric COVID-19 data challenge

    Get PDF
    Abstract Introduction: With persistent incidence, incomplete vaccination rates, confounding respiratory illnesses, and few therapeutic interventions available, COVID-19 continues to be a burden on the pediatric population. During a surge, it is difficult for hospitals to direct limited healthcare resources effectively. While the overwhelming majority of pediatric infections are mild, there have been life-threatening exceptions that illuminated the need to proactively identify pediatric patients at risk of severe COVID-19 and other respiratory infectious diseases. However, a nationwide capability for developing validated computational tools to identify pediatric patients at risk using real-world data does not exist. Methods: HHS ASPR BARDA sought, through the power of competition in a challenge, to create computational models to address two clinically important questions using the National COVID Cohort Collaborative: (1) Of pediatric patients who test positive for COVID-19 in an outpatient setting, who are at risk for hospitalization? (2) Of pediatric patients who test positive for COVID-19 and are hospitalized, who are at risk for needing mechanical ventilation or cardiovascular interventions? Results: This challenge was the first, multi-agency, coordinated computational challenge carried out by the federal government as a response to a public health emergency. Fifty-five computational models were evaluated across both tasks and two winners and three honorable mentions were selected. Conclusion: This challenge serves as a framework for how the government, research communities, and large data repositories can be brought together to source solutions when resources are strapped during a pandemic

    Large expert-curated database for benchmarking document similarity detection in biomedical literature search

    Get PDF
    Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency-Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.Peer reviewe

    The evolving SARS-CoV-2 epidemic in Africa: Insights from rapidly expanding genomic surveillance

    Get PDF
    INTRODUCTION Investment in Africa over the past year with regard to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequencing has led to a massive increase in the number of sequences, which, to date, exceeds 100,000 sequences generated to track the pandemic on the continent. These sequences have profoundly affected how public health officials in Africa have navigated the COVID-19 pandemic. RATIONALE We demonstrate how the first 100,000 SARS-CoV-2 sequences from Africa have helped monitor the epidemic on the continent, how genomic surveillance expanded over the course of the pandemic, and how we adapted our sequencing methods to deal with an evolving virus. Finally, we also examine how viral lineages have spread across the continent in a phylogeographic framework to gain insights into the underlying temporal and spatial transmission dynamics for several variants of concern (VOCs). RESULTS Our results indicate that the number of countries in Africa that can sequence the virus within their own borders is growing and that this is coupled with a shorter turnaround time from the time of sampling to sequence submission. Ongoing evolution necessitated the continual updating of primer sets, and, as a result, eight primer sets were designed in tandem with viral evolution and used to ensure effective sequencing of the virus. The pandemic unfolded through multiple waves of infection that were each driven by distinct genetic lineages, with B.1-like ancestral strains associated with the first pandemic wave of infections in 2020. Successive waves on the continent were fueled by different VOCs, with Alpha and Beta cocirculating in distinct spatial patterns during the second wave and Delta and Omicron affecting the whole continent during the third and fourth waves, respectively. Phylogeographic reconstruction points toward distinct differences in viral importation and exportation patterns associated with the Alpha, Beta, Delta, and Omicron variants and subvariants, when considering both Africa versus the rest of the world and viral dissemination within the continent. Our epidemiological and phylogenetic inferences therefore underscore the heterogeneous nature of the pandemic on the continent and highlight key insights and challenges, for instance, recognizing the limitations of low testing proportions. We also highlight the early warning capacity that genomic surveillance in Africa has had for the rest of the world with the detection of new lineages and variants, the most recent being the characterization of various Omicron subvariants. CONCLUSION Sustained investment for diagnostics and genomic surveillance in Africa is needed as the virus continues to evolve. This is important not only to help combat SARS-CoV-2 on the continent but also because it can be used as a platform to help address the many emerging and reemerging infectious disease threats in Africa. In particular, capacity building for local sequencing within countries or within the continent should be prioritized because this is generally associated with shorter turnaround times, providing the most benefit to local public health authorities tasked with pandemic response and mitigation and allowing for the fastest reaction to localized outbreaks. These investments are crucial for pandemic preparedness and response and will serve the health of the continent well into the 21st century

    CauseMap: fast inference of causality from complex time series

    No full text
    Background. Establishing health-related causal relationships is a central pursuit in biomedical research. Yet, the interdependent non-linearity of biological systems renders causal dynamics laborious and at times impractical to disentangle. This pursuit is further impeded by the dearth of time series that are sufficiently long to observe and understand recurrent patterns of flux. However, as data generation costs plummet and technologies like wearable devices democratize data collection, we anticipate a coming surge in the availability of biomedically-relevant time series data. Given the life-saving potential of these burgeoning resources, it is critical to invest in the development of open source software tools that are capable of drawing meaningful insight from vast amounts of time series data.Results. Here we present CauseMap, the first open source implementation of convergent cross mapping (CCM), a method for establishing causality from long time series data (≳25 observations). Compared to existing time series methods, CCM has the advantage of being model-free and robust to unmeasured confounding that could otherwise induce spurious associations. CCM builds on Takens’ Theorem, a well-established result from dynamical systems theory that requires only mild assumptions. This theorem allows us to reconstruct high dimensional system dynamics using a time series of only a single variable. These reconstructions can be thought of as shadows of the true causal system. If reconstructed shadows can predict points from opposing time series, we can infer that the corresponding variables are providing views of the same causal system, and so are causally related. Unlike traditional metrics, this test can establish the directionality of causation, even in the presence of feedback loops. Furthermore, since CCM can extract causal relationships from times series of, e.g., a single individual, it may be a valuable tool to personalized medicine. We implement CCM in Julia, a high-performance programming language designed for facile technical computing. Our software package, CauseMap, is platform-independent and freely available as an official Julia package.Conclusions. CauseMap is an efficient implementation of a state-of-the-art algorithm for detecting causality from time series data. We believe this tool will be a valuable resource for biomedical research and personalized medicine