33 research outputs found
A MOSAIC of methods: Improving ortholog detection through integration of algorithmic diversity
Ortholog detection (OD) is a critical step for comparative genomic analysis
of protein-coding sequences. In this paper, we begin with a comprehensive
comparison of four popular, methodologically diverse OD methods: MultiParanoid,
Blat, Multiz, and OMA. In head-to-head comparisons, these methods are shown to
significantly outperform one another 12-30% of the time. This high
complementarity motivates the presentation of the first tool for integrating
methodologically diverse OD methods. We term this program MOSAIC, or Multiple
Orthologous Sequence Analysis and Integration by Cluster optimization. Relative
to component and competing methods, we demonstrate that MOSAIC more than
quintuples the number of alignments for which all species are present, while
simultaneously maintaining or improving functional-, phylogenetic-, and
sequence identity-based measures of ortholog quality. Further, we demonstrate
that this improvement in alignment quality yields 40-280% more confidently
aligned sites. Combined, these factors translate to higher estimated levels of
overall conservation, while at the same time allowing for the detection of up
to 180% more positively selected sites. MOSAIC is available as python package.
MOSAIC alignments, source code, and full documentation are available at
http://pythonhosted.org/bio-MOSAIC
Population Genetics of Rare Variants and Complex Diseases
Identifying drivers of complex traits from the noisy signals of genetic
variation obtained from high throughput genome sequencing technologies is a
central challenge faced by human geneticists today. We hypothesize that the
variants involved in complex diseases are likely to exhibit non-neutral
evolutionary signatures. Uncovering the evolutionary history of all variants is
therefore of intrinsic interest for complex disease research. However, doing so
necessitates the simultaneous elucidation of the targets of natural selection
and population-specific demographic history. Here we characterize the action of
natural selection operating across complex disease categories, and use
population genetic simulations to evaluate the expected patterns of genetic
variation in large samples. We focus on populations that have experienced
historical bottlenecks followed by explosive growth (consistent with most human
populations), and describe the differences between evolutionarily deleterious
mutations and those that are neutral. Genes associated with several complex
disease categories exhibit stronger signatures of purifying selection than
non-disease genes. In addition, loci identified through genome-wide association
studies of complex traits also exhibit signatures consistent with being in
regions recurrently targeted by purifying selection. Through simulations, we
show that population bottlenecks and rapid growth enables deleterious rare
variants to persist at low frequencies just as long as neutral variants, but
low frequency and common variants tend to be much younger than neutral
variants. This has resulted in a large proportion of modern-day rare alleles
that have a deleterious effect on function, and that potentially contribute to
disease susceptibility.Comment: 36 pages, 7 figure
The Fitness Cost of Antibiotic Resistance in Streptococcus pneumoniae: Insight from the Field
Laboratory studies have suggested that antibiotic resistance may result in decreased fitness in the bacteria that harbor it. Observational studies have supported this, but due to ethical and practical considerations, it is rare to have experimental control over antibiotic prescription rates.We analyze data from a 54-month longitudinal trial that monitored pneumococcal drug resistance during and after biannual mass distribution of azithromycin for the elimination of the blinding eye disease, trachoma. Prescription of azithromycin and antibiotics that can create cross-resistance to it is rare in this part of the world. As a result, we were able to follow trends in resistance with minimal influence from unmeasured antibiotic use. Using these data, we fit a probabilistic disease transmission model that included two resistant strains, corresponding to the two dominant modes of resistance to macrolide antibiotics. We estimated the relative fitness of these two strains to be 0.86 (95% CI 0.80 to 0.90), and 0.88 (95% CI 0.82 to 0.93), relative to antibiotic-sensitive strains. We then used these estimates to predict that, within 5 years of the last antibiotic treatment, there would be a 95% chance of elimination of macrolide resistance by intra-species competition alone.Although it is quite possible that the fitness cost of macrolide resistance is sufficient to ensure its eventual elimination in the absence of antibiotic selection, this process takes time, and prevention is likely the best policy in the fight against resistance
A framework for future national pediatric pandemic respiratory disease severity triage: The HHS pediatric COVID-19 data challenge
Abstract
Introduction:
With persistent incidence, incomplete vaccination rates, confounding respiratory illnesses, and few therapeutic interventions available, COVID-19 continues to be a burden on the pediatric population. During a surge, it is difficult for hospitals to direct limited healthcare resources effectively. While the overwhelming majority of pediatric infections are mild, there have been life-threatening exceptions that illuminated the need to proactively identify pediatric patients at risk of severe COVID-19 and other respiratory infectious diseases. However, a nationwide capability for developing validated computational tools to identify pediatric patients at risk using real-world data does not exist.
Methods:
HHS ASPR BARDA sought, through the power of competition in a challenge, to create computational models to address two clinically important questions using the National COVID Cohort Collaborative: (1) Of pediatric patients who test positive for COVID-19 in an outpatient setting, who are at risk for hospitalization? (2) Of pediatric patients who test positive for COVID-19 and are hospitalized, who are at risk for needing mechanical ventilation or cardiovascular interventions?
Results:
This challenge was the first, multi-agency, coordinated computational challenge carried out by the federal government as a response to a public health emergency. Fifty-five computational models were evaluated across both tasks and two winners and three honorable mentions were selected.
Conclusion:
This challenge serves as a framework for how the government, research communities, and large data repositories can be brought together to source solutions when resources are strapped during a pandemic
Large expert-curated database for benchmarking document similarity detection in biomedical literature search
Document recommendation systems for locating relevant literature have mostly relied on methods developed a decade ago. This is largely due to the lack of a large offline gold-standard benchmark of relevant documents that cover a variety of research fields such that newly developed literature search techniques can be compared, improved and translated into practice. To overcome this bottleneck, we have established the RElevant LIterature SearcH consortium consisting of more than 1500 scientists from 84 countries, who have collectively annotated the relevance of over 180 000 PubMed-listed articles with regard to their respective seed (input) article/s. The majority of annotations were contributed by highly experienced, original authors of the seed articles. The collected data cover 76% of all unique PubMed Medical Subject Headings descriptors. No systematic biases were observed across different experience levels, research fields or time spent on annotations. More importantly, annotations of the same document pairs contributed by different scientists were highly concordant. We further show that the three representative baseline methods used to generate recommended articles for evaluation (Okapi Best Matching 25, Term Frequency-Inverse Document Frequency and PubMed Related Articles) had similar overall performances. Additionally, we found that these methods each tend to produce distinct collections of recommended articles, suggesting that a hybrid method may be required to completely capture all relevant articles. The established database server located at https://relishdb.ict.griffith.edu.au is freely available for the downloading of annotation data and the blind testing of new methods. We expect that this benchmark will be useful for stimulating the development of new powerful techniques for title and title/abstract-based search engines for relevant articles in biomedical research.Peer reviewe
The evolving SARS-CoV-2 epidemic in Africa: Insights from rapidly expanding genomic surveillance
INTRODUCTION
Investment in Africa over the past year with regard to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequencing has led to a massive increase in the number of sequences, which, to date, exceeds 100,000 sequences generated to track the pandemic on the continent. These sequences have profoundly affected how public health officials in Africa have navigated the COVID-19 pandemic.
RATIONALE
We demonstrate how the first 100,000 SARS-CoV-2 sequences from Africa have helped monitor the epidemic on the continent, how genomic surveillance expanded over the course of the pandemic, and how we adapted our sequencing methods to deal with an evolving virus. Finally, we also examine how viral lineages have spread across the continent in a phylogeographic framework to gain insights into the underlying temporal and spatial transmission dynamics for several variants of concern (VOCs).
RESULTS
Our results indicate that the number of countries in Africa that can sequence the virus within their own borders is growing and that this is coupled with a shorter turnaround time from the time of sampling to sequence submission. Ongoing evolution necessitated the continual updating of primer sets, and, as a result, eight primer sets were designed in tandem with viral evolution and used to ensure effective sequencing of the virus. The pandemic unfolded through multiple waves of infection that were each driven by distinct genetic lineages, with B.1-like ancestral strains associated with the first pandemic wave of infections in 2020. Successive waves on the continent were fueled by different VOCs, with Alpha and Beta cocirculating in distinct spatial patterns during the second wave and Delta and Omicron affecting the whole continent during the third and fourth waves, respectively. Phylogeographic reconstruction points toward distinct differences in viral importation and exportation patterns associated with the Alpha, Beta, Delta, and Omicron variants and subvariants, when considering both Africa versus the rest of the world and viral dissemination within the continent. Our epidemiological and phylogenetic inferences therefore underscore the heterogeneous nature of the pandemic on the continent and highlight key insights and challenges, for instance, recognizing the limitations of low testing proportions. We also highlight the early warning capacity that genomic surveillance in Africa has had for the rest of the world with the detection of new lineages and variants, the most recent being the characterization of various Omicron subvariants.
CONCLUSION
Sustained investment for diagnostics and genomic surveillance in Africa is needed as the virus continues to evolve. This is important not only to help combat SARS-CoV-2 on the continent but also because it can be used as a platform to help address the many emerging and reemerging infectious disease threats in Africa. In particular, capacity building for local sequencing within countries or within the continent should be prioritized because this is generally associated with shorter turnaround times, providing the most benefit to local public health authorities tasked with pandemic response and mitigation and allowing for the fastest reaction to localized outbreaks. These investments are crucial for pandemic preparedness and response and will serve the health of the continent well into the 21st century
Recommended from our members
Harnessing Change: Human Health through the Lense of Evolution and Dynamical Systems Theory
Over 2000 years ago, Heraclitus noted, "Everything changes and nothing stands still ." While this truth has long been evident to the wise, we have only recently developed the tools necessary to scientifically characterize sweeping patterns of change in large dynamical systems. Despite rapid progress, new methods and data sources are still sorely needed to further illuminate the intricate and dynamic nature of reality. In this dissertation, we will focus our investigations on understanding patterns of change with direct relevance to human health. In the first two chapters, we develop novel methodologies that lend insight into the evolutionary history of the human race and the genetic basis of human-specific traits and disease. Chapter 2 presents MOSAIC, a new python package for improved detection of genetically related genes between species. This inference is a foundational step towards understanding the function of proteins and the evolutionary pressures they have faced. This tool, along with a combination of other methods, facilitates our analysis in Chapter 3. In this section, we use the patterns of mutations along the human lineage to discover genes and even specific mutations that may play important roles in intelligence, obesity, mental health, as well as a variety of basic biological functions. These findings provide insight into the genetic architecture of health and disease. At the same time, they leave open questions about how genetic factors interact with the broad array of environmental and ecological variables that fundamentally shape downstream phenotypes. In Chapter 4, we introduce CauseMap, a tool I built to understand causal relationships within complex dynamical systems using time series data. It is our hope that this method will help us to interpret human health and disease as states of the bodily dynamical system embedded inextricably within an evolving social, economic, and environmental network. This perspective, we hope, will allow us to understand the characteristics of human health that emerge from an time-hewn dynamic equilibrium with the world within and around us
CauseMap: fast inference of causality from complex time series
Background. Establishing health-related causal relationships is a central pursuit in biomedical research. Yet, the interdependent non-linearity of biological systems renders causal dynamics laborious and at times impractical to disentangle. This pursuit is further impeded by the dearth of time series that are sufficiently long to observe and understand recurrent patterns of flux. However, as data generation costs plummet and technologies like wearable devices democratize data collection, we anticipate a coming surge in the availability of biomedically-relevant time series data. Given the life-saving potential of these burgeoning resources, it is critical to invest in the development of open source software tools that are capable of drawing meaningful insight from vast amounts of time series data.Results. Here we present CauseMap, the first open source implementation of convergent cross mapping (CCM), a method for establishing causality from long time series data (≳25 observations). Compared to existing time series methods, CCM has the advantage of being model-free and robust to unmeasured confounding that could otherwise induce spurious associations. CCM builds on Takens’ Theorem, a well-established result from dynamical systems theory that requires only mild assumptions. This theorem allows us to reconstruct high dimensional system dynamics using a time series of only a single variable. These reconstructions can be thought of as shadows of the true causal system. If reconstructed shadows can predict points from opposing time series, we can infer that the corresponding variables are providing views of the same causal system, and so are causally related. Unlike traditional metrics, this test can establish the directionality of causation, even in the presence of feedback loops. Furthermore, since CCM can extract causal relationships from times series of, e.g., a single individual, it may be a valuable tool to personalized medicine. We implement CCM in Julia, a high-performance programming language designed for facile technical computing. Our software package, CauseMap, is platform-independent and freely available as an official Julia package.Conclusions. CauseMap is an efficient implementation of a state-of-the-art algorithm for detecting causality from time series data. We believe this tool will be a valuable resource for biomedical research and personalized medicine
Recommended from our members
Population Genetics of Rare Variants and Complex Diseases
ObjectivesIdentifying drivers of complex traits from the noisy signals of genetic variation obtained from high-throughput genome sequencing technologies is a central challenge faced by human geneticists today. We hypothesize that the variants involved in complex diseases are likely to exhibit non-neutral evolutionary signatures. Uncovering the evolutionary history of all variants is therefore of intrinsic interest for complex disease research. However, doing so necessitates the simultaneous elucidation of the targets of natural selection and population-specific demographic history.MethodsHere we characterize the action of natural selection operating across complex disease categories, and use population genetic simulations to evaluate the expected patterns of genetic variation in large samples. We focus on populations that have experienced historical bottlenecks followed by explosive growth (consistent with many human populations), and describe the differences between evolutionarily deleterious mutations and those that are neutral.ResultsGenes associated with several complex disease categories exhibit stronger signatures of purifying selection than non-disease genes. In addition, loci identified through genome-wide association studies of complex traits also exhibit signatures consistent with being in regions recurrently targeted by purifying selection. Through simulations, we show that population bottlenecks and rapid growth enable deleterious rare variants to persist at low frequencies just as long as neutral variants, but low-frequency and common variants tend to be much younger than neutral variants. This has resulted in a large proportion of modern-day rare alleles that have a deleterious effect on function and that potentially contribute to disease susceptibility.ConclusionsThe key question for sequencing-based association studies of complex traits is how to distinguish between deleterious and benign genetic variation. We used population genetic simulations to uncover patterns of genetic variation that distinguish these two categories, especially derived allele age, thereby providing inroads into novel methods for characterizing rare genetic variation driving complex diseases