14 research outputs found
Recommended from our members
How Trustworthy Is Your Tree? Bayesian Phylogenetic Effective Sample Size Through the Lens of Monte Carlo Error.
Bayesian inference is a popular and widely-used approach to infer phylogenies (evolutionary trees). However, despite decades of widespread application, it remains difficult to judge how well a given Bayesian Markov chain Monte Carlo (MCMC) run explores the space of phylogenetic trees. In this paper, we investigate the Monte Carlo error of phylogenies, focusing on high-dimensional summaries of the posterior distribution, including variability in estimated edge/branch (known in phylogenetics as split) probabilities and tree probabilities, and variability in the estimated summary tree. Specifically, we ask if there is any measure of effective sample size (ESS) applicable to phylogenetic trees which is capable of capturing the Monte Carlo error of these three summary measures. We find that there are some ESS measures capable of capturing the error inherent in using MCMC samples to approximate the posterior distributions on phylogenies. We term these tree ESS measures, and identify a set of three which are useful in practice for assessing the Monte Carlo error. Lastly, we present visualization tools that can improve comparisons between multiple independent MCMC runs by accounting for the Monte Carlo error present in each chain. Our results indicate that common post-MCMC workflows are insufficient to capture the inherent Monte Carlo error of the tree, and highlight the need for both within-chain mixing and between-chain convergence assessments
Semiparametric inference of effective reproduction number dynamics from wastewater pathogen surveillance data
Concentrations of pathogen genomes measured in wastewater have recently
become available as a new data source to use when modeling the spread of
infectious diseases. One promising use for this data source is inference of the
effective reproduction number, the average number of individuals a newly
infected person will infect. We propose a model where new infections arrive
according to a time-varying immigration rate which can be interpreted as a
compound parameter equal to the product of the proportion of susceptibles in
the population and the transmission rate. This model allows us to estimate the
effective reproduction number from concentrations of pathogen genomes while
avoiding difficult to verify assumptions about the dynamics of the susceptible
population. As a byproduct of our primary goal, we also produce a new model for
estimating the effective reproduction number from case data using the same
framework. We test this modeling framework in an agent-based simulation study
with a realistic data generating mechanism which accounts for the time-varying
dynamics of pathogen shedding. Finally, we apply our new model to estimating
the effective reproduction number of SARS-CoV-2 in Los Angeles, California,
using pathogen RNA concentrations collected from a large wastewater treatment
facility.Comment: 23 pages, 6 figures in main te
Structural basis of sterol recognition by human hedgehog receptor PTCH1
Hedgehog signaling is central in embryonic development and tissue regeneration. Disruption of the pathway is linked to genetic diseases and cancer. Binding of the secreted ligand, Sonic hedgehog (ShhN) to its receptor Patched (PTCH1) activates the signaling pathway. Here, we describe a 3.4-Å cryo-EM structure of the human PTCH1 bound to ShhNC24II, a modified hedgehog ligand mimicking its palmitoylated form. The membrane-embedded part of PTCH1 is surrounded by 10 sterol molecules at the inner and outer lipid bilayer portion of the protein. The annular sterols interact at multiple sites with both the sterol-sensing domain (SSD) and the SSD-like domain (SSDL), which are located on opposite sides of PTCH1. The structure reveals a possible route for sterol translocation across the lipid bilayer by PTCH1 and homologous transporters.ISSN:2375-254
Recommended from our members
Estimation for general birth-death processes.
Birth-death processes (BDPs) are continuous-time Markov chains that track the number of particles in a system over time. While widely used in population biology, genetics and ecology, statistical inference of the instantaneous particle birth and death rates remains largely limited to restrictive linear BDPs in which per-particle birth and death rates are constant. Researchers often observe the number of particles at discrete times, necessitating data augmentation procedures such as expectation-maximization (EM) to find maximum likelihood estimates. For BDPs on finite state-spaces, there are powerful matrix methods for computing the conditional expectations needed for the E-step of the EM algorithm. For BDPs on infinite state-spaces, closed-form solutions for the E-step are available for some linear models, but most previous work has resorted to time-consuming simulation. Remarkably, we show that the E-step conditional expectations can be expressed as convolutions of computable transition probabilities for any general BDP with arbitrary rates. This important observation, along with a convenient continued fraction representation of the Laplace transforms of the transition probabilities, allows for novel and efficient computation of the conditional expectations for all BDPs, eliminating the need for truncation of the state-space or costly simulation. We use this insight to derive EM algorithms that yield maximum likelihood estimation for general BDPs characterized by various rate models, including generalized linear models. We show that our Laplace convolution technique outperforms competing methods when they are available and demonstrate a technique to accelerate EM algorithm convergence. We validate our approach using synthetic data and then apply our methods to cancer cell growth and estimation of mutation parameters in microsatellite evolution
Using genetic data to identify transmission risk factors: Statistical assessment and application to tuberculosis transmission.
Identifying host factors that influence infectious disease transmission is an important step toward developing interventions to reduce disease incidence. Recent advances in methods for reconstructing infectious disease transmission events using pathogen genomic and epidemiological data open the door for investigation of host factors that affect onward transmission. While most transmission reconstruction methods are designed to work with densely sampled outbreaks, these methods are making their way into surveillance studies, where the fraction of sampled cases with sequenced pathogens could be relatively low. Surveillance studies that use transmission event reconstruction then use the reconstructed events as response variables (i.e., infection source status of each sampled case) and use host characteristics as predictors (e.g., presence of HIV infection) in regression models. We use simulations to study estimation of the effect of a host factor on probability of being an infection source via this multi-step inferential procedure. Using TransPhylo-a widely-used method for Bayesian estimation of infectious disease transmission events-and logistic regression, we find that low sensitivity of identifying infection sources leads to dilution of the signal, biasing logistic regression coefficients toward zero. We show that increasing the proportion of sampled cases improves sensitivity and some, but not all properties of the logistic regression inference. Application of these approaches to real world data from a population-based TB study in Botswana fails to detect an association between HIV infection and probability of being a TB infection source. We conclude that application of a pipeline, where one first uses TransPhylo and sparsely sampled surveillance data to infer transmission events and then estimates effects of host characteristics on probabilities of these events, should be accompanied by a realistic simulation study to better understand biases stemming from imprecise transmission event inference
Study participant characteristics.
Summary statistics of participant characteristics from the study of M. tuberculosis in Botswana.</p
Operating characteristics of statistical pipelines in secondary simulation settings.
See Fig 2 for x-axis and y-axis labels. The settings denote different simulation settings. For all simulations the probability of transmission given contact for cases living without HIV was 1.75 times as large as the probability of transmission given contact for cases living with HIV. Default refers to the simulation settings used in the primary simulation setting. In the Double Sampling Density setting percent of active cases sampled was doubled from 16% to 32%. In the Quad Sampling Density setting percent active cases sampled was quadrupled to 64%. In the Increase Sample Size setting the number of clusters sampled from was doubled from 50 to 100. In the Increase Sampling Window the sampling window was changed from the last three years of the simulation to the last seven years.</p
Summary of analyses using five and ten SNP cutoffs.
Summary of analyses using five and ten SNP cutoffs. Mean tree height refers to the height of timed phylogenetic trees in years. TP + GLM refers to a statistical pipeline where infection source labels are first generated using TransPhylo and then used as response variables in a generalized linear model to calculate an odds ratio. The odds ratio is the odds ratio for the probability of being an infection source given the case is living without HIV as compared to cases living with HIV.</p
Operating characteristics of statistical pipelines in primary simulation settings.
Truth + GLM is a reference pipeline where the true infection source labels are used as response variables in a generalized linear model. TP + GLM is a pipeline where infection source labels generated by TransPhylo are used as response variables in a generalized linear model. TP + ME1 is a pipeline where infection source labels are generated by TransPhylo and used as an input into a model from the SAMBA package, which allows for false positives. TP + ME2 again uses labels generated by TransPhylo as response variables into a model which allows for both false positives and false negatives. The settings denote different simulation settings, each value describes the true ratio of the probability of transmission given contact for cases living without HIV to cases with HIV. I.E., 0.57 means that the probability of transmission given contact for hosts without HIV was 1.75 times as small as the probability of transmission given contact for hosts with HIV. Coverage refers to proportion of simulations where 95% confidence intervals captured the true odds ratio. Prop. Reject refers to the proportion of simulations where a null hypothesis that the true odds ratio was one would be rejected, assuming significance level of 5%. Percent bias is bias divided by the true odds ratio, MCIW is mean confidence interval width.</p