546 research outputs found
Comparing penalization methods for linear models on large observational health data
Objective: This study evaluates regularization variants in logistic regression (L1, L2, ElasticNet, Adaptive L1, Adaptive ElasticNet, Broken adaptive ridge [BAR], and Iterative hard thresholding [IHT]) for discrimination and calibration performance, focusing on both internal and external validation. Materials and Methods: We use data from 5 US claims and electronic health record databases and develop models for various outcomes in a major depressive disorder patient population. We externally validate all models in the other databases. We use a train-test split of 75%/25% and evaluate performance with discrimination and calibration. Statistical analysis for difference in performance uses Friedman's test and critical difference diagrams. Results: Of the 840 models we develop, L1 and ElasticNet emerge as superior in both internal and external discrimination, with a notable AUC difference. BAR and IHT show the best internal calibration, without a clear external calibration leader. ElasticNet typically has larger model sizes than L1. Methods like IHT and BAR, while slightly less discriminative, significantly reduce model complexity. Conclusion:L1 and ElasticNet offer the best discriminative performance in logistic regression for healthcare predictions, maintaining robustness across validations. For simpler, more interpretable models, L0-based methods (IHT and BAR) are advantageous, providing greater parsimony and calibration with fewer features. This study aids in selecting suitable regularization techniques for healthcare prediction models, balancing performance, complexity, and interpretability.</p
Distinguishing regional from within-codon rate heterogeneity in DNA sequence alignments
We present an improved phylogenetic factorial hidden Markov model (FHMM) for detecting two types of mosaic structures in DNA sequence alignments, related to (1) recombination and (2) rate heterogeneity. The focus of the present work is on improving the modelling of the latter aspect. Earlier papers have modelled different degrees of rate heterogeneity with separate hidden states of the FHMM. This approach fails to appreciate the intrinsic difference between two types of rate heterogeneity: long-range regional effects, which are potentially related to differences in the selective pressure, and the short-term periodic patterns within the codons, which merely capture the signature of the genetic code. We propose an improved model that explicitly distinguishes between these two effects, and we assess its performance on a set of simulated DNA sequence alignments
Fully Bayesian tests of neutrality using genealogical summary statistics
<p>Abstract</p> <p>Background</p> <p>Many data summary statistics have been developed to detect departures from neutral expectations of evolutionary models. However questions about the neutrality of the evolution of genetic loci within natural populations remain difficult to assess. One critical cause of this difficulty is that most methods for testing neutrality make simplifying assumptions simultaneously about the mutational model and the population size model. Consequentially, rejecting the null hypothesis of neutrality under these methods could result from violations of either or both assumptions, making interpretation troublesome.</p> <p>Results</p> <p>Here we harness posterior predictive simulation to exploit summary statistics of both the data and model parameters to test the goodness-of-fit of standard models of evolution. We apply the method to test the selective neutrality of molecular evolution in non-recombining gene genealogies and we demonstrate the utility of our method on four real data sets, identifying significant departures of neutrality in human influenza A virus, even after controlling for variation in population size.</p> <p>Conclusion</p> <p>Importantly, by employing a full model-based Bayesian analysis, our method separates the effects of demography from the effects of selection. The method also allows multiple summary statistics to be used in concert, thus potentially increasing sensitivity. Furthermore, our method remains useful in situations where analytical expectations and variances of summary statistics are not available. This aspect has great potential for the analysis of temporally spaced data, an expanding area previously ignored for limited availability of theory and methods.</p
Determinants of dengue virus dispersal in the Americas
Dengue viruses (DENVs) are classified into four serotypes, each of which contains multiple genotypes. DENV genotypes introduced into the Americas over the past five decades have exhibited different rates and patterns of spatial dispersal. In order to understand factors underlying these patterns, we utilized a statistical framework that allows for the integration of ecological, socioeconomic, and air transport mobility data as predictors of viral diffusion while inferring the phylogeographic history. Predictors describing spatial diffusion based on several covariates were compared using a generalized linear model approach, where the support for each scenario and its contribution is estimated simultaneously from the data set. Although different predictors were identified for different serotypes, our analysis suggests that overall diffusion of DENV-1, -2, and -3 in the Americas was associated with airline traffic. The other significant predictors included human population size, the geographical distance between countries and between urban centers and the density of people living in urban environments
Unifying the spatial epidemiology and molecular evolution of emerging epidemics
We introduce a conceptual bridge between the previously unlinked fields of phylogenetics and mathematical spatial ecology, which enables the spatial parameters of an emerging epidemic to be directly estimated from sampled pathogen genome sequences. By using phylogenetic history to correct for spatial autocorrelation, we illustrate how a fundamental spatial variable, the diffusion coefficient, can be estimated using robust nonparametric statistics, and how heterogeneity in dispersal can be readily quantified. We apply this framework to the spread of the West Nile virus across North America, an important recent instance of spatial invasion by an emerging infectious disease. We demonstrate that the dispersal of West Nile virus is greater and far more variable than previously measured, such that its dissemination was critically determined by rare, long-range movements that are unlikely to be discerned during field observations. Our results indicate that, by ignoring this heterogeneity, previous models of the epidemic have substantially overestimated its basic reproductive number. More generally, our approach demonstrates that easily obtainable genetic data can be used to measure the spatial dynamics of natural populations that are otherwise difficult or costly to quantify
TreeFlow: probabilistic programming and automatic differentiation for phylogenetics
Probabilistic programming frameworks are powerful tools for statistical modelling and inference. They are not immediately generalisable to phylogenetic problems due to the particular computational properties of the phylogenetic tree object. TreeFlow is a software library for probabilistic programming and automatic differentiation with phylogenetic trees. It implements inference algorithms for phylogenetic tree times and model parameters, given a tree topology. We demonstrate how TreeFlow can be used to quickly implement and assess new models. We also show that it provides reasonable performance for gradient-based inference algorithms compared to specialized computational libraries for phylogenetics.Data processing pipeline can be found at https://github.com/christiaanjs/treeflow-paper
Tree topologies inferred using RAxML 8.2.12
Tree topologies are rooted using LSD 0.2
BEAST analyses are performed using BEAST 2.6.7
Variational inference analyses are performed using TreeFlow 0.0.1beta
Sequences have been removed H3N2 BEAST XML as a result of license conflicts. This complete version of this file is generated by the above pipeline.Funding provided by: University of AucklandCrossref Funder Registry ID: http://dx.doi.org/10.13039/501100001537Award Number:Carnivores sequence alignment accessed from benchmark in BEAST examples
H3N2 sequence alignment taken from Vaughan TG, Kühnert D, Popinga A, Welch D, Drummond AJ. Efficient Bayesian inference under the structured coalescent. Bioinformatics. 2014 Aug 15;30(16):2272-9. doi: 10.1093/bioinformatics/btu20
Leptospira interrogans Endostatin-Like Outer Membrane Proteins Bind Host Fibronectin, Laminin and Regulators of Complement
The pathogenic spirochete Leptospira interrogans disseminates throughout its hosts via the bloodstream, then invades and colonizes a variety of host tissues. Infectious leptospires are resistant to killing by their hosts' alternative pathway of complement-mediated killing, and interact with various host extracellular matrix (ECM) components. The LenA outer surface protein (formerly called LfhA and Lsa24) was previously shown to bind the host ECM component laminin and the complement regulators factor H and factor H-related protein-1. We now demonstrate that infectious L. interrogans contain five additional paralogs of lenA, which we designated lenB, lenC, lenD, lenE and lenF. All six genes encode domains predicted to bear structural and functional similarities with mammalian endostatins. Sequence analyses of genes from seven infectious L. interrogans serovars indicated development of sequence diversity through recombination and intragenic duplication. LenB was found to bind human factor H, and all of the newly-described Len proteins bound laminin. In addition, LenB, LenC, LenD, LenE and LenF all exhibited affinities for fibronectin, a distinct host extracellular matrix protein. These characteristics suggest that Len proteins together facilitate invasion and colonization of host tissues, and protect against host immune responses during mammalian infection
An Adaptive Interacting Wang-Landau Algorithm for Automatic Density Exploration
While statisticians are well-accustomed to performing exploratory analysis in
the modeling stage of an analysis, the notion of conducting preliminary
general-purpose exploratory analysis in the Monte Carlo stage (or more
generally, the model-fitting stage) of an analysis is an area which we feel
deserves much further attention. Towards this aim, this paper proposes a
general-purpose algorithm for automatic density exploration. The proposed
exploration algorithm combines and expands upon components from various
adaptive Markov chain Monte Carlo methods, with the Wang-Landau algorithm at
its heart. Additionally, the algorithm is run on interacting parallel chains --
a feature which both decreases computational cost as well as stabilizes the
algorithm, improving its ability to explore the density. Performance is studied
in several applications. Through a Bayesian variable selection example, the
authors demonstrate the convergence gains obtained with interacting chains. The
ability of the algorithm's adaptive proposal to induce mode-jumping is
illustrated through a trimodal density and a Bayesian mixture modeling
application. Lastly, through a 2D Ising model, the authors demonstrate the
ability of the algorithm to overcome the high correlations encountered in
spatial models.Comment: 33 pages, 20 figures (the supplementary materials are included as
appendices
Sequence-based prediction for vaccine strain selection and identification of antigenic variability in foot-and-mouth disease virus
Identifying when past exposure to an infectious disease will protect against newly emerging strains is central to understanding the spread and the severity of epidemics, but the prediction of viral cross-protection remains an important unsolved problem. For foot-and-mouth disease virus (FMDV) research in particular, improved methods for predicting this cross-protection are critical for predicting the severity of outbreaks within endemic settings where multiple serotypes and subtypes commonly co-circulate, as well as for deciding whether appropriate vaccine(s) exist and how much they could mitigate the effects of any outbreak. To identify antigenic relationships and their predictors, we used linear mixed effects models to account for variation in pairwise cross-neutralization titres using only viral sequences and structural data. We identified those substitutions in surface-exposed structural proteins that are correlates of loss of cross-reactivity. These allowed prediction of both the best vaccine match for any single virus and the breadth of coverage of new vaccine candidates from their capsid sequences as effectively as or better than serology. Sub-sequences chosen by the model-building process all contained sites that are known epitopes on other serotypes. Furthermore, for the SAT1 serotype, for which epitopes have never previously been identified, we provide strong evidence - by controlling for phylogenetic structure - for the presence of three epitopes across a panel of viruses and quantify the relative significance of some individual residues in determining cross-neutralization. Identifying and quantifying the importance of sites that predict viral strain cross-reactivity not just for single viruses but across entire serotypes can help in the design of vaccines with better targeting and broader coverage. These techniques can be generalized to any infectious agents where cross-reactivity assays have been carried out. As the parameterization uses pre-existing datasets, this approach quickly and cheaply increases both our understanding of antigenic relationships and our power to control disease
- …