131,277 research outputs found
Influence of substitution model selection on protein phylogenetic tree reconstruction
Probabilistic phylogenetic tree reconstruction is traditionally performed under a best-fitting substitution model of molecular evolution previously selected according to diverse statistical criteria. Interestingly, some recent studies proposed that this procedure is unnecessary for phylogenetic tree reconstruction leading to a debate in the field. In contrast to DNA sequences, phylogenetic tree reconstruction from protein sequences is traditionally based on empirical exchangeability matrices that can differ among taxonomic groups and protein families. Considering this aspect, here we investigated the influence of selecting a substitution model of protein evolution on phylogenetic tree reconstruction by the analyses of real and simulated data. We found that phylogenetic tree reconstructions based on a selected best-fitting substitution model of protein evolution are the most accurate, in terms of topology and branch lengths, compared with those derived from substitution models with amino acid replacement matrices far from the selected best-fitting model, especially when the data has large genetic diversity. Indeed, we found that substitution models with similar amino acid replacement matrices produce similar reconstructed phylogenetic trees, suggesting the use of substitution models as similar as possible to a selected best-fitting model when the latter cannot be used. Therefore, we recommend the use of the traditional protocol of selection among substitution models of evolution for protein phylogenetic tree reconstruction.Universidade de Vigo/CISUGAgencia Estatal de Investigación | Ref. PID2019-107931GA-I0
The divergence time of protein structures modelled by Markov matrices and its relation to the divergence of sequences
A complete time-parameterized statistical model quantifying the divergent
evolution of protein structures in terms of the patterns of conservation of
their secondary structures is inferred from a large collection of protein 3D
structure alignments. This provides a better alternative to time-parameterized
sequence-based models of protein relatedness, that have clear limitations
dealing with twilight and midnight zones of sequence relationships. Since
protein structures are far more conserved due to the selection pressure
directly placed on their function, divergence time estimates can be more
accurate when inferred from structures. We use the Bayesian and
information-theoretic framework of Minimum Message Length to infer a
time-parameterized stochastic matrix (accounting for perturbed structural
states of related residues) and associated Dirichlet models (accounting for
insertions and deletions during the evolution of protein domains). These are
used in concert to estimate the Markov time of divergence of tertiary
structures, a task previously only possible using proxies (like RMSD). By
analyzing one million pairs of homologous structures, we yield a relationship
between the Markov divergence time of structures and of sequences. Using these
inferred models and the relationship between the divergence of sequences and
structures, we demonstrate a competitive performance in secondary structure
prediction against neural network architectures commonly employed for this
task. The source code and supplementary information are downloadable from
\url{http://lcb.infotech.monash.edu.au/sstsum}.Comment: 12 pages, 6 figure
Machine learning-guided directed evolution for protein engineering
Machine learning (ML)-guided directed evolution is a new paradigm for
biological design that enables optimization of complex functions. ML methods
use data to predict how sequence maps to function without requiring a detailed
model of the underlying physics or biological pathways. To demonstrate
ML-guided directed evolution, we introduce the steps required to build ML
sequence-function models and use them to guide engineering, making
recommendations at each stage. This review covers basic concepts relevant to
using ML for protein engineering as well as the current literature and
applications of this new engineering paradigm. ML methods accelerate directed
evolution by learning from information contained in all measured variants and
using that information to select sequences that are likely to be improved. We
then provide two case studies that demonstrate the ML-guided directed evolution
process. We also look to future opportunities where ML will enable discovery of
new protein functions and uncover the relationship between protein sequence and
function.Comment: Made significant revisions to focus on aspects most relevant to
applying machine learning to speed up directed evolutio
Topological AI forecasting of future dominating viral variants
The understanding of the mechanisms of SARS-CoV-2 evolution and transmission
is one of the greatest challenges of our time. By integrating artificial
intelligence (AI), viral genomes isolated from patients, tens of thousands of
mutational data, biophysics, bioinformatics, and algebraic topology, the
SARS-CoV-2 evolution was revealed to be governed by infectivity-based natural
selection. Two key mutation sites, L452 and N501 on the viral spike protein
receptor-binding domain (RBD), were predicted in summer 2020, long before they
occur in prevailing variants Alpha, Beta, Gamma, Delta, Kappa, Theta, Lambda,
Mu, and Omicron. Recent studies identified a new mechanism of natural
selection: antibody resistance. AI-based forecasting of Omicron's infectivity,
vaccine breakthrough, and antibody resistance was later nearly perfectly
confirmed by experiments. The replacement of dominant BA.1 by BA.2 in later
March was predicted in early February. On May 1, 2022, persistent
Laplacian-based AI projected Omicron BA.4 and BA.5 to become the new dominating
COVID-19 variants. This prediction became reality in late June. Topological AI
models offer accurate prediction of mutational impacts on the efficacy of
monoclonal antibodies (mAbs).Comment: 5 pages, 2 figure
Selection of sequence motifs and generative Hopfield-Potts models for protein familiesilies
Statistical models for families of evolutionary related proteins have
recently gained interest: in particular pairwise Potts models, as those
inferred by the Direct-Coupling Analysis, have been able to extract information
about the three-dimensional structure of folded proteins, and about the effect
of amino-acid substitutions in proteins. These models are typically requested
to reproduce the one- and two-point statistics of the amino-acid usage in a
protein family, {\em i.e.}~to capture the so-called residue conservation and
covariation statistics of proteins of common evolutionary origin. Pairwise
Potts models are the maximum-entropy models achieving this. While being
successful, these models depend on huge numbers of {\em ad hoc} introduced
parameters, which have to be estimated from finite amount of data and whose
biophysical interpretation remains unclear. Here we propose an approach to
parameter reduction, which is based on selecting collective sequence motifs. It
naturally leads to the formulation of statistical sequence models in terms of
Hopfield-Potts models. These models can be accurately inferred using a mapping
to restricted Boltzmann machines and persistent contrastive divergence. We show
that, when applied to protein data, even 20-40 patterns are sufficient to
obtain statistically close-to-generative models. The Hopfield patterns form
interpretable sequence motifs and may be used to clusterize amino-acid
sequences into functional sub-families. However, the distributed collective
nature of these motifs intrinsically limits the ability of Hopfield-Potts
models in predicting contact maps, showing the necessity of developing models
going beyond the Hopfield-Potts models discussed here.Comment: 26 pages, 16 figures, to app. in PR
The effect of genomic information on optimal contribution selection in livestock breeding programs
BACKGROUND: Long-term benefits in animal breeding programs require that increases in genetic merit be balanced with the need to maintain diversity (lost due to inbreeding). This can be achieved by using optimal contribution selection. The availability of high-density DNA marker information enables the incorporation of genomic data into optimal contribution selection but this raises the question about how this information affects the balance between genetic merit and diversity. METHODS: The effect of using genomic information in optimal contribution selection was examined based on simulated and real data on dairy bulls. We compared the genetic merit of selected animals at various levels of co-ancestry restrictions when using estimated breeding values based on parent average, genomic or progeny test information. Furthermore, we estimated the proportion of variation in estimated breeding values that is due to within-family differences. RESULTS: Optimal selection on genomic estimated breeding values increased genetic gain. Genetic merit was further increased using genomic rather than pedigree-based measures of co-ancestry under an inbreeding restriction policy. Using genomic instead of pedigree relationships to restrict inbreeding had a significant effect only when the population consisted of many large full-sib families; with a half-sib family structure, no difference was observed. In real data from dairy bulls, optimal contribution selection based on genomic estimated breeding values allowed for additional improvements in genetic merit at low to moderate inbreeding levels. Genomic estimated breeding values were more accurate and showed more within-family variation than parent average breeding values; for genomic estimated breeding values, 30 to 40% of the variation was due to within-family differences. Finally, there was no difference between constraining inbreeding via pedigree or genomic relationships in the real data. CONCLUSIONS: The use of genomic estimated breeding values increased genetic gain in optimal contribution selection. Genomic estimated breeding values were more accurate and showed more within-family variation, which led to higher genetic gains for the same restriction on inbreeding. Using genomic relationships to restrict inbreeding provided no additional gain, except in the case of very large full-sib families
- …