421 research outputs found
Statistical analysis using finite mixtures of normal linear models
Finite mixture models are often used in statistical applications when the population under study is believed to consist of a number of heterogeneous subpopulations, but it is not possible to identify the subpopulation to which an individual belongs. In this thesis, finite mixtures of normal linear regression models are explored as a class of models for relating a response variable to a set of predictor variables. We consider two classes of mixture models: those in which the proportion of the population in each subpopulation is independent of the measured predictor variables, and a second in which the mixture proportions are allowed to depend on the predictor variables;Conditions are determined under which the parameters of the finite mixture model are identifiable. Two approaches to statistical inference for the model parameters are reviewed: maximum likelihood estimation and the associated large sample theory, and Bayesian inference. There are several complications that arise in practice when analyzing data with finite mixture models including multiple modes of the likelihood function, degenerate modes corresponding to small subpopulations with apparently zero variance, and the failure of traditional large sample results. Simulations are used to investigate the performance of the two approaches to inference. It is important that a statistical analysis go beyond just fitting a model to data and include some model assessment. This thesis explores the use of posterior predictive model checks for this purpose. In particular a posterior predictive method is proposed for comparing the mixture of regressions with constant proportions to the mixture of regressions with nonconstant proportions;The various approaches to inference and model assessment are applied to an example concerning household expenditures in Bangladesh. An economic hypothesis there suggests that more resources are spent ensuring the health of male rather than female children. A simple linear regression explaining the difference between male and female child health finds no significant predictors. One plausible explanation is that the population consists of two types of households, those that do not discriminate based on gender and those that do. The finite mixture of regressions allows us to address this hypothesis
DOMAC: an accurate, hybrid protein domain prediction server
Protein domain prediction is important for protein structure prediction, structure determination, function annotation, mutagenesis analysis and protein engineering. Here we describe an accurate protein domain prediction server (DOMAC) combining both template-based and ab initio methods. The preliminary version of the server was ranked among the top domain prediction servers in the seventh edition of Critical Assessment of Techniques for Protein Structure Prediction (CASP7), 2006. DOMAC server and datasets are available at: http://www.bioinfotool.org/domac.htm
A multi-template combination algorithm for protein comparative modeling
<p>Abstract</p> <p>Background</p> <p>Multiple protein templates are commonly used in manual protein structure prediction. However, few automated algorithms of selecting and combining multiple templates are available.</p> <p>Results</p> <p>Here we develop an effective multi-template combination algorithm for protein comparative modeling. The algorithm selects templates according to the similarity significance of the alignments between template and target proteins. It combines the whole template-target alignments whose similarity significance score is close to that of the top template-target alignment within a threshold, whereas it only takes alignment fragments from a less similar template-target alignment that align with a sizable uncovered region of the target.</p> <p>We compare the algorithm with the traditional method of using a single top template on the 45 comparative modeling targets (i.e. easy template-based modeling targets) used in the seventh edition of Critical Assessment of Techniques for Protein Structure Prediction (CASP7). The multi-template combination algorithm improves the GDT-TS scores of predicted models by 6.8% on average. The statistical analysis shows that the improvement is significant (p-value < 10<sup>-4</sup>). Compared with the ideal approach that always uses the best template, the multi-template approach yields only slightly better performance. During the CASP7 experiment, the preliminary implementation of the multi-template combination algorithm (FOLDpro) was ranked second among 67 servers in the category of high-accuracy structure prediction in terms of GDT-TS measure.</p> <p>Conclusion</p> <p>We have developed a novel multi-template algorithm to improve protein comparative modeling.</p
CONFOLD2: Improved contact-driven ab initio protein structure modeling
Background: Contact-guided protein structure prediction methods are becoming more and more successful because of the latest advances in residue-residue contact prediction. To support contact-driven structure prediction, effective tools that can quickly build tertiary structural models of good quality from predicted contacts need to be developed. Results: We develop an improved contact-driven protein modelling method, CONFOLD2, and study how it may be effectively used for ab initio protein structure prediction with predicted contacts as input. It builds models using various subsets of input contacts to explore the fold space under the guidance of a soft square energy function, and then clusters the models to obtain the top five models. CONFOLD2 obtains an average reconstruction accuracy of 0.57 TM-score for the 150 proteins in the PSICOV contact prediction dataset. When benchmarked on the CASP11 contacts predicted using CONSIP2 and CASP12 contacts predicted using Raptor-X, CONFOLD2 achieves a mean TM-score of 0.41 on both datasets. Conclusion: CONFOLD2 allows to quickly generate top five structural models for a protein sequence when its secondary structures and contacts predictions at hand. The source code of CONFOLD2 is publicly available at https://github.com/multicom-toolbox/CONFOLD2/
Geometry-Complete Diffusion for 3D Molecule Generation
Denoising diffusion probabilistic models (DDPMs) have recently taken the
field of generative modeling by storm, pioneering new state-of-the-art results
in disciplines such as computer vision and computational biology for diverse
tasks ranging from text-guided image generation to structure-guided protein
design. Along this latter line of research, methods such as those of Hoogeboom
et al. 2022 have been proposed for unconditionally generating 3D molecules
using equivariant graph neural networks (GNNs) within a DDPM framework. Toward
this end, we propose GCDM, a geometry-complete diffusion model that achieves
new state-of-the-art results for 3D molecule diffusion generation by leveraging
the representation learning strengths offered by GNNs that perform
geometry-complete message-passing. Our results with GCDM also offer preliminary
insights into how physical inductive biases impact the generative dynamics of
molecular DDPMs. The source code, data, and instructions to train new models or
reproduce our results are freely available at
https://github.com/BioinfoMachineLearning/bio-diffusion.Comment: 13 pages, 1 figure, 3 tables. Under review. Code available at
https://github.com/BioinfoMachineLearning/bio-diffusio
HMMEditor: a visual editing tool for profile hidden Markov model
<p>Abstract</p> <p>Background</p> <p>Profile Hidden Markov Model (HMM) is a powerful statistical model to represent a family of DNA, RNA, and protein sequences. Profile HMM has been widely used in bioinformatics research such as sequence alignment, gene structure prediction, motif identification, protein structure prediction, and biological database search. However, few comprehensive, visual editing tools for profile HMM are publicly available.</p> <p>Results</p> <p>We develop a visual editor for profile Hidden Markov Models (HMMEditor). HMMEditor can visualize the profile HMM architecture, transition probabilities, and emission probabilities. Moreover, it provides functions to edit and save HMM and parameters. Furthermore, HMMEditor allows users to align a sequence against the profile HMM and to visualize the corresponding Viterbi path.</p> <p>Conclusion</p> <p>HMMEditor provides a set of unique functions to visualize and edit a profile HMM. It is a useful tool for biological sequence analysis and modeling. Both HMMEditor software and web service are freely available.</p
Geometry-Complete Perceptron Networks for 3D Molecular Graphs
The field of geometric deep learning has had a profound impact on the
development of innovative and powerful graph neural network architectures.
Disciplines such as computer vision and computational biology have benefited
significantly from such methodological advances, which has led to breakthroughs
in scientific domains such as protein structure prediction and design. In this
work, we introduce GCPNet, a new geometry-complete, SE(3)-equivariant graph
neural network designed for 3D graph representation learning. We demonstrate
the state-of-the-art utility and expressiveness of our method on six
independent datasets designed for three distinct geometric tasks:
protein-ligand binding affinity prediction, protein structure ranking, and
Newtonian many-body systems modeling. Our results suggest that GCPNet is a
powerful, general method for capturing complex geometric and physical
interactions within 3D graphs for downstream prediction tasks. The source code,
data, and instructions to train new models or reproduce our results are freely
available on GitHub.Comment: 7 pages, 1 figure, 3 tables. Under review. Code available at
https://github.com/BioinfoMachineLearning/GCPNe
- ā¦