Search CORE

421 research outputs found

Statistical analysis using finite mixtures of normal linear models

Author: Cheng Jianlin
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/1999
Field of study

Finite mixture models are often used in statistical applications when the population under study is believed to consist of a number of heterogeneous subpopulations, but it is not possible to identify the subpopulation to which an individual belongs. In this thesis, finite mixtures of normal linear regression models are explored as a class of models for relating a response variable to a set of predictor variables. We consider two classes of mixture models: those in which the proportion of the population in each subpopulation is independent of the measured predictor variables, and a second in which the mixture proportions are allowed to depend on the predictor variables;Conditions are determined under which the parameters of the finite mixture model are identifiable. Two approaches to statistical inference for the model parameters are reviewed: maximum likelihood estimation and the associated large sample theory, and Bayesian inference. There are several complications that arise in practice when analyzing data with finite mixture models including multiple modes of the likelihood function, degenerate modes corresponding to small subpopulations with apparently zero variance, and the failure of traditional large sample results. Simulations are used to investigate the performance of the two approaches to inference. It is important that a statistical analysis go beyond just fitting a model to data and include some model assessment. This thesis explores the use of posterior predictive model checks for this purpose. In particular a posterior predictive method is proposed for comparing the mixture of regressions with constant proportions to the mixture of regressions with nonconstant proportions;The various approaches to inference and model assessment are applied to an example concerning household expenditures in Bangladesh. An economic hypothesis there suggests that more resources are spent ensuring the health of male rather than female children. A simple linear regression explaining the difference between male and female child health finds no significant predictors. One plausible explanation is that the population consists of two types of households, those that do not discriminate based on gender and those that do. The finite mixture of regressions allows us to address this hypothesis

Digital Repository @ Iowa State University (ISU)

DOMAC: an accurate, hybrid protein domain prediction server

Author: Cheng Jianlin
Publication venue: Oxford University Press
Publication date: 01/01/2007
Field of study

Protein domain prediction is important for protein structure prediction, structure determination, function annotation, mutagenesis analysis and protein engineering. Here we describe an accurate protein domain prediction server (DOMAC) combining both template-based and ab initio methods. The preliminary version of the server was ranked among the top domain prediction servers in the seventh edition of Critical Assessment of Techniques for Protein Structure Prediction (CASP7), 2006. DOMAC server and datasets are available at: http://www.bioinfotool.org/domac.htm

CiteSeerX

Crossref

PubMed Central

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

A multi-template combination algorithm for protein comparative modeling

Author: Cheng Jianlin
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Multiple protein templates are commonly used in manual protein structure prediction. However, few automated algorithms of selecting and combining multiple templates are available. Results Here we develop an effective multi-template combination algorithm for protein comparative modeling. The algorithm selects templates according to the similarity significance of the alignments between template and target proteins. It combines the whole template-target alignments whose similarity significance score is close to that of the top template-target alignment within a threshold, whereas it only takes alignment fragments from a less similar template-target alignment that align with a sizable uncovered region of the target. We compare the algorithm with the traditional method of using a single top template on the 45 comparative modeling targets (i.e. easy template-based modeling targets) used in the seventh edition of Critical Assessment of Techniques for Protein Structure Prediction (CASP7). The multi-template combination algorithm improves the GDT-TS scores of predicted models by 6.8% on average. The statistical analysis shows that the improvement is significant (p-value < 10-4). Compared with the ideal approach that always uses the best template, the multi-template approach yields only slightly better performance. During the CASP7 experiment, the preliminary implementation of the multi-template combination algorithm (FOLDpro) was ranked second among 67 servers in the category of high-accuracy structure prediction in terms of GDT-TS measure. Conclusion We have developed a novel multi-template algorithm to improve protein comparative modeling.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

CONFOLD2: Improved contact-driven ab initio protein structure modeling

Author: Adhikari Badri
Cheng Jianlin
Publication venue: IRL @ UMSL
Publication date: 25/01/2018
Field of study

Background: Contact-guided protein structure prediction methods are becoming more and more successful because of the latest advances in residue-residue contact prediction. To support contact-driven structure prediction, effective tools that can quickly build tertiary structural models of good quality from predicted contacts need to be developed. Results: We develop an improved contact-driven protein modelling method, CONFOLD2, and study how it may be effectively used for ab initio protein structure prediction with predicted contacts as input. It builds models using various subsets of input contacts to explore the fold space under the guidance of a soft square energy function, and then clusters the models to obtain the top five models. CONFOLD2 obtains an average reconstruction accuracy of 0.57 TM-score for the 150 proteins in the PSICOV contact prediction dataset. When benchmarked on the CASP11 contacts predicted using CONSIP2 and CASP12 contacts predicted using Raptor-X, CONFOLD2 achieves a mean TM-score of 0.41 on both datasets. Conclusion: CONFOLD2 allows to quickly generate top five structural models for a protein sequence when its secondary structures and contacts predictions at hand. The source code of CONFOLD2 is publicly available at https://github.com/multicom-toolbox/CONFOLD2/

University of Missouri, St. Louis

Geometry-Complete Diffusion for 3D Molecule Generation

Author: Cheng Jianlin
Morehead Alex
Publication venue
Publication date: 15/02/2023
Field of study

Denoising diffusion probabilistic models (DDPMs) have recently taken the field of generative modeling by storm, pioneering new state-of-the-art results in disciplines such as computer vision and computational biology for diverse tasks ranging from text-guided image generation to structure-guided protein design. Along this latter line of research, methods such as those of Hoogeboom et al. 2022 have been proposed for unconditionally generating 3D molecules using equivariant graph neural networks (GNNs) within a DDPM framework. Toward this end, we propose GCDM, a geometry-complete diffusion model that achieves new state-of-the-art results for 3D molecule diffusion generation by leveraging the representation learning strengths offered by GNNs that perform geometry-complete message-passing. Our results with GCDM also offer preliminary insights into how physical inductive biases impact the generative dynamics of molecular DDPMs. The source code, data, and instructions to train new models or reproduce our results are freely available at https://github.com/BioinfoMachineLearning/bio-diffusion.Comment: 13 pages, 1 figure, 3 tables. Under review. Code available at https://github.com/BioinfoMachineLearning/bio-diffusio

arXiv.org e-Print Archive

HMMEditor: a visual editing tool for profile hidden Markov model

Author: Cheng Jianlin
Dai Jianyong
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Profile Hidden Markov Model (HMM) is a powerful statistical model to represent a family of DNA, RNA, and protein sequences. Profile HMM has been widely used in bioinformatics research such as sequence alignment, gene structure prediction, motif identification, protein structure prediction, and biological database search. However, few comprehensive, visual editing tools for profile HMM are publicly available. Results We develop a visual editor for profile Hidden Markov Models (HMMEditor). HMMEditor can visualize the profile HMM architecture, transition probabilities, and emission probabilities. Moreover, it provides functions to edit and save HMM and parameters. Furthermore, HMMEditor allows users to align a sequence against the profile HMM and to visualize the corresponding Viterbi path. Conclusion HMMEditor provides a set of unique functions to visualize and edit a profile HMM. It is a useful tool for biological sequence analysis and modeling. Both HMMEditor software and web service are freely available.</p

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Geometry-Complete Perceptron Networks for 3D Molecular Graphs

Author: Cheng Jianlin
Morehead Alex
Publication venue
Publication date: 04/11/2022
Field of study

The field of geometric deep learning has had a profound impact on the development of innovative and powerful graph neural network architectures. Disciplines such as computer vision and computational biology have benefited significantly from such methodological advances, which has led to breakthroughs in scientific domains such as protein structure prediction and design. In this work, we introduce GCPNet, a new geometry-complete, SE(3)-equivariant graph neural network designed for 3D graph representation learning. We demonstrate the state-of-the-art utility and expressiveness of our method on six independent datasets designed for three distinct geometric tasks: protein-ligand binding affinity prediction, protein structure ranking, and Newtonian many-body systems modeling. Our results suggest that GCPNet is a powerful, general method for capturing complex geometric and physical interactions within 3D graphs for downstream prediction tasks. The source code, data, and instructions to train new models or reproduce our results are freely available on GitHub.Comment: 7 pages, 1 figure, 3 tables. Under review. Code available at https://github.com/BioinfoMachineLearning/GCPNe

arXiv.org e-Print Archive