12,288 research outputs found
Minimum energy configurations of the 2-dimensional HP-model of proteins by self-organizing networks
We use self-organizing maps (SOM) as an efficient tool to find the minimum energy configurations of the 2-dimensional HP-models of proteins. The usage of the SOM for the protein folding problem is similar to that for the Traveling Salesman Problem. The lattice nodes represent the cities whereas the neurons in the network represent the amino acids moving towards the closest cities, subject to the HH interactions. The valid path that maximizes the HH contacts corresponds to the minimum energy configuration of the protein. We report promising results for the cases when the protein completely fills a lattice and discuss the current problems and possible extensions. In all the test sequences up to 36 amino acids, the algorithm was able to find the global minimum and its degeneracies
Bounded Coordinate-Descent for Biological Sequence Classification in High Dimensional Predictor Space
We present a framework for discriminative sequence classification where the
learner works directly in the high dimensional predictor space of all
subsequences in the training set. This is possible by employing a new
coordinate-descent algorithm coupled with bounding the magnitude of the
gradient for selecting discriminative subsequences fast. We characterize the
loss functions for which our generic learning algorithm can be applied and
present concrete implementations for logistic regression (binomial
log-likelihood loss) and support vector machines (squared hinge loss).
Application of our algorithm to protein remote homology detection and remote
fold recognition results in performance comparable to that of state-of-the-art
methods (e.g., kernel support vector machines). Unlike state-of-the-art
classifiers, the resulting classification models are simply lists of weighted
discriminative subsequences and can thus be interpreted and related to the
biological problem
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Random lasso
We propose a computationally intensive method, the random lasso method, for
variable selection in linear models. The method consists of two major steps. In
step 1, the lasso method is applied to many bootstrap samples, each using a set
of randomly selected covariates. A measure of importance is yielded from this
step for each covariate. In step 2, a similar procedure to the first step is
implemented with the exception that for each bootstrap sample, a subset of
covariates is randomly selected with unequal selection probabilities determined
by the covariates' importance. Adaptive lasso may be used in the second step
with weights determined by the importance measures. The final set of covariates
and their coefficients are determined by averaging bootstrap results obtained
from step 2. The proposed method alleviates some of the limitations of lasso,
elastic-net and related methods noted especially in the context of microarray
data analysis: it tends to remove highly correlated variables altogether or
select them all, and maintains maximal flexibility in estimating their
coefficients, particularly with different signs; the number of selected
variables is no longer limited by the sample size; and the resulting prediction
accuracy is competitive or superior compared to the alternatives. We illustrate
the proposed method by extensive simulation studies. The proposed method is
also applied to a Glioblastoma microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS377 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Structured penalized regression for drug sensitivity prediction
Large-scale {\it in vitro} drug sensitivity screens are an important tool in
personalized oncology to predict the effectiveness of potential cancer drugs.
The prediction of the sensitivity of cancer cell lines to a panel of drugs is a
multivariate regression problem with high-dimensional heterogeneous multi-omics
data as input data and with potentially strong correlations between the outcome
variables which represent the sensitivity to the different drugs. We propose a
joint penalized regression approach with structured penalty terms which allow
us to utilize the correlation structure between drugs with group-lasso-type
penalties and at the same time address the heterogeneity between omics data
sources by introducing data-source-specific penalty factors to penalize
different data sources differently. By combining integrative penalty factors
(IPF) with tree-guided group lasso, we create the IPF-tree-lasso method. We
present a unified framework to transform more general IPF-type methods to the
original penalized method. Because the structured penalty terms have multiple
parameters, we demonstrate how the interval-search Efficient Parameter
Selection via Global Optimization (EPSGO) algorithm can be used to optimize
multiple penalty parameters efficiently. Simulation studies show that
IPF-tree-lasso can improve the prediction performance compared to other
lasso-type methods, in particular for heterogenous data sources. Finally, we
employ the new methods to analyse data from the Genomics of Drug Sensitivity in
Cancer project.Comment: Zhao Z, Zucknick M (2020). Structured penalized regression for drug
sensitivity prediction. Journal of the Royal Statistical Society, Series C.
19 pages, 6 figures and 2 table
Machine learning applied to enzyme turnover numbers reveals protein structural correlates and improves metabolic models.
Knowing the catalytic turnover numbers of enzymes is essential for understanding the growth rate, proteome composition, and physiology of organisms, but experimental data on enzyme turnover numbers is sparse and noisy. Here, we demonstrate that machine learning can successfully predict catalytic turnover numbers in Escherichia coli based on integrated data on enzyme biochemistry, protein structure, and network context. We identify a diverse set of features that are consistently predictive for both in vivo and in vitro enzyme turnover rates, revealing novel protein structural correlates of catalytic turnover. We use our predictions to parameterize two mechanistic genome-scale modelling frameworks for proteome-limited metabolism, leading to significantly higher accuracy in the prediction of quantitative proteome data than previous approaches. The presented machine learning models thus provide a valuable tool for understanding metabolism and the proteome at the genome scale, and elucidate structural, biochemical, and network properties that underlie enzyme kinetics
- …