13,349 research outputs found
DeepConv-DTI: Prediction of drug-target interactions via deep learning with convolution on protein sequences
Identification of drug-target interactions (DTIs) plays a key role in drug
discovery. The high cost and labor-intensive nature of in vitro and in vivo
experiments have highlighted the importance of in silico-based DTI prediction
approaches. In several computational models, conventional protein descriptors
are shown to be not informative enough to predict accurate DTIs. Thus, in this
study, we employ a convolutional neural network (CNN) on raw protein sequences
to capture local residue patterns participating in DTIs. With CNN on protein
sequences, our model performs better than previous protein descriptor-based
models. In addition, our model performs better than the previous deep learning
model for massive prediction of DTIs. By examining the pooled convolution
results, we found that our model can detect binding sites of proteins for DTIs.
In conclusion, our prediction model for detecting local residue patterns of
target proteins successfully enriches the protein features of a raw protein
sequence, yielding better prediction results than previous approaches.Comment: 26 pages, 7 figure
Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening
This work introduces a number of algebraic topology approaches, such as
multicomponent persistent homology, multi-level persistent homology and
electrostatic persistence for the representation, characterization, and
description of small molecules and biomolecular complexes. Multicomponent
persistent homology retains critical chemical and biological information during
the topological simplification of biomolecular geometric complexity.
Multi-level persistent homology enables a tailored topological description of
inter- and/or intra-molecular interactions of interest. Electrostatic
persistence incorporates partial charge information into topological
invariants. These topological methods are paired with Wasserstein distance to
characterize similarities between molecules and are further integrated with a
variety of machine learning algorithms, including k-nearest neighbors, ensemble
of trees, and deep convolutional neural networks, to manifest their descriptive
and predictive powers for chemical and biological problems. Extensive numerical
experiments involving more than 4,000 protein-ligand complexes from the PDBBind
database and near 100,000 ligands and decoys in the DUD database are performed
to test respectively the scoring power and the virtual screening power of the
proposed topological approaches. It is demonstrated that the present approaches
outperform the modern machine learning based methods in protein-ligand binding
affinity predictions and ligand-decoy discrimination
CoGANPPIS: Coevolution-enhanced Global Attention Neural Network for Protein-Protein Interaction Site Prediction
Protein-protein interactions are essential in biochemical processes. Accurate
prediction of the protein-protein interaction sites (PPIs) deepens our
understanding of biological mechanism and is crucial for new drug design.
However, conventional experimental methods for PPIs prediction are costly and
time-consuming so that many computational approaches, especially ML-based
methods, have been developed recently. Although these approaches have achieved
gratifying results, there are still two limitations: (1) Most models have
excavated some useful input features, but failed to take coevolutionary
features into account, which could provide clues for inter-residue
relationships; (2) The attention-based models only allocate attention weights
for neighboring residues, instead of doing it globally, neglecting that some
residues being far away from the target residues might also matter.
We propose a coevolution-enhanced global attention neural network, a
sequence-based deep learning model for PPIs prediction, called CoGANPPIS. It
utilizes three layers in parallel for feature extraction: (1) Local-level
representation aggregation layer, which aggregates the neighboring residues'
features; (2) Global-level representation learning layer, which employs a novel
coevolution-enhanced global attention mechanism to allocate attention weights
to all the residues on the same protein sequences; (3) Coevolutionary
information learning layer, which applies CNN & pooling to coevolutionary
information to obtain the coevolutionary profile representation. Then, the
three outputs are concatenated and passed into several fully connected layers
for the final prediction. Application on two benchmark datasets demonstrated a
state-of-the-art performance of our model. The source code is publicly
available at https://github.com/Slam1423/CoGANPPIS_source_code
Computational approaches to predict protein functional families and functional sites.
Understanding the mechanisms of protein function is indispensable for many biological applications, such as protein engineering and drug design. However, experimental annotations are sparse, and therefore, theoretical strategies are needed to fill the gap. Here, we present the latest developments in building functional subclassifications of protein superfamilies and using evolutionary conservation to detect functional determinants, for example, catalytic-, binding- and specificity-determining residues important for delineating the functional families. We also briefly review other features exploited for functional site detection and new machine learning strategies for combining multiple features
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
- …