19 research outputs found
A topological approach for protein classification
Protein function and dynamics are closely related to its sequence and
structure. However prediction of protein function and dynamics from its
sequence and structure is still a fundamental challenge in molecular biology.
Protein classification, which is typically done through measuring the
similarity be- tween proteins based on protein sequence or physical
information, serves as a crucial step toward the understanding of protein
function and dynamics. Persistent homology is a new branch of algebraic
topology that has found its success in the topological data analysis in a
variety of disciplines, including molecular biology. The present work explores
the potential of using persistent homology as an indepen- dent tool for protein
classification. To this end, we propose a molecular topological fingerprint
based support vector machine (MTF-SVM) classifier. Specifically, we construct
machine learning feature vectors solely from protein topological fingerprints,
which are topological invariants generated during the filtration process. To
validate the present MTF-SVM approach, we consider four types of problems.
First, we study protein-drug binding by using the M2 channel protein of
influenza A virus. We achieve 96% accuracy in discriminating drug bound and
unbound M2 channels. Additionally, we examine the use of MTF-SVM for the
classification of hemoglobin molecules in their relaxed and taut forms and
obtain about 80% accuracy. The identification of all alpha, all beta, and
alpha-beta protein domains is carried out in our next study using 900 proteins.
We have found a 85% success in this identifica- tion. Finally, we apply the
present technique to 55 classification tasks of protein superfamilies over 1357
samples. An average accuracy of 82% is attained. The present study establishes
computational topology as an independent and effective alternative for protein
classification
Persistent homology analysis of ion aggregation and hydrogen-bonding network
Despite the great advancement of experimental tools and theoretical models, a
quantitative characterization of the microscopic structures of ion aggregates
and its associated water hydrogen-bonding networks still remains a challenging
problem. In this paper, a newly-invented mathematical method called persistent
homology is introduced, for the first time, to quantitatively analyze the
intrinsic topological properties of ion aggregation systems and
hydrogen-bonding networks. Two most distinguishable properties of persistent
homology analysis of assembly systems are as follows. First, it does not
require a predefined bond length to construct the ion or hydrogen network.
Persistent homology results are determined by the morphological structure of
the data only. Second, it can directly measure the size of circles or holes in
ion aggregates and hydrogen-bonding networks. To validate our model, we
consider two well-studied systems, i.e., NaCl and KSCN solutions, generated
from molecular dynamics simulations. They are believed to represent two
morphological types of aggregation, i.e., local clusters and extended ion
network. It has been found that the two aggregation types have distinguishable
topological features and can be characterized by our topological model very
well. For hydrogen-bonding networks, KSCN systems demonstrate much more
dramatic variations in their local circle structures with the concentration
increase. A consistent increase of large-sized local circle structures is
observed and the sizes of these circles become more and more diverse. In
contrast, NaCl systems show no obvious increase of large-sized circles. Instead
a consistent decline of the average size of circle structures is observed and
the sizes of these circles become more and more uniformed with the
concentration increase.Comment: 21 pages, 11 figures, 2 table
Weighted persistent homology for biomolecular data analysis
In this paper, we systematically review weighted persistent homology (WPH)
models and their applications in biomolecular data analysis. Essentially, the
weight value, which reflects physical, chemical and biological properties, can
be assigned to vertices (atom centers), edges (bonds), or higher order
simplexes (cluster of atoms), depending on the biomolecular structure,
function, and dynamics properties. Further, we propose the first localized
weighted persistent homology (LWPH). Inspired by the great success of element
specific persistent homology (ESPH), we do not treat biomolecules as an
inseparable system like all previous weighted models, instead we decompose them
into a series of local domains, which may be overlapped with each other. The
general persistent homology or weighted persistent homology analysis is then
applied on each of these local domains. In this way, functional properties,
that are embedded in local structures, can be revealed. Our model has been
applied to systematically studying DNA structures. It has been found that our
LWPH based features can be used to successfully discriminate the A-, B-, and
Z-types of DNA. More importantly, our LWPH based PCA model can identify two
configurational states of DNA structure in ion liquid environment, which can be
revealed only by the complicated helical coordinate system. The great
consistence with the helical-coordinate model demonstrates that our model
captures local structure variations so well that it is comparable with
geometric models. Moreover, geometric measurements are usually defined in very
local regions. For instance, the helical-coordinate system is limited to one or
two basepairs. However, our LWPH can quantitatively characterize structure
information in local regions or domains with arbitrary sizes and shapes, where
traditional geometrical measurements fail.Comment: 27 pages; 18 figure
Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening
This work introduces a number of algebraic topology approaches, such as
multicomponent persistent homology, multi-level persistent homology and
electrostatic persistence for the representation, characterization, and
description of small molecules and biomolecular complexes. Multicomponent
persistent homology retains critical chemical and biological information during
the topological simplification of biomolecular geometric complexity.
Multi-level persistent homology enables a tailored topological description of
inter- and/or intra-molecular interactions of interest. Electrostatic
persistence incorporates partial charge information into topological
invariants. These topological methods are paired with Wasserstein distance to
characterize similarities between molecules and are further integrated with a
variety of machine learning algorithms, including k-nearest neighbors, ensemble
of trees, and deep convolutional neural networks, to manifest their descriptive
and predictive powers for chemical and biological problems. Extensive numerical
experiments involving more than 4,000 protein-ligand complexes from the PDBBind
database and near 100,000 ligands and decoys in the DUD database are performed
to test respectively the scoring power and the virtual screening power of the
proposed topological approaches. It is demonstrated that the present approaches
outperform the modern machine learning based methods in protein-ligand binding
affinity predictions and ligand-decoy discrimination