19 research outputs found

    A topological approach for protein classification

    Full text link
    Protein function and dynamics are closely related to its sequence and structure. However prediction of protein function and dynamics from its sequence and structure is still a fundamental challenge in molecular biology. Protein classification, which is typically done through measuring the similarity be- tween proteins based on protein sequence or physical information, serves as a crucial step toward the understanding of protein function and dynamics. Persistent homology is a new branch of algebraic topology that has found its success in the topological data analysis in a variety of disciplines, including molecular biology. The present work explores the potential of using persistent homology as an indepen- dent tool for protein classification. To this end, we propose a molecular topological fingerprint based support vector machine (MTF-SVM) classifier. Specifically, we construct machine learning feature vectors solely from protein topological fingerprints, which are topological invariants generated during the filtration process. To validate the present MTF-SVM approach, we consider four types of problems. First, we study protein-drug binding by using the M2 channel protein of influenza A virus. We achieve 96% accuracy in discriminating drug bound and unbound M2 channels. Additionally, we examine the use of MTF-SVM for the classification of hemoglobin molecules in their relaxed and taut forms and obtain about 80% accuracy. The identification of all alpha, all beta, and alpha-beta protein domains is carried out in our next study using 900 proteins. We have found a 85% success in this identifica- tion. Finally, we apply the present technique to 55 classification tasks of protein superfamilies over 1357 samples. An average accuracy of 82% is attained. The present study establishes computational topology as an independent and effective alternative for protein classification

    Persistent homology analysis of ion aggregation and hydrogen-bonding network

    Full text link
    Despite the great advancement of experimental tools and theoretical models, a quantitative characterization of the microscopic structures of ion aggregates and its associated water hydrogen-bonding networks still remains a challenging problem. In this paper, a newly-invented mathematical method called persistent homology is introduced, for the first time, to quantitatively analyze the intrinsic topological properties of ion aggregation systems and hydrogen-bonding networks. Two most distinguishable properties of persistent homology analysis of assembly systems are as follows. First, it does not require a predefined bond length to construct the ion or hydrogen network. Persistent homology results are determined by the morphological structure of the data only. Second, it can directly measure the size of circles or holes in ion aggregates and hydrogen-bonding networks. To validate our model, we consider two well-studied systems, i.e., NaCl and KSCN solutions, generated from molecular dynamics simulations. They are believed to represent two morphological types of aggregation, i.e., local clusters and extended ion network. It has been found that the two aggregation types have distinguishable topological features and can be characterized by our topological model very well. For hydrogen-bonding networks, KSCN systems demonstrate much more dramatic variations in their local circle structures with the concentration increase. A consistent increase of large-sized local circle structures is observed and the sizes of these circles become more and more diverse. In contrast, NaCl systems show no obvious increase of large-sized circles. Instead a consistent decline of the average size of circle structures is observed and the sizes of these circles become more and more uniformed with the concentration increase.Comment: 21 pages, 11 figures, 2 table

    Weighted persistent homology for biomolecular data analysis

    Full text link
    In this paper, we systematically review weighted persistent homology (WPH) models and their applications in biomolecular data analysis. Essentially, the weight value, which reflects physical, chemical and biological properties, can be assigned to vertices (atom centers), edges (bonds), or higher order simplexes (cluster of atoms), depending on the biomolecular structure, function, and dynamics properties. Further, we propose the first localized weighted persistent homology (LWPH). Inspired by the great success of element specific persistent homology (ESPH), we do not treat biomolecules as an inseparable system like all previous weighted models, instead we decompose them into a series of local domains, which may be overlapped with each other. The general persistent homology or weighted persistent homology analysis is then applied on each of these local domains. In this way, functional properties, that are embedded in local structures, can be revealed. Our model has been applied to systematically studying DNA structures. It has been found that our LWPH based features can be used to successfully discriminate the A-, B-, and Z-types of DNA. More importantly, our LWPH based PCA model can identify two configurational states of DNA structure in ion liquid environment, which can be revealed only by the complicated helical coordinate system. The great consistence with the helical-coordinate model demonstrates that our model captures local structure variations so well that it is comparable with geometric models. Moreover, geometric measurements are usually defined in very local regions. For instance, the helical-coordinate system is limited to one or two basepairs. However, our LWPH can quantitatively characterize structure information in local regions or domains with arbitrary sizes and shapes, where traditional geometrical measurements fail.Comment: 27 pages; 18 figure

    Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening

    Full text link
    This work introduces a number of algebraic topology approaches, such as multicomponent persistent homology, multi-level persistent homology and electrostatic persistence for the representation, characterization, and description of small molecules and biomolecular complexes. Multicomponent persistent homology retains critical chemical and biological information during the topological simplification of biomolecular geometric complexity. Multi-level persistent homology enables a tailored topological description of inter- and/or intra-molecular interactions of interest. Electrostatic persistence incorporates partial charge information into topological invariants. These topological methods are paired with Wasserstein distance to characterize similarities between molecules and are further integrated with a variety of machine learning algorithms, including k-nearest neighbors, ensemble of trees, and deep convolutional neural networks, to manifest their descriptive and predictive powers for chemical and biological problems. Extensive numerical experiments involving more than 4,000 protein-ligand complexes from the PDBBind database and near 100,000 ligands and decoys in the DUD database are performed to test respectively the scoring power and the virtual screening power of the proposed topological approaches. It is demonstrated that the present approaches outperform the modern machine learning based methods in protein-ligand binding affinity predictions and ligand-decoy discrimination
    corecore