1,113 research outputs found

    Quantitative toxicity prediction using topology based multi-task deep neural networks

    Full text link
    The understanding of toxicity is of paramount importance to human health and environmental protection. Quantitative toxicity analysis has become a new standard in the field. This work introduces element specific persistent homology (ESPH), an algebraic topology approach, for quantitative toxicity prediction. ESPH retains crucial chemical information during the topological abstraction of geometric complexity and provides a representation of small molecules that cannot be obtained by any other method. To investigate the representability and predictive power of ESPH for small molecules, ancillary descriptors have also been developed based on physical models. Topological and physical descriptors are paired with advanced machine learning algorithms, such as deep neural network (DNN), random forest (RF) and gradient boosting decision tree (GBDT), to facilitate their applications to quantitative toxicity predictions. A topology based multi-task strategy is proposed to take the advantage of the availability of large data sets while dealing with small data sets. Four benchmark toxicity data sets that involve quantitative measurements are used to validate the proposed approaches. Extensive numerical studies indicate that the proposed topological learning methods are able to outperform the state-of-the-art methods in the literature for quantitative toxicity analysis. Our online server for computing element-specific topological descriptors (ESTDs) is available at http://weilab.math.msu.edu/TopTox/Comment: arXiv admin note: substantial text overlap with arXiv:1703.1095

    Data-driven fault detection using trending analysis

    Get PDF
    The objective of this research is to develop data-driven fault detection methods which do not rely on mathematical models yet are capable of detecting process malfunctions. Instead of using mathematical models for comparing performances, the methods developed rely on extensive collection of data to establish classification schemes that detect faults in new data. The research develops two different trending approaches. One uses the normal data to define a one-class classifier. The second approach uses a data mining technique, e.g. support vector machine (SVM) to define multi class classifiers. Each classifier is trained on a set of example objects. The one-class classification assumes that only information of one of the classes, namely the normal class, is available. The boundary between the two classes, normal and faulty, is estimated from data of the normal class only. The research assumes that the convex hull of the normal data can be used to define a boundary separating normal and faulty data. The multi class classifier is implemented through several binary classifiers. It is assumed that data from two classes are available and the decision boundary is supported from both sides by example objects. In order to detect significant trends in the data the research implements a non-uniform quantization technique, based on Lloyd’s algorithm and defines a special subsequence-based kernel. The effect of the subsequence length is examined through computer simulations and theoretical analysis. The test bed used to collect data and implement the fault detection is a six degrees of freedom, rigid body model of a B747 100/200 and only faults in the actuators are considered. In order to thoroughly test the efficiency of the approach, the test use only sensor data that does not include manipulated variables. Even with this handicap the approach is effective with the average of 79.5% correct detection and 16.7% missed alarm and 3.9% false alarms for six different faults

    A framework of face recognition with set of testing images

    Get PDF
    We propose a novel framework to solve the face recognition problem base on set of testing images. Our framework can handle the case that no pose overlap between training set and query set. The main techniques used in this framework are manifold alignment, face normalization and discriminant learning. Experiments on different databases show our system outperforms some state of the art methods

    Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening

    Full text link
    This work introduces a number of algebraic topology approaches, such as multicomponent persistent homology, multi-level persistent homology and electrostatic persistence for the representation, characterization, and description of small molecules and biomolecular complexes. Multicomponent persistent homology retains critical chemical and biological information during the topological simplification of biomolecular geometric complexity. Multi-level persistent homology enables a tailored topological description of inter- and/or intra-molecular interactions of interest. Electrostatic persistence incorporates partial charge information into topological invariants. These topological methods are paired with Wasserstein distance to characterize similarities between molecules and are further integrated with a variety of machine learning algorithms, including k-nearest neighbors, ensemble of trees, and deep convolutional neural networks, to manifest their descriptive and predictive powers for chemical and biological problems. Extensive numerical experiments involving more than 4,000 protein-ligand complexes from the PDBBind database and near 100,000 ligands and decoys in the DUD database are performed to test respectively the scoring power and the virtual screening power of the proposed topological approaches. It is demonstrated that the present approaches outperform the modern machine learning based methods in protein-ligand binding affinity predictions and ligand-decoy discrimination

    Improving the resolution of interaction maps: A middleground between high-resolution complexes and genome-wide interactomes

    Get PDF
    Protein-protein interactions are ubiquitous in Biology and therefore central to understand living organisms. In recent years, large-scale studies have been undertaken to describe, at least partially, protein-protein interaction maps or interactomes for a number of relevant organisms including human. Although the analysis of interaction networks is proving useful, current interactomes provide a blurry and granular picture of the molecular machinery, i.e. unless the structure of the protein complex is known the molecular details of the interaction are missing and sometime is even not possible to know if the interaction between the proteins is direct, i.e. physical interaction or part of functional, not necessary, direct association. Unfortunately, the determination of the structure of protein complexes cannot keep pace with the discovery of new protein-protein interactions resulting in a large, and increasing, gap between the number of complexes that are thought to exist and the number for which 3D structures are available. The aim of the thesis was to tackle this problem by implementing computational approaches to derive structural models of protein complexes and thus reduce this existing gap. Over the course of the thesis, a novel modelling algorithm to predict the structure of protein complexes, V-D2OCK, was implemented. This new algorithm combines structure-based prediction of protein binding sites by means of a novel algorithm developed over the course of the thesis: VORFFIP and M-VORFFIP, data-driven docking and energy minimization. This algorithm was used to improve the coverage and structural content of the human interactome compiled from different sources of interactomic data to ensure the most comprehensive interactome. Finally, the human interactome and structural models were compiled in a database, V-D2OCK DB, that offers an easy and user-friendly access to the human interactome including a bespoken graphical molecular viewer to facilitate the analysis of the structural models of protein complexes. Furthermore, new organisms, in addition to human, were included providing a useful resource for the study of all known interactomes

    BIOMOLECULAR FUNCTION FROM STRUCTURAL SNAPSHOTS

    Get PDF
    Biological molecules can assume a continuous range of conformations during function. Near equilibrium, the Boltzmann relation connects a particular conformation\u27s free energy to the conformation\u27s occupation probability, thus giving rise to one or more energy landscapes. Biomolecular function proceeds along minimum-energy pathways on such landscapes. Consequently, a comprehensive understanding of biomolecular function often involves the determination of the free-energy landscapes and the identification of functionally relevant minimum-energy conformational paths on these landscapes. Specific techniques are necessary to determine continuous conformational spectra and identify functionally relevant conformational trajectories from a collection of raw single-particle snapshots from, e.g. cryogenic electron microscopy (cryo-EM) or X-ray diffraction. To assess the capability of different algorithms to recover conformational landscapes, we:• Measure, compare, and benchmark the performance of four leading data-analytical approaches to determine the accuracy with which energy landscapes are recovered from simulated cryo-EM data. Our simulated data are derived from projection directions along the great circle, emanating from a known energy landscape. • Demonstrate the ability to recover a biomolecule\u27s energy landscapes and functional pathways of biomolecules extracted from collections of cryo-EM snapshots. Structural biology applications in drug discovery and molecular medicine highlight the importance of the free-energy landscapes of the biomolecules more crucial than ever. Recently several data-driven machine learning algorithms have emerged to extract energy landscapes and functionally relevant continuous conformational pathways from single-particle data (Dashti et al., 2014; Dashti et al., 2020; Mashayekhi,et al., 2022). In a benchmarking study, the performance of several advanced data-analytical algorithms was critically assessed (Dsouza et al., 2023). In this dissertation, we have benchmarked the performance of four leading algorithms in extracting energy landscapes and functional pathways from single-particle cryo-EM snapshots. In addition, we have significantly improved the performance of the ManifoldEM algorithm, which has demonstrated the highest performance. Our contributions can be summarized as follows.: • Expert user supervision is required in one of the main steps of the ManifoldEM framework wherein the algorithm needs to propagate the conformational information through all angular space. We have succeeded in introducing an automated approach, which eliminates the need for user involvement. • The quality of the energy landscapes extracted by ManifoldEM from cryo-EM data has been improved, as the accuracy scores demonstrate this improvement. These measures have substantially enhanced ManifoldEM’s ability to recover the conformational motions of biomolecules by extracting the energy landscape from cryo-EM data.In line with the primary goal of our research, we aimed to extend the automated method across the entire angular sphere rather than a great circle. During this endeavor, we encountered challenges, particularly with some projection directions not following the proposed model. Through methodological adjustments and sampling optimization, we improved the projection direction\u27s conformity to the model. However, a small subset of Projection directions (5 %) remained challenging. We also recommended the use of specific methodologies, namely feature extraction and edge detection algorithms, to enhance the precision in quantifying image differentiation, a crucial component of our automated model. we also suggested that integrating different techniques might potentially resolve challenges associated with certain projection directions. We also applied ManifoldEM to experimental cryo-EM images of the SARS-CoV-2 spike protein in complex with the ACE2 receptor. By introducing several improvements, such as the incorporation of an adaptive mask and cosine curve fitting, we enhanced the framework\u27s output quality. This enhancement can be quantified by observing the removal of the artifact from the energy landscape, especially if the post-enhancement landscape differs from the artifact-affected one. These modifications, specifically aimed at addressing challenges from Nonlinear Laplacian Spectral Analysis (NLSA) (Giannakis et al., 2012), are intended for application in upcoming cryo-EM studies utilizing ManifoldEM. In the closing sections of this dissertation, a summary and a projection of future research directions are provided. While initial automated methods have been explored, there remains room for refinement. We have offered numerous methodological suggestions oriented toward addressing solutions to the challenge of conformational information propagation. Key methodologies discussed include Manifold Alignment, Canonical Correlation Analysis, and Multi-View Diffusion Maps. These recommendations are aimed to inform and guide subsequent developments in the ManifoldEM suite

    Nature of the learning algorithms for feedforward neural networks

    Get PDF
    The neural network model (NN) comprised of relatively simple computing elements, operating in parallel, offers an attractive and versatile framework for exploring a variety of learning structures and processes for intelligent systems. Due to the amount of research developed in the area many types of networks have been defined. The one of interest here is the multi-layer perceptron as it is one of the simplest and it is considered a powerful representation tool whose complete potential has not been adequately exploited and whose limitations need yet to be specified in a formal and coherent framework. This dissertation addresses the theory of generalisation performance and architecture selection for the multi-layer perceptron; a subsidiary aim is to compare and integrate this model with existing data analysis techniques and exploit its potential by combining it with certain constructs from computational geometry creating a reliable, coherent network design process which conforms to the characteristics of a generative learning algorithm, ie. one including mechanisms for manipulating the connections and/or units that comprise the architecture in addition to the procedure for updating the weights of the connections. This means that it is unnecessary to provide an initial network as input to the complete training process.After discussing in general terms the motivation for this study, the multi-layer perceptron model is introduced and reviewed, along with the relevant supervised training algorithm, ie. backpropagation. More particularly, it is argued that a network developed employing this model can in general be trained and designed in a much better way by extracting more information about the domains of interest through the application of certain geometric constructs in a preprocessing stage, specifically by generating the Voronoi Diagram and Delaunav Triangulation [Okabe et al. 92] of the set of points comprising the training set and once a final architecture which performs appropriately on it has been obtained, Principal Component Analysis [Jolliffe 86] is applied to the outputs produced by the units in the network's hidden layer to eliminate the redundant dimensions of this space
    corecore