100,414 research outputs found

    DeepSF: deep convolutional neural network for mapping protein sequences to folds

    Get PDF
    Motivation Protein fold recognition is an important problem in structural bioinformatics. Almost all traditional fold recognition methods use sequence (homology) comparison to indirectly predict the fold of a tar get protein based on the fold of a template protein with known structure, which cannot explain the relationship between sequence and fold. Only a few methods had been developed to classify protein sequences into a small number of folds due to methodological limitations, which are not generally useful in practice. Results We develop a deep 1D-convolution neural network (DeepSF) to directly classify any protein se quence into one of 1195 known folds, which is useful for both fold recognition and the study of se quence-structure relationship. Different from traditional sequence alignment (comparison) based methods, our method automatically extracts fold-related features from a protein sequence of any length and map it to the fold space. We train and test our method on the datasets curated from SCOP1.75, yielding a classification accuracy of 80.4%. On the independent testing dataset curated from SCOP2.06, the classification accuracy is 77.0%. We compare our method with a top profile profile alignment method - HHSearch on hard template-based and template-free modeling targets of CASP9-12 in terms of fold recognition accuracy. The accuracy of our method is 14.5%-29.1% higher than HHSearch on template-free modeling targets and 4.5%-16.7% higher on hard template-based modeling targets for top 1, 5, and 10 predicted folds. The hidden features extracted from sequence by our method is robust against sequence mutation, insertion, deletion and truncation, and can be used for other protein pattern recognition problems such as protein clustering, comparison and ranking.Comment: 28 pages, 13 figure

    Two-Stage Metric Learning

    Get PDF
    In this paper, we present a novel two-stage metric learning algorithm. We first map each learning instance to a probability distribution by computing its similarities to a set of fixed anchor points. Then, we define the distance in the input data space as the Fisher information distance on the associated statistical manifold. This induces in the input data space a new family of distance metric with unique properties. Unlike kernelized metric learning, we do not require the similarity measure to be positive semi-definite. Moreover, it can also be interpreted as a local metric learning algorithm with well defined distance approximation. We evaluate its performance on a number of datasets. It outperforms significantly other metric learning methods and SVM.Comment: Accepted for publication in ICML 201

    k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)

    Get PDF
    Perhaps the most straightforward classifier in the arsenal or machine learning techniques is the Nearest Neighbour Classifier -- classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance because issues of poor run-time performance is not such a problem these days with the computational power that is available. This paper presents an overview of techniques for Nearest Neighbour classification focusing on; mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours and mechanisms for reducing the dimension of the data. This paper is the second edition of a paper previously published as a technical report. Sections on similarity measures for time-series, retrieval speed-up and intrinsic dimensionality have been added. An Appendix is included providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN

    Structurally constrained protein evolution: results from a lattice simulation

    Full text link
    We simulate the evolution of a protein-like sequence subject to point mutations, imposing conservation of the ground state, thermodynamic stability and fast folding. Our model is aimed at describing neutral evolution of natural proteins. We use a cubic lattice model of the protein structure and test the neutrality conditions by extensive Monte Carlo simulations. We observe that sequence space is traversed by neutral networks, i.e. sets of sequences with the same fold connected by point mutations. Typical pairs of sequences on a neutral network are nearly as different as randomly chosen sequences. The fraction of neutral neighbors has strong sequence to sequence variations, which influence the rate of neutral evolution. In this paper we study the thermodynamic stability of different protein sequences. We relate the high variability of the fraction of neutral mutations to the complex energy landscape within a neutral network, arguing that valleys in this landscape are associated to high values of the neutral mutation rate. We find that when a point mutation produces a sequence with a new ground state, this is likely to have a low stability. Thus we tentatively conjecture that neutral networks of different structures are typically well separated in sequence space. This results indicates that changing significantly a protein structure through a biologically acceptable chain of point mutations is a rare, although possible, event.Comment: added reference, to appear on European Physical Journal

    Exploiting conceptual spaces for ontology integration

    Get PDF
    The widespread use of ontologies raises the need to integrate distinct conceptualisations. Whereas the symbolic approach of established representation standards – based on first-order logic (FOL) and syllogistic reasoning – does not implicitly represent semantic similarities, ontology mapping addresses this problem by aiming at establishing formal relations between a set of knowledge entities which represent the same or a similar meaning in distinct ontologies. However, manually or semi-automatically identifying similarity relationships is costly. Hence, we argue, that representational facilities are required which enable to implicitly represent similarities. Whereas Conceptual Spaces (CS) address similarity computation through the representation of concepts as vector spaces, CS rovide neither an implicit representational mechanism nor a means to represent arbitrary relations between concepts or instances. In order to overcome these issues, we propose a hybrid knowledge representation approach which extends FOL-based ontologies with a conceptual grounding through a set of CS-based representations. Consequently, semantic similarity between instances – represented as members in CS – is indicated by means of distance metrics. Hence, automatic similarity detection across distinct ontologies is supported in order to facilitate ontology integration
    • …
    corecore