10 research outputs found

    Similarités et divergences, globales et locales, entre structures protéiques

    Get PDF
    This thesis focusses on local and global similarities and divergences inside protein structures. First, structures are scored, with criteria of similarity and distance in order to provide a supervised classification. This structural domain classification inside existing hierarchical databases is possible by using dominances and learning. These methods allow to assign new domains with accuracy and exactly. Second we focusses on local similarities and proposed a method of protein comparison modelisation inside graphs. Graph traversal allows to find protein similar substructures. This method is based on compatibility between elements and criterion of distances. We can use it and detect events such that circular permutations, hinges and structural motif repeats. Finally we propose a new approach of accurate protein structure analysis that focused on divergences between similar structures.Cette thèse s'articule autour de la détection de similarités globales et locales dans les structures protéiques. Premièrement les structures sont comparées, mesurées en termes de distance métrique dans un but de classification supervisée. Cette classification des domaines structuraux au sein de classifications hiérarchiques se fait par le biais de dominances et d'apprentissages permettant d'assigner plus rapidement et de manière exacte de nouveaux domaines. Deuxièmement, nous proposons une méthode de manière de traduire un problème biologique dans les formalisme des graphes. Puis nous résolvons ce problème via le parcours de ces graphes pour extraire les différentes sous-structures similaires. Cette méthode repose sur des notions de compatibilités entre éléments des structures ainsi que des critères de distances entre éléments. Ces techniques sont capables de détecter des événements tels que des permutations circulaires, des charnières (flexibilité) et des répétitions de motifs structuraux. Finalement nous proposons une nouvelle approche dans l'analyse fine de structures afin de faciliter la recherche de régions divergentes entre structures 3D fortement similaires

    Parallel seed-based approach to multiple protein structure similarities detection

    Get PDF
    Finding similarities between protein structures is a crucial task in molecular biology. Most of the existing tools require proteins to be aligned in order-preserving way and only find single alignments even when multiple similar regions exist. We propose a new seed-based ap-proach that discovers multiple pairs of similar regions. Its computa-tional complexity is polynomial and it comes with a quality guarantee– the returned alignments have both Root Mean Squared Deviations (coordinate-based as well as internal-distances based) lower than a given threshold, if such exist. We do not require the alignments to be order preserving (i.e. we consider non-sequential alignments), which makes our algorithm suitable for detecting similar domains when com-paring multi-domain proteins as well as to detect structural repetitions within a single protein. Because the search space for non-sequential alignments is much larger than for sequential ones, the computational burden is addressed by extensive use of parallel computing techniques: a coarse-grain level parallelism making use of available CPU cores for computation and a fine-grain level parallelism exploiting bit-level con-currency as well as vector instructions

    Using Dominances for Solving the Protein Family Identification Problem

    Get PDF
    Published in Workshop on Algorithms for Bioinformatics (WABI 2011)International audienceIdentification of protein families is a computational biology challenge that needs efficient and reliable methods. Here we introduce the concept of dominance and propose a novel combined approach based on Distance Alignment Search Tool (DAST), which contains an exact algorithm with bounds. Our experiments show that this method successfully finds the most similar proteins in a set without solving all instances.L'identification des familles protéique est un challenge de la biologie computationnelle qui nécessite des méthodes efficaces et robustes. Nous introduisons ici le concept de dominance entre instance de comparaison de structures protéiques, et proposons une nouvelle approche basée sur DAST (Distance Alignment Search Tool), un algorithme exact auquel nous rajoutons des bornes. Les résultats obtenus montrent que notre méthode résout correctement le problème de l'identification des familles protéique sans avoir besoin de résoudre toutes les instances de comparaison de structure

    Similarities and divergencies, global and local, between protein structures

    No full text
    Cette thèse s'articule autour de la détection de similarités globales et locales dans les structures protéiques. Premièrement les structures sont comparées, mesurées en termes de distance métrique dans un but de classification supervisée. Cette classification des domaines structuraux au sein de classifications hiérarchiques se fait par le biais de dominances et d'apprentissages permettant d'assigner plus rapidement et de manière exacte de nouveaux domaines. Deuxièmement, nous proposons une méthode de manière de traduire un problème biologique dans les formalisme des graphes. Puis nous résolvons ce problème via le parcours de ces graphes pour extraire les différentes sous-structures similaires. Cette méthode repose sur des notions de compatibilités entre éléments des structures ainsi que des critères de distances entre éléments. Ces techniques sont capables de détecter des événements tels que des permutations circulaires, des charnières (flexibilité) et des répétitions de motifs structuraux. Finalement nous proposons une nouvelle approche dans l'analyse fine de structures afin de faciliter la recherche de régions divergentes entre structures 3D fortement similaires.This thesis focusses on local and global similarities and divergences inside protein structures. First, structures are scored, with criteria of similarity and distance in order to provide a supervised classification. This structural domain classification inside existing hierarchical databases is possible by using dominances and learning. These methods allow to assign new domains with accuracy and exactly. Second we focusses on local similarities and proposed a method of protein comparison modelisation inside graphs. Graph traversal allows to find protein similar substructures. This method is based on compatibility between elements and criterion of distances. We can use it and detect events such that circular permutations, hinges and structural motif repeats. Finally we propose a new approach of accurate protein structure analysis that focused on divergences between similar structures

    De novo detection of structure repeats in Proteins

    Get PDF
    National audienceAlmost 25% of proteins contains internal repeats, these repeats may have a major role in the protein function. Furthermore some proteins actually are the same substructure repeated many times, these proteins are solenoids. But only few repeat detection programs exist, we present here Kunoichi, a simple and efficient tool for discovering protein repeats. Kunoichi is based on protein fragment comparison and clique detection. As first results, we show that Kunoichi can find different levels of repetitions and successfully identify protein tiles. Kunoichi is available on request from the authors. Détection de novo de structures répétées au sein des protéines Résumé Environ un quart des protéines contiennent des répétitions internes, ces répétitions peuvent jouer un rôle crucial dans la fonction de leurs protéines.De plus, certaines protéines ne sont en fait qu'une succession d'une sous-structure répétées plusieurs fois. Ces protéines modulaires sont appelées protéines solénoïdes. Néanmoins, peu d'outils de détection de répétitions existent. Nous présentons ici Kunoichi, un outil de détection de répétitions simple et efficace. Kunoichi est basé sur la comparaison des fragments ainsi que sur la détection de cliques. Les premiers résultats montrent que Kunoichi peut trouver différents niveaux de répétitions et identifier avec succès les "tuiles" composants les protéines multi-répétées. Kunoichi est disponible sur demande auprès des auteurs

    Modeling protein flexibility by distance geometry

    Get PDF
    National audienceThis long abstract discusses a strategy for modeling protein flexibility which is based on the discretization of the space of possible molecular conformations for a protein. The same discretization process was previously employed for discretizing Molecular Distance Geometry Problems (MDGPs)

    Parallel seed-based approach to multiple protein structure similarities detection

    Get PDF
    Finding similarities between protein structures is a crucial task in molecular biology. Most of the existing tools require proteins to be aligned in order-preserving way and only find single alignments even when multiple similar regions exist. We propose a new seed-based ap-proach that discovers multiple pairs of similar regions. Its computa-tional complexity is polynomial and it comes with a quality guarantee– the returned alignments have both Root Mean Squared Deviations (coordinate-based as well as internal-distances based) lower than a given threshold, if such exist. We do not require the alignments to be order preserving (i.e. we consider non-sequential alignments), which makes our algorithm suitable for detecting similar domains when com-paring multi-domain proteins as well as to detect structural repetitions within a single protein. Because the search space for non-sequential alignments is much larger than for sequential ones, the computational burden is addressed by extensive use of parallel computing techniques: a coarse-grain level parallelism making use of available CPU cores for computation and a fine-grain level parallelism exploiting bit-level con-currency as well as vector instructions

    Exact Protein Structure Classification Using the Maximum Contact Map Overlap Metric

    Get PDF
    In this work we propose a new distance measure for compar-ing two protein structures based on their contact map representations. We show that our novel measure, which we refer to as the maximum contact map overlap (max-CMO) metric, satisfies all properties of a metric on the space of protein representations. Having a metric in that space allows to avoid pairwise comparisons on the entire database and thus to significantly accelerate exploring the protein space compared to no-metric spaces. We show on a small gold-standard superfamily classification benchmark set of 6, 759 proteins that our exact scheme classifies up to 224 out of 236 queries correctly and on an larger, extended version of the benchmark up to 1361 out of 1369 queries. Our k-NN classification thus provides a promising approach for the automatic classification of protein structures into SCOP or CATH based on flexible contact map overlap alignments

    Identification rapide de familles protéiques par dominance

    Get PDF
    Publié dans le douzième congrès de la Société Française de Recherche Opérationnelle et d'Aide à la Décision (ROADEF 2011).National audienceStructural comparison of proteins is a frequent and important operation in bioinformatics, giving precious information for determining the possible functions of proteins. Unfortunately, the corresponding optimization problems are often NP-Hard. Different analysis approaches exist: Most are based on the superimposition of residue coordinates (like VAST) or on the comparison of internal distances. The objective is to quickly identify and classify similar structures. We used the comparison tool A_purva, which is based on Contact Map Overlap (CMO), to classify protein structure coming from the CATH database. The obtained results show that A_purva was able to correctly classify 92% of the structures, and that introducing the notion of dominance drastically reduces the computational time needed for classifying the protein structures.La comparaison de stuctures protéiques est une opération fréquente et importante dans le domaine de la bioinformatique. Elle apporte des informations aidant à la détermination des fonctions d'une protéine. Néanmoins, le problème sous-jacent est NP-complet. Différentes approches d'analyse existent: certaines basées sur la superposition de coordonnées (e.g. VAST) et d'autres sur les distances internes conservées dans les structures. L'objectif est donc d'identifier et de classer rapidement des structures similaires. Nous avons classé des structures de la base de données CATH avec un programme nommé A_purva qui utilise l'approche CMO (Conctact Map Overlap). Nous montrons que ce dernier a permis de prédire correctement la classification de 92% des structures soumises et que l'introduction de la notion de dominance a réduit considérablement les temps de classement des protéines
    corecore