26 research outputs found

    Contribution à l'analyse des séquences de protéines similarité, clustering et alignement

    Get PDF
    La prédiction des fonctions biologiques des protéines est primordiale en biologie cellulaire. On peut comprendre facilement tout l'enjeu de pouvoir différencier efficacement les protéines par leurs fonctions, quand on sait que ceci peut rendre possible la réparation des protéines anormales causants des maladies, ou du moins corriger ou améliorer leurs fonctions. Les méthodes expérimentales, basées sur la structure tridimensionnelle des protéines sont les plus fiables pour la prédiction des fonctions biologiques des protéines. Néanmoins, elles sont souvent coûteuses en temps et en ressources, et ne permettent pas de traiter de grands nombres de protéines. Il existe toutefois des algorithmes qui permettent aux biologistes d'arriver à de bons résultats de prédictions en utilisant des moyens beaucoup moins coûteux. Le plus souvent, ces algorithmes sont basés sur la similarité, le clustering, et l'alignement. Cependant, les algorithmes qui sont basés sur la similarité et le clustering utilisent souvent l'alignement des séquences et ne sont donc pas efficaces sur les protéines non alignables. Et lorsqu'ils ne sont pas basés sur l 'alignement, ces algorithmes utilisent souvent des approches qui ne tiennent pas compte de l'aspect biologique des séquences de protéines. D'autre part, l'efficacité des algorithmes d'alignements dépend souvent de la nature structurelle des protéines, ce qui rend difficile le choix de l'algorithme à utiliser quand la structure est inconnue. Par ailleurs, les algorithmes d'alignement ignorent les divergences entre les séquences à aligner, ce qui contraint souvent les biologistes à traiter manuellement les séquences à aligner, une tâche qui n'est pas toujours possible en pratique. Dans cette thèse nous présentons un ensemble de nouveaux algorithmes que nous avons conçus pour l'analyse des séquences de protéines. Dans le premier chapitre, nous présentons CLUSS, le premier algorithme de clustering capable de traiter des séquences de protéines non-alignables. Dans le deuxième chapitre, nous présentons CLUSS2 une version améliorée de CLUSS, capable de traiter de plus grands ensembles de protéines avec plus de de fonctions biologiques. Dans le troisième chapitre, nous présentons SCS, une nouvelle mesure de similarité capable de traiter efficacement non seulement les séquences de protéines mais aussi plusieurs types de séquences catégoriques. Dans le dernier chapitre, nous présentons ALIGNER, un algorithme d'alignement, efficace sur les séquences de protéines indépendamment de leurs types de structures. De plus, ALIGNER est capable de détecter automatiquement, parmi les protéines à aligner, les groupes de protéines dont l'alignement peut révéler d'importantes propriétés biochimiques structurelles et fonctionnelles, et cela sans faire appel à l'utilisateur

    CLUSS: Clustering of protein sequences based on a new similarity measure

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important. The challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowledge to help researchers understand biological phenomena. A good evolutionary model is essential to achieve a clustering that reflects the biological reality, and an accurate estimate of protein sequence similarity is crucial to the building of such a model. Most existing algorithms estimate this similarity using techniques that are not necessarily biologically plausible, especially for hard-to-align sequences such as proteins with different domain structures, which cause many difficulties for the alignment-dependent algorithms. In this paper, we propose a novel similarity measure based on matching amino acid subsequences. This measure, named SMS for Substitution Matching Similarity, is especially designed for application to non-aligned protein sequences. It allows us to develop a new alignment-free algorithm, named CLUSS, for clustering protein families. To the best of our knowledge, this is the first alignment-free algorithm for clustering protein sequences. Unlike other clustering algorithms, CLUSS is effective on both alignable and non-alignable protein families. In the rest of the paper, we use the term "<it>phylogenetic</it>" in the sense of "<it>relatedness of biological functions</it>".</p> <p>Results</p> <p>To show the effectiveness of CLUSS, we performed an extensive clustering on COG database. To demonstrate its ability to deal with hard-to-align sequences, we tested it on the GH2 family. In addition, we carried out experimental comparisons of CLUSS with a variety of mainstream algorithms. These comparisons were made on hard-to-align and easy-to-align protein sequences. The results of these experiments show the superiority of CLUSS in yielding clusters of proteins with similar functional activity.</p> <p>Conclusion</p> <p>We have developed an effective method and tool for clustering protein sequences to meet the needs of biologists in terms of phylogenetic analysis and prediction of biological functions. Compared to existing clustering methods, CLUSS more accurately highlights the functional characteristics of the clustered families. It provides biologists with a new and plausible instrument for the analysis of protein sequences, especially those that cause problems for the alignment-dependent algorithms.</p

    Fast and accurate discovery of degenerate linear motifs in protein sequences.

    No full text
    Linear motifs mediate a wide variety of cellular functions, which makes their characterization in protein sequences crucial to understanding cellular systems. However, the short length and degenerate nature of linear motifs make their discovery a difficult problem. Here, we introduce MotifHound, an algorithm particularly suited for the discovery of small and degenerate linear motifs. MotifHound performs an exact and exhaustive enumeration of all motifs present in proteins of interest, including all of their degenerate forms, and scores the overrepresentation of each motif based on its occurrence in proteins of interest relative to a background (e.g., proteome) using the hypergeometric distribution. To assess MotifHound, we benchmarked it together with state-of-the-art algorithms. The benchmark consists of 11,880 sets of proteins from S. cerevisiae; in each set, we artificially spiked-in one motif varying in terms of three key parameters, (i) number of occurrences, (ii) length and (iii) the number of degenerate or "wildcard" positions. The benchmark enabled the evaluation of the impact of these three properties on the performance of the different algorithms. The results showed that MotifHound and SLiMFinder were the most accurate in detecting degenerate linear motifs. Interestingly, MotifHound was 15 to 20 times faster at comparable accuracy and performed best in the discovery of highly degenerate motifs. We complemented the benchmark by an analysis of proteins experimentally shown to bind the FUS1 SH3 domain from S. cerevisiae. Using the full-length protein partners as sole information, MotifHound recapitulated most experimentally determined motifs binding to the FUS1 SH3 domain. Moreover, these motifs exhibited properties typical of SH3 binding peptides, e.g., high intrinsic disorder and evolutionary conservation, despite the fact that none of these properties were used as prior information. MotifHound is available (http://michnick.bcm.umontreal.ca or http://tinyurl.com/motifhound) together with the benchmark that can be used as a reference to assess future developments in motif discovery

    A general measure of similarity for categorical sequences

    No full text
    National Natural Science Foundation of China; Natural Sciences and Engineering Research Council of CanadaMeasuring the similarity between categorical sequences is a fundamental process in many data mining applications. A key issue is extracting and making use of significant features hidden behind the chronological and structural dependencies found in these sequences. Almost all existing algorithms designed to perform this task are based on the matching of patterns in chronological order, but such sequences often have similar structural features in chronologically different order. In this paper we propose SCS, a novel, effective and domain-independent method for measuring the similarity between categorical sequences, based on an original pattern matching scheme that makes it possible to capture chronological and non-chronological dependencies. SCS captures significant patterns that represent the natural structure of sequences, and reduces the influence of those which are merely noise. It constitutes an effective approach to measuring the similarity between data in the form of categorical sequences, such as biological sequences, natural language texts, speech recognition data, certain types of network transactions, and retail transactions. To show its effectiveness, we have tested SCS extensively on a range of data sets from different application fields, and compared the results with those obtained by various mainstream algorithms. The results obtained show that SCS produces results that are often competitive with domain-specific similarity approaches

    CLUSS: Clustering of protein sequences based on a new similarity measure-4

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "CLUSS: Clustering of protein sequences based on a new similarity measure"</p><p>http://www.biomedcentral.com/1471-2105/8/286</p><p>BMC Bioinformatics 2007;8():286-286.</p><p>Published online 4 Aug 2007</p><p>PMCID:PMC1976428.</p><p></p> (DDBJ: , DDBJ: ) "" and (GenBank: , DDBJ: ) "" activities. Subfamily SF_8 also includes closely related plant enzymes and bacterial enzymes produced by members of the genus Xanthomonas, including several plant pathogens

    CLUSS: Clustering of protein sequences based on a new similarity measure-0

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "CLUSS: Clustering of protein sequences based on a new similarity measure"</p><p>http://www.biomedcentral.com/1471-2105/8/286</p><p>BMC Bioinformatics 2007;8():286-286.</p><p>Published online 4 Aug 2007</p><p>PMCID:PMC1976428.</p><p></p>he 1000 randomly generated subsets from the COG database. As shown, the obtained results are in good concordance with the functional reference characterization of COG. The average of the quality measure of the 1000 clusterings is equal to with a standard deviation equal to . More than of the 1000 clusterings obtained a quality measure superior to , and more than of the clusterings obtained a quality measure superior to . The minimum value of the quality measure is and the maximum value is

    CLUSS: Clustering of protein sequences based on a new similarity measure-1

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "CLUSS: Clustering of protein sequences based on a new similarity measure"</p><p>http://www.biomedcentral.com/1471-2105/8/286</p><p>BMC Bioinformatics 2007;8():286-286.</p><p>Published online 4 Aug 2007</p><p>PMCID:PMC1976428.</p><p></p>clustering results obtained on six randomly generated subsets: SS1, red; SS2, blue; SS3, green; SS4, yellow; SS5, gray; SS6, amber

    Exhaustive search of linear information encoding protein-peptide recognition

    No full text
    <div><p>High-throughput <i>in vitro</i> methods have been extensively applied to identify linear information that encodes peptide recognition. However, these methods are limited in number of peptides, sequence variation, and length of peptides that can be explored, and often produce solutions that are not found in the cell. Despite the large number of methods developed to attempt addressing these issues, the exhaustive search of linear information encoding protein-peptide recognition has been so far physically unfeasible. Here, we describe a strategy, called DALEL, for the exhaustive search of linear sequence information encoded in proteins that bind to a common partner. We applied DALEL to explore binding specificity of SH3 domains in the budding yeast <i>Saccharomyces cerevisiae</i>. Using only the polypeptide sequences of SH3 domain binding proteins, we succeeded in identifying the majority of known SH3 binding sites previously discovered either <i>in vitro</i> or <i>in vivo</i>. Moreover, we discovered a number of sites with both non-canonical sequences and distinct properties that may serve ancillary roles in peptide recognition. We compared DALEL to a variety of state-of-the-art algorithms in the blind identification of known binding sites of the human Grb2 SH3 domain. We also benchmarked DALEL on curated biological motifs derived from the ELM database to evaluate the effect of increasing/decreasing the enrichment of the motifs. Our strategy can be applied in conjunction with experimental data of proteins interacting with a common partner to identify binding sites among them. Yet, our strategy can also be applied to any group of proteins of interest to identify enriched linear motifs or to exhaustively explore the space of linear information encoded in a polypeptide sequence. Finally, we have developed a webserver located at <a href="http://michnick.bcm.umontreal.ca/dalel" target="_blank">http://michnick.bcm.umontreal.ca/dalel</a>, offering user-friendly interface and providing different scenarios utilizing DALEL.</p></div

    CLUSS: Clustering of protein sequences based on a new similarity measure-5

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "CLUSS: Clustering of protein sequences based on a new similarity measure"</p><p>http://www.biomedcentral.com/1471-2105/8/286</p><p>BMC Bioinformatics 2007;8():286-286.</p><p>Published online 4 Aug 2007</p><p>PMCID:PMC1976428.</p><p></p>enBank: "", GenBank: "", GenBank: "" and "", which were recently analyzed by Côté . [41] and Fukamizo . [44] and characterized by their ability to recognize a substrate not yet associated with GH2 members
    corecore