5 research outputs found

    Efficient and accurate P-value computation for Position Weight Matrices

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Position Weight Matrices (PWMs) are probabilistic representations of signals in sequences. They are widely used to model approximate patterns in DNA or in protein sequences. The usage of PWMs needs as a prerequisite to knowing the statistical significance of a word according to its score. This is done by defining the P-value of a score, which is the probability that the background model can achieve a score larger than or equal to the observed value. This gives rise to the following problem: Given a P-value, find the corresponding score threshold. Existing methods rely on dynamic programming or probability generating functions. For many examples of PWMs, they fail to give accurate results in a reasonable amount of time.</p> <p>Results</p> <p>The contribution of this paper is two fold. First, we study the theoretical complexity of the problem, and we prove that it is NP-hard. Then, we describe a novel algorithm that solves the P-value problem efficiently. The main idea is to use a series of discretized score distributions that improves the final result step by step until some convergence criterion is met. Moreover, the algorithm is capable of calculating the exact P-value without any error, even for matrices with non-integer coefficient values. The same approach is also used to devise an accurate algorithm for the reverse problem: finding the P-value for a given score. Both methods are implemented in a software called TFM-PVALUE, that is freely available.</p> <p>Conclusion</p> <p>We have tested TFM-PVALUE on a large set of PWMs representing transcription factor binding sites. Experimental results show that it achieves better performance in terms of computational time and precision than existing tools.</p

    CLUSS: Clustering of protein sequences based on a new similarity measure

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important. The challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowledge to help researchers understand biological phenomena. A good evolutionary model is essential to achieve a clustering that reflects the biological reality, and an accurate estimate of protein sequence similarity is crucial to the building of such a model. Most existing algorithms estimate this similarity using techniques that are not necessarily biologically plausible, especially for hard-to-align sequences such as proteins with different domain structures, which cause many difficulties for the alignment-dependent algorithms. In this paper, we propose a novel similarity measure based on matching amino acid subsequences. This measure, named SMS for Substitution Matching Similarity, is especially designed for application to non-aligned protein sequences. It allows us to develop a new alignment-free algorithm, named CLUSS, for clustering protein families. To the best of our knowledge, this is the first alignment-free algorithm for clustering protein sequences. Unlike other clustering algorithms, CLUSS is effective on both alignable and non-alignable protein families. In the rest of the paper, we use the term "<it>phylogenetic</it>" in the sense of "<it>relatedness of biological functions</it>".</p> <p>Results</p> <p>To show the effectiveness of CLUSS, we performed an extensive clustering on COG database. To demonstrate its ability to deal with hard-to-align sequences, we tested it on the GH2 family. In addition, we carried out experimental comparisons of CLUSS with a variety of mainstream algorithms. These comparisons were made on hard-to-align and easy-to-align protein sequences. The results of these experiments show the superiority of CLUSS in yielding clusters of proteins with similar functional activity.</p> <p>Conclusion</p> <p>We have developed an effective method and tool for clustering protein sequences to meet the needs of biologists in terms of phylogenetic analysis and prediction of biological functions. Compared to existing clustering methods, CLUSS more accurately highlights the functional characteristics of the clustered families. It provides biologists with a new and plausible instrument for the analysis of protein sequences, especially those that cause problems for the alignment-dependent algorithms.</p

    Structural and Content Diversity of Mitochondrial Genome in Beet: A Comparative Genomic Analysis

    Get PDF
    Despite their monophyletic origin, mitochondrial (mt) genomes of plants and animals have developed contrasted evolutionary paths over time. Animal mt genomes are generally small, compact, and exhibit high mutation rates, whereas plant mt genomes exhibit low mutation rates, little compactness, larger sizes, and highly rearranged structures. We present the (nearly) whole sequences of five new mt genomes in the Beta genus: four from Beta vulgaris and one from B. macrocarpa, a sister species belonging to the same Beta section. We pooled our results with two previously sequenced genomes of B. vulgaris and studied genome diversity at the species level with an emphasis on cytoplasmic male-sterilizing (CMS) genomes. We showed that, contrary to what was previously assumed, all three CMS genomes belong to a single sterile lineage. In addition, the CMSs seem to have undergone an acceleration of the rates of substitution and rearrangement. This study suggests that male sterility emergence might have been favored by faster rates of evolution, unless CMS itself caused faster evolution
    corecore