11 research outputs found

    A statistical thin-tail test of predicting regulatory regions in the Drosophila genome

    Full text link
    Background: The identification of transcription factor binding sites (TFBSs) and cis-regulatory modules (CRMs) is a crucial step in studying gene expression, but the computational method attempting to distinguish CRMs from NCNRs still remains a challenging problem due to the limited knowledge of specific interactions involved. Methods: The statistical properties of cis-regulatory modules (CRMs) are explored by estimating the similar-word set distribution with overrepresentation (Z-score). It is observed that CRMs tend to have a thin-tail Z-score distribution. A new statistical thin-tail test with two thinness coefficients is proposed to distinguish CRMs from non-coding non-regulatory regions (NCNRs). Results: As compared with the existing fluffy-tail test, the first thinness coefficient is designed to reduce computational time, making the novel thin-tail test very suitable for long sequences and large database analysis in the post-genome time and the second one to improve the separation accuracy between CRMs and NCNRs. These two thinness coefficients may serve as valuable filtering indexes to predict CRMs experimentally. Conclusions: The novel thin-tail test provides an efficient and effective means for distinguishing CRMs from NCNRs based on the specific statistical properties of CRMs and can guide future experiments aimed at finding new CRMs in the post-genome time.Comment: arXiv admin note: substantial text overlap with arXiv:1402.533

    Alignment-free Genomic Analysis via a Big Data Spark Platform

    Get PDF
    Motivation: Alignment-free distance and similarity functions (AF functions, for short) are a well established alternative to two and multiple sequence alignments for many genomic, metagenomic and epigenomic tasks. Due to data-intensive applications, the computation of AF functions is a Big Data problem, with the recent Literature indicating that the development of fast and scalable algorithms computing AF functions is a high-priority task. Somewhat surprisingly, despite the increasing popularity of Big Data technologies in Computational Biology, the development of a Big Data platform for those tasks has not been pursued, possibly due to its complexity. Results: We fill this important gap by introducing FADE, the first extensible, efficient and scalable Spark platform for Alignment-free genomic analysis. It supports natively eighteen of the best performing AF functions coming out of a recent hallmark benchmarking study. FADE development and potential impact comprises novel aspects of interest. Namely, (a) a considerable effort of distributed algorithms, the most tangible result being a much faster execution time of reference methods like MASH and FSWM; (b) a software design that makes FADE user-friendly and easily extendable by Spark non-specialists; (c) its ability to support data- and compute-intensive tasks. About this, we provide a novel and much needed analysis of how informative and robust AF functions are, in terms of the statistical significance of their output. Our findings naturally extend the ones of the highly regarded benchmarking study, since the functions that can really be used are reduced to a handful of the eighteen included in FADE

    A basic analysis toolkit for biological sequences

    Get PDF
    This paper presents a software library, nicknamed BATS, for some basic sequence analysis tasks. Namely, local alignments, via approximate string matching, and global alignments, via longest common subsequence and alignments with affine and concave gap cost functions. Moreover, it also supports filtering operations to select strings from a set and establish their statistical significance, via z-score computation. None of the algorithms is new, but although they are generally regarded as fundamental for sequence analysis, they have not been implemented in a single and consistent software package, as we do here. Therefore, our main contribution is to fill this gap between algorithmic theory and practice by providing an extensible and easy to use software library that includes algorithms for the mentioned string matching and alignment problems. The library consists of C/C++ library functions as well as Perl library functions. It can be interfaced with Bioperl and can also be used as a stand-alone system with a GUI. The software is available at under the GNU GPL

    Conservation and implications of eukaryote transcriptional regulatory regions across multiple species

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Increasing evidence shows that whole genomes of eukaryotes are almost entirely transcribed into both protein coding genes and an enormous number of non-protein-coding RNAs (ncRNAs). Therefore, revealing the underlying regulatory mechanisms of transcripts becomes imperative. However, for a complete understanding of transcriptional regulatory mechanisms, we need to identify the regions in which they are found. We will call these transcriptional regulation regions, or TRRs, which can be considered functional regions containing a cluster of regulatory elements that cooperatively recruit transcriptional factors for binding and then regulating the expression of transcripts.</p> <p>Results</p> <p>We constructed a hierarchical stochastic language (HSL) model for the identification of core TRRs in yeast based on regulatory cooperation among TRR elements. The HSL model trained based on yeast achieved comparable accuracy in predicting TRRs in other species, e.g., fruit fly, human, and rice, thus demonstrating the conservation of TRRs across species. The HSL model was also used to identify the TRRs of genes, such as p53 or <it>OsALYL1</it>, as well as microRNAs. In addition, the ENCODE regions were examined by HSL, and TRRs were found to pervasively locate in the genomes.</p> <p>Conclusion</p> <p>Our findings indicate that 1) the HSL model can be used to accurately predict core TRRs of transcripts across species and 2) identified core TRRs by HSL are proper candidates for the further scrutiny of specific regulatory elements and mechanisms. Meanwhile, the regulatory activity taking place in the abundant numbers of ncRNAs might account for the ubiquitous presence of TRRs across the genome. In addition, we also found that the TRRs of protein coding genes and ncRNAs are similar in structure, with the latter being more conserved than the former.</p

    Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models.</p> <p>Results</p> <p>The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence.</p> <p>Conclusions</p> <p>Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.</p

    Mining protein loops using a structural alphabet and statistical exceptionality

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein loops encompass 50% of protein residues in available three-dimensional structures. These regions are often involved in protein functions, e.g. binding site, catalytic pocket... However, the description of protein loops with conventional tools is an uneasy task. Regular secondary structures, helices and strands, have been widely studied whereas loops, because they are highly variable in terms of sequence and structure, are difficult to analyze. Due to data sparsity, long loops have rarely been systematically studied.</p> <p>Results</p> <p>We developed a simple and accurate method that allows the description and analysis of the structures of short and long loops using structural motifs without restriction on loop length. This method is based on the structural alphabet HMM-SA. HMM-SA allows the simplification of a three-dimensional protein structure into a one-dimensional string of states, where each state is a four-residue prototype fragment, called structural letter. The difficult task of the structural grouping of huge data sets is thus easily accomplished by handling structural letter strings as in conventional protein sequence analysis. We systematically extracted all seven-residue fragments in a bank of 93000 protein loops and grouped them according to the structural-letter sequence, named structural word. This approach permits a systematic analysis of loops of all sizes since we consider the structural motifs of seven residues rather than complete loops. We focused the analysis on highly recurrent words of loops (observed more than 30 times). Our study reveals that 73% of loop-lengths are covered by only 3310 highly recurrent structural words out of 28274 observed words). These structural words have low structural variability (mean RMSd of 0.85 Å). As expected, half of these motifs display a flanking-region preference but interestingly, two thirds are shared by short (less than 12 residues) and long loops. Moreover, half of recurrent motifs exhibit a significant level of amino-acid conservation with at least four significant positions and 87% of long loops contain at least one such word. We complement our analysis with the detection of statistically over-represented patterns of structural letters as in conventional DNA sequence analysis. About 30% (930) of structural words are over-represented, and cover about 40% of loop lengths. Interestingly, these words exhibit lower structural variability and higher sequential specificity, suggesting structural or functional constraints.</p> <p>Conclusions</p> <p>We developed a method to systematically decompose and study protein loops using recurrent structural motifs. This method is based on the structural alphabet HMM-SA and not on structural alignment and geometrical parameters. We extracted meaningful structural motifs that are found in both short and long loops. To our knowledge, it is the first time that pattern mining helps to increase the signal-to-noise ratio in protein loops. This finding helps to better describe protein loops and might permit to decrease the complexity of long-loop analysis. Detailed results are available at <url>http://www.mti.univ-paris-diderot.fr/publication/supplementary/2009/ACCLoop/</url>.</p

    Recherche d'éléments répétés par analyse des distributions de fréquences d'oligonucléotides

    Full text link
    Mémoire numérisé par la Division de la gestion de documents et des archives de l'Université de Montréal

    Dissecting protein loops with a statistical scalpel suggests a functional implication of some structural motifs

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>One of the strategies for protein function annotation is to search particular structural motifs that are known to be shared by proteins with a given function.</p> <p>Results</p> <p>Here, we present a systematic extraction of structural motifs of seven residues from protein loops and we explore their correspondence with functional sites. Our approach is based on the structural alphabet HMM-SA (Hidden Markov Model - Structural Alphabet), which allows simplification of protein structures into uni-dimensional sequences, and advanced pattern statistics adapted to short sequences. Structural motifs of interest are selected by looking for structural motifs significantly over-represented in SCOP superfamilies in protein loops. We discovered two types of structural motifs significantly over-represented in SCOP superfamilies: (i) ubiquitous motifs, shared by several superfamilies and (ii) superfamily-specific motifs, over-represented in few superfamilies. A comparison of ubiquitous words with known small structural motifs shows that they contain well-described motifs as turn, niche or nest motifs. A comparison between superfamily-specific motifs and biological annotations of Swiss-Prot reveals that some of them actually correspond to functional sites involved in the binding sites of small ligands, such as ATP/GTP, NAD(P) and SAH/SAM.</p> <p>Conclusions</p> <p>Our findings show that statistical over-representation in SCOP superfamilies is linked to functional features. The detection of over-represented motifs within structures simplified by HMM-SA is therefore a promising approach for prediction of functional sites and annotation of uncharacterized proteins.</p

    Une nouvelle approche computationnelle pour la dĂ©couverte des sites de fixation de facteurs de transcription Ă  l’ADN, adaptĂ©e aux donnĂ©es de ChIP-chip et de ChIP-sĂ©quençage

    Full text link
    Les facteurs de transcription sont des protĂ©ines spĂ©cialisĂ©es qui jouent un rĂŽle important dans diffĂ©rents processus biologiques tel que la diffĂ©renciation, le cycle cellulaire et la tumorigenĂšse. Ils rĂ©gulent la transcription des gĂšnes en se fixant sur des sĂ©quences d’ADN spĂ©cifiques (Ă©lĂ©ments cis-rĂ©gulateurs). L’identification de ces Ă©lĂ©ments est une Ă©tape cruciale dans la comprĂ©hension des rĂ©seaux de rĂ©gulation des gĂšnes. Avec l’avĂšnement des technologies de sĂ©quençage Ă  haut dĂ©bit, l’identification de tout les Ă©lĂ©ments fonctionnels dans les gĂ©nomes, incluant gĂšnes et Ă©lĂ©ments cis-rĂ©gulateurs a connu une avancĂ©e considĂ©rable. Alors qu’on est arrivĂ© Ă  estimer le nombre de gĂšnes chez diffĂ©rentes espĂšces, l’information sur les Ă©lĂ©ments qui contrĂŽlent et orchestrent la rĂ©gulation de ces gĂšnes est encore mal dĂ©finie. Grace aux techniques de ChIP-chip et de ChIP-sĂ©quençage il est possible d’identifier toutes les rĂ©gions du gĂ©nome qui sont liĂ©es par un facteur de transcription d’intĂ©rĂȘt. Plusieurs approches computationnelles ont Ă©tĂ© dĂ©veloppĂ©es pour prĂ©dire les sites fixĂ©s par les facteurs de transcription. Ces approches sont classĂ©es en deux catĂ©gories principales: les algorithmes Ă©numĂ©ratifs et probabilistes. Toutefois, plusieurs Ă©tudes ont montrĂ© que ces approches gĂ©nĂšrent des taux Ă©levĂ©s de faux nĂ©gatifs et de faux positifs ce qui rend difficile l’interprĂ©tation des rĂ©sultats et par consĂ©quent leur validation expĂ©rimentale. Dans cette thĂšse, nous avons ciblĂ© deux objectifs. Le premier objectif a Ă©tĂ© de dĂ©velopper une nouvelle approche pour la dĂ©couverte des sites de fixation des facteurs de transcription Ă  l’ADN (SAMD-ChIP) adaptĂ©e aux donnĂ©es de ChIP-chip et de ChIP-sĂ©quençage. Notre approche implĂ©mente un algorithme hybride qui combine les deux stratĂ©gies Ă©numĂ©rative et probabiliste, afin d’exploiter les performances de chacune d’entre elles. Notre approche a montrĂ© ses performances, comparĂ©e aux outils de dĂ©couvertes de motifs existants sur des jeux de donnĂ©es simulĂ©es et des jeux de donnĂ©es de ChIP-chip et de ChIP-sĂ©quençage. SAMD-ChIP prĂ©sente aussi l’avantage d’exploiter les propriĂ©tĂ©s de distributions des sites liĂ©s par les facteurs de transcription autour du centre des rĂ©gions liĂ©es afin de limiter la prĂ©diction aux motifs qui sont enrichis dans une fenĂȘtre de longueur fixe autour du centre de ces rĂ©gions. Les facteurs de transcription agissent rarement seuls. Ils forment souvent des complexes pour interagir avec l’ADN pour rĂ©guler leurs gĂšnes cibles. Ces interactions impliquent des facteurs de transcription dont les sites de fixation Ă  l’ADN sont localisĂ©s proches les uns des autres ou bien mĂ©dier par des boucles de chromatine. Notre deuxiĂšme objectif a Ă©tĂ© d’exploiter la proximitĂ© spatiale des sites liĂ©s par les facteurs de transcription dans les rĂ©gions de ChIP-chip et de ChIP-sĂ©quençage pour dĂ©velopper une approche pour la prĂ©diction des motifs composites (motifs composĂ©s par deux sites et sĂ©parĂ©s par un espacement de taille fixe). Nous avons testĂ© ce module pour prĂ©dire la co-localisation entre les deux demi-sites ERE qui forment le site ERE, liĂ© par le rĂ©cepteur des ƓstrogĂšnes ERα. Ce module a Ă©tĂ© incorporĂ© Ă  notre outil de dĂ©couverte de motifs SAMD-ChIP.Transcription factors (TF) play important roles in various biological processes such as differentiation, cell cycle progression and tumorigenesis. They regulate gene expression by binding to specific DNA sequences (TFBS). Identifying these cis-regulatory elements is a crucial step to understand gene regulatory networks. Technological developments have enhanced DNA sequencing at genomic scale. On the basis of the resulting sequences, computational biologists now attempt to localize the most important functional regions, starting with genes, but also importantly the whole genome characterization of transcription factor binding sites and allow the development of several computational DNA motif discovery tools. Although these various tools are widely used and have been successful at discovering novel motifs, they are not adapted to ChIP-chip and ChIP-sequencing data. The main drawback of these approaches is that most of the predicted motifs represent artifacts due to an inefficient assessment of their enrichment. This thesis is about transcription factor proteins and statistical analysis of their binding sites in ChIP-chip and ChIP-sequencing data. The first objective was to develop a new do novo DNA motif discovery tool adapted to ChIP-chip and ChIP-sequencing data. SAMD-ChIP combines enumerative and stochastic strategies to predict enriched motifs in the vicinity of the ChIP peak summits. Our approach is an automated pipeline that includes motif discovery, motif clustering, motif optimization and finally motif identification using transcription factor (TF) databases. SAMD-ChIP outperforms state-of-the-art motif discovery tools in term of the number of predicted motifs and the prediction of rare and degenerate motifs. In particular, SAMD-ChIP efficiently identifies gapped motifs such as inverted or direct repeats bound by nuclear receptors and composite motifs resulting from the association of different single TF binding sites. The underlying assumption of the second objective is that in regulatory regions, binding sites of interacting transcription factors co-occur more often than expected by chance in the vicinity of the ChIP-peak summits. We proposed an approach to predict transcription factor binding sites co-localization based on the prediction of single motifs by do novo motif discovery tools or by using TFBS models from TF data bases
    corecore