13 research outputs found

    Detecting seeded motifs in DNA sequences

    Get PDF
    The problem of detecting DNA motifs with functional relevance in real biological sequences is difficult due to a number of biological, statistical and computational issues and also because of the lack of knowledge about the structure of searched patterns. Many algorithms are implemented in fully automated processes, which are often based upon a guess of input parameters from the user at the very first step. In this paper, we present a novel method for the detection of seeded DNA motifs, composed by regions with a different extent of variability. The method is based on a multi-step approach, which was implemented in a motif searching web tool (MOST). Overrepresented exact patterns are extracted from input sequences and clustered to produce motifs core regions, which are then extended and scored to generate seeded motifs. The combination of automated pattern discovery algorithms and different display tools for the evaluation and selection of results at several analysis steps can potentially lead to much more meaningful results than complete automation can produce. Experimental results on different yeast and human real datasets proved the methodology to be a promising solution for finding seeded motifs. MOST web tool is freely available at

    Space-efficient detection of unusual words

    Full text link
    Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of O(σ2log⁥2n)O(\sigma^2\log^2 n) bits, where nn is the length of the string and σ\sigma is the size of the alphabet. The size of the stack is o(n)o(n) except for very large values of σ\sigma. We further improve the algorithm by removing its time dependency on σ\sigma, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that do not occur\textit{do not occur} in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637

    OligoSpawn: a software tool for the design of overgo probes from large unigene datasets

    Get PDF
    BACKGROUND: Expressed sequence tag (EST) datasets represent perhaps the largest collection of genetic information. ESTs can be exploited in a variety of biological experiments and analysis. Here we are interested in the design of overlapping oligonucleotide (overgo) probes from large unigene (EST-contigs) datasets. RESULTS: OLIGOSPAWN is a suite of software tools that offers two complementary services, namely (1) the selection of "unique" oligos each of which appears in one unigene but does not occur (exactly or approximately) in any other and (2) the selection of "popular" oligos each of which occurs (exactly or approximately) in as many unigenes as possible. In this paper, we describe the functionalities of OLIGOSPAWN and the computational methods it employs, and we report on experimental results for the overgo probes designed with it. CONCLUSION: The algorithms we designed are highly efficient and capable of processing unigene datasets of sizes on the order of several tens of Mb in a few hours on a regular PC. The software has been used to design overgo probes employed to screen a barley BAC library (Hordeum vulgare). OLIGOSPAWN is freely available at

    IP6K gene identification in plant genomes by tag searching

    Get PDF
    BACKGROUND: Plants have played a special role in inositol polyphosphate (IP) research since in plant seeds was discovered the first IP, the fully phosphorylated inositol ring of phytic acid (IP6). It is now known that phytic acid is further metabolized by the IP6 Kinases (IP6Ks) to generate IP containing pyro-phosphate moiety. The IP6K are evolutionary conserved enzymes identified in several mammalian, fungi and amoebae species. Although IP6K has not yet been identified in plant chromosomes, there are many clues suggesting its presences in vegetal cells. RESULTS: In this paper we propose a new approach to search for the plant IP6K gene, that lead to the identification in plant genome of a nucleotide sequence corresponding to a specific tag of the IP6K family. Such a tag has been found in all IP6K genes identified up to now, as well as in all genes belonging to the Inositol Polyphosphate Kinases superfamily (IPK). The tag sequence corresponds to the inositol-binding site of the enzyme, and it can be considered as characterizing all IPK genes. To this aim we applied a technique based on motif discovery. We exploited DLSME, a software recently proposed, which allows for the motif structure to be only partially specified by the user. First we applied the new method on mitochondrial DNA (mtDNA) of plants, where such a gene could have been nested, possibly encrypted and hidden by virtue of the editing and/or trans-splicing processes. Then we looked for the gene in nuclear genome of two model plants, Arabidopsis thaliana and Oryza sativa. CONCLUSIONS: The analysis we conducted in plant mitochondria provided the negative, though we argue relevant, result that IP6K does not actually occur in vegetable mtDNA. Very interestingly, the tag search in nuclear genomes lead us to identify a promising sequence in chromosome 5 of Oryza sativa. Further analyses are in course to confirm that this sequence actually corresponds to IP6K mammalian gene

    Pattern Discovery from Biosequences

    Get PDF
    In this thesis we have developed novel methods for analyzing biological data, the primary sequences of the DNA and proteins, the microarray based gene expression data, and other functional genomics data. The main contribution is the development of the pattern discovery algorithm SPEXS, accompanied by several practical applications for analyzing real biological problems. For performing these biological studies that integrate different types of biological data we have developed a comprehensive web-based biological data analysis environment Expression Profiler (http://ep.ebi.ac.uk/)

    Discovery of Flexible Gap Patterns from Sequences

    Get PDF
    Human genome contains abundant motifs bound by particular biomolecules. These motifs are involved in the complex regulatory mechanisms of gene expressions. The dominant mechanism behind the intriguing gene expression patterns is known as combinatorial regulation, achieved by multiple cooperating biomolecules binding in a nearby genomic region to provide a specific regulatory behavior. To decipher the complicated combinatorial regulation mechanism at work in the cellular processes, there is a pressing need to identify co-binding motifs for these cooperating biomolecules in genomic sequences. The great flexibility of the interaction distance between nearby cooperating biomolecules leads to the presence of flexible gaps in between component motifs of a co-binding motif. Many existing motif discovery methods cannot handle co-binding motifs with flexible gaps. Existing co-binding motif discovery methods are ineffective in dealing with the following problems: (1) co-binding motifs may not appear in a large fraction of the input sequences, (2) the lengths of component motifs are unknown and (3) the maximum range of the flexible gap can be large. As a result, the probabilistic approach is easily trapped into a local optimal solution. Though deterministic approach may resolve these problems by allowing a relaxed motif template, it encounters the challenges of exploring an enormous pattern space and handling a huge output. This thesis presents an effective and scalable method called DFGP which stands for “Discovery of Flexible Gap Patterns” for identifying co-binding motifs in massive datasets. DFGP follows the deterministic approach that uses flexible gap pattern to model co-binding motif. A flexible gap pattern is composed of a number of boxes with a flexible gap in between consecutive boxes where each box is a consensus pattern representing a component motif. To address the computational challenge and the need to effectively process the large output under a relaxed motif template, DFGP incorporates two redundancy reduction methods as well as an effective statistical significance measure for ranking patterns. The first reduction method is achieved by the proposed concept of representative patterns, which aims at reducing the large set of consensus patterns used as boxes in existing deterministic methods into a much smaller yet informative set. The second method is attained by the proposed concept of delegate occurrences aiming at reducing the redundancy among occurrences of a flexible gap pattern. iv Extensive experiment results showed that (1) DFGP outperforms existing co-binding discovery methods significantly in terms of both the capability of identifying co-binding motifs and the runtime, (2) co-binding motifs found by DFGP in datasets reveal biological insights previously unknown, (3) the two redundancy reduction methods via the proposed concepts of representative patterns and delegate occurrences are indeed effective in significantly reducing the computational burden without sacrificing output quality, (4) the proposed statistical significance measures are robust and useful in ranking patterns and (5) DFGP allows a large maximum distance for flexible gap between component motifs and it is scalable to massive datasets

    Une nouvelle approche computationnelle pour la dĂ©couverte des sites de fixation de facteurs de transcription Ă  l’ADN, adaptĂ©e aux donnĂ©es de ChIP-chip et de ChIP-sĂ©quençage

    Full text link
    Les facteurs de transcription sont des protĂ©ines spĂ©cialisĂ©es qui jouent un rĂŽle important dans diffĂ©rents processus biologiques tel que la diffĂ©renciation, le cycle cellulaire et la tumorigenĂšse. Ils rĂ©gulent la transcription des gĂšnes en se fixant sur des sĂ©quences d’ADN spĂ©cifiques (Ă©lĂ©ments cis-rĂ©gulateurs). L’identification de ces Ă©lĂ©ments est une Ă©tape cruciale dans la comprĂ©hension des rĂ©seaux de rĂ©gulation des gĂšnes. Avec l’avĂšnement des technologies de sĂ©quençage Ă  haut dĂ©bit, l’identification de tout les Ă©lĂ©ments fonctionnels dans les gĂ©nomes, incluant gĂšnes et Ă©lĂ©ments cis-rĂ©gulateurs a connu une avancĂ©e considĂ©rable. Alors qu’on est arrivĂ© Ă  estimer le nombre de gĂšnes chez diffĂ©rentes espĂšces, l’information sur les Ă©lĂ©ments qui contrĂŽlent et orchestrent la rĂ©gulation de ces gĂšnes est encore mal dĂ©finie. Grace aux techniques de ChIP-chip et de ChIP-sĂ©quençage il est possible d’identifier toutes les rĂ©gions du gĂ©nome qui sont liĂ©es par un facteur de transcription d’intĂ©rĂȘt. Plusieurs approches computationnelles ont Ă©tĂ© dĂ©veloppĂ©es pour prĂ©dire les sites fixĂ©s par les facteurs de transcription. Ces approches sont classĂ©es en deux catĂ©gories principales: les algorithmes Ă©numĂ©ratifs et probabilistes. Toutefois, plusieurs Ă©tudes ont montrĂ© que ces approches gĂ©nĂšrent des taux Ă©levĂ©s de faux nĂ©gatifs et de faux positifs ce qui rend difficile l’interprĂ©tation des rĂ©sultats et par consĂ©quent leur validation expĂ©rimentale. Dans cette thĂšse, nous avons ciblĂ© deux objectifs. Le premier objectif a Ă©tĂ© de dĂ©velopper une nouvelle approche pour la dĂ©couverte des sites de fixation des facteurs de transcription Ă  l’ADN (SAMD-ChIP) adaptĂ©e aux donnĂ©es de ChIP-chip et de ChIP-sĂ©quençage. Notre approche implĂ©mente un algorithme hybride qui combine les deux stratĂ©gies Ă©numĂ©rative et probabiliste, afin d’exploiter les performances de chacune d’entre elles. Notre approche a montrĂ© ses performances, comparĂ©e aux outils de dĂ©couvertes de motifs existants sur des jeux de donnĂ©es simulĂ©es et des jeux de donnĂ©es de ChIP-chip et de ChIP-sĂ©quençage. SAMD-ChIP prĂ©sente aussi l’avantage d’exploiter les propriĂ©tĂ©s de distributions des sites liĂ©s par les facteurs de transcription autour du centre des rĂ©gions liĂ©es afin de limiter la prĂ©diction aux motifs qui sont enrichis dans une fenĂȘtre de longueur fixe autour du centre de ces rĂ©gions. Les facteurs de transcription agissent rarement seuls. Ils forment souvent des complexes pour interagir avec l’ADN pour rĂ©guler leurs gĂšnes cibles. Ces interactions impliquent des facteurs de transcription dont les sites de fixation Ă  l’ADN sont localisĂ©s proches les uns des autres ou bien mĂ©dier par des boucles de chromatine. Notre deuxiĂšme objectif a Ă©tĂ© d’exploiter la proximitĂ© spatiale des sites liĂ©s par les facteurs de transcription dans les rĂ©gions de ChIP-chip et de ChIP-sĂ©quençage pour dĂ©velopper une approche pour la prĂ©diction des motifs composites (motifs composĂ©s par deux sites et sĂ©parĂ©s par un espacement de taille fixe). Nous avons testĂ© ce module pour prĂ©dire la co-localisation entre les deux demi-sites ERE qui forment le site ERE, liĂ© par le rĂ©cepteur des ƓstrogĂšnes ERα. Ce module a Ă©tĂ© incorporĂ© Ă  notre outil de dĂ©couverte de motifs SAMD-ChIP.Transcription factors (TF) play important roles in various biological processes such as differentiation, cell cycle progression and tumorigenesis. They regulate gene expression by binding to specific DNA sequences (TFBS). Identifying these cis-regulatory elements is a crucial step to understand gene regulatory networks. Technological developments have enhanced DNA sequencing at genomic scale. On the basis of the resulting sequences, computational biologists now attempt to localize the most important functional regions, starting with genes, but also importantly the whole genome characterization of transcription factor binding sites and allow the development of several computational DNA motif discovery tools. Although these various tools are widely used and have been successful at discovering novel motifs, they are not adapted to ChIP-chip and ChIP-sequencing data. The main drawback of these approaches is that most of the predicted motifs represent artifacts due to an inefficient assessment of their enrichment. This thesis is about transcription factor proteins and statistical analysis of their binding sites in ChIP-chip and ChIP-sequencing data. The first objective was to develop a new do novo DNA motif discovery tool adapted to ChIP-chip and ChIP-sequencing data. SAMD-ChIP combines enumerative and stochastic strategies to predict enriched motifs in the vicinity of the ChIP peak summits. Our approach is an automated pipeline that includes motif discovery, motif clustering, motif optimization and finally motif identification using transcription factor (TF) databases. SAMD-ChIP outperforms state-of-the-art motif discovery tools in term of the number of predicted motifs and the prediction of rare and degenerate motifs. In particular, SAMD-ChIP efficiently identifies gapped motifs such as inverted or direct repeats bound by nuclear receptors and composite motifs resulting from the association of different single TF binding sites. The underlying assumption of the second objective is that in regulatory regions, binding sites of interacting transcription factors co-occur more often than expected by chance in the vicinity of the ChIP-peak summits. We proposed an approach to predict transcription factor binding sites co-localization based on the prediction of single motifs by do novo motif discovery tools or by using TFBS models from TF data bases
    corecore