20 research outputs found

    DNA Motif Match Statistics Without Poisson Approximation

    No full text
    Transcription factors (TFs) play a crucial role in gene regulation by binding to specific regulatory sequences. The sequence motifs recognized by a TF can be described in terms of position frequency matrices. Searching for motif matches with a given position frequency matrix is achieved by employing a predefined score cutoff and subsequently counting the number of matches above this cutoff. In this article, we approximate the distribution of the number of motif matches based on a novel dynamic programming approach, which accounts for higher order sequence background (e.g., as is characteristic for CpG islands) and overlapping motif matches on both DNA strands. A comparison with our previously published compound Poisson approximation and a binomial approximation demonstrates that in particular for relaxed score thresholds, the dynamic programming approach yields more accurate results

    Statistical detection of cooperative transcription factors with similarity adjustment

    Get PDF
    Motivation: Statistical assessment of cis-regulatory modules (CRMs) is a crucial task in computational biology. Usually, one concludes from exceptional co-occurrences of DNA motifs that the corresponding transcription factors (TFs) are cooperative. However, similar DNA motifs tend to co-occur in random sequences due to high probability of overlapping occurrences. Therefore, it is important to consider similarity of DNA motifs in the statistical assessment

    Algorithms and statistical methods for exact motif discovery

    Get PDF
    The motif discovery problem consists of uncovering exceptional patterns (called motifs) in sets of sequences. It arises in molecular biology when searching for yet unknown functional sites in DNA sequences. In this thesis, we develop a motif discovery algorithm that (1) is exact, that means it returns a motif with optimal score, (2) can use the statistical significance with respect to complex background models as a scoring function, (3) takes into account the effects of self-overlaps of motif instances, and (4) is efficient enough to be useful in large-scale applications. To this end, several algorithms and statistical methods are developed. First, the concepts of deterministic arithmetic automata (DAAs) and probabilistic arithmetic automata (PAAs) are introduced. We prove that they allow calculating the distributions of values resulting from deterministic computations on random texts generated by arbitrary finite-memory text models. This technique is applied three times: first, to compute the distribution of the number of occurrences of a pattern in a random string, second, to compute the distribution of the number of character accesses made by windowbased pattern matching algorithms, and, third, to compute the distribution of clump sizes, where a clump is a maximal set of overlapping motif occurrences. All of these applications are interesting theoretical topics in themselves and, in all three cases, our results go beyond those known previously. In order to compute the distribution of the number of occurrences of a motif in a random text, a deterministic finite automaton (DFA) accepting the motifā€™s instances is needed to subsequently construct a PAA. We therefore address the problem of efficiently constructing minimal DFAs for motif types common in computational biology. We introduce simple non-deterministic finite automata (NFAs) and prove that these NFAs are transformed into minimal DFAs by the classical subset construction. We show that they can be built from (sets of) generalized strings and from consensus strings with a Hamming neighborhood, allowing the direct construction of minimal DFAs for these pattern types. As a contribution to the field of motif statistics, we derive a formula for the expected clump size of motifs. It is remarkably simple and does not involve laborious operations like matrix inversions. This formula plays an important role in developing bounds for the expected clump size of partially known motifs. Such bounds are needed to obtain bounds for the p-value of a partially known motif. Using these, we are finally able to devise a branch-and-bound algorithm for motif discovery that extracts provably optimal motifs with respect to their p-values in compound Poisson approximation. Markovian text models of arbitrary order can be used as a background model (or null model). The algorithm is further generalized to jointly handle a motif and its reverse complement. An Open Source implementation is publicly available as part of the MoSDi software i package. An experimental evaluation using synthetic and real data sets follows. On the carefully crafted benchmark set of Sandve et al. (2007), the proposed algorithm outperforms Weeder (Bailey and Elkan, 1994) and MEME (Pavesi et al., 2004) in terms of the commonly used average nucleotide-level correlation coefficient. With respect to this measure, it is also superior to other algorithms tested by Fauteux et al. (2008) on the same benchmark suite; namely Seeder (Fauteux et al., 2008), BioProspector (Liu et al., 2001), GibbsSampler (Lawrence et al., 1993), and MotifSampler (Thijs et al., 2001). Besides the comparison to other algorithms, we perform motif discovery on the non-coding regions of Mycobacterium tuberculosis and on CpG-rich regions in the human genome. In both cases, we report on found motifs that are strikingly over-represented. While the function of most of these motifs remains unknown to us, some motifs found in M. tuberculosis can be attributed to a known biological function

    Compound Poisson Approximation of the Number of Occurrences of a Position Frequency Matrix (PFM) on Both Strands

    No full text
    Transcription factors play a key role in gene regulation by interacting with specific binding sites or motifs. Therefore, enrichment of binding motifs is important for genome annotation and efficient computation of the statistical significance, the p-value, of the enrichment of motifs is crucial. We propose an efficient approximation to compute the significance. Due to the incorporation of both strands of the DNA molecules and explicit modeling of dependencies between overlapping hits, we achieve accurate results for any DNA motif based on its Position Frequency Matrix (PFM) representation. The accuracy of the p-value approximation is shown by comparison with the simulated count distribution. Furthermore, we compare the approach with a binomial approximation, (compound) Poisson approximation, and a normal approximation. In general, our approach outperforms these approximations or is equally good but significantly faster. An implementation of our approach is available at http://mosta.molgen.mpg.de

    NASA Thesaurus. Volume 2: Access vocabulary

    Get PDF
    The NASA Thesaurus -- Volume 2, Access Vocabulary -- contains an alphabetical listing of all Thesaurus terms (postable and nonpostable) and permutations of all multiword and pseudo-multiword terms. Also included are Other Words (non-Thesaurus terms) consisting of abbreviations, chemical symbols, etc. The permutations and Other Words provide 'access' to the appropriate postable entries in the Thesaurus

    NASA Thesaurus. Volume 1: Hierarchical listing

    Get PDF
    There are 16,713 postable terms and 3,716 nonpostable terms approved for use in the NASA scientific and technical information system in the Hierarchical Listing of the NASA Thesaurus. The generic structure is presented for many terms. The broader term and narrower term relationships are shown in an indented fashion that illustrates the generic structure better than the more widely used BT and NT listings. Related terms are generously applied, thus enhancing the usefulness of the Hierarchical Listing. Greater access to the Hierarchical Listing may be achieved with the collateral use of Volume 2 - Access Vocabulary

    NASA thesaurus. Volume 1: Hierarchical Listing

    Get PDF
    There are over 17,000 postable terms and nearly 4,000 nonpostable terms approved for use in the NASA scientific and technical information system in the Hierarchical Listing of the NASA Thesaurus. The generic structure is presented for many terms. The broader term and narrower term relationships are shown in an indented fashion that illustrates the generic structure better than the more widely used BT and NT listings. Related terms are generously applied, thus enhancing the usefulness of the Hierarchical Listing. Greater access to the Hierarchical Listing may be achieved with the collateral use of Volume 2 - Access Vocabulary and Volume 3 - Definitions
    corecore