Search CORE

20 research outputs found

DNA Motif Match Statistics Without Poisson Approximation

Author: Kopp W.
Vingron M.
Publication venue: 'Mary Ann Liebert Inc'
Publication date: 12/08/2019
Field of study

Transcription factors (TFs) play a crucial role in gene regulation by binding to specific regulatory sequences. The sequence motifs recognized by a TF can be described in terms of position frequency matrices. Searching for motif matches with a given position frequency matrix is achieved by employing a predefined score cutoff and subsequently counting the number of matches above this cutoff. In this article, we approximate the distribution of the number of motif matches based on a novel dynamic programming approach, which accounts for higher order sequence background (e.g., as is characteristic for CpG islands) and overlapping motif matches on both DNA strands. A comparison with our previously published compound Poisson approximation and a binomial approximation demonstrates that in particular for relaxed score thresholds, the dynamic programming approach yields more accurate results

MPG.PuRe

Statistical detection of cooperative transcription factors with similarity adjustment

Author: Aerts
Arnone
Berman
Berman
Boeva
Brown
Chargaff
Clyde
Crooks
Crowley
De Bleser
Frith
Frith
Frith
Frith
GuhaThakurta
GuhaThakurta
H. Klein
Hannenhalli
Harbison
Klingenhoff
Krivan
Lifanov
M. Vingron
Matys
Matys
Pape
Pilpel
Sosinsky
Stormo
U. J. Pape
Wagner
Wasserman
Yoshida
Yuh
Publication venue: Oxford University Press
Publication date: 01/01/2009
Field of study

Motivation: Statistical assessment of cis-regulatory modules (CRMs) is a crucial task in computational biology. Usually, one concludes from exceptional co-occurrences of DNA motifs that the corresponding transcription factors (TFs) are cooperative. However, similar DNA motifs tend to co-occur in random sequences due to high probability of overlapping occurrences. Therefore, it is important to consider similarity of DNA motifs in the statistical assessment

Crossref

PubMed Central

MPG.PuRe

Algorithms and statistical methods for exact motif discovery

Author: Marschall Tobias
Publication venue
Publication date: 23/05/2011
Field of study

The motif discovery problem consists of uncovering exceptional patterns (called motifs) in sets of sequences. It arises in molecular biology when searching for yet unknown functional sites in DNA sequences. In this thesis, we develop a motif discovery algorithm that (1) is exact, that means it returns a motif with optimal score, (2) can use the statistical significance with respect to complex background models as a scoring function, (3) takes into account the effects of self-overlaps of motif instances, and (4) is efficient enough to be useful in large-scale applications. To this end, several algorithms and statistical methods are developed. First, the concepts of deterministic arithmetic automata (DAAs) and probabilistic arithmetic automata (PAAs) are introduced. We prove that they allow calculating the distributions of values resulting from deterministic computations on random texts generated by arbitrary finite-memory text models. This technique is applied three times: first, to compute the distribution of the number of occurrences of a pattern in a random string, second, to compute the distribution of the number of character accesses made by windowbased pattern matching algorithms, and, third, to compute the distribution of clump sizes, where a clump is a maximal set of overlapping motif occurrences. All of these applications are interesting theoretical topics in themselves and, in all three cases, our results go beyond those known previously. In order to compute the distribution of the number of occurrences of a motif in a random text, a deterministic finite automaton (DFA) accepting the motif’s instances is needed to subsequently construct a PAA. We therefore address the problem of efficiently constructing minimal DFAs for motif types common in computational biology. We introduce simple non-deterministic finite automata (NFAs) and prove that these NFAs are transformed into minimal DFAs by the classical subset construction. We show that they can be built from (sets of) generalized strings and from consensus strings with a Hamming neighborhood, allowing the direct construction of minimal DFAs for these pattern types. As a contribution to the field of motif statistics, we derive a formula for the expected clump size of motifs. It is remarkably simple and does not involve laborious operations like matrix inversions. This formula plays an important role in developing bounds for the expected clump size of partially known motifs. Such bounds are needed to obtain bounds for the p-value of a partially known motif. Using these, we are finally able to devise a branch-and-bound algorithm for motif discovery that extracts provably optimal motifs with respect to their p-values in compound Poisson approximation. Markovian text models of arbitrary order can be used as a background model (or null model). The algorithm is further generalized to jointly handle a motif and its reverse complement. An Open Source implementation is publicly available as part of the MoSDi software i package. An experimental evaluation using synthetic and real data sets follows. On the carefully crafted benchmark set of Sandve et al. (2007), the proposed algorithm outperforms Weeder (Bailey and Elkan, 1994) and MEME (Pavesi et al., 2004) in terms of the commonly used average nucleotide-level correlation coefficient. With respect to this measure, it is also superior to other algorithms tested by Fauteux et al. (2008) on the same benchmark suite; namely Seeder (Fauteux et al., 2008), BioProspector (Liu et al., 2001), GibbsSampler (Lawrence et al., 1993), and MotifSampler (Thijs et al., 2001). Besides the comparison to other algorithms, we perform motif discovery on the non-coding regions of Mycobacterium tuberculosis and on CpG-rich regions in the human genome. In both cases, we report on found motifs that are strikingly over-represented. While the function of most of these motifs remains unknown to us, some motifs found in M. tuberculosis can be attributed to a known biological function

Eldorado - Ressourcen aus und für Lehre, Studium und Forschung

Compound Poisson Approximation of the Number of Occurrences of a Position Frequency Matrix (PFM) on Both Strands

Author: Pape U.
Rahmann S.
Sun F.
Vingron M.
Publication venue: 'Mary Ann Liebert Inc'
Publication date: 01/01/2008
Field of study

Transcription factors play a key role in gene regulation by interacting with specific binding sites or motifs. Therefore, enrichment of binding motifs is important for genome annotation and efficient computation of the statistical significance, the p-value, of the enrichment of motifs is crucial. We propose an efficient approximation to compute the significance. Due to the incorporation of both strands of the DNA molecules and explicit modeling of dependencies between overlapping hits, we achieve accurate results for any DNA motif based on its Position Frequency Matrix (PFM) representation. The accuracy of the p-value approximation is shown by comparison with the simulated count distribution. Furthermore, we compare the approach with a binomial approximation, (compound) Poisson approximation, and a normal approximation. In general, our approach outperforms these approximations or is equally good but significantly faster. An implementation of our approach is available at http://mosta.molgen.mpg.de

PubMed Central

MPG.PuRe

Compound Poisson Approximation of the Number of Occurrences of a Position Frequency Matrix (PFM) on Both Strands

Author: Claverie J.-M.
Fengzhu Sun
Kemp C.D.
Kleffe J.
Martin Vingron
Pape U.J.
Prum B.
Staden R.
Sven Rahmann
Utz J. Pape
Publication venue: 'Mary Ann Liebert Inc'
Publication date
Field of study

Crossref

Recommended from our members

Performance Based Earthquake Engineering of Concrete Dams

Author: Hariri-Ardebili Mohammad Amin
Publication venue: University of Colorado Boulder
Publication date: 01/01/2015
Field of study

The main objective of this thesis is to develop a framework for performance based earthquake engineering (PBEE) of concrete dams. To pursue this goal, this study first develops an extended and quantitative version of potential failure mode analysis (PFMA) for concrete dams. Different failure modes are investigated for all types of concrete dams. A Matlab-based code is developed for probabilistic performance assessment of concrete dams (PPACD). This code is used for assessment of concrete dams within the context of PBEE. A probabilistic seismic demand model (PSDM) is proposed for concrete dams based on cloud analysis methodology. The outcome of PSDM is selection of optima intensity measure (IM) parameters for gravity dams. Then, the sensitivity and uncertainty of dam-foundation system is quantified under the mixed-mode fracture of zero-thickness interface joint element. Capacity and fragility curves are derived for most sensitive random variables. This research also examined the performance of the dam under incremental dynamic analysis (IDA). First, the anatomy of a single-record IDA is studied and contrasted by framed structures. Then, the collapse fragility curves are derived for single and multiple-component ground motions. The impact of epistemic uncertainty is investigated in addition to the aleatoric one. Finally, a multi-scale damage index (DI) is proposed for gravity dams which is a function of crest displacement, crack ratio, and dissipated energy. Using this hybrid DI, a computationally simple but effective methodology is proposed for progressive failure analysis of dams. In all cases, first the methodology is discussed and then, a numerical example illustrates the details

CU Scholar Institutional Repository

NASA Thesaurus. Volume 2: Access vocabulary

Author
Publication venue
Publication date
Field of study

The NASA Thesaurus -- Volume 2, Access Vocabulary -- contains an alphabetical listing of all Thesaurus terms (postable and nonpostable) and permutations of all multiword and pseudo-multiword terms. Also included are Other Words (non-Thesaurus terms) consisting of abbreviations, chemical symbols, etc. The permutations and Other Words provide 'access' to the appropriate postable entries in the Thesaurus

NASA Technical Reports Server

NASA Thesaurus. Volume 1: Hierarchical listing

Author
Publication venue
Publication date
Field of study

There are 16,713 postable terms and 3,716 nonpostable terms approved for use in the NASA scientific and technical information system in the Hierarchical Listing of the NASA Thesaurus. The generic structure is presented for many terms. The broader term and narrower term relationships are shown in an indented fashion that illustrates the generic structure better than the more widely used BT and NT listings. Related terms are generously applied, thus enhancing the usefulness of the Hierarchical Listing. Greater access to the Hierarchical Listing may be achieved with the collateral use of Volume 2 - Access Vocabulary

NASA Technical Reports Server

NASA thesaurus. Volume 1: Hierarchical Listing

Author
Publication venue
Publication date
Field of study

There are over 17,000 postable terms and nearly 4,000 nonpostable terms approved for use in the NASA scientific and technical information system in the Hierarchical Listing of the NASA Thesaurus. The generic structure is presented for many terms. The broader term and narrower term relationships are shown in an indented fashion that illustrates the generic structure better than the more widely used BT and NT listings. Related terms are generously applied, thus enhancing the usefulness of the Hierarchical Listing. Greater access to the Hierarchical Listing may be achieved with the collateral use of Volume 2 - Access Vocabulary and Volume 3 - Definitions

NASA Technical Reports Server