20 research outputs found
DNA Motif Match Statistics Without Poisson Approximation
Transcription factors (TFs) play a crucial role in gene regulation by binding to specific regulatory sequences. The sequence motifs recognized by a TF can be described in terms of position frequency matrices. Searching for motif matches with a given position frequency matrix is achieved by employing a predefined score cutoff and subsequently counting the number of matches above this cutoff. In this article, we approximate the distribution of the number of motif matches based on a novel dynamic programming approach, which accounts for higher order sequence background (e.g., as is characteristic for CpG islands) and overlapping motif matches on both DNA strands. A comparison with our previously published compound Poisson approximation and a binomial approximation demonstrates that in particular for relaxed score thresholds, the dynamic programming approach yields more accurate results
Statistical detection of cooperative transcription factors with similarity adjustment
Motivation: Statistical assessment of cis-regulatory modules (CRMs) is a crucial task in computational biology. Usually, one concludes from exceptional co-occurrences of DNA motifs that the corresponding transcription factors (TFs) are cooperative. However, similar DNA motifs tend to co-occur in random sequences due to high probability of overlapping occurrences. Therefore, it is important to consider similarity of DNA motifs in the statistical assessment
Algorithms and statistical methods for exact motif discovery
The motif discovery problem consists of uncovering exceptional patterns (called motifs)
in sets of sequences. It arises in molecular biology when searching for yet unknown
functional sites in DNA sequences.
In this thesis, we develop a motif discovery algorithm that (1) is exact, that means it
returns a motif with optimal score, (2) can use the statistical significance with respect
to complex background models as a scoring function, (3) takes into account the effects
of self-overlaps of motif instances, and (4) is efficient enough to be useful in large-scale
applications.
To this end, several algorithms and statistical methods are developed. First, the
concepts of deterministic arithmetic automata (DAAs) and probabilistic arithmetic automata
(PAAs) are introduced. We prove that they allow calculating the distributions of values
resulting from deterministic computations on random texts generated by arbitrary
finite-memory text models. This technique is applied three times: first, to compute
the distribution of the number of occurrences of a pattern in a random string, second,
to compute the distribution of the number of character accesses made by windowbased
pattern matching algorithms, and, third, to compute the distribution of clump
sizes, where a clump is a maximal set of overlapping motif occurrences. All of these
applications are interesting theoretical topics in themselves and, in all three cases, our
results go beyond those known previously.
In order to compute the distribution of the number of occurrences of a motif in a
random text, a deterministic finite automaton (DFA) accepting the motifās instances
is needed to subsequently construct a PAA. We therefore address the problem of
efficiently constructing minimal DFAs for motif types common in computational biology.
We introduce simple non-deterministic finite automata (NFAs) and prove that these NFAs
are transformed into minimal DFAs by the classical subset construction. We show that
they can be built from (sets of) generalized strings and from consensus strings with a
Hamming neighborhood, allowing the direct construction of minimal DFAs for these
pattern types.
As a contribution to the field of motif statistics, we derive a formula for the expected
clump size of motifs. It is remarkably simple and does not involve laborious operations
like matrix inversions. This formula plays an important role in developing bounds for
the expected clump size of partially known motifs. Such bounds are needed to obtain
bounds for the p-value of a partially known motif. Using these, we are finally able to
devise a branch-and-bound algorithm for motif discovery that extracts provably optimal
motifs with respect to their p-values in compound Poisson approximation. Markovian
text models of arbitrary order can be used as a background model (or null model). The
algorithm is further generalized to jointly handle a motif and its reverse complement.
An Open Source implementation is publicly available as part of the MoSDi software
i
package.
An experimental evaluation using synthetic and real data sets follows. On the
carefully crafted benchmark set of Sandve et al. (2007), the proposed algorithm outperforms
Weeder (Bailey and Elkan, 1994) and MEME (Pavesi et al., 2004) in terms of the
commonly used average nucleotide-level correlation coefficient. With respect to this
measure, it is also superior to other algorithms tested by Fauteux et al. (2008) on the
same benchmark suite; namely Seeder (Fauteux et al., 2008), BioProspector (Liu et al.,
2001), GibbsSampler (Lawrence et al., 1993), and MotifSampler (Thijs et al., 2001).
Besides the comparison to other algorithms, we perform motif discovery on the
non-coding regions of Mycobacterium tuberculosis and on CpG-rich regions in the human
genome. In both cases, we report on found motifs that are strikingly over-represented.
While the function of most of these motifs remains unknown to us, some motifs found
in M. tuberculosis can be attributed to a known biological function
Compound Poisson Approximation of the Number of Occurrences of a Position Frequency Matrix (PFM) on Both Strands
Transcription factors play a key role in gene regulation by interacting with specific binding sites or motifs. Therefore, enrichment of binding motifs is important for genome annotation and efficient computation of the statistical significance, the p-value, of the enrichment of motifs is crucial. We propose an efficient approximation to compute the significance. Due to the incorporation of both strands of the DNA molecules and explicit modeling of dependencies between overlapping hits, we achieve accurate results for any DNA motif based on its Position Frequency Matrix (PFM) representation. The accuracy of the p-value approximation is shown by comparison with the simulated count distribution. Furthermore, we compare the approach with a binomial approximation, (compound) Poisson approximation, and a normal approximation. In general, our approach outperforms these approximations or is equally good but significantly faster. An implementation of our approach is available at http://mosta.molgen.mpg.de
Recommended from our members
Performance Based Earthquake Engineering of Concrete Dams
The main objective of this thesis is to develop a framework for performance based earthquake engineering (PBEE) of concrete dams. To pursue this goal, this study first develops an extended and quantitative version of potential failure mode analysis (PFMA) for concrete dams. Different failure modes are investigated for all types of concrete dams.
A Matlab-based code is developed for probabilistic performance assessment of concrete dams (PPACD). This code is used for assessment of concrete dams within the context of PBEE. A probabilistic seismic demand model (PSDM) is proposed for concrete dams based on cloud analysis methodology. The outcome of PSDM is selection of optima intensity measure (IM) parameters for gravity dams. Then, the sensitivity and uncertainty of dam-foundation system is quantified under the mixed-mode fracture of zero-thickness interface joint element. Capacity and fragility curves are derived for most sensitive random variables.
This research also examined the performance of the dam under incremental dynamic analysis (IDA). First, the anatomy of a single-record IDA is studied and contrasted by framed structures. Then, the collapse fragility curves are derived for single and multiple-component ground motions. The impact of epistemic uncertainty is investigated in addition to the aleatoric one.
Finally, a multi-scale damage index (DI) is proposed for gravity dams which is a function of crest displacement, crack ratio, and dissipated energy. Using this hybrid DI, a computationally simple but effective methodology is proposed for progressive failure analysis of dams. In all cases, first the methodology is discussed and then, a numerical example illustrates the details
NASA Thesaurus. Volume 2: Access vocabulary
The NASA Thesaurus -- Volume 2, Access Vocabulary -- contains an alphabetical listing of all Thesaurus terms (postable and nonpostable) and permutations of all multiword and pseudo-multiword terms. Also included are Other Words (non-Thesaurus terms) consisting of abbreviations, chemical symbols, etc. The permutations and Other Words provide 'access' to the appropriate postable entries in the Thesaurus
NASA Thesaurus. Volume 1: Hierarchical listing
There are 16,713 postable terms and 3,716 nonpostable terms approved for use in the NASA scientific and technical information system in the Hierarchical Listing of the NASA Thesaurus. The generic structure is presented for many terms. The broader term and narrower term relationships are shown in an indented fashion that illustrates the generic structure better than the more widely used BT and NT listings. Related terms are generously applied, thus enhancing the usefulness of the Hierarchical Listing. Greater access to the Hierarchical Listing may be achieved with the collateral use of Volume 2 - Access Vocabulary
NASA thesaurus. Volume 1: Hierarchical Listing
There are over 17,000 postable terms and nearly 4,000 nonpostable terms approved for use in the NASA scientific and technical information system in the Hierarchical Listing of the NASA Thesaurus. The generic structure is presented for many terms. The broader term and narrower term relationships are shown in an indented fashion that illustrates the generic structure better than the more widely used BT and NT listings. Related terms are generously applied, thus enhancing the usefulness of the Hierarchical Listing. Greater access to the Hierarchical Listing may be achieved with the collateral use of Volume 2 - Access Vocabulary and Volume 3 - Definitions