22 research outputs found

    Stein's method, Palm theory and Poisson process approximation

    Full text link
    The framework of Stein's method for Poisson process approximation is presented from the point of view of Palm theory, which is used to construct Stein identities and define local dependence. A general result (Theorem \refimportantproposition) in Poisson process approximation is proved by taking the local approach. It is obtained without reference to any particular metric, thereby allowing wider applicability. A Wasserstein pseudometric is introduced for measuring the accuracy of point process approximation. The pseudometric provides a generalization of many metrics used so far, including the total variation distance for random variables and the Wasserstein metric for processes as in Barbour and Brown [Stochastic Process. Appl. 43 (1992) 9-31]. Also, through the pseudometric, approximation for certain point processes on a given carrier space is carried out by lifting it to one on a larger space, extending an idea of Arratia, Goldstein and Gordon [Statist. Sci. 5 (1990) 403-434]. The error bound in the general result is similar in form to that for Poisson approximation. As it yields the Stein factor 1/\lambda as in Poisson approximation, it provides good approximation, particularly in cases where \lambda is large. The general result is applied to a number of problems including Poisson process modeling of rare words in a DNA sequence.Comment: Published by the Institute of Mathematical Statistics (http://www.imstat.org) in the Annals of Probability (http://www.imstat.org/aop/) at http://dx.doi.org/10.1214/00911790400000002

    Scoring schemes of palindrome clusters for more sensitive prediction of replication origins in herpesviruses

    Get PDF
    Many empirical studies show that there are unusual clusters of palindromes, closely spaced direct and inverted repeats around the replication origins of herpesviruses. In this paper, we introduce two new scoring schemes to quantify the spatial abundance of palindromes in a genomic sequence. Based on these scoring schemes, a computational method to predict the locations of replication origins is developed. When our predictions are compared with 39 known or annotated replication origins in 19 herpesviruses, close to 80% of the replication origins are located within 2% of the genome length. A list of predicted locations of replication origins in all the known herpesviruses with complete genome sequences is reported

    AT excursion: a new approach to predict replication origins in viral genomes by locating AT-rich regions

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Replication origins are considered important sites for understanding the molecular mechanisms involved in DNA replication. Many computational methods have been developed for predicting their locations in archaeal, bacterial and eukaryotic genomes. However, a prediction method designed for a particular kind of genomes might not work well for another. In this paper, we propose the AT excursion method, which is a score-based approach, to quantify local AT abundance in genomic sequences and use the identified high scoring segments for predicting replication origins. This method has the advantages of requiring no preset window size and having rigorous criteria to evaluate statistical significance of high scoring segments.</p> <p>Results</p> <p>We have evaluated the AT excursion method by checking its predictions against known replication origins in herpesviruses and comparing its performance with an existing base weighted score method (BWS<sub>1</sub>). Out of 43 known origins, 39 are predicted by either one or the other method and 26 origins are predicted by both. The excursion method identifies six origins not predicted by BWS<sub>1</sub>, showing that the AT excursion method is a valuable complement to BWS<sub>1</sub>. We have also applied the AT excursion method to two other families of double stranded DNA viruses, the poxviruses and iridoviruses, of which very few replication origins are documented in the public domain. The prediction results are made available as supplementary materials at <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Preliminary investigation shows that the proposed method works well on some larger genomes too.</p> <p>Conclusion</p> <p>The AT excursion method will be a useful computational tool for identifying replication origins in a variety of genomic sequences.</p

    Species-specific Typing of DNA Based on Palindrome Frequency Patterns

    Get PDF
    DNA in its natural, double-stranded form may contain palindromes, sequences which read the same from either side because they are identical to their reverse complement on the sister strand. Short palindromes are underrepresented in all kinds of genomes. The frequency distribution of short palindromes exhibits more than twice the inter-species variance of non-palindromic sequences, which renders palindromes optimally suited for the typing of DNA. Here, we show that based on palindrome frequency, DNA sequences can be discriminated to the level of species of origin. By plotting the ratios of actual occurrence to expectancy, we generate palindrome frequency patterns that allow to cluster different sequences of the same genome and to assign plasmids, and in some cases even viruses to their respective host genomes. This finding will be of use in the growing field of metagenomics

    Palindrome distributions and their applications

    Get PDF
    Master'sMASTER OF SCIENC

    Importance Sampling of Word Patterns in DNA and Protein Sequences

    Get PDF
    The use of Monte Carlo evaluation to compute p-values of pattern counting test statistics is especially attractive when an asymptotic theory is absent or when the search sequence or the word pattern is too short for an asymptotic formula to be accurate. The drawback of applying Monte Carlo simulations directly is its inefficiency when p-values are small, which precisely is the situation of importance. In this paper, we provide a general importance sampling algorithm for efficient Monte Carlo evaluation of small p-values of pattern counting test statistics and apply it on word patterns of biological interest, in particular palindromes and inverted repeats, patterns arising from position specific weight matrices, as well as co-occurrences of pairs of motifs. We also show that our importance sampling technique satisfies a log efficient criterion
    corecore