Importance Sampling of Word Patterns in DNA and Protein Sequences

Chan, Hock Peng; Chen, Louis H. Y; Zhang, Nancy R

Importance Sampling of Word Patterns in DNA and Protein Sequences

Authors: Hock Peng Chan
Louis H. Y Chen
Nancy R Zhang
Publication date: 1 January 2010
Publisher: ScholarlyCommons

Abstract

The use of Monte Carlo evaluation to compute p-values of pattern counting test statistics is especially attractive when an asymptotic theory is absent or when the search sequence or the word pattern is too short for an asymptotic formula to be accurate. The drawback of applying Monte Carlo simulations directly is its inefficiency when p-values are small, which precisely is the situation of importance. In this paper, we provide a general importance sampling algorithm for efficient Monte Carlo evaluation of small p-values of pattern counting test statistics and apply it on word patterns of biological interest, in particular palindromes and inverted repeats, patterns arising from position specific weight matrices, as well as co-occurrences of pairs of motifs. We also show that our importance sampling technique satisfies a log efficient criterion

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

ScholarlyCommons@Penn

oai:repository.upenn.edu:stati...

Last time updated on 02/12/2017

Kosmopolis

oai:repository.upenn.edu:stati...

Last time updated on 09/07/2019