Location of Repository

Sequencing by hybridization is a method of reconstructing a long DNA string — that is, figuring out its nucleotide sequence — from knowledge of its short substrings. Unique reconstruction is not always possible, and the goal of this paper is to study the number of reconstructions of a random string. For a given string, the number of reconstructions is determined by the pattern of repeated substrings; in an appropriate limit substrings will occur at most twice, so the pattern of repeats is given by a pairing: a string of length 2n in which each symbol occurs twice. A pairing induces a 2-in, 2-out graph, whose directed edges are defined by successive symbols of the pairing — for example the pairing ABBCAC induces the graph with edges AB, BB, BC, and so forth — and the number of reconstructions is simply the number of Euler circuits in this 2-in, 2-out graph. The original problem is thus transformed into one about pairings: to find the number fk(n) of n-symbol pairings having k Euler circuits. We show how to compute this function, in closed form, for any fixed k, and we present the functions explicitly for k=1,…,9. The key is a decomposition theorem: the Euler “circuit number” of a pairing is the product of the circuit numbers of “component” sub-pairings. These components come from connected components of the “interlace graph”, which has the pairing's symbols as vertices, and edges when symbols are “interlaced”. (A and B are interlaced if the pairing has the form ABAB or BABA.) We carry these results back to the original question about DNA strings, and provide a total variation distance upper bound for the approximation error. We perform an asymptotic enumeration of 2-in, 2-out digraphs to show that, for a typical random n-pairing, the number of Euler circuits is of order no smaller than 2n/n, and the expected number is asymptotically at least e−1/22n−1/n. Since any n-pairing has at most 2n−1 Euler circuits, this pinpoints the exponential growth rate

Topics:
QA Mathematics

Publisher: Elsevier

Year: 2000

DOI identifier: 10.1016/S0166-218X(00)00190-6

OAI identifier:
oai:eprints.lse.ac.uk:35505

Provided by:
LSE Research Online

Download PDF:To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.