23 research outputs found
Comparing sequences with segment rearrangements
Abstract. Computational genomics involves comparing sequences based on "similarity " for detecting evolutionary and functional relation-ships. Until very recently, available portions of the human genome sequence (and that of other species) were fairly short and sparse. Mostsequencing effort was focused on genes and other short units; similarity between such sequences was measured based on character level differ-ences. However with the advent of whole genome sequencing technology there is emerging consensus that the measure of similarity between long genome sequences must capture the rearrangements of large segmentsfound in abundance in the human genome. In this paper, we abstract the general problem of computing sequence similarity in the presence of segment rearrangements. This problem isclosely related to computing the smallest grammar for a string or the block edit distance between two strings. Our problem, like these otherproblems, is NP hard. Our main result here is a simple O(1) factor approximation algorithm for this problem. In contrast, best known approxi-mations for the related problems are factor \Omega (log n) off from the optimal. Our algorithm works in linear time, and in one pass. In proving our re-sult, we relate sequence similarity measures based on different segment rearrangements, to each other, tight up to constant factors. 1 Introduction Similarity comparison between biomolecular sequences play an important rolein computational genomics due to the premise that sequence similarity usuall