Levenshtein distance is a commonly used edit distance metric, typically
applied in language processing, and to a lesser extent, in molecular biology
analysis. Biological nucleic acid sequences are often embedded in longer
sequences and are subject to insertion and deletion errors that introduce
frameshift during sequencing. These frameshift errors are due to string context
and should not be counted as true biological errors. Sequence-Levenshtein
distance is a modification to Levenshtein distance that is permissive of
frameshift error without additional penalty. However, in a biological context
Levenshtein distance needs to accommodate both frameshift and weighted errors,
which Sequence-Levenshtein distance cannot do. Errors are weighted when they
are associated with a numerical cost that corresponds to their frequency of
appearance. Here, we describe a modification that allows the use of Levenshtein
distance and Sequence-Levenshtein distance to appropriately accommodate
penalty-free frameshift between embedded sequences and correctly weight
specific error types.Comment: 10 pages, 8 figure