Background {#Sec1}
==========

Introduction {#Sec2}
------------

In the string correction problem, we are given two strings *A* and *B* and are required to find the minimum number of edit operations needed to transform *A* into *B*. The permitted edit operations are: (a) substitute a character in *A* to a different character, (b) insert a character into *A*, (c) delete a character of *A*, and (d) transpose two adjacent characters of *A*. When all four edit operations are permitted, the length of the optimal edit sequence is known as the Damerau-Levenshtein (DL) distance \[[@CR1], [@CR2]\]. Some applications limit the permissible edit operations to a subset of the stated four operations. As a result, string correction has been studied using other distance metrics as well. For example, the Levenshtein distance \[[@CR1]\] is the length of the shortest sequence of substitutions, insertions, and deletions needed to transform *A* into *B*. This distance is used in the longest common subsequence problem \[[@CR3]\], for example. When only substitutions are allowed, the length of the minimum edit sequence is the Hamming distance \[[@CR4]\] and when only transpositions are allowed, this length is the Jaro distance \[[@CR5]\].

The cost of an edit sequence may be generalized by using weights for the various operations. For example, in sequence alignment using the methods of Needleman and Wunsch \[[@CR6]\] and Smith and Waterman \[[@CR7]\], transpositions are not permitted, the cost of a substitution depends on the two characters involved, and there is a gap penalty. The string-to-string correction algorithm of Lowrance and Wagner \[[@CR8]\] uses a cost of *S* for a substitution, *I* for an insertion, *D* for a deletion, and *T* for a transposition and requires 2*T*≥*I*+*D*. We note that the costs used in computing the DL distance are *S*=*I*=*D*=*T*=1 and that these costs satisfy the 2*T*≥*I*+*D* requirement of the algorithm of Lowrance and Wagner \[[@CR8]\]. In fact, the best algorithm currently known for the DL distance is the one in \[[@CR8]\] with edit operation costs set to 1.

Spelling error correction \[[@CR9]--[@CR11]\], data clustering and data mining \[[@CR12]\], comparing packet traces \[[@CR13]\], quantifying the similarity of DNA/RNA/protein sequences, gene finding, and gene function prediction \[[@CR14]\] are some of the applications of the DL distance. While, in spelling error correction, the strings *A* and *B* are relatively short, in other applications, these strings may be quite long. For example, the length of a protein sequence may exceed 300,000 \[[@CR15]\].

Bard \[[@CR10]\] has shown that the DL distance is a true metric; that is, it satisfies 1) non-negativity, 2) identity, 3) symmetry, and 4) triangle inequality. The algorithm of Bard \[[@CR10]\] computes the DL distance in *O*(*mn*∗ max{*m*,*n*}) time, where *m* is the length of string *A* and *n* is the length of *B*. This algorithm uses *O*(*mn*) space. Hyyro \[[@CR16]\] has developed a bit-parallel algorithm to determine whether the DL distance between two strings is less than a specified threshold. This bit-parallel algorithm was tested using DNA sequences of length up to 10,000.

In an effort to reduce time complexity, Oommen and Loke \[[@CR17]\] consider restricting edit sequences so that no substring is edited more than once. We illustrate this restriction using the example given in \[[@CR18]\]. The string CA may be transformed into ABC using the edit sequence CA (transposition) → AC (insertion) → ABC. So, the DL distance between CA and ABC is 2. With the restriction of \[[@CR17]\], the second operation in this edit sequence is not permitted as it involves re-editing AC, which resulted from the first edit operation. The restricted DL distance is 3, which corresponds to the restricted edit sequence CA (deletion) → A (insertion) → AB (insertion) → ABC. The restricted DL distance is not a metric as it does not satisfy the triangle inequality.

The algorithm of Lowrance and Wagner \[[@CR8]\] computes the DL distance in *O*(*mn*) time while also using *O*(*mn*) space. This is the fastest and most space efficient algorithm known for string correction using the DL distance.

Neither the algorithm of Bard \[[@CR10]\] nor that of Lowrance and Wagner \[[@CR8]\] is practical when *m* and *n* are large due to their excessive space requirement. The former algorithm becomes impractical also due to its excessive run time. In this paper, we focus on the development of algorithms that are more space, time, and energy efficient than that of Lowrance and Wagner \[[@CR8]\]. To obtain space efficiency, we observe that the DL distance can be computed by retaining only *O*(*sm*) or *O*(*sn*) data, where *s* is the size of the alphabet. We note that, when *m* and *n* are large, *s* is much smaller than *m* and *n*. In fact, *s*=4 for RNA and DNA sequences and *s*=20 for protein sequences and the length of these sequences is often orders of magnitude larger than *s*.

Cache model {#Sec3}
-----------

To analyze the cache performance of our algorithms, we use the rather simple cache model which has been used by us successfully in our past work \[[@CR19]\]. In this model we have a single-level cache that has *l* cache lines of size *w*, where *w* is the number of data items that can be stored in one cache line. So, when the data size is 4 bytes and *w*=8, each cache line is 32 bytes. The size (i.e., capacity) of our one-level cache is *lw*. In accordance with this cache model, we assume that main memory is divided into blocks whose size is the same as that of a cache line (i.e., *w* words each). When we attempt to read a piece of data that is not in the cache, a read miss occurs. A read miss causes the corresponding block of main memory to be read into a cache line. When the cache is full, this read miss requires us to first evict the block that is in the least recently used (LRU) cache line. This eviction results in a write of the evicted block to main memory in case the evicted block has changed. A write miss occurs when we attempt to write data that is not in a cache line. At this time, the corresponding block of main memory is read into a cache line and the data we wish to write is written to this cache line.

Notice that every read and write miss results in a read access of main memory; some read and write misses also result in the writing of a cache line to main memory.

Today's computers actually employ multiple levels of cache and a far more sophisticated and proprietary cache servicing policy combined with prefetching to hide memory latency. As a result, it is extremely difficult to analyze cache performance using a realistic cache model. The described simple cache model is amenable to analysis and our experiments establish its usefulness for this purpose as algorithms with reduced cache misses using this model actually run faster on computers with more sophisticated cache architectures, replacement policies, and prefetching techniques.

Classical DL distance algorithm {#Sec4}
-------------------------------

Wagner and Fischer \[[@CR20]\] developed the notion of a *trace*, which is useful in reasoning about edit sequences that are limited to substitutions, insertions, and deletions. Lowrance and Wagner \[[@CR8]\] extended this notion to include the transposition operation. A *trace* for the strings *A*=*a*~1~⋯*a*~*m*~ and *B*=*b*~1~⋯*b*~*n*~ is a set *T* of lines, where the endpoints *u* and *v* of a line (*u*,*v*) denote positions in *A* and *B*, respectively. A set of lines *T* is a trace iff: For every (*u*,*v*)∈*T*, *u*≤*m* and *v*≤*n*.The lines in *T* have distinct *A* positions and distinct *B* positions. That is, no two lines in *T* have the same *u* or the same *v*.

A line (*u*,*v*) is *balanced* iff *a*~*u*~=*b*~*v*~ and two lines (*u*~1~,*v*~1~) and (*u*~2~,*v*~2~) cross iff (*u*~1~\<*u*~2~) and (*v*~1~\>*v*~2~). As an example, consider *A*=*dafac* and *B*=*fdbbec*. The set of lines *T*={(1,2),(3,1),(4,3),(5,6)} satisfies the requirements for a trace. Line (4,3) is not balanced as *a*~4~≠*b*~3~. The remaining 3 lines in the trace are balanced. The lines (1,2) and (3,1) cross. This trace may be depicted as a diagram as in Fig. [1](#Fig1){ref-type="fig"}. Fig. 1DL trace example

In a trace, an unbalanced line denotes a substitution operation and a balanced line denotes retaining the character of *A*. If *a*~*i*~ has no line attached to it, *a*~*i*~ is to be deleted and when *b*~*j*~ has no attached line, it is to be inserted. When two balanced lines (*u*~1~,*v*~1~) and (*u*~2~,*v*~2~) cross, $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$a_{u_{1}+1} \cdots a_{u_{2}-1}$\end{document}$ are to be deleted from *A* making $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$a_{u_{1}}$\end{document}$ and $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$a_{u_{2}}$\end{document}$ adjacent, then $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$a_{u_{1}}$\end{document}$ and $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$a_{u_{2}}$\end{document}$ are to be transposed, and finally, $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$b_{v_{2}+1} \cdots b_{v_{1}-1}$\end{document}$ are to be inserted between the just transposed characters of *A*.

The edit sequence corresponding to the trace of Fig. [1](#Fig1){ref-type="fig"} is delete *a*~2~, transpose *a*~1~ and *a*~3~, substitute *b* for *a*~4~, insert *b*~4~=*b* and *b*~5~=*e*, retain *a*~5~. The cost of this edit sequence is 5.

Lowrance and Wagner \[[@CR8]\] have proved the following properties: The cost of a trace equals the number of unbalanced lines plus the number of positions in *A* and *B* not touched by a line plus the number of line crossings.There is a trace whose cost equals that of an optimal edit sequence (Theorem 2 of \[[@CR8]\]). Since every trace corresponds to an edit sequence, it follows that the edit sequence that corresponds to a minimum cost trace is optimal.There is a minimum cost trace in which each line crosses at most one other line and in which every line that crosses another is balanced (Theorem 4 of \[[@CR8]\]).There is trace *T* that satisfies property P3 and for every pair of crossing lines (*u*~1~,*v*~1~), (*u*~2~,*v*~2~), *u*~1~\<*u*~2~ in *T*, (a) $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$a_{i} \neq a_{u_{1}} = b_{v_{1}}\phantom {\dot {i}\!}$\end{document}$, *u*~1~\<*i*\<*u*~2~ and (b) $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$b_{j} \neq b_{v_{2}} = a_{u_{2}}\phantom {\dot {i}\!}$\end{document}$, *v*~2~\<*j*\<*v*~1~. In words, *u*~1~ is the last (i.e., rightmost) occurrence of $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$b_{v_{1}}$\end{document}$ in *A* that precedes position *u*~2~ of *A* and *v*~2~ is the last occurrence of $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$a_{u_{2}}$\end{document}$ in *B* that precedes position *v*~1~ of *B*. We refer to these positions as $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$lastA[\!u_{2}][\!b_{v_{1}}]\phantom {\dot {i}\!}$\end{document}$ and $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$\phantom {\dot {i}\!}lastB[\!v_{1}][\!a_{u_{2}}]$\end{document}$, respectively (Theorem 5 of \[[@CR8]\]).

Let *H*~*ij*~ be the DL distance between *A*\[ 1:*i*\] to *B*\[ 1:*j*\]. So, *H*~*mn*~ is the DL distance between *A* and *B*. The following dynamic programming recurrence follows from properties P1-P4 of a trace. $$\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document} $$ H_{i,0} = i,\ H_{0,j} = j, \ 0 \le i \le m, \ 0 \le j \le n  $$ \end{document}$$

When *i*\>0 and *j*\>0, $$\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document} $$ {}H_{i,j} \,=\, \min\left\{\! \begin{array} {lcr} H_{i-1,j-1}+ c(a_{i},b_{j}) \\ H_{i,j-1}+ 1 \\ H_{i-1,j}+ 1 \\ H_{k-1,l-1} + (i-k-1)+ 1 + (j-l-1) \\ \end{array} \right.   $$ \end{document}$$

where *c*(*a*~*i*~,*b*~*j*~) is 1 if *a*~*i*~≠*b*~*j*~ and 0 otherwise, *k*=*lastA*\[ *i*\]\[ *b*~*j*~\] and *l*=*lastB*\[ *j*\]\[ *a*~*i*~\]. If *k* or *l* do not exist, then case 4 of the recurrence does not apply.

Figure [2](#Fig2){ref-type="fig"} illustrates the four cases of this recurrence. These cases correspond to the four possibilities for an optimal trace that transforms *A*\[ 1:*i*\] into *B*\[ 1:*j*\] and satisfies properties P2-P4. Such a trace may (a) contain the line (*i*,*j*), (b) contain no line that touches *b*~*j*~, (c) contain no line that touches *a*~*i*~, or (d) have crossing balanced lines that involve *a*~*i*~ and *b*~*j*~. Figure [2](#Fig2){ref-type="fig"}a illustrates the first case, which is a substitution between *a*~*i*~ and *b*~*j*~; we optimally transform *A*\[ 1:*i*−1\] into *B*\[ 1:*j*−1\] and then substitute *b*~*j*~ for *a*~*i*~. If *a*~*i*~=*b*~*j*~, the substitution cost is 0, otherwise it is 1. Figure [2](#Fig2){ref-type="fig"}b shows the second case. Here, *b*~*j*~ is inserted at the end of *B*\[ 1:*j*−1\] following an optimal transformation of *A*\[ 1:*i*\] into *B*\[ 1:*j*−1\]. Figure [2](#Fig2){ref-type="fig"}c shows the third case in which *a*~*i*~ is deleted from *A*\[1:*i*\] following an optimal transformation of *A*\[ 1:*i*−1\] into *B*\[ 1:*j*\]. Figure [2](#Fig2){ref-type="fig"}d shows the case of crossing balanced lines (*i*,*l*) and (*k*,*j*). Here, *A*\[ 1:*k*−1\] must be optimally transformed into *B*\[ 1:*l*−1\]. Note that to perform the crossing operation, we must delete *i*−*k*−1 characters from *A*, do an adjacent character transposition in *A*, and then insert *j*−*l*−1 characters from *B* between the two just transposed positions. So, the cost is (*i*−*k*−1)+1+(*j*−*l*−1). Fig. 2DL trace recurrence. **a** substitution **b** insertion **c** deletion **d** translate A\[k:i\] to B\[l:j\] where (a ~k~,b ~j~) and (b ~l~,a ~i~) form a transposition opportunity

Algorithm 1 is the pseudocode to compute *H* using Eqs. [1](#Equ1){ref-type=""} and [2](#Equ2){ref-type=""}. This is a simplification of the pseudocode given in Lowrance and Wagner \[[@CR8]\] to the case when each edit operation has unit cost. In this algorithm, *last*\_*row*\_*id*\[*c*\] keeps track of the last occurrence of character *c* in *A* (note that this is a row index of *H*) and *last*\_*col*\_*id* keeps track of the last occurrence of *a*~*i*~ in *B*.

We shall refer to Algorithm 1 as algorithm *DL*. Its time and space complexities are readily seen to be *O*(*mn*). Once *H* has been computed using algorithm *DL*, an optimal trace may be obtained in *O*(*m*+*n*) additional time using a standard dynamic programming traceback. We refer to the combination of *DL* and the traceback as algorithm *DL*\_*TRACE*.

The total number of cache misses is dominated by the read and write misses of the array *H*. So, we count only these misses. In each iteration of the loop for computing row *i* of *H*, we need the elements of rows *i* and *i*−1 of *H* in left-to-right order as in Algorithm 1 lines 9-11 and 14. Since these rows are read from main memory in blocks of size *w* and row *i* is written to main memory in blocks of this size, lines 9-11 and 14 result in 2*n*/*w* read accesses and *n*/*w* write accesses for each *i*. These lines, therefore, result in 3*mn*/*w* cache misses over the entire execution of *DL*. Line 13 makes one read access of *H* per iteration and so contributes at most *mn* to the total cache-miss count. Hence, the cache-miss count for algorithm *DL* is approximately *mn*(1+3/*w*).

Methods {#Sec5}
=======

Single-core algorithms {#Sec6}
----------------------

In this section, we develop four linear-space single-core algorithms for string correction using the DL distance. All four run in *O*(*mn*) time. The first two (*LS*\_*DL* and *Strip*\_*DL*) compute only the score *H*~*mn*~ of the optimal trace; they differ in their cache efficiency. The last two (*LSDL*\_*TRACE* and *Strip*\_*TRACE*) compute an optimal trace.

### The linear space algorithm *LS*\_*DL* {#Sec7}

Let *s* be the size of the alphabet. Instead of using the array *H* used in *DL*, algorithm *LS*\_*DL* uses a one-dimensional array *U*\[ −1:*n*\] and a two-dimensional array *T*\[ 1:*s*\]\[ −1:*n*\]. These two arrays have a space requirement of *O*((*s*+1)*n*) = *O*(*n*) for constant *s*. When *m*\<*n*, one may swap *A* and *B* to reduce the required memory. Adding the memory needed for *A* and *B*, the space complexity is *O*(*s* min{*m*,*n*}+*m*+*n*) = *O*(*m*+*n*) when *s* is a constant.

As in algorithm *DL*, the *H*~*ij*~ values are computed by rows. The one-dimensional array *U* is used to save the *H*\[ *i*\]\[ ∗\] values computed by algorithm *DL* when row *i* is being computed. Let *H*\[ *w*\]\[ ∗\] be the last row computed for character *c*. Then, *T*\[ *c*\]\[ ∗\] is row *w*−1 of *H*. Algorithm 2 gives the pseudocode for *LS*\_*DL*. Its correctness follows from the correctness of algorithm *DL*. Note that *swap*(*T*\[*A*\[ *i*\]\],*U*) takes *O*(1) time as pointers to 2 one-dimensional arrays are swapped rather than the content of these arrays. The cache-miss count for *LS*\_*DL* is the same as that for *DL* when *n* is suitably large as both have the same data access pattern. However, for smaller instances *LS*\_*DL* will exhibit much better cache behavior. For example, because of its use of much less memory, we may have enough LLC cache to store all the data in *LS*\_*DL* but not in *DL* (*O*(*sn*) vs *O*(*mn*)).

### The cache-efficient linear-space algorithm *Strip*\_*DL* {#Sec8}

When (*s*+1)*n* is larger than the size of the LLC cache, we may reduce cache misses relative to algorithm *LS*\_*DL* by computing *H*~*ij*~ by strips of width *q*, for some *q* less than *n* (the last strip may have a width smaller than *q*). This is shown in Fig. [3](#Fig3){ref-type="fig"}. The strips are computed in the order 0, 1,\... using algorithm *LS*\_*DL*. However, the space needed by *T* and *U* in *LS*\_*DL* is reduced to (*s*+1)*q* as the strip width is *q* rather than *n*. By choosing *q* small enough, we can ensure that blocks of the *T* and *U* arrays used by *LS*\_*DL* are not evicted from cache once they are brought in. So, if each entry of *T* and *U* takes 1 word, then when the cache size is *lw*, we have *q*\<*lw*/(*s*+1). Note that, in addition to *T* and *U*, the cache needs to hold partials of *A*, *B* and other arrays needed to pass the data from one strip to the next. Fig. 3Computing *H* by strips

To pass the data from one strip to next, we use an additional one-dimensional array *strip* of size *m* and a two-dimensional *s*∗*m* array *V*. The array *strip* records the values of *H* computed for the rightmost column in the strip. *V*\[ *c*\]\[*i*\] gives the *H* value in the rightmost column *j* of row *i* of *H* that is (a) in a strip to the left of the one currently being computed and (b) *c*=*B*\[ *j*\].

The pseudocode for *Strip*\_*DL* is given in Algorithm 3. For clarity, this pseudocode uses two *strip* arrays (lines 18 and 30) and two *V* arrays (lines 24 and 32). One set of arrays is used to fetch data calculated for the previous strip and the other set for data that is to be passed to the next strip. In the actual implementation, we use a single *strip* array and a single *V* array overwriting values received from the previous strip with values to be passed to the next strip.

The time and complexity of *Strip*\_*DL* are, respectively, *O*(*mn*) and *O*((*s*+1)*m*+(*s*+1)*q*+*n*) = *O*(*sm*+*sq*+*n*) = *O*(*sm*+*n*) as *q* is a constant. When *m*\>*n*, we may switch *A* and *B* to conserve memory and so the space complexity becomes *O*(*s* min{*m*,*n*}+*m*+*n*) = *O*(*m*+*n*) for constant *s*.

When we analyze the cache miss, we note that *q* is chosen such that *U* and *T* fit into cache. We make the reasonable assumption that the LRU replacement rule does not cause any block of *U* or *T* to be evicted during the running of algorithm *Strip*\_*DL*. As a result, the total number of cache misses due to *U* and *T* is independent of *m* and *n* and so may be ignored in the analysis. The initialization of *strip* and *V* results in *m*/*w* and (*s*+1)*m*/*w* read accesses, respectively. The number of write accesses is approximately the same as the number of read accesses. The computation for each strip accesses the array *strip* in ascending order of index. This results in (approximately) the same number of cache misses as made during the initialization phase. Hence, the total number of cache misses due to *strip* is approximately (2*m*/*w*)(*n*/*q*+1). For *V*, we note that when computing the current strip, the elements in any row of *V* are accessed in non-decreasing order of index (i.e., from left to right) and that we need to retain, in cache, only the most recently read value for each character of the alphabet (i.e., at most *s* values are to be retained). Making the assumption that a *V* value is evicted from cache only when a new value for the same character is accessed, the total number of read misses from *V* when computing a single strip is *sm*/*w*. The number of write misses is approximately the same. So, *V* contributes (2*sm*/*w*)(*n*/*q*+1). Hence, the total number of cache misses for algorithm *Strip*\_*DL* is ≈2(*s*+1)*mn*/(*wq*) when *m* and *n* are large.

Recall that the approximate cache-miss count for algorithms *DL* and *LS*\_*DL* is *mn*(1+3/*w*). This is (*wq*+3*q*)/(2*s*+2) times that for *Strip*\_*DL*.

### The linear-space trace algorithm *LSDL*\_*TRACE* {#Sec9}

Although algorithms *LS*\_*DL* and *Strip*\_*DL* determine the score (cost) of an optimal trace (and hence of an optimal edit sequence) that transforms *A* into *B*, these algorithms do not save enough information to actually determine an optimal trace. To determine an optimal trace using linear space, we adopt a divide-and-conquer strategy similar to that used by Hirschberg \[[@CR21]\] for the simple string editing problem (i.e., transpositions are not permitted) and Myers and Miller \[[@CR22]\] for the sequence alignment problem.

We say that a trace has a *center crossing* iff it contains two lines (*u*~1~,*v*~1~) and (*u*~2~,*v*~2~), *u*~1~\<*u*~2~ such that *v*~1~\>*n*/2 and *v*~2~≤*n*/2 (Fig. [4](#Fig4){ref-type="fig"}). Fig. 4DL trace splitting opportunities. **a** No center crossing **b** With center crossing

Let *T* be an optimal trace that satisfies properties P2-P4. If *T* contains no center crossing, then its lines may be partitioned into sets *TL* and *TR* such that *TL* contains all lines (*u*,*v*)∈*T* with *v*≤*n*/2 and *TR* contains the remaining lines (Fig. [4](#Fig4){ref-type="fig"}a). Since there is no center crossing, all lines in *TR* have a *u* value greater than the *u* value of every line in *TL*. It follows from properties P2-P4 that there is an *i*, 1≤*i*≤*m* such that *T* is the union of an optimal trace for *A*\[ 1:*i*\] and *B*\[ 1:*n*/2\] and that for *A*\[ *i*+1:*m*\] and *B*\[ *n*/2+1:*n*\]. Let *H*\[ *i*\] be the cost the former optimal trace and *H*^′^\[ *i*+1\] that of the latter optimal trace. We see that when *T* has no center crossing, the cost of *T* is $$\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document} $$ costNoCC(T) = \min_{1 \le i \le m}\{ H[\!i] + H'[\!i+1]\}  $$ \end{document}$$

When *T* contains a center crossing, its lines may be partitioned into 3 sets, *TL*, *TM*, and *TR*, as shown in Fig. [4](#Fig4){ref-type="fig"}b. Let (*u*~1~,*v*~1~) and (*u*~2~,*v*~2~) be the lines defining the center crossing. Note that *TL* contains all lines of *T* with *v*\<*v*~2~, *TR* contains all lines with *v*\>*v*~1~, and *TM*={(*u*~1~,*v*~1~),(*u*~2~,*v*~2~)}. Note also that all lines in *TL* have a *u*\<*u*~1~ and all in *TR* have *u*\>*u*~2~. From property P1, it follows that *TL* is an optimal trace for *A*\[ 1:*u*~1~−1\] and *B*\[ 1:*v*~2~−1\] and *TR* is an optimal trace for *A*\[ *u*~2~+1:*m*\] and *B*\[ *v*~1~+1:*n*\]. Further, since (*u*~1~,*v*~1~) and (*u*~2~,*v*~2~) are balanced lines, the cost of *TM* is (*u*~2~−*u*~1~−1)+1+(*v*~1~−*v*~2~−1). Also, *A*\[ *u*~1~\]≠*A*\[ *u*~2~\] as otherwise, replacing the center-crossing lines with (*u*~1~,*v*~2~) and (*u*~2~,*v*~1~) results in a lower cost trace. From property P4, we know that $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$u_{1} = lastA[\!u_{2}][\!b_{v_{1}}]\phantom {\dot {i}\!}$\end{document}$ and $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$v_{2} = lastB[\!v_{1}][\!a_{u_{2}}]\phantom {\dot {i}\!}$\end{document}$. Let *H*\[ *i*\]\[ *j*\] be the cost of an optimal trace for *A*\[ 1:*i*\] and *B*\[ 1:*j*\] and let *H*^′^\[ *i*\]\[ *j*\] be that for an optimal trace for *A*\[ *i*:*m*\] and *B*\[ *j*:*n*\]. So, when *T* has a center crossing, its cost is $$\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document} $$  {\begin{aligned} costCC(T) \,=\,& \min\{H[\!u_{1}\,-\,1][\!v_{2}-1] + H'[\!u_{2}\!+1][\!v_{1}+1] \\ &+ (u_{2}-u_{1}-1) + 1 + (v_{1}-v_{2}-1)\} \end{aligned}}  $$ \end{document}$$

where, for the min{}, we try 1≤*u*~1~\<*m* and for each such *u*~1~, we set *v*~1~ to be the smallest *i*\>*n*/2 for which $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$b_{i} = a_{u_{1}}\phantom {\dot {i}\!}$\end{document}$. For each *u*~1~ we examine all characters other than $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$\phantom {\dot {i}\!}a_{u_{1}}$\end{document}$ in the alphabet. For each such character *c*, *v*~2~ is set to the largest *j*≤*n*/2 for which *b*~*j*~=*c* and *u*~2~ is the smallest *i*\>*u*~1~ for which *a*~*i*~=*c*. So, the min is taken over (*s*−1)*m* terms.

Let *U*~*top*~ and *T*~*top*~ be the final *U* and *T* arrays computed by *LS*\_*DL* with inputs *B*\[ 1:*n*/2\] and *A*\[ 1:*m*\] and *U*~*bot*~ and *T*~*bot*~ be these arrays when the inputs are the reverse of *B*\[ *n*/2+1\] and *A*\[*m*:1\]. From these arrays, we may readily determine the *H* and *H*^′^ values needed to evaluate Eqs. [3](#Equ3){ref-type=""} and [4](#Equ4){ref-type=""}. Algorithm *LSDL*\_*TRACE* (Algorithm 4) provides the pseudocode for our linear space computation of an optimal trace. It assumes that *LS*\_*DL* has been modified to return both the arrays *U* and *T*.

For the time complexity, we see that at the top level of the recursion, we invoke *LS*\_*DL* twice with strings *A* and *B* of size *m* and *n*/2, respectively. This takes at most *amn* time for some constant *a*. The time required to compute Eqs. [3](#Equ3){ref-type=""} and [4](#Equ4){ref-type=""} is *O*(*sn*) and may be absorbed into *amn* by using a suitably large constant *a*. At the next level of recursion, *LS*\_*DL* is invoked 4 times. The sum of the lengths of the *A* strings across these 4 invocations is at most 2*m* and the *B* string has length at most *n*/4. So, the time for these four invocations is at most *amn*/2. Generalizing to the remaining levels of recursion, we see that algorithm *LSDL*\_*TRACE* takes *amn*(1+1/2+1/4+1/8+...)\<2*amn*=*O*(*mn*) time. The space needed is the same as that for *LS*\_*DL* (note that the parameters to this algorithm have been switched). From the time analysis, it follows that the number of cache misses is approximately twice that for *LS*\_*DL* when invoked with strings of size *m* and *n*. Hence the approximate cache miss count for *LSDL*\_*TRACE* is 2*mn*(1+3/*w*).

We note that some reduction in actual run time can be achieved by switching *A* and *B* when *A* is shorter than *B* thus ensuring that the shorter string is split at each level of recursion. This enables us to get the recursion terminates faster.

### The strip trace algorithm *Strip*\_*TRACE* {#Sec10}

This algorithm differs from *LSDL*\_*TRACE* in that it uses a modified version of *Strip*\_*DL* rather than a modified version of *LS*\_*DL*. The modified version of *Strip*\_*DL* returns the arrays *strip* and *V* computed by *Strip*\_*DL*. Correspondingly, *Strip*\_*TRACE* uses *V*~*top*~ and *V*~*bot*~ in place of *T*~*top*~ and *T*~*bot*~. The asymptotic time complexity of *Strip*\_*TRACE* is also *O*(*mn*) and it takes the same amount of space as does *Strip*\_*DL* (note that the parameters to *Strip*\_*DL* are switched relative to those for *Strip*\_*TRACE*). The number of cache misses is approximately twice that for *Strip*\_*DL*.

Multi-core algorithms {#Sec11}
---------------------

In this section, we describe our parallelizations of algorithm *DL* and the four single-core algorithms of previous section. These parallelizations assume that the number of processors is small relative to string length. The naming convention we adopt for the parallel versions is adding *PP*\_ as a prefix to the name of the single-core algorithm.

### The algorithm *PP*\_*DL* {#Sec12}

Our parallel version of algorithm *DL*, *PP*\_*DL*, computes the elements in the same order as does *DL*. However, it starts the computation of a row before the computation of its preceding row is complete. Each processor is assigned a unique row to compute and it computes this row from left to right. Let *p* be the number of processors. Processor *z* is initially assigned to do the outer loop computation for *i*=*z*, 1≤*i*≤*p*. Processor *z* begins after a suitable time lag relative to the start of processor *z*−1 so that the data it needs for its computation have already been computed by processor *z*−1. In our code, the time lag between the start of the computation of two consecutive rows is the time needed to compute *n*/*p* elements. Upon completion of its iteration *i* computation, the processor proceeds to iteration *i*+*p* of the outer loop. The time complexity of *PP*\_*DL* is *O*(*mn*/*p*).

### The algorithm *PP*\_*LS*\_*DL* {#Sec13}

While the general parallelization strategy for *PP*\_*LS*\_*DL* is the same as that used in *PP*\_*DL*, extra care is needed to ensure a computation identical to that of *LS*\_*DL*. Divergence in results is possible when two or more processors are simultaneously computing different rows of *H* using the same memory. This happens for example when *A*=*aaabc*⋯ and *p*≥3. We start with processor *i* assigned to compute row *i* of *H*, 1≤*i*≤*p*. Suppose that *U*=*x* and *T*\[ *a*\]=*y* initially (note that *x* and *y* are addresses in memory). Because of the *swap*(*T*\[ *A*\[ *i*\]\],*U*) statement in *LS*\_*DL*, processor 1 begins to compute row 1 of *H* using memory beginning at the address *y*. If processor 2 begins with a suitable time lag as in *PP*\_*DL*, it will compute row 2 of *H* using memory beginning at the address *x*. With a further lag, processor 3 will begin to compute row 3 of *H* again using memory beginning at the address *y*. Now, both processors 1 and 3 are using the same memory to compute different rows of *H* and so we run the risk of overwriting *H* values that may be needed for subsequent computations. As another example, consider *A*=*ababa*⋯ and *p*≥4. Suppose that *U*=*x* and *T*\[ *a*,*b*\]=\[ *y*,*z*\] initially. Processor 1 begins to compute row 1 using the memory *y*, then, with a lag, processor 2 begins to compute row 2 using memory *z*, then processor 3 starts to compute row 3 using memory *x*. Next processor 4 begins to compute row 4 using memory *y*. At this time processor 1 is computing row 1 with *A*\[ 1\]=*a* and processor 4 is computing row 4 with *A*\[ 4\]=*b* and both processors are using the same row memory *y*.

Let *p*~1~ and *p*~2~ be two processors that are using the same memory to compute rows *r*~1~\<*r*~2~ of *H* and that no processor is using this memory to compute a row between *r*~1~ and *r*~2~. From the swapping assignment scheme used in *LS*\_*DL*, it follows that *p*~1~ is computing the row $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$r_{1} = lastA[\!r_{2}][\!a_{r_{2}}]-1\phantom {\dot {i}\!}$\end{document}$. The *H* values in this row are needed to compute rows *r*~1~+1 through *r*~2~ as $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$\phantom {\dot {i}\!}r_{1} = lastA[\!i][\!a_{r_{2}}] r_{1} < i \le r_{2}$\end{document}$. These values are not needed for rows *i*\>*r*~2~ as for these rows $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$\phantom {\dot {i}\!}lastA[\!i][\!a_{r_{2}}] = r_{2} > r_{1} + 1 = lastA[\!r_{2}][\!a_{r_{2}}]$\end{document}$. Let *j*~1~ be such that $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$\phantom {\dot {i}\!}b_{j} = a_{r_{2}} = a_{r_{1} + 1}$\end{document}$. Then, for *j*\>*j*~1~, $\documentclass[12pt]{minimal}
                \usepackage{amsmath}
                \usepackage{wasysym} 
                \usepackage{amsfonts} 
                \usepackage{amssymb} 
                \usepackage{amsbsy}
                \usepackage{mathrsfs}
                \usepackage{upgreek}
                \setlength{\oddsidemargin}{-69pt}
                \begin{document}$\phantom {\dot {i}\!}lastB[\!j][\!a_{r_{2}}] \ge j_{1}$\end{document}$. Hence, for *j*\>*j*~1~ columns 1 through *j*~1~−2 of row *r*~1~ are not needed to compute an *H* in rows between *r*~1~ and *r*~2~.

Our parallel code uses a synchronization scheme that is based on the observations of the preceding paragraph to delay the overwriting of values that are needed for later computations and ensure a correct computation of the DL distance. Our synchronization scheme employs another array *W*\[ 1:*n*\] that is initialized to 1. Suppose that a processor is computing row *i* of *H* and that *A*\[ *i*\]=*a*. When this processor first encounters an *a* in *B*, say at position *j*~1~, it increments *W*\[0:*j*~1~−2\]. When the next *a* isencountered, say at *j*~2~, it increments *W*\[ *j*~1~−1:*j*~2~−2\] by 1. When the processor finishes its computation of row *i*, the remaining positions of *W* are incremented by 1. The processor assigned to compute row *q* of *H* may compute *U*\[ *j*\] iff *W*\[ *j*\]=*q*. From our earlier observations, it follows that when *W*\[ *j*\]=*q*, the old values in memory positions *U*\[ 1:*j*\] may be overwritten as these are not needed for future computations.

This *p*-processor algorithm *PP*\_*LS*\_*DL*'s time complexity depends on the data sets as the synchronization delay is data dependent. We, however, expect a run-time performance of approximately *O*(*mn*/*p*) when the characters in *B* are roughly uniformly distributed.

### The algorithm *PP*\_*Strip*\_*DL* {#Sec14}

In the parallel version *PP*\_*Strip*\_*DL* of *Strip*\_*DL*, processor *i* is initially assigned to compute strip *i*, 1≤*i*≤*p*. Upon completion of its currently assigned strip *j*, the processor proceeds to compute strip *j*+*p*. An array *signal*\[\] is used for synchronization purposes. When computing a row *r* in its assigned strip *s*, a processor needs to wait until *signal*\[ *r*\]=*s*. *signal*\[ *r*\] is set to *s* by the processor working on strip *s*−1 when the values to the left of strip *s* needed in the computation of row *r* of strip *s* have been computed and there is no risk that the computations for row *r* of strip *s* will overwrite *V* values needed by other processors. *signal* works very much like *W* in *PP*\_*LS*\_*DL*.

Note that when we are working on *p* strips, we need *p* copies of the arrays *U* and *T* used by *Strip*\_*DL*.

The time complexity of *PP*\_*Strip*\_*DL* depends on the synchronization delay and is expected to approximate *O*(*mn*/*p*).

### The algorithm *PP*\_*DL*\_*TRACE* {#Sec15}

This algorithm first uses *PP*\_*DL* to compute *H*\[\]\[\]. Then, a single processor performs a traceback to construct the optimal trace. For reasonable values of *p*, the run time is dominated by *PP*\_*DL* and so, the complexity of *PP*\_*DL*\_*TRACE* is also *O*(*mn*/*p*).

### The algorithms *PP*\_*LSDL*\_*TRACE* and *PP*\_*Strip*\_*TRACE* {#Sec16}

In *LSDL*\_*TRACE* (*Strip*\_*TRACE*), we repeatedly partition the problem into two and apply either *LS*\_*DL* (*Strip*\_*DL*) to each partition. The parallel version *PP*\_*LSDL*\_*TRACE* (*PP*\_*Strip*\_*TRACE*) employs the following parallelization strategy: Each subproblem is solved using *PP*\_*LS*\_*DL* (*PP*\_*Strip*\_*DL*) when the number of independent subproblems is small; all *p* processors are assigned to the parallel solution of a single subproblem. I.e., the subproblems are solved in sequence.*p* subproblems are solved in parallel using *LS*\_*DL* (*Strip*\_*DL*) to solve each subproblem serially when the number of independent subproblems is large,

The time complexity of *PP*\_*LSDL*\_*TRACE* and *PP*\_*Strip*\_*TRACE* is *O*(*mn*/*p*).

Results {#Sec17}
=======

Experimental platform and test data {#Sec18}
-----------------------------------

The single-core algorithms were implemented using C and the multi-core ones using C and OpenMP. Our codes may be downloaded from \[[@CR23]\]. The following computational platforms were used:

Xeon4: Intel Xeon CPU E5-2603 v2 Quad-Core processor 1.8GHz with 10MB cache and 32GB memory.Xeon6: Intel I7-x980 Six-Core processor 3.33GHz with 12MB LLC cache and 16GB memory.Xeon24: Intel Xeon CPU E5-2695 v2 2xTwelve-Core processors 2.40GHz with 30MB cache and 512GB memory.

We compiled all codes using the gcc compiler with the O2 option. Cache miss and energy consumption data were obtained for our Xeon4 platform using the "perf" \[[@CR24]\] software and the RAPL interface. This is the only platform for which we obtained cache miss and energy consumption data.

For test data, we downloaded the real DNA/RNA/protein sequences from the NCBI (National Center for Biotechnology Information) server \[[@CR25]\] and PDB (Protein Data Bank) server \[[@CR15]\]. In addition to that, we also generated random DNA/RNA and protein sequences.

Xeon E5-2603 (Xeon4) using random data {#Sec19}
--------------------------------------

### DL distance algorithms {#Sec20}

The observed cache misses for our DL distance algorithms on our Xeon4 platform for randomly generated sequences of size between 40000 and 400000 are given in Fig. [5](#Fig5){ref-type="fig"} and Table [1](#Tab1){ref-type="table"}. "\*\*" in the table indicates there was insufficient memory for the algorithm to run. The column of Table [1](#Tab1){ref-type="table"} labeled *LvsD* (*SvsD*) presents the percentage changes in cache misses reduced by *LS*\_*DL* (*Strip*\_*DL*) relative to *DL* while that labeled *SvsL* gives this percentage changes reduced by *Strip*\_*DL* relative to *LS*\_*DL*. Fig. 5Cache misses for DL distance algorithms on Xeon4 Table 1Cache misses for DL distance algorithms, in millions, on Xeon4ABDLLS_DLStrip_DLL vs DS vs DS vs L40000400002012656-31.8%97.5%97.9%800008000012677151643.6%98.8%97.8%120000120000400621804245.6%99.0%98.1%160000160000\*\*10,6526399.4%200000200000\*\*19,75114799.3%240000240000\*\*24,25713399.5%280000280000\*\*38,11918899.5%320000320000\*\*44,81524299.5%360000360000\*\*61,296135297.8%400000400000\*\*160,118240798.5%^\*\*^ ⇒ insufficient memory

Notice that *DL* runs out of memory when \|*A*\|=\|*B*\|≥160000. *Strip*\_*DL* has fewer cache misses than *LS*\_*DL* and *LS*\_*DL* has fewer cache misses than *DL*. *Strip*\_*DL* reduces cache misses by up to 99.0*%* relative to *DL* and by up to 99.5*%* relative to *LS*\_*DL*.

Run times are given in seconds in Fig. [6](#Fig6){ref-type="fig"} and using the format *hh*:*mm*:*ss* in Table [2](#Tab2){ref-type="table"} for our random data set. *Strip*\_*DL* is the fastest followed by *LS*\_*DL* and *DL*. *Strip*\_*DL* reduces run time by up to 61.3*%* relative to *DL* and by up to 47.6*%* relative to *LS*\_*DL*. Fig. 6Run time of DL distance algorithms, in seconds, on Xeon4 Table 2Run time of DL distance algorithms on Xeon4ABDLLS_DLStrip_DLL vs DS vs DS vs L40000400000:00:270:00:170:00:1334.3%53.0%28.4%80000800000:01:400:01:020:00:5037.8%50.1%19.7%1200001200000:04:500:02:400:01:5244.8%61.3%29.9%160000160000\*\*0:05:370:03:1940.8%200000200000\*\*0:09:380:05:1445.7%240000240000\*\*0:13:370:07:2845.1%280000280000\*\*0:18:340:10:1045.2%320000320000\*\*0:24:130:13:1745.1%360000360000\*\*0:33:100:17:2247.6%400000400000\*\*0:37:550:20:4645.3%

Energy consumption by the CPU and cache are gievn, in joules, in Fig. [7](#Fig7){ref-type="fig"} and Table [3](#Tab3){ref-type="table"}. *Strip*\_*DL* required up to 68.5*%* less CPU and cache energy than *DL* and up to 48.2*%* less than *LS*\_*DL*. Fig. 7CPU and cache energy consumption of DL distance algorithms, in joules, on Xeon4 Table 3CPU and cache energy consumption of DL distance algorithms on Xeon4ABDLLS_DLStrip_DLL vs DS vs DS vs L4000040000158.77107.176.7332.5%51.7%28.4%8000080000598.12383.88305.1235.8%49.0%20.5%1200001200002180.59996.9686.5454.3%68.5%31.1%160000160000\*\*2088.011212.2741.9%200000200000\*\*3576.521905.5446.7%240000240000\*\*5058.272714.4746.3%280000280000\*\*6905.743711.1846.3%320000320000\*\*9000.264852.446.1%360000360000\*\*12286.836365.8648.2%400000400000\*\*14218.287615.1646.4%

### DL trace algorithms {#Sec21}

The observed cache misses for our single-core DL trace algorithms on our Xeon4 platform are given in Fig. [8](#Fig8){ref-type="fig"} and Table [4](#Tab4){ref-type="table"}. Since *DL*\_*TRACE* is simply *DL* with a linear time traceback added, that cache miss count for *DL*\_*TRACE* is only slightly more than that for *DL*. *LSDL*\_*TRACE* has a higher count than does *DL*\_*TRACE* for the instances that *DL* has sufficient memory to solve though the gap narrows with increasing instance size. *Strip*\_*TRACE* consistently has fewer cache misses than both *DL*\_*TRACE* and *LSDL*\_*TRACE*. *Strip*\_*TRACE* reduces cache misses by up to 98.6*%* relative to *DL*\_*TRACE* and by up to 99.5*%* relative to *LSDL*\_*TRACE*. Fig. 8Cache misses for DL trace algorithms on Xeon4 Table 4Cache misses for DL trace algorithms, in millions, on Xeon4ABDL_TRACELSDL_TRACEStrip_TRACEL vs DS vs DS vs L400004000022042324-92.0%89.3%94.4%80000800001537197029-28.1%98.1%98.5%1200001200004852510066-5.1%98.6%98.7%160000160000\*\*16,35011599.3%200000200000\*\*33,99851398.5%240000240000\*\*42,25226899.4%280000280000\*\*70,37035899.5%320000320000\*\*91,50145399.5%360000360000\*\*146,103212098.5%400000400000\*\*221,690603297.3%

Run times of the DL trace algorithms on our Xeon4 platform are given in seconds in Fig. [9](#Fig9){ref-type="fig"} and Table [5](#Tab5){ref-type="table"}. *Strip*\_*TRACE* is competitive with *DL*\_*TRACE* on our instances of size 40,000 and 80,000 and 23.5% faster on the instance of size 120,000. *Strip*\_*TRACE* was consistently faster than *LSDL*\_*TRACE* achieving a speedup of up to 47.1*%*. Fig. 9Run time of DL trace algorithms, in seconds, on Xeon4 Table 5Run time of DL trace algorithms on Xeon4ABDL_TRACELSDL_TRACEStrip_TRACEL vs DS vs DS vs L40000400000:00:270:00:300:00:26-11.3%3.5%13.3%80000800000:01:400:01:540:01:40-14.5%-0.4%12.4%1200001200000:04:530:04:420:03:443.6%23.5%20.6%160000160000\*\*0:09:210:06:3729.2%200000200000\*\*0:15:580:10:3034.2%240000240000\*\*0:23:420:14:5237.3%280000280000\*\*0:33:410:20:1340.0%320000320000\*\*0:45:260:26:2441.9%360000360000\*\*1:04:180:34:0147.1%400000400000\*\*1:15:140:41:1145.3%

The energy consumed by the CPU and cache is given in Fig. [10](#Fig10){ref-type="fig"} and Table [6](#Tab6){ref-type="table"}. *Strip*\_*TRACE* required up to 46.8*%* less CPU and cache energy than *LSDL*\_*TRACE*. Fig. 10CPU and cache energy consumption of DL trace algorithms, in joules, on Xeon4 Table 6CPU and cache energy consumption of DL trace algorithms on Xeon4ABDL_TRACELSDL_TRACEStrip_TRACEL vs DS vs DS vs L4000040000158.56181.40156.95-14.4%1.0%13.5%8000080000597.27703.09610.70-17.7%-2.2%13.1%1200001200002,256.991,736.861,365.3223.0%39.5%21.4%160000160000\*\*3,443.832,407.8630.1%200000200000\*\*5,843.563,818.6834.7%240000240000\*\*8,665.305,403.6037.6%280000280000\*\*12,275.037,372.2539.9%320000320000\*\*16,536.939,609.5641.9%360000360000\*\*23,396.4112,439.7146.8%400000400000\*\*27,551.9015,167.7644.9%

### Parallel DL distance algorithms {#Sec22}

The observed cache misses for our parallel DL algorithms are given in Fig. [11](#Fig11){ref-type="fig"} and Table [7](#Tab7){ref-type="table"}. *PP*\_*Strip*\_*DL* has the fewest cache misses followed by *PP*\_*LS*\_*DL* and *PP*\_*DL* (in this order). The reduction is cache misses achieved by *PP*\_*Strip*\_*DL* is up to 99.6*%* relative to *PP*\_*DL* and up to 99.4*%* relative to *PP*\_*LS*\_*DL*. Fig. 11Cache misses of parallel DL distance algorithms on Xeon4 Table 7Cache misses of parallel DL distance algorithms, in millions, on Xeon4ABPP_DLPP_LS_DLPP_Strip_DLL vs DS vs DS vs L400004000025923539.2%99.0%98.9%80000800001417500664.7%99.6%98.8%120000120000474618572460.9%99.5%98.7%160000160000\*\*40282699.4%200000200000\*\*62434399.3%240000240000\*\*91016699.3%280000280000\*\*12,63611299.1%320000320000\*\*16,26720298.8%360000360000\*\*40,741102097.5%400000400000\*\*66,469164497.5%

Run times for our parallel DL algorithms are given in Fig. [12](#Fig12){ref-type="fig"} and Table [8](#Tab8){ref-type="table"}. *PP*\_*Strip* is up to 84.2*%* faster than *PP*\_*DL* and up to 57.0*%* faster than *PP*\_*LS*\_*DL*. Fig. 12Run time of parallel DL distance algorithms, in seconds, on Xeon4 Table 8Run time of parallel DL distance algorithms on Xeon4ABPP_DLPP_LS_DLPP_Strip_DLL vs DS vs DS vs L40000400000:00:080:00:050:00:0333.4%59.0%38.4%80000800000:00:290:00:200:00:1330.7%56.7%37.5%1200001200000:03:000:00:560:00:2868.9%84.2%49.2%160000160000\*\*0:01:490:00:5054.1%200000200000\*\*0:02:550:01:1955.2%240000240000\*\*0:04:090:01:5354.5%280000280000\*\*0:05:480:02:3455.9%320000320000\*\*0:07:210:03:2054.5%360000360000\*\*0:10:130:04:2457.0%400000400000\*\*0:11:410:05:1355.3%

Speedup numbers are given in Table [9](#Tab9){ref-type="table"}. The column labeled *DL*/*PP*, for example, is the time for *DL* divided by that for *PP*\_*DL*. *PP*\_*Strip*\_*DL* has a speedup between 3.95 and 3.99, which is quite close to the number of cores (4) on our Xeon4 platform. The speedup for *PP*\_*DL* is up to 3.45 and that for *PP*\_*LS*\_*DL* is up to 3.40. Table 9Speedup of parallel DL distance algorithms on Xeon4ABDL/PPLS_DL/PPStrip_DL/PP40000400003.453.403.9680000800003.443.093.971200001200001.622.873.96160000160000\*\*3.083.98200000200000\*\*3.303.99240000240000\*\*3.293.96280000280000\*\*3.203.97320000320000\*\*3.303.98360000360000\*\*3.243.95400000400000\*\*3.243.98

Energy data are given in Fig. [13](#Fig13){ref-type="fig"} and Table [10](#Tab10){ref-type="table"}. *PP*\_*Strip*\_*DL* used up to 81.4*%* less CPU and cache energy than did *PP*\_*DL* and up to 55.7*%* less than *PP*\_*LS*\_*DL*. Fig. 13CPU and cache energy consumption of parallel DL distance algorithms, in joules, on Xeon4 Table 10CPU and cache energy consumption of parallel DL distance algorithms on Xeon4ABPP_DLPP_LS_DLPP_Strip_DLL vs DS vs DS vs L400004000089.1260.6437.3132.0%58.1%38.5%8000080000336.87238.15147.8229.3%56.1%37.9%1200001200001800.48657.89334.8663.5%81.4%49.1%160000160000\*\*1285.34591.953.9%200000200000\*\*2063.55926.6455.1%240000240000\*\*2928.871332.2954.5%280000280000\*\*4106.151818.6655.7%320000320000\*\*5223.542385.4554.3%360000360000\*\*6640.933164.452.4%400000400000\*\*8287.463727.3155.0%

Although the multi-core algorithms use more CPU power than used by their single-core counterparts, the power increase is less than the decrease in run time. Hence, energy consumption is reduced.

### Parallel DL trace algorithms {#Sec23}

The number of cache misses incurred by our multi-core DL trace algorithms is given in Fig. [14](#Fig14){ref-type="fig"} and Table [11](#Tab11){ref-type="table"}. *PP*\_*Strip*\_*TRACE* has the fewest number of cache misses. *PP*\_*Strip*\_*TRACE* reduces cache misses by up to 99.3*%* and 99.6*%* relative to *PP*\_*DL*\_*TRACE* and *PP*\_*LSDL*\_*TRACE*, respectively. Fig. 14Cache misses for parallel DL trace algorithms on Xeon4 Table 11Cache misses for parallel DL trace algorithms, in millions, on Xeon4ABPP_DL_TRACEPP_LSDL_TRACEPP_Strip_TRACEL vs DS vs DS vs L400004000024763616-157.0%93.6%97.5%80000800001312243115-85.3%98.9%99.4%1200001200004940681434-37.9%99.3%99.5%160000160000\*\*12,7745399.6%200000200000\*\*28,9088599.7%240000240000\*\*30,52911099.6%280000280000\*\*40,80315499.6%320000320000\*\*53,89217999.7%360000360000\*\*91,62179699.1%400000400000\*\*188,325272798.6%

Run times are given in Fig. [15](#Fig15){ref-type="fig"} and Table [12](#Tab12){ref-type="table"}. *PP*\_*Strip*\_*Trace* is faster than *PP*\_*LSDL*\_*TRACE* by up to 59.2*%*. As in Table [13](#Tab13){ref-type="table"}, the speedup achieved by *PP*\_*Strip*\_*TRACE* relative to its single-core version ranges from 3.44 to 3.90. Fig. 15Run time of parallel DL trace algorithms, in seconds, on Xeon4 Table 12Run time of parallel DL trace algorithms on Xeon4ABPP_DL_TRACEPP_LSDL_TRACEPP_Strip_TRACEL vs DS vs DS vs L40000400000:00:080:00:100:00:07-36.1%0.9%27.2%80000800000:00:290:00:380:00:27-32.7%7.0%29.9%1200001200000:04:150:01:390:00:5961.2%76.8%40.3%160000160000\*\*0:03:050:01:4344.2%200000200000\*\*0:05:030:02:4346.3%240000240000\*\*0:07:500:03:5051.0%280000280000\*\*0:11:110:05:1353.4%320000320000\*\*0:15:070:06:4655.2%360000360000\*\*0:21:250:08:4459.2%400000400000\*\*0:24:100:10:3456.3% Table 13Speedup of parallel DL trace algorithms on Xeon4ABDL/PPLSDL_TRACE/PPStrip_TRACE/PP40000400003.532.883.4480000800003.442.973.721200001200001.152.863.79160000160000\*\*3.033.85200000200000\*\*3.163.87240000240000\*\*3.023.87280000280000\*\*3.013.88320000320000\*\*3.013.90360000360000\*\*3.003.89400000400000\*\*3.113.90

Energy consumption data are given in Fig. [16](#Fig16){ref-type="fig"} and Table [14](#Tab14){ref-type="table"}. *PP*\_*Strip*\_*TRACE* required up to 57.6*%* less CPU and cache energy than *PP*\_*LSDL*\_*TRACE*. Fig. 16CPU and cache energy consumption of parallel DL trace algorithms, in joules, on Xeon4 Table 14CPU and cache energy consumption of parallel DL trace algorithms on Xeon4ABPP_DL_TRACEPP_LSDL_TRACEPP_Strip_TRACEL vs DS vs DS vs L400004000087.39118.7484.85-35.9%2.9%28.5%8000080000334.01449.89310.34-34.7%7.1%31.0%1200001200002,433.281,149.61684.2852.8%71.9%40.5%160000160000\*\*2,149.581,202.5244.1%200000200000\*\*3,524.591,898.3546.1%240000240000\*\*5,410.422,684.7250.4%280000280000\*\*7,707.413,657.5452.5%320000320000\*\*10,384.754,789.0353.9%360000360000\*\*14,612.396,200.1057.6%400000400000\*\*16,559.767,472.5254.9%

### Xeon E5-2603 (Xeon4) using real data {#Sec24}

Tables [15](#Tab15){ref-type="table"} and [16](#Tab16){ref-type="table"}, respectively, give the run times for our single-core and multi-core DL and DL trace algorithms using real DNA sequences on our Xeon4 platform. The observed times are quite comparable to those for similarly sized random strings. Further, the speed up achieved by our parallel algorithms relative to the single-core algorithms is also comparable to that for random strings. So, for our remaining test platforms, we present only the results for our randomly generated data sets. Table 15Run time of DL distance algorithms for real DNA sequences on Xeon4ABDLLS_DLStrip_DLPP_DLPP_LS_DLPP_Strip_DLNZ_LRIA01000064CYPR010000970:00:230:00:190:00:120:00:080:00:060:00:03LNFE01000131AGUF010000280:01:270:01:160:00:470:00:290:00:240:00:12NZ_CYTG01000018LVKN010000710:03:210:02:530:01:460:02:420:00:540:00:28BX000446BX511181\*\*0:05:010:03:09\*\*0:01:340:00:49NZ_AMFW01000007LYHN01000016\*\*0:07:490:04:54\*\*0:02:260:01:17JLXA01000008AUHZ01000004\*\*0:11:310:07:04\*\*0:03:380:01:51NZ_FNNC01000004NZ_APZF01000097\*\*0:15:030:09:37\*\*0:04:430:02:31LHOK01000008AGYI01000018\*\*0:19:440:12:36\*\*0:06:110:03:17BAMV01000017MIMZ01000025\*\*0:24:560:15:56\*\*0:07:490:04:09LSMI01000030CZBU01000005\*\*0:30:130:19:39\*\*0:09:340:05:08 Table 16Run time of DL trace algorithms for real DNA sequences on Xeon4ABDL_TraceLSDL_TraceStrip_TracePP_DL_TracePP_LSDL_TracePP_Strip_TraceNZ_LRIA01000064CYPR010000970:00:230:00:330:00:240:00:080:00:120:00:08LNFE01000131AGUF010000280:01:270:02:120:01:350:00:290:00:450:00:28NZ_CYTG01000018LVKN010000710:03:230:05:020:03:330:03:480:01:440:01:00BX000446BX511181\*\*0:08:450:06:18\*\*0:02:590:01:45NZ_AMFW01000007LYHN01000016\*\*0:13:430:09:51\*\*0:04:390:02:43JLXA01000008AUHZ01000004\*\*0:20:100:14:10\*\*0:06:500:03:54NZ_FNNC01000004NZ_APZF01000097\*\*0:26:250:19:18\*\*0:08:490:05:17LHOK01000008AGYI01000018\*\*0:34:410:25:15\*\*0:11:290:06:55BAMV01000017MIMZ01000025\*\*0:43:490:31:58\*\*0:14:380:08:44LSMI01000030CZBU01000005\*\*0:52:540:39:23\*\*0:18:140:10:46

I7-x980 (Xeon6) using random data {#Sec25}
---------------------------------

### DL distance algorithms {#Sec26}

Single core run times are given in Fig. [17](#Fig17){ref-type="fig"} and Table [17](#Tab17){ref-type="table"} for our Xeon6 platform. As can be seen, *Strip*\_*DL* is the fastest followed by *LS*\_*DL* and *DL*. *Strip*\_*DL* reduces run time by up to 52.3*%* relative to *DL* and by up to 76.1*%* relative to *LS*\_*DL*. The classical *DL* algorithm ran out memory when \|*A*\|=\|*B*\|=8000. Fig. 17Run time of DL distance algorithms, in seconds, on Xeon6 Table 17Run time of DL distance algorithms on Xeon6ABDLLS_DLStrip_DLL vs DS vs DS vs L40000400000:00:170:00:140:00:0816.2%52.3%43.1%8000080000\*\*0:00:550:00:3241.3%120000120000\*\*0:02:120:01:1344.8%160000160000\*\*0:05:190:02:0959.5%200000200000\*\*0:10:160:03:2367.1%240000240000\*\*0:16:170:04:5070.3%280000280000\*\*0:24:190:06:3672.9%320000320000\*\*0:33:320:08:3674.4%360000360000\*\*0:45:500:10:5876.1%400000400000\*\*0:55:440:13:2775.9%

### DL trace algorithms {#Sec27}

The run times for the DL trace algorithms are given in Fig. [18](#Fig18){ref-type="fig"} and Table [18](#Tab18){ref-type="table"}. *Strip*\_*TRACE* reduces run time by up to 63.5*%* relative to *LSDL*\_*TRACE*. Fig. 18Run time of DL trace algorithms, in seconds, on Xeon6 Table 18Run time of DL trace algorithms on Xeon6ABDL_TRACELSDL_TRACEStrip_TRACEL vs DS vs DS vs L40000400000:00:170:00:220:00:17-26.3%4.8%24.6%8000080000\*\*0:01:240:01:0522.1%120000120000\*\*0:03:330:02:2631.2%160000160000\*\*0:07:200:04:2041.0%200000200000\*\*0:13:190:06:4649.1%240000240000\*\*0:20:510:09:4353.4%280000280000\*\*0:31:190:13:1457.7%320000320000\*\*0:43:240:17:1660.2%360000360000\*\*0:59:270:21:5563.1%400000400000\*\*1:13:510:26:5763.5%

### Parallel DL distance algorithms {#Sec28}

Run times for the parallel DL distance algorithms are given in Fig. [19](#Fig19){ref-type="fig"} and Table [19](#Tab19){ref-type="table"}. As was the case on our *Xeon*4 platform, *PP*\_*Strip*\_*DL* is faster than *PP*\_*DL* and *PP*\_*LS*\_*DL*. It reduces the run time by up to 25.6*%* and 79.4*%*, respectively. The speedup of our parallel algorithm *PP*\_*Strip*\_*DL* relative to its single-core version (Table [20](#Tab20){ref-type="table"}) is up to 5.71. This is quite close to the number of cores (6). The maximum speedup achieved by *PP*\_*DL* and *PP*\_*LS*\_*DL* was 4.89 and 5.27, respectively. Fig. 19Run time of parallel DL distance algorithms, in seconds, on Xeon6 Table 19Run time of parallel DL distance algorithms on Xeon6ABPP_DLPP_LS_DLPP_Strip_DLL vs DS vs DS vs L40000400000:00:030:00:030:00:0322.2%25.6%4.4%8000080000\*\*0:00:110:00:0646.7%120000120000\*\*0:00:320:00:1359.8%160000160000\*\*0:01:180:00:2371.1%200000200000\*\*0:02:340:00:3676.8%240000240000\*\*0:03:470:00:5277.2%280000280000\*\*0:05:250:01:0978.6%320000320000\*\*0:06:590:01:3277.9%360000360000\*\*0:09:230:01:5679.4%400000400000\*\*0:11:030:02:2378.5% Table 20Speedup of parallel DL distance algorithms on Xeon6ABDL/PPLS_DL/PPStrip_DL/PP40000400004.895.273.138000080000\*\*5.155.68120000120000\*\*4.085.59160000160000\*\*4.075.71200000200000\*\*3.995.67240000240000\*\*4.305.60280000280000\*\*4.495.70320000320000\*\*4.805.58360000360000\*\*4.885.67400000400000\*\*5.045.65

### Parallel DL trace algorithms {#Sec29}

Xeon6 run times for the parallel DL trace algorithms are given in Fig. [20](#Fig20){ref-type="fig"} and Table [21](#Tab21){ref-type="table"}. *PP*\_*Strip*\_*TRACE* is faster than *PP*\_*LSDL*\_*TRACE* and reduces the run time by up to 68.9*%*. As shown in Table [22](#Tab22){ref-type="table"}, *PP*\_*Strip*\_*TRACE* obtains a speedup of up to 5.33 while the maximum speedup by *PP*\_*DL*\_*TRACE* and *PP*\_*LSDL*\_*TRACE* was 5.23 and 4.55, respectively. Fig. 20Run time of parallel DL trace algorithms, in seconds, on Xeon6 Table 21Run time of parallel DL trace algorithms on Xeon6ABPP_DL_TRACEPP_LSDL_TRACEPP_Strip_TRACEL vs DS vs DS vs L40000400000:00:030:00:050:00:05-56.2%-51.2%3.2%8000080000\*\*0:00:190:00:1615.7%120000120000\*\*0:00:510:00:3138.1%160000160000\*\*0:01:490:00:5648.9%200000200000\*\*0:03:190:01:2159.3%240000240000\*\*0:05:040:01:5661.9%280000280000\*\*0:07:190:02:3365.1%320000320000\*\*0:09:550:03:2166.3%360000360000\*\*0:13:110:04:1268.2%400000400000\*\*0:16:140:05:0368.9% Table 22Speedup of parallel DL trace algorithms on Xeon6ABDL_TRACE/PPLSDL_TRACE/PPStrip_TRACE/PP40000400005.234.233.308000080000\*\*4.494.15120000120000\*\*4.194.65160000160000\*\*4.054.68200000200000\*\*4.025.02240000240000\*\*4.125.04280000280000\*\*4.285.18320000320000\*\*4.385.17360000360000\*\*4.515.22400000400000\*\*4.555.33

Xeon E5-2695 (Xeon24) using random data {#Sec30}
---------------------------------------

### DL distance algorithms {#Sec31}

The run times for our single-core DL distance algorithms on the Xeon24 are given in Fig. [21](#Fig21){ref-type="fig"} and Table [23](#Tab23){ref-type="table"}. As on our other test platforms, *Strip*\_*DL* is the fastest followed by *LS*\_*DL* and *DL*. *Strip*\_*DL* reduces run time by up to 73.1*%* relative to *DL* and by up to 42.9*%* relative to *LS*\_*DL*. Fig. 21Run time of DL distance algorithms, in seconds, on Xeon24 Table 23Run time of DL distance algorithms on Xeon24ABDLLS_DLStrip_DLL vs DS vs DS vs L40000400000:00:240:00:140:00:1041.5%57.1%26.7%80000800000:01:260:00:530:00:4138.3%53.0%23.8%1200001200000:03:060:02:010:01:3135.3%50.9%24.2%1600001600000:05:330:03:340:02:4235.6%51.3%24.3%2000002000000:08:320:06:080:04:1528.2%50.2%30.6%2400002400000:12:400:08:050:06:0536.2%52.0%24.8%2800002800000:19:240:11:320:08:2040.6%57.0%27.7%3200003200000:29:510:15:550:10:5846.7%63.3%31.1%3600003600000:44:440:26:410:15:1140.4%66.0%43.1%4000004000001:04:030:30:110:17:1552.9%73.1%42.9%

### DL trace algorithms {#Sec32}

The run times for our single-core DL trace algorithms on the Xeon24 are given in Fig. [22](#Fig22){ref-type="fig"} and Table [24](#Tab24){ref-type="table"}. *Strip*\_*TRACE* reduces run time by up to 46.9*%* and 31.5*%* relative to *DL*\_*TRACE* and *LSDL*\_*TRACE*, respectively. Fig. 22Run time of DL trace algorithms, in seconds, on Xeon24 Table 24Run time of DL trace algorithms on Xeon24ABDL_TRACELSDL_TRACEStrip_TRACEL vs DS vs DS vs L40000400000:00:230:00:250:00:21-6.1%9.7%14.9%80000800000:01:250:01:360:01:23-12.8%2.7%13.8%1200001200000:03:040:03:380:03:04-18.2%-0.1%15.3%1600001600000:05:290:06:260:05:27-17.1%0.7%15.2%2000002000000:08:300:10:260:08:54-22.9%-4.8%14.7%2400002400000:12:400:14:440:12:14-16.3%3.5%17.0%2800002800000:19:070:20:220:16:39-6.5%12.9%18.2%3200003200000:29:140:27:390:21:515.4%25.3%21.0%3600003600000:44:520:41:560:28:556.6%35.6%31.0%4000004000001:04:330:50:010:34:1622.5%46.9%31.5% Table 25Run time of parallel DL distance algorithms on Xeon24ABPP_DLPP_LS_DLPP_Strip_DLL vs DS vs DS vs L40000400000:00:020:00:010:00:0167.7%74.7%21.5%80000800000:00:080:00:040:00:0253.6%75.1%46.4%1200001200000:00:160:00:080:00:0451.8%72.7%43.5%1600001600000:00:280:00:180:00:0834.8%70.4%54.6%2000002000000:00:440:00:280:00:1137.1%74.6%59.6%2400002400000:01:040:00:370:00:1642.0%74.3%55.8%2800002800000:01:330:00:590:00:2336.3%75.8%61.9%3200003200000:02:110:01:330:00:3029.1%77.4%68.1%3600003600000:03:070:02:010:00:4035.1%78.5%66.9%4000004000000:03:340:02:440:00:4523.2%79.1%72.8% Table 26Speedup of parallel DL distance algorithms on Xeon24ABDL/PPLS_DL/PPStrip_DL/PP40000400009.8617.8816.69800008000010.5313.9919.9012000012000011.3915.2920.5116000016000012.0211.8619.7720000020000011.5113.1422.5724000024000011.9113.1022.2928000028000012.5111.6822.1932000032000013.7010.3022.2736000036000014.3913.2122.7340000040000018.0011.0523.22

### Parallel DL trace algorithms {#Sec33}

Parallel DL trace run times are given in Fig. [24](#Fig24){ref-type="fig"} and Table [27](#Tab27){ref-type="table"}. *PP*\_*Strip*\_*TRACE* is faster than *PP*\_*DL*\_*TRACE* and *PP*\_*LSDL*\_*TRACE* on large data. It reduces the run time by up to 51.1*%* and 50.1*%*, respectively. *PP*\_*Strip*\_*TRACE* achieves a speedup of up to 17.42 (Table [28](#Tab28){ref-type="table"}); *PP*\_*DL*\_*TRACE* and *PP*\_*LSDL*\_*TRACE* have maximum speedups of 16.98 and 12.60. Table 27Run time of parallel DL trace algorithms on Xeon24ABPP_DL_TRACEPP_LSDL_TRACEPP_Strip_TRACEL vs DS vs DS vs L40000400000:00:010:00:030:00:02-153.0%-79.8%28.9%80000800000:00:070:00:110:00:07-58.3%-7.1%32.3%1200001200000:00:170:00:210:00:15-25.4%8.3%26.9%1600001600000:00:300:00:350:00:24-18.4%18.1%30.8%2000002000000:00:470:00:530:00:41-11.0%14.3%22.8%2400002400000:01:080:01:130:00:53-8.0%21.7%27.5%2800002800000:01:390:01:520:01:10-13.4%28.9%37.3%3200003200000:02:270:02:300:01:26-1.8%41.7%42.7%3600003600000:03:230:03:200:01:401.5%50.9%50.1%4000004000000:04:220:04:140:02:083.3%51.1%49.5% Table 28Speedup of parallel DL trace algorithms on Xeon24ABDL_TRACE/PPLSDL_TRACE/PPStrip_TRACE/PP400004000016.987.128.53800008000012.598.9711.4312000012000011.0110.3812.0216000016000011.1411.0213.5120000020000010.7611.9113.1524000024000011.2012.0613.8128000028000011.6510.9414.2632000032000011.9211.0815.2836000036000013.2912.6017.4240000040000014.7711.8316.04

### Parallel DL distance algorithms {#Sec34}

Parallel DL distance run times are given in Fig. [23](#Fig23){ref-type="fig"} and Table [25](#Tab25){ref-type="table"}. *PP*\_*Strip*\_*DL* is faster than *PP*\_*DL* and *PP*\_*LS*\_*DL* and reduces the run time by up to 79.1*%* and 72.8*%*, respectively. As can be seen from Table [26](#Tab26){ref-type="table"}, *PP*\_*Strip*\_*DL* scales quite well and results in a speedup of up to 23.22. The maximum speedups provided by *PP*\_*DL* and *PP*\_*LS*\_*DL* are 18.00 and 17.88, respectively. Fig. 23Run time of parallel DL distance algorithms, in seconds, on Xeon24 Fig. 24Run time of parallel DL trace algorithms, in seconds, on Xeon24

Discussion {#Sec35}
==========

Cache efficient and multi-core linear-space algorithms to compute the DL distance between two strings as well as to determine an optimal trace (edit sequence) have been developed. The reduction in space provided by these algorithms enables the solution of much larger instances than is possible using previously known algorithms.

Conclusion {#Sec36}
==========

Our algorithms were empirically evaluated on 3 computational platforms. Cache-misses were experimentally measured on one of these platforms and we verified that the algorithms analyzed to have a smaller number of cache misses using our simple cache model actually had fewer misses on a real computational platform. Significant run-time improvement (relative to known algorithms) was seen for our cache-efficient algorithms on all three platforms. On all platforms, the linear-space cache-efficient algorithms *Strip*\_*DL* and *Strip*\_*TRACE* were the best-performing single-core algorithms to determine the DL distance and optimal trace, respectively.

*Strip*\_*DL* reduced run time by as much as 73.1*%* relative to the classical distance algorithm *DL* and *Strip*\_*TRACE* reduced run time by as much as 63.5*%* relative to the classical trace algorithm. Multi-core versions of these two algorithms scaled quite well and achieved a speedup of up to 23.22 on a 24 core computer.

We also measured the energy efficiency of our algorithms on one of the platforms. Our best single-core algorithms reduced energy consumption by as much as 68.5% (relative to the best previously known algorithm) when computing the DL distance and by as much as 46.8% when computing an optimal trace. Our best multi-core algorithms achieves up to 81.4% and 57.6% energy consumption reduction, respectively.

CPU

:   Central processing unit

DL

:   Damerau-Levenshtein

DNA

:   DeoxyriboNucleic acid

LLC

:   Last level cache

LRU

:   Least recently used

NCBI

:   National center for biotechnology information

PDB

:   Protein data bank

RNA

:   RiboNucleic acid

The authors also would like to thank the anonymous reviewers for their valuable comments and suggestions.

Funding {#d29e16223}
=======

This research was funded, in part, by the National Science Foundation under award NSF 1447711. Publication costs were funded by the University of Florida.

Availability of data and materials {#d29e16228}
==================================

All data generated or analyzed during this study are included in this supplementary information files.

About this supplement {#d29e16233}
=====================

This article has been published as part of *BMC Bioinformatics Volume 20 Supplement 11, 2019: Selected articles from the 7th IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2017): bioinformatics*. The full contents of the supplement are available online at <https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-11>.

CZ and SS developed the new linear space cache efficient DL distance algorithms, did theoretical analysis and the experimental results analysis, and wrote the manuscript. CZ programmed the algorithms and ran the benchmark tests. Both authors read and approved the final manuscript.

Not applicable.

Not applicable.

The authors declare that they have no competing interests.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
