1,428 research outputs found

    Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time

    Full text link
    Edit distance is a measure of similarity of two strings based on the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. The edit distance can be computed exactly using a dynamic programming algorithm that runs in quadratic time. Andoni, Krauthgamer and Onak (2010) gave a nearly linear time algorithm that approximates edit distance within approximation factor poly(logn)\text{poly}(\log n). In this paper, we provide an algorithm with running time O~(n22/7)\tilde{O}(n^{2-2/7}) that approximates the edit distance within a constant factor

    Speeding up the cyclic edit distance using LAESA with early abandon

    Get PDF
    The cyclic edit distance between two strings is the minimum edit distance between one of this strings and every possible cyclic shift of the other. This can be useful, for example, in image analysis where strings describe the contour of shapes or in computational biology for classifying circular permuted proteins or circular DNA/RNA molecules. The cyclic edit distance can be computed in O(mnlog m) time, however, in real recognition tasks this is a high computational cost because of the size of databases. A method to reduce the number of comparisons and avoid an exhaustive search is convenient. In this work, we present a new algorithm based on a modification of LAESA (linear approximating and eliminating search algorithm) for applying pruning in the computation of distances. It is an efficient procedure for classification and retrieval of cyclic strings. Experimental results show that our proposal considerably outperforms LAESAWork partially supported by the Spanish Government (TIN2010-18958), and the Generalitat Valenciana (PROMETEOII/2014/062)

    A basic analysis toolkit for biological sequences

    Get PDF
    This paper presents a software library, nicknamed BATS, for some basic sequence analysis tasks. Namely, local alignments, via approximate string matching, and global alignments, via longest common subsequence and alignments with affine and concave gap cost functions. Moreover, it also supports filtering operations to select strings from a set and establish their statistical significance, via z-score computation. None of the algorithms is new, but although they are generally regarded as fundamental for sequence analysis, they have not been implemented in a single and consistent software package, as we do here. Therefore, our main contribution is to fill this gap between algorithmic theory and practice by providing an extensible and easy to use software library that includes algorithms for the mentioned string matching and alignment problems. The library consists of C/C++ library functions as well as Perl library functions. It can be interfaced with Bioperl and can also be used as a stand-alone system with a GUI. The software is available at under the GNU GPL

    Edit distance Kernelization of NP theorem proving for polynomial-time machine learning of proof heuristics

    Get PDF
    We outline a general strategy for the application of edit- distance based kernels to NP Theorem Proving in order to allow for polynomial-time machine learning of proof heuristics without the loss of sequential structural information associated with conventional feature- based machine learning. We provide a general short introduction to logic and proof considering a few important complexity results to set the scene and highlight the relevance of our findings

    Edit distance Kernelization of NP theorem proving for polynomial-time machine learning of proof heuristics

    Get PDF
    We outline a general strategy for the application of edit- distance based kernels to NP Theorem Proving in order to allow for polynomial-time machine learning of proof heuristics without the loss of sequential structural information associated with conventional feature- based machine learning. We provide a general short introduction to logic and proof considering a few important complexity results to set the scene and highlight the relevance of our findings

    Boosting Perturbation-Based Iterative Algorithms to Compute the Median String

    Get PDF
    [Abstract] The most competitive heuristics for calculating the median string are those that use perturbation-based iterative algorithms. Given the complexity of this problem, which under many formulations is NP-hard, the computational cost involved in the exact solution is not affordable. In this work, the heuristic algorithms that solve this problem are addressed, emphasizing its initialization and the policy to order possible editing operations. Both factors have a significant weight in the solution of this problem. Initial string selection influences the algorithm’s speed of convergence, as does the criterion chosen to select the modification to be made in each iteration of the algorithm. To obtain the initial string, we use the median of a subset of the original dataset; to obtain this subset, we employ the Half Space Proximal (HSP) test to the median of the dataset. This test provides sufficient diversity within the members of the subset while at the same time fulfilling the centrality criterion. Similarly, we provide an analysis of the stop condition of the algorithm, improving its performance without substantially damaging the quality of the solution. To analyze the results of our experiments, we computed the execution time of each proposed modification of the algorithms, the number of computed editing distances, and the quality of the solution obtained. With these experiments, we empirically validated our proposal.This work was supported in part by the Comisión Nacional de Investigación Científica y Tecnológica - Programa de Formación de Capital Humano Avanzado (CONICYT-PCHA)/Doctorado Nacional/2014-63140074 through the Ph.D. Scholarship, in part by the European Union's Horizon 2020 under the Marie Sklodowska-Curie under Grant 690941, in part by the Millennium Institute for Foundational Research on Data (IMFD), and in part by the FONDECYT-CONICYT under Grant 1170497. The work of ÓSCAR PEDREIRA was supported in part by the Xunta de Galicia/FEDER-UE refs under Grant CSI ED431G/01 and Grant GRC: ED431C 2017/58, in part by the Office of the Vice President for Research and Postgraduate Studies of the Universidad Católica de Temuco, VIPUCT Project 2020EM-PS-08, and in part by the FEQUIP 2019-INRN-03 of the Universidad Católica de TemucoXunta de Galicia; ED431G/01Xunta de Galicia; ED431C 2017/58Chile. Comisión Nacional de Investigación Científica y Tecnológica; 2014-63140074Chile. Comisión Nacional de Investigación Científica y Tecnológica; 1170497Universidad Católica de Temuco (Chile); 2020EM-PS-08Universidad Católica de Temuco (Chile); 2019-INRN-0

    GPU acceleration of Levenshtein distance computation between long strings

    Get PDF
    Computing edit distance for very long strings has been hampered by quadratic time complexity with respect to string length. The WFA algorithm reduces the time complexity to a quadratic factor with respect to the edit distance between the strings. This work presents a GPU implementation of the WFA algorithm and a new optimization that can halve the elements to be computed, providing additional performance gains. The implementation allows to address the computation of the edit distance between strings having hundreds of millions of characters. The performance of the algorithm depends on the similarity between the strings. For strings longer than million characters, the performance is the best ever reported, which is above TCUPS for strings with similarities greater than 70% and above one hundred TCUPS for 99.9% similarity.This research was supported by the European Union Regional Development Fund (ERDF) within the framework of the ERDF Operational Program of Catalonia 2014–2020 with a grant of 50% of the total cost eligible under the Designing RISC-V based Accelerators for next generation computers project (DRAC) [001-P-001723], in part by the Catalan Government under grant 2017-SGR-1624, and in part by the Spanish Ministry of Science, Innovation and Universities under grant RTI2018-095209-B-C22.Peer ReviewedPostprint (published version