10 research outputs found

    String Indexing for Patterns with Wildcards

    Get PDF
    We consider the problem of indexing a string tt of length nn to report the occurrences of a query pattern pp containing mm characters and jj wildcards. Let occocc be the number of occurrences of pp in tt, and σ\sigma the size of the alphabet. We obtain the following results. - A linear space index with query time O(m+σjloglogn+occ)O(m+\sigma^j \log \log n + occ). This significantly improves the previously best known linear space index by Lam et al. [ISAAC 2007], which requires query time Θ(jn)\Theta(jn) in the worst case. - An index with query time O(m+j+occ)O(m+j+occ) using space O(σk2nlogklogn)O(\sigma^{k^2} n \log^k \log n), where kk is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time. - A time-space trade-off, generalizing the index by Cole et al. [STOC 2004]. We also show that these indexes can be generalized to allow variable length gaps in the pattern. Our results are obtained using a novel combination of well-known and new techniques, which could be of independent interest

    Optimal Prefix and Suffix Queries on Texts

    Get PDF
    International audienceIn this paper, we study a restricted version of the position restricted pattern matching problem introduced and studied by Makinen and Navarro [V. Makinen, G. Navarro, Position-restricted substring searching, in: J.R. Correa, A. Hevia, M.A. Kiwi (Eds.), LATIN, in: Lecture Notes in Computer Science, vol. 3887, Springer, 2006, pp. 703-714]. In the problem handled in this paper, we are interested in those occurrences of the pattern that lies in a suffix or in a prefix of the given text. We achieve optimal query time for our problem against a data structure which is an extension of the classic suffix tree data structure. The time and space complexity of the data structure is dominated by that of the suffix tree. Notably, the (best) algorithm by Makinen and Navarro, if applied to our problem, gives sub-optimal query time and the corresponding data structure also requires more time and space

    Lossless seeds for searching short patterns with high error rates

    Get PDF
    International audienceWe address the problem of approximate pattern matching using the Levenshtein distance. Given a text T and a pattern P , find alllocations in T that differ by at most k errors from P . For that purpose, we propose a filtration algorithm that is based on a novel type of seeds,combining exact parts and parts with a fixed number of errors. Experimental tests show that the method is specifically well-suited for short patterns with a large number of error

    Partial 3D Object Retrieval using Local Binary QUICCI Descriptors and Dissimilarity Tree Indexing

    Full text link
    A complete pipeline is presented for accurate and efficient partial 3D object retrieval based on Quick Intersection Count Change Image (QUICCI) binary local descriptors and a novel indexing tree. It is shown how a modification to the QUICCI query descriptor makes it ideal for partial retrieval. An indexing structure called Dissimilarity Tree is proposed which can significantly accelerate searching the large space of local descriptors; this is applicable to QUICCI and other binary descriptors. The index exploits the distribution of bits within descriptors for efficient retrieval. The retrieval pipeline is tested on the artificial part of SHREC'16 dataset with near-ideal retrieval results.Comment: 19 pages, 17 figures, to be published in Computers & Graphic

    Elastic-Degenerate String Matching with 1 Error

    Get PDF
    An elastic-degenerate string is a sequence of nn finite sets of strings of total length NN, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length mm in an ED text. This problem has recently received some attention by the combinatorial pattern matching community, culminating in an O~(nmω1)+O(N)\tilde{\mathcal{O}}(nm^{\omega-1})+\mathcal{O}(N)-time algorithm [Bernardini et al., SIAM J. Comput. 2022], where ω\omega denotes the matrix multiplication exponent and the O~()\tilde{\mathcal{O}}(\cdot) notation suppresses polylog factors. In the kk-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most kk errors. kk-EDSM can be solved in O(k2mG+kN)\mathcal{O}(k^2mG+kN) time, under edit distance, or O(kmG+kN)\mathcal{O}(kmG+kN) time, under Hamming distance, where GG denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately, GG is only bounded by NN, and so even for k=1k=1, the existing algorithms run in Ω(mN)\Omega(mN) time in the worst case. In this paper we show that 11-EDSM can be solved in O((nm2+N)logm)\mathcal{O}((nm^2 + N)\log m) or O(nm3+N)\mathcal{O}(nm^3 + N) time under edit distance. For the decision version, we present a faster O(nm2logm+Nloglogm)\mathcal{O}(nm^2\sqrt{\log m} + N\log\log m)-time algorithm. We also show that 11-EDSM can be solved in O(nm2+Nlogm)\mathcal{O}(nm^2 + N\log m) time under Hamming distance. Our algorithms for edit distance rely on non-trivial reductions from 11-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or 2d range emptiness), which we show how to solve efficiently. In order to obtain an even faster algorithm for Hamming distance, we rely on employing and adapting the kk-errata trees for indexing with errors [Cole et al., STOC 2004].Comment: This is an extended version of a paper accepted at LATIN 202

    Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing

    Get PDF
    Edit distance similarity search, also called approximate pattern matching, is a fundamental problem with widespread database applications. The goal of the problem is to preprocess n strings of length d, to quickly answer queries q of the form: if there is a database string within edit distance r of q, return a database string within edit distance cr of q. Previous approaches to this problem either rely on very large (superconstant) approximation ratios c, or very small search radii r. Outside of a narrow parameter range, these solutions are not competitive with trivially searching through all n strings. In this work we give a simple and easy-to-implement hash function that can quickly answer queries for a wide range of parameters. Specifically, our strategy can answer queries in time O?(d3^rn^{1/c}). The best known practical results require c ? r to achieve any correctness guarantee; meanwhile, the best known theoretical results are very involved and difficult to implement, and require query time that can be loosely bounded below by 24^r. Our results significantly broaden the range of parameters for which there exist nontrivial theoretical bounds, while retaining the practicality of a locality-sensitive hash function

    Building Blocks for Mapping Services

    Get PDF
    Mapping services are ubiquitous on the Internet. These services enjoy a considerable user base. But it is often overlooked that providing a service on a global scale with virtually millions of users has been the playground of an oligopoly of a select few service providers are able to do so. Unfortunately, the literature on these solutions is more than scarce. This thesis adds a number of building blocks to the literature that explain how to design and implement a number of features

    Text indexing with errors

    No full text
    corecore