42,704 research outputs found

    Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

    Full text link
    We study the approximate string matching and regular expression matching problem for the case when the text to be searched is compressed with the Ziv-Lempel adaptive dictionary compression schemes. We present a time-space trade-off that leads to algorithms improving the previously known complexities for both problems. In particular, we significantly improve the space bounds, which in practical applications are likely to be a bottleneck

    A practical index for approximate dictionary matching with few mismatches

    Get PDF
    Approximate dictionary matching is a classic string matching problem (checking if a query string occurs in a collection of strings) with applications in, e.g., spellchecking, online catalogs, geolocation, and web searchers. We present a surprisingly simple solution called a split index, which is based on the Dirichlet principle, for matching a keyword with few mismatches, and experimentally show that it offers competitive space-time tradeoffs. Our implementation in the C++ language is focused mostly on data compaction, which is beneficial for the search speed (e.g., by being cache friendly). We compare our solution with other algorithms and we show that it performs better for the Hamming distance. Query times in the order of 1 microsecond were reported for one mismatch for the dictionary size of a few megabytes on a medium-end PC. We also demonstrate that a basic compression technique consisting in qq-gram substitution can significantly reduce the index size (up to 50% of the input text size for the DNA), while still keeping the query time relatively low

    Chebushev Greedy Algorithm in convex optimization

    Full text link
    Chebyshev Greedy Algorithm is a generalization of the well known Orthogonal Matching Pursuit defined in a Hilbert space to the case of Banach spaces. We apply this algorithm for constructing sparse approximate solutions (with respect to a given dictionary) to convex optimization problems. Rate of convergence results in a style of the Lebesgue-type inequalities are proved

    Harmonic Decomposition of Audio Signals with Matching Pursuit

    Get PDF
    International audienceWe introduce a dictionary of elementary waveforms, called harmonic atoms, that extends the Gabor dictionary and fits well the natural harmonic structures of audio signals. By modifying the "standard" matching pursuit, we define a new pursuit along with a fast algorithm, namely, the fast harmonic matching pursuit, to approximate N-dimensional audio signals with a linear combination of M harmonic atoms. Our algorithm has a computational complexity of O(MKN), where K is the number of partials in a given harmonic atom. The decomposition method is demonstrated on musical recordings, and we describe a simple note detection algorithm that shows how one could use a harmonic matching pursuit to detect notes even in difficult situations, e.g., very different note durations, lots of reverberation, and overlapping notes

    Solution of linear ill-posed problems using overcomplete dictionaries

    Full text link
    In the present paper we consider application of overcomplete dictionaries to solution of general ill-posed linear inverse problems. Construction of an adaptive optimal solution for such problems usually relies either on a singular value decomposition or representation of the solution via an orthonormal basis. The shortcoming of both approaches lies in the fact that, in many situations, neither the eigenbasis of the linear operator nor a standard orthonormal basis constitutes an appropriate collection of functions for sparse representation of the unknown function. In the context of regression problems, there have been an enormous amount of effort to recover an unknown function using an overcomplete dictionary. One of the most popular methods, Lasso, is based on minimizing the empirical likelihood and requires stringent assumptions on the dictionary, the, so called, compatibility conditions. While these conditions may be satisfied for the original dictionary functions, they usually do not hold for their images due to contraction imposed by the linear operator. In what follows, we bypass this difficulty by a novel approach which is based on inverting each of the dictionary functions and matching the resulting expansion to the true function, thus, avoiding unrealistic assumptions on the dictionary and using Lasso in a predictive setting. We examine both the white noise and the observational model formulations and also discuss how exact inverse images of the dictionary functions can be replaced by their approximate counterparts. Furthermore, we show how the suggested methodology can be extended to the problem of estimation of a mixing density in a continuous mixture. For all the situations listed above, we provide the oracle inequalities for the risk in a finite sample setting. Simulation studies confirm good computational properties of the Lasso-based technique

    Towards Optimal Approximate Streaming Pattern Matching by Matching Multiple Patterns in Multiple Streams

    Get PDF
    Recently, there has been a growing focus in solving approximate pattern matching problems in the streaming model. Of particular interest are the pattern matching with k-mismatches (KMM) problem and the pattern matching with w-wildcards (PMWC) problem. Motivated by reductions from these problems in the streaming model to the dictionary matching problem, this paper focuses on designing algorithms for the dictionary matching problem in the multi-stream model where there are several independent streams of data (as opposed to just one in the streaming model), and the memory complexity of an algorithm is expressed using two quantities: (1) a read-only shared memory storage area which is shared among all the streams, and (2) local stream memory that each stream stores separately. In the dictionary matching problem in the multi-stream model the goal is to preprocess a dictionary D={P_1,P_2,...,P_d} of d=|D| patterns (strings with maximum length m over alphabet Sigma) into a data structure stored in shared memory, so that given multiple independent streaming texts (where characters arrive one at a time) the algorithm reports occurrences of patterns from D in each one of the texts as soon as they appear. We design two efficient algorithms for the dictionary matching problem in the multi-stream model. The first algorithm works when all the patterns in D have the same length m and costs O(d log m) words in shared memory, O(log m log d) words in stream memory, and O(log m) time per character. The second algorithm works for general D, but the time cost per character becomes O(log m+log d log log d). We also demonstrate the usefulness of our first algorithm in solving both the KMM problem and PMWC problem in the streaming model. In particular, we obtain the first almost optimal (up to poly-log factors) algorithm for the PMWC problem in the streaming model. We also design a new algorithm for the KMM problem in the streaming model that, up to poly-log factors, has the same bounds as the most recent results that use different techniques. Moreover, for most inputs, our algorithm for KMM is significantly faster on average

    A cascaded approach to normalising gene mentions in biomedical literature

    Get PDF
    Linking gene and protein names mentioned in the literature to unique identifiers in referent genomic databases is an essential step in accessing and integrating knowledge in the biomedical domain. However, it remains a challenging task due to lexical and terminological variation, and ambiguity of gene name mentions in documents. We present a generic and effective rule-based approach to link gene mentions in the literature to referent genomic databases, where pre-processing of both gene synonyms in the databases and gene mentions in text are first applied. The mapping method employs a cascaded approach, which combines exact, exact-like and token-based approximate matching by using flexible representations of a gene synonym dictionary and gene mentions generated during the pre-processing phase. We also consider multi-gene name mentions and permutation of components in gene names. A systematic evaluation of the suggested methods has identified steps that are beneficial for improving either precision or recall in gene name identification. The results of the experiments on the BioCreAtIvE2 data sets (identification of human gene names) demonstrated that our methods achieved highly encouraging results with F-measure of up to 81.20%

    The k-mismatch problem revisited

    Get PDF
    We revisit the complexity of one of the most basic problems in pattern matching. In the k-mismatch problem we must compute the Hamming distance between a pattern of length m and every m-length substring of a text of length n, as long as that Hamming distance is at most k. Where the Hamming distance is greater than k at some alignment of the pattern and text, we simply output "No". We study this problem in both the standard offline setting and also as a streaming problem. In the streaming k-mismatch problem the text arrives one symbol at a time and we must give an output before processing any future symbols. Our main results are as follows: 1) Our first result is a deterministic O(nk2log⁥k/m+npolylogm)O(n k^2\log{k} / m+n \text{polylog} m) time offline algorithm for k-mismatch on a text of length n. This is a factor of k improvement over the fastest previous result of this form from SODA 2000 by Amihood Amir et al. 2) We then give a randomised and online algorithm which runs in the same time complexity but requires only O(k2polylogm)O(k^2\text{polylog} {m}) space in total. 3) Next we give a randomised (1+Ï”)(1+\epsilon)-approximation algorithm for the streaming k-mismatch problem which uses O(k2polylogm/Ï”2)O(k^2\text{polylog} m / \epsilon^2) space and runs in O(polylogm/Ï”2)O(\text{polylog} m / \epsilon^2) worst-case time per arriving symbol. 4) Finally we combine our new results to derive a randomised O(k2polylogm)O(k^2\text{polylog} {m}) space algorithm for the streaming k-mismatch problem which runs in O(klog⁥k+polylogm)O(\sqrt{k}\log{k} + \text{polylog} {m}) worst-case time per arriving symbol. This improves the best previous space complexity for streaming k-mismatch from FOCS 2009 by Benny Porat and Ely Porat by a factor of k. We also improve the time complexity of this previous result by an even greater factor to match the fastest known offline algorithm (up to logarithmic factors)
    • 

    corecore