139,963 research outputs found

    Approximate String Matching With Dynamic Programming and Suffix Trees

    Get PDF
    The importance and the contribution of string matching algorithms to the modern society cannot be overstated. From basic search algorithms such as spell checking and data querying, to advanced algorithms such as DNA sequencing, trend analysis and signal processing, string matching algorithms form the foundation of many aspects in computing that have been pivotal in technological advancement. In general, string matching algorithms can be divided into the categories of exact string matching and approximate string matching. We study each area and examine some of the well known algorithms. We probe into one of the most intriguing data structure in string algorithms, the suffix tree. The lowest common ancestor extension of the suffix tree is the key to many advanced string matching algorithms. With these tools, we are able to solve string problems that were, until recently, thought intractable by many. Another interesting and relatively new data structure in string algorithms is the suffix array, which has significant breakthroughs in its linear time construction in recent years. Primarily, this thesis focuses on approximate string matching using dynamic programming and hybrid dynamic programming with suffix tree. We study both approaches in detail and see how the merger of exact string matching and approximate string matching algorithms can yield synergistic results in our experiments

    The Index-Based Subgraph Matching Algorithm (ISMA): Fast Subgraph Enumeration in Large Networks Using Optimized Search Trees

    Get PDF
    Subgraph matching algorithms are designed to find all instances of predefined subgraphs in a large graph or network and play an important role in the discovery and analysis of so-called network motifs, subgraph patterns which occur more often than expected by chance. We present the index-based subgraph matching algorithm (ISMA), a novel tree-based algorithm. ISMA realizes a speedup compared to existing algorithms by carefully selecting the order in which the nodes of a query subgraph are investigated. In order to achieve this, we developed a number of data structures and maximally exploited symmetry characteristics of the subgraph. We compared ISMA to a naive recursive tree-based algorithm and to a number of well-known subgraph matching algorithms. Our algorithm outperforms the other algorithms, especially on large networks and with large query subgraphs. An implementation of ISMA in Java is freely available at http://sourceforge.net/projects/isma

    On Randomized Algorithms for Matching in the Online Preemptive Model

    Full text link
    We investigate the power of randomized algorithms for the maximum cardinality matching (MCM) and the maximum weight matching (MWM) problems in the online preemptive model. In this model, the edges of a graph are revealed one by one and the algorithm is required to always maintain a valid matching. On seeing an edge, the algorithm has to either accept or reject the edge. If accepted, then the adjacent edges are discarded. The complexity of the problem is settled for deterministic algorithms. Almost nothing is known for randomized algorithms. A lower bound of 1.6931.693 is known for MCM with a trivial upper bound of 22. An upper bound of 5.3565.356 is known for MWM. We initiate a systematic study of the same in this paper with an aim to isolate and understand the difficulty. We begin with a primal-dual analysis of the deterministic algorithm due to McGregor. All deterministic lower bounds are on instances which are trees at every step. For this class of (unweighted) graphs we present a randomized algorithm which is 2815\frac{28}{15}-competitive. The analysis is a considerable extension of the (simple) primal-dual analysis for the deterministic case. The key new technique is that the distribution of primal charge to dual variables depends on the "neighborhood" and needs to be done after having seen the entire input. The assignment is asymmetric: in that edges may assign different charges to the two end-points. Also the proof depends on a non-trivial structural statement on the performance of the algorithm on the input tree. The other main result of this paper is an extension of the deterministic lower bound of Varadaraja to a natural class of randomized algorithms which decide whether to accept a new edge or not using independent random choices

    Suffix Structures and Circular Pattern Problems

    Get PDF
    The suffix tree is a data structure used to represent all the suffixes in a string. However, a major problem with the suffix tree is its practical space requirement. In this dissertation, we propose an efficient data structure -- the virtual suffix tree (VST) -- which requires less space than other recently proposed data structures for suffix trees and suffix arrays. On average, the space requirement (including that for suffix arrays and suffix links) is 13.8n bytes for the regular VST, and 12.05n bytes in its compact form, where n is the length of the sequence.;Markov models are very popular for modeling complex sequences. In this dissertation, we present the probabilistic suffix array (PSA), a space-efficient alternative to the probabilistic suffix tree (PST) used to represent Markov models. The PSA provides all the capabilities of the PST, such as learning and prediction, and maintains the same linear time construction (linearity with respect to sequence length). The PSA, however, has a significantly smaller memory requirement than the PST, for both the construction stage, and at the time of usage.;Using the proposed suffix data structures, we study the circular pattern matching (CPM) problem. We provide a linear time, linear space algorithm to solve the exact circular pattern matching problem. We then present four algorithms to address the approximate circular pattern matching (ACPM) problem. Our bidirectional ACPM algorithm provides the best time complexity when compared with other algorithms proposed in the literature. Further, we define the circular pattern discovery (CPD) problem and present algorithms to solve this problem. Using the proposed circular pattern matching algorithms, we perform experiments on computational analysis and function prediction for multidomain proteins

    Efficient Approximation of the Matching Distance for 2-Parameter Persistence

    Get PDF
    In topological data analysis, the matching distance is a computationally tractable metric on multi-filtered simplicial complexes. We design efficient algorithms for approximating the matching distance of two bi-filtered complexes to any desired precision ?>0. Our approach is based on a quad-tree refinement strategy introduced by Biasotti et al., but we recast their approach entirely in geometric terms. This point of view leads to several novel observations resulting in a practically faster algorithm. We demonstrate this speed-up by experimental comparison and provide our code in a public repository which provides the first efficient publicly available implementation of the matching distance

    Average-case analysis of perfect sorting by reversals (Journal Version)

    Full text link
    Perfect sorting by reversals, a problem originating in computational genomics, is the process of sorting a signed permutation to either the identity or to the reversed identity permutation, by a sequence of reversals that do not break any common interval. B\'erard et al. (2007) make use of strong interval trees to describe an algorithm for sorting signed permutations by reversals. Combinatorial properties of this family of trees are essential to the algorithm analysis. Here, we use the expected value of certain tree parameters to prove that the average run-time of the algorithm is at worst, polynomial, and additionally, for sufficiently long permutations, the sorting algorithm runs in polynomial time with probability one. Furthermore, our analysis of the subclass of commuting scenarios yields precise results on the average length of a reversal, and the average number of reversals.Comment: A preliminary version of this work appeared in the proceedings of Combinatorial Pattern Matching (CPM) 2009. See arXiv:0901.2847; Discrete Mathematics, Algorithms and Applications, vol. 3(3), 201

    Object Detection in High Resolution Aerial Images and Hyperspectral Remote Sensing Images

    Get PDF
    With rapid developments in satellite and sensor technologies, there has been a dramatic increase in the availability of remotely sensed images. However, the exploration of these images still involves a tremendous amount of human interventions, which are tedious, time-consuming, and inefficient. To help imaging experts gain a complete understanding of the images and locate the objects of interest in a more accurate and efficient way, there is always an urgent need for developing automatic detection algorithms. In this work, we delve into the object detection problems in remote sensing applications, exploring the detection algorithms for both hyperspectral images (HSIs) and high resolution aerial images. In the first part, we focus on the subpixel target detection problem in HSIs with low spatial resolutions, where the objects of interest are much smaller than the image pixel spatial resolution. To this end, we explore the detection frameworks that integrate image segmentation techniques in designing the matched filters (MFs). In particular, we propose a novel image segmentation algorithm to identify the spatial-spectral coherent image regions, from which the background statistics were estimated for deriving the MFs. Extensive experimental studies were carried out to demonstrate the advantages of the proposed subpixel target detection framework. Our studies show the superiority of the approach when comparing to state-of-the-art methods. The second part of the thesis explores the object based image analysis (OBIA) framework for geospatial object detection in high resolution aerial images. Specifically, we generate a tree representation of the aerial images from the output of hierarchical image segmentation algorithms and reformulate the object detection problem into a tree matching task. We then proposed two tree-matching algorithms for the object detection framework. We demonstrate the efficiency and effectiveness of the proposed tree-matching based object detection framework. In the third part, we study object detection in high resolution aerial images from a machine learning perspective. We investigate both traditional machine learning based framework and end-to-end convolutional neural network (CNN) based approach for various object detection tasks. In the traditional detection framework, we propose to apply the Gaussian process classifier (GPC) to train an object detector and demonstrate the advantages of the probabilistic classification algorithm. In the CNN based approach, we proposed a novel scale transfer module that generates enhanced feature maps for object detection. Our results show the efficiency and competitiveness of the proposed algorithms when compared to state-of-the-art counterparts

    Holistic Twig Joins: Optimal XML Pattern Matching

    Get PDF
    XML employs a tree-structured data model, and, naturally, XML queries specify patterns of selection predicates on multiple elements related by a tree structure. Finding all occurrences of such a twig pattern in an XML database is a core operation for XML query processing. Prior work has typically decomposed the twig pattern into binary structural (parent-child and ancestor-descendant) relationships, and twig matching is achieved by: (i) using structural join algorithms to match the binary relationships against the XML database, and (ii) stitching together these basic matches. A limitation of this approach for matching twig patterns is that intermediate result sizes can get large, even when the input and output sizes are more manageable. In this paper, we propose a novel holistic twig join algorithm, TwigStack, for matching an XML query twig pattern. Our technique uses a chain of linked stacks to compactly represent partial results to root-to-leaf query paths, which are then composed to obtain matches for the twig pattern. When the twig pattern uses only ancestor-descendant relationships between elements, TwigStack is I/O and CPU optimal among all sequential algorithms that read the entire input: it is linear in the sum of sizes of the input lists and the final result list, but independent of the sizes of intermediate results. We then show how to use (a modification of) B-trees, along with TwigStack, to match query twig patterns in sub-linear time. Finally, we complement our analysis with experimental results on a range of real and synthetic data, and query twig patterns

    Fast matching statistics in small space

    Get PDF
    Computing the matching statistics of a string S with respect to a string T on an alphabet of size sigma is a fundamental primitive for a number of large-scale string analysis applications, including the comparison of entire genomes, for which space is a pressing issue. This paper takes from theory to practice an existing algorithm that uses just O(|T|log{sigma}) bits of space, and that computes a compact encoding of the matching statistics array in O(|S|log{sigma}) time. The techniques used to speed up the algorithm are of general interest, since they optimize queries on the existence of a Weiner link from a node of the suffix tree, and parent operations after unsuccessful Weiner links. Thus, they can be applied to other matching statistics algorithms, as well as to any suffix tree traversal that relies on such calls. Some of our optimizations yield a matching statistics implementation that is up to three times faster than a plain version of the algorithm, depending on the similarity between S and T. In genomic datasets of practical significance we achieve speedups of up to 1.8, but our fastest implementations take on average twice the time of an existing code based on the LCP array. The key advantage is that our implementations need between one half and one fifth of the competitor\u27s memory, and they approach comparable running times when S and T are very similar

    Differentially Private Algorithms for Graphs Under Continual Observation

    Get PDF
    Differentially private algorithms protect individuals in data analysis scenarios by ensuring that there is only a weak correlation between the existence of the user in the data and the result of the analysis. Dynamic graph algorithms maintain the solution to a problem (e.g., a matching) on an evolving input, i.e., a graph where nodes or edges are inserted or deleted over time. They output the value of the solution after each update operation, i.e., continuously. We study (event-level and user-level) differentially private algorithms for graph problems under continual observation, i.e., differentially private dynamic graph algorithms. We present event-level private algorithms for partially dynamic counting-based problems such as triangle count that improve the additive error by a polynomial factor (in the length TT of the update sequence) on the state of the art, resulting in the first algorithms with additive error polylogarithmic in TT. We also give ε\varepsilon-differentially private and partially dynamic algorithms for minimum spanning tree, minimum cut, densest subgraph, and maximum matching. The additive error of our improved MST algorithm is O(Wlog3/2T/ε)O(W \log^{3/2}T / \varepsilon), where WW is the maximum weight of any edge, which, as we show, is tight up to a (logT/ε)(\sqrt{\log T} / \varepsilon)-factor. For the other problems, we present a partially-dynamic algorithm with multiplicative error (1+β)(1+\beta) for any constant β>0\beta > 0 and additive error O(Wlog(nW)log(T)/(εβ))O(W \log(nW) \log(T) / (\varepsilon \beta) ). Finally, we show that the additive error for a broad class of dynamic graph algorithms with user-level privacy must be linear in the value of the output solution's range
    corecore