57 research outputs found

    Dynamic and Internal Longest Common Substring

    Get PDF
    Given two strings S and T, each of length at most n, the longest common substring (LCS) problem is to find a longest substring common to S and T. This is a classical problem in computer science with an O(n) -time solution. In the fully dynamic setting, edit operations are allowed in either of the two strings, and the problem is to find an LCS after each edit. We present the first solution to the fully dynamic LCS problem requiring sublinear time in n per edit operation. In particular, we show how to find an LCS after each edit operation in O~ (n2 / 3) time, after O~ (n) -time and space preprocessing. This line of research has been recently initiated in a somewhat restricted dynamic variant by Amir et al. [SPIRE 2017]. More specifically, the authors presented an O~ (n) -sized data structure that returns an LCS of the two strings after a single edit operation (that is reverted afterwards) in O~ (1) time. At CPM 2018, three papers (Abedin et al., Funakoshi et al., and Urabe et al.) studied analogously restricted dynamic variants of problems on strings; specifically, computing the longest palindrome and the Lyndon factorization of a string after a single edit operation. We develop dynamic sublinear-time algorithms for both of these problems as well. We also consider internal LCS queries, that is, queries in which we are to return an LCS of a pair of substrings of S and T. We show that answering such queries is hard in general and propose efficient data structures for several restricted cases

    Efficient Data Structures for Text Processing Applications

    Get PDF
    This thesis is devoted to designing and analyzing efficient text indexing data structures and associated algorithms for processing text data. The general problem is to preprocess a given text or a collection of texts into a space-efficient index to quickly answer various queries on this data. Basic queries such as counting/reporting a given pattern\u27s occurrences as substrings of the original text are useful in modeling critical bioinformatics applications. This line of research has witnessed many breakthroughs, such as the suffix trees, suffix arrays, FM-index, etc. In this work, we revisit the following problems: 1. The Heaviest Induced Ancestors problem 2. Range Longest Common Prefix problem 3. Range Shortest Unique Substrings problem 4. Non-Overlapping Indexing problem For the first problem, we present two new space-time trade-offs that improve the space, query time, or both of the existing solutions by roughly a logarithmic factor. For the second problem, our solution takes linear space, which improves the previous result by a logarithmic factor. The techniques developed are then extended to obtain an efficient solution for our third problem, which is newly formulated. Finally, we present a new framework that yields efficient solutions for the last problem in both cache-aware and cache-oblivious models

    Entropy Lower Bounds for Dictionary Compression

    Get PDF
    We show that a wide class of dictionary compression methods (including LZ77, LZ78, grammar compressors as well as parsing-based structures) require |S|H_k(S) + Omega (|S|k log sigma/log_sigma |S|) bits to encode their output. This matches known upper bounds and improves the information-theoretic lower bound of |S|H_k(S). To this end, we abstract the crucial properties of parsings created by those methods, construct a certain family of strings and analyze the parsings of those strings. We also show that for k = alpha log_sigma |S|, where 0 < alpha < 1 is a constant, the aforementioned methods produce an output of size at least 1/(1-alpha)|S|H_k(S) bits. Thus our results separate dictionary compressors from context-based one (such as PPM) and BWT-based ones, as the those include methods achieving |S|H_k(S) + O(sigma^k log sigma) bits, i.e. the redundancy depends on k and sigma but not on |S|

    Sequence searching allowing for non-overlapping adjacent unbalanced translocations

    Get PDF
    Unbalanced translocations are among the most frequent chromosomal alterations, accounted for 30% of all losses of heterozygosity, a major genetic event causing inactivation of tumor suppressor genes. Despite of their central role in genomic sequence analysis, little attention has been devoted to the problem of matching sequences allowing for this kind of chromosomal alteration. In this paper we investigate the approximate string matching problem when the edit operations are non-overlapping unbalanced translocations of adjacent factors. In particular, we first present a O(nm3)-time and O(m2)-space algorithm based on the dynamic-programming approach. Then we improve our first result by designing a second solution which makes use of the Directed Acyclic Word Graph of the pattern. In particular, we show that under the assumptions of equiprobability and independence of characters, our algorithm has a O(n log2σ m) average time complexity, for an alphabet of size σ, still maintaining the O(nm3)-time and the O(m2)-space complexity in the worst case. To the best of our knowledge this is the first solution in literature for the approximate string matching problem allowing for unbalanced translocations of factors

    Sequence Searching Allowing for Non-Overlapping Adjacent Unbalanced Translocations

    Get PDF

    Succinct Data Structures for Parameterized Pattern Matching and Related Problems

    Get PDF
    Let T be a fixed text-string of length n and P be a varying pattern-string of length |P| \u3c= n. Both T and P contain characters from a totally ordered alphabet Sigma of size sigma \u3c= n. Suffix tree is the ubiquitous data structure for answering a pattern matching query: report all the positions i in T such that T[i + k - 1] = P[k], 1 \u3c= k \u3c= |P|. Compressed data structures support pattern matching queries, using much lesser space than the suffix tree, mainly by relying on a crucial property of the leaves in the tree. Unfortunately, in many suffix tree variants (such as parameterized suffix tree, order-preserving suffix tree, and 2-dimensional suffix tree), this property does not hold. Consequently, compressed representations of these suffix tree variants have been elusive. We present the first compressed data structures for two important variants of the pattern matching problem: (1) Parameterized Matching -- report a position i in T if T[i + k - 1] = f(P[k]), 1 \u3c= k \u3c= |P|, for a one-to-one function f that renames the characters in P to the characters in T[i,i+|P|-1], and (2) Order-preserving Matching -- report a position i in T if T[i + j - 1] and T[i + k -1] have the same relative order as that of P[j] and P[k], 1 \u3c= j \u3c k \u3c= |P|. For each of these two problems, the existing suffix tree variant requires O(n*log n) bits of space and answers a query in O(|P|*log sigma + occ) time, where occ is the number of starting positions where a match exists. We present data structures that require O(n*log sigma) bits of space and answer a query in O((|P|+occ) poly(log n)) time. As a byproduct, we obtain compressed data structures for a few other variants, as well as introduce two new techniques (of independent interest) for designing compressed data structures for pattern matching

    28th Annual Symposium on Combinatorial Pattern Matching : CPM 2017, July 4-6, 2017, Warsaw, Poland

    Get PDF
    Peer reviewe

    27th Annual European Symposium on Algorithms: ESA 2019, September 9-11, 2019, Munich/Garching, Germany

    Get PDF

    Methods for the identification of common RNA motifs

    Get PDF
    Löwes B. Methods for the identification of common RNA motifs. Bielefeld: UniversitĂ€t Bielefeld; 2017.For a long time, non-coding RNAs were given less attention than messenger RNAs, even though their existence was proposed at a similar time in 1971, because the research focus was mostly on protein coding genes. With the discovery of catalytically active RNA molecules and micro RNAs, which are involved in the post-transcriptional regulation of gene expression, non-coding RNAs have gained widespread attention. It was revealed early on that non-coding RNAs are often more conserved in structure than in sequence. Since determining the function of non-coding RNAs includes costly and time consuming laboratory experiments, computational methods can help identifying further homologs of experimentally validated RNA families. But a question remains: can we identify potential RNAs with novel functions solely by using *in silico* methods? In this thesis, we perform an evaluation of 4,667 viral reference genomes in order to identify common RNA motifs shared by multiple taxonomically distant viruses. One potential mechanism that might explain similar motifs in taxonomically distant viruses that infect common hosts by interacting with their cellular components is convergent evolution. Convergent evolution is used to describe the phenomenon that two different species that are originated from two ancestors share related or similar traits. By looking for long stretches of exact RNA structure matches with low sequence conservation, we want to maximize the chance that the common motifs are the result of structural convergence due to similar selection criteria in common host organisms. Viruses are an excellent fit when it comes to the discovery of shared RNA motifs without the involvement of conserved sequence regions because of their high mutation rates. We were able to identify 69 RNA motifs, which could not be assigned to any of the existing RNA families, with a length of at least 50 nucleotides that are shared among at least three taxonomically distant viruses. The secondary structure of an RNA molecule can be represented as a string. Finding maximal repeats in strings can be done using well-known string matching techniques based on suffix trees and arrays. In contrast to normal RNA sequences, secondary structure strings represent base pairing interactions within a single molecule. Thus, not every substring of the secondary structure defines a well-formed RNA structure. Therefore, we describe a new data structure, the viable suffix tree, that takes the constraints on the RNA secondary structure into account and only returns maximal repeats that are well-formed structures. But this data structure is not limited to RNA structures, it can also be used for any other problem domain for which a set of allowed words can be defined, e.g. by using a grammar. However, the overall complexity of constructing the viable suffix tree cannot be lower than the complexity of the word problem for the language of such a grammar. A limitation of exact structure matching is the need for long common stretches of secondary structures that are not allowed to have a mismatch at any position. Therefore, we need to allow small mismatches to find more potential targets, but current state of the art techniques use computationally too expensive methods for sequence and structure comparisons and exhibit high false positive rates around 50%. We present a new approach that uses smaller RNA sequence and structure seed motifs that do not require long stretches of the secondary structure to be identical. The sequence and structure motifs can be hashed into integer values, which can be compared much faster. An evaluation using the three well understood hammerhead ribozyme families showed that our approach is able to detect 70% to 80% of the hammerhead motifs with a similar false positive rate as the other approaches. Whenever the performance of new and existing tools should be compared, there is a need for a benchmark data set with an underlying gold standard. BRaliBase is a widely used benchmark for assessing the accuracy of RNA secondary structure alignment methods. In most case studies based on the BRaliBase benchmark, one can observe a puzzling drop in accuracy in the 40% to 60% sequence identity range, the so-called “BRaliBase dent”. We show that this dent is due to a bias in the composition of the BRaliBase benchmark, namely the inclusion of a disproportionate number of tRNAs, which exhibit a very conserved secondary structure. Furthermore, we show that a simple sampling approach that restricts the presence of the most abundant RNA families can prevent such artifacts during the performance evaluation
    • 

    corecore