33 research outputs found
Sequence searching allowing for non-overlapping adjacent unbalanced translocations
Unbalanced translocations are among the most frequent chromosomal alterations, accounted for 30% of all losses of heterozygosity, a major genetic event causing inactivation of tumor suppressor genes. Despite of their central role in genomic sequence analysis, little attention has been devoted to the problem of matching sequences allowing for this kind of chromosomal alteration.
In this paper we investigate the approximate string matching problem when the edit operations are non-overlapping unbalanced translocations of adjacent factors. In particular, we first present a O(nm3)-time and O(m2)-space algorithm based on the dynamic-programming approach. Then we improve our first result by designing a second solution which makes use of the Directed Acyclic Word Graph of the pattern. In particular, we show that under the assumptions of equiprobability and independence of characters, our algorithm has a O(n log2σ m) average time complexity, for an alphabet of size σ, still maintaining the O(nm3)-time and the O(m2)-space complexity in the worst case. To the best of our knowledge this is the first solution in literature for the approximate string matching problem allowing for unbalanced translocations of factors
A Parameterized Study of Maximum Generalized Pattern Matching Problems
The generalized function matching (GFM) problem has been intensively studied
starting with [Ehrenfeucht and Rozenberg, 1979]. Given a pattern p and a text
t, the goal is to find a mapping from the letters of p to non-empty substrings
of t, such that applying the mapping to p results in t. Very recently, the
problem has been investigated within the framework of parameterized complexity
[Fernau, Schmid, and Villanger, 2013].
In this paper we study the parameterized complexity of the optimization
variant of GFM (called Max-GFM), which has been introduced in [Amir and Nor,
2007]. Here, one is allowed to replace some of the pattern letters with some
special symbols "?", termed wildcards or don't cares, which can be mapped to an
arbitrary substring of the text. The goal is to minimize the number of
wildcards used.
We give a complete classification of the parameterized complexity of Max-GFM
and its variants under a wide range of parameterizations, such as, the number
of occurrences of a letter in the text, the size of the text alphabet, the
number of occurrences of a letter in the pattern, the size of the pattern
alphabet, the maximum length of a string matched to any pattern letter, the
number of wildcards and the maximum size of a string that a wildcard can be
mapped to.Comment: to appear in Proc. IPEC'1
Weak factor automata : the failure of failure factor oracles?
In indexing of, and pattern matching on, DNA and text sequences, it is often important to represent all factors of a
sequence. One e cient, compact representation is the factor oracle (FO). At the same time, any classical deterministic
nite automaton (DFA) can be transformed to a so-called failure one (FDFA), which may use failure transitions to replace
multiple symbol transitions, potentially yielding a more compact representation. We combine the two ideas and directly
construct a failure factor oracle (FFO) from a given sequence, in contrast to ex post facto transformation to an FDFA. The
algorithm is suitable for both short and long sequences. We empirically compared the resulting FFOs and FOs on number
of transitions for many DNA sequences of lengths 4 - 512, showing gains of up to 10% in total number of transitions, with
failure transitions also taking up less space than symbol transitions. The resulting FFOs can be used for indexing, as
well as in a variant of the FO-using backward oracle matching algorithm. We discuss and classify this pattern matching
algorithm in terms of the keyword pattern matching taxonomies of Watson, Cleophas and Zwaan. We also empirically
compared the use of FOs and FFOs in such backward reading pattern matching algorithms, using both DNA and natural
language (English) data sets. The results indicate that the decrease in pattern matching performance of an algorithm using
an FFO instead of an FO may outweigh the gain in representation space by using an FFO instead of an FO.http://www.journals.co.za/ej/ejour_comp.htmlam201
Algorithms for Order-Preserving Matching
String matching is a widely studied problem in Computer Science. There have been many recent developments in this field. One fascinating problem considered lately is the order-preserving matching (OPM) problem. The task is to find all the substrings in the text which have the same length and relative order as the pattern, where the relative order is the numerical order of the numbers in a string. The problem finds its applications in the areas involving time series or series of numbers. More specifically, it is useful for those who are interested in the relative order of the pattern and not in the pattern itself. For example, it can be used by analysts in a stock market to study movements of prices. In addition to the OPM problem, we also studied its approximate variation. In approximate order-preserving matching, we search for those substrings in the text which have relative order similar to the pattern, i.e., relative order of the pattern matches with at most k mismatches. With respect to applications of order-preserving matching, approximate search is more meaningful than exact search. We developed various advanced solutions for the problem and its variant. Special emphasis was laid on the practical efficiency of the solutions. Particularly, we introduced a simple solution for the OPM problem using filtration. We proved experimentally that our method was effective and faster than the previous solutions for the problem. In addition, we combined the Single Instruction Multiple Data (SIMD) instruction set architecture with filtration to develop competent solutions which were faster than our previous solution. Moreover, we proposed another efficient solution without filtration using the SIMD architecture. We also presented an offline solution based on the FM-index scheme. Furthermore, we proposed practical solutions for the approximate order-preserving matching problem and one of the solutions was the first sublinear solution on average for the problem
Recommended from our members
Text Indexing for Long Patterns: Anchors are All you Need
PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at https://github.com/lorrainea/BDA- index.Copyright © 2023 the owner/author(s). In many real-world database systems, a large fraction of the data is represented by strings: sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to represent such string datasets in a compact form but also to simultaneously enable fast pattern matching queries. This is the classic text indexing problem. The four absolute measures anyone should pay attention to when designing or implementing a text index are: (i) index space; (ii) query time; (iii) construction space; and (iv) construction time. Unfortunately, however, most (if not all) widely-used indexes (e.g., suffix tree, suffix array, or their compressed counterparts) are not optimized for all four measures simultaneously, as it is difficult to have the best of all four worlds. Here, we take an important step in this direction by showing that text indexing with locally consistent anchors (lc-anchors) offers remarkably good performance in all four measures, when we have at hand a lower bound l on the length of the queried patterns --- which is arguably a quite reasonable assumption in practical applications. Specifically, we improve on the construction of the index proposed by Loukides and Pissis, which is based on bidirectional string anchors (bd-anchors), a new type of lc-anchors, by: (i) designing an average-case linear-time algorithm to compute bd-anchors; and (ii) developing a semi-external-memory implementation to construct the index in small space using near-optimal work. We then present an extensive experimental evaluation, based on the four measures, using real benchmark datasets. The results show that, for long patterns, the index constructed using our improved algorithms compares favorably to all classic indexes: (compressed) suffix tree; (compressed) suffix array; and the FM-index.European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively; and by UKRI through REPHRAIN (EP/V011189/1)
A Parameterized Study of Maximum Generalized Pattern Matching Problems
The generalized function matching (GFM) problem has been intensively studied starting with Ehrenfreucht and Rozenberg (Inf Process Lett 9(2):86–88, 1979). Given a pattern p and a text t, the goal is to find a mapping from the letters of p to non-empty substrings of t, such that applying the mapping to p results in t. Very recently, the problem has been investigated within the framework of parameterized complexity (Fernau et al. in FSTTCS, 2013). In this paper we study the parameterized complexity of the optimization variant of GFM (called Max-GFM), which has been introduced in Amir and Amihood (J Discrete Algorithms 5(3):514–523, 2007). Here, one is allowed to replace some of the pattern letters with some special symbols “?”, termed wildcards or don’t cares, which can be mapped to an arbitrary substring of the text. The goal is to minimize the number of wildcards used. We give a complete classification of the parameterized complexity of Max-GFM and its variants under a wide range of parameterizations, such as, the number of occurrences of a letter in the text, the size of the text alphabet, the number of occurrences of a letter in the pattern, the size of the pattern alphabet, the maximum length of a string matched to any pattern letter, the number of wildcards and the maximum size of a string that a wildcard can be mapped to