Search CORE

36 research outputs found

A Note on Easy and Efficient Computation of Full Abelian Periods of a Word

Author: Fici Gabriele
Lecroq Thierry
Lefebvre Arnaud
Prieur-Gaston Élise
Smyth William F.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

Constantinescu and Ilie (Bulletin of the EATCS 89, 167-170, 2006) introduced the idea of an Abelian period with head and tail of a finite word. An Abelian period is called full if both the head and the tail are empty. We present a simple and easy-to-implement

O(n\log\log n)

-time algorithm for computing all the full Abelian periods of a word of length

n

over a constant-size alphabet. Experiments show that our algorithm significantly outperforms the

O(n)

algorithm proposed by Kociumaka et al. (Proc. of STACS, 245-256, 2013) for the same problem.Comment: Accepted for publication in Discrete Applied Mathematic

arXiv.org e-Print Archive

Research Repository

Archivio istituzionale della ricerca - Università di Palermo

Sequence Searching Allowing for Non-Overlapping Adjacent Unbalanced Translocations

Author: Cantone Domenico
Faro Simone
Pavone Arianna
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 20th International Workshop on Algorithms in Bioinformatics (WABI 2020)
Publication date: 01/01/2020
Field of study

Dagstuhl Research Online Publication Server

A Parameterized Study of Maximum Generalized Pattern Matching Problems

Author: Ordyniak Sebastian
Popa Alexandru
Publication venue
Publication date: 08/09/2014
Field of study

The generalized function matching (GFM) problem has been intensively studied starting with [Ehrenfeucht and Rozenberg, 1979]. Given a pattern p and a text t, the goal is to find a mapping from the letters of p to non-empty substrings of t, such that applying the mapping to p results in t. Very recently, the problem has been investigated within the framework of parameterized complexity [Fernau, Schmid, and Villanger, 2013]. In this paper we study the parameterized complexity of the optimization variant of GFM (called Max-GFM), which has been introduced in [Amir and Nor, 2007]. Here, one is allowed to replace some of the pattern letters with some special symbols "?", termed wildcards or don't cares, which can be mapped to an arbitrary substring of the text. The goal is to minimize the number of wildcards used. We give a complete classification of the parameterized complexity of Max-GFM and its variants under a wide range of parameterizations, such as, the number of occurrences of a letter in the text, the size of the text alphabet, the number of occurrences of a letter in the pattern, the size of the pattern alphabet, the maximum length of a string matched to any pattern letter, the number of wildcards and the maximum size of a string that a wildcard can be mapped to.Comment: to appear in Proc. IPEC'1

arXiv.org e-Print Archive

CiteSeerX

Crossref

Weak factor automata : the failure of failure factor oracles?

Author: Cleophas L.G.W.A. (Loek)
Kourie Derrick G.
Watson Bruce William
Publication venue: Computer Society of South Africa
Publication date: 01/08/2014
Field of study

In indexing of, and pattern matching on, DNA and text sequences, it is often important to represent all factors of a sequence. One e cient, compact representation is the factor oracle (FO). At the same time, any classical deterministic nite automaton (DFA) can be transformed to a so-called failure one (FDFA), which may use failure transitions to replace multiple symbol transitions, potentially yielding a more compact representation. We combine the two ideas and directly construct a failure factor oracle (FFO) from a given sequence, in contrast to ex post facto transformation to an FDFA. The algorithm is suitable for both short and long sequences. We empirically compared the resulting FFOs and FOs on number of transitions for many DNA sequences of lengths 4 - 512, showing gains of up to 10% in total number of transitions, with failure transitions also taking up less space than symbol transitions. The resulting FFOs can be used for indexing, as well as in a variant of the FO-using backward oracle matching algorithm. We discuss and classify this pattern matching algorithm in terms of the keyword pattern matching taxonomies of Watson, Cleophas and Zwaan. We also empirically compared the use of FOs and FFOs in such backward reading pattern matching algorithms, using both DNA and natural language (English) data sets. The results indicate that the decrease in pattern matching performance of an algorithm using an FFO instead of an FO may outweigh the gain in representation space by using an FFO instead of an FO.http://www.journals.co.za/ej/ejour_comp.htmlam201

Directory of Open Access Journals

Stellenbosch University SUNScholar Repository

UPSpace at the University of Pretoria

Recommended from our members

Text Indexing for Long Patterns: Anchors are All you Need

Author: Ayad L
Loukides G
Pissis S
Publication venue: Association for Computing Machinery
Publication date: 10/07/2023
Field of study

PVLDB Artifact Availability: The source code, data, and/or other artifacts have been made available at https://github.com/lorrainea/BDA- index.Copyright © 2023 the owner/author(s). In many real-world database systems, a large fraction of the data is represented by strings: sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to represent such string datasets in a compact form but also to simultaneously enable fast pattern matching queries. This is the classic text indexing problem. The four absolute measures anyone should pay attention to when designing or implementing a text index are: (i) index space; (ii) query time; (iii) construction space; and (iv) construction time. Unfortunately, however, most (if not all) widely-used indexes (e.g., suffix tree, suffix array, or their compressed counterparts) are not optimized for all four measures simultaneously, as it is difficult to have the best of all four worlds. Here, we take an important step in this direction by showing that text indexing with locally consistent anchors (lc-anchors) offers remarkably good performance in all four measures, when we have at hand a lower bound l on the length of the queried patterns --- which is arguably a quite reasonable assumption in practical applications. Specifically, we improve on the construction of the index proposed by Loukides and Pissis, which is based on bidirectional string anchors (bd-anchors), a new type of lc-anchors, by: (i) designing an average-case linear-time algorithm to compute bd-anchors; and (ii) developing a semi-external-memory implementation to construct the index in small space using near-optimal work. We then present an extensive experimental evaluation, based on the four measures, using real benchmark datasets. The results show that, for long patterns, the index constructed using our improved algorithms compares favorably to all classic indexes: (compressed) suffix tree; (compressed) suffix array; and the FM-index.European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively; and by UKRI through REPHRAIN (EP/V011189/1)

Brunel University Research Archive

Text indexing for long patterns: Anchors are all you need

Author: Ayad L.A.K. (Lorraine)
Loukides G. (Grigorios)
Pissis S. (Solon)
Publication venue
Publication date: 10/07/2023
Field of study

In many real-world database systems, a large fraction of the data is represented by strings: Sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to represent such string datasets in a compact form but also to simultaneously enable fast pattern matching queries. This is the classic text indexing problem. The four absolute measures anyone should pay attention to when designing or implementing a text index are: (ⅰ) index space; (ⅱ) query time;(ⅲ) construction space; and (iv) construction time. Unfortunately, however, most (if not all) widely-used indexes (e.g., suffix tree, suffix array, or their compressed counterparts) are not optimized for all four measures simultaneously, as it is difficult to have the best of all four worlds. Here, we take an important step in this direction by showing that text indexing with locally consistent anchors (lc-anchors) offers remarkably good performance in all four measures, when we have at hand a lower bound ℓ on the length of the queried patterns — which is arguably a quite reasonable assumption in practical applications. Specifically, we improve on the construction of the index proposed by Loukides and Pissis, which is based on bidirectional string anchors (bd-anchors), a new type of lc-anchors,by: (i) designing an average-case linear-time algorithm to compute bd-anchors; and (ii) developing a semi-external-memory implementation to construct the index in small space using near-optimal work. We then present an extensive experimental evaluation, based on the four measures, using real benchmark datasets. The results show that, for long patterns, the index constructed using our improved algorithms compares favorably to all classic indexes: (compressed) suffix tree; (compressed) suffix array; and the FM-index

CWI's Institutional Repository

Lightweight Massively Parallel Suffix Array Construction

Author: Shukhrov Yury
Publication venue
Publication date: 17/06/2019
Field of study

The suffix array is an array of sorted suffixes in lexicographic order, where each sorted suffix is represented by its starting position in the input string. It is a fundamental data structure that finds various applications in areas such as string processing, text indexing, data compression, computational biology, and many more. Over the last three decades, researchers have proposed a broad spectrum of suffix array construction algorithms (SACAs). However, the majority of SACAs were implemented using sequential and parallel programming models. The maturity of GPU programming opened doors to the development of massively parallel GPU SACAs that outperform the fastest versions of suffix sorting algorithms optimized for the CPU parallel computing. Over the last five years, several GPU SACA approaches were proposed and implemented. They prioritized the running time over lightweight design. In this thesis, we design and implement a lightweight massively parallel SACA on the GPU using the prefix-doubling technique. Our prefix-doubling implementation is memory-efficient and can successfully construct the suffix array for input strings as large as 640 megabytes (MB) on Tesla P100 GPU. On large datasets, our implementation achieves a speedup of 7-16x over the fastest, highly optimized, OpenMP-accelerated suffix array constructor, libdivsufsort, that leverages the CPU shared memory parallelism. The performance of our algorithm relies on several high-performance parallel primitives such as radix sort, conditional filtering, inclusive prefix sum, random memory scattering, and segmented sort. We evaluate the performance of our implementation over a variety of real-world datasets with respect to its runtime, throughput, memory usage, and scalability. We compare our results against libdivsufsort that we run on a Haswell compute node equipped with 24 cores. Our GPU SACA is simple and compact, consisting of less than 300 lines of readable and effective source code. Additionally, we design and implement a fast and lightweight algorithm for checking the correctness of the suffix array

Aaltodoc Publication Archive