Search CORE

9 research outputs found

Even faster elastic-degenerate string matching via fast matrix multiplication

Author: Bernardini G. (Giulia)
Gawrychowski P. (Paweł)
Pisanti N. (Nadia)
Pissis S. (Solon)
Rosone G. (Giovanna)
Publication venue
Publication date: 01/01/2019
Field of study

An elastic-degenerate (ED) string is a sequence of n sets of strings of total length N, which was recently proposed to model a set of similar sequences. The ED string matching (EDSM) problem is to find all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention in the combinatorial pattern matching community, and an O(nm1.5 √(log m) + N)-time algorithm is known [Aoyama et al., CPM 2018]. The standard assumption in the prior work on this question is that N is substantially larger than both n and m, and thus we would like to have a linear dependency on the former. Under this assumption, the natural open problem is whether we can decrease the 1.5 exponent in the time complexity, similarly as in the related (but, to the best of our knowledge, not equivalent) word break problem [Backurs and Indyk, FOCS 2016].Our starting point is a conditional lower bound for the EDSM problem. We use the popular combinatorial Boolean matrix multiplication (BMM) conjecture stating that there is no truly subcubic combinatorial algorithm for BMM [Abboud and Williams, FOCS 2014]. By designing an appropriate reduction we show that a combinatorial algorithm solving the EDSM problem in O(nm1.5−∊ + N) time, for any ∊ > 0, refutes this conjecture. Of course, the notion of combinatorial algorithms is not clearly defined, so our reduction should be understood as an indication that decreasing the exponent requires fast matrix multiplication.Two standard tools used in algorithms on strings are string periodicity and fast Fourier transform. Our main technical contribution is that we successfully combine these tools with fast matrix multiplication to design a non-combinatorial O(nm1.381 + N)-time algorithm for EDSM. To the best of our knowledge, we are the first to do so.</p

arXiv.org e-Print Archive

CWI's Institutional Repository

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

Dagstuhl Research Online Publication Server

Comparing Degenerate Strings

Author: Alzamel M. (Mai)
Ayad L.A.K. (Lorraine)
Bernardini G. (Giulia)
Grossi R. (Roberto)
Iliopoulos C.S. (Costas)
Pisanti N. (Nadia)
Pissis S. (Solon)
Rosone G. (Giovanna)
Publication venue: 'IOS Press'
Publication date: 01/01/2020
Field of study

Uncertain sequences are compact representations of sets of similar strings. They highlight common segments by collapsing them, and explicitly represent varying segments by listing all possible options. A generalized degenerate string (GD string) is a type of uncertain sequence. Formally, a GD string S is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length ki but this length can vary between different sets. We denote by W the sum of these lengths k0, k1,... , kn-1. Our main result is an (N + M)-time algorithm for deciding whether two GD strings of total sizes N and M, respectively, over an integer alphabet, have a non-empty intersection. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in linear space. We then apply our string comparison tool to devise a simple algorithm for computing all palindromes in S in (min{W, n2}N)-time. We complement this upper bound by showing a similar conditional lower bound for computing maximal palindromes in S. We also show that a result, which is essentially the same as our string comparison linear-time algorithm, can be obtained by employing an automata-based approach

Crossref

CWI's Institutional Repository

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

King's Research Portal

Beyond the BEST Theorem: Fast assessment of Eulerian Trails

Author: Conte (Alessio)
Grossi R. (Roberto)
Loukides G. (Grigorios)
Pisanti N. (Nadia)
Pissis S. (Solon)
Punzi (Giulia)
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Given a directed multigraph G= (V, E), with | V| = n nodes and | E| = m edges, and an integer z, we are asked to assess whether the number # ET(G) of node-distinct Eulerian trails of G is at least z; two trails are called node-distinct if their node sequences are different. This problem has been formalized by Bernardini et al. [ALENEX 2020] as it is the core computational problem in several string processing applications. It can be solved in O(nω) arithmetic operations by applying the well-known BEST theorem, where ω< 2.373 denotes the matrix multiplication exponent. The algorithmic challenge is: Can we solve this problem faster for certain values of m and z? Namely, we want to design a combinatorial algorithm for assessing whether # ET(G) ≥ z, which does not resort to the BEST theorem and has a predictably bounded cost as a function of m and z. We address this challenge here by providing a combinatorial algorithm requiring O(m· min { z, # ET(G) }) time

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server

StringSanitizationTKDD

Author: Bernardini G. (Giulia)
Chen H. (Huiping)
Conte A. (Alessio)
Grossi R. (Roberto)
Loukides G. (Grigorios)
Pisanti N. (Nadia)
Pissis S. (Solon)
Rosone G. (Giovanna)
Sweering M.J.M. (Michelle)
Publication venue
Publication date: 01/01/2020
Field of study

This repository contains the source code of the approach in 'Combinatorial Algorithms for String Sanitization' that appears in ACM Transactions on Knowledge Discovery from Data (TKDD)

CWI's Institutional Repository

A Relational Extension of the Notion of Motifs: Application to the Protein Common 3D Substructures Searching Problem

Author: Carpentier M.
Cook D.J
El-Zant N.
Feng J.
Gerstein M.
Gerstein M.
Guda C.
Henry Soldano
Joel Pothier
Karp R.
Leibowitz N.
Mathilde Carpentier
Nadia Pisanti
Su S.
Wu T.D.
Ye Y.
Publication venue: 'Mary Ann Liebert Inc'
Publication date: 01/01/2009
Field of study

The geometrical configurations of atoms in protein structures can be viewed as approximate relations among them. Then, finding similar common substructures within a set of protein structures belongs to a new class of problems that generalizes that of finding repeated motifs. The novelty lies in the addition of constraints on the motifs in terms of relations that must hold between pairs of positions of the motifs. We will hence denote them as relational motifs. For this class of problems, we present an algorithm that is a suitable extension of the KMR paradigm and, in particular, of the KMRC as it uses a degenerate alphabet. Our algorithm contains several improvements that become especially useful when-as it is required for relational motifs-the inference is made by partially overlapping shorter motifs, rather than concatenating them. The efficiency, correctness and completeness of the algorithm is ensured by several non-trivial properties that are proven in this paper. The algorithm has been applied in the important field of protein common 3D substructure searching. The methods implemented have been tested on several examples of protein families such as serine proteases, globins and cytochromes P450 additionally. The detected motifs have been compared to those found by multiple structural alignments method

Crossref

Archivio della Ricerca - Università di Pisa

HAL-Paris 13