Search CORE

24 research outputs found

String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure

Author: A
Alzamel Mai
Counting
Grossi Roberto
Hagerup Torben
Optimal
Uniqueness
Wavelet
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 05/05/2019
Field of study

Burrows-Wheeler transform (BWT) is an invertible text transformation that, given a text

T

of length

n

, permutes its symbols according to the lexicographic order of suffixes of

T

. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling the complexity of this task is one of the most important unsolved problems in sequence analysis that has remained open for 25 years. Given a binary string of length

n

, occupying

O(n/\log n)

machine words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009) runs in

O(n)

time and

O(n/\log n)

space. Recent advancements (Belazzougui, STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size dependency in the time complexity, but they still require

\Omega(n)

time. In this paper, we propose the first algorithm that breaks the

O(n)

-time barrier for BWT construction. Given a binary string of length

n

, our procedure builds the Burrows-Wheeler transform in

O(n/\sqrt{\log n})

time and

O(n/\log n)

space. We complement this result with a conditional lower bound proving that any further progress in the time complexity of BWT construction would yield faster algorithms for the very well studied problem of counting inversions: it would improve the state-of-the-art

O(m\sqrt{\log m})

-time solution by Chan and P\v{a}tra\c{s}cu (SODA 2010). Our algorithm is based on a novel concept of string synchronizing sets, which is of independent interest. As one of the applications, we show that this technique lets us design a data structure of the optimal size

O(n/\log n)

that answers Longest Common Extension queries (LCE queries) in

O(1)

time and, furthermore, can be deterministically constructed in the optimal

O(n/\log n)

time.Comment: Full version of a paper accepted to STOC 201

arXiv.org e-Print Archive

Crossref

Efficient Computation of Sequence Mappability

Author: Alzamel Mai
Charalampopoulos Panagiotis
Iliopoulos Costas S.
Kociumaka Tomasz
Pissis Solon P.
Radoszewski Jakub
Straszyński Juliusz
Publication venue
Publication date: 31/07/2018
Field of study

Sequence mappability is an important task in genome re-sequencing. In the

(k,m)

-mappability problem, for a given sequence

T

of length

n

, our goal is to compute a table whose

i

th entry is the number of indices

j \ne i

such that length-

m

substrings of

T

starting at positions

i

and

j

have at most

k

mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of

k=1

. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in

\mathcal{O}(n \min\{m^k,\log^{k+1} n\})

time and

\mathcal{O}(n)

space for

k=\mathcal{O}(1)

. It requires a carefu l adaptation of the technique of Cole et al.~[STOC 2004] to avoid multiple counting of pairs of substrings. We also show

\mathcal{O}(n^2)

-time algorithms to compute all results for a fixed

m

and all

k=0,\ldots,m

or a fixed

k

and all

m=k,\ldots,n-1

. Finally we show that the

(k,m)

-mappability problem cannot be solved in strongly subquadratic time for

k,m = \Theta(\log n)

unless the Strong Exponential Time Hypothesis fails.Comment: Accepted to SPIRE 201

arXiv.org e-Print Archive

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server

Finding the Anticover of a String

Author: Alzamel Mai
Conte Alessio
Denzumi Shuhei
Grossi Roberto
Iliopoulos Costas S.
Kurita Kazuhiro
Wasa Kunihiro
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)
Publication date: 01/01/2020
Field of study

A k-anticover of a string x is a set of pairwise distinct factors of x of equal length k, such that every symbol of x is contained into an occurrence of at least one of those factors. The existence of a k-anticover can be seen as a notion of non-redundancy, which has application in computational biology, where they are associated with various non-regulatory mechanisms. In this paper we address the complexity of the problem of finding a k-anticover of a string x if it exists, showing that the decision problem is NP-complete on general strings for k ? 3. We also show that the problem admits a polynomial-time solution for k=2. For unbounded k, we provide an exact exponential algorithm to find a k-anticover of a string of length n (or determine that none exists), which runs in O*(min {3^{(n-k)/3)}, ((k(k+1))/2)^{n/(k+1)) time using polynomial space

INRIA a CCSD electronic archive server

HAL Descartes

Dagstuhl Research Online Publication Server

Hal-Diderot

IUPACpal: efficient identification of inverted repeats in IUPAC-encoded DNA sequences

Author: Alamro H. (Hayam)
Alzamel M. (Mai)
Iliopoulos C.S. (Costas)
Pissis S. (Solon)
Watts S. (Steven)
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 06/02/2021
Field of study

Background: An inverted repeat is a DNA sequence followed downstream by its reverse complement, potentially with a gap in the centre. Inverted repeats are found in both prokaryotic and eukaryotic genomes and they have been linked with countless possible functions. Many international consortia provide a comprehensive description of common genetic variation making alternative sequence representations, such as IUPAC encoding, necessary for leveraging the full potential of such broad variation datasets. Results: We present IUPACpal, an exact tool for efficient identification of inverted repeats in IUPAC-encoded DNA sequences allowing also for potential mismatches and gaps in the inverted repeats. Conclusion: Within the parameters that were tested, our experimental results show that IUPACpal compares favourably to a similar application packaged with EMBOSS. We show that IUPACpal identifies many previously unidentified inverted repeats when compared with EMBOSS, and that this is also performed with orders of magnitude improved speed.</p

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server

HAL Descartes

King's Research Portal

Hal-Diderot

Comparing Degenerate Strings

Author: Alzamel M. (Mai)
Ayad L.A.K. (Lorraine)
Bernardini G. (Giulia)
Grossi R. (Roberto)
Iliopoulos C.S. (Costas)
Pisanti N. (Nadia)
Pissis S. (Solon)
Rosone G. (Giovanna)
Publication venue: 'IOS Press'
Publication date: 01/01/2020
Field of study

Uncertain sequences are compact representations of sets of similar strings. They highlight common segments by collapsing them, and explicitly represent varying segments by listing all possible options. A generalized degenerate string (GD string) is a type of uncertain sequence. Formally, a GD string S is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length ki but this length can vary between different sets. We denote by W the sum of these lengths k0, k1,... , kn-1. Our main result is an (N + M)-time algorithm for deciding whether two GD strings of total sizes N and M, respectively, over an integer alphabet, have a non-empty intersection. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in linear space. We then apply our string comparison tool to devise a simple algorithm for computing all palindromes in S in (min{W, n2}N)-time. We complement this upper bound by showing a similar conditional lower bound for computing maximal palindromes in S. We also show that a result, which is essentially the same as our string comparison linear-time algorithm, can be obtained by employing an automata-based approach

Crossref

CWI's Institutional Repository

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

King's Research Portal

Efficient and practical algorithms for sequence analysis:algorithm and data analysis research

Author: Alzamel Mai
Publication venue
Publication date: 01/09/2021
Field of study

King's Research Portal

Dynamic IoT Malware Detection in Android Systems Using Profile Hidden Markov Models

Author: Heba Kurdi
Mai Alzamel
Norah Abanmi
Publication venue: MDPI AG
Publication date: 01/12/2022
Field of study

The prevalence of malware attacks that target IoT systems has raised an alarm and highlighted the need for efficient mechanisms to detect and defeat them. However, detecting malware is challenging, especially malware with new or unknown behaviors. The main problem is that malware can hide, so it cannot be detected easily. Furthermore, information about malware families is limited which restricts the amount of “big data” that is available for analysis. The motivation of this paper is two-fold. First, to introduce a new Profile Hidden Markov Model (PHMM) that can be used for both app analysis and classification in Android systems. Second, to dynamically identify suspicious calls while reducing infection risks of executed codes. We focused on Android systems, as they are more vulnerable than other IoT systems due to their ubiquitousness and sideloading features. The experimental results showed that the proposed Dynamic IoT malware Detection in Android Systems using PHMM (DIP) achieved superior performance when benchmarked against eight rival malware detection frameworks, showing up to 96.3% accuracy at 5% False Positive Rate (FP rate), 3% False Negative Rate (FN rate) and 94.9% F-measure

Directory of Open Access Journals

Special Issue of Algorithmica for the 28th London Stringology Days & London Algorithmic Workshop (LSD & LAW)

Author: Alzamel Mai
Iliopoulos Costas S.
Letsios Dimitrios
Prezza Nicola
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

King's Research Portal

Efficient Computation of Palindromes in Sequences with Uncertainties Extension Version

Author: Alzamel Mai Abdulaziz M
Gao Jia
Iliopoulos Costas
Liu Chang
Publication venue
Publication date: 01/01/2018
Field of study

King's Research Portal