Search CORE

9 research outputs found

Sorting suffixes of a text via its Lyndon Factorization

Author: Mantaci Sabrina
Restivo Antonio
Rosone Giovanna
Sciortino Marinella
Publication venue
Publication date: 01/01/2013
Field of study

The process of sorting the suffixes of a text plays a fundamental role in Text Algorithms. They are used for instance in the constructions of the Burrows-Wheeler transform and the suffix array, widely used in several fields of Computer Science. For this reason, several recent researches have been devoted to finding new strategies to obtain effective methods for such a sorting. In this paper we introduce a new methodology in which an important role is played by the Lyndon factorization, so that the local suffixes inside factors detected by this factorization keep their mutual order when extended to the suffixes of the whole word. This property suggests a versatile technique that easily can be adapted to different implementative scenarios.Comment: Submitted to the Prague Stringology Conference 2013 (PSC 2013

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Palermo

When a Dollar Makes a BWT

Author: Giuliani Sara
Liptak Zsuzsanna
Rizzi Romeo
Publication venue
Publication date: 01/01/2019
Field of study

TheBurrows-Wheeler-Transform(BWT)isareversiblestring transformation which plays a central role in text compression and is fun- damental in many modern bioinformatics applications. The BWT is a permutation of the characters, which is in general better compressible and allows to answer several different query types more efficiently than the original string. It is easy to see that not every string is a BWT image, and exact charac- terizations of BWT images are known. We investigate a related combi- natorial question. In many applications, a sentinel character

is added to mark the end of the string, and thus the BWT of a string ending with

contains exactly one

character. We ask, given a string w, in which positions, if any, can the

-character be inserted to turn w into the BWT image of a word ending with the sentinel character. We show that this depends only on the standard permutation of w and give a combinatorial characterization of such positions via this permutation. We then develop an O(n log n)-time algorithm for identifying all such positions, improving on the naive quadratic time algorithm

Catalogo dei prodotti della ricerca

Computing the original eBWT faster, simpler, and with less memory

Author: Boucher Christina
Cenzato Davide
Lipták Zsuzsanna
Rossi Massimiliano
Sciortino Marinella
Publication venue
Publication date: 01/01/2021
Field of study

Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of the BWT to a collection of strings, however, since this introduction, it has been used more generally to describe any BWT of a collection of strings and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We combine our new eBWT construction with a variation of prefix-free parsing to allow for scalable construction of the eBWT. We evaluate our algorithm (pfpebwt) on sets of human chromosomes 19, Salmonella, and SARS-CoV2 genomes, and demonstrate that it is the fastest method for all collections, with a maximum speedup of 7.6x on the second best method. The peak memory is at most 2x larger than the second best method. Comparing with methods that are also, as our algorithm, able to report suffix array samples, we obtain a 57.1x improvement in peak memory. The source code is publicly available at https://github.com/davidecenzato/PFP-eBWT.Comment: 20 pages, 5 figures, 1 tabl

arXiv.org e-Print Archive

Catalogo dei prodotti della ricerca

A new class of string transformations for compressed text indexing

Author: Giancarlo R.
Manzini G.
Restivo A.
Rosone G.
Sciortino M.
Publication venue: Elsevier Inc.
Publication date: 01/01/2023
Field of study

Introduced about thirty years ago in the field of data compression, the Burrows-Wheeler Transform (BWT) is a string transformation that, besides being a booster of the performance of memoryless compressors, plays a fundamental role in the design of efficient self-indexing compressed data structures. Finding other string transformations with the same remarkable properties of BWT has been a challenge for many researchers for a long time. In this paper, we introduce a whole class of new string transformations, called local orderings-based transformations, which have all the “myriad virtues” of BWT. As a further result, we show that such new string transformations can be used for the construction of the recently introduced r-index, which makes them suitable also for highly repetitive collections. In this context, we consider the problem of finding, for a given string, the BWT variant that minimizes the number of runs in the transformed string

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Università di Palermo

Variable-order reference-free variant discovery with the Burrows-Wheeler Transform

Author: Pisanti Nadia
Prezza Nicola
Rosone Giovanna
Sciortino Marinella
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

International audienceBackground: In [Prezza et al., AMB 2019], a new reference-free and alignment-free framework for the detection of SNPs was suggested and tested. The framework, based on the Burrows-Wheeler Transform (BWT), significantly improves sensitivity and precision of previous de Bruijn graphs based tools by overcoming several of their limitations, namely: (i) the need to establish a fixed value, usually small, for the order k, (ii) the loss of important information such as k-mer coverage and adjacency of k-mers within the same read, and (iii) bad performance in repeated regions longer than k bases. The preliminary tool, however, was able to identify only SNPs and it was too slow and memory consuming due to the use of additional heavy data structures (namely, the Suffix and LCP arrays), besides the BWT. Results: In this paper, we introduce a new algorithm and the corresponding tool ebwt2InDel that (i) extend the framework of [Prezza et al., AMB 2019] to detect also INDELs, and (ii) implements recent algorithmic findings that allow to perform the whole analysis using just the BWT, thus reducing the working space by one order of magnitude and allowing the analysis of full genomes. Finally, we describe a simple strategy for effectively parallelizing our tool for SNP detection only. On a 24-cores machine, the parallel version of our tool is one order of magnitude faster than the sequential one. The tool ebwt2InDel is available at github.com/nicolaprezza/ebwt2InDel. Conclusions: Results on a synthetic dataset covered at 30x (Human chromosome 1) show that our tool is indeed able to find up to 83% of the SNPs and 72% of the existing INDELs. These percentages considerably improve the 71% of SNPs and 51% of INDELs found by the state-of-the art tool based on de Bruijn graphs. We furthermore repor

INRIA a CCSD electronic archive server

HAL Descartes

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Hal-Diderot

r-indexing the eBWT

Author: Boucher Christina
Cenzato Davide
Lipták Zsuzsanna
Rossi Massimiliano
Sciortino Marinella
Publication venue: Academic Press Incorporated
Publication date: 01/01/2024
Field of study

The extended Burrows-Wheeler Transform (eBWT) [Mantaci et al. TCS 2007] is a variant of the BWT, introduced for collections of strings. In this paper, we present the extended r-index, an analogous data structure to the r-index [Gagie et al. JACM 2020]. It occupies O(r) words, with r the number of runs of the eBWT, and offers the same functionalities as the r-index. We also show how to efficiently support finding maximal exact matches (MEMs). We implemented the extended r-index and tested it on circular bacterial genomes and plasmids, comparing it to five state-of-the-art compressed text indexes. While our data structure maintains similar time and memory requirements for answering pattern matching queries as the original r-index, it is the only index in the literature that can naturally be used for both circular and linear input collections. This is an extended version of [Boucher et al., r-indexing the eBWT, SPIRE 2021]

Catalogo dei prodotti della ricerca

Archivio istituzionale della ricerca - Università di Palermo

From First Principles to the Burrows and Wheeler Transform and Beyond, via Combinatorial Optimization

Author: Antonio Restivo
Marinella Sciortino
Raffaele Giancarlo
Publication venue
Publication date: 01/01/2007
Field of study

We introduce a combinatorial optimization framework that naturally yields a class of optimal word permutations. Our framework provides the first formal quantification of the intuitive idea that the longer the context shared by two symbols in a word, the closer those symbols should be to each other in a linear order of the symbols. The Burrows and Wheeler transform [6], and the compressible part of its analog for labelled trees [10], are special cases in the class. We also show that the class of optimal word permutations defined here is identical to the one identified by Ferragina et al. for compression boosting [9]. Therefore, they are all highly compressible. We also investigate more general classes of optimal word permutations, where relatedness of symbols may be measured by functions more complex than context length. In this case, we establish a non-trivial connection between word permutations and Table Compression techniques presented in Buchsbaum et al. [5], on one hand, and a universal similarity metric [17] with uses in Clustering and Classification [8]. Unfortunately, for this general problem, we provide instances that are MAX-SNP hard, and therefore unlikely to be solved or approximated efficiently. The results presented here indicate that, contrary to folklore, the key feature of the Burrows and Wheeler transform seems to be the existence of efficient algorithms for its computation and inversion, rather than its compressibility. Finally, for completeness, we also provide solution to an open problem implicitly posed in [6] regarding the computation of the transform.

CiteSeerX

Elsevier - Publisher Connector

Archivio istituzionale della ricerca - Università di Palermo

From first principles to the Burrows and Wheeler transform and beyond, via combinatorial optimization

Author: A. Restivo
Arora
Bentley
Blum
Choffrut
Ferragina
Gusfield
Lothaire
Lothaire
M. Sciortino
Mantaci
Manzini
Mignosi
Navarro
Papadimitriou
Puglisi
R. Giancarlo
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

No Feature Data Analytics: Compression Pattern Recognition

Author: Coca Mihai
Datcu Mihai
Dax Gabriel
Dumitru Corneliu Octavian
Schwarz Gottfried
Yao Wei
Publication venue
Publication date: 01/09/2019
Field of study

Similarity matrix shows the similarity degree between each data pairs, it actually plays a core role in a number of dimensionality reduction methods, since the objective function builds upon this matrix. While compression-based similarity measures are effectively employed in applications on diverse data types as basically parameter free approach, a fast compression distance (FCD) metric has been proved to be able to achieve similar classification performance comparing with the Normalized Compression Distance (NCD) method, regarding small- to medium- size datasets [1]. The FCD is claimed as combining a fast speed without skipping the joint compression step which obtains better performance compared with NCD [2]. The idea behind is: the LZW algorithm extracts a dictionary D(x) from each image patch, and encode into a string x, in ascending order. The definition of FCD is defined as an operation which mainly takes account of the joint number of patterns within two dictionaries D(x) and D(y). In this research, we use FCD together with t-SNE to visualize a large semantic annotated TerraSAR-X dataset as a study case. The dataset contain image patches from 288 TerraSAR-X images with a total number of over 60,000 individual image patches. The visualizations represent the annotated semantic labels in such an intuitive way which helps us to better understand the relationships between their annotated semantics and how their actual similarities are in manifold space. Our obtained results show that the FCD based similarity matrix effectively provides us a fast yet performance preserved insights in high-dimensional datasets with a non-parametric distance metric. Via the visualization on TerraSAR-X dataset, we have gained quick intuition and better understanding of the connections between the annotated semantics and the relationships within the data which is revealed as similarities in manifold space. The visualization interpretation is based on a vega-style interactive tool, which allows user zoom in, zoom out for processing large amount of data points. Change Detection methods are dependent on the extracted image features and measures of similarity used for the comparison of the observed scene at different time moment. The NCD has an important advantage; it does not use features and compares the intrinsic data information. The change detection is thus an un-polarized estimator for temporal changes. The method is validated on two areas with visible changes such as flooding and tsunami effects. The results are compared with the ground truth data. The influence of natural disasters, as well as climate warming of the global environment increased in the past decades. Therefore, the detection of changes in a satellite image time series is a trivial task [3]. In order to do this, the calculation processes can be divided into several phases. The first represents the preprocessing, which includes an alignment of all images as well as the creation of patches in a region of interest. We propose here to use the Sentinel-1 SAR data. The second phase encloses the generation of a distance matrix of the patches from two images. In the last phase a threshold is applied to the created matrix, in order to show the changes in a binary way as binary change map (BCM). The results show that a compression-based approach is working with very good results on SAR data. Moreover, a visual evaluation of the resulting images shows that the compression-based approach detects the flooded areas within a region. Furthermore, if the input is reordered with the Burrows and Wheeler transformation [4], the resulting image is better in some areas that the base method. This shows the robustness of NCD. [1] C. Daniele and M. Datcu, “A fast compression-based similarity measure with applications to content-based image retrieval,” Journal of Visual Communication and ImageRepresentation, vol. 23, pp. 293–302, February 2012. [2] M. Li, X. Chen, X. Li, Ma. B., and P.M.B. Vitanyi, “The similarity metric,” IEEE Transaction of Information Theory, vol. 50, pp. 3250–3264, 2004. [3] M. Coca, A. Anghel, M. Datcu, “Normalized compression distance for SAR image change Detection”, pp. 1-3, 2018. [4] R. Giancarlo, A. Restivo, M. Sciortino, “From first principles to the Burrows and Wheeler transform and beyond, via combinatorial optimization”, Elsevier Theoretical Computer Science, vol. 387, pp. 236-248, 2007

Institute of Transport Research:Publications