Search CORE

67 research outputs found

Space-efficient construction of compressed suffix trees

Author: Prezza Nicola
Rosone Giovanna
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

We show how to build several data structures of central importance to string processing by taking as input the Burrows-Wheeler transform (BWT) and using small extra working space. Let n be the text length and σ be the alphabet size. We first provide two algorithms that enumerate all LCP values and suffix tree intervals in O(nlog⁡σ) time using just o(nlog⁡σ) bits of working space on top of the input re-writable BWT. Using these algorithms as building blocks, for any parameter 00. This improves the previous most space-efficient algorithms, which worked in O(n) bits and O(nlog⁡n) time. We also consider the problem of merging BWTs of string collections, and provide a solution running in O(nlog⁡σ) time and using just o(nlog⁡σ) bits of working space. An efficient implementation of our LCP construction and BWT merge algorithms uses (in RAM) as few as n bits on top of a packed representation of the input/output and process data as fast as 2.92 megabases per second

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Lightweight LCP Construction for Very Large Collections of Strings

Author: Cox Anthony J.
Garofalo Fabio
Rosone Giovanna
Sciortino Marinella
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

The longest common prefix array is a very advantageous data structure that, combined with the suffix array and the Burrows-Wheeler transform, allows to efficiently compute some combinatorial properties of a string useful in several applications, especially in biological contexts. Nowadays, the input data for many problems are big collections of strings, for instance the data coming from "next-generation" DNA sequencing (NGS) technologies. In this paper we present the first lightweight algorithm (called extLCP) for the simultaneous computation of the longest common prefix array and the Burrows-Wheeler transform of a very large collection of strings having any length. The computation is realized by performing disk data accesses only via sequential scans, and the total disk space usage never needs more than twice the output size, excluding the disk space required for the input. Moreover, extLCP allows to compute also the suffix array of the strings of the collection, without any other further data structure is needed. Finally, we test our algorithm on real data and compare our results with another tool capable to work in external memory on large collections of strings.Comment: This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/ The final version of this manuscript is in press in Journal of Discrete Algorithm

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Università di Palermo

Lossy Compressor preserving variant calling through Extended BWT

Author: Guerrini Veronica
Louza Felipe A.
Rosone Giovanna
Publication venue: 'Scitepress'
Publication date: 17/04/2023
Field of study

A standard format used for storing the output of high-throughput sequencing experiments is the FASTQ format. It comprises three main components: (i) headers, (ii) bases (nucleotide sequences), and (iii) quality scores. FASTQ files are widely used for variant calling, where sequencing data are mapped into a reference genome to discover variants that may be used for further analysis. There are many specialized compressors that exploit redundancy in FASTQ data with the focus only on either the bases or the quality scores components. In this paper we consider the novel problem of lossy compressing, in a reference-free way, FASTQ data by modifying both components at the same time, while preserving the important information of the original FASTQ. We introduce a general strategy, based on the Extended Burrows-Wheeler Transform (EBWT) and positional clustering, and we present implementations in both internal memory and external memory. Experimental results show that the lossy compression performed by our tool is able to achieve good compression while preserving information relating to variant calling more than the competitors. Availability: the software is freely available at https://github.com/veronicaguerrini/BFQzip.Comment: Proceedings of the 15th International Joint Conference on Biomedical Engineering Systems and Technologie

arXiv.org e-Print Archive

A New Class of Searchable and Provably Highly Compressible String Transformations

Author: Giancarlo Raffaele
Manzini Giovanni
Rosone Giovanna
Sciortino Marinella
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)
Publication date: 01/01/2019
Field of study

The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the domains of strings. However, efforts to find non-trivial alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met limited success. In this paper we bring new lymph to this area by introducing a whole new family of transformations that have all the "myriad virtues" of the BWT: they can be computed and inverted in linear time, they produce provably highly compressible strings, and they support linear time pattern search directly on the transformed string. This new family is a special case of a more general class of transformations based on context adaptive alphabet orderings, a concept introduced here. This more general class includes also the Alternating BWT, another invertible string transforms recently introduced in connection with a generalization of Lyndon words

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Pisa

DROPS Dagstuhl Research Online Publication Server

Archivio Istituzionale della Ricerca- Università del Piemonte Orientale

Archivio istituzionale della ricerca - Università di Palermo

Lightweight Reference-Free Variation Detection using the Burrows-Wheeler Transform

Author: Giovanna Rosone
Marinella Sciortino
Nadia Pisanti
Nicola Prezza
Publication venue
Publication date: 01/01/2019
Field of study

Lightweight Reference-Free Variation Detection using the Burrows-Wheeler Transfor

Archivio della Ricerca - Università di Pisa

Detecting Mutations by eBWT

Author: Pisanti Nadia
PREZZA NICOLA
Rosone Giovanna
Sciortino Marinella
Publication venue: place:Leibniz
Publication date: 01/01/2018
Field of study

In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (EBWT) of a collection of DNA fragments tends to cluster together the copies of nucleotides sequenced from a genome G. Our theory accurately predicts how many copies of any nucleotide are expected inside each such cluster, and how an elegant and precise LCP array based procedure can locate these clusters in the EBWT. Our findings are very general and can be applied to a wide range of different problems. In this paper, we consider the case of alignment-free and reference-free SNPs discovery in multiple collections of reads. We note that, in accordance with our theoretical results, SNPs are clustered in the EBWT of the reads collection, and we develop a tool finding SNPs with a simple scan of the EBWT and LCP arrays. Preliminary results show that our method requires much less coverage than state-of-the-art tools while drastically improving precision and sensitivity

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

HAL Descartes

DROPS Dagstuhl Research Online Publication Server

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Hal-Diderot

Archivio istituzionale della ricerca - Università di Palermo