9 research outputs found

    Sorting suffixes of a text via its Lyndon Factorization

    Full text link
    The process of sorting the suffixes of a text plays a fundamental role in Text Algorithms. They are used for instance in the constructions of the Burrows-Wheeler transform and the suffix array, widely used in several fields of Computer Science. For this reason, several recent researches have been devoted to finding new strategies to obtain effective methods for such a sorting. In this paper we introduce a new methodology in which an important role is played by the Lyndon factorization, so that the local suffixes inside factors detected by this factorization keep their mutual order when extended to the suffixes of the whole word. This property suggests a versatile technique that easily can be adapted to different implementative scenarios.Comment: Submitted to the Prague Stringology Conference 2013 (PSC 2013

    When a Dollar Makes a BWT

    Get PDF
    TheBurrows-Wheeler-Transform(BWT)isareversiblestring transformation which plays a central role in text compression and is fun- damental in many modern bioinformatics applications. The BWT is a permutation of the characters, which is in general better compressible and allows to answer several different query types more efficiently than the original string. It is easy to see that not every string is a BWT image, and exact charac- terizations of BWT images are known. We investigate a related combi- natorial question. In many applications, a sentinel character isaddedtomarktheendofthestring,andthustheBWTofastringendingwith is added to mark the end of the string, and thus the BWT of a string ending with contains exactly one character.Weask,givenastringw,inwhichpositions,ifany,canthe character. We ask, given a string w, in which positions, if any, can the -character be inserted to turn w into the BWT image of a word ending with the sentinel character. We show that this depends only on the standard permutation of w and give a combinatorial characterization of such positions via this permutation. We then develop an O(n log n)-time algorithm for identifying all such positions, improving on the naive quadratic time algorithm

    Computing the original eBWT faster, simpler, and with less memory

    Full text link
    Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of the BWT to a collection of strings, however, since this introduction, it has been used more generally to describe any BWT of a collection of strings and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We combine our new eBWT construction with a variation of prefix-free parsing to allow for scalable construction of the eBWT. We evaluate our algorithm (pfpebwt) on sets of human chromosomes 19, Salmonella, and SARS-CoV2 genomes, and demonstrate that it is the fastest method for all collections, with a maximum speedup of 7.6x on the second best method. The peak memory is at most 2x larger than the second best method. Comparing with methods that are also, as our algorithm, able to report suffix array samples, we obtain a 57.1x improvement in peak memory. The source code is publicly available at https://github.com/davidecenzato/PFP-eBWT.Comment: 20 pages, 5 figures, 1 tabl

    A new class of string transformations for compressed text indexing

    Get PDF
    Introduced about thirty years ago in the field of data compression, the Burrows-Wheeler Transform (BWT) is a string transformation that, besides being a booster of the performance of memoryless compressors, plays a fundamental role in the design of efficient self-indexing compressed data structures. Finding other string transformations with the same remarkable properties of BWT has been a challenge for many researchers for a long time. In this paper, we introduce a whole class of new string transformations, called local orderings-based transformations, which have all the “myriad virtues” of BWT. As a further result, we show that such new string transformations can be used for the construction of the recently introduced r-index, which makes them suitable also for highly repetitive collections. In this context, we consider the problem of finding, for a given string, the BWT variant that minimizes the number of runs in the transformed string

    Variable-order reference-free variant discovery with the Burrows-Wheeler Transform

    Get PDF
    International audienceBackground: In [Prezza et al., AMB 2019], a new reference-free and alignment-free framework for the detection of SNPs was suggested and tested. The framework, based on the Burrows-Wheeler Transform (BWT), significantly improves sensitivity and precision of previous de Bruijn graphs based tools by overcoming several of their limitations, namely: (i) the need to establish a fixed value, usually small, for the order k, (ii) the loss of important information such as k-mer coverage and adjacency of k-mers within the same read, and (iii) bad performance in repeated regions longer than k bases. The preliminary tool, however, was able to identify only SNPs and it was too slow and memory consuming due to the use of additional heavy data structures (namely, the Suffix and LCP arrays), besides the BWT. Results: In this paper, we introduce a new algorithm and the corresponding tool ebwt2InDel that (i) extend the framework of [Prezza et al., AMB 2019] to detect also INDELs, and (ii) implements recent algorithmic findings that allow to perform the whole analysis using just the BWT, thus reducing the working space by one order of magnitude and allowing the analysis of full genomes. Finally, we describe a simple strategy for effectively parallelizing our tool for SNP detection only. On a 24-cores machine, the parallel version of our tool is one order of magnitude faster than the sequential one. The tool ebwt2InDel is available at github.com/nicolaprezza/ebwt2InDel. Conclusions: Results on a synthetic dataset covered at 30x (Human chromosome 1) show that our tool is indeed able to find up to 83% of the SNPs and 72% of the existing INDELs. These percentages considerably improve the 71% of SNPs and 51% of INDELs found by the state-of-the art tool based on de Bruijn graphs. We furthermore repor

    r-indexing the eBWT

    Get PDF
    The extended Burrows-Wheeler Transform (eBWT) [Mantaci et al. TCS 2007] is a variant of the BWT, introduced for collections of strings. In this paper, we present the extended r-index, an analogous data structure to the r-index [Gagie et al. JACM 2020]. It occupies O(r) words, with r the number of runs of the eBWT, and offers the same functionalities as the r-index. We also show how to efficiently support finding maximal exact matches (MEMs). We implemented the extended r-index and tested it on circular bacterial genomes and plasmids, comparing it to five state-of-the-art compressed text indexes. While our data structure maintains similar time and memory requirements for answering pattern matching queries as the original r-index, it is the only index in the literature that can naturally be used for both circular and linear input collections. This is an extended version of [Boucher et al., r-indexing the eBWT, SPIRE 2021]

    From First Principles to the Burrows and Wheeler Transform and Beyond, via Combinatorial Optimization

    Get PDF
    We introduce a combinatorial optimization framework that naturally yields a class of optimal word permutations. Our framework provides the first formal quantification of the intuitive idea that the longer the context shared by two symbols in a word, the closer those symbols should be to each other in a linear order of the symbols. The Burrows and Wheeler transform [6], and the compressible part of its analog for labelled trees [10], are special cases in the class. We also show that the class of optimal word permutations defined here is identical to the one identified by Ferragina et al. for compression boosting [9]. Therefore, they are all highly compressible. We also investigate more general classes of optimal word permutations, where relatedness of symbols may be measured by functions more complex than context length. In this case, we establish a non-trivial connection between word permutations and Table Compression techniques presented in Buchsbaum et al. [5], on one hand, and a universal similarity metric [17] with uses in Clustering and Classification [8]. Unfortunately, for this general problem, we provide instances that are MAX-SNP hard, and therefore unlikely to be solved or approximated efficiently. The results presented here indicate that, contrary to folklore, the key feature of the Burrows and Wheeler transform seems to be the existence of efficient algorithms for its computation and inversion, rather than its compressibility. Finally, for completeness, we also provide solution to an open problem implicitly posed in [6] regarding the computation of the transform.

    No Feature Data Analytics: Compression Pattern Recognition

    No full text
    Similarity matrix shows the similarity degree between each data pairs, it actually plays a core role in a number of dimensionality reduction methods, since the objective function builds upon this matrix. While compression-based similarity measures are effectively employed in applications on diverse data types as basically parameter free approach, a fast compression distance (FCD) metric has been proved to be able to achieve similar classification performance comparing with the Normalized Compression Distance (NCD) method, regarding small- to medium- size datasets [1]. The FCD is claimed as combining a fast speed without skipping the joint compression step which obtains better performance compared with NCD [2]. The idea behind is: the LZW algorithm extracts a dictionary D(x) from each image patch, and encode into a string x, in ascending order. The definition of FCD is defined as an operation which mainly takes account of the joint number of patterns within two dictionaries D(x) and D(y). In this research, we use FCD together with t-SNE to visualize a large semantic annotated TerraSAR-X dataset as a study case. The dataset contain image patches from 288 TerraSAR-X images with a total number of over 60,000 individual image patches. The visualizations represent the annotated semantic labels in such an intuitive way which helps us to better understand the relationships between their annotated semantics and how their actual similarities are in manifold space. Our obtained results show that the FCD based similarity matrix effectively provides us a fast yet performance preserved insights in high-dimensional datasets with a non-parametric distance metric. Via the visualization on TerraSAR-X dataset, we have gained quick intuition and better understanding of the connections between the annotated semantics and the relationships within the data which is revealed as similarities in manifold space. The visualization interpretation is based on a vega-style interactive tool, which allows user zoom in, zoom out for processing large amount of data points. Change Detection methods are dependent on the extracted image features and measures of similarity used for the comparison of the observed scene at different time moment. The NCD has an important advantage; it does not use features and compares the intrinsic data information. The change detection is thus an un-polarized estimator for temporal changes. The method is validated on two areas with visible changes such as flooding and tsunami effects. The results are compared with the ground truth data. The influence of natural disasters, as well as climate warming of the global environment increased in the past decades. Therefore, the detection of changes in a satellite image time series is a trivial task [3]. In order to do this, the calculation processes can be divided into several phases. The first represents the preprocessing, which includes an alignment of all images as well as the creation of patches in a region of interest. We propose here to use the Sentinel-1 SAR data. The second phase encloses the generation of a distance matrix of the patches from two images. In the last phase a threshold is applied to the created matrix, in order to show the changes in a binary way as binary change map (BCM). The results show that a compression-based approach is working with very good results on SAR data. Moreover, a visual evaluation of the resulting images shows that the compression-based approach detects the flooded areas within a region. Furthermore, if the input is reordered with the Burrows and Wheeler transformation [4], the resulting image is better in some areas that the base method. This shows the robustness of NCD. [1] C. Daniele and M. Datcu, “A fast compression-based similarity measure with applications to content-based image retrieval,” Journal of Visual Communication and ImageRepresentation, vol. 23, pp. 293–302, February 2012. [2] M. Li, X. Chen, X. Li, Ma. B., and P.M.B. Vitanyi, “The similarity metric,” IEEE Transaction of Information Theory, vol. 50, pp. 3250–3264, 2004. [3] M. Coca, A. Anghel, M. Datcu, “Normalized compression distance for SAR image change Detection”, pp. 1-3, 2018. [4] R. Giancarlo, A. Restivo, M. Sciortino, “From first principles to the Burrows and Wheeler transform and beyond, via combinatorial optimization”, Elsevier Theoretical Computer Science, vol. 387, pp. 236-248, 2007
    corecore