28 research outputs found

    A New Class of Searchable and Provably Highly Compressible String Transformations

    Get PDF
    The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the domains of strings. However, efforts to find non-trivial alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met limited success. In this paper we bring new lymph to this area by introducing a whole new family of transformations that have all the "myriad virtues" of the BWT: they can be computed and inverted in linear time, they produce provably highly compressible strings, and they support linear time pattern search directly on the transformed string. This new family is a special case of a more general class of transformations based on context adaptive alphabet orderings, a concept introduced here. This more general class includes also the Alternating BWT, another invertible string transforms recently introduced in connection with a generalization of Lyndon words

    The Alternating BWT: An algorithmic perspective

    Get PDF
    The Burrows-Wheeler Transform (BWT) is a word transformation introduced in 1994 for Data Compression. It has become a fundamental tool for designing self-indexing data structures, with important applications in several areas in science and engineering. The Alternating Burrows-Wheeler Transform (ABWT) is another transformation recently introduced in Gessel et al. (2012) [21] and studied in the field of Combinatorics on Words. It is analogous to the BWT, except that it uses an alternating lexicographical order instead of the usual one. Building on results in Giancarlo et al. (2018) [23], where we have shown that BWT and ABWT are part of a larger class of reversible transformations, here we provide a combinatorial and algorithmic study of the novel transform ABWT. We establish a deep analogy between BWT and ABWT by proving they are the only ones in the above mentioned class to be rank-invertible, a novel notion guaranteeing efficient invertibility. In addition, we show that the backward-search procedure can be efficiently generalized to the ABWT; this result implies that also the ABWT can be used as a basis for efficient compressed full text indices. Finally, we prove that the ABWT can be efficiently computed by using a combination of the Difference Cover suffix sorting algorithm (K\ue4rkk\ue4inen et al., 2006 [28]) with a linear time algorithm for finding the minimal cyclic rotation of a word with respect to the alternating lexicographical order

    A new class of string transformations for compressed text indexing

    Get PDF
    Introduced about thirty years ago in the field of data compression, the Burrows-Wheeler Transform (BWT) is a string transformation that, besides being a booster of the performance of memoryless compressors, plays a fundamental role in the design of efficient self-indexing compressed data structures. Finding other string transformations with the same remarkable properties of BWT has been a challenge for many researchers for a long time. In this paper, we introduce a whole class of new string transformations, called local orderings-based transformations, which have all the “myriad virtues” of BWT. As a further result, we show that such new string transformations can be used for the construction of the recently introduced r-index, which makes them suitable also for highly repetitive collections. In this context, we consider the problem of finding, for a given string, the BWT variant that minimizes the number of runs in the transformed string

    Analisis Kombinasi Algoritma Knapsack dan Run Length Enconding (RLE) pada File Teks

    Get PDF
    Algoritma Knapsack merupakan bagian dari kriptografi asimetri yang mana kunci enkripsinya berbeda dengan kunci dekripsinya. Di samping masalah keamanan file teks, masalah ukuran dari sebuah file teks juga menjadi pertimbangan. File teks yang berukuran besar dapat dimampatkan dengan melakukan proses kompresi. Algoritma Run Length Encoding (RLE) merupakan algoritma yang mengecilkan ukuran file teks, apabila teks tersebut mengalami banyak perulangan karakter. Kombinasi algoritma Knapsack dan RLE dapat menjamin file Teks tidak dapat dilihat oleh pengguna yang tidak berhak dan dapat menjamin file teks dapat disimpan dalam media file yang berkapasitas rendah. Pada penelitian ini, penulis membuat program kombinasi algortitma knapsack dan RLE pada file teks. Pada algoritma Knapsack akan terjadi penambahan ukuran file teks, hal ini dapat dilihat pada contoh kasus yang mana ukuran plainteks (pesan asli) adalah 9 bytes, kemudian setelah dilakukan proses enkripsi ukuran file teks menjadi 7 bytes. Karena itu pengunaan kombinasi enkripsi dan kompresi data lebih baik karena file menjadi lebih kecil dibandingkan kombinasi kompresi dan enkripsi data. Plainteks yang memiliki banyak perulangan karakter akan terkompresi dengan baik

    Variable-order reference-free variant discovery with the Burrows-Wheeler Transform

    Get PDF
    International audienceBackground: In [Prezza et al., AMB 2019], a new reference-free and alignment-free framework for the detection of SNPs was suggested and tested. The framework, based on the Burrows-Wheeler Transform (BWT), significantly improves sensitivity and precision of previous de Bruijn graphs based tools by overcoming several of their limitations, namely: (i) the need to establish a fixed value, usually small, for the order k, (ii) the loss of important information such as k-mer coverage and adjacency of k-mers within the same read, and (iii) bad performance in repeated regions longer than k bases. The preliminary tool, however, was able to identify only SNPs and it was too slow and memory consuming due to the use of additional heavy data structures (namely, the Suffix and LCP arrays), besides the BWT. Results: In this paper, we introduce a new algorithm and the corresponding tool ebwt2InDel that (i) extend the framework of [Prezza et al., AMB 2019] to detect also INDELs, and (ii) implements recent algorithmic findings that allow to perform the whole analysis using just the BWT, thus reducing the working space by one order of magnitude and allowing the analysis of full genomes. Finally, we describe a simple strategy for effectively parallelizing our tool for SNP detection only. On a 24-cores machine, the parallel version of our tool is one order of magnitude faster than the sequential one. The tool ebwt2InDel is available at github.com/nicolaprezza/ebwt2InDel. Conclusions: Results on a synthetic dataset covered at 30x (Human chromosome 1) show that our tool is indeed able to find up to 83% of the SNPs and 72% of the existing INDELs. These percentages considerably improve the 71% of SNPs and 51% of INDELs found by the state-of-the art tool based on de Bruijn graphs. We furthermore repor

    Burrows-Wheeler transform and Run-Length Enconding

    No full text
    In this paper we study the clustering effect of the Burrows-Wheeler Transform (BWT) from a combinatorial viewpoint. In particular, given a word w we define the BWT-clustering ratio of w as the ratio between the number of clusters produced by BWT and the number of the clusters of w. The number of clusters of a word is measured by its Run-Length Encoding. We show that the BWT-clustering ratio ranges in ]0,à2]. Moreover, given a rational number rñ]0,2], it is possible to find infinitely many words having BWT-clustering ratio equal to r. Finally, we show how the words can be classified according to their BWT-clustering ratio. The behavior of such a parameter is studied for very well-known families of binary words

    Diagnostic Implications of Arrhythmogenic Cardiomyopathy Genetic Testing

    Get PDF
    Background & Aims: Arrhythmogenic Cardiomyopathy (AC) is a rare inherited heart muscle disease associated with mutations in genes encoding mainly components of the cardiac desmosome. We performed a comprehensive study of genetic variants in a cohort of AC subjects and the assessment of Next Generation Sequencing (NGS) strategies for molecular diagnosis of AC. Methods: Ninety-nine unrelated index cases, of which 26 sudden cardiac death cases, underwent genetic screening for 5 desmosomal genes by denaturing high performance liquid chromatography and direct sequencing, whereas 46 probands were additionally screened for 3 extra desmosomal genes. A complementary analysis for copy number variants (CNVs) was performed by multiplex ligation-dependent probe amplification and quantitative real-time PCR in the entire cohort. A 4-step variant filtering strategy based on mutation type, frequency, evolutionary conservation and in silico analysis, was used. Whole Exome and Targeted NGS strategies were performed on Illumina platforms in order to test methods efficacy. Results: Screening of 8 AC genes and subsequent 4-step variants filtering identified 37 different point desmosomal mutations in 42 AC probands (42%). The most frequently mutated genes resulted PKP2 and DSP, with “radical” mutation type accounting for the 80% of the PKP2 variants. No pathogenic mutations were identified in the extra desmosomal genes analyzed. CNVs analysis further revealed 3 different large genomic rearrangements in 5 probands (4%), increasing to 46 (46%) the number of positively genotyped patients. PKP2 and DSP single mutation accounted respectively for 20% and 11% of the cohort, with DSP carriers showing a higher risk of sudden cardiac death. Eight multiple mutations carriers were observed (8%). NGS approaches identified 4 variants in extra desmosomal genes allowing a differential diagnosis in 4 patients. Conclusions: A fine variant filtering avoids overrepresentation of putative pathogenic mutations and shows that radical and missense mutations should be equally interpreted with great caution in the setting of clinical diagnosis. NGS and CNVs analysis increased significantly the diagnostic yield in AC genetic testing. The genetics of AC is more complex than previously appreciated, with frequent requirement for more than one ‘hit’ for penetrant disease

    Anthropometric and genetic determinants of cardiac morphology and function

    Get PDF
    Background Cardiac structure and function result from complex interactions between genetic and environmental factors. Population-based studies have relied on 2-dimensional cardiovascular magnetic resonance as the gold-standard for phenotyping. However, this technique provides limited global metrics and is insensitive to regional or asymmetric changes in left ventricular (LV) morphology. High-resolution 3-dimensional cardiac magnetic resonance (3D-CMR) with computational quantitative phenotyping, might improve on traditional CMR by enabling the creation of detailed 3D statistical models of the variation in cardiac phenotypes for use in studies of genetic and/or environmental effects on cardiac form or function. Purpose To determine whether 3D-CMR is applicable at scale, and provides methodological and statistical advantages over conventional imaging for large-scale population studies and to apply 3D-CMR to anthropometric and genetic studies of the heart. Methods 1530 volunteers (54.8% females, 74.7% Caucasian, mean age 41.3±13.0 years) without self-reported cardiovascular disease were recruited prospectively to the Digital Heart Project. Using a cardiac atlas-based software, these images were computationally processed and quantitatively analysed. Parameters such as myocardial shape, curvature, wall thickness, relative wall thickness, end-systolic wall stress, fractional wall thickening and ventricular volumes were extracted at over 46,000 points in the model. The relationships between these parameters and systolic blood pressure (SBP), fat mass, lean mass and genetic variationswere analysed using 3D regression models adjusted for body surface area, gender, race, age and multiple testing. Targeted resequencing of titin (TTN), the largest human gene and the commonest genetic cause of dilated cardiomyopathy, was performed in 928 subjects while common variants (~700.000) were genotyped in 1346 subjects. Results Automatically segmented 3D images were more accurate than 2D images at defining cardiac surfaces, resulting in fewer subjects being required to detect a statistically significant 1 mm difference in wall thickness. 3D-CMR enabled the detection of a strong and distinct regionality of the effects of SBP, body composition and genetic variation on the heart. It shows that the precursors of the hypertensive heart phenotype can be traced to healthy normotensives and that different ratios of body composition are associated with particular gender-specific patterns of cardiac remodelling. In 17 asymptomatic subjects with genetic variations associated with dilated cardiomyopathy, early stages of ventricular impairment and wall thinning were identified, which were not apparent by 2D imaging. Conclusions 3D-CMR combined with computational modelling provides high-resolution insight into the earliest stages of heart disease. These methods show promise for population-based studies of the anthropometric, environmental and genetic determinants of LV form and function in health and disease.Open Acces
    corecore