14 research outputs found

    Compressing and indexing aligned readsets

    Get PDF
    Compressed full-text indexes are one of the main success stories of bioinformatics data structures but even they struggle to handle some DNA readsets. This may seem surprising since, at least when dealing with short reads from the same individual, the readset will be highly repetitive and, thus, highly compressible. If we are not careful, however, this advantage can be more than offset by two disadvantages: first, since most base pairs are included in at least tens reads each, the uncompressed readset is likely to be at least an order of magnitude larger than the individual's uncompressed genome; second, these indexes usually pay some space overhead for each string they store, and the total overhead can be substantial when dealing with millions of reads. The most successful compressed full-text indexes for readsets so far are based on the Extended Burrows-Wheeler Transform (EBWT) and use a sorting heuristic to try to reduce the space overhead per read, but they still treat the reads as separate strings and thus may not take full advantage of the readset's structure. For example, if we have already assembled an individual's genome from the readset, then we can usually use it to compress the readset well: e.g., we store the gap-coded list of reads' starting positions; we store the list of their lengths, which is often highly compressible; and we store information about the sequencing errors, which are rare with short reads. There is nowhere, however, where we can plug an assembled genome into the EBWT. In this paper we show how to use one or more assembled or partially assembled genome as the basis for a compressed full-text index of its readset. Specifically, we build a labelled tree by taking the assembled genome as a trunk and grafting onto it the reads that align to it, at the starting positions of their alignments. Next, we compute the eXtended Burrows-Wheeler Transform (XBWT) of the resulting labelled tree and build a compressed full-text index on that. Although this index can occasionally return false positives, it is usually much more compact than the alternatives. Following the established practice for datasets with many repetitions, we compare different full-text indices by looking at the number of runs in the transformed strings. For a human Chr19 readset our preliminary experiments show that eliminating separators characters from the EBWT reduces the number of runs by 19%, from 220 million to 178 million, and using the XBWT reduces it by a further 15%, to 150 million

    String periods in the order-preserving model

    Full text link
    In the order-preserving model, two strings match if they share the same relative order between the characters at the corresponding positions. This model is quite recent, but it has already attracted significant attention because of its applications in data analysis. We introduce several types of periods in this setting (op-periods). Then we give algorithms to compute these periods in time O(n), O(nlog⁡log⁡n), O(nlog2⁡log⁡n/log⁡log⁡log⁡n), O(nlog⁡n) depending on the type of periodicity. In the most general variant, the number of different op-periods can be as big as Ω(n2), and a compact representation is needed. Our algorithms require novel combinatorial insight into the properties of op-periods. In particular, we characterize the Fine–Wilf property for coprime op-periods. © 2019 Elsevier Inc.Supported by ISF grants no. 824/17 and 1278/16 and by an ERC grant MPM under the EU's Horizon 2020 Research and Innovation Programme (grant no. 683064).Supported by the Ministry of Science and Higher Education of the Russian Federation, project 1.3253.2017.A part of this work was done during the workshop StringMasters in Warsaw 2017 that was sponsored by the Warsaw Center of Mathematics and Computer Science. The authors thank the participants of the workshop, especially Hideo Bannai and Shunsuke Inenaga, for helpful discussions

    Computing Covers under Substring Consistent Equivalence Relations

    Full text link
    Covers are a kind of quasiperiodicity in strings. A string CC is a cover of another string TT if any position of TT is inside some occurrence of CC in TT. The shortest and longest cover arrays of TT have the lengths of the shortest and longest covers of each prefix of TT, respectively. The literature has proposed linear-time algorithms computing longest and shortest cover arrays taking border arrays as input. An equivalence relation \approx over strings is called a substring consistent equivalence relation (SCER) iff XYX \approx Y implies (1) X=Y|X| = |Y| and (2) X[i:j]Y[i:j]X[i:j] \approx Y[i:j] for all 1ijX1 \le i \le j \le |X|. In this paper, we generalize the notion of covers for SCERs and prove that existing algorithms to compute the shortest cover array and the longest cover array of a string TT under the identity relation will work for any SCERs taking the accordingly generalized border arrays.Comment: 16 page

    Hide and mine in strings: Hardness, algorithms, and experiments

    Get PDF
    Data sanitization and frequent pattern mining are two well-studied topics in data mining. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well

    Hide and mine solver

    No full text
    This repository contains solutions to the Hide and Mine problem. There are separated in two kind, one using an ILP solver (Gurobi) and the others are greedy heuristics. To use them you will need to first install Gurobi (they have free academic licenses available). Both ILP and the heuristics are written in C++ 11

    Hide and mine solver

    No full text
    This repository contains solutions to the Hide and Mine problem. There are separated in two kind, one using an ILP solver (Gurobi) and the others are greedy heuristics. To use them you will need to first install Gurobi (they have free academic licenses available). Both ILP and the heuristics are written in C++ 11
    corecore