4 research outputs found

    Improving read mapping using additional prefix grams

    Get PDF
    BACKGROUND: Next-generation sequencing (NGS) enables rapid production of billions of bases at a relatively low cost. Mapping reads from next-generation sequencers to a given reference genome is an important first step in many sequencing applications. Popular read mappers, such as Bowtie and BWA, are optimized to return top one or a few candidate locations of each read. However, identifying all mapping locations of each read, instead of just one or a few, is also important in some sequencing applications such as ChIP-seq for discovering binding sites in repeat regions, and RNA-seq for transcript abundance estimation. RESULTS: Here we present Hobbes2, a software package designed for fast and accurate alignment of NGS reads and specialized in identifying all mapping locations of each read. Hobbes2 efficiently identifies all mapping locations of reads using a novel technique that utilizes additional prefix q-grams to improve filtering. We extensively compare Hobbes2 with state-of-the-art read mappers, and show that Hobbes2 can be an order of magnitude faster than other read mappers while consuming less memory space and achieving similar accuracy. CONCLUSIONS: We propose Hobbes2 to improve the accuracy of read mapping, specialized in identifying all mapping locations of each read. Hobbes2 is implemented in C++, and the source code is freely available for download at http://hobbes.ics.uci.edu

    Universal Indexes for Highly Repetitive Document Collections

    Get PDF
    Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet, they are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    Optimizing Trees for Static Searchable Encryption

    Get PDF
    Searchable symmetric encryption (SSE) enables data owners to conduct searches over encrypted data stored by an untrusted server, retrieving only those encrypted files that match the search queries. Several recent schemes employ a server-side encrypted index in the form of a search tree where each node stores a bit vector denoting for each keyword whether any file in its subtree contains that keyword. Our work is motivated by the observation that the way data is distributed in such a search tree has a big impact on the cost of searches. For single-keyword queries, it impacts the number of different paths that must be followed to find all the matching files; for multi-keyword queries, the arrangement of the tree also impacts the number of nodes visited during the search on paths that do not lead to any satisfying data elements. We present three algorithms that improve the performance of SSE schemes based on tree indexes and prove that for cases where the search cost is high, the cost of our algorithms converges to the cost of the optimal tree. In our experiments, the resulting search trees outperform the arbitrary search trees used in previous works by a factor of up to tw

    Intersections of Inverted Lists

    No full text
    corecore