2,360 research outputs found

    A new method for indexing genomes using on-disk suffix trees

    Full text link
    We propose a new method to build persistent suffix trees for indexing the genomic data. Our algorithm DiGeST (Disk-Based Genomic Suffix Tree) improves significantly over previous work in reducing the random access to the in-put string and performing only two passes over disk data. DiGeST is based on the two-phase multi-way merge sort paradigm using a concise binary representation of the DNA alphabet. Furthermore, our method scales to larger genomic data than managed before

    SoK: Cryptographically Protected Database Search

    Full text link
    Protected database search systems cryptographically isolate the roles of reading from, writing to, and administering the database. This separation limits unnecessary administrator access and protects data in the case of system breaches. Since protected search was introduced in 2000, the area has grown rapidly; systems are offered by academia, start-ups, and established companies. However, there is no best protected search system or set of techniques. Design of such systems is a balancing act between security, functionality, performance, and usability. This challenge is made more difficult by ongoing database specialization, as some users will want the functionality of SQL, NoSQL, or NewSQL databases. This database evolution will continue, and the protected search community should be able to quickly provide functionality consistent with newly invented databases. At the same time, the community must accurately and clearly characterize the tradeoffs between different approaches. To address these challenges, we provide the following contributions: 1) An identification of the important primitive operations across database paradigms. We find there are a small number of base operations that can be used and combined to support a large number of database paradigms. 2) An evaluation of the current state of protected search systems in implementing these base operations. This evaluation describes the main approaches and tradeoffs for each base operation. Furthermore, it puts protected search in the context of unprotected search, identifying key gaps in functionality. 3) An analysis of attacks against protected search for different base queries. 4) A roadmap and tools for transforming a protected search system into a protected database, including an open-source performance evaluation platform and initial user opinions of protected search.Comment: 20 pages, to appear to IEEE Security and Privac

    Targeted Greybox Fuzzing with Static Lookahead Analysis

    Full text link
    Automatic test generation typically aims to generate inputs that explore new paths in the program under test in order to find bugs. Existing work has, therefore, focused on guiding the exploration toward program parts that are more likely to contain bugs by using an offline static analysis. In this paper, we introduce a novel technique for targeted greybox fuzzing using an online static analysis that guides the fuzzer toward a set of target locations, for instance, located in recently modified parts of the program. This is achieved by first semantically analyzing each program path that is explored by an input in the fuzzer's test suite. The results of this analysis are then used to control the fuzzer's specialized power schedule, which determines how often to fuzz inputs from the test suite. We implemented our technique by extending a state-of-the-art, industrial fuzzer for Ethereum smart contracts and evaluate its effectiveness on 27 real-world benchmarks. Using an online analysis is particularly suitable for the domain of smart contracts since it does not require any code instrumentation---instrumentation to contracts changes their semantics. Our experiments show that targeted fuzzing significantly outperforms standard greybox fuzzing for reaching 83% of the challenging target locations (up to 14x of median speed-up)

    An Optimal Algorithm for Sliding Window Order Statistics

    Get PDF
    Assume there is a data stream of elements and a window of size m. Sliding window algorithms compute various statistic functions over the last m elements of the data stream seen so far. The time complexity of a sliding window algorithm is measured as the time required to output an updated statistic function value every time a new element is read. For example, it is well known that computing the sliding window maximum/minimum has time complexity O(1) while computing the sliding window median has time complexity O(log m). In this paper we close the gap between these two cases by (1) presenting an algorithm for computing the sliding window k-th smallest element in O(log k) time and (2) prove that this time complexity is optimal

    Structator: fast index-based search for RNA sequence-structure patterns

    Get PDF
    Background The secondary structure of RNA molecules is intimately related to their function and often more conserved than the sequence. Hence, the important task of searching databases for RNAs requires to match sequence-structure patterns. Unfortunately, current tools for this task have, in the best case, a running time that is only linear in the size of sequence databases. Furthermore, established index data structures for fast sequence matching, like suffix trees or arrays, cannot benefit from the complementarity constraints introduced by the secondary structure of RNAs. Results We present a novel method and readily applicable software for time efficient matching of RNA sequence-structure patterns in sequence databases. Our approach is based on affix arrays, a recently introduced index data structure, preprocessed from the target database. Affix arrays support bidirectional pattern search, which is required for efficiently handling the structural constraints of the pattern. Structural patterns like stem-loops can be matched inside out, such that the loop region is matched first and then the pairing bases on the boundaries are matched consecutively. This allows to exploit base pairing information for search space reduction and leads to an expected running time that is sublinear in the size of the sequence database. The incorporation of a new chaining approach in the search of RNA sequence-structure patterns enables the description of molecules folding into complex secondary structures with multiple ordered patterns. The chaining approach removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our method runs up to two orders of magnitude faster than previous methods. Conclusions The presented method's sublinear expected running time makes it well suited for RNA sequence-structure pattern matching in large sequence databases. RNA molecules containing several stem-loop substructures can be described by multiple sequence-structure patterns and their matches are efficiently handled by a novel chaining method. Beyond our algorithmic contributions, we provide with Structator a complete and robust open-source software solution for index-based search of RNA sequence-structure patterns. The Structator software is available at http://www.zbh.uni-hamburg.de/Structator webcite.Deutsche Forschungsgemeinschaft (grant WI 3628/1-1

    The SBC-Tree: An Index for Run-Length Compressed Sequences

    Get PDF
    Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., biological sequence databases. multimedia: and facsimile transmission. One of the main challenges is how to operate, e.g., indexing: searching, and retriexral: on the compressed data without decompressing it. In t.his paper, we present the String &tree for _Compressed sequences; termed the SBC-tree, for indexing and searching RLE-compressed sequences of arbitrary length. The SBC-tree is a two-level index structure based on the well-knoxvn String B-tree and a 3-sided range query structure. The SBC-tree supports substring as \\re11 as prefix m,atching, and range search operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. The insertion and deletion of all suffixes of a compressed sequence of length m taltes O(m logB(N + m)) I/O operations. Substring match,ing, pre,fix matching, and range search execute in an optimal O(log, N + F) I/O operations, where Ip is the length of the compressed query pattern and T is the query output size. Re present also two variants of the SBC-tree: the SBC-tree that is based on an R-tree instead of the 3-sided structure: and the one-level SBC-tree that does not use a two-dimensional index. These variants do not have provable worstcase theoret.ica1 bounds for search operations, but perform well in practice. The SBC-tree index is realized inside PostgreSQL in t,he context of a biological protein database application. Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, up to 30 % reduction in 110s for the insertion operations, and retains the optimal search performance achieved by the St,ring B-tree over the uncompressed sequences.!I c 0,
    corecore