291 research outputs found

    Analysis of Min-Hashing for Variant Tolerant DNA Read Mapping

    Get PDF
    DNA read mapping has become a ubiquitous task in bioinformatics. New technologies provide ever longer DNA reads (several thousand basepairs), although at comparatively high error rates (up to 15%), and the reference genome is increasingly not considered as a simple string over ACGT anymore, but as a complex object containing known genetic variants in the population. Conventional indexes based on exact seed matches, in particular the suffix array based FM index, struggle with these changing conditions, so other methods are being considered, and one such alternative is locality sensitive hashing. Here we examine the question whether including single nucleotide polymorphisms (SNPs) in a min-hashing index is beneficial. The answer depends on the population frequency of the SNP, and we analyze several models (from simple to complex) that provide precise answers to this question under various assumptions. Our results also provide sensitivity and specificity values for min-hashing based read mappers and may be used to understand dependencies between the parameters of such methods. We hope that this article will provide a theoretical foundation for a new generation of read mappers

    Canonical, Stable, General Mapping using Context Schemes

    Full text link
    Motivation: Sequence mapping is the cornerstone of modern genomics. However, most existing sequence mapping algorithms are insufficiently general. Results: We introduce context schemes: a method that allows the unambiguous recognition of a reference base in a query sequence by testing the query for substrings from an algorithmically defined set. Context schemes only map when there is a unique best mapping, and define this criterion uniformly for all reference bases. Mappings under context schemes can also be made stable, so that extension of the query string (e.g. by increasing read length) will not alter the mapping of previously mapped positions. Context schemes are general in several senses. They natively support the detection of arbitrary complex, novel rearrangements relative to the reference. They can scale over orders of magnitude in query sequence length. Finally, they are trivially extensible to more complex reference structures, such as graphs, that incorporate additional variation. We demonstrate empirically the existence of high performance context schemes, and present efficient context scheme mapping algorithms. Availability and Implementation: The software test framework created for this work is available from https://registry.hub.docker.com/u/adamnovak/sequence-graphs/. Contact: [email protected] Supplementary Information: Six supplementary figures and one supplementary section are available with the online version of this article.Comment: Submission for Bioinformatic

    mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications

    Get PDF
    Cataloged from PDF version of article.High throughput sequencing (HTS) platforms generate unprecedented amounts of data that introduce challenges for processing and downstream analysis. While tools that report the 'best' mapping location of each read provide a fast way to process HTS data, they are not suitable for many types of downstream analysis such as structural variation detection, where it is important to report multiple mapping loci for each read. For this purpose we introduce mrsFAST-Ultra, a fast, cache oblivious, SNP-aware aligner that can handle the multi-mapping of HTS reads very efficiently. mrsFAST-Ultra improves mrsFAST, our first cache oblivious read aligner capable of handling multi-mapping reads, through new and compact index structures that reduce not only the overall memory usage but also the number of CPU operations per alignment. In fact the size of the index generated by mrsFAST-Ultra is 10 times smaller than that of mrsFAST. As importantly, mrsFAST-Ultra introduces new features such as being able to (i) obtain the best mapping loci for each read, and (ii) return all reads that have at most n mapping loci (within an error threshold), together with these loci, for any user specified n. Furthermore, mrsFAST-Ultra is SNP-aware, i.e. it can map reads to reference genome while discounting the mismatches that occur at common SNP locations provided by db-SNP; this significantly increases the number of reads that can be mapped to the reference genome. Notice that all of the above features are implemented within the index structure and are not simple post-processing steps and thus are performed highly efficiently. Finally, mrsFAST-Ultra utilizes multiple available cores and processors and can be tuned for various memory settings. Our results show that mrsFAST-Ultra is roughly five times faster than its predecessor mrsFAST. In comparison to newly enhanced popular tools such as Bowtie2, it is more sensitive (it can report 10 times or more mappings per read) and much faster (six times or more) in the multi-mapping mode. Furthermore, mrsFAST-Ultra has an index size of 2GB for the entire human reference genome, which is roughly half of that of Bowtie2. mrsFAST-Ultra is open source and it can be accessed at http://mrsfast.sourceforge.net

    Read alignment using deep neural networks

    Get PDF
    2019 Spring.Includes bibliographical references.Read alignment is the process of mapping short DNA sequences into the reference genome. With the advent of consecutively evolving "next generation" sequencing technologies, the need for sequence alignment tools appeared. Many scientific communities and the companies marketing the sequencing technologies developed a whole spectrum of read aligners/mappers for different error profiles and read length characteristics. Among the most recent successfully marketed sequencing technologies are Oxford Nanopore and PacBio SMRT sequencing, which are considered top players because of their extremely long reads and low cost. However, the reads may contain error up to 20% that are not generally uniformly distributed. To deal with that level of error rate and read length, proximity preserving hashing techniques, such as Minhash and Minimizers, were utilized to quickly map a read to the target region of the reference sequence. Subsequently, a variant of global or local alignment dynamic programming is then used to give the final alignment. In this research work, we train a Deep Neural Network (DNN) to yield a hashing scheme for the highly erroneous long reads, which is deemed superior to Minhash for mapping the reads. We implemented that idea to build a read alignment tool: DNNAligner. We evaluated the performance of our aligner against the popular read aligners in the bioinformatics community currently — minimap2, bwa-mem and graphmap. Our results show that the performance of DNNAligner is comparable to other tools without any code optimization or integration of other advanced features. Moreover, DNN exhibits superior performance in comparison with Minhashon neighborhood classification

    Fast Lightweight Accurate Xenograft Sorting

    Get PDF
    Motivation: With an increasing number of patient-derived xenograft (PDX) models being created and subsequently sequenced to study tumor heterogeneity and to guide therapy decisions, there is a similarly increasing need for methods to separate reads originating from the graft (human) tumor and reads originating from the host species\u27 (mouse) surrounding tissue. Two kinds of methods are in use: On the one hand, alignment-based tools require that reads are mapped and aligned (by an external mapper/aligner) to the host and graft genomes separately first; the tool itself then processes the resulting alignments and quality metrics (typically BAM files) to assign each read or read pair. On the other hand, alignment-free tools work directly on the raw read data (typically FASTQ files). Recent studies compare different approaches and tools, with varying results. Results: We show that alignment-free methods for xenograft sorting are superior concerning CPU time usage and equivalent in accuracy. We improve upon the state of the art by presenting a fast lightweight approach based on three-way bucketed quotiented Cuckoo hashing. Our hash table requires memory comparable to an FM index typically used for read alignment and less than other alignment-free approaches. It allows extremely fast lookups and uses less CPU time than other alignment-free methods and alignment-based methods at similar accuracy

    Locality-Preserving Hashing for Shifts with Connections to Cryptography

    Get PDF
    Can we sense our location in an unfamiliar environment by taking a sublinear-size sample of our surroundings? Can we efficiently encrypt a message that only someone physically close to us can decrypt? To solve this kind of problems, we introduce and study a new type of hash functions for finding shifts in sublinear time. A function h:{0,1}nZnh:\{0,1\}^n\to \mathbb{Z}_n is a (d,δ)(d,\delta) {\em locality-preserving hash function for shifts} (LPHS) if: (1) hh can be computed by (adaptively) querying dd bits of its input, and (2) Pr[h(x)h(x1)+1]δ\Pr [ h(x) \neq h(x \ll 1) + 1 ] \leq \delta, where xx is random and 1\ll 1 denotes a cyclic shift by one bit to the left. We make the following contributions. * Near-optimal LPHS via Distributed Discrete Log: We establish a general two-way connection between LPHS and algorithms for distributed discrete logarithm in the generic group model. Using such an algorithm of Dinur et al. (Crypto 2018), we get LPHS with near-optimal error of δ=O~(1/d2)\delta=\tilde O(1/d^2). This gives an unusual example for the usefulness of group-based cryptography in a post-quantum world. We extend the positive result to non-cyclic and worst-case variants of LPHS. * Multidimensional LPHS: We obtain positive and negative results for a multidimensional extension of LPHS, making progress towards an optimal 2-dimensional LPHS. * Applications: We demonstrate the usefulness of LPHS by presenting cryptographic and algorithmic applications. In particular, we apply multidimensional LPHS to obtain an efficient "packed" implementation of homomorphic secret sharing and a sublinear-time implementation of location-sensitive encryption whose decryption requires a significantly overlapping view

    Fast and Sensitive Genome-Hashing Software and its Application in Using NGS as a Detection Agent for Bacterial Presence in Oral Metagenomic Samples

    Get PDF
    Next generation sequencing has increased the throughput of sequenced DNA into the range of billions of nucleotides sequenced per day. With the increased speed of DNA sequencing and the short length of reads produced by next generation sequencers, a significant challenge has been created in quickly and accurately assembling the hundreds of millions of short reads created by modern sequencing instruments into their full genomic sequences. With the increase in throughput in next generation sequencing and the decrease in time and cost to perform DNA sequencing, novel applications for DNA sequencing are being considered. Among them is a methodology by which DNA sequencing can be used as a diagnostic or detection tool for bacterial infection or presence. Here, the implementation, characteristics, and deployment of a novel, genome-hashing alignment algorithm for quickly performing reference-based alignment is described. This algorithm, SRmapper, is shown to be between two-fold to eight-fold faster than a current and popular alignment algorithm, BWA, while retaining a similar fraction of reads aligned to human reference genome. SRmapper demonstrates a capability to align approximately 150 billion nucleotides per processor day on an Intel Xeon 2.8GHz processor to the human genome while using approximately 2.5GB of RAM. SRmapper is demonstrated to be able to perform both single-end and pair-end alignment and tolerates a higher number of discrepancies between reads and the reference sequence than BWA. Using SRmapper as an alignment tool, a method to detect Mycobacterium tuberculosis (TB) in metagenomic samples containing many different bacteria is described. This method utilizes the construction of a novel uniqueness genome for TB containing only the regions of the TB genome not similar to any other bacterial species in the oral metagenome. Alignment of simulated and real metagenomic samples demonstrate the effectiveness of the uniqueness genome in the detection of TB and discover TB contamination in samples from the 1000 genomes project. Finally, the uniqueness genomes methodology is expanded to all genomes within the oral metagenome, and preliminary evidence is provided demonstrating that next generation sequencing can detect the presence of multiple simultaneously via alignment using SRmapper
    corecore