6 research outputs found

    Analysis of Min-Hashing for Variant Tolerant DNA Read Mapping

    Get PDF
    DNA read mapping has become a ubiquitous task in bioinformatics. New technologies provide ever longer DNA reads (several thousand basepairs), although at comparatively high error rates (up to 15%), and the reference genome is increasingly not considered as a simple string over ACGT anymore, but as a complex object containing known genetic variants in the population. Conventional indexes based on exact seed matches, in particular the suffix array based FM index, struggle with these changing conditions, so other methods are being considered, and one such alternative is locality sensitive hashing. Here we examine the question whether including single nucleotide polymorphisms (SNPs) in a min-hashing index is beneficial. The answer depends on the population frequency of the SNP, and we analyze several models (from simple to complex) that provide precise answers to this question under various assumptions. Our results also provide sensitivity and specificity values for min-hashing based read mappers and may be used to understand dependencies between the parameters of such methods. We hope that this article will provide a theoretical foundation for a new generation of read mappers

    MaskAl: Privacy Preserving Masked Reads Alignment using Intel SGX

    Get PDF
    The recent introduction of new DNA sequencing techniques caused the amount of processed and stored biological data to skyrocket. In order to process these vast amounts of data, bio-centers have been tempted to use low-cost public clouds. However, genomes are privacy sensitive, since they store personal information about their donors, such as their identity, disease risks, heredity and ethnic origin. The first critical DNA processing step that can be executed in a cloud, i.e., read alignment, consists in finding the location of the DNA sequences produced by a sequencing machine in the human genome. While recent developments aim at increasing performance, only few approaches address the need for fast and privacy preserving read alignment methods. This paper introduces MaskAl, a novel approach for read alignment. MaskAl combines a fast preprocessing step on raw genomic data — filtering and masking — with established algorithms to align sanitized reads, from which sensitive parts have been masked out, and refines the alignment score using the masked out information with Intel’s software guard extensions (SGX). MaskAl is a highly competitive privacy-preserving read alignment software that can be massively parallelized with public clouds and emerging enclave clouds. Finally, MaskAl is nearly as accurate as plain-text approaches (more than 96% of aligned reads with MaskAl compared to 98% with BWA) and can process alignment workloads 87% faster than current privacy-preserving approaches while using less memory and network bandwidth

    Locality-Sensitive Hashing for Efficient Web Application Security Testing

    Full text link
    Web application security has become a major concern in recent years, as more and more content and services are available online. A useful method for identifying security vulnerabilities is black-box testing, which relies on an automated crawling of web applications. However, crawling Rich Internet Applications (RIAs) is a very challenging task. One of the key obstacles crawlers face is the state similarity problem: how to determine if two client-side states are equivalent. As current methods do not completely solve this problem, a successful scan of many real-world RIAs is still not possible. We present a novel approach to detect redundant content for security testing purposes. The algorithm applies locality-sensitive hashing using MinHash sketches in order to analyze the Document Object Model (DOM) structure of web pages, and to efficiently estimate similarity between them. Our experimental results show that this approach allows a successful scan of RIAs that cannot be crawled otherwise

    Accurate filtering of privacy-sensitive information in raw genomic data

    Get PDF
    Sequencing thousands of human genomes has enabled breakthroughs in many areas, among them precision medicine, the study of rare diseases, and forensics. However, mass collection of such sensitive data entails enormous risks if not protected to the highest standards. In this article, we follow the position and argue that post-alignment privacy is not enough and that data should be automatically protected as early as possible in the genomics workflow, ideally immediately after the data is produced. We show that a previous approach for filtering short reads cannot extend to long reads and present a novel filtering approach that classifies raw genomic data (i.e., whose location and content is not yet determined) into privacy-sensitive (i.e., more affected by a successful privacy attack) and non-privacy-sensitive information. Such a classification allows the fine-grained and automated adjustment of protective measures to mitigate the possible consequences of exposure, in particular when relying on public clouds. We present the first filter that can be indistinctly applied to reads of any length, i.e., making it usable with any recent or future sequencing technologies. The filter is accurate, in the sense that it detects all known sensitive nucleotides except those located in highly variable regions (less than 10 nucleotides remain undetected per genome instead of 100,000 in previous works). It has far less false positives than previously known methods (10% instead of 60%) and can detect sensitive nucleotides despite sequencing errors (86% detected instead of 56% with 2% of mutations). Finally, practical experiments demonstrate high performance, both in terms of throughput and memory consumption

    Enclave-Based Privacy-Preserving Alignment of Raw Genomic Information

    Get PDF
    Recent breakthroughs in genomic sequencing led to an enormous increase of DNA sampling rates, which in turn favored the use of clouds to efficiently process huge amounts of genomic data. However, while allowing possible achievements in personalized medicine and related areas, cloud-based processing of genomic information also entails significant privacy risks, asking for increased protection. In this paper, we focus on the first, but also most data-intensive, processing step of the genomics information processing pipeline: the alignment of raw genomic data samples (called reads) to a synthetic human reference genome. Even though privacy-preserving alignment solutions (e.g., based on homomorphic encryption) have been proposed, their slow performance encourages alternatives based on trusted execution environments, such as Intel SGX, to speed up secure alignment. Such alternatives have to deal with data structures whose size by far exceeds secure enclave memory, requiring the alignment code to reach out into untrusted memory. We highlight how sensitive genomic information can be leaked when those enclave-external alignment data structures are accessed, and suggest countermeasures to prevent privacy breaches. The overhead of these countermeasures indicate that the competitiveness of a privacy-preserving enclave-based alignment has yet to be precisely evaluated

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up
    corecore