4,255 research outputs found

    Optimal Assembly for High Throughput Shotgun Sequencing

    Get PDF
    We present a framework for the design of optimal assembly algorithms for shotgun sequencing under the criterion of complete reconstruction. We derive a lower bound on the read length and the coverage depth required for reconstruction in terms of the repeat statistics of the genome. Building on earlier works, we design a de Brujin graph based assembly algorithm which can achieve very close to the lower bound for repeat statistics of a wide range of sequenced genomes, including the GAGE datasets. The results are based on a set of necessary and sufficient conditions on the DNA sequence and the reads for reconstruction. The conditions can be viewed as the shotgun sequencing analogue of Ukkonen-Pevzner's necessary and sufficient conditions for Sequencing by Hybridization.Comment: 26 pages, 18 figure

    Reverse-Safe Data Structures for Text Indexing

    Get PDF
    We introduce the notion of reverse-safe data structures. These are data structures that prevent the reconstruction of the data they encode (i.e., they cannot be easily reversed). A data structure D is called z-reverse-safe when there exist at least z datasets with the same set of answers as the ones stored by D. The main challenge is to ensure that D stores as many answers to useful queries as possible, is constructed efficiently, and has size close to the size of the original dataset it encodes. Given a text of length n and an integer z, we propose an algorithm which constructs a z-reverse-safe data structure that has size O(n) and answers pattern matching queries of length at most d optimally, where d is maximal for any such z-reverse-safe data structure. The construction algorithm takes O(n ω log d) time, where ω is the matrix multiplication exponent. We show that, despite the n ω factor, our engineered implementation takes only a few minutes to finish for million-letter texts. We further show that plugging our method in data analysis applications gives insignificant or no data utility loss. Finally, we show how our technique can be extended to support applications under a realistic adversary model

    Safe and complete contig assembly via omnitigs

    Full text link
    Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs -- a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question was never solved: given a genome graph GG (e.g. a de Bruijn, or a string graph), what are all the strings that can be safely reported from GG as contigs? In this paper we finally answer this question, and also give a polynomial time algorithm to find them. Our experiments show that these strings, which we call omnitigs, are 66% to 82% longer on average than the popular unitigs, and 29% of dbSNP locations have more neighbors in omnitigs than in unitigs.Comment: Full version of the paper in the proceedings of RECOMB 201

    General relativistic corrections and non-Gaussianity in large scale structure

    Full text link
    General relativistic cosmology cannot be reduced to linear relativistic perturbations superposed on an isotropic and homogeneous (Friedmann-Robertson-Walker) background, even though such a simple scheme has been successfully applied to analyse a large variety of phenomena (such as Cosmic Microwave Background primary anisotropies, matter clustering on large scales, weak gravitational lensing, etc.). The general idea of going beyond this simple paradigm is what characterises most of the efforts made in recent years: the study of second and higher-order cosmological perturbations including all general relativistic contributions -- also in connection with primordial non-Gaussianities -- the idea of defining large-scale structure observables directly from a general relativistic perspective, the various attempts to go beyond the Newtonian approximation in the study of non-linear gravitational dynamics, by using e.g., Post-Newtonian treatments, are all examples of this general trend. Here we summarise some of these directions of investigation, with the aim of emphasising future prospects in this area of cosmology, both from a theoretical and observational point of view.Comment: 20 pages. A review article submitted to CQG focus issue "Relativistic Effects in Cosmology". Typos corrected; Refs. adde

    Probabilistic and Distributed Control of a Large-Scale Swarm of Autonomous Agents

    Get PDF
    We present a novel method for guiding a large-scale swarm of autonomous agents into a desired formation shape in a distributed and scalable manner. Our Probabilistic Swarm Guidance using Inhomogeneous Markov Chains (PSG-IMC) algorithm adopts an Eulerian framework, where the physical space is partitioned into bins and the swarm's density distribution over each bin is controlled. Each agent determines its bin transition probabilities using a time-inhomogeneous Markov chain. These time-varying Markov matrices are constructed by each agent in real-time using the feedback from the current swarm distribution, which is estimated in a distributed manner. The PSG-IMC algorithm minimizes the expected cost of the transitions per time instant, required to achieve and maintain the desired formation shape, even when agents are added to or removed from the swarm. The algorithm scales well with a large number of agents and complex formation shapes, and can also be adapted for area exploration applications. We demonstrate the effectiveness of this proposed swarm guidance algorithm by using results of numerical simulations and hardware experiments with multiple quadrotors.Comment: Submitted to IEEE Transactions on Robotic

    Performance Improvement of Distributed Computing Framework and Scientific Big Data Analysis

    Get PDF
    Analysis of Big data to gain better insights has been the focus of researchers in the recent past. Traditional desktop computers or database management systems may not be suitable for efficient and timely analysis, due to the requirement of massive parallel processing. Distributed computing frameworks are being explored as a viable solution. For example, Google proposed MapReduce, which is becoming a de facto computing architecture for Big data solutions. However, scheduling in MapReduce is coarse grained and remains as a challenge for improvement. Related with MapReduce scheduler when configured over distributed clusters, we identify two issues: data locality disruption and random assignment of non-local map tasks. We propose a network aware scheduler to extend the existing rack awareness. The tasks are scheduled in the order of node, rack and any other rack within the same cluster to achieve cluster level data locality. The issue of random assignment non-local map tasks is handled by enhancing the scheduler to consider the network parameters, such as delay, bandwidth and packet loss between remote clusters. As part of Big data analysis at computational biology, we consider two major data intensive applications: indexing genome sequences and de Novo assembly. Both of these applications deal with the massive amount data generated from DNA sequencers. We developed a scalable algorithm to construct sub-trees of a suffix tree in parallel to address huge memory requirements needed for indexing the human genome. For the de Novo assembly, we propose Parallel Giraph based Assembler (PGA) to address the challenges associated with the assembly of large genomes over commodity hardware. PGA uses the de Bruijn graph to represent the data generated from sequencers. Huge memory demands and performance expectations are addressed by developing parallel algorithms based on the distributed graph-processing framework, Apache Giraph

    Cautionary Tales of Inapproximability

    Get PDF
    Modeling biology as classical problems in computer science allows researchers to leverage the wealth of theoretical advancements in this field. Despite countless studies presenting heuristics that report improvement on specific benchmarking data, there has been comparatively little focus on exploring the theoretical bounds on the performance of practical (polynomial-time) algorithms. Conversely, theoretical studies tend to overstate the generalizability of their conclusions to physical biological processes. In this article we provide a fresh perspective on the concepts of NP-hardness and inapproximability in the computational biology domain, using popular sequence assembly and alignment (mapping) algorithms as illustrative examples. These algorithms exemplify how computer science theory can both (a) lead to substantial improvement in practical performance and (b) highlight areas ripe for future innovation. Importantly, we discuss caveats that seemingly allow the performance of heuristics to exceed their provable bounds
    • …
    corecore