4,255 research outputs found
Optimal Assembly for High Throughput Shotgun Sequencing
We present a framework for the design of optimal assembly algorithms for
shotgun sequencing under the criterion of complete reconstruction. We derive a
lower bound on the read length and the coverage depth required for
reconstruction in terms of the repeat statistics of the genome. Building on
earlier works, we design a de Brujin graph based assembly algorithm which can
achieve very close to the lower bound for repeat statistics of a wide range of
sequenced genomes, including the GAGE datasets. The results are based on a set
of necessary and sufficient conditions on the DNA sequence and the reads for
reconstruction. The conditions can be viewed as the shotgun sequencing analogue
of Ukkonen-Pevzner's necessary and sufficient conditions for Sequencing by
Hybridization.Comment: 26 pages, 18 figure
Reverse-Safe Data Structures for Text Indexing
We introduce the notion of reverse-safe data structures. These are data structures that prevent the reconstruction of the data they encode (i.e., they cannot be easily reversed). A data structure D is called z-reverse-safe when there exist at least z datasets with the same set of answers as the ones stored by D. The main challenge is to ensure that D stores as many answers to useful queries as possible, is constructed efficiently, and has size close to the size of the original dataset it encodes. Given a text of length n and an integer z, we propose an algorithm which constructs a z-reverse-safe data structure that has size O(n) and answers pattern matching queries of length at most d optimally, where d is maximal for any such z-reverse-safe data structure. The construction algorithm takes O(n ω log d) time, where ω is the matrix multiplication exponent. We show that, despite the n ω factor, our engineered implementation takes only a few minutes to finish for million-letter texts. We further show that plugging our method in data analysis applications gives insignificant or no data utility loss. Finally, we show how our technique can be extended to support applications under a realistic adversary model
Safe and complete contig assembly via omnitigs
Contig assembly is the first stage that most assemblers solve when
reconstructing a genome from a set of reads. Its output consists of contigs --
a set of strings that are promised to appear in any genome that could have
generated the reads. From the introduction of contigs 20 years ago, assemblers
have tried to obtain longer and longer contigs, but the following question was
never solved: given a genome graph (e.g. a de Bruijn, or a string graph),
what are all the strings that can be safely reported from as contigs? In
this paper we finally answer this question, and also give a polynomial time
algorithm to find them. Our experiments show that these strings, which we call
omnitigs, are 66% to 82% longer on average than the popular unitigs, and 29% of
dbSNP locations have more neighbors in omnitigs than in unitigs.Comment: Full version of the paper in the proceedings of RECOMB 201
General relativistic corrections and non-Gaussianity in large scale structure
General relativistic cosmology cannot be reduced to linear relativistic
perturbations superposed on an isotropic and homogeneous
(Friedmann-Robertson-Walker) background, even though such a simple scheme has
been successfully applied to analyse a large variety of phenomena (such as
Cosmic Microwave Background primary anisotropies, matter clustering on large
scales, weak gravitational lensing, etc.). The general idea of going beyond
this simple paradigm is what characterises most of the efforts made in recent
years: the study of second and higher-order cosmological perturbations
including all general relativistic contributions -- also in connection with
primordial non-Gaussianities -- the idea of defining large-scale structure
observables directly from a general relativistic perspective, the various
attempts to go beyond the Newtonian approximation in the study of non-linear
gravitational dynamics, by using e.g., Post-Newtonian treatments, are all
examples of this general trend. Here we summarise some of these directions of
investigation, with the aim of emphasising future prospects in this area of
cosmology, both from a theoretical and observational point of view.Comment: 20 pages. A review article submitted to CQG focus issue "Relativistic
Effects in Cosmology". Typos corrected; Refs. adde
Probabilistic and Distributed Control of a Large-Scale Swarm of Autonomous Agents
We present a novel method for guiding a large-scale swarm of autonomous
agents into a desired formation shape in a distributed and scalable manner. Our
Probabilistic Swarm Guidance using Inhomogeneous Markov Chains (PSG-IMC)
algorithm adopts an Eulerian framework, where the physical space is partitioned
into bins and the swarm's density distribution over each bin is controlled.
Each agent determines its bin transition probabilities using a
time-inhomogeneous Markov chain. These time-varying Markov matrices are
constructed by each agent in real-time using the feedback from the current
swarm distribution, which is estimated in a distributed manner. The PSG-IMC
algorithm minimizes the expected cost of the transitions per time instant,
required to achieve and maintain the desired formation shape, even when agents
are added to or removed from the swarm. The algorithm scales well with a large
number of agents and complex formation shapes, and can also be adapted for area
exploration applications. We demonstrate the effectiveness of this proposed
swarm guidance algorithm by using results of numerical simulations and hardware
experiments with multiple quadrotors.Comment: Submitted to IEEE Transactions on Robotic
Performance Improvement of Distributed Computing Framework and Scientific Big Data Analysis
Analysis of Big data to gain better insights has been the focus of researchers in the recent past. Traditional desktop computers or database management systems may not be suitable for efficient and timely analysis, due to the requirement of massive parallel processing. Distributed computing frameworks are being explored as a viable solution. For example, Google proposed MapReduce, which is becoming a de facto computing architecture for Big data solutions. However, scheduling in MapReduce is coarse grained and remains as a challenge for improvement. Related with MapReduce scheduler when configured over distributed clusters, we identify two issues: data locality disruption and random assignment of non-local map tasks. We propose a network aware scheduler to extend the existing rack awareness. The tasks are scheduled in the order of node, rack and any other rack within the same cluster to achieve cluster level data locality. The issue of random assignment non-local map tasks is handled by enhancing the scheduler to consider the network parameters, such as delay, bandwidth and packet loss between remote clusters. As part of Big data analysis at computational biology, we consider two major data intensive applications: indexing genome sequences and de Novo assembly. Both of these applications deal with the massive amount data generated from DNA sequencers. We developed a scalable algorithm to construct sub-trees of a suffix tree in parallel to address huge memory requirements needed for indexing the human genome. For the de Novo assembly, we propose Parallel Giraph based Assembler (PGA) to address the challenges associated with the assembly of large genomes over commodity hardware. PGA uses the de Bruijn graph to represent the data generated from sequencers. Huge memory demands and performance expectations are addressed by developing parallel algorithms based on the distributed graph-processing framework, Apache Giraph
Cautionary Tales of Inapproximability
Modeling biology as classical problems in computer science allows researchers to leverage the wealth of theoretical advancements in this field. Despite countless studies presenting heuristics that report improvement on specific benchmarking data, there has been comparatively little focus on exploring the theoretical bounds on the performance of practical (polynomial-time) algorithms. Conversely, theoretical studies tend to overstate the generalizability of their conclusions to physical biological processes. In this article we provide a fresh perspective on the concepts of NP-hardness and inapproximability in the computational biology domain, using popular sequence assembly and alignment (mapping) algorithms as illustrative examples. These algorithms exemplify how computer science theory can both (a) lead to substantial improvement in practical performance and (b) highlight areas ripe for future innovation. Importantly, we discuss caveats that seemingly allow the performance of heuristics to exceed their provable bounds
- …