818 research outputs found
Joint Haplotype Assembly and Genotype Calling via Sequential Monte Carlo Algorithm
Genetic variations predispose individuals to hereditary diseases, play important role in the development of complex diseases, and impact drug metabolism. The full information about the DNA variations in the genome of an individual is given by haplotypes, the ordered lists of single nucleotide polymorphisms (SNPs) located on chromosomes. Affordable high-throughput DNA sequencing technologies enable routine acquisition of data needed for the assembly of single individual haplotypes. However, state-of-the-art high-throughput sequencing platforms generate data that is erroneous, which induces uncertainty in the SNP and genotype calling procedures and, ultimately, adversely affect the accuracy of haplotyping. When inferring haplotype phase information, the vast majority of the existing techniques for haplotype assembly assume that the genotype information is correct. This motivates the development of methods capable of joint genotype calling and haplotype assembly. Results: We present a haplotype assembly algorithm, ParticleHap, that relies on a probabilistic description of the sequencing data to jointly infer genotypes and assemble the most likely haplotypes. Our method employs a deterministic sequential Monte Carlo algorithm that associates single nucleotide polymorphisms with haplotypes by exhaustively exploring all possible extensions of the partial haplotypes. The algorithm relies on genotype likelihoods rather than on often erroneously called genotypes, thus ensuring a more accurate assembly of the haplotypes. Results on both the 1000 Genomes Project experimental data as well as simulation studies demonstrate that the proposed approach enables highly accurate solutions to the haplotype assembly problem while being computationally efficient and scalable, generally outperforming existing methods in terms of both accuracy and speed. Conclusions: The developed probabilistic framework and sequential Monte Carlo algorithm enable joint haplotype assembly and genotyping in a computationally efficient manner. Our results demonstrate fast and highly accurate haplotype assembly aided by the re-examination of erroneously called genotypes.National Science Foundation CCF-1320273Electrical and Computer Engineerin
Bayesian multiple-instance motif discovery with BAMBI: inference of recombinase and transcription factor binding sites
Finding conserved motifs in genomic sequences represents one of essential bioinformatic problems. However, achieving high discovery performance without imposing substantial auxiliary constraints on possible motif features remains a key algorithmic challenge. This work describes BAMBIāa sequential Monte Carlo motif-identification algorithm, which is based on a position weight matrix model that does not require additional constraints and is able to estimate such motif properties as length, logo, number of instances and their locations solely on the basis of primary nucleotide sequence data. Furthermore, should biologically meaningful information about motif attributes be available, BAMBI takes advantage of this knowledge to further refine the discovery results. In practical applications, we show that the proposed approach can be used to find sites of such diverse DNA-binding molecules as the cAMP receptor protein (CRP) and Din-family site-specific serine recombinases. Results obtained by BAMBI in these and other settings demonstrate better statistical performance than any of the four widely-used profile-based motif discovery methods: MEME, BioProspector with BioOptimizer, SeSiMCMC and Motif Sampler as measured by the nucleotide-level correlation coefficient. Additionally, in the case of Din-family recombinase target site discovery, the BAMBI-inferred motif is found to be the only one functionally accurate from the underlying biochemical mechanism standpoint. C++ and Matlab code is available at http://www.ee.columbia.edu/~guido/BAMBI or http://genomics.lbl.gov/BAMBI/
Novel stochastic and entropy-based Expectation-Maximisation algorithm for transcription factor binding site motif discovery
The discovery of transcription factor binding site (TFBS) motifs remains an important
and challenging problem in computational biology. This thesis presents MITSU,
a novel algorithm for TFBS motif discovery which exploits stochastic methods as a
means of both overcoming optimality limitations in current algorithms and as a framework
for incorporating relevant prior knowledge in order to improve results.
The current state of the TFBS motif discovery field is surveyed, with a focus
on probabilistic algorithms that typically take the promoter regions of coregulated
genes as input. A case is made for an approach based on the stochastic Expectation-
Maximisation (sEM) algorithm; its position amongst existing probabilistic algorithms
for motif discovery is shown. The algorithm developed in this thesis is unique amongst
existing motif discovery algorithms in that it combines the sEM algorithm with a derived
data set which leads to an improved approximation to the likelihood function.
This likelihood function is unconstrained with regard to the distribution of motif occurrences
within the input dataset. MITSU also incorporates a novel heuristic to automatically
determine TFBS motif width. This heuristic, known as MCOIN, is shown to
outperform current methods for determining motif width. MITSU is implemented in
Java and an executable is available for download.
MITSU is evaluated quantitatively using realistic synthetic data and several collections
of previously characterised prokaryotic TFBS motifs. The evaluation demonstrates
that MITSU improves on a deterministic EM-based motif discovery algorithm
and an alternative sEM-based algorithm, in terms of previously established metrics.
The ability of the sEM algorithm to escape stable fixed points of the EM algorithm,
which trap deterministic motif discovery algorithms and the ability of MITSU to discover
multiple motif occurrences within a single input sequence are also demonstrated.
MITSU is validated using previously characterised Alphaproteobacterial motifs,
before being applied to motif discovery in uncharacterised Alphaproteobacterial data.
A number of novel results from this analysis are presented and motivate two extensions
of MITSU: a strategy for the discovery of multiple different motifs within a single
dataset and a higher order Markov background model. The effects of incorporating
these extensions within MITSU are evaluated quantitatively using previously characterised
prokaryotic TFBS motifs and demonstrated using Alphaproteobacterial motifs.
Finally, an information-theoretic measure of motif palindromicity is presented and its
advantages over existing approaches for discovering palindromic motifs discussed
Recommended from our members
Topics in Genomic Signal Processing
Genomic information is digital in its nature and admits mathematical modeling in order to gain biological knowledge. This dissertation focuses on the development and application of detection and estimation theories for solving problems in genomics by describing biological problems in mathematical terms and proposing a solution in this domain. More specifically, a novel framework for hypothesis testing is presented, where it is desired to decide among multiple hypotheses and where each hypothesis involves unknown parameters. Within this framework, a test is developed to perform both detection and estimation jointly in an optimal sense. The proposed test is then applied to the problem of detecting and estimating periodicities in DNA sequences. Moreover, the problem of motif discovery in DNA sequences is presented, where a set of sequences is observed and it is needed to determine which sequences contain instances (if any) of an unknown motif and estimate their positions. A statistical description of the problem is used and a sequential Monte Carlo method is applied for the inference. Finally, the phasing of haplotypes for diploid organisms is introduced, where a novel mathematical model is proposed. The haplotypes that are used to reconstruct the observed genotypes of a group of unrelated individuals are detected and the haplotype pair for each individual in the group is estimated. The model translates a biological principle, the maximum parsimony principle, to a sparseness condition
Recommended from our members
Bayesian Inference for Genomic Data Analysis
High-throughput genomic data contain gazillion of information that are influenced by the complex biological processes in the cell. As such, appropriate mathematical modeling frameworks are required to understand the data and the data generating processes. This dissertation focuses on the formulation of mathematical models and the description of appropriate computational algorithms to obtain insights from genomic data.
Specifically, characterization of intra-tumor heterogeneity is studied. Based on the total number of allele copies at the genomic locations in the tumor subclones, the problem is viewed from two perspectives: the presence or absence of copy-neutrality assumption. With the presence of copy-neutrality, it is assumed that the genome contains mutational variability and the three possible genotypes may be present at each genomic location. As such, the genotypes of all the genomic locations in the tumor subclones are modeled by a ternary matrix. In the second case, in addition to mutational variability, it is assumed that the genomic locations may be affected by structural variabilities such as copy number variation (CNV). Thus, the genotypes are modeled with a pair of (Q + 1)-ary matrices. Using the categorical Indian buffet process (cIBP), state-space modeling framework is employed in describing the two processes and the sequential Monte Carlo (SMC) methods for dynamic models are applied to perform inference on important model parameters.
Moreover, the problem of estimating gene regulatory network (GRN) from measurement with missing values is presented. Specifically, gene expression time series data may contain missing values for entire expression values of a single point or some set of consecutive time points. However, complete data is often needed to make inference on the underlying GRN. Using the missing measurement, a dynamic stochastic model is used to describe the evolution of gene expression and point-based Gaussian approximation (PBGA) filters with one-step or two-step missing measurements are applied for the inference. Finally, the problem of deconvolving gene expression data from complex heterogeneous biological samples is examined, where the observed data are a mixture of different cell types. A statistical description of the problem is used and the SMC method for static models is applied to estimate the cell-type specific expressions and the cell type proportions in the heterogeneous samples
On Improving Stochastic Simulation for Systems Biology
Mathematical modeling and computer simulation are powerful approaches for understanding the complexity of biological systems.
In particular, computer simulation represents a strong validation and fast hypothesis verification tool. In the course of the years,
several successful attempts have been made to simulate complex biological processes like metabolic pathways, gene regulatory networks and cell signaling pathways. These processes are stochastic in nature, and furthermore they are characterized by multiple time scale evolutions and great variability in the population size of molecules. The most known method to capture random time evolutions of well-stirred chemical reacting systems is the Gillespie's Stochastic Simulation Algorithm. This Monte carlo method
generates exact realizations of the state of the system by stochastically determining when a reaction will occurs and what reaction it will be. Most of the assumptions and hypothesis are
clearly simplifications but in many cases this method have been proved useful to capture the randomness typical of realistic biological
systems. Unfortunately, often the Gillespie's stochastic simulation method results slow in practice. This posed a great challenge and a
motivation toward the development of new efficient methods able to simulate stochastic and multiscale biological systems. In this
thesis we address the problems of simulating metabolic experiments and develop efficient simulation methods for well-stirred chemically
reacting systems. We showed as a Systems Biology approach can provide a cheap, fast and powerful method for validating models proposed in literature. In the present case, we specified the model of SRI photocycle proposed by Hoff et al. in a suitable developed simulator. This simulator was specifically designed to reproduce in silico wet-lab experiments performed on metabolic networks with several possible controls exerted on them by the operator. Thanks to this, we proved that the screened model is able to explain correctly many light responses but unfortunately it was unable to explain some critical experiments, due to some unresolvable time scale problems. This confirm that our simulator is useful to simulate metabolic experiments. Furthermore, it can be downloaded at the URL
http://sourceforge.net/projects/gillespie-qdc. In order to accelerate the simulation of SSA we first proposed a data parallel implementation on General Purpose Graphics Processing Units of a revised version of the Gillespie's First Reaction Method. The simulations performed on a GeForce 8600M GS Graphic Card with 16 stream processors showed that the parallel computations halves the execution time, and this performance scales with the number of steps of the simulation. We also highlighted some specific problem of the programming environment to execute non trivial general purpose applications. Concluding we proved the
extreme computational power of these low cost and widespread technologies, but the limitations emerged demonstrate that we are far from a general purpose application for GPU. In our investigation we also attempted to achieve higher simulation speed focusing on tau-leaping methods. We revealed that these methods implement a common basic algorithmic convention. This convention is the pre-computation of information necessary to estimate the size of the
leap and the number of reactions that will fire on it. Often these pre-processing operations are used to avoid negative populations. The computational cost to perform these operations is often proportional to the size of the model (i.e. number of reactions). This means that larger models involve larger computational cost. The pre-processing operations result in very efficient simulation when the leap are long and many reactions can be fired. But at the contrary they represent a burden when leap are short and few reactions occur. So to efficiently deal with
the latter cases we proposed a method that works differently respect to the trend. The SSALeaping method, SSAL for short, is a new method which lays in the middle between the direct method (DM) and a tau-leaping. The SSALeaping method adaptively builds leaps and stepwise updates the system state. Differently from methods like the Modified tau-leaping (MTL), SSAL neither shifts from tau-leaping to DM nor pre-selects the largest leap time consistent with the leap condition. Additionally whereas MTL
prevents negative populations taking apart critical and non critical reactions, SSAL generates sequentially the reactions to fire
verifying the leap condition after each reaction selection. We proved that a reaction overdraws one of its reactants if and only if the leap
condition is violated. Therefore, this makes it impossible for the population to become negatives, because SSAL stops the leap
generation in advance. To test the accuracy and the performance of our method we performed a large number of simulations upon realistic
biological models. The tests aimed to span
the number of reactions fired in a leap and the number of reactions of the system as much as possible. Sometimes orders of magnitude. Results showed that our method performs better
than MTL for many of the tested cases, but not in all. Then to augment the number of models eligible to be simulated efficiently we
exploiting the complementarity emerged between SSAL and MTL, and we proposed a new adaptive method, called Adaptive Modified SSALeaping (AMS). During the simulation, our method switches between SSALeaping (SSAL) and Modified tau-leaping, according to conditions on the number of reactions of the model and the predicted number of reactions firing in a
leap. We were able to find both theoretically and experimentally how to estimate the number of reactions that will fire in a leap and the
threshold that determines the switch from one method to the other and viceversa. Results obtained from realistic biological models showed that in practice AMS performs better than SSAL and MTL by augmenting the number of models
eligible ro be simulated efficiently. In fact, the method selects correctly the best algorithm between SSAL and MTL according to the cases.
In this thesis we also investigated other new
parallelization techniques. The parallelization of biological systems stimulated the interest of many researchers because the nature of these systems is parallel and sometimes distributed.
However, the nature of the Gillespie's SSA is strictly sequential. We presented a novel exact formulation of SSA based on the idea of
partitioning the volume. We proved the equivalence between our method and DM, and we have given a simple test to show its accuracy in practice. Then we proposed a variant of
SSALeaping based on the partitioning of the volume, called Partitioned SSALeaping. The main feature we pointed out is that the
dynamics of a system in a leap can be obtained by the composition of the dynamics processed by each sub-volume of the partition. This form of independency gives a different view with respect to existing methods. We only tested the method on a simple model, and we showed that the method accurately matched the results of DM, independently of the number of sub-volumes in the partition. This confirmed that the method works and that independency is effective. We have not already given parallel implementation of this method because this work is still in progress and much work has to be done.
Nevertheless, the Partitioned SSAleaping is a promising approach for a future parallelization on multi core (e.g. GPU's) or in many core
(e.g. cluster) technologies
Recommended from our members
Computational methods for understanding genetic variations from next generation sequencing data
Studies of human genetic variation reveal critical information about genetic and complex diseases such as cancer, diabetes and heart disease, ultimately leading towards improvements in health and quality of life. Moreover, understanding genetic variations in viral population is of utmost importance to virologists and helps in search for vaccines. Next-generation sequencing technology is capable of acquiring massive amounts of data that can provide insight into the structure of diverse sets of genomic sequences. However, reconstructing heterogeneous sequences is computationally challenging due to the large dimension of the problem and limitations of the sequencing technology.This dissertation is focused on algorithms and analysis for two problems in which we seek to characterize genetic variations: (1) haplotype reconstruction for a single individual, so-called single individual haplotyping (SIH) or haplotype assembly problem, and (2) reconstruction of viral population, the so-called quasispecies reconstruction (QSR) problem. For the SIH problem, we have developed a method that relies on a probabilistic model of the data and employs the sequential Monte Carlo (SMC) algorithm to jointly determine type of variation (i.e., perform genotype calling) and assemble haplotypes. For the QSR problem, we have developed two algorithms. The first algorithm combines agglomerative hierarchical clustering and Bayesian inference to reconstruct quasispecies characterized by low diversity. The second algorithm utilizes tensor factorization framework with successive data removal to reconstruct quasispecies characterized by highly uneven frequencies of its components. Both algorithms outperform existing methods in both benchmarking tests and real data.Electrical and Computer Engineerin
- ā¦