73 research outputs found
A nonparametric HMM for genetic imputation and coalescent inference
Genetic sequence data are well described by hidden Markov models (HMMs) in
which latent states correspond to clusters of similar mutation patterns. Theory
from statistical genetics suggests that these HMMs are nonhomogeneous (their
transition probabilities vary along the chromosome) and have large support for
self transitions. We develop a new nonparametric model of genetic sequence
data, based on the hierarchical Dirichlet process, which supports these self
transitions and nonhomogeneity. Our model provides a parameterization of the
genetic process that is more parsimonious than other more general nonparametric
models which have previously been applied to population genetics. We provide
truncation-free MCMC inference for our model using a new auxiliary sampling
scheme for Bayesian nonparametric HMMs. In a series of experiments on male X
chromosome data from the Thousand Genomes Project and also on data simulated
from a population bottleneck we show the benefits of our model over the popular
finite model fastPHASE, which can itself be seen as a parametric truncation of
our model. We find that the number of HMM states found by our model is
correlated with the time to the most recent common ancestor in population
bottlenecks. This work demonstrates the flexibility of Bayesian nonparametrics
applied to large and complex genetic data
Fragmentation Coagulation Based Mixed Membership Stochastic Blockmodel
The Mixed-Membership Stochastic Blockmodel~(MMSB) is proposed as one of the
state-of-the-art Bayesian relational methods suitable for learning the complex
hidden structure underlying the network data. However, the current formulation
of MMSB suffers from the following two issues: (1), the prior information~(e.g.
entities' community structural information) can not be well embedded in the
modelling; (2), community evolution can not be well described in the
literature. Therefore, we propose a non-parametric fragmentation coagulation
based Mixed Membership Stochastic Blockmodel (fcMMSB). Our model performs
entity-based clustering to capture the community information for entities and
linkage-based clustering to derive the group information for links
simultaneously. Besides, the proposed model infers the network structure and
models community evolution, manifested by appearances and disappearances of
communities, using the discrete fragmentation coagulation process (DFCP). By
integrating the community structure with the group compatibility matrix we
derive a generalized version of MMSB. An efficient Gibbs sampling scheme with
Polya Gamma (PG) approach is implemented for posterior inference. We validate
our model on synthetic and real world data.Comment: AAAI 202
Bayesian nonparametric models of genetic variation
We will develop three new Bayesian nonparametric models for genetic variation. These models are all dynamic-clustering approximations of the ancestral recombination graph (or ARG), a structure that fully describes the genetic history of a population. Due to its complexity, efficient inference for the ARG is not possible. However, different aspects of the ARG can be captured by the approximations discussed in our work. The ARG can be described by a tree valued HMM where the trees vary along the genetic sequence. Many modern models of genetic variation proceed by approximating these trees with (often finite) clusterings. We will consider Bayesian nonparametric priors for the clustering, thereby providing nonparametric generalizations of these models and avoiding problems with model selection and label switching. Further, we will compare the performance of these models on a wide selection of inference problems in genetics such as phasing, imputation, genome wide association and admixture or bottleneck discovery. These experiments should provide a common testing ground on which the different approximations inherent in modern genetic models can be compared. The results of these experiments should shed light on the nature of the approximations and guide future application of these models
An evolutionary model that satisfies detailed balance
We propose a class of evolutionary models that involves an arbitrary
exchangeable process as the breeding process and different selection schemes.
In those models, a new genome is born according to the breeding process, and
then a genome is removed according to the selection scheme that involves
fitness. Thus the population size remains constant. The process evolves
according to a Markov chain, and, unlike in many other existing models, the
stationary distribution -- so called mutation-selection equilibrium -- can be
easily found and studied. The behaviour of the stationary distribution when the
population size increases is our main object of interest. Several
phase-transition theorems are proved.Comment: 38 pages, 5 figure
Scalable Feature Selection Applications for Genome-Wide Association Studies of Complex Diseases
Personalized medicine will revolutionize our capabilities to combat disease. Working toward this goal, a fundamental task is the deciphering of geneticvariants that are predictive of complex diseases. Modern studies, in the formof genome-wide association studies (GWAS) have afforded researchers with the opportunity to reveal new genotype-phenotype relationships through the extensive scanning of genetic variants. These studies typically contain over half a million genetic features for thousands of individuals. Examining this with methods other than univariate statistics is a challenging task requiring advanced algorithms that are scalable to the genome-wide level. In the future, next-generation sequencing studies (NGS) will contain an even larger number of common and rare variants.
Machine learning-based feature selection algorithms have been shown to have the ability to effectively create predictive models for various genotype-phenotype relationships. This work explores the problem of selecting genetic variant subsets that are the most predictive of complex disease phenotypes through various feature selection methodologies, including filter, wrapper and embedded algorithms. The examined machine learning algorithms were demonstrated to not only be effective at predicting the disease phenotypes, but also doing so efficiently through the use of computational shortcuts. While much of the work was able to be run on high-end desktops, some work was further extended so that it could be implemented on parallel computers helping to assure that they will also scale to the NGS data sets.
Further, these studies analyzed the relationships between various feature selection methods and demonstrated the need for careful testing when selecting an algorithm. It was shown that there is no universally optimal algorithm for variant selection in GWAS, but rather methodologies need to be selected based on the desired outcome, such as the number of features to be included in the prediction model. It was also demonstrated that without proper model validation, for example using nested cross-validation, the models can result in overly-optimistic prediction accuracies and decreased generalization ability. It is through the implementation and application of machine learning methods that one can extract predictive genotype–phenotype relationships and biological insights from genetic data sets.Siirretty Doriast
Identification of breed contributions in crossbred dogs
There has been a strong public interest recently in the interrogation of canine ancestries using direct-toconsumer (DTC) genetic ancestry inference tools. Our goal is to improve the accuracy of the associated computational tools, by developing superior algorithms for identifying the breed composition of mixedbreed dogs. Genetic test data has been provided by Mars Veterinary, using SNP markers. We approach this ancestry inference problem from two main directions. The first approach is optimized for datasets composed of a small number of ancestry informative markers (AIM). Firstly, we compute haplotype frequencies from purebred ancestral panels which characterize genetic variation within breeds and are utilized to predict breed compositions. Due to a large number of possible breed combinations in admixed dogs we approximately sample this search space with a Metropolis-Hastings algorithm. As proposal density we either uniformly sample new breeds for the lineage, or we bias the Markov Chain so that breeds in the lineage are more likely to be replaced by similar breeds. The second direction we explore is dominated by HMM approaches which view genotypes as realizations of latent variable sequences corresponding to breeds. In this approach an admixed canine sample is viewed as a linear combination of segments from dogs in the ancestral panel. Results were evaluated using two different performance measures. Firstly, we looked at a generalization of binary ROC-curves to multi-class classification problems. Secondly, to more accurately judge breed contribution approximations we computed the difference between expected and predicted breed contributions. Experimental results on a synthetic, admixed test dataset using AIMs showed that the MCMC approach successfully predicts breed proportions for a variety of lineage complexities. Furthermore, due to exploration in the MCMC algorithm true breed contributions are underestimated. The HMM approach performed less well which is presumably due to using less information of the dataset
Time-dependent metabolic phenotyping of inflammatory dysregulation
A rich and functional description of a patient health status is the fundamental basis for the personalisation of treatment and the targeting of interventions. The function of inflammation in the healing process as well as its involvement in most major diseases is well established, yet the specific mechanism by which it contributes to the pathogenesis is still not fully understood. If conditions arising from a dysregulation of the inflammatory process are to be treated before they become irreversible, a novel understanding of these pathologies must be achieved and a stratification of patients based on their inflammatory status undertaken.
The work presented in this thesis aims to deliver new analytical and statistical approaches to support the investigation of the time-dependent dysregulation of inflammation.
Lipid mediators have been described as exerting a major role in the initiation and regulation of the inflammatory response, yet analytical platforms for their large-scale characterisation in human biofluids are lacking. This thesis reports the validation of an assay for the simultaneous quantification of pro- and anti-inflammatory signalling molecules in multiple human biofluids. The coverage of the assay in each biofluid is subsequently established, characterising inflammatory signalling across biological compartments. A second study explores the assay’s applicability in a clinical context; investigating the relationship between lipid mediators, current clinical markers of inflammation and post-operative complications.
Characterising the interplay between signalling and regulatory networks is key to understanding a living system’s response to perturbations, yet few statistical approaches are suited for the detection of time-dependent patterns in short and irregularly sampled longitudinal datasets. This thesis reports the development of a statistical approach to support the identification of altered time-trajectories in such studies. The method’s wide applicability is subsequently demonstrated on two investigations covering the diversity of metabolic phenotyping data generation platforms.
This thesis is a proof of concept for the characterisation of patient-specific inflammatory status in a clinical context and the identification of altered time-dependent patterns. Both analytical and statistical developments have been motivated by the needs of real world applications and provide a template for the characterisation and analysis of the molecular basis for treatment.Open Acces
- …