119 research outputs found
CellPhy: accurate and fast probabilistic inference of single-cell phylogenies from scDNA-seq data
We introduce CellPhy, a maximum likelihood framework for inferring phylogenetic trees from somatic single-cell single-nucleotide variants. CellPhy leverages a finite-site Markov genotype model with 16 diploid states and considers amplification error and allelic dropout. We implement CellPhy into RAxML-NG, a widely used phylogenetic inference package that provides statistical confidence measurements and scales well on large datasets with hundreds or thousands of cells. Comprehensive simulations suggest that CellPhy is more robust to single-cell genomics errors and outperforms state-of-the-art methods under realistic scenarios, both in accuracy and speed.European Research Council | Ref. ERC-617457- PHYLOCANCERAgencia Estatal de Investigación | Ref. PID2019-106247GB-I00Fundação para a Ciência e a Tecnologia | Ref. PTDC/BIA-EVL/32030/2017Xunta de Galici
CellPhy: accurate and fast probabilistic inference of single-cell phylogenies from scDNA-seq data
We introduce CellPhy, a maximum likelihood framework for inferring phylogenetic trees from somatic single-cell single-nucleotide variants. CellPhy leverages a finite-site Markov genotype model with 16 diploid states and considers amplification error and allelic dropout. We implement CellPhy into RAxML-NG, a widely used phylogenetic inference package that provides statistical confidence measurements and scales well on large datasets with hundreds or thousands of cells. Comprehensive simulations suggest that CellPhy is more robust to single-cell genomics errors and outperforms state-of-the-art methods under realistic scenarios, both in accuracy and speed. CellPhy is freely available a
Whole genome association mapping by incompatibilities and local perfect phylogenies
BACKGROUND: With current technology, vast amounts of data can be cheaply and efficiently produced in association studies, and to prevent data analysis to become the bottleneck of studies, fast and efficient analysis methods that scale to such data set sizes must be developed. RESULTS: We present a fast method for accurate localisation of disease causing variants in high density case-control association mapping experiments with large numbers of cases and controls. The method searches for significant clustering of case chromosomes in the "perfect" phylogenetic tree defined by the largest region around each marker that is compatible with a single phylogenetic tree. This perfect phylogenetic tree is treated as a decision tree for determining disease status, and scored by its accuracy as a decision tree. The rationale for this is that the perfect phylogeny near a disease affecting mutation should provide more information about the affected/unaffected classification than random trees. If regions of compatibility contain few markers, due to e.g. large marker spacing, the algorithm can allow the inclusion of incompatibility markers in order to enlarge the regions prior to estimating their phylogeny. Haplotype data and phased genotype data can be analysed. The power and efficiency of the method is investigated on 1) simulated genotype data under different models of disease determination 2) artificial data sets created from the HapMap ressource, and 3) data sets used for testing of other methods in order to compare with these. Our method has the same accuracy as single marker association (SMA) in the simplest case of a single disease causing mutation and a constant recombination rate. However, when it comes to more complex scenarios of mutation heterogeneity and more complex haplotype structure such as found in the HapMap data our method outperforms SMA as well as other fast, data mining approaches such as HapMiner and Haplotype Pattern Mining (HPM) despite being significantly faster. For unphased genotype data, an initial step of estimating the phase only slightly decreases the power of the method. The method was also found to accurately localise the known susceptibility variants in an empirical data set – the ΔF508 mutation for cystic fibrosis – where the susceptibility variant is already known – and to find significant signals for association between the CYP2D6 gene and poor drug metabolism, although for this dataset the highest association score is about 60 kb from the CYP2D6 gene. CONCLUSION: Our method has been implemented in the Blossoc (BLOck aSSOCiation) software. Using Blossoc, genome wide chip-based surveys of 3 million SNPs in 1000 cases and 1000 controls can be analysed in less than two CPU hours
HTreeQA: Using Semi-Perfect Phylogeny Trees in Quantitative Trait Loci Study on Genotype Data
With the advances in high-throughput genotyping technology, the study of quantitative trait loci (QTL) has emerged as a promising tool to understand the genetic basis of complex traits. Methodology development for the study of QTL recently has attracted significant research attention. Local phylogeny-based methods have been demonstrated to be powerful tools for uncovering significant associations between phenotypes and single-nucleotide polymorphism markers. However, most existing methods are designed for homozygous genotypes, and a separate haplotype reconstruction step is often needed to resolve heterozygous genotypes. This approach has limited power to detect nonadditive genetic effects and imposes an extensive computational burden. In this article, we propose a new method, HTreeQA, that uses a tristate semi-perfect phylogeny tree to approximate the perfect phylogeny used in existing methods. The semi-perfect phylogeny trees are used as high-level markers for association study. HTreeQA uses the genotype data as direct input without phasing. HTreeQA can handle complex local population structures. It is suitable for QTL mapping on any mouse populations, including the incipient Collaborative Cross lines. Applied HTreeQA, significant QTLs are found for two phenotypes of the PreCC lines, white head spot and running distance at day 5/6. These findings are consistent with known genes and QTL discovered in independent studies. Simulation studies under three different genetic models show that HTreeQA can detect a wider range of genetic effects and is more efficient than existing phylogeny-based approaches. We also provide rigorous theoretical analysis to show that HTreeQA has a lower error rate than alternative methods
Direct maximum parsimony phylogeny reconstruction from genotype data
<p>Abstract</p> <p>Background</p> <p>Maximum parsimony phylogenetic tree reconstruction from genetic variation data is a fundamental problem in computational genetics with many practical applications in population genetics, whole genome analysis, and the search for genetic predictors of disease. Efficient methods are available for reconstruction of maximum parsimony trees from haplotype data, but such data are difficult to determine directly for autosomal DNA. Data more commonly is available in the form of genotypes, which consist of conflated combinations of pairs of haplotypes from homologous chromosomes. Currently, there are no general algorithms for the direct reconstruction of maximum parsimony phylogenies from genotype data. Hence phylogenetic applications for autosomal data must therefore rely on other methods for first computationally inferring haplotypes from genotypes.</p> <p>Results</p> <p>In this work, we develop the first practical method for computing maximum parsimony phylogenies directly from genotype data. We show that the standard practice of first inferring haplotypes from genotypes and then reconstructing a phylogeny on the haplotypes often substantially overestimates phylogeny size. As an immediate application, our method can be used to determine the minimum number of mutations required to explain a given set of observed genotypes.</p> <p>Conclusion</p> <p>Phylogeny reconstruction directly from unphased data is computationally feasible for moderate-sized problem instances and can lead to substantially more accurate tree size inferences than the standard practice of treating phasing and phylogeny construction as two separate analysis stages. The difference between the approaches is particularly important for downstream applications that require a lower-bound on the number of mutations that the genetic region has undergone.</p
Bayesian Statistical Methods for Genetic Association Studies with Case-Control and Cohort Design
Large-scale genetic association studies are carried out with the hope of discovering single
nucleotide polymorphisms involved in the etiology of complex diseases. We propose a
coalescent-based model for association mapping which potentially increases the power to
detect disease-susceptibility variants in genetic association studies with case-control and cohort
design. The approach uses Bayesian partition modelling to cluster haplotypes with
similar disease risks by exploiting evolutionary information. We focus on candidate gene
regions and we split the chromosomal region of interest into sub-regions or windows of high
linkage disequilibrium (LD) therein assuming a perfect phylogeny. The haplotype space is
then partitioned into disjoint clusters within which the phenotype-haplotype association is
assumed to be the same. The novelty of our approach consists in the fact that the distance
used for clustering haplotypes has an evolutionary interpretation, as haplotypes are clustered
according to the time to their most recent common mutation. Our approach is fully
Bayesian and we develop Markov Chain Monte Carlo algorithms to sample efficiently over
the space of possible partitions. We have also developed a Bayesian survival regression model
for high-dimension and small sample size settings. We provide a Bayesian variable selection
procedure and shrinkage tool by imposing shrinkage priors on the regression coefficients. We
have developed a computationally efficient optimization algorithm to explore the posterior
surface and find the maximum a posteriori estimates of the regression coefficients. We compare
the performance of the proposed methods in simulation studies and using real datasets
to both single-marker analyses and recently proposed multi-marker methods and show that
our methods perform similarly in localizing the causal allele while yielding lower false positive
rates. Moreover, our methods offer computational advantages over other multi-marker
approaches
Local Genealogies in a Linear Mixed Model for Genome-Wide Association Mapping in Complex Pedigreed Populations
INTRODUCTION: The state-of-the-art for dealing with multiple levels of relationship among the samples in genome-wide association studies (GWAS) is unified mixed model analysis (MMA). This approach is very flexible, can be applied to both family-based and population-based samples, and can be extended to incorporate other effects in a straightforward and rigorous fashion. Here, we present a complementary approach, called 'GENMIX (genealogy based mixed model)' which combines advantages from two powerful GWAS methods: genealogy-based haplotype grouping and MMA. SUBJECTS AND METHODS: We validated GENMIX using genotyping data of Danish Jersey cattle and simulated phenotype and compared to the MMA. We simulated scenarios for three levels of heritability (0.21, 0.34, and 0.64), seven levels of MAF (0.05, 0.10, 0.15, 0.20, 0.25, 0.35, and 0.45) and five levels of QTL effect (0.1, 0.2, 0.5, 0.7 and 1.0 in phenotypic standard deviation unit). Each of these 105 possible combinations (3 h(2) x 7 MAF x 5 effects) of scenarios was replicated 25 times. RESULTS: GENMIX provides a better ranking of markers close to the causative locus' location. GENMIX outperformed MMA when the QTL effect was small and the MAF at the QTL was low. In scenarios where MAF was high or the QTL affecting the trait had a large effect both GENMIX and MMA performed similarly. CONCLUSION: In discovery studies, where high-ranking markers are identified and later examined in validation studies, we therefore expect GENMIX to enrich candidates brought to follow-up studies with true positives over false positives more than the MMA would
Analysis and Visualization of Local Phylogenetic Structure within Species
While it is interesting to examine the evolutionary history and phylogenetic relationship between species, for example, in a sort of tree of life, there is also a great deal to be learned from examining population structure and relationships within species. A careful description of phylogenetic relationships within species provides insights into causes of phenotypic variation, including disease susceptibility. The better we are able to understand the patterns of genotypic variation within species, the better these populations may be used as models to identify causative variants and possible therapies, for example through targeted genome-wide association studies (GWAS). My thesis describes a model of local phylogenetic structure, how it can be effectively derived under various circumstances, and useful applications and visualizations of this model to aid genetic studies. I introduce a method for discovering phylogenetic structure among individuals of a population by partitioning the genome into a minimal set of intervals within which there is no evidence of recombination. I describe two extensions of this basic method. The first allows it to be applied to heterozygous, in addition to homozygous, genotypes and the second makes it more robust to errors in the source genotypes. I demonstrate the predictive power of my local phylogeny model using a novel method for genome-wide genotype imputation. This imputation method achieves very high accuracy - on the order of the accuracy rate in the sequencing technology - by imputing genotypes in regions of shared inheritance based on my local phylogenies. Comparative genomic analysis within species can be greatly aided by appropriate visualization and analysis tools. I developed a framework for web-based visualization and analysis of multiple individuals within a species, with my model of local phylogeny providing the underlying structure. I will describe the utility of these tools and the applications for which they have found widespread use.Doctor of Philosoph
Full‐likelihood genomic analysis clarifies a complex history 2 of species divergence and introgression: the example of the 3 erato‐sara group of Heliconius butterflies
Introgressive hybridization plays a key role in adaptive evolution and species diversification in many groups of species. However, frequent hybridization and gene flow between species make estimation of the species phylogeny and key population parameters challenging. Here, we show that by accounting for phasing and using full-likelihood methods, introgression histories and population parameters can be estimated reliably from whole-genome sequence data. We employ the multispecies coalescent (MSC) model with and without gene flow to infer the species phylogeny and cross-species introgression events using genomic data from six members of the erato-sara clade of Heliconius butterflies. The methods naturally accommodate random fluctuations in genealogical history across the genome due to deep coalescence. To avoid heterozygote phasing errors in haploid sequences commonly produced by genome assembly methods, we process and compile unphased diploid sequence alignments and use analytical methods to average over uncertainties in heterozygote phase resolution. There is robust evidence for introgression across the genome, both among distantly related species deep in the phylogeny and between sister species in shallow parts of the tree. We obtain chromosome-specific estimates of key population parameters such as introgression directions, times and probabilities, as well as species divergence times and population sizes for modern and ancestral species. We confirm ancestral gene flow between the sara clade and an ancestral population of Heliconius telesiphe, a likely hybrid speciation origin for Heliconius hecalesia, and gene flow between the sister species Heliconius erato and Heliconius himera. Inferred introgression among ancestral species also explains the history of two chromosomal inversions deep in the phylogeny of the group. This study illustrates how a full-likelihood approach based on the MSC makes it possible to extract rich historical information of species divergence and gene flow from genomic data. [3S; BPP; gene flow; Heliconius; hybrid speciation; introgression; inversion; multispecies coalescent
Sequence clustering for genetic mapping of binary traits
Sequence relatedness has potential application to fine-mapping genetic variants contributing to inherited traits. We investigate the utility of genealogical tree-based approaches to fine-map causal variants in three different projects. In the first project, through coalescent simulation, we compare the ability of several popular methods of association mapping to localize causal variants in a sub-region of a candidate genomic region. We consider four broad classes of association methods, which we describe as single-variant, pooled-variant, joint-modelling and tree-based, under an additive genetic-risk model. We also investigate whether differentiating case sequences based on their carrier status for a causal variant can improve fine-mapping. Our results lend support to the potential of tree-based methods for genetic fine-mapping of disease. In the second project, we develop an R package to dynamically cluster a set of single-nucleotide variant sequences. The resulting partition structures provide important insight into the sequence relatedness. In the third project, we investigate the ability of methods based on sequence relatedness to fine-map rare causal variants and compare it to genotypic association methods. Since the true gene genealogy is unknown in reality, we apply the methods developed in the second project to estimate the sequence relatedness. We also pursue the idea of reclassifying case sequences into their carrier status using the idea of genealogical nearest neighbours. We find that method based on sequence relatedness is competitive for fine-mapping rare causal variants. We propose some general recommendations for fine-mapping rare variants in case-control association studies
- …