665 research outputs found

    A hierarchical Dirichlet process mixture model for haplotype reconstruction from multi-population data

    Full text link
    The perennial problem of "how many clusters?" remains an issue of substantial interest in data mining and machine learning communities, and becomes particularly salient in large data sets such as populational genomic data where the number of clusters needs to be relatively large and open-ended. This problem gets further complicated in a co-clustering scenario in which one needs to solve multiple clustering problems simultaneously because of the presence of common centroids (e.g., ancestors) shared by clusters (e.g., possible descents from a certain ancestor) from different multiple-cluster samples (e.g., different human subpopulations). In this paper we present a hierarchical nonparametric Bayesian model to address this problem in the context of multi-population haplotype inference. Uncovering the haplotypes of single nucleotide polymorphisms is essential for many biological and medical applications. While it is uncommon for the genotype data to be pooled from multiple ethnically distinct populations, few existing programs have explicitly leveraged the individual ethnic information for haplotype inference. In this paper we present a new haplotype inference program, Haploi, which makes use of such information and is readily applicable to genotype sequences with thousands of SNPs from heterogeneous populations, with competent and sometimes superior speed and accuracy comparing to the state-of-the-art programs. Underlying Haploi is a new haplotype distribution model based on a nonparametric Bayesian formalism known as the hierarchical Dirichlet process, which represents a tractable surrogate to the coalescent process. The proposed model is exchangeable, unbounded, and capable of coupling demographic information of different populations.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS225 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Genomic diversity associated with polymorphic inversions in humans and their close relatives

    Get PDF
    Individuals of one species share the bulk of their genetic material, yet no two genomes are the same. Aside from displaying classical variation such as deletions, insertions, or substitutions of base pairs, two DNA segments can also differ in their orientation relative to the rest of their chromosomes. Such inversions are known for a range of biological implications and contribute critically to genome evolution and disease. However, inversions are notoriously challenging to detect, a fact which still impedes comprehensive analysis of their specific properties. This thesis describes several highly inter-connected projects aimed at identifying and functionally characterizing inversions present in the human population and related great ape species. First, inversions between human and four great ape species were assessed for their potential to disrupt topologically associating domains (TADs), potentially prompting gene misregulation. TAD boundaries co-located with breakpoints of long inversions, and while disrupted TADs displayed elevated rates of differen- tially expressed genes, this effect could be attributed the vicinity to inversion breakpoints, suggesting overall robustness of gene expression in response to TAD disruption. The second part of this thesis describes contributions to a collaborative project aimed at characterizing the full spectrum of inversions in 43 humans. In this study, I co-developed a novel inversion genotyping algorithm based on Strand- specific DNA sequencing and contributed to the description of 398 inversion polymorphisms. Inversions exhibited various underlying formation mechanisms, promotion of gene dysregulation, widespread recurrence, and association with genomic disease. These results suggest that long inversions are much more prominent in humans than previously thought, with at least 0.6% of the genome subject to inversion recurrence and, sometimes, the associated risk of subsequent deleterious mutation. With a focus on the link between inversions and disease-causing copy num- ber variations, the last project describes a novel algorithm to identify loci hit sequentially by several overlapping mutation events. This algorithm enabled the description of detailed mutation sequences in 20 highly dynamic regions in the human genome, and additional complex variants on chromosome Y. Six complex loci associate directly with a genomic disease, thereby highlighting in detail the intrinsic link between inversions and CNVs. In summary, these projects provide novel insights into the landscape of in- versions in humans and primates, which are much more frequent, and often more complex than previously thought. These findings provide a basis for future inversion studies and highlight the crucial contribution of this class of mutation to genome variation

    Deep learning in population genetics

    Get PDF
    KK is supported by a grant from the Deutsche Forschungsgemeinschaft (DFG) through the TUM International Graduate School of Science and Engineering (IGSSE), GSC 81, within the project GENOMIE QADOP. We acknowledge the support of Imperial College London - TUM Partnership award.Population genetics is transitioning into a data-driven discipline thanks to the availability of large-scale genomic data and the need to study increasingly complex evolutionary scenarios. With likelihood and Bayesian approaches becoming either intractable or computationally unfeasible, machine learning, and in particular deep learning, algorithms are emerging as popular techniques for population genetic inferences. These approaches rely on algorithms that learn non-linear relationships between the input data and the model parameters being estimated through representation learning from training data sets. Deep learning algorithms currently employed in the field comprise discriminative and generative models with fully connected, con volutional, or recurrent layers. Additionally, a wide range of powerful simulators to generate training data under complex scenarios are now available. The application of deep learning to empirical data sets mostly replicates previous findings of demography reconstruction and signals of natural selection in model organisms. To showcase the feasibility of deep learning to tackle new challenges, we designed a branched architecture to detect signals of recent balancing selection from temporal haplotypic data, which exhibited good predictive performance on simulated data. Investigations on the interpretability of neural networks, their robustness to uncertain training data, and creative representation of population genetic data, will provide further opportunities for technological advancements in the field.Publisher PDFPeer reviewe

    Comprehensive identification and characterisation of germline structural variation within the Iberian population

    Get PDF
    [eng] One of the central aims of biology and biomedicine has been the characterisation and understanding of genetic variation across humans, to answer important evolutionary questions and to explain phenotypic variability concerning the diseases. Understanding genetic variability, is key to study this relationship (through imputation and GWASs) and to translate the results into improved clinical protocols. Different initiatives have emerged around the world to systematically characterise the genetic variability of specific human populations from whole-genome sequences, usually by selecting geographical regions. Examples such as 1000 Genomes (1000G)1, GoNL2, HRC, UK10K3 or Estonian population4, have already identified and characterised millions of genetic variants across different populations. In combination with imputation analysis, these sequenced-based projects allow increasing the statistical power and resolution of Genome-Wide Association Studies (GWAS), identifying and discovering new disease-associated variants5. Additionally, genetic variability among population groups is associated with geographic ancestry and can affect the disease risk or treatment efficacy differently6,7. For this reason, population- specific reference panels are necessary to characterise their genetic diversity and to assess its effect on human phenotypes, improving GWAS studies, as one of the cornerstones of precision medicine7. Existing genetic variability panels include Single Nucleotide Variants (SNVs) and indels (<50bp) but are limited in large Structural Variants (SV) (≥50bp). Technical and methodological limitations hindered the discovery of SVs using Next-generation Sequencing (NGS) technologies, as it produced False-Discovery Rates between 9-89% and recall 10-70%, depending on the SV type and size8. On average, the genomic variation between two human genomes is around 0.1%, but this difference increases to 1.5% with SVs8. The SVs also affect 3-10 times more nucleotides than SNVs9 (4M SNVs per genome10), showing their potential effect on human phenotypes. For this reason, including a complete catalogue of SVs in reference panels will increase the power in GWAS studies and provide opportunities to find new disease-associated variants. To overcome these limitations, in this thesis, we have generated the first genome-wide Iberian haplotype reference panel, mainly focused on Structural Variants, using 785 samples whole-genome sequenced (WGS) at high coverage (30X) from the GCAT-Genomics for life project. We designed a complete strategy, including an extensive benchmarking of multiple variant calling programs and by building specific Logistic Regression Models (LRM) for SV types, as well as phasing strategies to come up with a high quality and comprehensive genetic variability panel. This strategy was benchmarked using different controlled sets of variants, showing high precision and recall values across all variant types and sizes. The application of this strategy to our GCAT whole-genome samples resulted in the identification of 35,431,441 genetic variants, classified as 30,325,064 SNPs, 5,017,19 small indels (< 50bp), and 89,178 larger SV (≥ 50bp). The latter group was further subclassified into 33,244 deletions, 6,269 duplications, 12,782 insertions, 10,115 inversions, 18,779 transposons and 7,989 translocations, covering all ranges of frequencies and sizes. Besides, 60% of the discovered SVs were not catalogued in any repository, thus increasing the insights of SV in humans. Additionally, 52.44% of common and 71.63% of low-frequency SVs were not included in any haplotype reference panel. Thus, new SVs could be used in GWAS, adding more value to the Iberian-GCAT catalogue. The prediction of the functional impact of the SVs shows that these variants might have a central role in several diseases. Of all SVs included in the Iberian-GCAT catalogue, 46% overlapped in genes (both protein-coding genes and non-protein-coding genes), highlighting their potential impact on human traits. Besides, 92.7% of protein-coding genes were located outside low-complexity (repeated) genomic regions, expecting short-reads from NGS to capture the most interpretable SVs in humans11. Moreover, 32.93% of SVs affected protein-coding genes with a predicted loss of function intolerance (pLI) effect, further supporting the potential implication of these variants on complex diseases and therefore enabling a better explanation of missing heritability. Importantly, taking advantage of high coverage (30X), we accurately determine the genotypes of SVs, enabling to phase together with SNVs and indels, and increasing the SV phasing accuracy, in contrast to 1000G and GoNL. Besides, high coverage allowed to use Phasing Informative Reads (PIRs), increasing the phasing performance. The overall strategy enables the community to expand and improve the imputation possibilities within GWAS. The Iberian-GCAT haplotype reference panel created in this thesis, imputes accurately common SVs, with near ~100% of agreement with sequencing results. Although the Iberian- GCAT haplotype reference panel can be used in all populations from different continental groups, due to closer ancestries, the imputation performance is high in European and Latin American populations, reflected in the amount of low-frequency (1% ≤ MAF MAF) variants imputed at high info scores. These results demonstrated the versatility of our resource, increasing their performance in closer ancestries. In general, we observed that when the allele frequency decreases, the imputation accuracy drops too, highlighting the necessity to include more samples in reference panels, to impute low-frequency and rare variants efficiently, which normally are expected to have more functional impact on diseases. Finally, we compared the imputation possibilities of the 1000G and GoNL reference panels, with our Iberian-GCAT reference panel. We observed that the Iberian-GCAT reference panel outperformed the imputation of high-quality SVs by 2.7 and 1.6-fold compared to 1000G and GoNL, respectively. Also, the overall imputation quality is higher, showing the value of this new resource in GWAS as it includes more SVs than previous reference panels. The combination of different reference panels will improve the resolution and statistical power of GWAS, thus increasing the chances to find more risk variants in complex diseases, and ultimately, translated this insight to precision medicine

    Phenotype prediction and feature selection in genome-wide association studies

    Get PDF
    Genome wide association studies (GWAS) search for correlations between single nucleotide polymorphisms (SNPs) in a subject genome and an observed phenotype. GWAS can be used to generate models for predicting phenotype based on genotype, as well as aiding in identification of specific genes affecting the biological mechanism underlying the phenotype. In this investigation, phenotype prediction models are constructed from GWAS training data and are evaluated for performance on test data. Three methods are used to rank SNPs by their correlation with the phenotype: the univariate Wald test, a multivariate, support vector machine (SVM) based technique, and a hybrid method where a subset of top ranked SNPs from the Wald test are used to train the SVM. Both case- control studies and quantitative phenotypes are examined. For each method and data set, a series of least squares linear regression models is generated from nested subsets of the best SNPs from each ranking method. The accuracy of these models is determined on a test data set, and a plot of prediction performance against the number of top ranked SNPs considered is generated. The SVM and hybrid methods are found to be consistently superior to the Wald test in ranking predictive SNPs. The hybrid method allows a useful trade-off between increasing accuracy vs. using fewer SNPs to be optimized as desired

    PedGenie: an analysis approach for genetic association testing in extended pedigrees and genealogies of arbitrary size

    Get PDF
    BACKGROUND: We present a general approach to perform association analyses in pedigrees of arbitrary size and structure, which also allows for a mixture of pedigree members and independent individuals to be analyzed together, to test genetic markers and qualitative or quantitative traits. Our software, PedGenie, uses Monte Carlo significance testing to provide a valid test for related individuals that can be applied to any test statistic, including transmission disequilibrium statistics. Single locus at a time, composite genotype tests, and haplotype analyses may all be performed. We illustrate the validity and functionality of PedGenie using simulated and real data sets. For the real data set, we evaluated the role of two tagging-single nucleotide polymorphisms (tSNPs) in the DNA repair gene, NBS1, and their association with female breast cancer in 462 cases and 572 controls selected to be BRCA1/2 mutation negative from 139 high-risk Utah breast cancer families. RESULTS: The results from PedGenie were shown to be valid both for accurate p-value calculations and consideration of pedigree structure in the simulated data set. A nominally significant association with breast cancer was observed with the NBS1 tSNP rs709816 for carriage of the rare allele (OR = 1.61, 95% CI = 1.10–2.35, p = 0.019). CONCLUSION: PedGenie is a flexible and valid statistical tool that is intuitively simple to understand, makes efficient use of all the data available from pedigrees without requiring trimming, and is flexible to the types of tests to which it can be applied. Further, our analyses of real data indicate NBS1 may play a role in the genetic etiology of heritable breast cancer

    Subtle changes in chromatin loop contact propensity are associated with differential gene regulation and expression.

    Get PDF
    While genetic variation at chromatin loops is relevant for human disease, the relationships between contact propensity (the probability that loci at loops physically interact), genetics, and gene regulation are unclear. We quantitatively interrogate these relationships by comparing Hi-C and molecular phenotype data across cell types and haplotypes. While chromatin loops consistently form across different cell types, they have subtle quantitative differences in contact frequency that are associated with larger changes in gene expression and H3K27ac. For the vast majority of loci with quantitative differences in contact frequency across haplotypes, the changes in magnitude are smaller than those across cell types; however, the proportional relationships between contact propensity, gene expression, and H3K27ac are consistent. These findings suggest that subtle changes in contact propensity have a biologically meaningful role in gene regulation and could be a mechanism by which regulatory genetic variants in loop anchors mediate effects on expression

    Inference of Population Structure using Dense Haplotype Data

    Get PDF
    The advent of genome-wide dense variation data provides an opportunity to investigate ancestry in unprecedented detail, but presents new statistical challenges. We propose a novel inference framework that aims to efficiently capture information on population structure provided by patterns of haplotype similarity. Each individual in a sample is considered in turn as a recipient, whose chromosomes are reconstructed using chunks of DNA donated by the other individuals. Results of this “chromosome painting” can be summarized as a “coancestry matrix,” which directly reveals key information about ancestral relationships among individuals. If markers are viewed as independent, we show that this matrix almost completely captures the information used by both standard Principal Components Analysis (PCA) and model-based approaches such as STRUCTURE in a unified manner. Furthermore, when markers are in linkage disequilibrium, the matrix combines information across successive markers to increase the ability to discern fine-scale population structure using PCA. In parallel, we have developed an efficient model-based approach to identify discrete populations using this matrix, which offers advantages over PCA in terms of interpretability and over existing clustering algorithms in terms of speed, number of separable populations, and sensitivity to subtle population structure. We analyse Human Genome Diversity Panel data for 938 individuals and 641,000 markers, and we identify 226 populations reflecting differences on continental, regional, local, and family scales. We present multiple lines of evidence that, while many methods capture similar information among strongly differentiated groups, more subtle population structure in human populations is consistently present at a much finer level than currently available geographic labels and is only captured by the haplotype-based approach. The software used for this article, ChromoPainter and fineSTRUCTURE, is available from http://www.paintmychromosomes.com/
    corecore