7,446 research outputs found

    Computational and Statistical Approaches for Large-Scale Genome-Wide Association Studies

    Full text link
    Over the past decade, genome-wide association studies (GWAS) have proven successful at shedding light on the underlying genetic variations that affect the risk of human complex diseases, which can be translated to novel preventative and therapeutic strategies. My research aims at identifying novel disease-associated genetic variants through large-scale GWAS and developing computational and statistical pipelines and methods to improve power and accuracy of GWAS. Bicuspid aortic valve (BAV) is a congenital heart defect characterized by fusion of two of the normal three leaflets of the aortic valve. As the most common cardiovascular malformation in humans, BAV is moderately heritable and is an important risk factor for valvulopathy and aortopathy, but its genetic origins remain elusive. In Chapter 2, we present the first large-scale GWAS study to identify novel genetic variants associated with BAV. We report association with a non-coding variant 151kb from the gene encoding the cardiac-specific transcription factor, GATA4, and near-significance for p.Ser377Gly in GATA4. We used multiple bioinformatics approaches to demonstrate that the GATA4 gene is a plausible biological candidate. In the subsequent functional follow-up, GATA4 was interrupted by CRISPR-Cas9 in induced pluripotent stem cells from healthy donors. The disruption of GATA4 significantly impaired the transition from endothelial cells into mesenchymal cells, a critical step in heart valve development. Genotype imputation is widely used in GWAS to perform in silico genotyping, leading to higher power to identify novel genetic signals. When multiple reference panels are not consented to combine together, it is unclear how to combine the imputation results to optimize the power of genetic association tests. In Chapter 3, we compared the accuracy of 9,265 Norwegian genomes imputed from three reference panels – 1000 Genomes Phase 3 (1000G), Haplotype Reference Consortium (HRC), and a reference panel containing 2,201 Norwegian participants from the HUNT study with low-pass genome sequencing. We observed that the overall imputation accuracy from the population-specific panel was substantially higher than 1000G and was comparable with HRC, despite HRC being 15-fold larger. We also evaluated different strategies to utilize multiple sets of imputed genotypes to increase the power of association studies. We propose that testing association for all variants imputed from any panel results in higher power to detect association than the alternative strategy of testing only the version of each genetic variant with the highest imputation quality metric. In phenome-wide GWAS by large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, linear mixed model and the recently proposed logistic mixed model, perform poorly -- producing large type I error rates -- in the analysis of phenotypes with unbalanced case-control ratios. In Chapter 4, we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation (SPA) to calibrate the distribution of score test statistics. This method, SAIGE, provides accurate p-values even when case-control ratios are extremely unbalanced. It utilizes state-of-art optimization strategies to reduce computational time and memory cost of generalized mixed model. The computation cost linearly depends on sample size, and hence can be applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 white British European-ancestry samples for 1,403 dichotomous phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144097/1/zhowei_1.pd

    Epigenetic consequences of interploidal hybridisation in synthetic and natural interspecific potato hybrids

    Get PDF
    Interploidal hybridisation can generate changes in plant chromosome numbers, which might exert effects additional to the expected due to genome merger per se (i.e., genetic, epigenetic and phenotypic novelties).Wild potatoes are suitable to address this question in an evolutionary context. To this end, we performed genetic (AFLP and SSR), epigenetic (MSAP), and cytological comparisons in: i) natural populations of the diploid cytotype of the hybrid taxonomic species Solanum x rechei (2n=2x, 3x) and its parental species, the triploid cytotype of Solanum microdontum (2n=2x, 3x) and Solanum kurtzianum (2n=2x); and ii) newly synthesised intraploidal (2x x 2x) and interploidal (3x x 2x) S. microdontum x S. kurtzianum hybrids.Aneuploidy was detected in S. x rechei and the synthetic interploidal progeny; this phenomenon might have originated the significantly higher number of methylation changes observed in the interploidal vs. the intraploidal hybrids. The wide epigenetic variability induced by interploidal hybridisation is consistent with the novel epigenetic pattern established in S. x rechei compared to its parental species in nature.These results suggest that aneuploid potato lineages can persist throughout the short term, and possibly medium term, and that differences in parental ploidy resulting in aneuploidy are an additional source of epigenetic variation.Fil: Cara, Nicolás. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza. Instituto de Biología Agrícola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de Biología Agrícola de Mendoza; ArgentinaFil: Ferrer, María Soledad. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza. Instituto de Biología Agrícola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de Biología Agrícola de Mendoza; ArgentinaFil: Masuelli, Ricardo Williams. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza. Instituto de Biología Agrícola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de Biología Agrícola de Mendoza; ArgentinaFil: Camadro, Elsa Lucila. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mar del Plata; Argentina. Universidad Nacional de Mar del Plata; ArgentinaFil: Marfil, Carlos Federico. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mendoza. Instituto de Biología Agrícola de Mendoza. Universidad Nacional de Cuyo. Facultad de Ciencias Agrarias. Instituto de Biología Agrícola de Mendoza; Argentin

    Evaluation of experimental design and computational parameter choices affecting analyses of ChIP-seq and RNA-seq data in undomesticated poplar trees.

    Get PDF
    BackgroundOne of the great advantages of next generation sequencing is the ability to generate large genomic datasets for virtually all species, including non-model organisms. It should be possible, in turn, to apply advanced computational approaches to these datasets to develop models of biological processes. In a practical sense, working with non-model organisms presents unique challenges. In this paper we discuss some of these challenges for ChIP-seq and RNA-seq experiments using the undomesticated tree species of the genus Populus.ResultsWe describe specific challenges associated with experimental design in Populus, including selection of optimal genotypes for different technical approaches and development of antibodies against Populus transcription factors. Execution of the experimental design included the generation and analysis of Chromatin immunoprecipitation-sequencing (ChIP-seq) data for RNA polymerase II and transcription factors involved in wood formation. We discuss criteria for analyzing the resulting datasets, determination of appropriate control sequencing libraries, evaluation of sequencing coverage needs, and optimization of parameters. We also describe the evaluation of ChIP-seq data from Populus, and discuss the comparison between ChIP-seq and RNA-seq data and biological interpretations of these comparisons.ConclusionsThese and other "lessons learned" highlight the challenges but also the potential insights to be gained from extending next generation sequencing-supported network analyses to undomesticated non-model species

    Fast and accurate population admixture inference from genotype data from a few microsatellites to millions of SNPs

    Get PDF
    Model-based (likelihood and Bayesian) and non-model-based (PCA and K-means clustering) methods were developed to identify populations and assign individuals to the identified populations using marker genotype data. Model-based methods are favoured because they are based on a probabilistic model of population genetics with biologically meaningful parameters and thus produce results that are easily interpretable and applicable. Furthermore, they often yield more accurate structure inferences than non-model-based methods. However, current model-based methods either are computationally demanding and thus applicable to small problems only or use simplified admixture models that could yield inaccurate results in difficult situations such as unbalanced sampling. In this study, I propose new likelihood methods for fast and accurate population admixture inference using genotype data from a few multiallelic microsatellites to millions of diallelic SNPs. The methods conduct first a clustering analysis of coarse-grained population structure by using the mixture model and the simulated annealing algorithm, and then an admixture analysis of fine-grained population structure by using the clustering results as a starting point in an expectation maximisation algorithm. Extensive analyses of both simulated and empirical data show that the new methods compare favourably with existing methods in both accuracy and running speed. They can analyse small datasets with just a few multiallelic microsatellites but can also handle in parallel terabytes of data with millions of markers and millions of individuals. In difficult situations such as many and/or lowly differentiated populations, unbalanced or very small samples of individuals, the new methods are substantially more accurate than other methods

    Genomic prediction and quantitative trait locus discovery in a cassava training population constructed from multiple breeding stages

    Get PDF
    Open Access Article; Published online: 11 Dec 2019Assembly of a training population (TP) is an important component of effective genomic selection‐based breeding programs. In this study, we examined the power of diverse germplasm assembled from two cassava (Manihot esculenta Crantz) breeding programs in Tanzania at different breeding stages to predict traits and discover quantitative trait loci (QTL). This is the first genomic selection and genome‐wide association study (GWAS) on Tanzanian cassava data. We detected QTL associated with cassava mosaic disease (CMD) resistance on chromosomes 12 and 16; QTL conferring resistance to cassava brown streak disease (CBSD) on chromosomes 9 and 11; and QTL on chromosomes 2, 3, 8, and 10 associated with resistance to CBSD for root necrosis. We detected a QTL on chromosome 4 and two QTL on chromosome 12 conferring dual resistance to CMD and CBSD. The use of clones in the same stage to construct TPs provided higher trait prediction accuracy than TPs with a mixture of clones from multiple breeding stages. Moreover, clones in the early breeding stage provided more reliable trait prediction accuracy and are better candidates for constructing a TP. Although larger TP sizes have been associated with improved accuracy, in this study, adding clones from Kibaha to those from Ukiriguru and vice versa did not improve the prediction accuracy of either population. Including the Ugandan TP in either population did not improve trait prediction accuracy. This study applied genomic prediction to understand the implications of constructing TP from clones at different breeding stages pooled from different locations on trait accuracy

    Random Forest as a tumour genetic marker extractor

    Get PDF
    Identifying tumour genetic markers is an essential task for biomedicine. In this thesis, we analyse a dataset of chromosomal rearrangements of cancer samples and present a methodology for extracting genetic markers from this dataset by using a Random Forest as a feature selection tool

    An AUC-based Permutation Variable Importance Measure for Random Forests

    Get PDF
    The random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class imbalance. We investigated the performance of the standard permutation VIM and of our novel AUC-based permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the standard permutation VIM loses its ability to discriminate between associated predictors and predictors not associated with the response for increasing class imbalance. It is outperformed by our new AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF variant based on conditional inference trees. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html
    corecore