27 research outputs found

    Accuracy of Gene Expression Prediction From Genotype Data With PrediXcan Varies Across and Within Continental Populations

    Get PDF
    Using genetic data to predict gene expression has garnered significant attention in recent years. PrediXcan has become one of the most widely used gene-based methods for testing associations between predicted gene expression values and a phenotype, which has facilitated novel insights into the relationship between complex traits and the component of gene expression that can be attributed to genetic variation. The gene expression prediction models for PrediXcan were developed using supervised machine learning methods and training data from the Depression Genes and Networks (DGN) study and the Genotype-Tissue Expression (GTEx) project, where the majority of subjects are of European descent. Many genetic studies, however, include samples from multi-ethnic populations, and in this paper we evaluate the accuracy of PrediXcan for predicting gene expression in diverse populations. Using transcriptomic data from the GEUVADIS (Genetic European Variation in Disease) RNA sequencing project and whole genome sequencing data from the 1000 Genomes project, we evaluate and compare the predictive performance of PrediXcan in an African population (Yoruban) and four European ancestry populations for thousands of genes. We evaluate a range of models from the PrediXcan weight databases and use Pearson's correlation coefficient to assess gene expression prediction accuracy with PrediXcan. From our evaluation, we find that the predictive performance of PrediXcan varies substantially among populations from different continents (F-test p-value < 2.2 × 10−16), where prediction accuracy is lower in the Yoruban population from West Africa compared to the European-ancestry populations. Moreover, not only do we find differences in predictive performance between populations from different continents, we also find highly significant differences in prediction accuracy among the four European ancestry populations considered (F-test p-value < 2.2 × 10−16). Finally, while there is variability in prediction accuracy across different PrediXcan weight databases, we also find consistency in the qualitative performance of PrediXcan for the five populations considered, with the African ancestry population having the lowest accuracy across databases

    Optimizing Gene Expression Prediction and Omics Integration in Populations of African Ancestry

    Get PDF
    Popular transcriptome imputation methods such as PrediXcan and FUSIon use parametric linear assumptions, and thus are unable to flexibly model the complex genetic architecture of the transcriptome. Although non-linear modeling has been shown to improve imputation performance, replicability and potential cross-population differences have not been adequately studied. Therefore, to optimize imputation performance across global populations, we used the non-linear machine learning (ML) models random forest (RF), support vector regression (SVR), and K nearest neighbor (KNN) to build transcriptome imputation models, and evaluated their performance in comparison to elastic net (EN). We trained gene expression prediction models using genotype and blood monocyte transcriptome data from the Multi-Ethnic Study of Atherosclerosis (MESA) comprising individuals of African, Hispanic, and European ancestries and tested them using genotype and whole blood transcriptome data from the Modeling the Epidemiology Transition Study (METS) comprising individuals of African ancestries. We show that the prediction performance is highest when the training and the testing population share similar ancestries regardless of the prediction algorithm used. While EN generally outperformed RF, SVR, and KNN, we found that RF outperforms EN for some genes, particularly between disparate ancestries, suggesting potential robustness and reduced variability of RF imputation performance across global populations. When applied to a high-density lipoprotein (HDL) phenotype, we show including RF prediction models in PrediXcan reveals potential gene associations missed by EN models. Therefore, by integrating non-linear modeling into PrediXcan and diversifying our training populations to include more global ancestries, we may uncover new genes associated with complex traits. We did not find any significant associations when the prediction models were applied to obesity status and microbiome diversity

    A Review of Integrative Imputation for Multi-Omics Datasets

    Get PDF
    Multi-omics studies, which explore the interactions between multiple types of biological factors, have significant advantages over single-omics analysis for their ability to provide a more holistic view of biological processes, uncover the causal and functional mechanisms for complex diseases, and facilitate new discoveries in precision medicine. However, omics datasets often contain missing values, and in multi-omics study designs it is common for individuals to be represented for some omics layers but not all. Since most statistical analyses cannot be applied directly to the incomplete datasets, imputation is typically performed to infer the missing values. Integrative imputation techniques which make use of the correlations and shared information among multi-omics datasets are expected to outperform approaches that rely on single-omics information alone, resulting in more accurate results for the subsequent downstream analyses. In this review, we provide an overview of the currently available imputation methods for handling missing values in bioinformatics data with an emphasis on multi-omics imputation. In addition, we also provide a perspective on how deep learning methods might be developed for the integrative imputation of multi-omics datasets

    Genomic Contributors to Individual Differences in Reward-Related Neural Activity

    Get PDF
    Aberrant reward-related behavior, including impulsive and risk-taking behaviors, is a common feature of externalizing psychopathology (e.g., attention deficit hyperactivity disorder, antisocial personality disorder, and substance-use disorders). Through imaging studies, these behaviors have been linked to dysregulated reactivity within a diffuse reward-related corticostriatal neural network, including the striatum, frontal regions (namely orbital, ventromedial, and dorsolateral cortices), the insula, and the hippocampus. Because variability in risk-taking behavior and related psychopathology is moderately-to-largely heritable (i.e., with estimates ranging from 40 – 80%), a genetically-informed approach is well-positioned to provide valuable insight into the etiology of reward-related neural and behavioral phenotypes that characterize externalizing psychopathology. Using summary statistics from a recent genome-wide association study (GWAS) of risk tolerance among 939,908 individuals, we generated polygenic risk scores (PRS) for a European-ancestry subsample (usable data ranging from n=457 to n=518; see Table 2) of the Duke Neurogenetics Study (DNS; a large community sample) and examined associations between genomic liability and risk-taking phenotypes (i.e., self-reported impulsivity and alcohol use, and behavioral delay discounting), as well as BOLD activation of the ventral striatum. Contrary to our hypotheses, GWAS-based PRS were not consistently significantly associated with risk-related behavior or with activation of the ventral striatum. In order to increase biological informativeness, we also used PrediXcan analyses to identify genes with differential expression based on the risk-related genomic liability; however, PRS of these differentially-expressed variants were also not significantly associated with risk-related behavioral or neural-activation phenotypes in the DNS. Though these null findings may reflect a true lack of association between risk-related genetic liability and behavior/neural externalizing phenotypes, we discuss possible alternative explanations regarding imprecise phenotyping in the discovery GWAS, inadequate statistical power, and questionable reliability of task-based fMRI measurements

    Incorporating Sex Chromosomes in Transcriptome Prediction Models and Improving Cross-Population Prediction Performance

    Get PDF
    Transcriptome prediction models built with data from European-descent individuals are less accurate when applied to different populations because of differences in linkage disequilibrium patterns and allele frequencies. We hypothesized multivariate adaptive shrinkage may improve cross-population transcriptome prediction, as it leverages effect size estimates across different conditions - in this case, different populations. To test this hypothesis, we made transcriptome prediction models for use in transcriptome-wide association studies (TWAS) using different methods (Elastic Net, Matrix eQTL and Multivariate Adaptive Shrinkage in R (MASHR)) and tested their out-of-sample transcriptome prediction accuracy in population-matched and cross-population scenarios. Additionally, to evaluate model applicability in TWAS, we integrated publicly available multi-ancestry genome-wide association study (GWAS) summary statistics from the Population Architecture using Genomics and Epidemiology Study (PAGE) and Pan-UK Biobank with our developed transcriptome prediction models. In regard to transcriptome prediction accuracy, MASHR models had similar performance to other methods when the training population ancestry closely matched the test population, but outperformed other methods in cross-population predictions. Furthermore, in multi-ancestry TWAS, MASHR models yielded more discoveries that replicate in both PAGE and PanUKBB across all methods analyzed, including loci previously mapped in GWAS and new loci previously not found in GWAS. Overall, we demonstrate the importance of using methods that incorporate effect size estimates from multiple populations in order to improve TWAS for multi-ancestry or underrepresented populations

    Evaluating PrediXcan’s Ability to Predict Differential Expression Between Alcoholics and Non-Alcoholics

    Get PDF
    PrediXcan is a recent software for the imputation of gene expression from genotype data alone. Using an overlapping set of transcriptome datasets from postmortem brain tissues of donors with alcohol use disorder and neurotypical controls, which were generated by two different platforms (e.g., Arraystar and Affymetrix), and an additional unrelated transcriptome dataset from lung tissue, we sought to evaluate PrediXcan’s ability to impute gene expression and identify differentially expressed genes. From the Arraystar platform, 1.3% of matched genes between the measured and imputed expression had a Pearson correlation ≄ 0.5. Our attempt to replicate this finding using the expression data from the Affymetrix platform also lead to a similarly poor outcome (2.7%). Our third attempt using the transcriptome data from lung tissue produced similar results (1.1%) but performance improved markedly after filtering out genes with a low predicted R2, which was a model metric provided by the PrediXcan authors. For example, filtering out genes with a predicted R2 below 0.6 led to 16 genes remaining and a Pearson correlation of 0.365 between the measured and imputed expression. We were unable to reproduce similar performance gains with filtering the Arraystar or Affymetrix alcohol use disorder datasets. Given that PrediXcan can impute a narrow portion of the transcriptome, which is further reduced significantly by filtering, we believe caution is warranted with the interpretation of results derived from PrediXcan

    Interpretation of psychiatric genome-wide association studies with multispecies heterogeneous functional genomic data integration.

    Get PDF
    Genome-wide association studies and other discovery genetics methods provide a means to identify previously unknown biological mechanisms underlying behavioral disorders that may point to new therapeutic avenues, augment diagnostic tools, and yield a deeper understanding of the biology of psychiatric conditions. Recent advances in psychiatric genetics have been made possible through large-scale collaborative efforts. These studies have begun to unearth many novel genetic variants associated with psychiatric disorders and behavioral traits in human populations. Significant challenges remain in characterizing the resulting disease-associated genetic variants and prioritizing functional follow-up to make them useful for mechanistic understanding and development of therapeutics. Model organism research has generated extensive genomic data that can provide insight into the neurobiological mechanisms of variant action, but a cohesive effort must be made to establish which aspects of the biological modulation of behavioral traits are evolutionarily conserved across species. Scalable computing, new data integration strategies, and advanced analysis methods outlined in this review provide a framework to efficiently harness model organism data in support of clinically relevant psychiatric phenotypes

    Effect of 6p21 region on lung function is modified by smoking: a genome-wide interaction study

    Get PDF
    Smoking is a major risk factor for chronic obstructive pulmonary disease (COPD); however, more than 25% of COPD patients are non-smokers, and gene-by-smoking interactions are expected to affect COPD onset. We aimed to identify the common genetic variants interacting with pack-years of smoking on FEV1/FVC ratios in individuals with normal lung function. A genome-wide interaction study (GWIS) on FEV1/FVC was performed for individuals with FEV1/FVC ratio ≄ 70 in the Korea Associated Resource cohort data, and significant SNPs were validated using data from two other Korean cohorts. The GWIS revealed that rs10947231 and rs8192575 met genome-wide significant levels; For [Formula: see text] the likelihood ratio (LR) test was conducted, and its P values, PLR, for rs10947231 and rs8192575 were 2.23 × 10-12 and 1.18 × 10-8, respectively. Interaction between rs8192575 and smoking is significantly replicated with two additional data (PINT = 0.0454, 0.0131). Expression quantitative trait loci, topologically associated domains, and PrediXcan analyses revealed that rs8192575 is significantly associated with AGER expression. SNPs on the 6p21 region are associated with FEV1/FVC, and the effect of smoking on FEV1/FVC differs among the associated genotypes.ope

    MOSTWAS: Multi-Omic Strategies for Transcriptome-Wide Association Studies

    Get PDF
    Traditional predictive models for transcriptome-wide association studies (TWAS) consider only single nucleotide polymorphisms (SNPs) local to genes of interest and perform parameter shrinkage with a regularization process. These approaches ignore the effect of distal-SNPs or other molecular effects underlying the SNP-gene association. Here, we outline multi-omics strategies for transcriptome imputation from germline genetics to allow more powerful testing of gene-trait associations by prioritizing distal-SNPs to the gene of interest. In one extension, we identify mediating biomarkers (CpG sites, microRNAs, and transcription factors) highly associated with gene expression and train predictive models for these mediators using their local SNPs. Imputed values for mediators are then incorporated into the final predictive model of gene expression, along with local SNPs. In the second extension, we assess distal-eQTLs (SNPs associated with genes not in a local window around it) for their mediation effect through mediating biomarkers local to these distal-eSNPs. Distal-eSNPs with large indirect mediation effects are then included in the transcriptomic prediction model with the local SNPs around the gene of interest. Using simulations and real data from ROS/MAP brain tissue and TCGA breast tumors, we show considerable gains of percent variance explained (1–2% additive increase) of gene expression and TWAS power to detect gene-trait associations. This integrative approach to transcriptome-wide imputation and association studies aids in identifying the complex interactions underlying genetic regulation within a tissue and important risk genes for various traits and disorders
    corecore