16 research outputs found
A Method to Exploit the Structure of Genetic Ancestry Space to Enhance Case-Control Studies
A. Palotie on työryhmän Int IBD Genetics Consortium jäsen.One goal of human genetics is to understand the genetic basis of disease, a challenge for diseases of complex inheritance because risk alleles are few relative to the vast set of benign variants. Risk variants are often sought by association studies in which allele frequencies in case subjects are contrasted with those from population-based samples used as control subjects. In an ideal world we would know population-level allele frequencies, releasing researchers to focus on case subjects. We argue this ideal is possible, at least theoretically, and we outline a path to achieving it in reality. If such a resource were to exist, it would yield ample savings and would facilitate the effective use of data repositories by removing administrative and technical barriers. We call this concept the Universal Control Repository Network (UNICORN), a means to perform association analyses without necessitating direct access to individual-level control data. Our approach to UNICORN uses existing genetic resources and various statistical tools to analyze these data, including hierarchical clustering with spectral analysis of ancestry; and empirical Bayesian analysis along with Gaussian spatial processes to estimate ancestry-specific allele frequencies. We demonstrate our approach using tens of thousands of control subjects from studies of Crohn disease, showing how it controls false positives, provides power similar to that achieved when all control data are directly accessible, and enhances power when control data are limiting or even imperfectly matched ancestrally. These results highlight how UNICORN can enable reliable, powerful, and convenient genetic association analyses without access to the individual-level data.Peer reviewe
Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data
Abstract: Crohn Disease (CD) is a complex genetic disorder for which more than 140 genes have been identified using genome wide association studies (GWAS). However, the genetic architecture of the trait remains largely unknown. The recent development of machine learning (ML) approaches incited us to apply them to classify healthy and diseased people according to their genomic information. The Immunochip dataset containing 18,227 CD patients and 34,050 healthy controls enrolled and genotyped by the international Inflammatory Bowel Disease genetic consortium (IIBDGC) has been re-analyzed using a set of ML methods: penalized logistic regression (LR), gradient boosted trees (GBT) and artificial neural networks (NN). The main score used to compare the methods was the Area Under the ROC Curve (AUC) statistics. The impact of quality control (QC), imputing and coding methods on LR results showed that QC methods and imputation of missing genotypes may artificially increase the scores. At the opposite, neither the patient/control ratio nor marker preselection or coding strategies significantly affected the results. LR methods, including Lasso, Ridge and ElasticNet provided similar results with a maximum AUC of 0.80. GBT methods like XGBoost, LightGBM and CatBoost, together with dense NN with one or more hidden layers, provided similar AUC values, suggesting limited epistatic effects in the genetic architecture of the trait. ML methods detected near all the genetic variants previously identified by GWAS among the best predictors plus additional predictors with lower effects. The robustness and complementarity of the different methods are also studied. Compared to LR, non-linear models such as GBT or NN may provide robust complementary approaches to identify and classify genetic markers
Inherited determinants of Crohn's disease and ulcerative colitis phenotypes: a genetic association study
Crohn's disease and ulcerative colitis are the two major forms of inflammatory bowel disease; treatment strategies have historically been determined by this binary categorisation. Genetic studies have identified 163 susceptibility loci for inflammatory bowel disease, mostly shared between Crohn's disease and ulcerative colitis. We undertook the largest genotype association study, to date, in widely used clinical subphenotypes of inflammatory bowel disease with the goal of further understanding the biological relations between diseases
IBD risk loci are enriched in multigenic regulatory modules encompassing putative causative genes.
GWAS have identified >200 risk loci for Inflammatory Bowel Disease (IBD). The majority of disease associations are known to be driven by regulatory variants. To identify the putative causative genes that are perturbed by these variants, we generate a large transcriptome data set (nine disease-relevant cell types) and identify 23,650 cis-eQTL. We show that these are determined by ∼9720 regulatory modules, of which ∼3000 operate in multiple tissues and ∼970 on multiple genes. We identify regulatory modules that drive the disease association for 63 of the 200 risk loci, and show that these are enriched in multigenic modules. Based on these analyses, we resequence 45 of the corresponding 100 candidate genes in 6600 Crohn disease (CD) cases and 5500 controls, and show with burden tests that they include likely causative genes. Our analyses indicate that ≥10-fold larger sample sizes will be required to demonstrate the causality of individual genes using this approach
Multiple sclerosis genomic map implicates peripheral immune cells and microglia in susceptibility
We analyzed genetic data of 47,429 multiple sclerosis (MS) and 68,374 control subjects and established a reference map of the genetic architecture of MS that includes 200 autosomal susceptibility variants outside the major histocompatibility complex (MHC), one chromosome X variant, and 32 variants within the extended MHC. We used an ensemble of methods to prioritize 551 putative susceptibility genes that implicate multiple innate and adaptive pathways distributed across the cellular components of the immune system. Using expression profiles from purified human microglia, we observed enrichment for MS genes in these brain-resident immune cells, suggesting that these may have a role in targeting an autoimmune process to the central nervous system, although MS is most likely initially triggered by perturbation of peripheral immune responses
A Method to Exploit the Structure of Genetic Ancestry Space to Enhance Case-Control Studies
One goal of human genetics is to understand the genetic basis of disease, a challenge for diseases of complex inheritance because risk alleles are few relative to the vast set of benign variants. Risk variants are often sought by association studies in which allele frequencies in case subjects are contrasted with those from population-based samples used as control subjects. In an ideal world we would know population-level allele frequencies, releasing researchers to focus on case subjects. We argue this ideal is possible, at least theoretically, and we outline a path to achieving it in reality. If such a resource were to exist, it would yield ample savings and would facilitate the effective use of data repositories by removing administrative and technical barriers. We call this concept the Universal Control Repository Network (UNICORN), a means to perform association analyses without necessitating direct access to individual-level control data. Our approach to UNICORN uses existing genetic resources and various statistical tools to analyze these data, including hierarchical clustering with spectral analysis of ancestry; and empirical Bayesian analysis along with Gaussian spatial processes to estimate ancestry-specific allele frequencies. We demonstrate our approach using tens of thousands of control subjects from studies of Crohn disease, showing how it controls false positives, provides power similar to that achieved when all control data are directly accessible, and enhances power when control data are limiting or even imperfectly matched ancestrally. These results highlight how UNICORN can enable reliable, powerful, and convenient genetic association analyses without access to the individual-level data.One goal of human genetics is to understand the genetic basis of disease, a challenge for diseases of complex inheritance because risk alleles are few relative to the vast set of benign variants. Risk variants are often sought by association studies in which allele frequencies in case subjects are contrasted with those from population-based samples used as control subjects. In an ideal world we would know population-level allele frequencies, releasing researchers to focus on case subjects. We argue this ideal is possible, at least theoretically, and we outline a path to achieving it in reality. If such a resource were to exist, it would yield ample savings and would facilitate the effective use of data repositories by removing administrative and technical barriers. We call this concept the Universal Control Repository Network (UNICORN), a means to perform association analyses without necessitating direct access to individual-level control data. Our approach to UNICORN uses existing genetic resources and various statistical tools to analyze these data, including hierarchical clustering with spectral analysis of ancestry; and empirical Bayesian analysis along with Gaussian spatial processes to estimate ancestry-specific allele frequencies. We demonstrate our approach using tens of thousands of control subjects from studies of Crohn disease, showing how it controls false positives, provides power similar to that achieved when all control data are directly accessible, and enhances power when control data are limiting or even imperfectly matched ancestrally. These results highlight how UNICORN can enable reliable, powerful, and convenient genetic association analyses without access to the individual-level data
A Method to Exploit the Structure of Genetic Ancestry Space to Enhance Case-Control Studies
One goal of human genetics is to understand the genetic basis of disease, a challenge for diseases of complex inheritance because risk alleles are few relative to the vast set of benign variants. Risk variants are often sought by association studies in which allele frequencies in case subjects are contrasted with those from population-based samples used as control subjects. In an ideal world we would know population-level allele frequencies, releasing researchers to focus on case subjects. We argue this ideal is possible, at least theoretically, and we outline a path to achieving it in reality. If such a resource were to exist, it would yield ample savings and would facilitate the effective use of data repositories by removing administrative and technical barriers. We call this concept the Universal Control Repository Network (UNICORN), a means to perform association analyses without necessitating direct access to individual-level control data. Our approach to UNICORN uses existing genetic resources and various statistical tools to analyze these data, including hierarchical clustering with spectral analysis of ancestry; and empirical Bayesian analysis along with Gaussian spatial processes to estimate ancestry-specific allele frequencies. We demonstrate our approach using tens of thousands of control subjects from studies of Crohn disease, showing how it controls false positives, provides power similar to that achieved when all control data are directly accessible, and enhances power when control data are limiting or even imperfectly matched ancestrally. These results highlight how UNICORN can enable reliable, powerful, and convenient genetic association analyses without access to the individual-level data.One goal of human genetics is to understand the genetic basis of disease, a challenge for diseases of complex inheritance because risk alleles are few relative to the vast set of benign variants. Risk variants are often sought by association studies in which allele frequencies in case subjects are contrasted with those from population-based samples used as control subjects. In an ideal world we would know population-level allele frequencies, releasing researchers to focus on case subjects. We argue this ideal is possible, at least theoretically, and we outline a path to achieving it in reality. If such a resource were to exist, it would yield ample savings and would facilitate the effective use of data repositories by removing administrative and technical barriers. We call this concept the Universal Control Repository Network (UNICORN), a means to perform association analyses without necessitating direct access to individual-level control data. Our approach to UNICORN uses existing genetic resources and various statistical tools to analyze these data, including hierarchical clustering with spectral analysis of ancestry; and empirical Bayesian analysis along with Gaussian spatial processes to estimate ancestry-specific allele frequencies. We demonstrate our approach using tens of thousands of control subjects from studies of Crohn disease, showing how it controls false positives, provides power similar to that achieved when all control data are directly accessible, and enhances power when control data are limiting or even imperfectly matched ancestrally. These results highlight how UNICORN can enable reliable, powerful, and convenient genetic association analyses without access to the individual-level data
Association analyses identify 38 susceptibility loci for inflammatory bowel disease and highlight shared genetic risk across populations
Ulcerative colitis and Crohn's disease are the two main forms of inflammatory bowel disease (IBD). Here we report the first trans-ancestry association study of IBD, with genome-wide or Immunochip genotype data from an extended cohort of 86,640 European individuals and Immunochip data from 9,846 individuals of East Asian, Indian or Iranian descent. We implicate 38 loci in IBD risk for the first time. For the majority of the IBD risk loci, the direction and magnitude of effect are consistent in European and non-European cohorts. Nevertheless, we observe genetic heterogeneity between divergent populations at several established risk loci driven by differences in allele frequency (NOD2) or effect size (TNFSF15 and ATG16L1) or a combination of these factors (IL23R and IRGM). Our results provide biological insights into the pathogenesis of IBD and demonstrate the usefulness of trans-ancestry association studies for mapping loci associated with complex diseases and understanding genetic architecture across diverse populations.Ulcerative colitis and Crohn's disease are the two main forms of inflammatory bowel disease (IBD). Here we report the first trans-ancestry association study of IBD, with genome-wide or Immunochip genotype data from an extended cohort of 86,640 European individuals and Immunochip data from 9,846 individuals of East Asian, Indian or Iranian descent. We implicate 38 loci in IBD risk for the first time. For the majority of the IBD risk loci, the direction and magnitude of effect are consistent in European and non-European cohorts. Nevertheless, we observe genetic heterogeneity between divergent populations at several established risk loci driven by differences in allele frequency (NOD2) or effect size (TNFSF15 and ATG16L1) or a combination of these factors (IL23R and IRGM). Our results provide biological insights into the pathogenesis of IBD and demonstrate the usefulness of trans-ancestry association studies for mapping loci associated with complex diseases and understanding genetic architecture across diverse populations