Search CORE

424 research outputs found

The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases

Author: A Bureau
A Geert Heidema
A Wille
AA Motsinger
BV North
CM Bishop
CS Coffey
Daphne L van der A
DJF De Quervain
DR Cox
Edith JM Feskens
Edwin CM Mariman
IR Dohoo
J Hoh
J Hoh
J Ott
J Ott
J Xu
JH Moore
JH Moore
JH Moore
JH Moore
JH Moore
Jolanda MA Boer
KL Lunetta
L Li
LW Hahn
MA Province
MD Ritchie
MD Ritchie
MD Ritchie
MR Nelson
N Nagelkerke
Nico Nagelkerke
NJ Schork
P Peduzzi
PR Lucek
R Bellman
R Culverhouse
R Culverhouse
R Tibshirani
RA Wilke
RYL Zee
SM Williams
TA Thornton-Wells
Y Benjamini
Y Tomita
YM Cho
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

Genetic epidemiologists have taken the challenge to identify genetic polymorphisms involved in the development of diseases. Many have collected data on large numbers of genetic markers but are not familiar with available methods to assess their association with complex diseases. Statistical methods have been developed for analyzing the relation between large numbers of genetic and environmental predictors to disease or disease-related variables in genetic association studies. In this commentary we discuss logistic regression analysis, neural networks, including the parameter decreasing method (PDM) and genetic programming optimized neural networks (GPNN) and several non-parametric methods, which include the set association approach, combinatorial partitioning method (CPM), restricted partitioning method (RPM), multifactor dimensionality reduction (MDR) method and the random forests approach. The relative strengths and weaknesses of these methods are highlighted. Logistic regression and neural networks can handle only a limited number of predictor variables, depending on the number of observations in the dataset. Therefore, they are less useful than the non-parametric methods to approach association studies with large numbers of predictor variables. GPNN on the other hand may be a useful approach to select and model important predictors, but its performance to select the important effects in the presence of large numbers of predictors needs to be examined. Both the set association approach and random forests approach are able to handle a large number of predictors and are useful in reducing these predictors to a subset of predictors with an important contribution to disease. The combinatorial methods give more insight in combination patterns for sets of genetic and/or environmental predictor variables that may be related to the outcome variable. As the non-parametric methods have different strengths and weaknesses we conclude that to approach genetic association studies using the case-control design, the application of a combination of several methods, including the set association approach, MDR and the random forests approach, will likely be a useful strategy to find the important genes and interaction patterns involved in complex diseases

Maastricht University Research Portal

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Wageningen University & Research Publications

An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings

Author: Barcellos Lisa F
Cutler Adele
Goldstein Benjamin A
Hubbard Alan E
Publication venue: BioMed Central
Publication date: 01/06/2010
Field of study

Abstract Background As computational power improves, the application of more advanced machine learning techniques to the analysis of large genome-wide association (GWA) datasets becomes possible. While most traditional statistical methods can only elucidate main effects of genetic variants on risk for disease, certain machine learning approaches are particularly suited to discover higher order and non-linear effects. One such approach is the Random Forests (RF) algorithm. The use of RF for SNP discovery related to human disease has grown in recent years; however, most work has focused on small datasets or simulation studies which are limited. Results Using a multiple sclerosis (MS) case-control dataset comprised of 300 K SNP genotypes across the genome, we outline an approach and some considerations for optimally tuning the RF algorithm based on the empirical dataset. Importantly, results show that typical default parameter values are not appropriate for large GWA datasets. Furthermore, gains can be made by sub-sampling the data, pruning based on linkage disequilibrium (LD), and removing strong effects from RF analyses. The new RF results are compared to findings from the original MS GWA study and demonstrate overlap. In addition, four new interesting candidate MS genes are identified, <it>MPHOSPH9, CTNNA3, PHACTR2 </it>and <it>IL7</it>, by RF analysis and warrant further follow-up in independent studies. Conclusions This study presents one of the first illustrations of successfully analyzing GWA data with a machine learning algorithm. It is shown that RF is computationally feasible for GWA data and the results obtained make biologic sense based on previous studies. More importantly, new genes were identified as potentially being associated with MS, suggesting new avenues of investigation for this complex disease.</p

Directory of Open Access Journals

PubMed Central

Identifying Multimodal Intermediate Phenotypes between Genetic Risk Factors and Disease Status in Alzheimer’s Disease

Author: Andrew J. Saykin
Daoqiang Zhang
Jingwen Yan
Li Shen
null null
Shannon L. Risacher
Xiaohui Yao
Xiaoke Hao
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 09/06/2016
Field of study

Neuroimaging genetics has attracted growing attention and interest, which is thought to be a powerful strategy to examine the influence of genetic variants (i.e., single nucleotide polymorphisms (SNPs)) on structures or functions of human brain. In recent studies, univariate or multivariate regression analysis methods are typically used to capture the effective associations between genetic variants and quantitative traits (QTs) such as brain imaging phenotypes. The identified imaging QTs, although associated with certain genetic markers, may not be all disease specific. A useful, but underexplored, scenario could be to discover only those QTs associated with both genetic markers and disease status for revealing the chain from genotype to phenotype to symptom. In addition, multimodal brain imaging phenotypes are extracted from different perspectives and imaging markers consistently showing up in multimodalities may provide more insights for mechanistic understanding of diseases (i.e., Alzheimer’s disease (AD)). In this work, we propose a general framework to exploit multi-modal brain imaging phenotypes as intermediate traits that bridge genetic risk factors and multi-class disease status. We applied our proposed method to explore the relation between the well-known AD risk SNP APOE rs429358 and three baseline brain imaging modalities (i.e., structural magnetic resonance imaging (MRI), fluorodeoxyglucose positron emission tomography (FDG-PET) and F-18 florbetapir PET scans amyloid imaging (AV45)) from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. The empirical results demonstrate that our proposed method not only helps improve the performances of imaging genetic associations, but also discovers robust and consistent regions of interests (ROIs) across multi-modalities to guide the disease-induced interpretation

Crossref

IUPUIScholarWorks

PubMed Central

Integrative Analysis to Investigate Complex Interaction in Alzheimer’s Disease

Author: Li Zeran
Publication venue: Washington University Open Scholarship
Publication date: 15/05/2019
Field of study

Alzheimer’s disease (AD) is a neurodegenerative disorder featuring progressive cognitive and functional deficits. Pathologically, AD is characterized by tau and amyloid β protein deposition in the brain. As the sixth leading cause of death in the U.S., the disease course usually last from 7 to 10 years on average before the consequential death. In 2019 there are estimated 5.8 million Americans living with AD affecting 16 million family members. At certain stage of the disease course, patients with inability of maintaining their daily functioning highly depend on caregivers, primarily family caregivers, that incur estimated 18.4 billion unpaid hours of cares, which is equivalent to 232 billion dollars. These huge economic burdens and inevitable emotional distress on the family and the society would also increase as the number of AD affected population could triple by 2050. Altered cellular composition is associated with AD progression and decline in cognition, such as neuronal loss and astrocytosis, which is a key feature in neurodegeneration but has often been overlooked in transcriptome research. To explore the cellular composition changes in AD, I developed a deconvolution pipeline for bulk RNA-Seq to account for cell type specific effects in brain tissues. I found that neuronal and astrocyte relative proportions differ between healthy and diseased brains and also among AD cases that carry specific genetic risk variants. Brain carriers of pathogenic mutations in APP, PSEN1, or PSEN2 presented lower neuron and higher astrocyte relative proportions compared to sporadic AD. Similarly, the APOE ε4 allele also showed decreased neuronal and increased astrocyte relative proportions compared to AD non-carriers. In contrast, carriers of variants in TREM2 risk showed a lower degree of neuronal loss compared to matched AD cases in multiple independent studies. These findings suggest that genetic risk factors associated with AD etiology have a specific effect on the cellular composition of AD brains. The digital deconvolution approach provides an enhanced understanding of the fundamental molecular mechanisms underlying neurodegeneration, enabling the analysis of large bulk RNA-sequencing studies for cell composition. It also suggests that correcting for the cellular structure when performing transcriptomic analysis will lead to novel insights of AD. With deconvolution methods to delineate cell population changes in disease condition, it would help interpret transcriptomics results and reveal transcriptional changes in a cell type specific manner. One application demonstrated in this dissertation work is to use cell type proportion as quantitative trait to identify genetic factors associated with cellular composition changes. I performed cell type QTL analysis and identified a common pathway associated with neuronal protection underlying aging brains in the presence or absence of neurodegenerative disease symptoms. A protective variant of TMEM106B, which was previously identified with a protective effect in FTD, was identified to be associated with neuronal proportion in aging brains, suggesting a common pathway underlying neuronal protection and cognitive reservation in elderly. This extended analysis yield from deconvolution results demonstrated one promising direction of using deconvolution followed by cell type QTL analysis in identifying new genes or pathways underlying neurodegenerative or aging brains. To understand the complexity of the brain under disease condition, network analysis as a large-scale system-level approach provides unbiased and data-driven view to identify gene-gene interactions altered by disease status. Using network analysis, I replicated and reconfirmed the co-expression pattern between MS4A gene cluster and TREM2 in sporadic AD, from which further evidence was inferred from Bayesian network analysis to show that MS4A4A might be a potential regulator of TREM2 that is validated by in-vitro experiments. In Autosomal Dominant AD (ADAD) cohort, disrupted and acquired genes were identified from PSEN1 mutation carriers. Among these genes, previously identified AD risk genes and pathways were revealed along with novel findings. These results demonstrated the great potential of applying network approach in identifying disease associated genes and the interactions among them. To conclude the dissertation work from methodological, empirical, and theoretical levels, deconvolution pipeline for bulk RNA-Seq, cell type QTL analysis, and network analysis approaches were applied to understand transcriptome changes underlying disease etiology. From which previous AD related findings were replicated that validated the methods, and novel genes and pathways were identified as potential new therapeutic targets. Based on prior knowledge and empirical evidence observed from this dissertation work, a model is proposed to explain how genetic factors are assembled as a highly interconnected interactome network to affect proteinopathy observed in neurodegenerative disorders, that cause cellular composition changes in the brain, which ultimately leads to cognitive and functional deficits observed in AD patients

Washington University St. Louis: Open Scholarship

DETECTING CANCER-RELATED GENES AND GENE-GENE INTERACTIONS BY MACHINE LEARNING METHODS

Author: Han Bing
Publication venue: 'Paleontological Institute at The University of Kansas'
Publication date: 01/01/2011
Field of study

To understand the underlying molecular mechanisms of cancer and therefore to improve pathogenesis, prevention, diagnosis and treatment of cancer, it is necessary to explore the activities of cancer-related genes and the interactions among these genes. In this dissertation, I use machine learning and computational methods to identify differential gene relations and detect gene-gene interactions. To identify gene pairs that have different relationships in normal versus cancer tissues, I develop an integrative method based on the bootstrapping K-S test to evaluate a large number of microarray datasets. The experimental results demonstrate that my method can find meaningful alterations in gene relations. For gene-gene interaction detection, I propose to use two Bayesian Network based methods: DASSO-MB (Detection of ASSOciations using Markov Blanket) and EpiBN (Epistatic interaction detection using Bayesian Network model) to address the two critical challenges: searching and scoring. DASSO-MB is based on the concept of Markov Blanket in Bayesian Networks. In EpiBN, I develop a new scoring function, which can reflect higher-order gene-gene interactions and detect the true number of disease markers, and apply a fast Branch-and-Bound (B&B) algorithm to learn the structure of Bayesian Network. Both DASSO-MB and EpiBN outperform some other commonly-used methods and are scalable to genome-wide data

KU ScholarWorks

Recommended from our members

CpG-related SNPs in the MS4A region have a dose-dependent effect on risk of late-onset Alzheimer disease

Author: Reiman Eric
Publication venue: 'Wiley'
Publication date: 01/08/2019
Field of study

CpG-related single nucleotide polymorphisms (CGS) have the potential to perturb DNA methylation; however, their effects on Alzheimer disease (AD) risk have not been evaluated systematically. We conducted a genome-wide association study using a sliding-window approach to measure the combined effects of CGSes on AD risk in a discovery sample of 24 European ancestry cohorts (12,181 cases, 12,601 controls) from the Alzheimer's Disease Genetics Consortium (ADGC) and replication sample of seven European ancestry cohorts (7,554 cases, 27,382 controls) from the International Genomics of Alzheimer's Project (IGAP). The potential functional relevance of significant associations was evaluated by analysis of methylation and expression levels in brain tissue of the Religious Orders Study and the Rush Memory and Aging Project (ROSMAP), and in whole blood of Framingham Heart Study participants (FHS). Genome-wide significant (p < 5 × 10-8 ) associations were identified with 171 1.0 kb-length windows spanning 932 kb in the APOE region (top p < 2.2 × 10-308 ), five windows at BIN1 (top p = 1.3 × 10-13 ), two windows at MS4A6A (top p = 2.7 × 10-10 ), two windows near MS4A4A (top p = 6.4 × 10-10 ), and one window at PICALM (p = 6.3 × 10-9 ). The total number of CGS-derived CpG dinucleotides in the window near MS4A4A was associated with AD risk (p = 2.67 × 10-10 ), brain DNA methylation (p = 2.15 × 10-10 ), and gene expression in brain (p = 0.03) and blood (p = 2.53 × 10-4 ). Pathway analysis of the genes responsive to changes in the methylation quantitative trait locus signal at MS4A4A (cg14750746) showed an enrichment of methyltransferase functions. We confirm the importance of CGS in AD and the potential for creating a functional CpG dosage-derived genetic score to predict AD risk.NIA [U24-AG041689-01, P30 AG019610, P30 AG013846, P50 AG008702, P50 AG025688, P50 AG047266, P30 AG010133, P50 AG005146, P50 AG005134, P50 AG016574, P50 AG005138, P30 AG008051, P30 AG013854, P30 AG008017, P30 AG010161, P50 AG047366, P30 AG010129, P50 AG016573, P50 AG005131, P50 AG023501]; National Institute on Aging (NIA) [U24 AG21886, U01-AG032984, RC2AG036528]; NIA/NIH [U01 AG016976]; [P30 AG035982]; [P30 AG028383]; [P30 AG053760]; [P30 AG010124]; [P50 AG005133]; [P50 AG005142]; [P30 AG012300]; [P30 AG049638]; [P50 AG005136]; [P50 AG033514]; [P50 AG005681]; [P50 AG047270]; [P30-AG10161]; [R01-AG17917]; [R01-AG36042]; [U01-AG46152]; [R01-AG048927]; [RF1-AG057519]Open access journalThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]

The University of Arizona