Search CORE

19 research outputs found

Performance of random forest when SNPs are in linkage disequilibrium

Author: A Bureau
C Strobl
DF Schwarz
DJ Schaid
EM Reiman
JH Friedman
K Nicodemus
Kathryn L Lunetta
KJ Archer
KL Lunetta
L Adrienne Cupples
L Breiman
L Breiman
L Breiman
L Breiman
Lindsay A Farrer
N Risch
R Díaz-Uriarte
S Purcell
Y Freund
Y Meng
Yan A Meng
Yi Yu
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Single nucleotide polymorphisms (SNPs) may be correlated due to linkage disequilibrium (LD). Association studies look for both direct and indirect associations with disease loci. In a Random Forest (RF) analysis, correlation between a true risk SNP and SNPs in LD may lead to diminished variable importance for the true risk SNP. One approach to address this problem is to select SNPs in linkage equilibrium (LE) for analysis. Here, we explore alternative methods for dealing with SNPs in LD: change the tree-building algorithm by building each tree in an RF only with SNPs in LE, modify the importance measure (IM), and use haplotypes instead of SNPs to build a RF. Results We evaluated the performance of our alternative methods by simulation of a spectrum of complex genetics models. When a haplotype rather than an individual SNP is the risk factor, we find that the original Random Forest method performed on SNPs provides good performance. When individual, genotyped SNPs are the risk factors, we find that the stronger the genetic effect, the stronger the effect LD has on the performance of the original RF. A revised importance measure used with the original RF is relatively robust to LD among SNPs; this revised importance measure used with the revised RF is sometimes inflated. Overall, we find that the revised importance measure used with the original RF is the best choice when the genetic model and the number of SNPs in LD with risk SNPs are unknown. For the haplotype-based method, under a multiplicative heterogeneity model, we observed a decrease in the performance of RF with increasing LD among the SNPs in the haplotype. Conclusion Our results suggest that by strategically revising the Random Forest method tree-building or importance measure calculation, power can increase when LD exists between SNPs. We conclude that the revised Random Forest method performed on SNPs offers an advantage of not requiring genotype phase, making it a viable tool for use in the context of thousands of SNPs, such as candidate gene studies and follow-up of top candidates from genome wide association studies.</p

Crossref

Boston University Institutional Repository (OpenBU)

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A Two-Stage Random Forest-Based Pathway Analysis Method

Author: A Bureau
A Torkamani
DF Schwarz
DJ Hunter
EE Calle
H Eleftherohorinou
H Lind
H Pang
HJ Cordell
JS Chang
K Wang
K Wang
KL Lunetta
L Breiman
L De Lobel
LS Chen
M Kanehisa
MD Mailman
N Chatterjee
P Holmans
P Scheet
Ren-Hua Chung
S Purcell
SG Park
SG Park
TL Edwards
X Zhang
Xi-Nian Zuo
YA Meng
Ying-Erh Chen
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

Pathway analysis provides a powerful approach for identifying the joint effect of genes grouped into biologically-based pathways on disease. Pathway analysis is also an attractive approach for a secondary analysis of genome-wide association study (GWAS) data that may still yield new results from these valuable datasets. Most of the current pathway analysis methods focused on testing the cumulative main effects of genes in a pathway. However, for complex diseases, gene-gene interactions are expected to play a critical role in disease etiology. We extended a random forest-based method for pathway analysis by incorporating a two-stage design. We used simulations to verify that the proposed method has the correct type I error rates. We also used simulations to show that the method is more powerful than the original random forest-based pathway approach and the set-based test implemented in PLINK in the presence of gene-gene interactions. Finally, we applied the method to a breast cancer GWAS dataset and a lung cancer GWAS dataset and interesting pathways were identified that have implications for breast and lung cancers

Crossref

National Health Research Institues

Directory of Open Access Journals

PubMed Central

Overview of Random Forest Methodology and Practical Guidance with Emphasis on Computational Biology and Bioinformatics

Author: Boulesteix Anne-Laure
Janitza Silke
Kruppa Jochen
König Inke R.
Publication venue
Publication date: 25/07/2012
Field of study

The Random Forest (RF) algorithm by Leo Breiman has become a standard data analysis tool in bioinformatics. It has shown excellent performance in settings where the number of variables is much larger than the number of observations, can cope with complex interaction structures as well as highly correlated variables and returns measures of variable importance. This paper synthesizes ten years of RF development with emphasis on applications to bioinformatics and computational biology. Special attention is given to practical aspects such as the selection of parameters, available RF implementations, and important pitfalls and biases of RF and its variable importance measures (VIMs). The paper surveys recent developments of the methodology relevant to bioinformatics as well as some representative examples of RF applications in this context and possible directions for future research

Crossref

Open Access LMU

The behaviour of random forest permutation-based variable importance measures under predictor correlation

Author: Malley James D
Nicodemus Kristin K
Strobl Carolin
Ziegler Andreas
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Random forests (RF) have been increasingly used in applications such as genome-wide association and microarray studies where predictor correlation is frequently observed. Recent works on permutation-based variable importance measures (VIMs) used in RF have come to apparently contradictory conclusions. We present an extended simulation study to synthesize results. Results In the case when both predictor correlation was present and predictors were associated with the outcome (HA), the unconditional RF VIM attributed a higher share of importance to correlated predictors, while under the null hypothesis that no predictors are associated with the outcome (H0) the unconditional RF VIM was unbiased. Conditional VIMs showed a decrease in VIM values for correlated predictors versus the unconditional VIMs under HA and was unbiased under H0. Scaled VIMs were clearly biased under HA and H0. Conclusions Unconditional unscaled VIMs are a computationally tractable choice for large datasets and are unbiased under the null hypothesis. Whether the observed increased VIMs for correlated predictors may be considered a "bias" - because they do not directly reflect the coefficients in the generating model - or if it is a beneficial attribute of these VIMs is dependent on the application. For example, in genetic association studies, where correlation between markers may help to localize the functionally relevant variant, the increased importance of correlated predictors may be an advantage. On the other hand, we show examples where this increased importance may result in spurious signals.</p

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Open Access LMU

Nonparametric Variable Selection Using Machine Learning Algorithms in High Dimensional (Large P, Small N) Biomedical Applications

Author: Christina M.R. Kitchen
Publication venue: 'IntechOpen'
Publication date: 08/01/2011
Field of study

IntechOpen

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests

Author: Joshua Huang
Mark Li
Qingyao Wu
Thanh-Tung Nguyen
Thuy Nguyen
Publication venue: Springer Nature
Publication date: 01/01/2015
Field of study

Springer - Publisher Connector

Ranking causal variants and associated regions in genome-wide association studies by the support vector machine and random forest

Author: Alpaydin
Ban
Boulesteix
Breiman
Calle
Chanda
Chen
Diaz-Uriarte
Durrant
Evans
Fan
Gail
Gillespie
Guyon
Guyon
Hakon Hakonarson
Hardin
Hochberg
Hoggart
Hulbert
Jewell
Joachims
Kai Wang
Li
Li
Li
Li
Mao
Meng
Mueller
Niijima
Pearson
Satish Chikkagoudar
Schwarz
Schölkopf
Smith
Statnikov
Statnikov
Stromberg
Teo
Usman Roshan
Vapnik
Wei
Wei
Wray
Wu
Zhang
Zheng
Zhi Wei
Publication venue: Oxford University Press
Publication date
Field of study

We study the number of causal variants and associated regions identified by top SNPs in rankings given by the popular 1 df chi-squared statistic, support vector machine (SVM) and the random forest (RF) on simulated and real data. If we apply the SVM and RF to the top 2r chi-square-ranked SNPs, where r is the number of SNPs with P-values within the Bonferroni correction, we find that both improve the ranks of causal variants and associated regions and achieve higher power on simulated data. These improvements, however, as well as stability of the SVM and RF rankings, progressively decrease as the cutoff increases to 5r and 10r. As applications we compare the ranks of previously replicated SNPs in real data, associated regions in type 1 diabetes, as provided by the Type 1 Diabetes Consortium, and disease risk prediction accuracies as given by top ranked SNPs by the three methods. Software and webserver are available at http://svmsnps.njit.edu

Crossref

PubMed Central

An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings

Author: Barcellos Lisa F
Cutler Adele
Goldstein Benjamin A
Hubbard Alan E
Publication venue: BioMed Central
Publication date: 01/06/2010
Field of study

Abstract Background As computational power improves, the application of more advanced machine learning techniques to the analysis of large genome-wide association (GWA) datasets becomes possible. While most traditional statistical methods can only elucidate main effects of genetic variants on risk for disease, certain machine learning approaches are particularly suited to discover higher order and non-linear effects. One such approach is the Random Forests (RF) algorithm. The use of RF for SNP discovery related to human disease has grown in recent years; however, most work has focused on small datasets or simulation studies which are limited. Results Using a multiple sclerosis (MS) case-control dataset comprised of 300 K SNP genotypes across the genome, we outline an approach and some considerations for optimally tuning the RF algorithm based on the empirical dataset. Importantly, results show that typical default parameter values are not appropriate for large GWA datasets. Furthermore, gains can be made by sub-sampling the data, pruning based on linkage disequilibrium (LD), and removing strong effects from RF analyses. The new RF results are compared to findings from the original MS GWA study and demonstrate overlap. In addition, four new interesting candidate MS genes are identified, <it>MPHOSPH9, CTNNA3, PHACTR2 </it>and <it>IL7</it>, by RF analysis and warrant further follow-up in independent studies. Conclusions This study presents one of the first illustrations of successfully analyzing GWA data with a machine learning algorithm. It is shown that RF is computationally feasible for GWA data and the results obtained make biologic sense based on previous studies. More importantly, new genes were identified as potentially being associated with MS, suggesting new avenues of investigation for this complex disease.</p

Directory of Open Access Journals

PubMed Central

Data mining of high density genomic variant data for prediction of Alzheimer's disease risk

Author: A Park
AC Naj
B Boland
B Chakravarthy
B Levine
C Strubing
D Avramopoulos
D Harold
EM Reiman
G Fu
HC Hartzell
J Eswaran
JC Barrett
JC Lambert
KD Coon
KG Mawuenyega
KR Brunden
L Bertram
L Bertram
L Bertram
L Bertram
L Breiman
L Breiman
L Jones
M Garcia-Arencibia
M Gatz
M Hall
M Majidi
M Mancuso
M Nizzari
MA Curtis
MM Carrasquillo
MM Lipinski
MS Patrick
N Gustke
N Labrecque
Natalia Briones
P Hollingworth
P Hollingworth
P Marambaud
P Rujkijyanont
P Tan
R Schreiber
RE Tanzi
S Purcell
S Seshadri
S Shim
SK Alahari
SM Cardoso
SS Hebert
T Yano
V Dinu
V Rhein
Valentin Dinu
VM Milenkovic
Y Yi
YA Meng
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

Abstract Background The discovery of genetic associations is an important factor in the understanding of human illness to derive disease pathways. Identifying multiple interacting genetic mutations associated with disease remains challenging in studying the etiology of complex diseases. And although recently new single nucleotide polymorphisms (SNPs) at genes implicated in immune response, cholesterol/lipid metabolism, and cell membrane processes have been confirmed by genome-wide association studies (GWAS) to be associated with late-onset Alzheimer's disease (LOAD), a percentage of AD heritability continues to be unexplained. We try to find other genetic variants that may influence LOAD risk utilizing data mining methods. Methods Two different approaches were devised to select SNPs associated with LOAD in a publicly available GWAS data set consisting of three cohorts. In both approaches, single-locus analysis (logistic regression) was conducted to filter the data with a less conservative p-value than the Bonferroni threshold; this resulted in a subset of SNPs used next in multi-locus analysis (random forest (RF)). In the second approach, we took into account prior biological knowledge, and performed sample stratification and linkage disequilibrium (LD) in addition to logistic regression analysis to preselect loci to input into the RF classifier construction step. Results The first approach gave 199 SNPs mostly associated with genes in calcium signaling, cell adhesion, endocytosis, immune response, and synaptic function. These SNPs together with <it>APOE and GAB2 </it>SNPs formed a predictive subset for LOAD status with an average error of 9.8% using 10-fold cross validation (CV) in RF modeling. Nineteen variants in LD with <it>ST5, TRPC1, ATG10, ANO3, NDUFA12, and NISCH </it>respectively, genes linked directly or indirectly with neurobiology, were identified with the second approach. These variants were part of a model that included <it>APOE </it>and <it>GAB2 </it>SNPs to predict LOAD risk which produced a 10-fold CV average error of 17.5% in the classification modeling. Conclusions With the two proposed approaches, we identified a large subset of SNPs in genes mostly clustered around specific pathways/functions and a smaller set of SNPs, within or in proximity to five genes not previously reported, that may be relevant for the prediction/understanding of AD.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Identification of population-informative markers from high-density genotyping data through combined feature selection and machine learning algorithms: Application to European autochthonous and cosmopolitan pig breeds

Author: Bertolini F.
Bovo S.
Bozzi R.
Candek-Potokar M.
Fontanesi L.
Galimberti G.
Munoz M.
Ovilo C.
Schiavo G.
Publication venue
Publication date: 01/01/2024
Field of study

Large genotyping datasets, obtained from high-density single nucleotide polymorphism (SNP) arrays, developed for different livestock species, can be used to describe and differentiate breeds or populations. To identify the most discriminating genetic markers among thousands of genotyped SNPs, a few statistical approaches have been proposed. In this study, we applied the Boruta algorithm, a wrapper of the machine learning random forest algorithm, on a database of 23 European pig breeds (20 autochthonous and three cosmopolitan breeds) genotyped with a 70k SNP chip, to pre-select informative SNPs. To identify different sets of SNPs, these pre-selected markers were then ranked with random forest based on their mean decrease accuracy and mean decrease gene indexes. We evaluated the efficiency of these subsets for breed classification and the usefulness of this approach to detect candidate genes affecting breed-specific phenotypes and relevant production traits that might differ among breeds. The lowest overall classification error (2.3%) was reached with a subpanel including only 398 SNPs (ranked based on their mean decrease accuracy), with no classification error in seven breeds using up to 49 SNPs. Several SNPs of these selected subpanels were in genomic regions in which previous studies had identified signatures of selection or genes associated with morphological or production traits that distinguish the analysed breeds. Therefore, even if these approaches have not been originally designed to identify signatures of selection, the obtained results showed that they could potentially be useful for this purpose

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna