Search CORE

75 research outputs found

Evaluation of network-guided random forest for disease gene discovery

Author: Hu Jianchang
Szymczak Silke
Publication venue
Publication date: 02/08/2023
Field of study

Gene network information is believed to be beneficial for disease module and pathway identification, but has not been explicitly utilized in the standard random forest (RF) algorithm for gene expression data analysis. We investigate the performance of a network-guided RF where the network information is summarized into a sampling probability of predictor variables which is further used in the construction of the RF. Our results suggest that network-guided RF does not provide better disease prediction than the standard RF. In terms of disease gene discovery, if disease genes form module(s), network-guided RF identifies them more accurately. In addition, when disease status is independent from genes in the given network, spurious gene selection results can occur when using network information, especially on hub genes. Our empirical analysis on two balanced microarray and RNA-Seq breast cancer datasets from The Cancer Genome Atlas (TCGA) for classification of progesterone receptor (PR) status also demonstrates that network-guided RF can identify genes from PGR-related pathways, which leads to a better connected module of identified genes.Comment: 23 pages, 2 tables, 7 figure

arXiv.org e-Print Archive

A review on longitudinal data analysis with random forest in precision medicine

Author: Hu Jianchang
Szymczak Silke
Publication venue: 'Oxford University Press (OUP)'
Publication date: 08/08/2022
Field of study

Precision medicine provides customized treatments to patients based on their characteristics and is a promising approach to improving treatment efficiency. Large scale omics data are useful for patient characterization, but often their measurements change over time, leading to longitudinal data. Random forest is one of the state-of-the-art machine learning methods for building prediction models, and can play a crucial role in precision medicine. In this paper, we review extensions of the standard random forest method for the purpose of longitudinal data analysis. Extension methods are categorized according to the data structures for which they are designed. We consider both univariate and multivariate responses and further categorize the repeated measurements according to whether the time effect is relevant. Information of available software implementations of the reviewed extensions is also given. We conclude with discussions on the limitations of our review and some future research directions.Comment: 27 pages, 2 figures, 3 table

arXiv.org e-Print Archive

Evaluation of single-nucleotide polymorphism imputation using random forests

Author: König Inke R
Schwarz Daniel F
Szymczak Silke
Ziegler Andreas
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Genome-wide association studies (GWAS) have helped to reveal genetic mechanisms of complex diseases. Although commonly used genotyping technology enables us to determine up to a million single-nucleotide polymorphisms (SNPs), causative variants are typically not genotyped directly. A favored approach to increase the power of genome-wide association studies is to impute the untyped SNPs using more complete genotype data of a reference population

Crossref

PubMed Central

Effect of hyperparameters on variable selection in random forests

Author: Fouodo Cesaire J. K.
Kronziel Lea L.
König Inke R.
Szymczak Silke
Publication venue
Publication date: 13/09/2023
Field of study

Random forests (RFs) are well suited for prediction modeling and variable selection in high-dimensional omics studies. The effect of hyperparameters of the RF algorithm on prediction performance and variable importance estimation have previously been investigated. However, how hyperparameters impact RF-based variable selection remains unclear. We evaluate the effects on the Vita and the Boruta variable selection procedures based on two simulation studies utilizing theoretical distributions and empirical gene expression data. We assess the ability of the procedures to select important variables (sensitivity) while controlling the false discovery rate (FDR). Our results show that the proportion of splitting candidate variables (mtry.prop) and the sample fraction (sample.fraction) for the training dataset influence the selection procedures more than the drawing strategy of the training datasets and the minimal terminal node size. A suitable setting of the RF hyperparameters depends on the correlation structure in the data. For weakly correlated predictor variables, the default value of mtry is optimal, but smaller values of sample.fraction result in larger sensitivity. In contrast, the difference in sensitivity of the optimal compared to the default value of sample.fraction is negligible for strongly correlated predictor variables, whereas smaller values than the default are better in the other settings. In conclusion, the default values of the hyperparameters will not always be suitable for identifying important variables. Thus, adequate values differ depending on whether the aim of the study is optimizing prediction performance or variable selection.Comment: 18 pages, 2 figures + 2 figures in appendix, 3 table

arXiv.org e-Print Archive

ACPA: automated cluster plot analysis of genotype data

Author: König Inke R
Schillert Arne
Schwarz Daniel F
Szymczak Silke
Vens Maren
Ziegler Andreas
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Genome-wide association studies have become standard in genetic epidemiology. Analyzing hundreds of thousands of markers simultaneously imposes some challenges for statisticians. One issue is the problem of multiplicity, which has been compared with the search for the needle in a haystack. To reduce the number of false-positive findings, a number of quality filters such as exclusion of single-nucleotide polymorphisms (SNPs) with a high missing fraction are employed. Another filter is exclusion of SNPs for which the calling algorithm had difficulties in assigning the genotypes. The only way to do this is the visual inspection of the cluster plots, also termed signal intensity plots, but this approach is often neglected. We developed an algorithm ACPA (automated cluster plot analysis), which performs this task automatically for autosomal SNPs. It is based on counting samples that lie too close to the cluster of a different genotype; SNPs are excluded when a certain threshold is exceeded. We evaluated ACPA using 1,000 randomly selected quality controlled SNPs from the Framingham Heart Study data that were provided for the Genetic Analysis Workshop 16. We compared the decision of ACPA with the decision made by two independent readers. We achieved a sensitivity of 88% (95% CI: 81%-93%) and a specificity of 86% (95% CI: 83%-89%). In a screening setting in which one aims at not losing any good SNP, we achieved 99% (95% CI: 98%-100%) specificity and still detected every second low-quality SNP

Crossref

Springer - Publisher Connector

PubMed Central

Risk estimation using probability machines

Author: Abhijit Dasgupta
James D Malley
Jason H Moore
Joan E Bailey-Wilson
Silke Szymczak
Publication venue: Springer Nature
Publication date: 01/01/2014
Field of study

BACKGROUND: Logistic regression has been the de facto, and often the only, model used in the description and analysis of relationships between a binary outcome and observed features. It is widely used to obtain the conditional probabilities of the outcome given predictors, as well as predictor effect size estimates using conditional odds ratios. RESULTS: We show how statistical learning machines for binary outcomes, provably consistent for the nonparametric regression problem, can be used to provide both consistent conditional probability estimation and conditional effect size estimates. Effect size estimates from learning machines leverage our understanding of counterfactual arguments central to the interpretation of such estimates. We show that, if the data generating model is logistic, we can recover accurate probability predictions and effect size estimates with nearly the same efficiency as a correct logistic model, both for main effects and interactions. We also propose a method using learning machines to scan for possible interaction effects quickly and efficiently. Simulations using random forest probability machines are presented. CONCLUSIONS: The models we propose make no assumptions about the data structure, and capture the patterns in the data by just specifying the predictors involved and not any particular model structure. So they do not run the same risks of model mis-specification and the resultant estimation biases as a logistic model. This methodology, which we call a “risk machine”, will share properties from the statistical machine that it is derived from

Springer - Publisher Connector

PubMed Central

Genetic association studies for gene expressions: permutation-based mutual information in a comparison with standard ANOVA and as a novel approach for feature selection

Author: Bellazzi Riccardo
Fuchsberger Christian
Igl Bernd-Wolfgang
Nuzzo Angelo
Schwarz Daniel F
Szymczak Silke
Ziegler Andreas
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Mutual information (MI) is a robust nonparametric statistical approach for identifying associations between genotypes and gene expression levels. Using the data of Problem 1 provided for the Genetic Analysis Workshop 15, we first compared a quantitative MI (Tsalenko et al. 2006 J Bioinform Comput Biol 4:259–4) with the standard analysis of variance (ANOVA) and the nonparametric Kruskal-Wallis (KW) test. We then proposed a novel feature selection approach using MI in a classification scenario to address the small n - large p problem and compared it with a feature selection that relies on an asymptotic χ2 distribution. In both applications, we used a permutation-based approach for evaluating the significance of MI. Substantial discrepancies in significance were observed between MI, ANOVA, and KW that can be explained by different empirical distributions of the data. In contrast to ANOVA and KW, MI detects shifts in location when the data are non-normally distributed, skewed, or contaminated with outliers. ANOVA but not MI is often significant if one genotype with a small frequency had a remarkable difference in the average gene expression level relative to the other two genotypes. MI depends on genotype frequencies and cannot detect these differences. In the classification scenario, we show that our novel approach for feature selection identifies a smaller list of markers with higher accuracy compared to the standard method. In conclusion, permutation-based MI approaches provide reliable and flexible statistical frameworks which seem to be well suited for data that are non-normal, skewed, or have an otherwise peculiar distribution. They merit further methodological investigation

Crossref

PubMed Central

Comparison of parametric and machine methods for variable selection in simulated Genetic Analysis Workshop 19 data

Author: A Cutler
C Strobl
Cheryl D. Cropp
D Welter
DF Schwarz
Elizabeth W. Pugh
Emily R. Holzinger
Hua Ling
J Blangero
James Malley
JC Gertrudes
Joan E. Bailey-Wilson
Peng Zhang
Qing Li
S Szymczak
Sean Griffith
Silke Szymczak
TA Manolio
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Paternal chronic colitis causes epigenetic inheritance of susceptibility to colitis.

Author: Adolph Timon Erik
Ammerpohl Ole
Franke Andre
Heinsen Femke-Anouska
Kachroo Priyadarshini
Kaser Arthur
Klughammer Johanna
Krueger Felix
Offner Felix Albert
Rühlemann Malte Christoph
Smallwood Sébastien
Szymczak Silke
Tschurtschenthaler Markus
Publication venue: Sci Rep
Publication date: 19/08/2016
Field of study

Inflammatory bowel disease (IBD) arises by unknown environmental triggers in genetically susceptible individuals. Epigenetic regulation of gene expression may integrate internal and external influences and may thereby modulate disease susceptibility. Epigenetic modification may also affect the germ-line and in certain contexts can be inherited to offspring. This study investigates epigenetic alterations consequent to experimental murine colitis induced by dextran sodium sulphate (DSS), and their paternal transmission to offspring. Genome-wide methylome- and transcriptome-profiling of intestinal epithelial cells (IECs) and sperm cells of males of the F0 generation, which received either DSS and consequently developed colitis (F0(DSS)), or non-supplemented tap water (F0(Ctrl)) and hence remained healthy, and of their F1 offspring was performed using reduced representation bisulfite sequencing (RRBS) and RNA-sequencing (RNA-Seq), respectively. Offspring of F0(DSS) males exhibited aberrant methylation and expression patterns of multiple genes, including Igf1r and Nr4a2, which are involved in energy metabolism. Importantly, DSS colitis in F0(DSS) mice was associated with decreased body weight at baseline of their F1 offspring, and these F1 mice exhibited increased susceptibility to DSS-induced colitis compared to offspring from F0(Ctrl) males. This study hence demonstrates epigenetic transmissibility of metabolic and inflammatory traits resulting from experimental colitis.This study was carried out as part of the Research Training Group “Genes, Environment and Inflammation”, supported by the Deutsche Forschungsgemeinschaft (RTG 1743/1) of which A.F. is the spokesperson, the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007–2013)/ERC Grant agreement no. 260961 (A.K.), the Austrian Science Fund and Ministry of Science P21530-B18 and START Y446-B18 (A.K.), the Wellcome Trust (investigator award 106260/Z/14/Z) to A.K., the Cambridge Biomedical Research Centre (A.K.), a fellowship from the European Crohn’s and Colitis Organisation (M.T. and T.E.A.) and a DOC fellowship from the Austrian Academy of Sciences (J.K.).This is the final version of the article. It first appeared from Nature Publishing Group via http://dx.doi.org/10.1038/srep3164

PubMed Central

Apollo (Cambridge)