75 research outputs found
Evaluation of network-guided random forest for disease gene discovery
Gene network information is believed to be beneficial for disease module and
pathway identification, but has not been explicitly utilized in the standard
random forest (RF) algorithm for gene expression data analysis. We investigate
the performance of a network-guided RF where the network information is
summarized into a sampling probability of predictor variables which is further
used in the construction of the RF. Our results suggest that network-guided RF
does not provide better disease prediction than the standard RF. In terms of
disease gene discovery, if disease genes form module(s), network-guided RF
identifies them more accurately. In addition, when disease status is
independent from genes in the given network, spurious gene selection results
can occur when using network information, especially on hub genes. Our
empirical analysis on two balanced microarray and RNA-Seq breast cancer
datasets from The Cancer Genome Atlas (TCGA) for classification of progesterone
receptor (PR) status also demonstrates that network-guided RF can identify
genes from PGR-related pathways, which leads to a better connected module of
identified genes.Comment: 23 pages, 2 tables, 7 figure
A review on longitudinal data analysis with random forest in precision medicine
Precision medicine provides customized treatments to patients based on their
characteristics and is a promising approach to improving treatment efficiency.
Large scale omics data are useful for patient characterization, but often their
measurements change over time, leading to longitudinal data. Random forest is
one of the state-of-the-art machine learning methods for building prediction
models, and can play a crucial role in precision medicine. In this paper, we
review extensions of the standard random forest method for the purpose of
longitudinal data analysis. Extension methods are categorized according to the
data structures for which they are designed. We consider both univariate and
multivariate responses and further categorize the repeated measurements
according to whether the time effect is relevant. Information of available
software implementations of the reviewed extensions is also given. We conclude
with discussions on the limitations of our review and some future research
directions.Comment: 27 pages, 2 figures, 3 table
Evaluation of single-nucleotide polymorphism imputation using random forests
Genome-wide association studies (GWAS) have helped to reveal genetic mechanisms of complex diseases. Although commonly used genotyping technology enables us to determine up to a million single-nucleotide polymorphisms (SNPs), causative variants are typically not genotyped directly. A favored approach to increase the power of genome-wide association studies is to impute the untyped SNPs using more complete genotype data of a reference population
Effect of hyperparameters on variable selection in random forests
Random forests (RFs) are well suited for prediction modeling and variable
selection in high-dimensional omics studies. The effect of hyperparameters of
the RF algorithm on prediction performance and variable importance estimation
have previously been investigated. However, how hyperparameters impact RF-based
variable selection remains unclear. We evaluate the effects on the Vita and the
Boruta variable selection procedures based on two simulation studies utilizing
theoretical distributions and empirical gene expression data. We assess the
ability of the procedures to select important variables (sensitivity) while
controlling the false discovery rate (FDR). Our results show that the
proportion of splitting candidate variables (mtry.prop) and the sample fraction
(sample.fraction) for the training dataset influence the selection procedures
more than the drawing strategy of the training datasets and the minimal
terminal node size. A suitable setting of the RF hyperparameters depends on the
correlation structure in the data. For weakly correlated predictor variables,
the default value of mtry is optimal, but smaller values of sample.fraction
result in larger sensitivity. In contrast, the difference in sensitivity of the
optimal compared to the default value of sample.fraction is negligible for
strongly correlated predictor variables, whereas smaller values than the
default are better in the other settings. In conclusion, the default values of
the hyperparameters will not always be suitable for identifying important
variables. Thus, adequate values differ depending on whether the aim of the
study is optimizing prediction performance or variable selection.Comment: 18 pages, 2 figures + 2 figures in appendix, 3 table
ACPA: automated cluster plot analysis of genotype data
Genome-wide association studies have become standard in genetic epidemiology. Analyzing hundreds of thousands of markers simultaneously imposes some challenges for statisticians. One issue is the problem of multiplicity, which has been compared with the search for the needle in a haystack. To reduce the number of false-positive findings, a number of quality filters such as exclusion of single-nucleotide polymorphisms (SNPs) with a high missing fraction are employed. Another filter is exclusion of SNPs for which the calling algorithm had difficulties in assigning the genotypes. The only way to do this is the visual inspection of the cluster plots, also termed signal intensity plots, but this approach is often neglected. We developed an algorithm ACPA (automated cluster plot analysis), which performs this task automatically for autosomal SNPs. It is based on counting samples that lie too close to the cluster of a different genotype; SNPs are excluded when a certain threshold is exceeded. We evaluated ACPA using 1,000 randomly selected quality controlled SNPs from the Framingham Heart Study data that were provided for the Genetic Analysis Workshop 16. We compared the decision of ACPA with the decision made by two independent readers. We achieved a sensitivity of 88% (95% CI: 81%-93%) and a specificity of 86% (95% CI: 83%-89%). In a screening setting in which one aims at not losing any good SNP, we achieved 99% (95% CI: 98%-100%) specificity and still detected every second low-quality SNP
Risk estimation using probability machines
BACKGROUND: Logistic regression has been the de facto, and often the only, model used in the description and analysis of relationships between a binary outcome and observed features. It is widely used to obtain the conditional probabilities of the outcome given predictors, as well as predictor effect size estimates using conditional odds ratios. RESULTS: We show how statistical learning machines for binary outcomes, provably consistent for the nonparametric regression problem, can be used to provide both consistent conditional probability estimation and conditional effect size estimates. Effect size estimates from learning machines leverage our understanding of counterfactual arguments central to the interpretation of such estimates. We show that, if the data generating model is logistic, we can recover accurate probability predictions and effect size estimates with nearly the same efficiency as a correct logistic model, both for main effects and interactions. We also propose a method using learning machines to scan for possible interaction effects quickly and efficiently. Simulations using random forest probability machines are presented. CONCLUSIONS: The models we propose make no assumptions about the data structure, and capture the patterns in the data by just specifying the predictors involved and not any particular model structure. So they do not run the same risks of model mis-specification and the resultant estimation biases as a logistic model. This methodology, which we call a “risk machine”, will share properties from the statistical machine that it is derived from
Genetic association studies for gene expressions: permutation-based mutual information in a comparison with standard ANOVA and as a novel approach for feature selection
Mutual information (MI) is a robust nonparametric statistical approach for identifying associations between genotypes and gene expression levels. Using the data of Problem 1 provided for the Genetic Analysis Workshop 15, we first compared a quantitative MI (Tsalenko et al. 2006 J Bioinform Comput Biol 4:259–4) with the standard analysis of variance (ANOVA) and the nonparametric Kruskal-Wallis (KW) test. We then proposed a novel feature selection approach using MI in a classification scenario to address the small n - large p problem and compared it with a feature selection that relies on an asymptotic χ2 distribution. In both applications, we used a permutation-based approach for evaluating the significance of MI. Substantial discrepancies in significance were observed between MI, ANOVA, and KW that can be explained by different empirical distributions of the data. In contrast to ANOVA and KW, MI detects shifts in location when the data are non-normally distributed, skewed, or contaminated with outliers. ANOVA but not MI is often significant if one genotype with a small frequency had a remarkable difference in the average gene expression level relative to the other two genotypes. MI depends on genotype frequencies and cannot detect these differences. In the classification scenario, we show that our novel approach for feature selection identifies a smaller list of markers with higher accuracy compared to the standard method. In conclusion, permutation-based MI approaches provide reliable and flexible statistical frameworks which seem to be well suited for data that are non-normal, skewed, or have an otherwise peculiar distribution. They merit further methodological investigation
Paternal chronic colitis causes epigenetic inheritance of susceptibility to colitis.
Inflammatory bowel disease (IBD) arises by unknown environmental triggers in genetically susceptible individuals. Epigenetic regulation of gene expression may integrate internal and external influences and may thereby modulate disease susceptibility. Epigenetic modification may also affect the germ-line and in certain contexts can be inherited to offspring. This study investigates epigenetic alterations consequent to experimental murine colitis induced by dextran sodium sulphate (DSS), and their paternal transmission to offspring. Genome-wide methylome- and transcriptome-profiling of intestinal epithelial cells (IECs) and sperm cells of males of the F0 generation, which received either DSS and consequently developed colitis (F0(DSS)), or non-supplemented tap water (F0(Ctrl)) and hence remained healthy, and of their F1 offspring was performed using reduced representation bisulfite sequencing (RRBS) and RNA-sequencing (RNA-Seq), respectively. Offspring of F0(DSS) males exhibited aberrant methylation and expression patterns of multiple genes, including Igf1r and Nr4a2, which are involved in energy metabolism. Importantly, DSS colitis in F0(DSS) mice was associated with decreased body weight at baseline of their F1 offspring, and these F1 mice exhibited increased susceptibility to DSS-induced colitis compared to offspring from F0(Ctrl) males. This study hence demonstrates epigenetic transmissibility of metabolic and inflammatory traits resulting from experimental colitis.This study was carried out as part of the Research Training Group “Genes, Environment and Inflammation”, supported by the Deutsche Forschungsgemeinschaft (RTG 1743/1) of which A.F. is the spokesperson, the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007–2013)/ERC Grant agreement no. 260961 (A.K.), the Austrian Science Fund and Ministry of Science P21530-B18 and START Y446-B18 (A.K.), the Wellcome Trust (investigator award 106260/Z/14/Z) to A.K., the Cambridge Biomedical Research Centre (A.K.), a fellowship from the European Crohn’s and Colitis Organisation (M.T. and T.E.A.) and a DOC fellowship from the Austrian Academy of Sciences (J.K.).This is the final version of the article. It first appeared from Nature Publishing Group via http://dx.doi.org/10.1038/srep3164
- …