Gene network information is believed to be beneficial for disease module and
pathway identification, but has not been explicitly utilized in the standard
random forest (RF) algorithm for gene expression data analysis. We investigate
the performance of a network-guided RF where the network information is
summarized into a sampling probability of predictor variables which is further
used in the construction of the RF. Our results suggest that network-guided RF
does not provide better disease prediction than the standard RF. In terms of
disease gene discovery, if disease genes form module(s), network-guided RF
identifies them more accurately. In addition, when disease status is
independent from genes in the given network, spurious gene selection results
can occur when using network information, especially on hub genes. Our
empirical analysis on two balanced microarray and RNA-Seq breast cancer
datasets from The Cancer Genome Atlas (TCGA) for classification of progesterone
receptor (PR) status also demonstrates that network-guided RF can identify
genes from PGR-related pathways, which leads to a better connected module of
identified genes.Comment: 23 pages, 2 tables, 7 figure