Bioinformatics tools have been developed to interpret gene expression data at
the gene set level, and these gene set based analyses improve the biologists'
capability to discover functional relevance of their experiment design. While
elucidating gene set individually, inter gene sets association is rarely taken
into consideration. Deep learning, an emerging machine learning technique in
computational biology, can be used to generate an unbiased combination of gene
set, and to determine the biological relevance and analysis consistency of
these combining gene sets by leveraging large genomic data sets. In this study,
we proposed a gene superset autoencoder (GSAE), a multi-layer autoencoder model
with the incorporation of a priori defined gene sets that retain the crucial
biological features in the latent layer. We introduced the concept of the gene
superset, an unbiased combination of gene sets with weights trained by the
autoencoder, where each node in the latent layer is a superset. Trained with
genomic data from TCGA and evaluated with their accompanying clinical
parameters, we showed gene supersets' ability of discriminating tumor subtypes
and their prognostic capability. We further demonstrated the biological
relevance of the top component gene sets in the significant supersets. Using
autoencoder model and gene superset at its latent layer, we demonstrated that
gene supersets retain sufficient biological information with respect to tumor
subtypes and clinical prognostic significance. Superset also provides high
reproducibility on survival analysis and accurate prediction for cancer
subtypes.Comment: Presented in the International Conference on Intelligent Biology and
Medicine (ICIBM 2018) at Los Angeles, CA, USA and published in BMC Systems
Biology 2018, 12(Suppl 8):14