122 research outputs found
Multi-TGDR: a regularization method for multi-class classification in microarray experiments
Background
With microarray technology becoming mature and popular, the selection and use
of a small number of relevant genes for accurate classification of samples is a
hot topic in the circles of biostatistics and bioinformatics. However, most of
the developed algorithms lack the ability to handle multiple classes, which
arguably a common application. Here, we propose an extension to an existing
regularization algorithm called Threshold Gradient Descent Regularization
(TGDR) to specifically tackle multi-class classification of microarray data.
When there are several microarray experiments addressing the same/similar
objectives, one option is to use meta-analysis version of TGDR (Meta-TGDR),
which considers the classification task as combination of classifiers with the
same structure/model while allowing the parameters to vary across studies.
However, the original Meta-TGDR extension did not offer a solution to the
prediction on independent samples. Here, we propose an explicit method to
estimate the overall coefficients of the biomarkers selected by Meta-TGDR. This
extension permits broader applicability and allows a comparison between the
predictive performance of Meta-TGDR and TGDR using an independent testing set.
Results
Using real-world applications, we demonstrated the proposed multi-TGDR
framework works well and the number of selected genes is less than the sum of
all individualized binary TGDRs. Additionally, Meta-TGDR and TGDR on the
batch-effect adjusted pooled data approximately provided same results. By
adding Bagging procedure in each application, the stability and good predictive
performance are warranted.
Conclusions
Compared with Meta-TGDR, TGDR is less computing time intensive, and requires
no samples of all classes in each study. On the adjusted data, it has
approximate same predictive performance with Meta-TGDR. Thus, it is highly
recommended
Feature Selection for Longitudinal Data by Using Sign Averages to Summarize Gene Expression Values over Time
With the rapid evolution of high-throughput technologies, time series/longitudinal high-throughput experiments have become possible and affordable. However, the development of statistical methods dealing with gene expression profiles across time points has not kept up with the explosion of such data. The feature selection process is of critical importance for longitudinal microarray data. In this study, we proposed aggregating a gene’s expression values across time into a single value using the sign average method, thereby degrading a longitudinal feature selection process into a classic one. Regularized logistic regression models with pseudogenes (i.e., the sign average of genes across time as predictors) were then optimized by either the coordinate descent method or the threshold gradient descent regularization method. By applying the proposed methods to simulated data and a traumatic injury dataset, we have demonstrated that the proposed methods, especially for the combination of sign average and threshold gradient descent regularization, outperform other competitive algorithms. To conclude, the proposed methods are highly recommended for studies with the objective of carrying out feature selection for longitudinal gene expression data
An Ensemble of the iCluster Method to Analyze Longitudinal lncRNA Expression Data for Psoriasis Patients
BACKGROUND: Psoriasis is an immune-mediated, inflammatory disorder of the skin with chronic inflammation and hyper-proliferation of the epidermis. Since psoriasis has genetic components and the diseased tissue of psoriasis is very easily accessible, it is natural to use high-throughput technologies to characterize psoriasis and thus seek targeted therapies. Transcriptional profiles change correspondingly after an intervention. Unlike cross-sectional gene expression data, longitudinal gene expression data can capture the dynamic changes and thus facilitate causal inference.
METHODS: Using the iCluster method as a building block, an ensemble method was proposed and applied to a longitudinal gene expression dataset for psoriasis, with the objective of identifying key lncRNAs that can discriminate the responders from the non-responders to two immune treatments of psoriasis.
RESULTS: Using support vector machine models, the leave-one-out predictive accuracy of the 20-lncRNA signature identified by this ensemble was estimated as 80%, which outperforms several competing methods. Furthermore, pathway enrichment analysis was performed on the target mRNAs of the identified lncRNAs. Of the enriched GO terms or KEGG pathways, proteasome, and protein deubiquitination is included. The ubiquitination-proteasome system is regarded as a key player in psoriasis, and a proteasome inhibitor to target ubiquitination pathway holds promises for treating psoriasis.
CONCLUSIONS: An integrative method such as iCluster for multiple data integration can be adopted directly to analyze longitudinal gene expression data, which offers more promising options for longitudinal big data analysis. A comprehensive evaluation and validation of the resulting 20-lncRNA signature is highly desirable
Incorporating Pathway Information into Feature Selection Towards Better Performed Gene Signatures
To analyze gene expression data with sophisticated grouping structures and to extract hidden patterns from such data, feature selection is of critical importance. It is well known that genes do not function in isolation but rather work together within various metabolic, regulatory, and signaling pathways. If the biological knowledge contained within these pathways is taken into account, the resulting method is a pathway-based algorithm. Studies have demonstrated that a pathway-based method usually outperforms its gene-based counterpart in which no biological knowledge is considered. In this article, a pathway-based feature selection is firstly divided into three major categories, namely, pathway-level selection, bilevel selection, and pathway-guided gene selection. With bilevel selection methods being regarded as a special case of pathway-guided gene selection process, we discuss pathway-guided gene selection methods in detail and the importance of penalization in such methods. Last, we point out the potential utilizations of pathway-guided gene selection in one active research avenue, namely, to analyze longitudinal gene expression data. We believe this article provides valuable insights for computational biologists and biostatisticians so that they can make biology more computable
A Logitudinal Feature Selection Method Identifies Relevant Genes to Distinguish Complicated Injury and Uncomplicated Injury Over Time
Background: Feature selection and gene set analysis are of increasing interest in the field of bioinformatics. While these two approaches have been developed for different purposes, we describe how some gene set analysis methods can be utilized to conduct feature selection.
Methods: We adopted a gene set analysis method, the significance analysis of microarray gene set reduction (SAMGSR) algorithm, to carry out feature selection for longitudinal gene expression data.
Results: Using a real-world application and simulated data, it is demonstrated that the proposed SAMGSR extension outperforms other relevant methods. In this study, we illustrate that a gene’s expression profiles over time can be regarded as a gene set and then a suitable gene set analysis method can be utilized directly to select relevant genes associated with the phenotype of interest over time.
Conclusions: We believe this work will motivate more research to bridge feature selection and gene set analysis, with the development of novel algorithms capable of carrying out feature selection for longitudinal gene expression data
Test on Existence of Histology Subtype-Specific Prognostic Signatures Among Early Stage Lung Adenocarcinoma and Squamous Cell Carcinoma Patients Using a Cox-Model Based Filter
BACKGROUND: Non-small cell lung cancer (NSCLC) is the predominant histological type of lung cancer, accounting for up to 85% of cases. Disease stage is commonly used to determine adjuvant treatment eligibility of NSCLC patients, however, it is an imprecise predictor of the prognosis of an individual patient. Currently, many researchers resort to microarray technology for identifying relevant genetic prognostic markers, with particular attention on trimming or extending a Cox regression model. Adenocarcinoma (AC) and squamous cell carcinoma (SCC) are two major histology subtypes of NSCLC. It has been demonstrated that fundamental differences exist in their underlying mechanisms, which motivated us to postulate the existence of specific genes related to the prognosis of each histology subtype.
RESULTS: In this article, we propose a simple filter feature selection algorithm with a Cox regression model as the base. Applying this method to real-world microarray data identifies a histology-specific prognostic gene signature. Furthermore, the resulting 32-gene (32/12 for AC/SCC) prognostic signature for early-stage AC and SCC samples has superior predictive ability relative to two relevant prognostic signatures, and has comparable performance with signatures obtained by applying two state-of-the art algorithms separately to AC and SCC samples.
CONCLUSIONS: Our proposal is conceptually simple, and straightforward to implement. Furthermore, it can be easily adapted and applied to a range of other research settings.
REVIEWERS: This article was reviewed by Leonid Hanin (nominated by Dr. Lev Klebanov), Limsoon Wong and Jun Yu
- …