Search CORE

8 research outputs found

A Machine Learned Classifier That Uses Gene Expression Data to Accurately Predict Estrogen Receptor Status

Author: Jean Deschenes (491772)
John Mackey (372754)
Kathryn Graham (192363)
Larissa Vos (491770)
Meysam Bastani (491769)
Nasimeh Asgarian (491771)
Russell Greiner (6063)
Publication venue
Publication date: 02/12/2013
Field of study

<div>BackgroundSelecting the appropriate treatment for breast cancer requires accurately determining the estrogen receptor (ER) status of the tumor. However, the standard for determining this status, immunohistochemical analysis of formalin-fixed paraffin embedded samples, suffers from numerous technical and reproducibility issues. Assessment of ER-status based on RNA expression can provide more objective, quantitative and reproducible test results.MethodsTo learn a parsimonious RNA-based classifier of hormone receptor status, we applied a machine learning tool to a training dataset of gene expression microarray data obtained from 176 frozen breast tumors, whose ER-status was determined by applying ASCO-CAP guidelines to standardized immunohistochemical testing of formalin fixed tumor.ResultsThis produced a three-gene classifier that can predict the ER-status of a novel tumor, with a cross-validation accuracy of 93.17±2.44%. When applied to an independent validation set and to four other public databases, some on different platforms, this classifier obtained over 90% accuracy in each. In addition, we found that this prediction rule separated the patients' recurrence-free survival curves with a hazard ratio lower than the one based on the IHC analysis of ER-status.ConclusionsOur efficient and parsimonious classifier lends itself to high throughput, highly accurate and low-cost RNA-based assessments of ER-status, suitable for routine high-throughput clinical use. This analytic method provides a proof-of-principle that may be applicable to developing effective RNA-based tests for other biomarkers and conditions.</div

Directory of Open Access Journals

PubMed Central

FigShare

The Eq3 Classifier Predicts ER-Status with High Accuracy.

Author: Jean Deschenes (491772)
John Mackey (372754)
Kathryn Graham (192363)
Larissa Vos (491770)
Meysam Bastani (491769)
Nasimeh Asgarian (491771)
Russell Greiner (6063)
Publication venue
Publication date
Field of study

The individual patient <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> values from the combined E176 and E23 datasets are sorted in descending order. The black triangular peaks mark patients classified as ER+ or ER- from IHC but the opposite from the <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> classifier, and the number of patients within each peak is labeled above. a) Histogram of the above sorted <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> values, showing the percentage of IHC-determined ER+ patients, in each 10-patient bin.</p

FigShare

Accuracy for our 3-feature classifier, over various datasets.

Author: Jean Deschenes (491772)
John Mackey (372754)
Kathryn Graham (192363)
Larissa Vos (491770)
Meysam Bastani (491769)
Nasimeh Asgarian (491771)
Russell Greiner (6063)
Publication venue
Publication date
Field of study

*#ER+/ER−: Number of patients that were estrogen receptor positive/negative from gold standard IHC analysis.</p

FigShare

Kaplan-Meier Survival and Recurrence-Free Survival Curves For Patients Sorted by IHC-Determined ER-Status and Eq3 Predicted ER-Status.

Author: Jean Deschenes (491772)
John Mackey (372754)
Kathryn Graham (192363)
Larissa Vos (491770)
Meysam Bastani (491769)
Nasimeh Asgarian (491771)
Russell Greiner (6063)
Publication venue
Publication date
Field of study

Both the survival and recurrence-free survival curves had greater separation and lower hazard ratios (HR) when the patients were sorted by <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> ER-status compared with traditional IHC. a) Survival curves for patients split based on IHC ER-status (ER+ n = 126, median survival = 3807days; ER- n = 72, median survival = 2704days; HR = 0.5090; 95% CI = 0.2968–0.8731). b) Survival curves for patients split based on <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> ER-status (ER+ n = 123, median survival = 3807days; ER- n = 75, median survival = 1623days; HR = 0.3901; 95% CI = 0.2420–0.6935). c) Recurrence-free survival curves for patients split based on IHC ER-status (ER+ n = 126, median recurrence-free survival = 1694days; ER- n = 72, median recurrence-free survival = 1246days; HR = 0.7160; 95% CI = 0.4623–1.109).d) Recurrence-free survival curves for patients split based on <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> ER-status (ER+ n = 123, median recurrence-free survival = 1820days; ER- n = 75, median recurrence-free survival = 875days; HR = 0.5731; 95% CI = 0.3718–0.8833).</p

FigShare

FS_SVM; a feature selection version of the Support Vector Machine (SVM) learner.

Author: Jean Deschenes (491772)
John Mackey (372754)
Kathryn Graham (192363)
Larissa Vos (491770)
Meysam Bastani (491769)
Nasimeh Asgarian (491771)
Russell Greiner (6063)
Publication venue
Publication date
Field of study

Line 6 runs SVM on the dataset S, but uses only the r* “best” features, where features are ranked by their mRMR score15, which is computed in Line 5. Note this mRMR score combines mutual information (<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e006" target="_blank">Eq 1</a>) with minimum redundancy. The goal of the first 4 lines is to compute this r* value: Here, we first partition the dataset into 10 disjoint same-sized subsets {Si, i = 1…10}, which are balanced (ie, each is of the same size, and has about the same number of ER+ instances). FS_SVM then considers each of these Si subsets, one by one. It first considers the remaining instances, S−i = S − Si, and computes the mRMR score for each feature with respect to this subset of instances. It then evaluates how well SVM does when using only the first r = 1, 2,… of these features, in order. Here, it runs SVM, using that size-r subset of features, on the training set S−i, then evaluates the resulting classifier on the remaining “testing subset” Si. Line 4 sets r* to be the smallest value that is within 1 standard deviation of the high-water mark. See <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.s001" target="_blank">Material S1</a> for more details.</p

FigShare

Basic machine learning framework.

Author: Jean Deschenes (491772)
John Mackey (372754)
Kathryn Graham (192363)
Larissa Vos (491770)
Meysam Bastani (491769)
Nasimeh Asgarian (491771)
Russell Greiner (6063)
Publication venue
Publication date
Field of study

The bottom portion of this figures shows that a “Classifier” takes as input a description of a novel instance (here, the 27688 gene expression values from a microarray taken from a patient's biopsy), and returns a prediction for this instance (here, its prediction of whether this tumor is ER+ or ER−). The figure suggests this response is “No”. The Machine Learning challenge is to produce this classifier from a dataset of historical data (called labeled “Training Data”); this is the vertical portion, showing that a Learner uses that Training Data to produce the classifier. When evaluating the quality of a learned classifier, we require that the “Novel Instance” is not in the Training Data.</p

FigShare

Average accuracy of SVM, as a function of number of features.

Author: Jean Deschenes (491772)
John Mackey (372754)
Kathryn Graham (192363)
Larissa Vos (491770)
Meysam Bastani (491769)
Nasimeh Asgarian (491771)
Russell Greiner (6063)
Publication venue
Publication date
Field of study

For each r = 1,2,…,18, line 3 of FS_SVM (<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone-0082144-g002" target="_blank">Figure 2</a>) computes the mean ar and standard deviation σr of the empirical accuracies obtained, over all 10 folds; this figure plots these bars, for each r. Notice the average accuracy on the hold-out sets increases as the number of features is increased, then levels out, with only minor fluctuations. Here, the largest accuracy occurs at r = 4; notice however that this accuracy is “essentially” the same as at r = 3. We therefore set r* = 3 as it is the smallest number of features whose accuracy's “mean + standard deviation” is at least the high-water-mark mean accuracy.</p

FigShare

Top 10 genes, sorted by mutual information related to ER-status, based on the E176-cohort.

Author: Jean Deschenes (491772)
John Mackey (372754)
Kathryn Graham (192363)
Larissa Vos (491770)
Meysam Bastani (491769)
Nasimeh Asgarian (491771)
Russell Greiner (6063)
Publication venue
Publication date
Field of study

This table also provides the SVM coefficient, the index over the E23-cohort (see text), and a short description of the gene.</p

FigShare