8 research outputs found

    A Machine Learned Classifier That Uses Gene Expression Data to Accurately Predict Estrogen Receptor Status

    Get PDF
    <div><p>Background</p><p>Selecting the appropriate treatment for breast cancer requires accurately determining the estrogen receptor (ER) status of the tumor. However, the standard for determining this status, immunohistochemical analysis of formalin-fixed paraffin embedded samples, suffers from numerous technical and reproducibility issues. Assessment of ER-status based on RNA expression can provide more objective, quantitative and reproducible test results.</p><p>Methods</p><p>To learn a parsimonious RNA-based classifier of hormone receptor status, we applied a machine learning tool to a training dataset of gene expression microarray data obtained from 176 frozen breast tumors, whose ER-status was determined by applying ASCO-CAP guidelines to standardized immunohistochemical testing of formalin fixed tumor.</p><p>Results</p><p>This produced a three-gene classifier that can predict the ER-status of a novel tumor, with a cross-validation accuracy of 93.17±2.44%. When applied to an independent validation set and to four other public databases, some on different platforms, this classifier obtained over 90% accuracy in each. In addition, we found that this prediction rule separated the patients' recurrence-free survival curves with a hazard ratio lower than the one based on the IHC analysis of ER-status.</p><p>Conclusions</p><p>Our efficient and parsimonious classifier lends itself to high throughput, highly accurate and low-cost RNA-based assessments of ER-status, suitable for routine high-throughput clinical use. This analytic method provides a proof-of-principle that may be applicable to developing effective RNA-based tests for other biomarkers and conditions.</p></div

    The Eq3 Classifier Predicts ER-Status with High Accuracy.

    No full text
    <p>The individual patient <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> values from the combined E176 and E23 datasets are sorted in descending order. The black triangular peaks mark patients classified as ER+ or ER- from IHC but the opposite from the <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> classifier, and the number of patients within each peak is labeled above. a) Histogram of the above sorted <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> values, showing the percentage of IHC-determined ER+ patients, in each 10-patient bin.</p

    Accuracy for our 3-feature classifier, over various datasets.

    No full text
    <p>*#ER+/ER−: Number of patients that were estrogen receptor positive/negative from gold standard IHC analysis.</p

    Kaplan-Meier Survival and Recurrence-Free Survival Curves For Patients Sorted by IHC-Determined ER-Status and Eq3 Predicted ER-Status.

    No full text
    <p>Both the survival and recurrence-free survival curves had greater separation and lower hazard ratios (HR) when the patients were sorted by <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> ER-status compared with traditional IHC. a) Survival curves for patients split based on IHC ER-status (ER+ n = 126, median survival = 3807days; ER- n = 72, median survival = 2704days; HR = 0.5090; 95% CI = 0.2968–0.8731). b) Survival curves for patients split based on <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> ER-status (ER+ n = 123, median survival = 3807days; ER- n = 75, median survival = 1623days; HR = 0.3901; 95% CI = 0.2420–0.6935). c) Recurrence-free survival curves for patients split based on IHC ER-status (ER+ n = 126, median recurrence-free survival = 1694days; ER- n = 72, median recurrence-free survival = 1246days; HR = 0.7160; 95% CI = 0.4623–1.109).d) Recurrence-free survival curves for patients split based on <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> ER-status (ER+ n = 123, median recurrence-free survival = 1820days; ER- n = 75, median recurrence-free survival = 875days; HR = 0.5731; 95% CI = 0.3718–0.8833).</p

    FS_SVM; a feature selection version of the Support Vector Machine (SVM) learner.

    No full text
    <p>Line 6 runs SVM on the dataset S, but uses only the <i>r<sup>*</sup></i> “best” features, where features are ranked by their mRMR score<sup>15</sup>, which is computed in Line 5. Note this mRMR score combines mutual information (<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e006" target="_blank">Eq 1</a>) with minimum redundancy. The goal of the first 4 lines is to compute this <i>r<sup>*</sup></i> value: Here, we first partition the dataset into 10 disjoint same-sized subsets {S<sub>i</sub>, <sub>i = 1…10</sub>}, which are balanced (ie, each is of the same size, and has about the same number of ER+ instances). FS_SVM then considers each of these S<sub>i</sub> subsets, one by one. It first considers the remaining instances, S<sub>−i</sub>  =  S − S<sub>i</sub>, and computes the mRMR score for each feature with respect to this subset of instances. It then evaluates how well SVM does when using only the first r = 1, 2,… of these features, in order. Here, it runs SVM, using that size-r subset of features, on the training set S<sub>−i</sub>, then evaluates the resulting classifier on the remaining “testing subset” S<sub>i</sub>. Line 4 sets r<sup>*</sup> to be the smallest value that is within 1 standard deviation of the high-water mark. See <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.s001" target="_blank">Material S1</a> for more details.</p

    Basic machine learning framework.

    No full text
    <p>The bottom portion of this figures shows that a “Classifier” takes as input a description of a novel instance (here, the 27688 gene expression values from a microarray taken from a patient's biopsy), and returns a prediction for this instance (here, its prediction of whether this tumor is ER+ or ER−). The figure suggests this response is “No”. The Machine Learning challenge is to produce this classifier from a dataset of historical data (called labeled “Training Data”); this is the vertical portion, showing that a Learner uses that Training Data to produce the classifier. When evaluating the quality of a learned classifier, we require that the “Novel Instance” is not in the Training Data.</p

    Average accuracy of SVM, as a function of number of features.

    No full text
    <p>For each r = 1,2,…,18, line 3 of FS_SVM (<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone-0082144-g002" target="_blank">Figure 2</a>) computes the mean <i>a<sub>r</sub></i> and standard deviation <i>σ<sub>r</sub></i> of the empirical accuracies obtained, over all 10 folds; this figure plots these bars, for each r. Notice the average accuracy on the hold-out sets increases as the number of features is increased, then levels out, with only minor fluctuations. Here, the largest accuracy occurs at r = 4; notice however that this accuracy is “essentially” the same as at r = 3. We therefore set r<sup>*</sup> = 3 as it is the smallest number of features whose accuracy's “mean + standard deviation” is at least the high-water-mark mean accuracy.</p

    Top 10 genes, sorted by mutual information related to ER-status, based on the E176-cohort.

    No full text
    <p>This table also provides the SVM coefficient, the index over the E23-cohort (see text), and a short description of the gene.</p
    corecore