14 research outputs found

    Project Final Report

    No full text
    Vast amount of literatures for biomedical research is available online, in MEDLINE database. This helps the biomedical scientists to have instant access to literatures and references they need. But finding a manageable subset of literatures that are relevant to their current research is hard because: (1) the number of these articles are growing very fast, and (2) each disease (and gene) has different synonyms and in different articles it is called by different names. For this project we want to: identify and extract biomedical information from the text; and provide different types of short summaries for a set of abstracts related to a specific biological/chemical name.

    A Machine Learned Classifier That Uses Gene Expression Data to Accurately Predict Estrogen Receptor Status

    Get PDF
    <div><p>Background</p><p>Selecting the appropriate treatment for breast cancer requires accurately determining the estrogen receptor (ER) status of the tumor. However, the standard for determining this status, immunohistochemical analysis of formalin-fixed paraffin embedded samples, suffers from numerous technical and reproducibility issues. Assessment of ER-status based on RNA expression can provide more objective, quantitative and reproducible test results.</p><p>Methods</p><p>To learn a parsimonious RNA-based classifier of hormone receptor status, we applied a machine learning tool to a training dataset of gene expression microarray data obtained from 176 frozen breast tumors, whose ER-status was determined by applying ASCO-CAP guidelines to standardized immunohistochemical testing of formalin fixed tumor.</p><p>Results</p><p>This produced a three-gene classifier that can predict the ER-status of a novel tumor, with a cross-validation accuracy of 93.17±2.44%. When applied to an independent validation set and to four other public databases, some on different platforms, this classifier obtained over 90% accuracy in each. In addition, we found that this prediction rule separated the patients' recurrence-free survival curves with a hazard ratio lower than the one based on the IHC analysis of ER-status.</p><p>Conclusions</p><p>Our efficient and parsimonious classifier lends itself to high throughput, highly accurate and low-cost RNA-based assessments of ER-status, suitable for routine high-throughput clinical use. This analytic method provides a proof-of-principle that may be applicable to developing effective RNA-based tests for other biomarkers and conditions.</p></div

    Accuracy for our 3-feature classifier, over various datasets.

    No full text
    <p>*#ER+/ER−: Number of patients that were estrogen receptor positive/negative from gold standard IHC analysis.</p

    Kaplan-Meier Survival and Recurrence-Free Survival Curves For Patients Sorted by IHC-Determined ER-Status and Eq3 Predicted ER-Status.

    No full text
    <p>Both the survival and recurrence-free survival curves had greater separation and lower hazard ratios (HR) when the patients were sorted by <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> ER-status compared with traditional IHC. a) Survival curves for patients split based on IHC ER-status (ER+ n = 126, median survival = 3807days; ER- n = 72, median survival = 2704days; HR = 0.5090; 95% CI = 0.2968–0.8731). b) Survival curves for patients split based on <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> ER-status (ER+ n = 123, median survival = 3807days; ER- n = 75, median survival = 1623days; HR = 0.3901; 95% CI = 0.2420–0.6935). c) Recurrence-free survival curves for patients split based on IHC ER-status (ER+ n = 126, median recurrence-free survival = 1694days; ER- n = 72, median recurrence-free survival = 1246days; HR = 0.7160; 95% CI = 0.4623–1.109).d) Recurrence-free survival curves for patients split based on <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> ER-status (ER+ n = 123, median recurrence-free survival = 1820days; ER- n = 75, median recurrence-free survival = 875days; HR = 0.5731; 95% CI = 0.3718–0.8833).</p

    FS_SVM; a feature selection version of the Support Vector Machine (SVM) learner.

    No full text
    <p>Line 6 runs SVM on the dataset S, but uses only the <i>r<sup>*</sup></i> “best” features, where features are ranked by their mRMR score<sup>15</sup>, which is computed in Line 5. Note this mRMR score combines mutual information (<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e006" target="_blank">Eq 1</a>) with minimum redundancy. The goal of the first 4 lines is to compute this <i>r<sup>*</sup></i> value: Here, we first partition the dataset into 10 disjoint same-sized subsets {S<sub>i</sub>, <sub>i = 1…10</sub>}, which are balanced (ie, each is of the same size, and has about the same number of ER+ instances). FS_SVM then considers each of these S<sub>i</sub> subsets, one by one. It first considers the remaining instances, S<sub>−i</sub>  =  S − S<sub>i</sub>, and computes the mRMR score for each feature with respect to this subset of instances. It then evaluates how well SVM does when using only the first r = 1, 2,… of these features, in order. Here, it runs SVM, using that size-r subset of features, on the training set S<sub>−i</sub>, then evaluates the resulting classifier on the remaining “testing subset” S<sub>i</sub>. Line 4 sets r<sup>*</sup> to be the smallest value that is within 1 standard deviation of the high-water mark. See <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.s001" target="_blank">Material S1</a> for more details.</p

    Average accuracy of SVM, as a function of number of features.

    No full text
    <p>For each r = 1,2,…,18, line 3 of FS_SVM (<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone-0082144-g002" target="_blank">Figure 2</a>) computes the mean <i>a<sub>r</sub></i> and standard deviation <i>σ<sub>r</sub></i> of the empirical accuracies obtained, over all 10 folds; this figure plots these bars, for each r. Notice the average accuracy on the hold-out sets increases as the number of features is increased, then levels out, with only minor fluctuations. Here, the largest accuracy occurs at r = 4; notice however that this accuracy is “essentially” the same as at r = 3. We therefore set r<sup>*</sup> = 3 as it is the smallest number of features whose accuracy's “mean + standard deviation” is at least the high-water-mark mean accuracy.</p

    Top 10 genes, sorted by mutual information related to ER-status, based on the E176-cohort.

    No full text
    <p>This table also provides the SVM coefficient, the index over the E23-cohort (see text), and a short description of the gene.</p

    Basic machine learning framework.

    No full text
    <p>The bottom portion of this figures shows that a “Classifier” takes as input a description of a novel instance (here, the 27688 gene expression values from a microarray taken from a patient's biopsy), and returns a prediction for this instance (here, its prediction of whether this tumor is ER+ or ER−). The figure suggests this response is “No”. The Machine Learning challenge is to produce this classifier from a dataset of historical data (called labeled “Training Data”); this is the vertical portion, showing that a Learner uses that Training Data to produce the classifier. When evaluating the quality of a learned classifier, we require that the “Novel Instance” is not in the Training Data.</p

    The Eq3 Classifier Predicts ER-Status with High Accuracy.

    No full text
    <p>The individual patient <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> values from the combined E176 and E23 datasets are sorted in descending order. The black triangular peaks mark patients classified as ER+ or ER- from IHC but the opposite from the <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> classifier, and the number of patients within each peak is labeled above. a) Histogram of the above sorted <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082144#pone.0082144.e013" target="_blank">Eq3</a> values, showing the percentage of IHC-determined ER+ patients, in each 10-patient bin.</p
    corecore