7 research outputs found

    Improving Breast Cancer Survival Analysis through Competition-Based Multidimensional Modeling

    Get PDF
    Breast cancer is the most common malignancy in women and is responsible for hundreds of thousands of deaths annually. As with most cancers, it is a heterogeneous disease and different breast cancer subtypes are treated differently. Understanding the difference in prognosis for breast cancer based on its molecular and phenotypic features is one avenue for improving treatment by matching the proper treatment with molecular subtypes of the disease. In this work, we employed a competition-based approach to modeling breast cancer prognosis using large datasets containing genomic and clinical information and an online real-time leaderboard program used to speed feedback to the modeling team and to encourage each modeler to work towards achieving a higher ranked submission. We find that machine learning methods combined with molecular features selected based on expert prior knowledge can improve survival predictions compared to current best-in-class methodologies and that ensemble models trained across multiple user submissions systematically outperform individual models within the ensemble. We also find that model scores are highly consistent across multiple independent evaluations. This study serves as the pilot phase of a much larger competition open to the whole research community, with the goal of understanding general strategies for model optimization using clinical and molecular profiling data and providing an objective, transparent system for assessing prognostic models

    Model performance by feature set and learning algorithm.

    No full text
    <p>(A) The concordance index is displayed for each model from the controlled experiment (<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003047#pcbi.1003047.s005" target="_blank">Table S4</a>). The methods and features sets are arranged according to the mean concordance index score. The ensemble method (cyan curve) infers survival predictions based on the average rank of samples from each of the four other learning algorithms, and the ensemble feature set uses the average rank of samples based on models trained using all of the other feature sets. <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003047#s2" target="_blank">Results</a> for the METABRIC2 and MicMa datasets are show in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003047#pcbi.1003047.s001" target="_blank">Figure S1</a>. (B) The concordance index of models from the controlled phase by type. The ensemble method again utilizes the average rank for models in each category.</p

    Gene expression subclass analysis.

    No full text
    <p>(A) Comparison of hierarchical clustering of METABRIC data (left panel) and Perou data (right panel). Hierarchical clustering on the gene expression data of the PAM50 genes in both datasets reveals a similar gene expression pattern that separates into several subclasses. Although several classes are apparent, they are consistent with sample assignment into basal-like, Her2-enriched and luminal subclasses in the Perou data. Similarly, in the METABRIC data the subclasses are consistent with the available clinical data for triple-negative, ER and PR status, and HER2 positive. (B) Kaplan-Meier plot for subclasses. The METABRIC test dataset was separated into 3 major subclasses according to clinical features. The subclasses were determined by the clinical features: triple negative (red); ER or PR positive status (blue); and HER2 positive with ER and PR negative status (green). The survival curve was estimated using a standard Kaplan-Meier curve, and shows the expected differences in overall survival between the subclasses. (C,D) Kaplan-Meier curve by grade and histology. The test dataset was separated by tumor grade (subplot C; grade 1 – red, grade 2 – green, grade 3- blue), or by histology (subplot D; Infilitrating Lobular – red, Infiltrating Ductal – yellow, Medullary –green, Mixed Histology – blue, or Mucinous - purple). The survival curves were estimated using a standard Kaplan-Meier curve, and show the expected differences in overall survival for the clinical features.</p

    Distribution of concordance index scores of models submitted in the pilot competition.

    No full text
    <p>(A) Models are categorized by the type of features they use. Boxes indicate the 25<sup>th</sup> (lower end), 50<sup>th</sup> (middle red line) and 75<sup>th</sup> (upper end) of the scores in each category, while the whiskers indicate the 10<sup>th</sup> and 90<sup>th</sup> percentiles of the scores. The scores for the baseline and best performer are highlighted. (B) Model performance by submission date. In the initial phase of the competition, slight improvements over the baseline model were achieved by applying machine learning approaches to only the clinical data (red circles), whereas initial attempts to incorporate molecular data significantly decreased performance (green, purple, and black circles). In the intermediate phase of the competition, models combining molecular and clinical data (green circles) predominated and achieved slightly improved performance over clinical only models. Towards the end of the competition, models combining clinical information with molecular features selected based on prior information (purple circles) predominated.</p
    corecore