Search CORE

7 research outputs found

Improving Breast Cancer Survival Analysis through Competition-Based Multidimensional Modeling

Author: Alvarez Mariano Javier
Aparicio Samuel
Bilal Erhan
Børresen-Dale Anne-Lise
Caldas Carlos
Califano Andrea
Curtis Christina
Dutkowski Janusz
Friend Stephen H.
Guinney Justin
Ideker Trey
Jang In Sock
Kristensen Vessela N.
Logsdon Benjamin A.
Margolin Adam A.
Mecham Brigham H.
Pandey Gaurav
Rueda Oscar M.
Sauerwine Benjamin A.
Schadt Eric E.
Shimoni Yishai
Stolovitzky Gustavo A.
Tost Jorg
Vollan Hans Kristian Moen
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2013
Field of study

Breast cancer is the most common malignancy in women and is responsible for hundreds of thousands of deaths annually. As with most cancers, it is a heterogeneous disease and different breast cancer subtypes are treated differently. Understanding the difference in prognosis for breast cancer based on its molecular and phenotypic features is one avenue for improving treatment by matching the proper treatment with molecular subtypes of the disease. In this work, we employed a competition-based approach to modeling breast cancer prognosis using large datasets containing genomic and clinical information and an online real-time leaderboard program used to speed feedback to the modeling team and to encourage each modeler to work towards achieving a higher ranked submission. We find that machine learning methods combined with molecular features selected based on expert prior knowledge can improve survival predictions compared to current best-in-class methodologies and that ensemble models trained across multiple user submissions systematically outperform individual models within the ensemble. We also find that model scores are highly consistent across multiple independent evaluations. This study serves as the pilot phase of a much larger competition open to the whole research community, with the goal of understanding general strategies for model optimization using clinical and molecular profiling data and providing an objective, transparent system for assessing prognostic models

Crossref

Columbia University Academic Commons

Directory of Open Access Journals

PubMed Central

Model performance by feature set and learning algorithm.

(A) The concordance index is displayed for each model from the controlled experiment (<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003047#pcbi.1003047.s005" target="_blank">Table S4</a>). The methods and features sets are arranged according to the mean concordance index score. The ensemble method (cyan curve) infers survival predictions based on the average rank of samples from each of the four other learning algorithms, and the ensemble feature set uses the average rank of samples based on models trained using all of the other feature sets. <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003047#s2" target="_blank">Results</a> for the METABRIC2 and MicMa datasets are show in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003047#pcbi.1003047.s001" target="_blank">Figure S1</a>. (B) The concordance index of models from the controlled phase by type. The ensemble method again utilizes the average rank for models in each category.</p

FigShare

Feature sets used in the controlled experiment.

Feature sets used in the controlled experiment.</p

FigShare

Consistency of results in 2 additional datasets.

(A,C) Concordance index scores for all models evaluated in the controlled experiment. Scores from the original evaluation are compared against METABRIC2 (A) and MicMa (C). The 4 machine learning algorithms are displayed in different colors. (B,D) Individual plots for each machine learning algorithm.</p

FigShare

Gene expression subclass analysis.

(A) Comparison of hierarchical clustering of METABRIC data (left panel) and Perou data (right panel). Hierarchical clustering on the gene expression data of the PAM50 genes in both datasets reveals a similar gene expression pattern that separates into several subclasses. Although several classes are apparent, they are consistent with sample assignment into basal-like, Her2-enriched and luminal subclasses in the Perou data. Similarly, in the METABRIC data the subclasses are consistent with the available clinical data for triple-negative, ER and PR status, and HER2 positive. (B) Kaplan-Meier plot for subclasses. The METABRIC test dataset was separated into 3 major subclasses according to clinical features. The subclasses were determined by the clinical features: triple negative (red); ER or PR positive status (blue); and HER2 positive with ER and PR negative status (green). The survival curve was estimated using a standard Kaplan-Meier curve, and shows the expected differences in overall survival between the subclasses. (C,D) Kaplan-Meier curve by grade and histology. The test dataset was separated by tumor grade (subplot C; grade 1 – red, grade 2 – green, grade 3- blue), or by histology (subplot D; Infilitrating Lobular – red, Infiltrating Ductal – yellow, Medullary –green, Mixed Histology – blue, or Mucinous - purple). The survival curves were estimated using a standard Kaplan-Meier curve, and show the expected differences in overall survival for the clinical features.</p

FigShare

Distribution of concordance index scores of models submitted in the pilot competition.

(A) Models are categorized by the type of features they use. Boxes indicate the 25th (lower end), 50th (middle red line) and 75th (upper end) of the scores in each category, while the whiskers indicate the 10th and 90th percentiles of the scores. The scores for the baseline and best performer are highlighted. (B) Model performance by submission date. In the initial phase of the competition, slight improvements over the baseline model were achieved by applying machine learning approaches to only the clinical data (red circles), whereas initial attempts to incorporate molecular data significantly decreased performance (green, purple, and black circles). In the intermediate phase of the competition, models combining molecular and clinical data (green circles) predominated and achieved slightly improved performance over clinical only models. Towards the end of the competition, models combining clinical information with molecular features selected based on prior information (purple circles) predominated.</p

FigShare

Improving Breast Cancer Survival Analysis through Competition-Based Multidimensional Modeling

Author: A Jain
A Naderi
A Prat
Adam A. Margolin
Andrea Califano
Anne-Lise Børresen-Dale
B Naume
Benjamin A. Logsdon
Benjamin A. Sauerwine
BH Mecham
Brigham H. Mecham
C Curtis
C Sotiriou
Carlos Caldas
Christina Curtis
CK Zoon
CM Perou
CW Elston
D Earl
D Marbach
D Marbach
D Venet
E Enerly
Erhan Bilal
Eric E. Schadt
F Cardoso
FM Buffa
G Athanasopoulos
Gaurav Pandey
Gustavo A. Stolovitzky
Hans Kristian Moen Vollan
I Ben-Porath
In Sock Jang
J Bennett
J Moult
J Moult
Janusz Dutkowski
JH Friedman
JH Taube
JMJ Derry
Jorg Tost
Justin Guinney
KA Kwei
KR Lakhani
L Shi
LJ Van T Veer
M Ben-David
Mariano J. Alvarez
ME Higgins
MJ Van De Vijver
Oscar M. Rueda
P Meyer
P Meyer
P Radivojac
P Wirapati
PA Futreal
PJ Stephens
R Norel
RB Scharpf
RC Gentleman
Richard Bonneau
RJ Prill
RJ Prill
S Valastyan
Samuel Aparicio
SJ Wodak
SL Carter
Stephen H. Friend
T Ideker
T Ravasi
T Sørlie
TG Clark
Trey Ideker
V Glinsky G
Vessela N. Kristensen
VN Kristensen
Yishai Shimoni
Publication venue: 'Public Library of Science (PLoS)'
Publication date
Field of study

Crossref