51 research outputs found

    A Crowdsourcing Approach to Developing and Assessing Prediction Algorithms for AML Prognosis

    Get PDF
    abstract: Acute Myeloid Leukemia (AML) is a fatal hematological cancer. The genetic abnormalities underlying AML are extremely heterogeneous among patients, making prognosis and treatment selection very difficult. While clinical proteomics data has the potential to improve prognosis accuracy, thus far, the quantitative means to do so have yet to be developed. Here we report the results and insights gained from the DREAM 9 Acute Myeloid Prediction Outcome Prediction Challenge (AML-OPC), a crowdsourcing effort designed to promote the development of quantitative methods for AML prognosis prediction. We identify the most accurate and robust models in predicting patient response to therapy, remission duration, and overall survival. We further investigate patient response to therapy, a clinically actionable prediction, and find that patients that are classified as resistant to therapy are harder to predict than responsive patients across the 31 models submitted to the challenge. The top two performing models, which held a high sensitivity to these patients, substantially utilized the proteomics data to make predictions. Using these models, we also identify which signaling proteins were useful in predicting patient therapeutic response.The article is published at http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.100489

    Acute myeloid leukemia risk group prediction from gene expression data with feed-forward neural networks

    Get PDF
    Abstract. Predicting from gene expression data remains a complex task due to it characteristically having large dimensionality and small sample sizes. Creating classifiers in these settings is a non-trivial task, which is complicated by the presence of multi-class imbalance. The imbalance hinders the feed-forward neural network’s ability to learn patterns from the data, and the multi-class structure makes common evaluation metrics hide the network’s poor performance in the minority classes. For Acute Myeloid Leukemia (AML) these issues are magnified by the fact that the underlying molecular factors are heterogeneous from patient to patient, which makes treatment and prognosis difficult. Having limited resources has a direct impact on which methods can be used to tackle these problems. In this thesis, the goal is to find cost-effective methods to balance the data, remove unnecessary features and to create a multi-class classifier for AML risk group. The risk group is created using two variables based on survival times. In total six scenarios are compared for creating the optimal feed-forward neural network. First, the original gene expressions are used as the predictors without any pre-processing. The following two scenarios fix the class imbalance using SMOTE and ADASYN. Finally, RFE is used to reduce dimensions in all previous scenarios to get the last three data sets. The feed-forward neural network is tuned separately for each scenario. In total 100 parameter combinations are chosen randomly from around 3000 possible model configurations, and the resulting models are evaluated based on overall accuracy and F1 score for each class. The results show that while ADASYN, SMOTE, and RFE help the networks yield better results, having the right network structure is just as important. This is demonstrated by the fact that some models using the unprocessed data set were found among the best-performing models. Furthermore, based on high accuracy in classification, predicting the new AML risk category based only on genes seems possible even with limited resources.Akuutin myelooisen leukemian riskiryhmän ennustaminen geeniekpressiodatasta eteenpäinsyöttävillä neuroverkoilla. Tiivistelmä. Geeniekpressiodatalle on tyypillistä, että muuttujia on kerätty kymmeniä tuhansia, kun taas havaintoja on vain muutama sata. Tämän takia luokkien ennustaminen geeniekpressioista on monimutkainen tehtävä, jota vaikeuttaa epätasapaino enemmistö- ja vähemmistöluokkien välillä. Epätasapaino vaikeuttaa geenien välisten yhteyksien oppimista, ja kun luokkia on useampi, yleisesti käytetyt arviointimenetelmät piilottavat huonon luokittelukyvyn vähemmistöluokille. Näiden ongelmien lisäksi akuutti myelooinen leukemia (AML) tuo omat haasteensa potilaiden välillä olevien molekyylisten tekijöiden heterogeenisyyden vuoksi. Tämän seurauksena ennusteiden tekeminen ja hoitokeinojen suunnittelu geenien pohjalta on haastavaa. Menetelmien valitseminen edellä mainittujen ongelmien ratkaisemiseksi riippuu suoraan käytettävissä olevista resursseista. Tämän työn tavoite on löytää kustannustehokkaat menetelmät datan epätasapainon korjaamiseen ja ylimääräisten muuttujien poistamiseen, sekä luoda useamman luokan luokittelija uudelle AML riskiryhmälle. Uusi riskiryhmä luodaan kahdesta muusta muuttujasta selviytymisaikojen perusteella. Yhteensä kuutta eri tilannetta tarkastellaan eteenpäinsyöttävillä neuroverkoilla. Ensin alkuperäistä AML geeniekspressiodataa käytetään ennustamaan riskiryhmä ilman aineiston esikäsittelyä. Tämän jälkeen aineiston epätasapaino korjataan simuloimalla vähemmistöluokalle uusia havaintoja käyttäen SMOTE- ja ADASYN-algoritmeja. Viimeiset kolme aineistoa saadaan pudottamalla muuttujia edellisistä aineistoista RFE-algoritmia hyödyntäen. Eteenpäinsyöttävien neuroverkkojen optimaaliset hyperparametrien arvot haetaan 100:sta parmetrikombinaatiosta, jotka on valittu satunnaisesti noin 3000:n kombinaation ryhmästä. Valittujen neuroverkkojen tuloksia verrataan kokonaistarkkuuden, sekä jokaisesta ryhmästä erikseen saatavan F1-suureen perusteella. Parhaimpien mallien joukosta löytyi esikäsiteltyjen aineistojen lisäksi prosessoimattomia aineistoja, mikä viittaa siihen, että neuroverkkojen oikean rakenteen valitseminen on yhtä tärkeää kuin datan esikäsittely. Uuden riskiryhmän luokittelu antoi lupaavia tuloksia, joten ennustaminen pelkästään geenien pohjalta näyttäisi olevan mahdollista myös vähäisillä resursseilla

    Attractor Metafeatures and Their Application in Biomolecular Data Analysis

    Get PDF
    This dissertation proposes a family of algorithms for deriving signatures of mutually associated features, to which we refer as attractor metafeatures, or simply attractors. Specifically, we present multi-cancer attractor derivation algorithms, identifying correlated features in signatures from multiple biological data sets in one analysis, as well as the groups of samples or cells that exclusively express these signatures. Our results demonstrate that these signatures can be used, in proper combinations, as biomarkers that predict a patient’s survival rate, based on the transcriptome of the tumor sample. They can also be used as features to analyze the composition of the tumor. Through analyzing large data sets of 18 cancer types and three high-throughput platforms from The Cancer Genome Atlas (TCGA) PanCanAtlas Project and multiple single-cell RNA-seq data sets, we identified novel cancer attractor signatures and elucidated the identity of the cells that express these signatures. Using these signatures, we developed a prognostic biomarker for breast cancer called the Breast Cancer Attractor Metagenes (BCAM) biomarker as well as a software platform to analyze the tumor sample, called Analysis of the Single-Cell Omics for Tumor (ASCOT)

    A Machine Learning Classifier Trained on Cancer Transcriptomes Detects NF1 Inactivation Signal in Glioblastoma

    Get PDF
    We have identified molecules that exhibit synthetic lethality in cells with loss of the neurofibromin 1 (NF1) tumor suppressor gene. However, recognizing tumors that have inactivation of the NF1 tumor suppressor function is challenging because the loss may occur via mechanisms that do not involve mutation of the genomic locus. Degradation of the NF1 protein, independent of NF1 mutation status, phenocopies inactivating mutations to drive tumors in human glioma cell lines. NF1 inactivation may alter the transcriptional landscape of a tumor and allow a machine learning classifier to detect which tumors will benefit from synthetic lethal molecules. We developed a strategy to predict tumors with low NF1 activity and hence tumors that may respond to treatments that target cells lacking NF1. Using RNAseq data from The Cancer Genome Atlas (TCGA), we trained an ensemble of 500 logistic regression classifiers that integrates mutation status with whole transcriptomes to predict NF1 inactivation in glioblastoma (GBM)

    A community approach to mortality prediction in sepsis via gene expression analysis.

    Get PDF
    Improved risk stratification and prognosis prediction in sepsis is a critical unmet need. Clinical severity scores and available assays such as blood lactate reflect global illness severity with suboptimal performance, and do not specifically reveal the underlying dysregulation of sepsis. Here, we present prognostic models for 30-day mortality generated independently by three scientific groups by using 12 discovery cohorts containing transcriptomic data collected from primarily community-onset sepsis patients. Predictive performance is validated in five cohorts of community-onset sepsis patients in which the models show summary AUROCs ranging from 0.765-0.89. Similar performance is observed in four cohorts of hospital-acquired sepsis. Combining the new gene-expression-based prognostic models with prior clinical severity scores leads to significant improvement in prediction of 30-day mortality as measured via AUROC and net reclassification improvement index These models provide an opportunity to develop molecular bedside tests that may improve risk stratification and mortality prediction in patients with sepsis.y NIGMS Glue Grant Legacy Award R24GM102656. J.F.B.-M., R.A., and E.T. were supported by Instituto de Salud Carlos III (grants EMER07/050, PI13/02110, PI16/01156). R.J.L. was supported by the National Center for Advancing Translational Sciences of the National Institutes of Health under award number UL1TR001417. The CAPSOD study was supported by NIH (U01AI066569, P20RR016480, HHSN266200400064C). P.K. is supported by grants from Bill Melinda Gates Foundation, R01 AI125197-01, 1U19AI109662, and U19AI057229, outside the submitted work. The GAinS study was supported by the National Institute for Health Research through the Comprehensive Clinical Research Network for patient recruitment; Wellcome Trust (Grants 074318 [to J.C.K.], and 090532/Z/09/Z [core facilities Wellcome Trust Centre for Human Genetics including High-Throughput Genomics Group]); European Research Council under the European Union’s Seventh Framework Programme (FP7/2007–2013)/ERC Grant agreement no. 281824 (to J.C.K.), the Medical Research Council (98082 [to J.C.K.]); UK Intensive Care Society; and NIHR Oxford Biomedical Research Centre. The Duke HAI study was supported by a research agreement between Duke University and Novartis Vaccines and Diagnostics, Inc. According to the terms of the agreement, representatives of the sponsor had an opportunity to review and comment on a draft of the manuscript. The authors had full control of the analyses, the preparation of the manuscript, and the decision to submit the manuscript for publication. For the University of Florida ‘P50’ Study, data were obtained from the Sepsis and Critically Illness Research Center (SCIRC) at the University of Florida College of Medicine, which is supported in part by NIGMS P50 GM111152. This work was supported by Defense Advanced Research Projects Agency and the Army Research Office through Grant W911NF-15-1-0107.

    Addressing the challenges of uncertainty in regression models for high dimensional and heterogeneous data from observational studies

    Get PDF
    The lack of replicability in research findings from different scientific disciplines has gained wide attention in the last few years and led to extensive discussions. In this `replication crisis', different types of uncertainty play an important role, which occur at different points of data collection and statistical analysis. Nevertheless, the consequences are often ignored in current research practices with the risk of low credibility and reliability of research findings. For the analysis and the development of solutions to this problem, we define measurement uncertainty, sampling uncertainty, data pre-processing uncertainty, method uncertainty, and model uncertainty, and investigate them in particular in the context of regression analyses. Therefore, we consider data from observational studies with the focus on high dimensionality and heterogeneous variables, which are characteristics of growing importance. High dimensional data, i.e., data with more variables than observations, play an important role in the area of medical research, where large amounts of molecular data (omics data) can be collected with ever decreasing expense and effort. Where several types of omics data are available, we are additionally faced with heterogeneity. Moreover, heterogeneous data can be found in many observational studies, where data originate from different sources, or where variables of different types are collected. This work comprises four contributions with different approaches to this topic and a different focus of investigation. Contribution 1 can be considered as a practical example to illustrate data pre-processing and method uncertainty in the context of prediction and variable selection from high dimensional and heterogeneous data. In the first part of this paper, we introduce the development of priority-Lasso, a hierarchical method for prediction using multi-omics data. Priority-Lasso is based on standard Lasso and assumes a pre-specified priority order of blocks of data. The idea is to successively fit Lasso models on these blocks of data and to take the linear predictor from every fit as an offset in the fit of the block with next lowest priority. In the second part, we apply this method in a current study of acute myeloid leukemia (AML) and compare its performance to standard Lasso. We illustrate data pre-processing and method uncertainty, caused by different choices of variable definitions and specifications of settings in the application of the method. These choices result in different effect estimates and thus different prediction performances and selected variables. In the second contribution, we compare method uncertainty with sampling uncertainty in the context of variable selection and ranking of omics biomarkers. For this purpose, we develop a user-friendly and versatile framework. We apply this framework on data from AML patients with high dimensional and heterogeneous characteristics and explore three different scenarios: First, variable selection in multivariable regression based on multi-omics data, second, variable ranking based on variable importance measures from random forests, and, third, identification of genes based on differential gene expression analysis. In contributions 3 and 4, we apply the vibration of effects framework, which was initially used to analyze model uncertainty in a large epidemiological study (NHANES), to assess and compare different types of uncertainty. The two contributions intensively address the methodological extension of this framework to different types of uncertainty. In contribution 3, we describe the extension of the vibration of effects framework to sampling and data pre-processing uncertainty. As a practical illustration, we take a large data set from psychological research with heterogeneous variable structure (SAPA-project), and examine sampling, model and data pre-processing uncertainty in the context of logistic regression for varying sample sizes. Beyond the comparison of single types of uncertainty, we introduce a strategy which allows quantifying cumulative model and data pre-processing uncertainty and analyzing their relative contributions to the total uncertainty with a variance decomposition. Finally, we extend the vibration of effects framework to measurement uncertainty in contribution 4. In a practical example, we conduct a comparison study between sampling, model and measurement uncertainty on the NHANES data set in the context of survival analysis. We focus on different scenarios of measurement uncertainty which differ in the choice of variables considered to be measured with error. Moreover, we analyze the behavior of different types of uncertainty with increasing sample sizes in a large simulation study

    MEASURABLE RESIDUAL DISEASE AND LEUKEMIC STEM CELLS IN ACUTE MYELOID LEUKEMIA

    Get PDF
    Nearly all fit patients with acute myeloid leukemia (AML) receive intense chemotherapy, followed by consolidation therapy which can be either additional cycle(s) of chemotherapy, autologous stem cell transplantation or allogeneic stem cell transplantation. In this order, anti-leukemic efficacy increases together with toxicity. While, fortunately, most patients achieve complete remission, unfortunately, 40-50% of patients experience a relapse. Patients who relapse have a dismal prognosis since the relapse is mostly difficult to eradicate. A correct understanding of the risk to relapse is vital for selecting the correct therapy intensity. Risk stratification at diagnosis is based on factors such as age, white blood cell (WBC) count and genetic (mutations and cytogenetic aberrations) characteristics.1 This risk assessment at diagnosis does not suffice for an accurate estimation of patients that relapse, therefore, more specific and sensitive methods (both by flow cytometry and molecular techniques) are widely used to assess possible residual disease during and after therapy. When this residual disease (termed measurable residual disease or minimal residual disease, MRD) is present above a critical level, patients have a higher chance of experiencing a relapse. The overall aim of the studies described in this thesis is to investigate the role of measurable residual disease (MRD) and leukemic stem cells (LSC), and several initiatives to improve the MRD assessment to be used for relapse prediction for the individual patient. Chapter 2 covers a review on several aspects of LSCs in AML and its considered role in relapse progression. Moreover, it discusses how these relatively rare cells can be detected by flow cytometry, and furthermore discusses how this detection is currently used in clinical application. In chapter 3-4 we investigated if the LSC frequency harbors prognostic information for improved relapse prediction for AML. In chapter 3 we present the clinical significance of the presence and frequency of CD34+CD38- LSCs at time of diagnosis and in remission bone marrow in adult AML. In addition, the prognostic relevance of the combination of LSC-MRD and MFC-MRD is investigated. In chapter 4 we investigated whether detection of CD34+CD38- LSCs in BM of newly diagnosed pediatric AML bears similar prognostic relevance as shown in adult AML. In chapter 5-6 we elaborate on the importance of standardization of the flow cytometric MRD and LSC detection approaches. In chapter 5 we evaluated the technical and analytical feasibility of the previously designed eight‐color LSC single tube assay, as well as standardization of the process. In chapter 6 we present a new flow cytometric model for standardized and objective MRD calculation, retrospectively applied in a large clinical study. For this, we evaluate if the balance between neoplastic and normal progenitors in CR bone marrow has prognostic relevance. In chapter 7 we evaluate whether next-generation sequencing has clinical value for the prediction of relapse. Since measurements were simultaneously evaluated for MFC-MRD, we investigated whether NGS and MFC-MRD have independent and additive prognostic value. In addition, we studied whether MRD and LSC-MRD is a valid surrogate endpoint in AML. As shown in a recent clinical trial, the new therapeutic clofarabine has clinical beneficial effect in a subgroup of patients. In chapter 8 we investigated whether the prospectively defined MRD and LSC-MRD frequencies were different between patients with clofarabine and patients without clofarabine, and whether MRD levels mirrored the clinical outcome within this subgroup. Finally, in chapter 9 we summarize the results of this thesis and which implications these results may have for future AML relapse prediction. Furthermore, we evaluate the different techniques used in this thesis, discuss how each technique can be further optimized and elaborate on the optimal use for future clinical trials
    corecore