9 research outputs found

    Illustration of the herein proposed approaches to enhance the standard workflow for onco-hematological patients stratification.

    No full text
    Description of the common recent workflow in onco-hematology (in the black box) for the analysis of genomic data used to improve the identification of disease components that can potentially support the progress of novel disease classification systems. On the right, we illustrate the herein presented novel approaches to enhance the statistical characterization of components and to provide a maximum-likelihood inspired alternative to perform patients classification. Our approaches are built upon the outcome of the Hierarchical Dirichlet Mixture Model (HDMM) of multinomials that usually fits the data. Given the HDMM outcome, the components are usually characterized by a single or a few genomic drivers inspired by how the HDMM clustered the genomic alterations and by a priori clinical knowledge. In contrast, our approaches utilize the HDMM outcome to respectively characterize the components either as multinomials, i.e., in line with the HDMM, or as Multivariate Fisher Non Central Hypergeometric (MFNCH) distributions. Each distribution models a different urn problem. The multinomials model drawings with replacement from an urn with multiple marbles with different colors. Instead, the MFNCH distributions we use model drawings without replacement from an urn with one single marble per color and with each marble with a different size.</p

    Characterizations of the HDMM components on Acute Myeloid Leukemia (AML) from the three different approaches.

    No full text
    Comparison of the three approaches used on a public AML dataset in terms of how components are characterized after the HDMM fit. The left column indicates our multinomial-based approach that is the most rigorous statistical approach, given that the HDMM is run to estimate a mixture of multinomials. In this case, each component is considered as a multinomial and the most frequent genomic alterations are prioritized. The central column reports the driving genomic alterations chosen by the usual standard workflow in onco-hematology. The driver alterations are chosen based on how frequent they are associated with the component and on a priori clinical knowledge. The right column exhibits the prioritization provided when characterizing each component as a MFNCH distribution. This latter seems to show the best compromise between a pure statistical approach (i.e., multinomial-based) and an clinically educated one (i.e., standard workflow). Bold genomic alterations indicate the driving genomic alterations reported by the standard workflow (central column). In both left and right columns only the top six alterations with non-zeros parameters are reported. Plus, beside each alteration the number of times that alteration was clustered by the HDMM into a component is reported. The vertical bar for components 2 and 3 in the central column is a logic OR between drivers, i.e., they are equally prioritized.</p

    Convergence of HDMM around the number of estimated components can be hard to reach.

    No full text
    Results on the capability of convergence of the HDMM on simulated data similar to the onco-hematological data commonly used to discover novel disease classes. In this plot we illustrate how frequently the number of components estimated by the HDMM emerges along the Markov Chain Monte Carlo (MCMC) employed to performed the fit. Exact convergence would imply 100% frequency on the y-axis. All quadrant report the number of simulated components and whether the simulated components tend to be uniform-like (αsim = 1) or low-overlapping (αsim = 1/M, where M is the number of genomic alterations). Plus, the color of the points represent if the HDMM was run to detect more uniform-like components (αHDMM = 1, in red) or more disjunct components (αHDMM = 1/M, in blue). The greater is the frequency up the y-axis, the closer the HDMM is to convergence.</p

    Stratification performance of multinomial-based approach on simulated patients.

    No full text
    Illustration of the accuracy of our proposed maximum-likelihood approach based on multinomials to assign the simulated patients to the HDMM components. The metrics of accuracy is the Adjusted Rand Index (ARI), which is able to deal with scenarios where the observed number of components was found different from the expected one. ARI equals to one matches perfect agreement. The upper quadrant reports the result for K = 5 simulated components, while the lower quadrant does it for K = 10 components. The variables αsim and αHDMM respectively indicate when the expected components were uniform-like simulated (αsim = 1) or were low-overlapping (αsim = 1/M). Similarly, scenarios with αHDMM = 1 indicate when the HDMM was set to find poorly disjunct components, whereas αHDMM = 1/M caused the HDMM to estimate highly disjunct components. The boxplots in the plot summarizes the performance across any average number of genomic alterations per simulated patient.</p

    Difficulty for HDMM to estimate the expected number of components on simulated patients.

    No full text
    Results on the capability of convergence of the HDMM on simulated data that aim to reproduce the onco-hematological data commonly used to discover novel disease classes. In this plot we observe the number of components estimated by the HDMM (y-axis) along the average number of genomic alterations per simulated patient for several settings (x-axis). The expected K number of components in logarithmic scale is indicated by the horizontal yellow line, while the observed number is reported on the y-axis. Along with K, each quadrant shows whether the simulated components tend to be uniform-like (αsim = 1) or low-overlapping (αsim = 1/M, where M is the number of genomic alterations). Plus, the color of the points represent if the HDMM was run to detect more uniform-like components (αHDMM = 1, in red) or more disjunct components (αHDMM = 1/M, in blue).</p

    The MNFCH-based approach performs patients stratification at least as accurate as the multinomial-based approach.

    No full text
    Overview of the impact on accuracy of using the MFNCH-based approach instead of the multinomial-based approach to characterize the components estimated by the HDMM and to assign simulated patients to such components. Each simulated patient is assigned to the component that has the highest likelihood of generating that patient. The likelihood of a component for a sample is calculated using the p.m.f. of its MFNCH distribution. The change in performance is reported for either K = 5 simulated components (upper) or K = 10 simulated components (K = 10). In addition, it is reported for all combination of the concentration parameters αsim and αHDMM, which respectively regulate whether the simulated components tend to be conjuct (αsim = 1) or disjunct (αsim = 1/M) and if the HDMM was run to fit a mixture of more uniform-like (αHDMM = 1) or low-overlapping components (αHDMM = 1/M). A positive value on the y-axis reflects an uplift of accuracy when stratifying simulated patients.</p

    Clustering of genomic alterations provided by the HDMM on public AML data.

    No full text
    The table exhibits how one HDMM clusters all gene mutations and cytogenetic anomalies across one garbage component (column 0) and ten components (1-10). Some alterations are uniquely assigned to a single component but more than half are assigned at least to two components. Besides, this table shows that the genomic alterations are not equally abundant in the cohort with NPM1 being the most frequent occurring alteration. (CSV)</p
    corecore