19 research outputs found

    Clustering of samples and variables with mixed-type data

    No full text
    <div><p>Analysis of data measured on different scales is a relevant challenge. Biomedical studies often focus on high-throughput datasets of, e.g., quantitative measurements. However, the need for integration of other features possibly measured on different scales, e.g. clinical or cytogenetic factors, becomes increasingly important. The analysis results (e.g. a selection of relevant genes) are then visualized, while adding further information, like clinical factors, on top. However, a more integrative approach is desirable, where all available data are analyzed jointly, and where also in the visualization different data sources are combined in a more natural way. Here we specifically target integrative visualization and present a heatmap-style graphic display. To this end, we develop and explore methods for clustering mixed-type data, with special focus on clustering variables. Clustering of variables does not receive as much attention in the literature as does clustering of samples. We extend the variables clustering methodology by two new approaches, one based on the combination of different association measures and the other on distance correlation. With simulation studies we evaluate and compare different clustering strategies. Applying specific methods for mixed-type data proves to be comparable and in many cases beneficial as compared to standard approaches applied to corresponding quantitative or binarized data. Our two novel approaches for mixed-type variables show similar or better performance than the existing methods <i>ClustOfVar</i> and bias-corrected mutual information. Further, in contrast to ClustOfVar, our methods provide dissimilarity matrices, which is an advantage, especially for the purpose of visualization. Real data examples aim to give an impression of various kinds of potential applications for the integrative heatmap and other graphical displays based on dissimilarity matrices. We demonstrate that the presented integrative heatmap provides more information than common data displays about the relationship among variables and samples. The described clustering and visualization methods are implemented in our R package <i>CluMix</i> available from <a href="https://cran.r-project.org/web/packages/CluMix" target="_blank">https://cran.r-project.org/web/packages/CluMix</a>.</p></div

    Median difference in error rates between each mixed-data method and binary clustering.

    No full text
    <p>For all simulation settings, misclassification rates from clustering binarized datasets by simple matching cofficient as a reference were subtracted from corresponding MCRs when using each of the three mixed-data variables clustering approaches. Shown are the medians of those differences (purple square: CluMix-ama, yellow triangle: CluMix-dcor, green circle: ClustOfVar, orange star: BCMI). In the left panel, different sample sizes (<i>n</i> = 25, 50, 100; panel columns) and numbers of variables (<i>p</i> = 50, 100, 200; panel rows) were considered, while keeping within-group correlation fixed at 0.5 and fraction of non-zero between-group correlations (with value 0.5) at 20%. In the right panel, settings varied w.r.t. within-group correlations (corr = 0.25, 0.5, 0.75; panel columns) and fraction of between-group correlations with value 0.5 instead of 0 (noise = 0%, 20%, 40%; panel rows), while keeping numbers of samples and variables fixed at 100, respectively. Datasets were simulated with varying amounts of categorical variables (0%–100%; from left to right within each sub-figure).</p

    Like Table 2, but for methods for clustering or defining distances between <i>variables</i> of mixed types.

    No full text
    <p>The last four methods in the table are applied and compared throughout the manuscript.</p

    Heatmap of similarities between clinical parameters in ALL.

    No full text
    <p>Similarities between variables available in the ALL dataset were calculated by the CluMix-ama approach. Stronger relationships between variables are indicated by shorter distances in the dendrograms and darker blue color in the heatmap.</p

    Similarity of each variable with the remission indicator plotted against respective similarities with molecular ALL subtype.

    No full text
    <p>From the complete variable similarity matrix (derived by the CluMix-ama approach), the values for two variables of interest, namely remission and molecular subtype, are extracted and shown in a scatter plot, such that each point in the plot illustrates the similarity of a third covariate to both variables of interest. Plotting symbols are replaced by respective variable names. The color indicates numerical (black) and categorical (purple) variables. This kind of illustration may help to identify surrogate, collinear and confounding variables.</p

    Misclassification rates from clustering variables of 100 simulated datasets each, using the CluMix-ama, CluMix-dcor, ClustOfVar and BCMI approach.

    No full text
    <p>Datasets were simulated as described in the Methods section for evaluating clustering of variables. This plot shows results of the simulation setting with 50 samples, 100 variables, within-group correlation of 0.5, and 20% of between-group correlations of 0.5 instead of 0. MCRs (y-axis) were calculated based on clustering with Euclidean distances for the purely quantitative datasets (white), with the three approaches for mixed data (purple: CluMix-ama, yellow: mCluMix-dcor, green: ClustOfVar, orange: BCMI) for datasets with varying amounts of categorical variables (0%, 25%, 75%, 100% from left to right), and with simple matching coefficient for completely binarized data (grey).</p

    Functionality of the CluMix R package.

    No full text
    <p>Distance matrices are derived separately for samples and variables. They build the basis for hierarchical clustering and integrative visualization of mixed data.</p

    Overview over most important symbols.

    No full text
    <p>Overview over most important symbols.</p

    In vitro replication of TTV-HD14b and TTV-HD14c as measured by real-time quantitative PCR.

    No full text
    <p>Replication of (A) TTV-HD14b and (B) TTV-HD14c. qPCR values are expressed in ΔΔCt relative to BJAB (used as calibrator). Shown are mean ±95% confidence interval values of triplicate tests. A significant higher replication level of both TTV isolates was detected in all cell lines when compared to BJAB (p<%0.05).</p
    corecore