12 research outputs found
Visualization of the CPD (left) and CFI (right) based on <i>n</i> = 100 samples and the true function <i>f</i>.
Since β4 = 0 in this example the 4-th component has no effect on the value of f resulting in a CFI of zero and a flat CPD. Since we are not estimating f, the CFI values exactly correspond to the β-coefficients in this example.</p
Overview of KernelBiome.
We start from a paired dataset with a compositional predictor X and a response Y and optional prior knowledge on the relation between components in the compositions (e.g., via a phylogenetic tree). We then select a model among a large class of kernels which best fits the data. This results in an estimated model and embedding . Finally, these can be analyzed while accounting for the compositional structure.</p
Proofs.
Supervised learning, such as regression and classification, is an essential tool for analyzing modern high-throughput sequencing data, for example in microbiome research. However, due to the compositionality and sparsity, existing techniques are often inadequate. Either they rely on extensions of the linear log-contrast model (which adjust for compositionality but cannot account for complex signals or sparsity) or they are based on black-box machine learning methods (which may capture useful signals, but lack interpretability due to the compositionality). We propose KernelBiome, a kernel-based nonparametric regression and classification framework for compositional data. It is tailored to sparse compositional data and is able to incorporate prior knowledge, such as phylogenetic structure. KernelBiome captures complex signals, including in the zero-structure, while automatically adapting model complexity. We demonstrate on par or improved predictive performance compared with state-of-the-art machine learning methods on 33 publicly available microbiome datasets. Additionally, our framework provides two key advantages: (i) We propose two novel quantities to interpret contributions of individual components and prove that they consistently estimate average perturbation effects of the conditional mean, extending the interpretability of linear log-contrast coefficients to nonparametric models. (ii) We show that the connection between kernels and distances aids interpretability and provides a data-driven embedding that can augment further analysis. KernelBiome is available as an open-source Python package on PyPI and at https://github.com/shimenghuang/KernelBiome.</div
Fig 5 -
(A) shows a kernel PCA for the centralparksoil dataset with 2 principle components. On the right, the contribution of the species to each of the two components is given (see Sec. B of S5 Appendix for details). (B) and (C) are both based on the cirrhosis dataset. In (B) the CFI values are shown on the left and the right plot compares the proposed kernel health score with Simpson diversity. In (C) the scaled CFI values for are illustrated for different weightings. A darker color shade of the (shortened) name of the microbiota signifies a stronger (positive resp. negative) CFI.</p
Details on CFI and CPD.
Formal definitions of perturbations and estimators related to CFI and CPD. (PDF)</p
Fig 3 -
(A) Comparison of predictive performance on 33 public datasets (9 regression and 24 classification tasks, separated by the grey vertical line in the figure) based on 20 random 10-fold CV. On the two datasets with grey tick labels no method significantly outperforms the baseline based on the Wilcoxon signed-rank test, meaning that there is little signal in the data. The ones in green are the datasets where KernelBiome significantly outperforms the baseline, while it does not on the single dataset with the black label. The corresponding p-values are provided in brackets. (B) Percentage of time a method is significantly outperformed by another based on the Wilcoxon signed-rank test. (C) Average run time of each method on each dataset. A significance level of 0.05 is used.</p
Details and additional results for experiments in Sec. 3.
Datasets pre-processing, parameter setup, construction of the weighting matrices with UniFrac-distance and further experiment results based on the cirrhosis and centralparksoil datasets. (PDF)</p
Additional experiments with simulated data.
Consistency of CFI and CPD and comparison of CFI and CPD with their non-simplex counterparts. (PDF)</p
Fig 4 -
Left and middle: Predictive performance of weighted KernelBiome when the given weights are informative (DGP1) and adversarial (DGP2) based on 200 repetitions. Right: the number of other methods each method is significantly outperformed by based on Wilcoxon signed-rank test (siginificance level 0.05), under DGP1 and DGP2.</p
Background on kernels.
Mathematical background on kernels and details on dimensionality and visualization with kernels. (PDF)</p
