27 research outputs found
Self-supervised Metric Learning in Multi-View Data: A Downstream Task Perspective
Self-supervised metric learning has been a successful approach for learning a distance from an unlabeled dataset. The resulting distance is broadly useful for improving various distance-based downstream tasks, even when no information from downstream tasks is used in the metric learning stage. To gain insights into this approach, we develop a statistical framework to theoretically study how self-supervised metric learning can benefit downstream tasks in the context of multi-view data. Under this framework, we show that the target distance of metric learning satisfies several desired properties for the downstream tasks. On the other hand, our investigation suggests the target distance can be further improved by moderating each direction’s weights. In addition, our analysis precisely characterizes the improvement by self-supervised metric learning on four commonly used downstream tasks: sample identification, two-sample testing, k-means clustering, and k-nearest neighbor classification. When the distance is estimated from an unlabeled dataset, we establish the upper bound on distance estimation’s accuracy and the number of samples sufficient for downstream task improvement. Finally, numerical experiments are presented to support the theoretical results in the article. Supplementary materials for this article are available online.</p
Compositional bias can create false clusters in PCoA plots.
In (a) and (b), samples are randomly divided into two groups. No modification is applied to (a), while the count data in group 1 is rarefied in (b). In (c), samples are divided into two groups based on the sequencing depth (>10000 belongs to the first group, and <5000 belongs to the second group). In these figures, RSim normalization can help remove the false clusters resulting from compositional bias. Euclidean distance with log transformation is used in all PCoA plots.</p
Detailed description of simulation.
Descriptions of how the datasets were generated in simulations. (PDF)</p
Normalization can reduce false discovery and improve the power of association analysis.
In (a), the samples are randomly divided into two groups, and the count data in the first group is rarefied. In (b), the synthetic data include differential abundant taxa. The significance level is 0.05 in both (a) and (b). Normalization is an essential step to avoid false discovery and improve power.</p
Misclassification rate control when the signal strength of differential abundant taxa, the balance of group size, proportion of differential abundant taxa, and sample size are different.
The empirical misclassification rate in RSim is well controlled despite the choices of the signal strength of differential abundant taxa, the balance of group size, proportion of differential abundant taxa, and sample size. (PNG)</p
Comparisons of normalization methods in estimating sampling fraction.
The numerical experiments are performed when the signal strength of differential abundant taxa is (a) weak, (b) moderate, and (c) strong. In (a), (b), and (c), the x-axis represents true sampling fractions, while the y-axis represents the estimated sampling fraction from normalization methods. We scale the estimated sampling fractions so that their average is the same as the average of true sampling fractions. The black line in these figures represents equality between the estimated and true sampling fractions and the color of points represent which group the differential abundant taxa belong to. The bias in sampling fraction estimation by different normalization methods is compared in (d) when the signal strength and proportion (p = 0.1, 0.2, 0.3) of differential abundant taxa vary. It is clear that the reference-based method can better correct the compositional bias than existing methods, especially when there is a large proportion of strong differential abundant taxa.</p
False pattern caused by compositional bias leads to a misleading conclusion.
(a) shows the PCoA plots colored by days after the experiment started. (b) presents the PCoA plots colored by sequencing depth. (c) show the relationship between time and sequencing depth. The pattern of time in PCoA plots is highly overlapped with pattern of the sequencing depth, which can be explained by the deterministic relationship between time and sequencing depth. (PNG)</p
Computational time of different normalization methods (in seconds).
d is the number of taxa, n is the sample size. All the experiments are conducted in iMac M1/8GB. Data are subsampled from the dataset collected in [30]. (PDF)</p
RSim normalization helps two-sample <i>t</i>-test control false discovery.
Samples are divided into two groups based on the sequencing depth (20000 belongs to the second group), and the FDR is shown when the different significance levels are used. In (a), seven normalization methods are compared. In (b), a two-sample t-test equipped with RSim normalization is compared with state-of-art differential abundance tests.</p
Normalization can improve the power of association analysis.
In (a) and (b), samples are randomly divided into two groups, and the top 25% most abundant taxa are differential abundant taxa with a binary or continuous latent variable. The significance level is 0.05. RSim can improve the power of association analysis. (PNG)</p
