Semi-supervised learning (SSL) is a promising approach for training deep
classification models using labeled and unlabeled datasets. However, existing
SSL methods rely on a large unlabeled dataset, which may not always be
available in many real-world applications due to legal constraints (e.g.,
GDPR). In this paper, we investigate the research question: Can we train SSL
models without real unlabeled datasets? Instead of using real unlabeled
datasets, we propose an SSL method using synthetic datasets generated from
generative foundation models trained on datasets containing millions of samples
in diverse domains (e.g., ImageNet). Our main concepts are identifying
synthetic samples that emulate unlabeled samples from generative foundation
models and training classifiers using these synthetic samples. To achieve this,
our method is formulated as an alternating optimization problem: (i)
meta-learning of generative foundation models and (ii) SSL of classifiers using
real labeled and synthetic unlabeled samples. For (i), we propose a
meta-learning objective that optimizes latent variables to generate samples
that resemble real labeled samples and minimize the validation loss. For (ii),
we propose a simple unsupervised loss function that regularizes the feature
extractors of classifiers to maximize the performance improvement obtained from
synthetic samples. We confirm that our method outperforms baselines using
generative foundation models on SSL. We also demonstrate that our methods
outperform SSL using real unlabeled datasets in scenarios with extremely small
amounts of labeled datasets. This suggests that synthetic samples have the
potential to provide improvement gains more efficiently than real unlabeled
data.Comment: Accepted to the 15th Asian Conference on Machine Learning (ACML2023);
a preprint of the camera-ready versio