Representation disentanglement may help AI fundamentally understand the real
world and thus benefit both discrimination and generation tasks. It currently
has at least three unresolved core issues: (i) heavy reliance on label
annotation and synthetic data -- causing poor generalization on natural
scenarios; (ii) heuristic/hand-craft disentangling constraints make it hard to
adaptively achieve an optimal training trade-off; (iii) lacking reasonable
evaluation metric, especially for the real label-free data. To address these
challenges, we propose a \textbf{C}losed-\textbf{L}oop unsupervised
representation \textbf{Dis}entanglement approach dubbed \textbf{CL-Dis}.
Specifically, we use diffusion-based autoencoder (Diff-AE) as a backbone while
resorting to β-VAE as a co-pilot to extract semantically disentangled
representations. The strong generation ability of diffusion model and the good
disentanglement ability of VAE model are complementary. To strengthen
disentangling, VAE-latent distillation and diffusion-wise feedback are
interconnected in a closed-loop system for a further mutual promotion. Then, a
self-supervised \textbf{Navigation} strategy is introduced to identify
interpretable semantic directions in the disentangled latent space. Finally, a
new metric based on content tracking is designed to evaluate the
disentanglement effect. Experiments demonstrate the superiority of CL-Dis on
applications like real image manipulation and visual analysis