Multimodal generative models should be able to learn a meaningful latent
representation that enables a coherent joint generation of all modalities
(e.g., images and text). Many applications also require the ability to
accurately sample modalities conditioned on observations of a subset of the
modalities. Often not all modalities may be observed for all training data
points, so semi-supervised learning should be possible. In this study, we
evaluate a family of product-of-experts (PoE) based variational autoencoders
that have these desired properties. We include a novel PoE based architecture
and training procedure. An empirical evaluation shows that the PoE based models
can outperform an additive mixture-of-experts (MoE) approach. Our experiments
support the intuition that PoE models are more suited for a conjunctive
combination of modalities while MoEs are more suited for a disjunctive fusion