Deep Visual Representation Learning for Classification and Retrieval: Uncertainty, Geometry, and Applications

Abstract

Deep visual representation learning is the process by which deep neural networks discover a low-dimensional latent feature space, or embedding space, of visual data such that distance serves as a proxy for semantic dissimilarity. We consider deep visual representation learning tailored for classification and retrieval applications, that is, the representation is trained to discriminate between inputs belonging to different classes. In particular, we explore two facets of these visual representations: their stochasticity and their geometry. The vast majority of losses, or methods, used to discover visual representations operate on deterministic embeddings where an input projects to a single point in the embedding space. Methods that produce stochastic embeddings, in contrast, project an input to a random variable whose distribution reflects its uncertainty in the semantic space. Capturing uncertainty in the embedding space is useful for robust classification and retrieval, informing downstream applications, and interpreting representations. Our primary focus is designing novel loss functions for discovering stochastic visual representations that perform equivalently or better than deterministic alternatives, are efficient and tractable, and are more robust. The secondary focus is on the geometry of the representation. The three geometries are Euclidean, spherical, and hyperbolic, and each induce constraints on the latent space. In conjunction with designing stochastic embedding methods, we empirically explore the three geometries. We propose two novel stochastic methods: (1) the Stochastic Prototype Embedding using Gaussians in Euclidean space and (2) the von Mises-Fisher loss using von Mises-Fisher distributions in spherical space (i.e., on the unit hypersphere). While each of the three geometries has benefits, we find that spherical methods produce the strongest discrimination between classes and thus are well-suited for the downstream retrieval and classification applications that act on the learned representations. Our tertiary focus involves the application of discriminative visual representations, appealing to practitioners via two large-scale empirical studies. The first unifies few- and zero-shot egocentric action recognition---and more generally, few- and zero-shot classification---verifying that the same representation can be used jointly for both tasks without degrading generalization. The second explores clustering pretrained embeddings with results that emphasize (1) the benefit of spherical representations, (2) the value of shallow, unsupervised clustering methods, for example hierarchical agglomerative clustering, when carefully tuned and benchmarked, and (3) the fragility of recent supervised, deep clustering methods operating on embeddings with more uncertainty (i.e., less discrimination).</p

    Similar works