560 research outputs found
OmniDepth: Dense Depth Estimation for Indoors Spherical Panoramas.
Recent work on depth estimation up to now has only focused on projective images ignoring 360o content which is now increasingly and more easily produced. We show that monocular depth estimation models trained on traditional images produce sub-optimal results on omnidirectional images, showcasing the need for training directly on 360o datasets, which however, are hard to acquire. In this work, we circumvent the challenges associated with acquiring high quality 360o datasets with ground truth depth annotations, by re-using recently released large scale 3D datasets and re-purposing them to 360o via rendering. This dataset, which is considerably larger than similar projective datasets, is publicly offered to the community to enable future research in this direction. We use this dataset to learn in an end-to-end fashion the task of depth estimation from 360o images. We show promising results in our synthesized data as well as in unseen realistic images
BlobGAN: Spatially Disentangled Scene Representations
We propose an unsupervised, mid-level representation for a generative model
of scenes. The representation is mid-level in that it is neither per-pixel nor
per-image; rather, scenes are modeled as a collection of spatial, depth-ordered
"blobs" of features. Blobs are differentiably placed onto a feature grid that
is decoded into an image by a generative adversarial network. Due to the
spatial uniformity of blobs and the locality inherent to convolution, our
network learns to associate different blobs with different entities in a scene
and to arrange these blobs to capture scene layout. We demonstrate this
emergent behavior by showing that, despite training without any supervision,
our method enables applications such as easy manipulation of objects within a
scene (e.g., moving, removing, and restyling furniture), creation of feasible
scenes given constraints (e.g., plausible rooms with drawers at a particular
location), and parsing of real-world images into constituent parts. On a
challenging multi-category dataset of indoor scenes, BlobGAN outperforms
StyleGAN2 in image quality as measured by FID. See our project page for video
results and interactive demo: https://www.dave.ml/blobganComment: ECCV 2022. Project webpage available at https://www.dave.ml/blobga
Self Adversarial Training for Human Pose Estimation
This paper presents a deep learning based approach to the problem of human
pose estimation. We employ generative adversarial networks as our learning
paradigm in which we set up two stacked hourglass networks with the same
architecture, one as the generator and the other as the discriminator. The
generator is used as a human pose estimator after the training is done. The
discriminator distinguishes ground-truth heatmaps from generated ones, and
back-propagates the adversarial loss to the generator. This process enables the
generator to learn plausible human body configurations and is shown to be
useful for improving the prediction accuracy.Comment: CVPR 2017 Workshop on Visual Understanding of Humans in Crowd Scene
and the 1st Look Into Person (LIP) Challeng
Dynamic in Static:Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation
Conventional video object segmentation (VOS) methods usually necessitate a substantial volume of pixel-level annotated video data for fully supervised learning. In this paper, we present HVC, a \textbf{h}ybrid static-dynamic \textbf{v}isual \textbf{c}orrespondence framework for self-supervised VOS. HVC extracts pseudo-dynamic signals from static images, enabling an efficient and scalable VOS model. Our approach utilizes a minimalist fully-convolutional architecture to capture static-dynamic visual correspondence in image-cropped views. To achieve this objective, we present a unified self-supervised approach to learn visual representations of static-dynamic feature similarity. Firstly, we establish static correspondence by utilizing a priori coordinate information between cropped views to guide the formation of consistent static feature representations. Subsequently, we devise a concise convolutional layer to capture the forward / backward pseudo-dynamic signals between two views, serving as cues for dynamic representations. Finally, we propose a hybrid visual correspondence loss to learn joint static and dynamic consistency representations. Our approach, without bells and whistles, necessitates only one training session using static image data, significantly reducing memory consumption (16GB) and training time (\textbf{2h}). Moreover, HVC achieves state-of-the-art performance in several self-supervised VOS benchmarks and additional video label propagation tasks
ULISSE: an unsupervised algorithm for detecting reliable dependency parses
In this paper we present ULISSE, an unsupervised linguistically--driven algorithm to select reliable parses from the output of a dependency parser. Different experiments were devised to show that the algorithm is robust enough to deal with the output of different parsers and with different languages, as well as to be used across different domains. In all cases, ULISSE appears to outperform the baseline algorithms
- …