347 research outputs found
GRASS: Generative Recursive Autoencoders for Shape Structures
We introduce a novel neural network architecture for encoding and synthesis
of 3D shapes, particularly their structures. Our key insight is that 3D shapes
are effectively characterized by their hierarchical organization of parts,
which reflects fundamental intra-shape relationships such as adjacency and
symmetry. We develop a recursive neural net (RvNN) based autoencoder to map a
flat, unlabeled, arbitrary part layout to a compact code. The code effectively
captures hierarchical structures of man-made 3D objects of varying structural
complexities despite being fixed-dimensional: an associated decoder maps a code
back to a full hierarchy. The learned bidirectional mapping is further tuned
using an adversarial setup to yield a generative model of plausible structures,
from which novel structures can be sampled. Finally, our structure synthesis
framework is augmented by a second trained module that produces fine-grained
part geometry, conditioned on global and local structural context, leading to a
full generative pipeline for 3D shapes. We demonstrate that without
supervision, our network learns meaningful structural hierarchies adhering to
perceptual grouping principles, produces compact codes which enable
applications such as shape classification and partial matching, and supports
shape synthesis and interpolation with significant variations in topology and
geometry.Comment: Corresponding author: Kai Xu ([email protected]
Learned Perceptual Image Enhancement
Learning a typical image enhancement pipeline involves minimization of a loss
function between enhanced and reference images. While L1 and L2 losses are
perhaps the most widely used functions for this purpose, they do not
necessarily lead to perceptually compelling results. In this paper, we show
that adding a learned no-reference image quality metric to the loss can
significantly improve enhancement operators. This metric is implemented using a
CNN (convolutional neural network) trained on a large-scale dataset labelled
with aesthetic preferences of human raters. This loss allows us to conveniently
perform back-propagation in our learning framework to simultaneously optimize
for similarity to a given ground truth reference and perceptual quality. This
perceptual loss is only used to train parameters of image processing operators,
and does not impose any extra complexity at inference time. Our experiments
demonstrate that this loss can be effective for tuning a variety of operators
such as local tone mapping and dehazing
Context-aware CNNs for person head detection
Person detection is a key problem for many computer vision tasks. While face
detection has reached maturity, detecting people under a full variation of
camera view-points, human poses, lighting conditions and occlusions is still a
difficult challenge. In this work we focus on detecting human heads in natural
scenes. Starting from the recent local R-CNN object detector, we extend it with
two types of contextual cues. First, we leverage person-scene relations and
propose a Global CNN model trained to predict positions and scales of heads
directly from the full image. Second, we explicitly model pairwise relations
among objects and train a Pairwise CNN model using a structured-output
surrogate loss. The Local, Global and Pairwise models are combined into a joint
CNN framework. To train and test our full model, we introduce a large dataset
composed of 369,846 human heads annotated in 224,740 movie frames. We evaluate
our method and demonstrate improvements of person head detection against
several recent baselines in three datasets. We also show improvements of the
detection speed provided by our model.Comment: To appear in International Conference on Computer Vision (ICCV), 201
Fill in Fabrics: Body-Aware Self-Supervised Inpainting for Image-Based Virtual Try-On
Previous virtual try-on methods usually focus on aligning a clothing item
with a person, limiting their ability to exploit the complex pose, shape and
skin color of the person, as well as the overall structure of the clothing,
which is vital to photo-realistic virtual try-on. To address this potential
weakness, we propose a fill in fabrics (FIFA) model, a self-supervised
conditional generative adversarial network based framework comprised of a
Fabricator and a unified virtual try-on pipeline with a Segmenter, Warper and
Fuser. The Fabricator aims to reconstruct the clothing image when provided with
a masked clothing as input, and learns the overall structure of the clothing by
filling in fabrics. A virtual try-on pipeline is then trained by transferring
the learned representations from the Fabricator to Warper in an effort to warp
and refine the target clothing. We also propose to use a multi-scale structural
constraint to enforce global context at multiple scales while warping the
target clothing to better fit the pose and shape of the person. Extensive
experiments demonstrate that our FIFA model achieves state-of-the-art results
on the standard VITON dataset for virtual try-on of clothing items, and is
shown to be effective at handling complex poses and retaining the texture and
embroidery of the clothing
Fine-Scaled 3D Geometry Recovery from Single RGB Images
3D geometry recovery from single RGB images is a highly ill-posed and inherently ambiguous problem, which has been a challenging research topic in computer vision for several decades. When fine-scaled 3D geometry is required, the problem become even more difficult. 3D geometry recovery from single images has the objective of recovering geometric information from a single photograph of an object or a scene with multiple objects. The geometric information that is to be retrieved can be of different representations such as surface meshes, voxels, depth maps or 3D primitives, etc. In this thesis, we investigate fine-scaled 3D geometry recovery from single RGB images for three categories: facial wrinkles, indoor scenes and man-made objects. Since each category has its own particular features, styles and also variations in representation, we propose different strategies to handle different 3D geometry estimates respectively. We present a lightweight non-parametric method to generate wrinkles from monocular Kinect RGB images. The key lightweight feature of the method is that it can generate plausible wrinkles using exemplars from one high quality 3D face model with textures. The local geometric patches from the source could be copied to synthesize different wrinkles on the blendshapes of specific users in an offline stage. During online tracking, facial animations with high quality wrinkle details can be recovered in real-time as a linear combination of these personalized wrinkled blendshapes. We propose a fast-to-train two-streamed CNN with multi-scales, which predicts both dense depth map and depth gradient for single indoor scene images.The depth and depth gradient are then fused together into a more accurate and detailed depth map. We introduce a novel set loss over multiple related images. By regularizing the estimation between a common set of images, the network is less prone to overfitting and achieves better accuracy than competing methods. Fine-scaled 3D point cloud could be produced by re-projection to 3D using the known camera parameters. To handle highly structured man-made objects, we introduce a novel neural network architecture for 3D shape recovering from a single image. We develop a convolutional encoder to map a given image to a compact code. Then an associated recursive decoder maps this code back to a full hierarchy, resulting a set of bounding boxes to represent the estimated shape. Finally, we train a second network to predict the fine-scaled geometry in each bounding box at voxel level. The per-box volumes are then embedded into a global one, and from which we reconstruct the final meshed model. Experiments on a variety of datasets show that our approaches can estimate fine-scaled geometry from single RGB images for each category successfully, and surpass state-of-the-art performance in recovering faithful 3D local details with high resolution mesh surface or point cloud
Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds
Humans can robustly recognize and localize objects by integrating visual and
auditory cues. While machines are able to do the same now with images, less
work has been done with sounds. This work develops an approach for dense
semantic labelling of sound-making objects, purely based on binaural sounds. We
propose a novel sensor setup and record a new audio-visual dataset of street
scenes with eight professional binaural microphones and a 360 degree camera.
The co-existence of visual and audio cues is leveraged for supervision
transfer. In particular, we employ a cross-modal distillation framework that
consists of a vision `teacher' method and a sound `student' method -- the
student method is trained to generate the same results as the teacher method.
This way, the auditory system can be trained without using human annotations.
We also propose two auxiliary tasks namely, a) a novel task on Spatial Sound
Super-resolution to increase the spatial resolution of sounds, and b) dense
depth prediction of the scene. We then formulate the three tasks into one
end-to-end trainable multi-tasking network aiming to boost the overall
performance. Experimental results on the dataset show that 1) our method
achieves promising results for semantic prediction and the two auxiliary tasks;
and 2) the three tasks are mutually beneficial -- training them together
achieves the best performance and 3) the number and orientations of microphones
are both important. The data and code will be released to facilitate the
research in this new direction.Comment: Project page:
https://www.trace.ethz.ch/publications/2020/sound_perception/index.htm
Towards Smarter Fluorescence Microscopy: Enabling Adaptive Acquisition Strategies With Optimized Photon Budget
Fluorescence microscopy is an invaluable technique for studying the intricate process of organism development. The acquisition process, however, is associated with the fundamental trade-off between the quality and reliability of the acquired data. On one hand, the goal of capturing the development in its entirety, often times across multiple spatial and temporal scales, requires extended acquisition periods. On the other hand, high doses of light required for such experiments are harmful for living samples and can introduce non-physiological artifacts in the normal course of development. Conventionally, a single set of acquisition parameters is chosen in the beginning of the acquisition and constitutes the experimenter’s best guess of the overall optimal configuration within the aforementioned trade-off. In the paradigm of adaptive microscopy, in turn, one aims at achieving more efficient photon budget distribution by dynamically adjusting the acquisition parameters to the changing properties of the sample. In this thesis, I explore the principles of adaptive microscopy and propose a range of improvements for two real imaging scenarios.
Chapter 2 summarizes the design and implementation of an adaptive pipeline for efficient observation of the asymmetrically dividing neurogenic progenitors in Zebrafish retina. In the described approach the fast and expensive acquisition mode is automatically activated only when the mitotic cells are present in the field of view. The method illustrates the benefits of the adaptive acquisition in the common scenario of the individual events of interest being sparsely distributed throughout the duration of the acquisition.
Chapter 3 focuses on computational aspects of segmentation-based adaptive schemes for efficient acquisition of the developing Drosophila pupal wing. Fast sample segmentation is shown to provide a valuable output for the accurate evaluation of the sample morphology and dynamics in real time. This knowledge proves instrumental for adjusting the acquisition parameters to the current properties of the sample and reducing the required photon budget with minimal effects to the quality of the acquired data.
Chapter 4 addresses the generation of synthetic training data for learning-based methods in bioimage analysis, making them more practical and accessible for smart microscopy pipelines. State-of-the-art deep learning models trained exclusively on the generated synthetic data are shown to yield powerful predictions when applied to the real microscopy images. In the end, in-depth evaluation of the segmentation quality of both real and synthetic data-based models illustrates the important practical aspects of the approach and outlines the directions for further research
- …