332 research outputs found

    GRASS: Generative Recursive Autoencoders for Shape Structures

    Full text link
    We introduce a novel neural network architecture for encoding and synthesis of 3D shapes, particularly their structures. Our key insight is that 3D shapes are effectively characterized by their hierarchical organization of parts, which reflects fundamental intra-shape relationships such as adjacency and symmetry. We develop a recursive neural net (RvNN) based autoencoder to map a flat, unlabeled, arbitrary part layout to a compact code. The code effectively captures hierarchical structures of man-made 3D objects of varying structural complexities despite being fixed-dimensional: an associated decoder maps a code back to a full hierarchy. The learned bidirectional mapping is further tuned using an adversarial setup to yield a generative model of plausible structures, from which novel structures can be sampled. Finally, our structure synthesis framework is augmented by a second trained module that produces fine-grained part geometry, conditioned on global and local structural context, leading to a full generative pipeline for 3D shapes. We demonstrate that without supervision, our network learns meaningful structural hierarchies adhering to perceptual grouping principles, produces compact codes which enable applications such as shape classification and partial matching, and supports shape synthesis and interpolation with significant variations in topology and geometry.Comment: Corresponding author: Kai Xu ([email protected]

    Learned Perceptual Image Enhancement

    Full text link
    Learning a typical image enhancement pipeline involves minimization of a loss function between enhanced and reference images. While L1 and L2 losses are perhaps the most widely used functions for this purpose, they do not necessarily lead to perceptually compelling results. In this paper, we show that adding a learned no-reference image quality metric to the loss can significantly improve enhancement operators. This metric is implemented using a CNN (convolutional neural network) trained on a large-scale dataset labelled with aesthetic preferences of human raters. This loss allows us to conveniently perform back-propagation in our learning framework to simultaneously optimize for similarity to a given ground truth reference and perceptual quality. This perceptual loss is only used to train parameters of image processing operators, and does not impose any extra complexity at inference time. Our experiments demonstrate that this loss can be effective for tuning a variety of operators such as local tone mapping and dehazing

    Context-aware CNNs for person head detection

    Full text link
    Person detection is a key problem for many computer vision tasks. While face detection has reached maturity, detecting people under a full variation of camera view-points, human poses, lighting conditions and occlusions is still a difficult challenge. In this work we focus on detecting human heads in natural scenes. Starting from the recent local R-CNN object detector, we extend it with two types of contextual cues. First, we leverage person-scene relations and propose a Global CNN model trained to predict positions and scales of heads directly from the full image. Second, we explicitly model pairwise relations among objects and train a Pairwise CNN model using a structured-output surrogate loss. The Local, Global and Pairwise models are combined into a joint CNN framework. To train and test our full model, we introduce a large dataset composed of 369,846 human heads annotated in 224,740 movie frames. We evaluate our method and demonstrate improvements of person head detection against several recent baselines in three datasets. We also show improvements of the detection speed provided by our model.Comment: To appear in International Conference on Computer Vision (ICCV), 201

    Fill in Fabrics: Body-Aware Self-Supervised Inpainting for Image-Based Virtual Try-On

    Full text link
    Previous virtual try-on methods usually focus on aligning a clothing item with a person, limiting their ability to exploit the complex pose, shape and skin color of the person, as well as the overall structure of the clothing, which is vital to photo-realistic virtual try-on. To address this potential weakness, we propose a fill in fabrics (FIFA) model, a self-supervised conditional generative adversarial network based framework comprised of a Fabricator and a unified virtual try-on pipeline with a Segmenter, Warper and Fuser. The Fabricator aims to reconstruct the clothing image when provided with a masked clothing as input, and learns the overall structure of the clothing by filling in fabrics. A virtual try-on pipeline is then trained by transferring the learned representations from the Fabricator to Warper in an effort to warp and refine the target clothing. We also propose to use a multi-scale structural constraint to enforce global context at multiple scales while warping the target clothing to better fit the pose and shape of the person. Extensive experiments demonstrate that our FIFA model achieves state-of-the-art results on the standard VITON dataset for virtual try-on of clothing items, and is shown to be effective at handling complex poses and retaining the texture and embroidery of the clothing

    Fine-Scaled 3D Geometry Recovery from Single RGB Images

    Get PDF
    3D geometry recovery from single RGB images is a highly ill-posed and inherently ambiguous problem, which has been a challenging research topic in computer vision for several decades. When fine-scaled 3D geometry is required, the problem become even more difficult. 3D geometry recovery from single images has the objective of recovering geometric information from a single photograph of an object or a scene with multiple objects. The geometric information that is to be retrieved can be of different representations such as surface meshes, voxels, depth maps or 3D primitives, etc. In this thesis, we investigate fine-scaled 3D geometry recovery from single RGB images for three categories: facial wrinkles, indoor scenes and man-made objects. Since each category has its own particular features, styles and also variations in representation, we propose different strategies to handle different 3D geometry estimates respectively. We present a lightweight non-parametric method to generate wrinkles from monocular Kinect RGB images. The key lightweight feature of the method is that it can generate plausible wrinkles using exemplars from one high quality 3D face model with textures. The local geometric patches from the source could be copied to synthesize different wrinkles on the blendshapes of specific users in an offline stage. During online tracking, facial animations with high quality wrinkle details can be recovered in real-time as a linear combination of these personalized wrinkled blendshapes. We propose a fast-to-train two-streamed CNN with multi-scales, which predicts both dense depth map and depth gradient for single indoor scene images.The depth and depth gradient are then fused together into a more accurate and detailed depth map. We introduce a novel set loss over multiple related images. By regularizing the estimation between a common set of images, the network is less prone to overfitting and achieves better accuracy than competing methods. Fine-scaled 3D point cloud could be produced by re-projection to 3D using the known camera parameters. To handle highly structured man-made objects, we introduce a novel neural network architecture for 3D shape recovering from a single image. We develop a convolutional encoder to map a given image to a compact code. Then an associated recursive decoder maps this code back to a full hierarchy, resulting a set of bounding boxes to represent the estimated shape. Finally, we train a second network to predict the fine-scaled geometry in each bounding box at voxel level. The per-box volumes are then embedded into a global one, and from which we reconstruct the final meshed model. Experiments on a variety of datasets show that our approaches can estimate fine-scaled geometry from single RGB images for each category successfully, and surpass state-of-the-art performance in recovering faithful 3D local details with high resolution mesh surface or point cloud

    Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

    Full text link
    Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic labelling of sound-making objects, purely based on binaural sounds. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360 degree camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of a vision `teacher' method and a sound `student' method -- the student method is trained to generate the same results as the teacher method. This way, the auditory system can be trained without using human annotations. We also propose two auxiliary tasks namely, a) a novel task on Spatial Sound Super-resolution to increase the spatial resolution of sounds, and b) dense depth prediction of the scene. We then formulate the three tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results on the dataset show that 1) our method achieves promising results for semantic prediction and the two auxiliary tasks; and 2) the three tasks are mutually beneficial -- training them together achieves the best performance and 3) the number and orientations of microphones are both important. The data and code will be released to facilitate the research in this new direction.Comment: Project page: https://www.trace.ethz.ch/publications/2020/sound_perception/index.htm

    Towards Smarter Fluorescence Microscopy: Enabling Adaptive Acquisition Strategies With Optimized Photon Budget

    Get PDF
    Fluorescence microscopy is an invaluable technique for studying the intricate process of organism development. The acquisition process, however, is associated with the fundamental trade-off between the quality and reliability of the acquired data. On one hand, the goal of capturing the development in its entirety, often times across multiple spatial and temporal scales, requires extended acquisition periods. On the other hand, high doses of light required for such experiments are harmful for living samples and can introduce non-physiological artifacts in the normal course of development. Conventionally, a single set of acquisition parameters is chosen in the beginning of the acquisition and constitutes the experimenter’s best guess of the overall optimal configuration within the aforementioned trade-off. In the paradigm of adaptive microscopy, in turn, one aims at achieving more efficient photon budget distribution by dynamically adjusting the acquisition parameters to the changing properties of the sample. In this thesis, I explore the principles of adaptive microscopy and propose a range of improvements for two real imaging scenarios. Chapter 2 summarizes the design and implementation of an adaptive pipeline for efficient observation of the asymmetrically dividing neurogenic progenitors in Zebrafish retina. In the described approach the fast and expensive acquisition mode is automatically activated only when the mitotic cells are present in the field of view. The method illustrates the benefits of the adaptive acquisition in the common scenario of the individual events of interest being sparsely distributed throughout the duration of the acquisition. Chapter 3 focuses on computational aspects of segmentation-based adaptive schemes for efficient acquisition of the developing Drosophila pupal wing. Fast sample segmentation is shown to provide a valuable output for the accurate evaluation of the sample morphology and dynamics in real time. This knowledge proves instrumental for adjusting the acquisition parameters to the current properties of the sample and reducing the required photon budget with minimal effects to the quality of the acquired data. Chapter 4 addresses the generation of synthetic training data for learning-based methods in bioimage analysis, making them more practical and accessible for smart microscopy pipelines. State-of-the-art deep learning models trained exclusively on the generated synthetic data are shown to yield powerful predictions when applied to the real microscopy images. In the end, in-depth evaluation of the segmentation quality of both real and synthetic data-based models illustrates the important practical aspects of the approach and outlines the directions for further research
    corecore