20 research outputs found
A Framework for Symmetric Part Detection in Cluttered Scenes
The role of symmetry in computer vision has waxed and waned in importance
during the evolution of the field from its earliest days. At first figuring
prominently in support of bottom-up indexing, it fell out of favor as shape
gave way to appearance and recognition gave way to detection. With a strong
prior in the form of a target object, the role of the weaker priors offered by
perceptual grouping was greatly diminished. However, as the field returns to
the problem of recognition from a large database, the bottom-up recovery of the
parts that make up the objects in a cluttered scene is critical for their
recognition. The medial axis community has long exploited the ubiquitous
regularity of symmetry as a basis for the decomposition of a closed contour
into medial parts. However, today's recognition systems are faced with
cluttered scenes, and the assumption that a closed contour exists, i.e. that
figure-ground segmentation has been solved, renders much of the medial axis
community's work inapplicable. In this article, we review a computational
framework, previously reported in Lee et al. (2013), Levinshtein et al. (2009,
2013), that bridges the representation power of the medial axis and the need to
recover and group an object's parts in a cluttered scene. Our framework is
rooted in the idea that a maximally inscribed disc, the building block of a
medial axis, can be modeled as a compact superpixel in the image. We evaluate
the method on images of cluttered scenes.Comment: 10 pages, 8 figure
Real-time deep hair matting on mobile devices
Augmented reality is an emerging technology in many application domains.
Among them is the beauty industry, where live virtual try-on of beauty products
is of great importance. In this paper, we address the problem of live hair
color augmentation. To achieve this goal, hair needs to be segmented quickly
and accurately. We show how a modified MobileNet CNN architecture can be used
to segment the hair in real-time. Instead of training this network using large
amounts of accurate segmentation data, which is difficult to obtain, we use
crowd sourced hair segmentation data. While such data is much simpler to
obtain, the segmentations there are noisy and coarse. Despite this, we show how
our system can produce accurate and fine-detailed hair mattes, while running at
over 30 fps on an iPad Pro tablet.Comment: 7 pages, 7 figures, submitted to CRV 201
Dual-Camera Joint Deblurring-Denoising
Recent image enhancement methods have shown the advantages of using a pair of
long and short-exposure images for low-light photography. These image
modalities offer complementary strengths and weaknesses. The former yields an
image that is clean but blurry due to camera or object motion, whereas the
latter is sharp but noisy due to low photon count. Motivated by the fact that
modern smartphones come equipped with multiple rear-facing camera sensors, we
propose a novel dual-camera method for obtaining a high-quality image. Our
method uses a synchronized burst of short exposure images captured by one
camera and a long exposure image simultaneously captured by another. Having a
synchronized short exposure burst alongside the long exposure image enables us
to (i) obtain better denoising by using a burst instead of a single image, (ii)
recover motion from the burst and use it for motion-aware deblurring of the
long exposure image, and (iii) fuse the two results to further enhance quality.
Our method is able to achieve state-of-the-art results on synthetic dual-camera
images from the GoPro dataset with five times fewer training parameters
compared to the next best method. We also show that our method qualitatively
outperforms competing approaches on real synchronized dual-camera captures.Comment: Project webpage:
http://shekshaa.github.io/Joint-Deblurring-Denoising
Watch Your Steps: Local Image and Scene Editing by Text Instructions
Denoising diffusion models have enabled high-quality image generation and
editing. We present a method to localize the desired edit region implicit in a
text instruction. We leverage InstructPix2Pix (IP2P) and identify the
discrepancy between IP2P predictions with and without the instruction. This
discrepancy is referred to as the relevance map. The relevance map conveys the
importance of changing each pixel to achieve the edits, and is used to to guide
the modifications. This guidance ensures that the irrelevant pixels remain
unchanged. Relevance maps are further used to enhance the quality of
text-guided editing of 3D scenes in the form of neural radiance fields. A field
is trained on relevance maps of training views, denoted as the relevance field,
defining the 3D region within which modifications should be made. We perform
iterative updates on the training views guided by rendered relevance maps from
the relevance field. Our method achieves state-of-the-art performance on both
image and NeRF editing tasks. Project page:
https://ashmrz.github.io/WatchYourSteps/Comment: Project page: https://ashmrz.github.io/WatchYourSteps
Reconstructive Latent-Space Neural Radiance Fields for Efficient 3D Scene Representations
Neural Radiance Fields (NeRFs) have proven to be powerful 3D representations,
capable of high quality novel view synthesis of complex scenes. While NeRFs
have been applied to graphics, vision, and robotics, problems with slow
rendering speed and characteristic visual artifacts prevent adoption in many
use cases. In this work, we investigate combining an autoencoder (AE) with a
NeRF, in which latent features (instead of colours) are rendered and then
convolutionally decoded. The resulting latent-space NeRF can produce novel
views with higher quality than standard colour-space NeRFs, as the AE can
correct certain visual artifacts, while rendering over three times faster. Our
work is orthogonal to other techniques for improving NeRF efficiency. Further,
we can control the tradeoff between efficiency and image quality by shrinking
the AE architecture, achieving over 13 times faster rendering with only a small
drop in performance. We hope that our approach can form the basis of an
efficient, yet high-fidelity, 3D scene representation for downstream tasks,
especially when retaining differentiability is useful, as in many robotics
scenarios requiring continual learning
SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting with Neural Radiance Fields
Neural Radiance Fields (NeRFs) have emerged as a popular approach for novel
view synthesis. While NeRFs are quickly being adapted for a wider set of
applications, intuitively editing NeRF scenes is still an open challenge. One
important editing task is the removal of unwanted objects from a 3D scene, such
that the replaced region is visually plausible and consistent with its context.
We refer to this task as 3D inpainting. In 3D, solutions must be both
consistent across multiple views and geometrically valid. In this paper, we
propose a novel 3D inpainting method that addresses these challenges. Given a
small set of posed images and sparse annotations in a single input image, our
framework first rapidly obtains a 3D segmentation mask for a target object.
Using the mask, a perceptual optimizationbased approach is then introduced that
leverages learned 2D image inpainters, distilling their information into 3D
space, while ensuring view consistency. We also address the lack of a diverse
benchmark for evaluating 3D scene inpainting methods by introducing a dataset
comprised of challenging real-world scenes. In particular, our dataset contains
views of the same scene with and without a target object, enabling more
principled benchmarking of the 3D inpainting task. We first demonstrate the
superiority of our approach on multiview segmentation, comparing to NeRFbased
methods and 2D segmentation approaches. We then evaluate on the task of 3D
inpainting, establishing state-ofthe-art performance against other NeRF
manipulation algorithms, as well as a strong 2D image inpainter baselineComment: Project Page: https://spinnerf3d.github.i
Low and Mid-level Shape Priors for Image Segmentation
Perceptual grouping is essential to manage the complexity of real world scenes. We explore bottom-up grouping at three different levels. Starting from low-level grouping, we propose a novel method for oversegmenting an image into compact superpixels, reducing the complexity of many high-level tasks. Unlike most low-level segmentation techniques, our geometric flow formulation enables us to impose additional compactness constraints, resulting in a fast method with minimal undersegmentation. Our subsequent work utilizes compact superpixels to detect two important mid-level shape regularities, closure and symmetry. Unlike the majority of closure detection approaches, we transform the closure detection problem into one of finding a subset of superpixels whose collective boundary has strong edge support in the image. Building on superpixels, we define a closure cost which is a ratio of a novel learned boundary gap measure to area, and show how it can be globally minimized to recover a small set of promising shape hypotheses. In our final contribution, motivated by the success of shape skeletons, we recover and group symmetric parts without assuming prior figure-ground segmentation. Further exploiting superpixel compactness, superpixels are this time used as an approximation to deformable maximal discs that comprise a medial axis. A learned measure of affinity between neighboring superpixels and between symmetric parts enables the purely bottom-up recovery of a skeleton-like structure, facilitating indexing and generic object recognition in complex real images.Ph