2,442 research outputs found
ImageSpirit: Verbal Guided Image Parsing
Humans describe images in terms of nouns and adjectives while algorithms
operate on images represented as sets of pixels. Bridging this gap between how
humans would like to access images versus their typical representation is the
goal of image parsing, which involves assigning object and attribute labels to
pixel. In this paper we propose treating nouns as object labels and adjectives
as visual attribute labels. This allows us to formulate the image parsing
problem as one of jointly estimating per-pixel object and attribute labels from
a set of training images. We propose an efficient (interactive time) solution.
Using the extracted labels as handles, our system empowers a user to verbally
refine the results. This enables hands-free parsing of an image into pixel-wise
object/attribute labels that correspond to human semantics. Verbally selecting
objects of interests enables a novel and natural interaction modality that can
possibly be used to interact with new generation devices (e.g. smart phones,
Google Glass, living room devices). We demonstrate our system on a large number
of real-world images with varying complexity. To help understand the tradeoffs
compared to traditional mouse based interactions, results are reported for both
a large scale quantitative evaluation and a user study.Comment: http://mmcheng.net/imagespirit
FactorMatte: Redefining Video Matting for Re-Composition Tasks
We propose "factor matting", an alternative formulation of the video matting
problem in terms of counterfactual video synthesis that is better suited for
re-composition tasks. The goal of factor matting is to separate the contents of
video into independent components, each visualizing a counterfactual version of
the scene where contents of other components have been removed. We show that
factor matting maps well to a more general Bayesian framing of the matting
problem that accounts for complex conditional interactions between layers.
Based on this observation, we present a method for solving the factor matting
problem that produces useful decompositions even for video with complex
cross-layer interactions like splashes, shadows, and reflections. Our method is
trained per-video and requires neither pre-training on external large datasets,
nor knowledge about the 3D structure of the scene. We conduct extensive
experiments, and show that our method not only can disentangle scenes with
complex interactions, but also outperforms top methods on existing tasks such
as classical video matting and background subtraction. In addition, we
demonstrate the benefits of our approach on a range of downstream tasks. Please
refer to our project webpage for more details: https://factormatte.github.ioComment: Project webpage: https://factormatte.github.i
Layered Neural Rendering for Retiming People in Video
We present a method for retiming people in an ordinary, natural
video---manipulating and editing the time in which different motions of
individuals in the video occur. We can temporally align different motions,
change the speed of certain actions (speeding up/slowing down, or entirely
"freezing" people), or "erase" selected people from the video altogether. We
achieve these effects computationally via a dedicated learning-based layered
video representation, where each frame in the video is decomposed into separate
RGBA layers, representing the appearance of different people in the video. A
key property of our model is that it not only disentangles the direct motions
of each person in the input video, but also correlates each person
automatically with the scene changes they generate---e.g., shadows,
reflections, and motion of loose clothing. The layers can be individually
retimed and recombined into a new video, allowing us to achieve realistic,
high-quality renderings of retiming effects for real-world videos depicting
complex actions and involving multiple individuals, including dancing,
trampoline jumping, or group running.Comment: To appear in SIGGRAPH Asia 2020. Project webpage:
https://retiming.github.io
Depth-Assisted Semantic Segmentation, Image Enhancement and Parametric Modeling
This dissertation addresses the problem of employing 3D depth information on solving a number of traditional challenging computer vision/graphics problems. Humans have the abilities of perceiving the depth information in 3D world, which enable humans to reconstruct layouts, recognize objects and understand the geometric space and semantic meanings of the visual world. Therefore it is significant to explore how the 3D depth information can be utilized by computer vision systems to mimic such abilities of humans. This dissertation aims at employing 3D depth information to solve vision/graphics problems in the following aspects: scene understanding, image enhancements and 3D reconstruction and modeling.
In addressing scene understanding problem, we present a framework for semantic segmentation and object recognition on urban video sequence only using dense depth maps recovered from the video. Five view-independent 3D features that vary with object class are extracted from dense depth maps and used for segmenting and recognizing different object classes in street scene images. We demonstrate a scene parsing algorithm that uses only dense 3D depth information to outperform using sparse 3D or 2D appearance features.
In addressing image enhancement problem, we present a framework to overcome the imperfections of personal photographs of tourist sites using the rich information provided by large-scale internet photo collections (IPCs). By augmenting personal 2D images with 3D information reconstructed from IPCs, we address a number of traditionally challenging image enhancement techniques and achieve high-quality results using simple and robust algorithms.
In addressing 3D reconstruction and modeling problem, we focus on parametric modeling of flower petals, the most distinctive part of a plant. The complex structure, severe occlusions and wide variations make the reconstruction of their 3D models a challenging task. We overcome these challenges by combining data driven modeling techniques with domain knowledge from botany. Taking a 3D point cloud of an input flower scanned from a single view, each segmented petal is fitted with a scale-invariant morphable petal shape model, which is constructed from individually scanned 3D exemplar petals. Novel constraints based on botany studies are incorporated into the fitting process for realistically reconstructing occluded regions and maintaining correct 3D spatial relations.
The main contribution of the dissertation is in the intelligent usage of 3D depth information on solving traditional challenging vision/graphics problems. By developing some advanced algorithms either automatically or with minimum user interaction, the goal of this dissertation is to demonstrate that computed 3D depth behind the multiple images contains rich information of the visual world and therefore can be intelligently utilized to recognize/ understand semantic meanings of scenes, efficiently enhance and augment single 2D images, and reconstruct high-quality 3D models
- …