143,710 research outputs found
Deep Eyes: Binocular Depth-from-Focus on Focal Stack Pairs
Human visual system relies on both binocular stereo cues and monocular
focusness cues to gain effective 3D perception. In computer vision, the two
problems are traditionally solved in separate tracks. In this paper, we present
a unified learning-based technique that simultaneously uses both types of cues
for depth inference. Specifically, we use a pair of focal stacks as input to
emulate human perception. We first construct a comprehensive focal stack
training dataset synthesized by depth-guided light field rendering. We then
construct three individual networks: a Focus-Net to extract depth from a single
focal stack, a EDoF-Net to obtain the extended depth of field (EDoF) image from
the focal stack, and a Stereo-Net to conduct stereo matching. We show how to
integrate them into a unified BDfF-Net to obtain high-quality depth maps.
Comprehensive experiments show that our approach outperforms the
state-of-the-art in both accuracy and speed and effectively emulates human
vision systems
MM-NeRF: Multimodal-Guided 3D Multi-Style Transfer of Neural Radiance Field
3D style transfer aims to render stylized novel views of 3D scenes with the
specified style, which requires high-quality rendering and keeping multi-view
consistency. Benefiting from the ability of 3D representation from Neural
Radiance Field (NeRF), existing methods learn the stylized NeRF by giving a
reference style from an image. However, they suffer the challenges of
high-quality stylization with texture details for multi-style transfer and
stylization with multimodal guidance. In this paper, we reveal that the same
objects in 3D scenes show various states (color tone, details, etc.) from
different views after stylization since previous methods optimized by
single-view image-based style loss functions, leading NeRF to tend to smooth
texture details, further resulting in low-quality rendering. To tackle these
problems, we propose a novel Multimodal-guided 3D Multi-style transfer of NeRF,
termed MM-NeRF, which achieves high-quality 3D multi-style rendering with
texture details and can be driven by multimodal-style guidance. First, MM-NeRF
adopts a unified framework to project multimodal guidance into CLIP space and
extracts multimodal style features to guide the multi-style stylization. To
relieve the problem of lacking details, we propose a novel Multi-Head Learning
Scheme (MLS), in which each style head predicts the parameters of the color
head of NeRF. MLS decomposes the learning difficulty caused by the
inconsistency of multi-style transfer and improves the quality of stylization.
In addition, the MLS can generalize pre-trained MM-NeRF to any new styles by
adding heads with small training costs (a few minutes). Extensive experiments
on three real-world 3D scene datasets show that MM-NeRF achieves high-quality
3D multi-style stylization with multimodal guidance, keeps multi-view
consistency, and keeps semantic consistency of multimodal style guidance. Codes
will be released later
Monte Carlo yolak izlenmiş düşük çözünürlüklü kaplamaların gürültüden arındırılması ve güdümlü yukarı örneklenmesi
Monte Carlo path tracing is used to generate renderings by estimating the rendering equation using the Monte Carlo method. An extensive amount of ray samples per pixel is needed to be cast during this rendering process to create an image with a low enough variance to be considered visually noise-free. Casting that amount of samples requires an expensive time budget. Many studies focus on rendering a noisy image at the original resolution with a decreased sample count and then applying a post-process denoising to produce a visually appealing output. This approach speeds up the rendering process and creates a denoised image of comparable quality to the visually noise-free ground truth. However, the denoising process cannot handle the noisy image’s high variance accurately if the sample count is decreased harshly to complete the rendering process in a shorter time budget. In this thesis work, we try to overcome this problem by proposing a pipeline that renders the image at a reduced resolution to cast more samples than the harshly decreased sample count in the same time budget. This noisy low-resolution image is then denoised more accurately, thanks to having a lower variance. It is then upsampled with the guidance of the auxiliary scene data rendered swiftly in a separate rendering pass at the original resolution. Experimental evaluation shows that the proposed pipeline generates denoised and guided upsampled images in promisingly good quality compared to denoising the noisy original resolution images rendered with the harshly decreased sample count.----M.S. - Master of Scienc
Point2Pix: Photo-Realistic Point Cloud Rendering via Neural Radiance Fields
Synthesizing photo-realistic images from a point cloud is challenging because
of the sparsity of point cloud representation. Recent Neural Radiance Fields
and extensions are proposed to synthesize realistic images from 2D input. In
this paper, we present Point2Pix as a novel point renderer to link the 3D
sparse point clouds with 2D dense image pixels. Taking advantage of the point
cloud 3D prior and NeRF rendering pipeline, our method can synthesize
high-quality images from colored point clouds, generally for novel indoor
scenes. To improve the efficiency of ray sampling, we propose point-guided
sampling, which focuses on valid samples. Also, we present Point Encoding to
build Multi-scale Radiance Fields that provide discriminative 3D point
features. Finally, we propose Fusion Encoding to efficiently synthesize
high-quality images. Extensive experiments on the ScanNet and ArkitScenes
datasets demonstrate the effectiveness and generalization
High-speed Video from Asynchronous Camera Array
This paper presents a method for capturing high-speed video using an
asynchronous camera array. Our method sequentially fires each sensor in a
camera array with a small time offset and assembles captured frames into a
high-speed video according to the time stamps. The resulting video, however,
suffers from parallax jittering caused by the viewpoint difference among
sensors in the camera array. To address this problem, we develop a dedicated
novel view synthesis algorithm that transforms the video frames as if they were
captured by a single reference sensor. Specifically, for any frame from a
non-reference sensor, we find the two temporally neighboring frames captured by
the reference sensor. Using these three frames, we render a new frame with the
same time stamp as the non-reference frame but from the viewpoint of the
reference sensor. Specifically, we segment these frames into super-pixels and
then apply local content-preserving warping to warp them to form the new frame.
We employ a multi-label Markov Random Field method to blend these warped
frames. Our experiments show that our method can produce high-quality and
high-speed video of a wide variety of scenes with large parallax, scene
dynamics, and camera motion and outperforms several baseline and
state-of-the-art approaches.Comment: 10 pages, 82 figures, Published at IEEE WACV 201
Selective rendering for efficient ray traced stereoscopic images
Depth-related visual effects are a key feature of many virtual environments. In stereo-based systems, the depth effect can be produced by delivering frames of disparate image pairs, while in monocular environments, the viewer has to extract this depth information from a single image by examining details such as perspective and shadows. This paper investigates via a number of psychophysical experiments, whether we can reduce computational effort and still achieve perceptually high-quality rendering for stereo imagery. We examined selectively rendering the image pairs by exploiting the fusing capability and depth perception underlying human stereo vision. In ray-tracing-based global illumination systems, a higher image resolution introduces more computation to the rendering process since many more rays need to be traced. We first investigated whether we could utilise the human binocular fusing ability and significantly reduce the resolution of one of the image pairs and yet retain a high perceptual quality under stereo viewing condition. Secondly, we evaluated subjects' performance on a specific visual task that required accurate depth perception. We found that subjects required far fewer rendered depth cues in the stereo viewing environment to perform the task well. Avoiding rendering these detailed cues saved significant computational time. In fact it was possible to achieve a better task performance in the stereo viewing condition at a combined rendering time for the image pairs less than that required for the single monocular image. The outcome of this study suggests that we can produce more efficient stereo images for depth-related visual tasks by selective rendering and exploiting inherent features of human stereo vision
- …