1,087 research outputs found
MIMIC: Masked Image Modeling with Image Correspondences
Many pixelwise dense prediction tasks-depth estimation and semantic
segmentation in computer vision today rely on pretrained image representations.
Therefore, curating effective pretraining datasets is vital. Unfortunately, the
effective pretraining datasets are those with multi-view scenes and have only
been curated using annotated 3D meshes, point clouds, and camera parameters
from simulated environments. We propose a dataset-curation mechanism that does
not require any annotations. We mine two datasets: MIMIC-1M with 1.3M and
MIMIC-3M with 3.1M multi-view image pairs from open-sourced video datasets and
from synthetic 3D environments. We train multiple self-supervised models with
different masked image modeling objectives to showcase the following findings:
Representations trained on MIMIC-3M outperform those mined using annotations on
multiple downstream tasks, including depth estimation, semantic segmentation,
surface normals, and pose estimation. They also outperform representations that
are frozen and when downstream training data is limited to few-shot. Larger
dataset (MIMIC-3M) significantly improves performance, which is promising since
our curation method can arbitrarily scale to produce even larger datasets.
MIMIC code, dataset, and pretrained models are open-sourced at
https://github.com/RAIVNLab/MIMIC
OR-NeRF: Object Removing from 3D Scenes Guided by Multiview Segmentation with Neural Radiance Fields
The emergence of Neural Radiance Fields (NeRF) for novel view synthesis has
increased interest in 3D scene editing. An essential task in editing is
removing objects from a scene while ensuring visual reasonability and multiview
consistency. However, current methods face challenges such as time-consuming
object labeling, limited capability to remove specific targets, and compromised
rendering quality after removal. This paper proposes a novel object-removing
pipeline, named OR-NeRF, that can remove objects from 3D scenes with user-given
points or text prompts on a single view, achieving better performance in less
time than previous works. Our method spreads user annotations to all views
through 3D geometry and sparse correspondence, ensuring 3D consistency with
less processing burden. Then recent 2D segmentation model Segment-Anything
(SAM) is applied to predict masks, and a 2D inpainting model is used to
generate color supervision. Finally, our algorithm applies depth supervision
and perceptual loss to maintain consistency in geometry and appearance after
object removal. Experimental results demonstrate that our method achieves
better editing quality with less time than previous works, considering both
quality and quantity.Comment: project site: https://ornerf.github.io/ (codes available
NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding
Research on depth-based human activity analysis achieved outstanding
performance and demonstrated the effectiveness of 3D representation for action
recognition. The existing depth-based and RGB+D-based action recognition
benchmarks have a number of limitations, including the lack of large-scale
training samples, realistic number of distinct class categories, diversity in
camera views, varied environmental conditions, and variety of human subjects.
In this work, we introduce a large-scale dataset for RGB+D human action
recognition, which is collected from 106 distinct subjects and contains more
than 114 thousand video samples and 8 million frames. This dataset contains 120
different action classes including daily, mutual, and health-related
activities. We evaluate the performance of a series of existing 3D activity
analysis methods on this dataset, and show the advantage of applying deep
learning methods for 3D-based human action recognition. Furthermore, we
investigate a novel one-shot 3D activity recognition problem on our dataset,
and a simple yet effective Action-Part Semantic Relevance-aware (APSR)
framework is proposed for this task, which yields promising results for
recognition of the novel action classes. We believe the introduction of this
large-scale dataset will enable the community to apply, adapt, and develop
various data-hungry learning techniques for depth-based and RGB+D-based human
activity understanding. [The dataset is available at:
http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI
Reflectance Hashing for Material Recognition
We introduce a novel method for using reflectance to identify materials.
Reflectance offers a unique signature of the material but is challenging to
measure and use for recognizing materials due to its high-dimensionality. In
this work, one-shot reflectance is captured using a unique optical camera
measuring {\it reflectance disks} where the pixel coordinates correspond to
surface viewing angles. The reflectance has class-specific stucture and angular
gradients computed in this reflectance space reveal the material class.
These reflectance disks encode discriminative information for efficient and
accurate material recognition. We introduce a framework called reflectance
hashing that models the reflectance disks with dictionary learning and binary
hashing. We demonstrate the effectiveness of reflectance hashing for material
recognition with a number of real-world materials
PixelHuman: Animatable Neural Radiance Fields from Few Images
In this paper, we propose PixelHuman, a novel human rendering model that
generates animatable human scenes from a few images of a person with unseen
identity, views, and poses. Previous work have demonstrated reasonable
performance in novel view and pose synthesis, but they rely on a large number
of images to train and are trained per scene from videos, which requires
significant amount of time to produce animatable scenes from unseen human
images. Our method differs from existing methods in that it can generalize to
any input image for animatable human synthesis. Given a random pose sequence,
our method synthesizes each target scene using a neural radiance field that is
conditioned on a canonical representation and pose-aware pixel-aligned
features, both of which can be obtained through deformation fields learned in a
data-driven manner. Our experiments show that our method achieves
state-of-the-art performance in multiview and novel pose synthesis from
few-shot images.Comment: 8 page
- …