4 research outputs found
Segment Anything in 3D with NeRFs
The Segment Anything Model (SAM) has demonstrated its effectiveness in
segmenting any object/part in various 2D images, yet its ability for 3D has not
been fully explored. The real world is composed of numerous 3D scenes and
objects. Due to the scarcity of accessible 3D data and high cost of its
acquisition and annotation, lifting SAM to 3D is a challenging but valuable
research avenue. With this in mind, we propose a novel framework to Segment
Anything in 3D, named SA3D. Given a neural radiance field (NeRF) model, SA3D
allows users to obtain the 3D segmentation result of any target object via only
one-shot manual prompting in a single rendered view. With input prompts, SAM
cuts out the target object from the according view. The obtained 2D
segmentation mask is projected onto 3D mask grids via density-guided inverse
rendering. 2D masks from other views are then rendered, which are mostly
uncompleted but used as cross-view self-prompts to be fed into SAM again.
Complete masks can be obtained and projected onto mask grids. This procedure is
executed via an iterative manner while accurate 3D masks can be finally
learned. SA3D can adapt to various radiance fields effectively without any
additional redesigning. The entire segmentation process can be completed in
approximately two minutes without any engineering optimization. Our experiments
demonstrate the effectiveness of SA3D in different scenes, highlighting the
potential of SAM in 3D scene perception. The project page is at
https://jumpat.github.io/SA3D/.Comment: Work in progress. Project page: https://jumpat.github.io/SA3D
NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds
We present NeRFVS, a novel neural radiance fields (NeRF) based method to
enable free navigation in a room. NeRF achieves impressive performance in
rendering images for novel views similar to the input views while suffering for
novel views that are significantly different from the training views. To
address this issue, we utilize the holistic priors, including pseudo depth maps
and view coverage information, from neural reconstruction to guide the learning
of implicit neural representations of 3D indoor scenes. Concretely, an
off-the-shelf neural reconstruction method is leveraged to generate a geometry
scaffold. Then, two loss functions based on the holistic priors are proposed to
improve the learning of NeRF: 1) A robust depth loss that can tolerate the
error of the pseudo depth map to guide the geometry learning of NeRF; 2) A
variance loss to regularize the variance of implicit neural representations to
reduce the geometry and color ambiguity in the learning procedure. These two
loss functions are modulated during NeRF optimization according to the view
coverage information to reduce the negative influence brought by the view
coverage imbalance. Extensive results demonstrate that our NeRFVS outperforms
state-of-the-art view synthesis methods quantitatively and qualitatively on
indoor scenes, achieving high-fidelity free navigation results.Comment: 10 pages, 7 figure
Enhancing SPARQL Query Generation for Knowledge Base Question Answering Systems by Learning to Correct Triplets
Generating SPARQL queries from natural language questions is challenging in Knowledge Base Question Answering (KBQA) systems. The current state-of-the-art models heavily rely on fine-tuning pretrained models such as T5. However, these methods still encounter critical issues such as triple-flip errors (e.g., (subject, relation, object) is predicted as (object, relation, subject)). To address this limitation, we introduce TSET (Triplet Structure Enhanced T5), a model with a novel pretraining stage positioned between the initial T5 pretraining and the fine-tuning for the Text-to-SPARQL task. In this intermediary stage, we introduce a new objective called Triplet Structure Correction (TSC) to train the model on a SPARQL corpus derived from Wikidata. This objective aims to deepen the model’s understanding of the order of triplets. After this specialized pretraining, the model undergoes fine-tuning for SPARQL query generation, augmenting its query-generation capabilities. We also propose a method named “semantic transformation” to fortify the model’s grasp of SPARQL syntax and semantics without compromising the pre-trained weights of T5. Experimental results demonstrate that our proposed TSET outperforms existing methods on three well-established KBQA datasets: LC-QuAD 2.0, QALD-9 plus, and QALD-10, establishing a new state-of-the-art performance (95.0% F1 and 93.1% QM on LC-QuAD 2.0, 75.85% F1 and 61.76% QM on QALD-9 plus, 51.37% F1 and 40.05% QM on QALD-10)
DialogueNeRF: Towards Realistic Avatar Face-to-face Conversation Video Generation
Conversation is an essential component of virtual avatar activities in the
metaverse. With the development of natural language processing, textual and
vocal conversation generation has achieved a significant breakthrough.
Face-to-face conversations account for the vast majority of daily
conversations. However, this task has not acquired enough attention. In this
paper, we propose a novel task that aims to generate a realistic human avatar
face-to-face conversation process and present a new dataset to explore this
target. To tackle this novel task, we propose a new framework that utilizes a
series of conversation signals, e.g. audio, head pose, and expression, to
synthesize face-to-face conversation videos between human avatars, with all the
interlocutors modeled within the same network. Our method is evaluated by
quantitative and qualitative experiments in different aspects, e.g. image
quality, pose sequence trend, and naturalness of the rendering videos. All the
code, data, and models will be made publicly available