25 research outputs found
TöRF: Time-of-Flight Radiance Fields for Dynamic Scene View Synthesis
Neural networks can represent and accurately reconstruct radiance fields for
static 3D scenes (e.g., NeRF). Several works extend these to dynamic scenes
captured with monocular video, with promising performance. However, the
monocular setting is known to be an under-constrained problem, and so methods
rely on data-driven priors for reconstructing dynamic content. We replace these
priors with measurements from a time-of-flight (ToF) camera, and introduce a
neural representation based on an image formation model for continuous-wave ToF
cameras. Instead of working with processed depth maps, we model the raw ToF
sensor measurements to improve reconstruction quality and avoid issues with low
reflectance regions, multi-path interference, and a sensor's limited
unambiguous depth range. We show that this approach improves robustness of
dynamic scene reconstruction to erroneous calibration and large motions, and
discuss the benefits and limitations of integrating RGB+ToF sensors that are
now available on modern smartphones.Comment: Accepted to NeurIPS 2021. Web page: https://imaging.cs.cmu.edu/torf/
NeurIPS camera ready updates -- added quantitative comparisons to new
methods, visual side-by-side comparisons performed on larger baseline camera
sequence
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images
We assemble a dataset of Creative-Commons-licensed (CC) images, which we use
to train a set of open diffusion models that are qualitatively competitive with
Stable Diffusion 2 (SD2). This task presents two challenges: (1)
high-resolution CC images lack the captions necessary to train text-to-image
generative models; (2) CC images are relatively scarce. In turn, to address
these challenges, we use an intuitive transfer learning technique to produce a
set of high-quality synthetic captions paired with curated CC images. We then
develop a data- and compute-efficient training recipe that requires as little
as 3% of the LAION-2B data needed to train existing SD2 models, but obtains
comparable quality. These results indicate that we have a sufficient number of
CC images (~70 million) for training high-quality models. Our training recipe
also implements a variety of optimizations that achieve ~3X training speed-ups,
enabling rapid model iteration. We leverage this recipe to train several
high-quality text-to-image models, which we dub the CommonCanvas family. Our
largest model achieves comparable performance to SD2 on a human evaluation,
despite being trained on our CC dataset that is significantly smaller than
LAION and using synthetic captions for training. We release our models, data,
and code at
https://github.com/mosaicml/diffusion/blob/main/assets/common-canvas.m
Habitat-Matterport 3D Semantics Dataset
We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is
the largest dataset of 3D real-world spaces with densely annotated semantics
that is currently available to the academic community. It consists of 142,646
object instance annotations across 216 3D spaces and 3,100 rooms within those
spaces. The scale, quality, and diversity of object annotations far exceed
those of prior datasets. A key difference setting apart HM3DSEM from other
datasets is the use of texture information to annotate pixel-accurate object
boundaries. We demonstrate the effectiveness of HM3DSEM dataset for the Object
Goal Navigation task using different methods. Policies trained using HM3DSEM
perform outperform those trained on prior datasets. Introduction of HM3DSEM in
the Habitat ObjectNav Challenge lead to an increase in participation from 400
submissions in 2021 to 1022 submissions in 2022.Comment: 14 Pages, 10 Figures, 5 Table
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License
MatryODShka: Real-time 6DoF Video View Synthesis using Multi-Sphere Images
We introduce a method to convert stereo 360° (omnidirectional stereo) imagery into a layered, multi-sphere image representation for six degree-of-freedom (6DoF) rendering. Stereo 360° imagery can be captured from multi-camera systems for virtual reality (VR), but lacks motion parallax and correct-in-all-directions disparity cues. Together, these can quickly lead to VR sickness when viewing content. One solution is to try and generate a format suitable for 6DoF rendering, such as by estimating depth. However, this raises questions as to how to handle disoccluded regions in dynamic scenes. Our approach is to simultaneously learn depth and disocclusions via a multi-sphere image representation, which can be rendered with correct 6DoF disparity and motion parallax in VR. This significantly improves comfort for the viewer, and can be inferred and rendered in real time on modern GPU hardware. Together, these move towards making VR video a more comfortable immersive medium
TöRF: Time-of-Flight Radiance Fields for Dynamic Scene View Synthesis
Neural networks can represent and accurately reconstruct radiance fields for static 3D scenes (e.g., NeRF). Several works extend these to dynamic scenes captured with monocular video, with promising performance. However, the monocular setting is known to be an under-constrained problem, and so methods rely on data-driven priors for reconstructing dynamic content. We replace these priors with measurements from a time-of-flight (ToF) camera, and introduce a neural representation based on an image formation model for continuous-wave ToF cameras. Instead of working with processed depth maps, we model the raw ToF sensor measurements to improve reconstruction quality and avoid issues with low reflectance regions, multi-path interference, and a sensor's limited unambiguous depth range. We show that this approach improves robustness of dynamic scene reconstruction to erroneous calibration and large motions, and discuss the benefits and limitations of integrating RGB+ToF sensors that are now available on modern smartphones