7 research outputs found
Linear-time Online Action Detection From 3D Skeletal Data Using Bags of Gesturelets
Sliding window is one direct way to extend a successful recognition system to
handle the more challenging detection problem. While action recognition decides
only whether or not an action is present in a pre-segmented video sequence,
action detection identifies the time interval where the action occurred in an
unsegmented video stream. Sliding window approaches for action detection can
however be slow as they maximize a classifier score over all possible
sub-intervals. Even though new schemes utilize dynamic programming to speed up
the search for the optimal sub-interval, they require offline processing on the
whole video sequence. In this paper, we propose a novel approach for online
action detection based on 3D skeleton sequences extracted from depth data. It
identifies the sub-interval with the maximum classifier score in linear time.
Furthermore, it is invariant to temporal scale variations and is suitable for
real-time applications with low latency
Neural Rendering Techniques for Photo-realistic Image Generation and Novel View Synthesis
Recent advances in deep generative models have enabled computers to imagine and generate fictional images from any given distribution of images. Techniques like Generative Adversarial Networks (GANs) and image-to-image (I2I) translation can generate images by mapping a random noise or an input image (e.g., a sketch or a semantic map) to photo-realistic images. However, there are still plenty of challenges regarding training such models and improving their output quality and diversity. Furthermore, to harness this imaginative and generative power for solving real-world applications, we need to be able to control different aspects of the rendering process; for example to specify the content and/or style of generated images, camera pose, lighting, ... etc.
One challenge to training image generation models is the multi-modal nature of image synthesis. An image with a specific content, such as a cat or a car, can be generated with countless choices of different styles (e.g., colors, lighting, and local texture details). To enable user control over the generated style, previous works train multi-modal I2I translation networks, but they suffer from a complicated and slow training, and their training is specific to one target image domain. We address this limitation and propose a style pre-training strategy that generalizes across many image domains, improves the training stability and speed, and improves the performance in terms of output quality and diversity.
Another challenge to GANs and I2I translation is to provide 3D control over the rendering process. For example, applications such as AR/VR, virtual tours and telepresence require generating consistent images or videos of 3D environments. However, GANs and I2I translation mainly operate in 2D, which limits their use for such applications. To address this limitation, we propose to condition image synthesis on coarse geometric proxies (e.g., a point cloud, a coarse mesh, or a voxel grid), and we augment these rough proxies with machine learned components to fix and compliment their artifacts and render photo-realistic images. We apply our proposal to solve the task of novel view synthesis under different challenging settings, and show photo-realistic novel views of complex scenes with multiple objects, tourist landmarks under different appearances, and human subjects under novel head poses and facial expressions
RTMV: A Ray-Traced Multi-View Synthetic Dataset for Novel View Synthesis
We present a large-scale synthetic dataset for novel view synthesis
consisting of ~300k images rendered from nearly 2000 complex scenes using
high-quality ray tracing at high resolution (1600 x 1600 pixels). The dataset
is orders of magnitude larger than existing synthetic datasets for novel view
synthesis, thus providing a large unified benchmark for both training and
evaluation. Using 4 distinct sources of high-quality 3D meshes, the scenes of
our dataset exhibit challenging variations in camera views, lighting, shape,
materials, and textures. Because our dataset is too large for existing methods
to process, we propose Sparse Voxel Light Field (SVLF), an efficient
voxel-based light field approach for novel view synthesis that achieves
comparable performance to NeRF on synthetic data, while being an order of
magnitude faster to train and two orders of magnitude faster to render. SVLF
achieves this speed by relying on a sparse voxel octree, careful voxel sampling
(requiring only a handful of queries per ray), and reduced network structure;
as well as ground truth depth maps at training time. Our dataset is generated
by NViSII, a Python-based ray tracing renderer, which is designed to be simple
for non-experts to use and share, flexible and powerful through its use of
scripting, and able to create high-quality and physically-based rendered
images. Experiments with a subset of our dataset allow us to compare standard
methods like NeRF and mip-NeRF for single-scene modeling, and pixelNeRF for
category-level modeling, pointing toward the need for future improvements in
this area.Comment: Project page at http://www.cs.umd.edu/~mmeshry/projects/rtm