24 research outputs found
Cascaded Scene Flow Prediction using Semantic Segmentation
Given two consecutive frames from a pair of stereo cameras, 3D scene flow
methods simultaneously estimate the 3D geometry and motion of the observed
scene. Many existing approaches use superpixels for regularization, but may
predict inconsistent shapes and motions inside rigidly moving objects. We
instead assume that scenes consist of foreground objects rigidly moving in
front of a static background, and use semantic cues to produce pixel-accurate
scene flow estimates. Our cascaded classification framework accurately models
3D scenes by iteratively refining semantic segmentation masks, stereo
correspondences, 3D rigid motion estimates, and optical flow fields. We
evaluate our method on the challenging KITTI autonomous driving benchmark, and
show that accounting for the motion of segmented vehicles leads to
state-of-the-art performance.Comment: International Conference on 3D Vision (3DV), 2017 (oral presentation
A Fusion Approach for Multi-Frame Optical Flow Estimation
To date, top-performing optical flow estimation methods only take pairs of
consecutive frames into account. While elegant and appealing, the idea of using
more than two frames has not yet produced state-of-the-art results. We present
a simple, yet effective fusion approach for multi-frame optical flow that
benefits from longer-term temporal cues. Our method first warps the optical
flow from previous frames to the current, thereby yielding multiple plausible
estimates. It then fuses the complementary information carried by these
estimates into a new optical flow field. At the time of writing, our method
ranks first among published results in the MPI Sintel and KITTI 2015
benchmarks. Our models will be available on https://github.com/NVlabs/PWC-Net.Comment: Work accepted at IEEE Winter Conference on Applications of Computer
Vision (WACV 2019
AutoFocusFormer: Image Segmentation off the Grid
Real world images often have highly imbalanced content density. Some areas
are very uniform, e.g., large patches of blue sky, while other areas are
scattered with many small objects. Yet, the commonly used successive grid
downsampling strategy in convolutional deep networks treats all areas equally.
Hence, small objects are represented in very few spatial locations, leading to
worse results in tasks such as segmentation. Intuitively, retaining more pixels
representing small objects during downsampling helps to preserve important
information. To achieve this, we propose AutoFocusFormer (AFF), a
local-attention transformer image recognition backbone, which performs adaptive
downsampling by learning to retain the most important pixels for the task.
Since adaptive downsampling generates a set of pixels irregularly distributed
on the image plane, we abandon the classic grid structure. Instead, we develop
a novel point-based local attention block, facilitated by a balanced clustering
module and a learnable neighborhood merging module, which yields
representations for our point-based versions of state-of-the-art segmentation
heads. Experiments show that our AutoFocusFormer (AFF) improves significantly
over baseline models of similar sizes.Comment: CVPR 202
UPSCALE: Unconstrained Channel Pruning
As neural networks grow in size and complexity, inference speeds decline. To
combat this, one of the most effective compression techniques -- channel
pruning -- removes channels from weights. However, for multi-branch segments of
a model, channel removal can introduce inference-time memory copies. In turn,
these copies increase inference latency -- so much so that the pruned model can
be slower than the unpruned model. As a workaround, pruners conventionally
constrain certain channels to be pruned together. This fully eliminates memory
copies but, as we show, significantly impairs accuracy. We now have a dilemma:
Remove constraints but increase latency, or add constraints and impair
accuracy. In response, our insight is to reorder channels at export time, (1)
reducing latency by reducing memory copies and (2) improving accuracy by
removing constraints. Using this insight, we design a generic algorithm UPSCALE
to prune models with any pruning pattern. By removing constraints from existing
pruners, we improve ImageNet accuracy for post-training pruned models by 2.1
points on average -- benefiting DenseNet (+16.9), EfficientNetV2 (+7.9), and
ResNet (+6.2). Furthermore, by reordering channels, UPSCALE improves inference
speeds by up to 2x over a baseline export.Comment: 29 pages, 26 figures, accepted to ICML 202
Semantic MapNet: Building Allocentric SemanticMaps and Representations from Egocentric Views
We study the task of semantic mapping - specifically, an embodied agent (a
robot or an egocentric AI assistant) is given a tour of a new environment and
asked to build an allocentric top-down semantic map ("what is where?") from
egocentric observations of an RGB-D camera with known pose (via localization
sensors). Towards this goal, we present SemanticMapNet (SMNet), which consists
of: (1) an Egocentric Visual Encoder that encodes each egocentric RGB-D frame,
(2) a Feature Projector that projects egocentric features to appropriate
locations on a floor-plan, (3) a Spatial Memory Tensor of size floor-plan
length x width x feature-dims that learns to accumulate projected egocentric
features, and (4) a Map Decoder that uses the memory tensor to produce semantic
top-down maps. SMNet combines the strengths of (known) projective camera
geometry and neural representation learning. On the task of semantic mapping in
the Matterport3D dataset, SMNet significantly outperforms competitive baselines
by 4.01-16.81% (absolute) on mean-IoU and 3.81-19.69% (absolute) on Boundary-F1
metrics. Moreover, we show how to use the neural episodic memories and
spatio-semantic allocentric representations build by SMNet for subsequent tasks
in the same space - navigating to objects seen during the tour("Find chair") or
answering questions about the space ("How many chairs did you see in the
house?")
Image segmentation by cascaded region agglomeration
We propose a hierarchical segmentation algorithm that starts with a very fine oversegmentation and gradually merges regions using a cascade of boundary classifiers. This approach allows the weights of region and boundary features to adapt to the segmentation scale at which they are applied. The stages of the cascade are trained sequentially, with asymetric loss to maximize boundary recall. On six segmentation data sets, our algorithm achieves best performance under most region-quality measures, and does it with fewer segments than the prior work. Our algorithm is also highly competitive in a dense oversegmentation (superpixel) regime under boundary-based measures. 1