25 research outputs found
Light-weight Head Pose Invariant Gaze Tracking
Unconstrained remote gaze tracking using off-the-shelf cameras is a
challenging problem. Recently, promising algorithms for appearance-based gaze
estimation using convolutional neural networks (CNN) have been proposed.
Improving their robustness to various confounding factors including variable
head pose, subject identity, illumination and image quality remain open
problems. In this work, we study the effect of variable head pose on machine
learning regressors trained to estimate gaze direction. We propose a novel
branched CNN architecture that improves the robustness of gaze classifiers to
variable head pose, without increasing computational cost. We also present
various procedures to effectively train our gaze network including transfer
learning from the more closely related task of object viewpoint estimation and
from a large high-fidelity synthetic gaze dataset, which enable our ten times
faster gaze network to achieve competitive accuracy to its current
state-of-the-art direct competitor.Comment: 9 pages, IEEE Conference on Computer Vision and Pattern Recognition
Worksho
GPViT: A High Resolution Non-Hierarchical Vision Transformer with Group Propagation
We present the Group Propagation Vision Transformer (GPViT): a novel nonhierarchical (i.e. non-pyramidal) transformer model designed for general visual recognition with high-resolution features. High-resolution features (or tokens) are a natural fit for tasks that involve perceiving fine-grained details such as detection and segmentation, but exchanging global information between these features is expensive in memory and computation because of the way self-attention scales. We provide a highly efficient alternative Group Propagation Block (GP Block) to exchange global information. In each GP Block, features are first grouped together by a fixed number of learnable group tokens; we then perform Group Propagation where global information is exchanged between the grouped features; finally, global information in the updated grouped features is returned back to the image features through a transformer decoder. We evaluate GPViT on a variety of visual recognition tasks including image classification, semantic segmentation, object detection, and instance segmentation. Our method achieves significant performance gains over previous works across all tasks, especially on tasks that require highresolution outputs, for example, our GPViT-L3 outperforms Swin Transformer-B by 2.0 mIoU on ADE20K semantic segmentation with only half as many parameter
Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models
We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation,
which unifies pre-trained text-image diffusion and discriminative models to
perform open-vocabulary panoptic segmentation. Text-to-image diffusion models
have shown the remarkable capability of generating high-quality images with
diverse open-vocabulary language descriptions. This demonstrates that their
internal representation space is highly correlated with open concepts in the
real world. Text-image discriminative models like CLIP, on the other hand, are
good at classifying images into open-vocabulary labels. We propose to leverage
the frozen representation of both these models to perform panoptic segmentation
of any category in the wild. Our approach outperforms the previous state of the
art by significant margins on both open-vocabulary panoptic and semantic
segmentation tasks. In particular, with COCO training only, our method achieves
23.4 PQ and 30.0 mIoU on the ADE20K dataset, with 8.3 PQ and 7.9 mIoU absolute
improvement over the previous state-of-the-art. Project page is available at
https://jerryxu.net/ODISE .Comment: CVPR 2023. Project page: https://jerryxu.net/ODIS
Convolutional State Space Models for Long-Range Spatiotemporal Modeling
Effectively modeling long spatiotemporal sequences is challenging due to the
need to model complex spatial correlations and long-range temporal dependencies
simultaneously. ConvLSTMs attempt to address this by updating tensor-valued
states with recurrent neural networks, but their sequential computation makes
them slow to train. In contrast, Transformers can process an entire
spatiotemporal sequence, compressed into tokens, in parallel. However, the cost
of attention scales quadratically in length, limiting their scalability to
longer sequences. Here, we address the challenges of prior methods and
introduce convolutional state space models (ConvSSM) that combine the tensor
modeling ideas of ConvLSTM with the long sequence modeling approaches of state
space methods such as S4 and S5. First, we demonstrate how parallel scans can
be applied to convolutional recurrences to achieve subquadratic parallelization
and fast autoregressive generation. We then establish an equivalence between
the dynamics of ConvSSMs and SSMs, which motivates parameterization and
initialization strategies for modeling long-range dependencies. The result is
ConvS5, an efficient ConvSSM variant for long-range spatiotemporal modeling.
ConvS5 significantly outperforms Transformers and ConvLSTM on a long horizon
Moving-MNIST experiment while training 3X faster than ConvLSTM and generating
samples 400X faster than Transformers. In addition, ConvS5 matches or exceeds
the performance of state-of-the-art methods on challenging DMLab, Minecraft and
Habitat prediction benchmarks and enables new directions for modeling long
spatiotemporal sequences