116 research outputs found
Egocentric Hand Detection Via Dynamic Region Growing
Egocentric videos, which mainly record the activities carried out by the
users of the wearable cameras, have drawn much research attentions in recent
years. Due to its lengthy content, a large number of ego-related applications
have been developed to abstract the captured videos. As the users are
accustomed to interacting with the target objects using their own hands while
their hands usually appear within their visual fields during the interaction,
an egocentric hand detection step is involved in tasks like gesture
recognition, action recognition and social interaction understanding. In this
work, we propose a dynamic region growing approach for hand region detection in
egocentric videos, by jointly considering hand-related motion and egocentric
cues. We first determine seed regions that most likely belong to the hand, by
analyzing the motion patterns across successive frames. The hand regions can
then be located by extending from the seed regions, according to the scores
computed for the adjacent superpixels. These scores are derived from four
egocentric cues: contrast, location, position consistency and appearance
continuity. We discuss how to apply the proposed method in real-life scenarios,
where multiple hands irregularly appear and disappear from the videos.
Experimental results on public datasets show that the proposed method achieves
superior performance compared with the state-of-the-art methods, especially in
complicated scenarios
Exemplar-AMMs: Recognizing Crowd Movements From Pedestrian Trajectories
In this paper, we present a novel method to recognize the types of crowd movement from crowd trajectories using agent-based motion models (AMMs). Our idea is to apply a number of AMMs, referred to as exemplar-AMMs, to describe the crowd movement. Specifically, we propose an optimization framework that filters out the unknown noise in the crowd trajectories and measures their similarity to the exemplar-AMMs to produce a crowd motion feature. We then address our real-world crowd movement recognition problem as a multi-label classification problem. Our experiments show that the proposed feature outperforms the state-of-the-art methods in recognizing both simulated and real-world crowd movements from their trajectories. Finally, we have created a synthetic dataset, SynCrowd, which contains 2D crowd trajectories in various scenarios, generated by various crowd simulators. This dataset can serve as a training set or benchmark for crowd analysis work
Distilling Localization for Self-Supervised Representation Learning
Recent progress in contrastive learning has revolutionized unsupervised
representation learning. Concretely, multiple views (augmentations) from the
same image are encouraged to map to the similar embeddings, while views from
different images are pulled apart. In this paper, through visualizing and
diagnosing classification errors, we observe that current contrastive models
are ineffective at localizing the foreground object, limiting their ability to
extract discriminative high-level features. This is due to the fact that view
generation process considers pixels in an image uniformly. To address this
problem, we propose a data-driven approach for learning invariance to
backgrounds. It first estimates foreground saliency in images and then creates
augmentations by copy-and-pasting the foreground onto a variety of backgrounds.
The learning still follows the instance discrimination pretext task, so that
the representation is trained to disregard background content and focus on the
foreground. We study a variety of saliency estimation methods, and find that
most methods lead to improvements for contrastive learning. With this approach
(DiLo), significant performance is achieved for self-supervised learning on
ImageNet classification, and also for object detection on PASCAL VOC and
MSCOCO.Comment: Accepted by AAAI202
Neural Preset for Color Style Transfer
In this paper, we present a Neural Preset technique to address the
limitations of existing color style transfer methods, including visual
artifacts, vast memory requirement, and slow style switching speed. Our method
is based on two core designs. First, we propose Deterministic Neural Color
Mapping (DNCM) to consistently operate on each pixel via an image-adaptive
color mapping matrix, avoiding artifacts and supporting high-resolution inputs
with a small memory footprint. Second, we develop a two-stage pipeline by
dividing the task into color normalization and stylization, which allows
efficient style switching by extracting color styles as presets and reusing
them on normalized input images. Due to the unavailability of pairwise
datasets, we describe how to train Neural Preset via a self-supervised
strategy. Various advantages of Neural Preset over existing methods are
demonstrated through comprehensive evaluations. Notably, Neural Preset enables
stable 4K color style transfer in real-time without artifacts. Besides, we show
that our trained model can naturally support multiple applications without
fine-tuning, including low-light image enhancement, underwater image
correction, image dehazing, and image harmonization. Project page with demos:
https://zhkkke.github.io/NeuralPreset .Comment: Project page with demos: https://zhkkke.github.io/NeuralPreset .
Artifact-free real-time 4K color style transfer via AI-generated presets.
CVPR 202
Deformable Object Tracking with Gated Fusion
The tracking-by-detection framework receives growing attentions through the
integration with the Convolutional Neural Networks (CNNs). Existing
tracking-by-detection based methods, however, fail to track objects with severe
appearance variations. This is because the traditional convolutional operation
is performed on fixed grids, and thus may not be able to find the correct
response while the object is changing pose or under varying environmental
conditions. In this paper, we propose a deformable convolution layer to enrich
the target appearance representations in the tracking-by-detection framework.
We aim to capture the target appearance variations via deformable convolution,
which adaptively enhances its original features. In addition, we also propose a
gated fusion scheme to control how the variations captured by the deformable
convolution affect the original appearance. The enriched feature representation
through deformable convolution facilitates the discrimination of the CNN
classifier on the target object and background. Extensive experiments on the
standard benchmarks show that the proposed tracker performs favorably against
state-of-the-art methods
- …