20 research outputs found
Matching-CNN Meets KNN: Quasi-Parametric Human Parsing
Both parametric and non-parametric approaches have demonstrated encouraging
performances in the human parsing task, namely segmenting a human image into
several semantic regions (e.g., hat, bag, left arm, face). In this work, we aim
to develop a new solution with the advantages of both methodologies, namely
supervision from annotated data and the flexibility to use newly annotated
(possibly uncommon) images, and present a quasi-parametric human parsing model.
Under the classic K Nearest Neighbor (KNN)-based nonparametric framework, the
parametric Matching Convolutional Neural Network (M-CNN) is proposed to predict
the matching confidence and displacements of the best matched region in the
testing image for a particular semantic region in one KNN image. Given a
testing image, we first retrieve its KNN images from the
annotated/manually-parsed human image corpus. Then each semantic region in each
KNN image is matched with confidence to the testing image using M-CNN, and the
matched regions from all KNN images are further fused, followed by a superpixel
smoothing procedure to obtain the ultimate human parsing result. The M-CNN
differs from the classic CNN in that the tailored cross image matching filters
are introduced to characterize the matching between the testing image and the
semantic region of a KNN image. The cross image matching filters are defined at
different convolutional layers, each aiming to capture a particular range of
displacements. Comprehensive evaluations over a large dataset with 7,700
annotated human images well demonstrate the significant performance gain from
the quasi-parametric model over the state-of-the-arts, for the human parsing
task.Comment: This manuscript is the accepted version for CVPR 201
Computational Baby Learning
Intuitive observations show that a baby may inherently possess the capability
of recognizing a new visual concept (e.g., chair, dog) by learning from only
very few positive instances taught by parent(s) or others, and this recognition
capability can be gradually further improved by exploring and/or interacting
with the real instances in the physical world. Inspired by these observations,
we propose a computational model for slightly-supervised object detection,
based on prior knowledge modelling, exemplar learning and learning with video
contexts. The prior knowledge is modeled with a pre-trained Convolutional
Neural Network (CNN). When very few instances of a new concept are given, an
initial concept detector is built by exemplar learning over the deep features
from the pre-trained CNN. Simulating the baby's interaction with physical
world, the well-designed tracking solution is then used to discover more
diverse instances from the massive online unlabeled videos. Once a positive
instance is detected/identified with high score in each video, more variable
instances possibly from different view-angles and/or different distances are
tracked and accumulated. Then the concept detector can be fine-tuned based on
these new instances. This process can be repeated again and again till we
obtain a very mature concept detector. Extensive experiments on Pascal
VOC-07/10/12 object detection datasets well demonstrate the effectiveness of
our framework. It can beat the state-of-the-art full-training based
performances by learning from very few samples for each object category, along
with about 20,000 unlabeled videos.Comment: 9 page
Towards Consistent Video Editing with Text-to-Image Diffusion Models
Existing works have advanced Text-to-Image (TTI) diffusion models for video
editing in a one-shot learning manner. Despite their low requirements of data
and computation, these methods might produce results of unsatisfied consistency
with text prompt as well as temporal sequence, limiting their applications in
the real world. In this paper, we propose to address the above issues with a
novel EI model towards \textbf{E}nhancing v\textbf{I}deo \textbf{E}diting
cons\textbf{I}stency of TTI-based frameworks. Specifically, we analyze and find
that the inconsistent problem is caused by newly added modules into TTI models
for learning temporal information. These modules lead to covariate shift in the
feature space, which harms the editing capability. Thus, we design EI to
tackle the above drawbacks with two classical modules: Shift-restricted
Temporal Attention Module (STAM) and Fine-coarse Frame Attention Module (FFAM).
First, through theoretical analysis, we demonstrate that covariate shift is
highly related to Layer Normalization, thus STAM employs a \textit{Instance
Centering} layer replacing it to preserve the distribution of temporal
features. In addition, {STAM} employs an attention layer with normalized
mapping to transform temporal features while constraining the variance shift.
As the second part, we incorporate {STAM} with a novel {FFAM}, which
efficiently leverages fine-coarse spatial information of overall frames to
further enhance temporal consistency. Extensive experiments demonstrate the
superiority of the proposed EI model for text-driven video editing
Referring Image Segmentation via Cross-Modal Progressive Comprehension
Referring image segmentation aims at segmenting the foreground masks of the
entities that can well match the description given in the natural language
expression. Previous approaches tackle this problem using implicit feature
interaction and fusion between visual and linguistic modalities, but usually
fail to explore informative words of the expression to well align features from
the two modalities for accurately identifying the referred entity. In this
paper, we propose a Cross-Modal Progressive Comprehension (CMPC) module and a
Text-Guided Feature Exchange (TGFE) module to effectively address the
challenging task. Concretely, the CMPC module first employs entity and
attribute words to perceive all the related entities that might be considered
by the expression. Then, the relational words are adopted to highlight the
correct entity as well as suppress other irrelevant ones by multimodal graph
reasoning. In addition to the CMPC module, we further leverage a simple yet
effective TGFE module to integrate the reasoned multimodal features from
different levels with the guidance of textual information. In this way,
features from multi-levels could communicate with each other and be refined
based on the textual context. We conduct extensive experiments on four popular
referring segmentation benchmarks and achieve new state-of-the-art
performances.Comment: Accepted by CVPR 2020. Code is available at
https://github.com/spyflying/CMPC-Refse
DropKey
In this paper, we focus on analyzing and improving the dropout technique for
self-attention layers of Vision Transformer, which is important while
surprisingly ignored by prior works. In particular, we conduct researches on
three core questions: First, what to drop in self-attention layers? Different
from dropping attention weights in literature, we propose to move dropout
operations forward ahead of attention matrix calculation and set the Key as the
dropout unit, yielding a novel dropout-before-softmax scheme. We theoretically
verify that this scheme helps keep both regularization and probability features
of attention weights, alleviating the overfittings problem to specific patterns
and enhancing the model to globally capture vital information; Second, how to
schedule the drop ratio in consecutive layers? In contrast to exploit a
constant drop ratio for all layers, we present a new decreasing schedule that
gradually decreases the drop ratio along the stack of self-attention layers. We
experimentally validate the proposed schedule can avoid overfittings in
low-level features and missing in high-level semantics, thus improving the
robustness and stableness of model training; Third, whether need to perform
structured dropout operation as CNN? We attempt patch-based block-version of
dropout operation and find that this useful trick for CNN is not essential for
ViT. Given exploration on the above three questions, we present the novel
DropKey method that regards Key as the drop unit and exploits decreasing
schedule for drop ratio, improving ViTs in a general way. Comprehensive
experiments demonstrate the effectiveness of DropKey for various ViT
architectures, e.g. T2T and VOLO, as well as for various vision tasks, e.g.,
image classification, object detection, human-object interaction detection and
human body shape recovery.Comment: Accepted by CVPR202
Clothing attributes assisted person reidentification
10.1109/TCSVT.2014.2352552IEEE Transactions on Circuits and Systems for Video Technology255869-87