8 research outputs found
Adaptive Temporal Encoding Network for Video Instance-level Human Parsing
Beyond the existing single-person and multiple-person human parsing tasks in
static images, this paper makes the first attempt to investigate a more
realistic video instance-level human parsing that simultaneously segments out
each person instance and parses each instance into more fine-grained parts
(e.g., head, leg, dress). We introduce a novel Adaptive Temporal Encoding
Network (ATEN) that alternatively performs temporal encoding among key frames
and flow-guided feature propagation from other consecutive frames between two
key frames. Specifically, ATEN first incorporates a Parsing-RCNN to produce the
instance-level parsing result for each key frame, which integrates both the
global human parsing and instance-level human segmentation into a unified
model. To balance between accuracy and efficiency, the flow-guided feature
propagation is used to directly parse consecutive frames according to their
identified temporal consistency with key frames. On the other hand, ATEN
leverages the convolution gated recurrent units (convGRU) to exploit temporal
changes over a series of key frames, which are further used to facilitate the
frame-level instance-level parsing. By alternatively performing direct feature
propagation between consistent frames and temporal encoding network among key
frames, our ATEN achieves a good balance between frame-level accuracy and time
efficiency, which is a common crucial problem in video object segmentation
research. To demonstrate the superiority of our ATEN, extensive experiments are
conducted on the most popular video segmentation benchmark (DAVIS) and a newly
collected Video Instance-level Parsing (VIP) dataset, which is the first video
instance-level human parsing dataset comprised of 404 sequences and over 20k
frames with instance-level and pixel-wise annotations.Comment: To appear in ACM MM 2018. Code link:
https://github.com/HCPLab-SYSU/ATEN. Dataset link: http://sysu-hcp.net/li
Contrastive Transformation for Self-supervised Correspondence Learning
In this paper, we focus on the self-supervised learning of visual
correspondence using unlabeled videos in the wild. Our method simultaneously
considers intra- and inter-video representation associations for reliable
correspondence estimation. The intra-video learning transforms the image
contents across frames within a single video via the frame pair-wise affinity.
To obtain the discriminative representation for instance-level separation, we
go beyond the intra-video analysis and construct the inter-video affinity to
facilitate the contrastive transformation across different videos. By forcing
the transformation consistency between intra- and inter-video levels, the
fine-grained correspondence associations are well preserved and the
instance-level feature discrimination is effectively reinforced. Our simple
framework outperforms the recent self-supervised correspondence methods on a
range of visual tasks including video object tracking (VOT), video object
segmentation (VOS), pose keypoint tracking, etc. It is worth mentioning that
our method also surpasses the fully-supervised affinity representation (e.g.,
ResNet) and performs competitively against the recent fully-supervised
algorithms designed for the specific tasks (e.g., VOT and VOS).Comment: To appear in AAAI 202