423 research outputs found
Video-based Sign Language Recognition without Temporal Segmentation
Millions of hearing impaired people around the world routinely use some
variants of sign languages to communicate, thus the automatic translation of a
sign language is meaningful and important. Currently, there are two
sub-problems in Sign Language Recognition (SLR), i.e., isolated SLR that
recognizes word by word and continuous SLR that translates entire sentences.
Existing continuous SLR methods typically utilize isolated SLRs as building
blocks, with an extra layer of preprocessing (temporal segmentation) and
another layer of post-processing (sentence synthesis). Unfortunately, temporal
segmentation itself is non-trivial and inevitably propagates errors into
subsequent steps. Worse still, isolated SLR methods typically require strenuous
labeling of each word separately in a sentence, severely limiting the amount of
attainable training data. To address these challenges, we propose a novel
continuous sign recognition framework, the Hierarchical Attention Network with
Latent Space (LS-HAN), which eliminates the preprocessing of temporal
segmentation. The proposed LS-HAN consists of three components: a two-stream
Convolutional Neural Network (CNN) for video feature representation generation,
a Latent Space (LS) for semantic gap bridging, and a Hierarchical Attention
Network (HAN) for latent space based recognition. Experiments are carried out
on two large scale datasets. Experimental results demonstrate the effectiveness
of the proposed framework.Comment: 32nd AAAI Conference on Artificial Intelligence (AAAI-18), Feb. 2-7,
2018, New Orleans, Louisiana, US
Online Filter Clustering and Pruning for Efficient Convnets
Pruning filters is an effective method for accelerating deep neural networks
(DNNs), but most existing approaches prune filters on a pre-trained network
directly which limits in acceleration. Although each filter has its own effect
in DNNs, but if two filters are the same with each other, we could prune one
safely. In this paper, we add an extra cluster loss term in the loss function
which can force filters in each cluster to be similar online. After training,
we keep one filter in each cluster and prune others and fine-tune the pruned
network to compensate for the loss. Particularly, the clusters in every layer
can be defined firstly which is effective for pruning DNNs within residual
blocks. Extensive experiments on CIFAR10 and CIFAR100 benchmarks demonstrate
the competitive performance of our proposed filter pruning method.Comment: 5 pages, 4 figure
Spatial and Temporal Mutual Promotion for Video-based Person Re-identification
Video-based person re-identification is a crucial task of matching video
sequences of a person across multiple camera views. Generally, features
directly extracted from a single frame suffer from occlusion, blur,
illumination and posture changes. This leads to false activation or missing
activation in some regions, which corrupts the appearance and motion
representation. How to explore the abundant spatial-temporal information in
video sequences is the key to solve this problem. To this end, we propose a
Refining Recurrent Unit (RRU) that recovers the missing parts and suppresses
noisy parts of the current frame's features by referring historical frames.
With RRU, the quality of each frame's appearance representation is improved.
Then we use the Spatial-Temporal clues Integration Module (STIM) to mine the
spatial-temporal information from those upgraded features. Meanwhile, the
multi-level training objective is used to enhance the capability of RRU and
STIM. Through the cooperation of those modules, the spatial and temporal
features mutually promote each other and the final spatial-temporal feature
representation is more discriminative and robust. Extensive experiments are
conducted on three challenging datasets, i.e., iLIDS-VID, PRID-2011 and MARS.
The experimental results demonstrate that our approach outperforms existing
state-of-the-art methods of video-based person re-identification on iLIDS-VID
and MARS and achieves favorable results on PRID-2011.Comment: Accepted by AAAI19 as spotligh
Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video
Reconstructing interacting hands from monocular RGB data is a challenging
task, as it involves many interfering factors, e.g. self- and mutual occlusion
and similar textures. Previous works only leverage information from a single
RGB image without modeling their physically plausible relation, which leads to
inferior reconstruction results. In this work, we are dedicated to explicitly
exploiting spatial-temporal information to achieve better interacting hand
reconstruction. On one hand, we leverage temporal context to complement
insufficient information provided by the single frame, and design a novel
temporal framework with a temporal constraint for interacting hand motion
smoothness. On the other hand, we further propose an interpenetration detection
module to produce kinetically plausible interacting hands without physical
collisions. Extensive experiments are performed to validate the effectiveness
of our proposed framework, which achieves new state-of-the-art performance on
public benchmarks.Comment: 16 page
Contrastive Transformation for Self-supervised Correspondence Learning
In this paper, we focus on the self-supervised learning of visual
correspondence using unlabeled videos in the wild. Our method simultaneously
considers intra- and inter-video representation associations for reliable
correspondence estimation. The intra-video learning transforms the image
contents across frames within a single video via the frame pair-wise affinity.
To obtain the discriminative representation for instance-level separation, we
go beyond the intra-video analysis and construct the inter-video affinity to
facilitate the contrastive transformation across different videos. By forcing
the transformation consistency between intra- and inter-video levels, the
fine-grained correspondence associations are well preserved and the
instance-level feature discrimination is effectively reinforced. Our simple
framework outperforms the recent self-supervised correspondence methods on a
range of visual tasks including video object tracking (VOT), video object
segmentation (VOS), pose keypoint tracking, etc. It is worth mentioning that
our method also surpasses the fully-supervised affinity representation (e.g.,
ResNet) and performs competitively against the recent fully-supervised
algorithms designed for the specific tasks (e.g., VOT and VOS).Comment: To appear in AAAI 202
MA2CL:Masked Attentive Contrastive Learning for Multi-Agent Reinforcement Learning
Recent approaches have utilized self-supervised auxiliary tasks as
representation learning to improve the performance and sample efficiency of
vision-based reinforcement learning algorithms in single-agent settings.
However, in multi-agent reinforcement learning (MARL), these techniques face
challenges because each agent only receives partial observation from an
environment influenced by others, resulting in correlated observations in the
agent dimension. So it is necessary to consider agent-level information in
representation learning for MARL. In this paper, we propose an effective
framework called \textbf{M}ulti-\textbf{A}gent \textbf{M}asked
\textbf{A}ttentive \textbf{C}ontrastive \textbf{L}earning (MA2CL), which
encourages learning representation to be both temporal and agent-level
predictive by reconstructing the masked agent observation in latent space.
Specifically, we use an attention reconstruction model for recovering and the
model is trained via contrastive learning. MA2CL allows better utilization of
contextual information at the agent level, facilitating the training of MARL
agents for cooperation tasks. Extensive experiments demonstrate that our method
significantly improves the performance and sample efficiency of different MARL
algorithms and outperforms other methods in various vision-based and
state-based scenarios. Our code can be found in
\url{https://github.com/ustchlsong/MA2CL
- …