4,145 research outputs found
Graph Distillation for Action Detection with Privileged Modalities
We propose a technique that tackles action detection in multimodal videos
under a realistic and challenging condition in which only limited training data
and partially observed modalities are available. Common methods in transfer
learning do not take advantage of the extra modalities potentially available in
the source domain. On the other hand, previous work on multimodal learning only
focuses on a single domain or task and does not handle the modality discrepancy
between training and testing. In this work, we propose a method termed graph
distillation that incorporates rich privileged information from a large-scale
multimodal dataset in the source domain, and improves the learning in the
target domain where training data and modalities are scarce. We evaluate our
approach on action classification and detection tasks in multimodal videos, and
show that our model outperforms the state-of-the-art by a large margin on the
NTU RGB+D and PKU-MMD benchmarks. The code is released at
http://alan.vision/eccv18_graph/.Comment: ECCV 201
Pseudo-labels for Supervised Learning on Dynamic Vision Sensor Data, Applied to Object Detection under Ego-motion
In recent years, dynamic vision sensors (DVS), also known as event-based
cameras or neuromorphic sensors, have seen increased use due to various
advantages over conventional frame-based cameras. Using principles inspired by
the retina, its high temporal resolution overcomes motion blurring, its high
dynamic range overcomes extreme illumination conditions and its low power
consumption makes it ideal for embedded systems on platforms such as drones and
self-driving cars. However, event-based data sets are scarce and labels are
even rarer for tasks such as object detection. We transferred discriminative
knowledge from a state-of-the-art frame-based convolutional neural network
(CNN) to the event-based modality via intermediate pseudo-labels, which are
used as targets for supervised learning. We show, for the first time,
event-based car detection under ego-motion in a real environment at 100 frames
per second with a test average precision of 40.3% relative to our annotated
ground truth. The event-based car detector handles motion blur and poor
illumination conditions despite not explicitly trained to do so, and even
complements frame-based CNN detectors, suggesting that it has learnt
generalized visual representations
IMD: 3D Action Representation Learning with Inter- and Intra-modal Mutual Distillation
Recent progresses on self-supervised 3D human action representation learning
are largely attributed to contrastive learning. However, in conventional
contrastive frameworks, the rich complementarity between different skeleton
modalities remains under-explored. Moreover, optimized with distinguishing
self-augmented samples, models struggle with numerous similar positive
instances in the case of limited action categories. In this work, we tackle the
aforementioned problems by introducing a general Inter- and Intra-modal Mutual
Distillation (IMD) framework. In IMD, we first re-formulate the
cross-modal interaction as a Cross-modal Mutual Distillation (CMD) process.
Different from existing distillation solutions that transfer the knowledge of a
pre-trained and fixed teacher to the student, in CMD, the knowledge is
continuously updated and bidirectionally distilled between modalities during
pre-training. To alleviate the interference of similar samples and exploit
their underlying contexts, we further design the Intra-modal Mutual
Distillation (IMD) strategy, In IMD, the Dynamic Neighbors Aggregation (DNA)
mechanism is first introduced, where an additional cluster-level discrimination
branch is instantiated in each modality. It adaptively aggregates
highly-correlated neighboring features, forming local cluster-level
contrasting. Mutual distillation is then performed between the two branches for
cross-level knowledge exchange. Extensive experiments on three datasets show
that our approach sets a series of new records.Comment: submitted to IJCV. arXiv admin note: substantial text overlap with
arXiv:2208.1244
Cross Modal Distillation for Supervision Transfer
In this work we propose a technique that transfers supervision between images
from different modalities. We use learned representations from a large labeled
modality as a supervisory signal for training representations for a new
unlabeled paired modality. Our method enables learning of rich representations
for unlabeled modalities and can be used as a pre-training procedure for new
modalities with limited labeled data. We show experimental results where we
transfer supervision from labeled RGB images to unlabeled depth and optical
flow images and demonstrate large improvements for both these cross modal
supervision transfers. Code, data and pre-trained models are available at
https://github.com/s-gupta/fast-rcnn/tree/distillationComment: Updated version (v2) contains additional experiments and result
XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning
We present XKD, a novel self-supervised framework to learn meaningful
representations from unlabelled video clips. XKD is trained with two pseudo
tasks. First, masked data reconstruction is performed to learn individual
representations from audio and visual streams. Next, self-supervised
cross-modal knowledge distillation is performed between the two modalities
through teacher-student setups to learn complementary information. To identify
the most effective information to transfer and also to tackle the domain gap
between audio and visual modalities which could hinder knowledge transfer, we
introduce a domain alignment and feature refinement strategy for effective
cross-modal knowledge distillation. Lastly, to develop a general-purpose
network capable of handling both audio and visual streams, modality-agnostic
variants of our proposed framework are introduced, which use the same backbone
for both audio and visual modalities. Our proposed cross-modal knowledge
distillation improves linear evaluation top-1 accuracy of video action
classification by 8.6% on UCF101, 8.2% on HMDB51, 13.9% on Kinetics-Sound, and
15.7% on Kinetics400. Additionally, our modality-agnostic variant shows
promising results in developing a general-purpose network capable of learning
both data streams for solving different downstream tasks
Feature-Supervised Action Modality Transfer
This paper strives for action recognition and detection in video modalities like RGB, depth maps or 3D-skeleton sequences when only limited modality-specific labeled examples are available. For the RGB, and derived optical-flow, modality many large-scale labeled datasets have been made available. They have become the de facto pre-training choice when recognizing or detecting new actions from RGB datasets that have limited amounts of labeled examples available. Unfortunately, large-scale labeled action datasets for other modalities are unavailable for pre-training. In this paper, our goal is to recognize actions from limited examples in non-RGB video modalities, by learning from large-scale labeled RGB data. To this end, we propose a two-step training process: (i) we extract action representation knowledge from an RGB-trained teacher network and adapt it to a non-RGB student network. (ii) we then fine-tune the transfer model with available labeled examples of the target modality. For the knowledge transfer we introduce feature-supervision strategies, which rely on unlabeled pairs of two modalities (the RGB and the target modality) to transfer feature level representations from the teacher to the student network. Ablations and generalizations with two RGB source datasets and two non-RGB target datasets demonstrate that an optical-flow teacher provides better action transfer features than RGB for both depth maps and 3D-skeletons, even when evaluated on a different target domain, or for a different task. Compared to alternative cross-modal action transfer methods we show a good improvement in performance especially when labeled non-RGB examples to learn from are scarce
Physical-aware Cross-modal Adversarial Network for Wearable Sensor-based Human Action Recognition
Wearable sensor-based Human Action Recognition (HAR) has made significant
strides in recent times. However, the accuracy performance of wearable
sensor-based HAR is currently still lagging behind that of visual
modalities-based systems, such as RGB video and depth data. Although diverse
input modalities can provide complementary cues and improve the accuracy
performance of HAR, wearable devices can only capture limited kinds of
non-visual time series input, such as accelerometers and gyroscopes. This
limitation hinders the deployment of multimodal simultaneously using visual and
non-visual modality data in parallel on current wearable devices. To address
this issue, we propose a novel Physical-aware Cross-modal Adversarial (PCA)
framework that utilizes only time-series accelerometer data from four inertial
sensors for the wearable sensor-based HAR problem. Specifically, we propose an
effective IMU2SKELETON network to produce corresponding synthetic skeleton
joints from accelerometer data. Subsequently, we imposed additional constraints
on the synthetic skeleton data from a physical perspective, as accelerometer
data can be regarded as the second derivative of the skeleton sequence
coordinates. After that, the original accelerometer as well as the constrained
skeleton sequence were fused together to make the final classification. In this
way, when individuals wear wearable devices, the devices can not only capture
accelerometer data, but can also generate synthetic skeleton sequences for
real-time wearable sensor-based HAR applications that need to be conducted
anytime and anywhere. To demonstrate the effectiveness of our proposed PCA
framework, we conduct extensive experiments on Berkeley-MHAD, UTD-MHAD, and
MMAct datasets. The results confirm that the proposed PCA approach has
competitive performance compared to the previous methods on the mono
sensor-based HAR classification problem.Comment: First IMU2SKELETON GANs approach for wearable HAR problem. arXiv
admin note: text overlap with arXiv:2208.0809
- …