8 research outputs found
Knowledge Distillation for Action Anticipation via Label Smoothing
Human capability to anticipate near future from visual observations and
non-verbal cues is essential for developing intelligent systems that need to
interact with people. Several research areas, such as human-robot interaction
(HRI), assisted living or autonomous driving need to foresee future events to
avoid crashes or help people. Egocentric scenarios are classic examples where
action anticipation is applied due to their numerous applications. Such
challenging task demands to capture and model domain's hidden structure to
reduce prediction uncertainty. Since multiple actions may equally occur in the
future, we treat action anticipation as a multi-label problem with missing
labels extending the concept of label smoothing. This idea resembles the
knowledge distillation process since useful information is injected into the
model during training. We implement a multi-modal framework based on long
short-term memory (LSTM) networks to summarize past observations and make
predictions at different time steps. We perform extensive experiments on
EPIC-Kitchens and EGTEA Gaze+ datasets including more than 2500 and 100 action
classes, respectively. The experiments show that label smoothing systematically
improves performance of state-of-the-art models for action anticipation.Comment: Accepted to ICPR 202
Distilling Knowledge for Short-to-Long Term Trajectory Prediction
Long-term trajectory forecasting is an important and challenging problem in
the fields of computer vision, machine learning, and robotics. One fundamental
difficulty stands in the evolution of the trajectory that becomes more and more
uncertain and unpredictable as the time horizon grows, subsequently increasing
the complexity of the problem. To overcome this issue, in this paper, we
propose Di-Long, a new method that employs the distillation of a short-term
trajectory model forecaster that guides a student network for long-term
trajectory prediction during the training process. Given a total sequence
length that comprehends the allowed observation for the student network and the
complementary target sequence, we let the student and the teacher solve two
different related tasks defined over the same full trajectory: the student
observes a short sequence and predicts a long trajectory, whereas the teacher
observes a longer sequence and predicts the remaining short target trajectory.
The teacher's task is less uncertain, and we use its accurate predictions to
guide the student through our knowledge distillation framework, reducing
long-term future uncertainty. Our experiments show that our proposed Di-Long
method is effective for long-term forecasting and achieves state-of-the-art
performance on the Intersection Drone Dataset (inD) and the Stanford Drone
Dataset (SDD)
Deep Virtual-to-Real Distillation for Pedestrian Crossing Prediction
Pedestrian crossing is one of the most typical behavior which conflicts with
natural driving behavior of vehicles. Consequently, pedestrian crossing
prediction is one of the primary task that influences the vehicle planning for
safe driving. However, current methods that rely on the practically collected
data in real driving scenes cannot depict and cover all kinds of scene
condition in real traffic world. To this end, we formulate a deep virtual to
real distillation framework by introducing the synthetic data that can be
generated conveniently, and borrow the abundant information of pedestrian
movement in synthetic videos for the pedestrian crossing prediction in real
data with a simple and lightweight implementation. In order to verify this
framework, we construct a benchmark with 4667 virtual videos owning about 745k
frames (called Virtual-PedCross-4667), and evaluate the proposed method on two
challenging datasets collected in real driving situations, i.e., JAAD and PIE
datasets. State-of-the-art performance of this framework is demonstrated by
exhaustive experiment analysis. The dataset and code can be downloaded from the
website \url{http://www.lotvs.net/code_data/}.Comment: Accepted by ITSC 202
Jointly-Learnt Networks for Future Action Anticipation Via Self-Knowledge Distillation and Cycle Consistency
Future action anticipation aims to infer future actions from the observation of a small set of past video frames. In this paper, we propose a novel Jointly learnt Action Anticipation Network (J-AAN) via Self-Knowledge Distillation (Self-KD) and cycle consistency for future action anticipation. In contrast to the current state-of-the-art methods which anticipate the future actions either directly or recursively, our proposed J-AAN anticipates the future actions jointly in both direct and recursive ways. However, when dealing with future action anticipation, one important challenge to address is the future\u27s uncertainty since multiple action sequences may come from or be followed by the same action. Training an action anticipation model with one-hot-encoded hard labels that assign zero probabilities to incorrect, yet semantically similar actions may not handle the uncertain future. To address this challenge, we design a Self-KD mechanism to train our J-AAN, where the J-AAN gradually distills its own knowledge during the training to soften the hard labels to model the uncertainty on future action anticipation. Furthermore, we design a forward and backward action anticipation framework with our proposed J-AAN based on a cyclic consistency constraint. The forward J-AAN anticipates the future actions from the observed past actions, and the backward J-AAN verifies the anticipation of the forward J-AAN by anticipating the past actions from the anticipated future actions. The proposed method outperforms all the latest state-of-the-art action anticipation methods on the breakfast, 50Salads, and EPIC-Kitchens-55 datasets. This project will be publicly available on https://github.com/MoniruzzamanMd/J-AAN
Self-supervised and semi-supervised learning for road condition estimation from distributed road-side cameras
Monitoring road conditions, e.g., water build-up due to intense rainfall, plays a fundamental role in ensuring road safety while increasing resilience to the effects of climate change. Distributed cameras provide an easy and affordable alternative to instrumented weather stations, enabling diffused and capillary road monitoring. Here, we propose a deep learning-based solution to automatically detect wet road events in continuous video streams acquired by road-side surveillance cameras. Our contribution is two-fold: first, we employ a convolutional Long Short-Term Memory model (convLSTM) to detect subtle changes in the road appearance, introducing a novel temporally consistent data augmentation to increase robustness to outdoor illumination conditions. Second, we present a contrastive self-supervised framework that is uniquely tailored to
surveillance camera networks. The proposed technique was validated on a large-scale dataset comprising roughly 2000 full day sequences (roughly 400K video frames, of which 300K unlabelled), acquired from several road-side cameras over a span of two years. Experimental results show the effectiveness of self-supervised and semi-supervised learning, increasing the
frame classification performance (measured by the Area under the ROC curve) from 0.86 to 0.92. From the standpoint of event detection, we show that incorporating temporal features through a convLSTM model both improves the detection rate of wet road events (+10%) and reduces false positive alarms (–45%). The proposed techniques could benefit also other tasks related
to weather analysis from road-side and vehicle-mounted cameras
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?
Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing
what commonly happens after his/her current action (e.g. crack eggs)? What if
we also know the longer-term goal of the actor (e.g. making egg fried rice)?
The long-term action anticipation (LTA) task aims to predict an actor's future
behavior from video observations in the form of verb and noun sequences, and it
is crucial for human-machine interaction. We propose to formulate the LTA task
from two perspectives: a bottom-up approach that predicts the next actions
autoregressively by modeling temporal dynamics; and a top-down approach that
infers the goal of the actor and plans the needed procedure to accomplish the
goal. We hypothesize that large language models (LLMs), which have been
pretrained on procedure text data (e.g. recipes, how-tos), have the potential
to help LTA from both perspectives. It can help provide the prior knowledge on
the possible next actions, and infer the goal given the observed part of a
procedure, respectively. To leverage the LLMs, we propose a two-stage
framework, AntGPT. It first recognizes the actions already performed in the
observed videos and then asks an LLM to predict the future actions via
conditioned generation, or to infer the goal and plan the whole procedure by
chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2
benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the
effectiveness of our proposed approach. AntGPT achieves state-of-the-art
performance on all above benchmarks, and can successfully infer the goal and
thus perform goal-conditioned "counterfactual" prediction via qualitative
analysis. Code and model will be released at
https://brown-palm.github.io/AntGP