8 research outputs found

    Knowledge Distillation for Action Anticipation via Label Smoothing

    Full text link
    Human capability to anticipate near future from visual observations and non-verbal cues is essential for developing intelligent systems that need to interact with people. Several research areas, such as human-robot interaction (HRI), assisted living or autonomous driving need to foresee future events to avoid crashes or help people. Egocentric scenarios are classic examples where action anticipation is applied due to their numerous applications. Such challenging task demands to capture and model domain's hidden structure to reduce prediction uncertainty. Since multiple actions may equally occur in the future, we treat action anticipation as a multi-label problem with missing labels extending the concept of label smoothing. This idea resembles the knowledge distillation process since useful information is injected into the model during training. We implement a multi-modal framework based on long short-term memory (LSTM) networks to summarize past observations and make predictions at different time steps. We perform extensive experiments on EPIC-Kitchens and EGTEA Gaze+ datasets including more than 2500 and 100 action classes, respectively. The experiments show that label smoothing systematically improves performance of state-of-the-art models for action anticipation.Comment: Accepted to ICPR 202

    Distilling Knowledge for Short-to-Long Term Trajectory Prediction

    Full text link
    Long-term trajectory forecasting is an important and challenging problem in the fields of computer vision, machine learning, and robotics. One fundamental difficulty stands in the evolution of the trajectory that becomes more and more uncertain and unpredictable as the time horizon grows, subsequently increasing the complexity of the problem. To overcome this issue, in this paper, we propose Di-Long, a new method that employs the distillation of a short-term trajectory model forecaster that guides a student network for long-term trajectory prediction during the training process. Given a total sequence length that comprehends the allowed observation for the student network and the complementary target sequence, we let the student and the teacher solve two different related tasks defined over the same full trajectory: the student observes a short sequence and predicts a long trajectory, whereas the teacher observes a longer sequence and predicts the remaining short target trajectory. The teacher's task is less uncertain, and we use its accurate predictions to guide the student through our knowledge distillation framework, reducing long-term future uncertainty. Our experiments show that our proposed Di-Long method is effective for long-term forecasting and achieves state-of-the-art performance on the Intersection Drone Dataset (inD) and the Stanford Drone Dataset (SDD)

    Deep Virtual-to-Real Distillation for Pedestrian Crossing Prediction

    Full text link
    Pedestrian crossing is one of the most typical behavior which conflicts with natural driving behavior of vehicles. Consequently, pedestrian crossing prediction is one of the primary task that influences the vehicle planning for safe driving. However, current methods that rely on the practically collected data in real driving scenes cannot depict and cover all kinds of scene condition in real traffic world. To this end, we formulate a deep virtual to real distillation framework by introducing the synthetic data that can be generated conveniently, and borrow the abundant information of pedestrian movement in synthetic videos for the pedestrian crossing prediction in real data with a simple and lightweight implementation. In order to verify this framework, we construct a benchmark with 4667 virtual videos owning about 745k frames (called Virtual-PedCross-4667), and evaluate the proposed method on two challenging datasets collected in real driving situations, i.e., JAAD and PIE datasets. State-of-the-art performance of this framework is demonstrated by exhaustive experiment analysis. The dataset and code can be downloaded from the website \url{http://www.lotvs.net/code_data/}.Comment: Accepted by ITSC 202

    Jointly-Learnt Networks for Future Action Anticipation Via Self-Knowledge Distillation and Cycle Consistency

    Get PDF
    Future action anticipation aims to infer future actions from the observation of a small set of past video frames. In this paper, we propose a novel Jointly learnt Action Anticipation Network (J-AAN) via Self-Knowledge Distillation (Self-KD) and cycle consistency for future action anticipation. In contrast to the current state-of-the-art methods which anticipate the future actions either directly or recursively, our proposed J-AAN anticipates the future actions jointly in both direct and recursive ways. However, when dealing with future action anticipation, one important challenge to address is the future\u27s uncertainty since multiple action sequences may come from or be followed by the same action. Training an action anticipation model with one-hot-encoded hard labels that assign zero probabilities to incorrect, yet semantically similar actions may not handle the uncertain future. To address this challenge, we design a Self-KD mechanism to train our J-AAN, where the J-AAN gradually distills its own knowledge during the training to soften the hard labels to model the uncertainty on future action anticipation. Furthermore, we design a forward and backward action anticipation framework with our proposed J-AAN based on a cyclic consistency constraint. The forward J-AAN anticipates the future actions from the observed past actions, and the backward J-AAN verifies the anticipation of the forward J-AAN by anticipating the past actions from the anticipated future actions. The proposed method outperforms all the latest state-of-the-art action anticipation methods on the breakfast, 50Salads, and EPIC-Kitchens-55 datasets. This project will be publicly available on https://github.com/MoniruzzamanMd/J-AAN

    Self-supervised and semi-supervised learning for road condition estimation from distributed road-side cameras

    Get PDF
    Monitoring road conditions, e.g., water build-up due to intense rainfall, plays a fundamental role in ensuring road safety while increasing resilience to the effects of climate change. Distributed cameras provide an easy and affordable alternative to instrumented weather stations, enabling diffused and capillary road monitoring. Here, we propose a deep learning-based solution to automatically detect wet road events in continuous video streams acquired by road-side surveillance cameras. Our contribution is two-fold: first, we employ a convolutional Long Short-Term Memory model (convLSTM) to detect subtle changes in the road appearance, introducing a novel temporally consistent data augmentation to increase robustness to outdoor illumination conditions. Second, we present a contrastive self-supervised framework that is uniquely tailored to surveillance camera networks. The proposed technique was validated on a large-scale dataset comprising roughly 2000 full day sequences (roughly 400K video frames, of which 300K unlabelled), acquired from several road-side cameras over a span of two years. Experimental results show the effectiveness of self-supervised and semi-supervised learning, increasing the frame classification performance (measured by the Area under the ROC curve) from 0.86 to 0.92. From the standpoint of event detection, we show that incorporating temporal features through a convLSTM model both improves the detection rate of wet road events (+10%) and reduces false positive alarms (–45%). The proposed techniques could benefit also other tasks related to weather analysis from road-side and vehicle-mounted cameras

    AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

    Full text link
    Can we better anticipate an actor's future actions (e.g. mix eggs) by knowing what commonly happens after his/her current action (e.g. crack eggs)? What if we also know the longer-term goal of the actor (e.g. making egg fried rice)? The long-term action anticipation (LTA) task aims to predict an actor's future behavior from video observations in the form of verb and noun sequences, and it is crucial for human-machine interaction. We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal. We hypothesize that large language models (LLMs), which have been pretrained on procedure text data (e.g. recipes, how-tos), have the potential to help LTA from both perspectives. It can help provide the prior knowledge on the possible next actions, and infer the goal given the observed part of a procedure, respectively. To leverage the LLMs, we propose a two-stage framework, AntGPT. It first recognizes the actions already performed in the observed videos and then asks an LLM to predict the future actions via conditioned generation, or to infer the goal and plan the whole procedure by chain-of-thought prompting. Empirical results on the Ego4D LTA v1 and v2 benchmarks, EPIC-Kitchens-55, as well as EGTEA GAZE+ demonstrate the effectiveness of our proposed approach. AntGPT achieves state-of-the-art performance on all above benchmarks, and can successfully infer the goal and thus perform goal-conditioned "counterfactual" prediction via qualitative analysis. Code and model will be released at https://brown-palm.github.io/AntGP
    corecore