Search CORE

6 research outputs found

Distill and Collect for Semi-Supervised Temporal Action Segmentation

Author: Beckwith Richard
Biswas Sovan
Manuvinakurike Ramesh
Raffa Giuseppe
Rhodes Anthony
Publication venue
Publication date: 03/11/2022
Field of study

Recent temporal action segmentation approaches need frame annotations during training to be effective. These annotations are very expensive and time-consuming to obtain. This limits their performances when only limited annotated data is available. In contrast, we can easily collect a large corpus of in-domain unannotated videos by scavenging through the internet. Thus, this paper proposes an approach for the temporal action segmentation task that can simultaneously leverage knowledge from annotated and unannotated video sequences. Our approach uses multi-stream distillation that repeatedly refines and finally combines their frame predictions. Our model also predicts the action order, which is later used as a temporal constraint while estimating frames labels to counter the lack of supervision for unannotated videos. In the end, our evaluation of the proposed approach on two different datasets demonstrates its capability to achieve comparable performance to the full supervision despite limited annotation

arXiv.org e-Print Archive

BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation

Author: Elhamifar Ehsan
Lu Zijia
Publication venue
Publication date: 28/08/2023
Field of study

We address the task of supervised action segmentation which aims to partition a video into non-overlapping segments, each representing a different action. Recent works apply transformers to perform temporal modeling at the frame-level, which suffer from high computational cost and cannot well capture action dependencies over long temporal horizons. To address these issues, we propose an efficient BI-level Temporal modeling (BIT) framework that learns explicit action tokens to represent action segments, in parallel performs temporal modeling on frame and action levels, while maintaining a low computational cost. Our model contains (i) a frame branch that uses convolution to learn frame-level relationships, (ii) an action branch that uses transformer to learn action-level dependencies with a small set of action tokens and (iii) cross-attentions to allow communication between the two branches. We apply and extend a set-prediction objective to allow each action token to represent one or multiple action segments, thus can avoid learning a large number of tokens over long videos with many segments. Thanks to the design of our action branch, we can also seamlessly leverage textual transcripts of videos (when available) to help action segmentation by using them to initialize the action tokens. We evaluate our model on four video datasets (two egocentric and two third-person) for action segmentation with and without transcripts, showing that BIT significantly improves the state-of-the-art accuracy with much lower computational cost (30 times faster) compared to existing transformer-based methods.Comment: 9 pages, 6 figure

arXiv.org e-Print Archive

Recommended from our members

Action Segmentation with Limited Supervision

Author: Li Jun
Publication venue: 'Oregon State University'
Publication date
Field of study

In this dissertation, we address action segmentation in videos under limited supervision. The goal of action segmentation is to predict an action class for each frame of a video. The limited supervision means ground truth labels of video frames are not available in training. We focus on three types of problems: (1) Transcript-level supervised learning, where the ground truth is a transcript which represents the temporal ordering of actions present in a training video; (2) Set-level supervised learning, where the ground truth specifies only a set of actions present; and (3) Unsupervised learning, where no ground truth is available. To address these problems, we make three hypotheses. First, we believe that action segmentation under limited supervision would benefit from reasoning over many candidate segmentations rather than predicting a single optimal segmentation. To this end, we efficiently represent a video by a segmentation graph, where paths are candidate segmentations. Second, we hypothesize that a discriminative learning of minimizing energy between valid segmentations that satisfy ground truth and invalid segmentations that violate ground truth is a better learning objective than only minimizing a loss defined with respect to valid segmentations. Third, we hypothesize that regularization of action affinity for same actions, sparsity of action activations for different actions, and orthonormality of parameter matrices are helpful in a limited supervision learning. The dissertation presents our approaches to action segmentation that are based on these hypotheses. Our key technical contributions include versions of a constrained Viterbi algorithm aimed at efficiently approximating the NP-hard all-color-shortest-path problem, as well as efficient Riemannian optimization on the Stiefel manifold via the Cayley transform for regularization of model parameters. Our experimental evaluation demonstrates the advantages of our approaches relative to existing work

ScholarsArchive@OSU