6 research outputs found
Distill and Collect for Semi-Supervised Temporal Action Segmentation
Recent temporal action segmentation approaches need frame annotations during
training to be effective. These annotations are very expensive and
time-consuming to obtain. This limits their performances when only limited
annotated data is available. In contrast, we can easily collect a large corpus
of in-domain unannotated videos by scavenging through the internet. Thus, this
paper proposes an approach for the temporal action segmentation task that can
simultaneously leverage knowledge from annotated and unannotated video
sequences. Our approach uses multi-stream distillation that repeatedly refines
and finally combines their frame predictions. Our model also predicts the
action order, which is later used as a temporal constraint while estimating
frames labels to counter the lack of supervision for unannotated videos. In the
end, our evaluation of the proposed approach on two different datasets
demonstrates its capability to achieve comparable performance to the full
supervision despite limited annotation
BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation
We address the task of supervised action segmentation which aims to partition
a video into non-overlapping segments, each representing a different action.
Recent works apply transformers to perform temporal modeling at the
frame-level, which suffer from high computational cost and cannot well capture
action dependencies over long temporal horizons. To address these issues, we
propose an efficient BI-level Temporal modeling (BIT) framework that learns
explicit action tokens to represent action segments, in parallel performs
temporal modeling on frame and action levels, while maintaining a low
computational cost. Our model contains (i) a frame branch that uses convolution
to learn frame-level relationships, (ii) an action branch that uses transformer
to learn action-level dependencies with a small set of action tokens and (iii)
cross-attentions to allow communication between the two branches. We apply and
extend a set-prediction objective to allow each action token to represent one
or multiple action segments, thus can avoid learning a large number of tokens
over long videos with many segments. Thanks to the design of our action branch,
we can also seamlessly leverage textual transcripts of videos (when available)
to help action segmentation by using them to initialize the action tokens. We
evaluate our model on four video datasets (two egocentric and two third-person)
for action segmentation with and without transcripts, showing that BIT
significantly improves the state-of-the-art accuracy with much lower
computational cost (30 times faster) compared to existing transformer-based
methods.Comment: 9 pages, 6 figure
Recommended from our members
Action Segmentation with Limited Supervision
In this dissertation, we address action segmentation in videos under limited supervision. The goal of action segmentation is to predict an action class for each frame of a video. The limited supervision means ground truth labels of video frames are not available in training. We focus on three types of problems: (1) Transcript-level supervised learning, where the ground truth is a transcript which represents the temporal ordering of actions present in a training video; (2) Set-level supervised learning, where the ground truth specifies only a set of actions present; and (3) Unsupervised learning, where no ground truth is available. To address these problems, we make three hypotheses. First, we believe that action segmentation under limited supervision would benefit from reasoning over many candidate segmentations rather than predicting a single optimal segmentation. To this end, we efficiently represent a video by a segmentation graph, where paths are candidate segmentations. Second, we hypothesize that a discriminative learning of minimizing energy between valid segmentations that satisfy ground truth and invalid segmentations that violate ground truth is a better learning objective than only minimizing a loss defined with respect to valid segmentations. Third, we hypothesize that regularization of action affinity for same actions, sparsity of action activations for different actions, and orthonormality of parameter matrices are helpful in a limited supervision learning. The dissertation presents our approaches to action segmentation that are based on these hypotheses. Our key technical contributions include versions of a constrained Viterbi algorithm aimed at efficiently approximating the NP-hard all-color-shortest-path problem, as well as efficient Riemannian optimization on the Stiefel manifold via the Cayley transform for regularization of model parameters. Our experimental evaluation demonstrates the advantages of our approaches relative to existing work