In this work, we address the task of weakly-supervised human action
segmentation in long, untrimmed videos. Recent methods have relied on expensive
learning models, such as Recurrent Neural Networks (RNN) and Hidden Markov
Models (HMM). However, these methods suffer from expensive computational cost,
thus are unable to be deployed in large scale. To overcome the limitations, the
keys to our design are efficiency and scalability. We propose a novel action
modeling framework, which consists of a new temporal convolutional network,
named Temporal Convolutional Feature Pyramid Network (TCFPN), for predicting
frame-wise action labels, and a novel training strategy for weakly-supervised
sequence modeling, named Iterative Soft Boundary Assignment (ISBA), to align
action sequences and update the network in an iterative fashion. The proposed
framework is evaluated on two benchmark datasets, Breakfast and Hollywood
Extended, with four different evaluation metrics. Extensive experimental
results show that our methods achieve competitive or superior performance to
state-of-the-art methods.Comment: CVPR 201