We present a novel approach for unsupervised activity segmentation, which
uses video frame clustering as a pretext task and simultaneously performs
representation learning and online clustering. This is in contrast with prior
works where representation learning and clustering are often performed
sequentially. We leverage temporal information in videos by employing temporal
optimal transport. In particular, we incorporate a temporal regularization term
which preserves the temporal order of the activity into the standard optimal
transport module for computing pseudo-label cluster assignments. The temporal
optimal transport module enables our approach to learn effective
representations for unsupervised activity segmentation. Furthermore, previous
methods require storing learned features for the entire dataset before
clustering them in an offline manner, whereas our approach processes one
mini-batch at a time in an online manner. Extensive evaluations on three public
datasets, i.e. 50-Salads, YouTube Instructions, and Breakfast, and our dataset,
i.e., Desktop Assembly, show that our approach performs on par or better than
previous methods for unsupervised activity segmentation, despite having
significantly less memory constraints.Comment: Preprint. Under revie