Video action segmentation and recognition tasks have been widely applied in
many fields. Most previous studies employ large-scale, high computational
visual models to understand videos comprehensively. However, few studies
directly employ the graph model to reason about the video. The graph model
provides the benefits of fewer parameters, low computational cost, a large
receptive field, and flexible neighborhood message aggregation. In this paper,
we present a graph-based method named Semantic2Graph, to turn the video action
segmentation and recognition problem into node classification of graphs. To
preserve fine-grained relations in videos, we construct the graph structure of
videos at the frame-level and design three types of edges: temporal, semantic,
and self-loop. We combine visual, structural, and semantic features as node
attributes. Semantic edges are used to model long-term spatio-temporal
relations, while the semantic features are the embedding of the label-text
based on the textual prompt. A Graph Neural Networks (GNNs) model is used to
learn multi-modal feature fusion. Experimental results show that Semantic2Graph
achieves improvement on GTEA and 50Salads, compared to the state-of-the-art
results. Multiple ablation experiments further confirm the effectiveness of
semantic features in improving model performance, and semantic edges enable
Semantic2Graph to capture long-term dependencies at a low cost.Comment: 10 pages, 3 figures, 8 tables. This paper was submitted to IEEE
Transactions on Multimedi