Automatic recognition of fine-grained surgical activities, called steps, is a
challenging but crucial task for intelligent intra-operative computer
assistance. The development of current vision-based activity recognition
methods relies heavily on a high volume of manually annotated data. This data
is difficult and time-consuming to generate and requires domain-specific
knowledge. In this work, we propose to use coarser and easier-to-annotate
activity labels, namely phases, as weak supervision to learn step recognition
with fewer step annotated videos. We introduce a step-phase dependency loss to
exploit the weak supervision signal. We then employ a Single-Stage Temporal
Convolutional Network (SS-TCN) with a ResNet-50 backbone, trained in an
end-to-end fashion from weakly annotated videos, for temporal activity
segmentation and recognition. We extensively evaluate and show the
effectiveness of the proposed method on a large video dataset consisting of 40
laparoscopic gastric bypass procedures and the public benchmark CATARACTS
containing 50 cataract surgeries