2 research outputs found
U-LanD: Uncertainty-Driven Video Landmark Detection
This paper presents U-LanD, a framework for joint detection of key frames and
landmarks in videos. We tackle a specifically challenging problem, where
training labels are noisy and highly sparse. U-LanD builds upon a pivotal
observation: a deep Bayesian landmark detector solely trained on key video
frames, has significantly lower predictive uncertainty on those frames vs.
other frames in videos. We use this observation as an unsupervised signal to
automatically recognize key frames on which we detect landmarks. As a test-bed
for our framework, we use ultrasound imaging videos of the heart, where sparse
and noisy clinical labels are only available for a single frame in each video.
Using data from 4,493 patients, we demonstrate that U-LanD can exceedingly
outperform the state-of-the-art non-Bayesian counterpart by a noticeable
absolute margin of 42% in R2 score, with almost no overhead imposed on the
model size. Our approach is generic and can be potentially applied to other
challenging data with noisy and sparse training labels
Actor and Action Modular Network for Text-based Video Segmentation
The actor and action semantic segmentation is a challenging problem that
requires joint actor and action understanding, and learns to segment from
pre-defined actor and action label pairs. However, existing methods for this
task fail to distinguish those actors that have same super-category and
identify the actor-action pairs that outside of the fixed actor and action
vocabulary. Recent studies have extended this task using textual queries,
instead of word-level actor-action pairs, to make the actor and action can be
flexibly specified. In this paper, we focus on the text-based actor and action
segmentation problem, which performs fine-grained actor and action
understanding in the video. Previous works predicted segmentation masks from
the merged heterogenous features of a given video and textual query, while they
ignored that the linguistic variation of the textual query and visual semantic
discrepancy of the video, and led to the asymmetric matching between convolved
volumes of the video and the global query representation. To alleviate
aforementioned problem, we propose a novel actor and action modular network
that individually localizes the actor and action in two separate modules. We
first learn the actor-/action-related content for the video and textual query,
and then match them in a symmetrical manner to localize the target region. The
target region includes the desired actor and action which is then fed into a
fully convolutional network to predict the segmentation mask. The whole model
enables joint learning for the actor-action matching and segmentation, and
achieves the state-of-the-art performance on A2D Sentences and J-HMDB Sentences
datasets