187 research outputs found
Relational Future Captioning Model for Explaining Likely Collisions in Daily Tasks
Domestic service robots that support daily tasks are a promising solution for
elderly or disabled people. It is crucial for domestic service robots to
explain the collision risk before they perform actions. In this paper, our aim
is to generate a caption about a future event. We propose the Relational Future
Captioning Model (RFCM), a crossmodal language generation model for the future
captioning task. The RFCM has the Relational Self-Attention Encoder to extract
the relationships between events more effectively than the conventional
self-attention in transformers. We conducted comparison experiments, and the
results show the RFCM outperforms a baseline method on two datasets.Comment: Accepted for presentation at ICIP202
Unifying Event Detection and Captioning as Sequence Generation via Pre-Training
Dense video captioning aims to generate corresponding text descriptions for a
series of events in the untrimmed video, which can be divided into two
sub-tasks, event detection and event captioning. Unlike previous works that
tackle the two sub-tasks separately, recent works have focused on enhancing the
inter-task association between the two sub-tasks. However, designing inter-task
interactions for event detection and captioning is not trivial due to the large
differences in their task specific solutions. Besides, previous event detection
methods normally ignore temporal dependencies between events, leading to event
redundancy or inconsistency problems. To tackle above the two defects, in this
paper, we define event detection as a sequence generation task and propose a
unified pre-training and fine-tuning framework to naturally enhance the
inter-task association between event detection and captioning. Since the model
predicts each event with previous events as context, the inter-dependency
between events is fully exploited and thus our model can detect more diverse
and consistent events in the video. Experiments on the ActivityNet dataset show
that our model outperforms the state-of-the-art methods, and can be further
boosted when pre-trained on extra large-scale video-text data. Code is
available at \url{https://github.com/QiQAng/UEDVC}
End-to-end Dense Video Captioning as Sequence Generation
Dense video captioning aims to identify the events of interest in an input
video, and generate descriptive captions for each event. Previous approaches
usually follow a two-stage generative process, which first proposes a segment
for each event, then renders a caption for each identified segment. Recent
advances in large-scale sequence generation pretraining have seen great success
in unifying task formulation for a great variety of tasks, but so far, more
complex tasks such as dense video captioning are not able to fully utilize this
powerful paradigm. In this work, we show how to model the two subtasks of dense
video captioning jointly as one sequence generation task, and simultaneously
predict the events and the corresponding descriptions. Experiments on YouCook2
and ViTT show encouraging results and indicate the feasibility of training
complex tasks such as end-to-end dense video captioning integrated into
large-scale pre-trained models
- …