1,955 research outputs found
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings
We address the problem of cross-modal fine-grained action retrieval between
text and video. Cross-modal retrieval is commonly achieved through learning a
shared embedding space, that can indifferently embed modalities. In this paper,
we propose to enrich the embedding by disentangling parts-of-speech (PoS) in
the accompanying captions. We build a separate multi-modal embedding space for
each PoS tag. The outputs of multiple PoS embeddings are then used as input to
an integrated multi-modal space, where we perform action retrieval. All
embeddings are trained jointly through a combination of PoS-aware and
PoS-agnostic losses. Our proposal enables learning specialised embedding spaces
that offer multiple views of the same embedded entities.
We report the first retrieval results on fine-grained actions for the
large-scale EPIC dataset, in a generalised zero-shot setting. Results show the
advantage of our approach for both video-to-text and text-to-video action
retrieval. We also demonstrate the benefit of disentangling the PoS for the
generic task of cross-modal video retrieval on the MSR-VTT dataset.Comment: Accepted for presentation at ICCV. Project Page:
https://mwray.github.io/FGA
A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews
Despite the recent advances in opinion mining for written reviews, few works
have tackled the problem on other sources of reviews. In light of this issue,
we propose a multi-modal approach for mining fine-grained opinions from video
reviews that is able to determine the aspects of the item under review that are
being discussed and the sentiment orientation towards them. Our approach works
at the sentence level without the need for time annotations and uses features
derived from the audio, video and language transcriptions of its contents. We
evaluate our approach on two datasets and show that leveraging the video and
audio modalities consistently provides increased performance over text-only
baselines, providing evidence these extra modalities are key in better
understanding video reviews.Comment: Second Grand Challenge and Workshop on Multimodal Language ACL 202
- …