Search CORE

1,955 research outputs found

Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

Author: Csurka Gabriela
Damen Dima
Larlus Diane
Wray Michael
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/08/2019
Field of study

We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS tag. The outputs of multiple PoS embeddings are then used as input to an integrated multi-modal space, where we perform action retrieval. All embeddings are trained jointly through a combination of PoS-aware and PoS-agnostic losses. Our proposal enables learning specialised embedding spaces that offer multiple views of the same embedded entities. We report the first retrieval results on fine-grained actions for the large-scale EPIC dataset, in a generalised zero-shot setting. Results show the advantage of our approach for both video-to-text and text-to-video action retrieval. We also demonstrate the benefit of disentangling the PoS for the generic task of cross-modal video retrieval on the MSR-VTT dataset.Comment: Accepted for presentation at ICCV. Project Page: https://mwray.github.io/FGA

arXiv.org e-Print Archive

Crossref

Explore Bristol Research

Verbs and Me:An Investigation Into Verbs as Labels for Action Recognition in Video Understanding

Author: Wray Michael
Publication venue
Publication date: 23/01/2020
Field of study

Explore Bristol Research

Unsupervised domain adaptation for fine-grained action understanding

Author: Damen Dima
Munro Jonathan P N
Publication venue
Publication date: 25/01/2022
Field of study

Explore Bristol Research

A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews

Author: Balazs Jorge A.
Gould Stephen
Marrese-Taylor Edison
Matsuo Yutaka
Rodriguez-Opazo Cristian
Publication venue
Publication date: 01/01/2020
Field of study

Despite the recent advances in opinion mining for written reviews, few works have tackled the problem on other sources of reviews. In light of this issue, we propose a multi-modal approach for mining fine-grained opinions from video reviews that is able to determine the aspects of the item under review that are being discussed and the sentiment orientation towards them. Our approach works at the sentence level without the need for time annotations and uses features derived from the audio, video and language transcriptions of its contents. We evaluate our approach on two datasets and show that leveraging the video and audio modalities consistently provides increased performance over text-only baselines, providing evidence these extra modalities are key in better understanding video reviews.Comment: Second Grand Challenge and Workshop on Multimodal Language ACL 202

arXiv.org e-Print Archive

Crossref

Adelaide Research & Scholarship