Search CORE

269 research outputs found

Local Temporal Bilinear Pooling for Fine-grained Action Parsing

Author: Jarvers Christian
Muandet Krikamol
Neumann Heiko
Tang Siyu
Zhang Yan
Publication venue
Publication date: 01/01/2019
Field of study

Fine-grained temporal action parsing is important in many applications, such as daily activity understanding, human motion analysis, surgical robotics and others requiring subtle and precise operations in a long-term period. In this paper we propose a novel bilinear pooling operation, which is used in intermediate layers of a temporal convolutional encoder-decoder net. In contrast to other work, our proposed bilinear pooling is learnable and hence can capture more complex local statistics than the conventional counterpart. In addition, we introduce exact lower-dimension representations of our bilinear forms, so that the dimensionality is reduced with neither information loss nor extra computation. We perform intensive experiments to quantitatively analyze our model and show the superior performances to other state-of-the-art work on various datasets.Comment: 11 pages, 2 figures. Cam.

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Temporal bilinear encoding network of audio-visual features at low sampling rates

Author: Hu Feiyan
McGuinness Kevin
Mohedano Eva
O'Connor Noel E.
Publication venue: 'Scitepress'
Publication date: 18/12/2020
Field of study

Current deep learning based video classification architectures are typically trained end-to-end on large volumes of data and require extensive computational resources. This paper aims to exploit audio-visual information in video classification with a 1 frame per second sampling rate. We propose Temporal Bilinear Encoding Networks (TBEN) for encoding both audio and visual long range temporal information using bilinear pooling and demonstrate bilinear pooling is better than average pooling on the temporal dimension for videos with low sampling rate. We also embed the label hierarchy in TBEN to further improve the robustness of the classifier. Experiments on the FGA240 fine-grained classification dataset using TBEN achieve a new state-of-the-art (hit@1=47.95%). We also exploit the possibility of incorporating TBEN with multiple decoupled modalities like visual semantic and motion features: experiments on UCF101 sampled at 1 FPS achieve close to state-of-the-art accuracy (hit@1=91.03%) while requiring significantly less computational resources than competing approaches for both training and prediction

arXiv.org e-Print Archive

Irish Universities

DCU Online Research Access Service

Attention Mechanisms in Computer Vision: A Survey

Author: Cheng Ming-Ming
Guo Meng-Hao
Hu Shi-Min
Jiang Peng-Tao
Liu Jiang-Jiang
Liu Zheng-Ning
Martin Ralph R.
Mu Tai-Jiang
Xu Tian-Xing
Zhang Song-Hai
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 15/11/2021
Field of study

Humans can naturally and effectively find salient regions in complex scenes. Motivated by this observation, attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks, including image classification, object detection, semantic segmentation, video understanding, image generation, 3D vision, multi-modal tasks and self-supervised learning. In this survey, we provide a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention and branch attention; a related repository https://github.com/MenghaoGuo/Awesome-Vision-Attentions is dedicated to collecting related work. We also suggest future directions for attention mechanism research.Comment: 27 pages, 9 figure

arXiv.org e-Print Archive

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

Author: Kim Jin-Hwa
Song Young Chol
Thomas
Velickovic Petar
Xu Huijuan
Xu Ran
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 27/07/2019
Field of study

Query-based moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query. Existing works often only focus on one aspect of this emerging task, such as the query representation learning, video context modeling or multi-modal fusion, thus fail to develop a comprehensive system for further performance improvement. In this paper, we introduce a novel Cross-Modal Interaction Network (CMIN) to consider multiple crucial factors for this challenging task, including (1) the syntactic structure of natural language queries; (2) long-range semantic dependencies in video context and (3) the sufficient cross-modal interaction. Specifically, we devise a syntactic GCN to leverage the syntactic structure of queries for fine-grained representation learning, propose a multi-head self-attention to capture long-range semantic dependencies from video context, and next employ a multi-stage cross-modal interaction to explore the potential relations of video and query contents. The extensive experiments demonstrate the effectiveness of our proposed method.Comment: Accepted by SIGIR 2019 as a full pape

arXiv.org e-Print Archive

Crossref