11,678 research outputs found
Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos
Query-based moment retrieval aims to localize the most relevant moment in an
untrimmed video according to the given natural language query. Existing works
often only focus on one aspect of this emerging task, such as the query
representation learning, video context modeling or multi-modal fusion, thus
fail to develop a comprehensive system for further performance improvement. In
this paper, we introduce a novel Cross-Modal Interaction Network (CMIN) to
consider multiple crucial factors for this challenging task, including (1) the
syntactic structure of natural language queries; (2) long-range semantic
dependencies in video context and (3) the sufficient cross-modal interaction.
Specifically, we devise a syntactic GCN to leverage the syntactic structure of
queries for fine-grained representation learning, propose a multi-head
self-attention to capture long-range semantic dependencies from video context,
and next employ a multi-stage cross-modal interaction to explore the potential
relations of video and query contents. The extensive experiments demonstrate
the effectiveness of our proposed method.Comment: Accepted by SIGIR 2019 as a full pape
Non-local Neural Networks
Both convolutional and recurrent operations are building blocks that process
one local neighborhood at a time. In this paper, we present non-local
operations as a generic family of building blocks for capturing long-range
dependencies. Inspired by the classical non-local means method in computer
vision, our non-local operation computes the response at a position as a
weighted sum of the features at all positions. This building block can be
plugged into many computer vision architectures. On the task of video
classification, even without any bells and whistles, our non-local models can
compete or outperform current competition winners on both Kinetics and Charades
datasets. In static image recognition, our non-local models improve object
detection/segmentation and pose estimation on the COCO suite of tasks. Code is
available at https://github.com/facebookresearch/video-nonlocal-net .Comment: CVPR 2018, code is available at:
https://github.com/facebookresearch/video-nonlocal-ne
Pedestrian Attribute Recognition: A Survey
Recognizing pedestrian attributes is an important task in computer vision
community due to it plays an important role in video surveillance. Many
algorithms has been proposed to handle this task. The goal of this paper is to
review existing works using traditional methods or based on deep learning
networks. Firstly, we introduce the background of pedestrian attributes
recognition (PAR, for short), including the fundamental concepts of pedestrian
attributes and corresponding challenges. Secondly, we introduce existing
benchmarks, including popular datasets and evaluation criterion. Thirdly, we
analyse the concept of multi-task learning and multi-label learning, and also
explain the relations between these two learning algorithms and pedestrian
attribute recognition. We also review some popular network architectures which
have widely applied in the deep learning community. Fourthly, we analyse
popular solutions for this task, such as attributes group, part-based,
\emph{etc}. Fifthly, we shown some applications which takes pedestrian
attributes into consideration and achieve better performance. Finally, we
summarized this paper and give several possible research directions for
pedestrian attributes recognition. The project page of this paper can be found
from the following website:
\url{https://sites.google.com/view/ahu-pedestrianattributes/}.Comment: Check our project page for High Resolution version of this survey:
https://sites.google.com/view/ahu-pedestrianattributes
Multi-modal Facial Action Unit Detection with Large Pre-trained Models for the 5th Competition on Affective Behavior Analysis in-the-wild
Facial action unit detection has emerged as an important task within facial
expression analysis, aimed at detecting specific pre-defined, objective facial
expressions, such as lip tightening and cheek raising. This paper presents our
submission to the Affective Behavior Analysis in-the-wild (ABAW) 2023
Competition for AU detection. We propose a multi-modal method for facial action
unit detection with visual, acoustic, and lexical features extracted from the
large pre-trained models. To provide high-quality details for visual feature
extraction, we apply super-resolution and face alignment to the training data
and show potential performance gain. Our approach achieves the F1 score of
52.3% on the official validation set of the 5th ABAW Challenge.Comment: 8 pages, 7 figures, 5 table
- …