256 research outputs found
VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting
The performance of the keyword spotting (KWS) system based on audio modality,
commonly measured in false alarms and false rejects, degrades significantly
under the far field and noisy conditions. Therefore, audio-visual keyword
spotting, which leverages complementary relationships over multiple modalities,
has recently gained much attention. However, current studies mainly focus on
combining the exclusively learned representations of different modalities,
instead of exploring the modal relationships during each respective modeling.
In this paper, we propose a novel visual modality enhanced end-to-end KWS
framework (VE-KWS), which fuses audio and visual modalities from two aspects.
The first one is utilizing the speaker location information obtained from the
lip region in videos to assist the training of multi-channel audio beamformer.
By involving the beamformer as an audio enhancement module, the acoustic
distortions, caused by the far field or noisy environments, could be
significantly suppressed. The other one is conducting cross-attention between
different modalities to capture the inter-modal relationships and help the
representation learning of each modality. Experiments on the MSIP challenge
corpus show that our proposed model achieves 2.79% false rejection rate and
2.95% false alarm rate on the Eval set, resulting in a new SOTA performance
compared with the top-ranking systems in the ICASSP2022 MISP challenge.Comment: 5 pages. Accepted at ICASSP202
Semantic2Graph: Graph-based Multi-modal Feature for Action Segmentation in Videos
Video action segmentation and recognition tasks have been widely applied in
many fields. Most previous studies employ large-scale, high computational
visual models to understand videos comprehensively. However, few studies
directly employ the graph model to reason about the video. The graph model
provides the benefits of fewer parameters, low computational cost, a large
receptive field, and flexible neighborhood message aggregation. In this paper,
we present a graph-based method named Semantic2Graph, to turn the video action
segmentation and recognition problem into node classification of graphs. To
preserve fine-grained relations in videos, we construct the graph structure of
videos at the frame-level and design three types of edges: temporal, semantic,
and self-loop. We combine visual, structural, and semantic features as node
attributes. Semantic edges are used to model long-term spatio-temporal
relations, while the semantic features are the embedding of the label-text
based on the textual prompt. A Graph Neural Networks (GNNs) model is used to
learn multi-modal feature fusion. Experimental results show that Semantic2Graph
achieves improvement on GTEA and 50Salads, compared to the state-of-the-art
results. Multiple ablation experiments further confirm the effectiveness of
semantic features in improving model performance, and semantic edges enable
Semantic2Graph to capture long-term dependencies at a low cost.Comment: 10 pages, 3 figures, 8 tables. This paper was submitted to IEEE
Transactions on Multimedi
Humans in 4D: Reconstructing and Tracking Humans with Transformers
We present an approach to reconstruct humans and track them over time. At the
core of our approach, we propose a fully "transformerized" version of a network
for human mesh recovery. This network, HMR 2.0, advances the state of the art
and shows the capability to analyze unusual poses that have in the past been
difficult to reconstruct from single images. To analyze video, we use 3D
reconstructions from HMR 2.0 as input to a tracking system that operates in 3D.
This enables us to deal with multiple people and maintain identities through
occlusion events. Our complete approach, 4DHumans, achieves state-of-the-art
results for tracking people from monocular video. Furthermore, we demonstrate
the effectiveness of HMR 2.0 on the downstream task of action recognition,
achieving significant improvements over previous pose-based action recognition
approaches. Our code and models are available on the project website:
https://shubham-goel.github.io/4dhumans/.Comment: Project Webpage: https://shubham-goel.github.io/4dhumans
Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos
Understanding human emotions is a crucial ability for intelligent robots to
provide better human-robot interactions. The existing works are limited to
trimmed video-level emotion classification, failing to locate the temporal
window corresponding to the emotion. In this paper, we introduce a new task,
named Temporal Emotion Localization in videos~(TEL), which aims to detect human
emotions and localize their corresponding temporal boundaries in untrimmed
videos with aligned subtitles. TEL presents three unique challenges compared to
temporal action localization: 1) The emotions have extremely varied temporal
dynamics; 2) The emotion cues are embedded in both appearances and complex
plots; 3) The fine-grained temporal annotations are complicated and
labor-intensive. To address the first two challenges, we propose a novel
dilated context integrated network with a coarse-fine two-stream architecture.
The coarse stream captures varied temporal dynamics by modeling
multi-granularity temporal contexts. The fine stream achieves complex plots
understanding by reasoning the dependency between the multi-granularity
temporal contexts from the coarse stream and adaptively integrates them into
fine-grained video segment features. To address the third challenge, we
introduce a cross-modal consensus learning paradigm, which leverages the
inherent semantic consensus between the aligned video and subtitle to achieve
weakly-supervised learning. We contribute a new testing set with 3,000
manually-annotated temporal boundaries so that future research on the TEL
problem can be quantitatively evaluated. Extensive experiments show the
effectiveness of our approach on temporal emotion localization. The repository
of this work is at
https://github.com/YYJMJC/Temporal-Emotion-Localization-in-Videos.Comment: Accepted by ACM Multimedia 202
Video Sign Language Recognition using Pose Extraction and Deep Learning Models
Sign language recognition (SLR) has long been a studied subject and research field within the Computer Vision domain. Appearance-based and pose-based approaches are two ways to tackle SLR tasks. Various models from traditional to current state-of-the-art including HOG-based features, Convolutional Neural Network, Recurrent Neural Network, Transformer, and Graph Convolutional Network have been utilized to tackle the area of SLR. While classifying alphabet letters in sign language has shown high accuracy rates, recognizing words presents its set of difficulties including the large vocabulary size, the subtleties in body motions and hand orientations, and regional dialects and variations. The emergence of deep learning has created opportunities for improved word-level sign recognition, but challenges such as overfitting and limited training data remain. Techniques such as data augmentation, feature engineering, hyperparameter tuning, optimization, and ensemble methods have been used to overcome these challenges and improve the accuracy and generalization ability of ASL classification models. We explore various methods to improve the accuracy and performance in this project. From the approach, we were able to first reproduce a baseline accuracy of 43.02% on the WLASL dataset and further achieve an improvement in accuracy at 55.96%. We also extended the work to a different dataset to gain a comprehensive understanding of our work
Ensemble Based Feature Extraction and Deep Learning Classification Model with Depth Vision
It remains a challenging task to identify human activities from a video sequence or still image due to factors such as backdrop clutter, fractional occlusion, and changes in scale, point of view, appearance, and lighting. Different appliances, as well as video surveillance systems, human-computer interfaces, and robots used to study human behavior, require different activity classification systems. A four-stage framework for recognizing human activities is proposed in the paper. As part of the initial stages of pre-processing, video-to-frame conversion and adaptive histogram equalization (AHE) are performed. Additionally, watershed segmentation is performed and, from the segmented images, local texton XOR patterns (LTXOR), motion boundary scale-invariant feature transforms (MoBSIFT) and bag of visual words (BoW) based features are extracted. The Bidirectional gated recurrent unit (Bi-GRU) and the Bidirectional long short-term memory (Bi-LSTM) classifiers are used to detect human activity. In addition, the combined decisions of the Bi-GRU and Bi-LSTM classifiers are further fused, and their accuracy levels are determined. With this Dempster-Shafer theory (DST) technique, it is more likely that the results obtained from the analysis are accurate. Various metrics are used to assess the effectiveness of the deployed approach
Hierarchical Boundary-Aware Neural Encoder for Video Captioning
The use of Recurrent Neural Networks for video captioning has recently gained a lot of attention, since they can be used both to encode the input video and to generate the corresponding description. In this paper, we present a recurrent video encoding scheme which can discover and leverage the hierarchical structure of the video. Unlike the classical encoder-decoder approach, in which a video is encoded continuously by a recurrent layer, we propose a novel LSTM cell, which can identify discontinuity points between frames or segments and modify the temporal connections of the encoding layer accordingly. We evaluate our approach on three large-scale datasets: the Montreal Video Annotation dataset, the MPII Movie Description dataset and the Microsoft Video Description Corpus. Experiments show that our approach can discover appropriate hierarchical representations of input videos and improve the state of the art results on movie description datasets
No-audio speaking status detection in crowded settings via visual pose-based filtering and wearable acceleration
Recognizing who is speaking in a crowded scene is a key challenge towards the
understanding of the social interactions going on within. Detecting speaking
status from body movement alone opens the door for the analysis of social
scenes in which personal audio is not obtainable. Video and wearable sensors
make it possible recognize speaking in an unobtrusive, privacy-preserving way.
When considering the video modality, in action recognition problems, a bounding
box is traditionally used to localize and segment out the target subject, to
then recognize the action taking place within it. However, cross-contamination,
occlusion, and the articulated nature of the human body, make this approach
challenging in a crowded scene. Here, we leverage articulated body poses for
subject localization and in the subsequent speech detection stage. We show that
the selection of local features around pose keypoints has a positive effect on
generalization performance while also significantly reducing the number of
local features considered, making for a more efficient method. Using two
in-the-wild datasets with different viewpoints of subjects, we investigate the
role of cross-contamination in this effect. We additionally make use of
acceleration measured through wearable sensors for the same task, and present a
multimodal approach combining both methods
- …