56 research outputs found
Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction
Target speech extraction aims to extract, based on a given conditioning cue,
a target speech signal that is corrupted by interfering sources, such as noise
or competing speakers. Building upon the achievements of the state-of-the-art
(SOTA) time-frequency speaker separation model TF-GridNet, we propose
AV-GridNet, a visual-grounded variant that incorporates the face recording of a
target speaker as a conditioning factor during the extraction process.
Recognizing the inherent dissimilarities between speech and noise signals as
interfering sources, we also propose SAV-GridNet, a scenario-aware model that
identifies the type of interfering scenario first and then applies a dedicated
expert model trained specifically for that scenario. Our proposed model
achieves SOTA results on the second COG-MHEAR Audio-Visual Speech Enhancement
Challenge, outperforming other models by a significant margin, objectively and
in a listening test. We also perform an extensive analysis of the results under
the two scenarios.Comment: Accepted by ASRU 202
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers
Given an input video, its associated audio, and a brief caption, the
audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a
question-answer dialog with a human about the audio-visual content. This task
thus poses a challenging multi-modal representation learning and reasoning
scenario, advancements into which could influence several human-machine
interaction applications. To solve this task, we introduce a
semantics-controlled multi-modal shuffled Transformer reasoning framework,
consisting of a sequence of Transformer modules, each taking a modality as
input and producing representations conditioned on the input question. Our
proposed Transformer variant uses a shuffling scheme on their multi-head
outputs, demonstrating better regularization. To encode fine-grained visual
information, we present a novel dynamic scene graph representation learning
pipeline that consists of an intra-frame reasoning layer producing
spatio-semantic graph representations for every frame, and an inter-frame
aggregation module capturing temporal cues. Our entire pipeline is trained
end-to-end. We present experiments on the benchmark AVSD dataset, both on
answer generation and selection tasks. Our results demonstrate state-of-the-art
performances on all evaluation metrics.Comment: Accepted at AAAI 202
Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos
To realize human-robot collaboration, robots need to execute actions for new
tasks according to human instructions given finite prior knowledge. Human
experts can share their knowledge of how to perform a task with a robot through
multi-modal instructions in their demonstrations, showing a sequence of
short-horizon steps to achieve a long-horizon goal. This paper introduces a
method for robot action sequence generation from instruction videos using (1)
an audio-visual Transformer that converts audio-visual features and instruction
speech to a sequence of robot actions called dynamic movement primitives (DMPs)
and (2) style-transfer-based training that employs multi-task learning with
video captioning and weakly-supervised learning with a semantic classifier to
exploit unpaired video-action data. We built a system that accomplishes various
cooking actions, where an arm robot executes a DMP sequence acquired from a
cooking video using the audio-visual Transformer. Experiments with
Epic-Kitchen-100, YouCookII, QuerYD, and in-house instruction video datasets
show that the proposed method improves the quality of DMP sequences by 2.3
times the METEOR score obtained with a baseline video-to-action Transformer.
The model achieved 32% of the task success rate with the task knowledge of the
object.Comment: Accepted to Interspeech202
- …