206,521 research outputs found
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
There has been a long-standing quest for a unified audio-visual-text model to
enable various multimodal understanding tasks, which mimics the listening,
seeing and reading process of human beings. Humans tends to represent knowledge
using two separate systems: one for representing verbal (textual) information
and one for representing non-verbal (visual and auditory) information. These
two systems can operate independently but can also interact with each other.
Motivated by this understanding of human cognition, in this paper, we introduce
CoAVT -- a novel cognition-inspired Correlated Audio-Visual-Text pre-training
model to connect the three modalities. It contains a joint audio-visual encoder
that learns to encode audio-visual synchronization information together with
the audio and visual content for non-verbal information, and a text encoder to
handle textual input for verbal information. To bridge the gap between
modalities, CoAVT employs a query encoder, which contains a set of learnable
query embeddings, and extracts the most informative audiovisual features of the
corresponding text. Additionally, to leverage the correspondences between audio
and vision with language respectively, we also establish the audio-text and
visual-text bi-modal alignments upon the foundational audiovisual-text
tri-modal alignment to enhance the multimodal representation learning. Finally,
we jointly optimize CoAVT model with three multimodal objectives: contrastive
loss, matching loss and language modeling loss. Extensive experiments show that
CoAVT can learn strong multimodal correlations and be generalized to various
downstream tasks. CoAVT establishes new state-of-the-art performance on
text-video retrieval task on AudioCaps for both zero-shot and fine-tuning
settings, audio-visual event classification and audio-visual retrieval tasks on
AudioSet and VGGSound
The Integration Of Audio Into Multimodal Interfaces: Guidelines And Applications Of Integrating Speech, Earcons, Auditory Icons, and Spatial Audio (SEAS)
The current research is directed at providing validated guidelines to direct the integration of audio into human-system interfaces. This work first discusses the utility of integrating audio to support multimodal human-information processing. Next, an auditory interactive computing paradigm utilizing Speech, Earcons, Auditory icons, and Spatial audio (SEAS) cues is proposed and guidelines for the integration of SEAS cues into multimodal systems are presented. Finally, the results of two studies are presented that evaluate the utility of using SEAS cues, developed following the proposed guidelines, in relieving perceptual and attention processing bottlenecks when conducting Unmanned Air Vehicle (UAV) control tasks. The results demonstrate that SEAS cues significantly enhance human performance on UAV control tasks, particularly response accuracy and reaction time on a secondary monitoring task. The results suggest that SEAS cues may be effective in overcoming perceptual and attentional bottlenecks, with the advantages being most revealing during high workload conditions. The theories and principles provided in this paper should be of interest to audio system designers and anyone involved in the design of multimodal human-computer systems
Exploring the Efficacy of Audio Email Feedback in Information Management Assessment (ExAEF Project) : Final Report
Formative assessment generates feedback on students' performance, thereby accelerating and improving student learning. Anecdotal evidence gathered by a number of evaluations has hypothesised that audio feedback may be capable of enhancing student learning more than other approaches. A quasi-experimental study employing qualitative techniques for triangulation was conducted to formally evaluate the efficacy of formative audio feedback on student learning in a web technologies module. We focussed on the delivery of 'voice emails' to undergraduate students (n = 66) and attempted to evaluate the efficacy of such feedback in formative assessment and ergo studentsâ learning, as well as achieving a better understanding of studentsâ feedback behaviour post-delivery. The results indicated that audio feedback better conforms to existing models of âqualityâ formative feedback as defined by the pedagogical research, can enhance the student learning experience and can be more efficient in feedback delivery. Despite this and high levels of feedback re-use by student participants, the audio treatment group underperformed in learning tasks when compared to the control group. The benefits to be gained when using audio feedback has led to its wider adoption within information and computer science teaching practice and greater use of formative assessment in taught modules
MASR: Metadata Aware Speech Representation
In the recent years, speech representation learning is constructed primarily
as a self-supervised learning (SSL) task, using the raw audio signal alone,
while ignoring the side-information that is often available for a given speech
recording. In this paper, we propose MASR, a Metadata Aware Speech
Representation learning framework, which addresses the aforementioned
limitations. MASR enables the inclusion of multiple external knowledge sources
to enhance the utilization of meta-data information. The external knowledge
sources are incorporated in the form of sample-level pair-wise similarity
matrices that are useful in a hard-mining loss. A key advantage of the MASR
framework is that it can be combined with any choice of SSL method. Using MASR
representations, we perform evaluations on several downstream tasks such as
language identification, speech recognition and other non-semantic tasks such
as speaker and emotion recognition. In these experiments, we illustrate
significant performance improvements for the MASR over other established
benchmarks. We perform a detailed analysis on the language identification task
to provide insights on how the proposed loss function enables the
representations to separate closely related languages
Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation
Collecting audio-text pairs is expensive; however, it is much easier to
access text-only data. Unless using shallow fusion, end-to-end automatic speech
recognition (ASR) models require architecture modifications or additional
training schemes to use text-only data. Inspired by recent advances in
decoder-only language models (LMs), such as GPT-3 and PaLM adopted for
speech-processing tasks, we propose using a decoder-only architecture for ASR
with simple text augmentation. To provide audio information, encoder features
compressed by CTC prediction are used as prompts for the decoder, which can be
regarded as refining CTC prediction using the decoder-only model. Because the
decoder architecture is the same as an autoregressive LM, it is simple to
enhance the model by leveraging external text data with LM training. An
experimental comparison using LibriSpeech and Switchboard shows that our
proposed models with text augmentation training reduced word error rates from
ordinary CTC by 0.3% and 1.4% on LibriSpeech test-clean and testother set,
respectively, and 2.9% and 5.0% on Switchboard and CallHome. The proposed model
had advantage on computational efficiency compared with conventional
encoder-decoder ASR models with a similar parameter setup, and outperformed
them on the LibriSpeech 100h and Switchboard training scenarios.Comment: Submitted to ICASSP202
Eccentricity dependent auditory enhancement of visual stimulus detection but not discrimination
Sensory perception is enhanced by the complementary information provided by our different sensory modalities and even apparently task irrelevant stimuli in one modality can facilitate performance in another. While perception in general comprises both, the detection of sensory objects as well as their discrimination and recognition, most studies on audio-visual interactions have focused on either of these aspects. However, previous evidence, neuroanatomical projections between early sensory cortices and computational mechanisms suggest that sounds might differentially affect visual detection and discrimination and differentially at central and peripheral retinal locations. We performed an experiment to directly test this by probing the enhancement of visual detection and discrimination by auxiliary sounds at different visual eccentricities and within the same subjects. Specifically, we quantified the enhancement provided by sounds that reduce the overall uncertainty about the visual stimulus beyond basic multisensory co-stimulation. This revealed a general trend for stronger enhancement at peripheral locations in both tasks, but a statistically significant effect only for detection and only at peripheral locations. Overall this suggests that there are topographic differences in the auditory facilitation of basic visual processes and that these may differentially affect basic aspects of visual recognition
ENHANCING USERSâ EXPERIENCE WITH SMART MOBILE TECHNOLOGY
The aim of this thesis is to investigate mobile guides for use with smartphones. Mobile guides have been successfully used to provide information, personalisation and navigation for the user. The researcher also wanted to ascertain how and in what ways mobile guides can enhance users' experience.
This research involved designing and developing web based applications to run on smartphones. Four studies were conducted, two of which involved testing of the particular application. The applications tested were a museum mobile guide application and a university mobile guide mapping application. Initial testing examined the prototype work for the âChronology of His Majesty Sultan Haji Hassanal Bolkiahâ application. The results were used to assess the potential of using similar mobile guides in Brunei Darussalamâs museums. The second study involved testing of the âKent LiveMapâ application for use at the University of Kent. Students at the university tested this mapping application, which uses crowdsourcing of information to provide live data. The results were promising and indicate that users' experience was enhanced when using the application.
Overall results from testing and using the two applications that were developed as part of this thesis show that mobile guides have the potential to be implemented in Brunei Darussalamâs museums and on campus at the University of Kent. However, modifications to both applications are required to fulfil their potential and take them beyond the prototype stage in order to be fully functioning and commercially viable
- âŠ