206,521 research outputs found

    CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing

    Full text link
    There has been a long-standing quest for a unified audio-visual-text model to enable various multimodal understanding tasks, which mimics the listening, seeing and reading process of human beings. Humans tends to represent knowledge using two separate systems: one for representing verbal (textual) information and one for representing non-verbal (visual and auditory) information. These two systems can operate independently but can also interact with each other. Motivated by this understanding of human cognition, in this paper, we introduce CoAVT -- a novel cognition-inspired Correlated Audio-Visual-Text pre-training model to connect the three modalities. It contains a joint audio-visual encoder that learns to encode audio-visual synchronization information together with the audio and visual content for non-verbal information, and a text encoder to handle textual input for verbal information. To bridge the gap between modalities, CoAVT employs a query encoder, which contains a set of learnable query embeddings, and extracts the most informative audiovisual features of the corresponding text. Additionally, to leverage the correspondences between audio and vision with language respectively, we also establish the audio-text and visual-text bi-modal alignments upon the foundational audiovisual-text tri-modal alignment to enhance the multimodal representation learning. Finally, we jointly optimize CoAVT model with three multimodal objectives: contrastive loss, matching loss and language modeling loss. Extensive experiments show that CoAVT can learn strong multimodal correlations and be generalized to various downstream tasks. CoAVT establishes new state-of-the-art performance on text-video retrieval task on AudioCaps for both zero-shot and fine-tuning settings, audio-visual event classification and audio-visual retrieval tasks on AudioSet and VGGSound

    The Integration Of Audio Into Multimodal Interfaces: Guidelines And Applications Of Integrating Speech, Earcons, Auditory Icons, and Spatial Audio (SEAS)

    Get PDF
    The current research is directed at providing validated guidelines to direct the integration of audio into human-system interfaces. This work first discusses the utility of integrating audio to support multimodal human-information processing. Next, an auditory interactive computing paradigm utilizing Speech, Earcons, Auditory icons, and Spatial audio (SEAS) cues is proposed and guidelines for the integration of SEAS cues into multimodal systems are presented. Finally, the results of two studies are presented that evaluate the utility of using SEAS cues, developed following the proposed guidelines, in relieving perceptual and attention processing bottlenecks when conducting Unmanned Air Vehicle (UAV) control tasks. The results demonstrate that SEAS cues significantly enhance human performance on UAV control tasks, particularly response accuracy and reaction time on a secondary monitoring task. The results suggest that SEAS cues may be effective in overcoming perceptual and attentional bottlenecks, with the advantages being most revealing during high workload conditions. The theories and principles provided in this paper should be of interest to audio system designers and anyone involved in the design of multimodal human-computer systems

    Exploring the Efficacy of Audio Email Feedback in Information Management Assessment (ExAEF Project) : Final Report

    Get PDF
    Formative assessment generates feedback on students' performance, thereby accelerating and improving student learning. Anecdotal evidence gathered by a number of evaluations has hypothesised that audio feedback may be capable of enhancing student learning more than other approaches. A quasi-experimental study employing qualitative techniques for triangulation was conducted to formally evaluate the efficacy of formative audio feedback on student learning in a web technologies module. We focussed on the delivery of 'voice emails' to undergraduate students (n = 66) and attempted to evaluate the efficacy of such feedback in formative assessment and ergo students’ learning, as well as achieving a better understanding of students’ feedback behaviour post-delivery. The results indicated that audio feedback better conforms to existing models of ‘quality’ formative feedback as defined by the pedagogical research, can enhance the student learning experience and can be more efficient in feedback delivery. Despite this and high levels of feedback re-use by student participants, the audio treatment group underperformed in learning tasks when compared to the control group. The benefits to be gained when using audio feedback has led to its wider adoption within information and computer science teaching practice and greater use of formative assessment in taught modules

    MASR: Metadata Aware Speech Representation

    Full text link
    In the recent years, speech representation learning is constructed primarily as a self-supervised learning (SSL) task, using the raw audio signal alone, while ignoring the side-information that is often available for a given speech recording. In this paper, we propose MASR, a Metadata Aware Speech Representation learning framework, which addresses the aforementioned limitations. MASR enables the inclusion of multiple external knowledge sources to enhance the utilization of meta-data information. The external knowledge sources are incorporated in the form of sample-level pair-wise similarity matrices that are useful in a hard-mining loss. A key advantage of the MASR framework is that it can be combined with any choice of SSL method. Using MASR representations, we perform evaluations on several downstream tasks such as language identification, speech recognition and other non-semantic tasks such as speaker and emotion recognition. In these experiments, we illustrate significant performance improvements for the MASR over other established benchmarks. We perform a detailed analysis on the language identification task to provide insights on how the proposed loss function enables the representations to separate closely related languages

    Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

    Full text link
    Collecting audio-text pairs is expensive; however, it is much easier to access text-only data. Unless using shallow fusion, end-to-end automatic speech recognition (ASR) models require architecture modifications or additional training schemes to use text-only data. Inspired by recent advances in decoder-only language models (LMs), such as GPT-3 and PaLM adopted for speech-processing tasks, we propose using a decoder-only architecture for ASR with simple text augmentation. To provide audio information, encoder features compressed by CTC prediction are used as prompts for the decoder, which can be regarded as refining CTC prediction using the decoder-only model. Because the decoder architecture is the same as an autoregressive LM, it is simple to enhance the model by leveraging external text data with LM training. An experimental comparison using LibriSpeech and Switchboard shows that our proposed models with text augmentation training reduced word error rates from ordinary CTC by 0.3% and 1.4% on LibriSpeech test-clean and testother set, respectively, and 2.9% and 5.0% on Switchboard and CallHome. The proposed model had advantage on computational efficiency compared with conventional encoder-decoder ASR models with a similar parameter setup, and outperformed them on the LibriSpeech 100h and Switchboard training scenarios.Comment: Submitted to ICASSP202

    Eccentricity dependent auditory enhancement of visual stimulus detection but not discrimination

    Get PDF
    Sensory perception is enhanced by the complementary information provided by our different sensory modalities and even apparently task irrelevant stimuli in one modality can facilitate performance in another. While perception in general comprises both, the detection of sensory objects as well as their discrimination and recognition, most studies on audio-visual interactions have focused on either of these aspects. However, previous evidence, neuroanatomical projections between early sensory cortices and computational mechanisms suggest that sounds might differentially affect visual detection and discrimination and differentially at central and peripheral retinal locations. We performed an experiment to directly test this by probing the enhancement of visual detection and discrimination by auxiliary sounds at different visual eccentricities and within the same subjects. Specifically, we quantified the enhancement provided by sounds that reduce the overall uncertainty about the visual stimulus beyond basic multisensory co-stimulation. This revealed a general trend for stronger enhancement at peripheral locations in both tasks, but a statistically significant effect only for detection and only at peripheral locations. Overall this suggests that there are topographic differences in the auditory facilitation of basic visual processes and that these may differentially affect basic aspects of visual recognition

    ENHANCING USERS’ EXPERIENCE WITH SMART MOBILE TECHNOLOGY

    Get PDF
    The aim of this thesis is to investigate mobile guides for use with smartphones. Mobile guides have been successfully used to provide information, personalisation and navigation for the user. The researcher also wanted to ascertain how and in what ways mobile guides can enhance users' experience. This research involved designing and developing web based applications to run on smartphones. Four studies were conducted, two of which involved testing of the particular application. The applications tested were a museum mobile guide application and a university mobile guide mapping application. Initial testing examined the prototype work for the ‘Chronology of His Majesty Sultan Haji Hassanal Bolkiah’ application. The results were used to assess the potential of using similar mobile guides in Brunei Darussalam’s museums. The second study involved testing of the ‘Kent LiveMap’ application for use at the University of Kent. Students at the university tested this mapping application, which uses crowdsourcing of information to provide live data. The results were promising and indicate that users' experience was enhanced when using the application. Overall results from testing and using the two applications that were developed as part of this thesis show that mobile guides have the potential to be implemented in Brunei Darussalam’s museums and on campus at the University of Kent. However, modifications to both applications are required to fulfil their potential and take them beyond the prototype stage in order to be fully functioning and commercially viable
    • 

    corecore