171,524 research outputs found

    Seeing Through Noise: Visually Driven Speaker Separation and Enhancement

    Full text link
    Isolating the voice of a specific person while filtering out other voices or background noises is challenging when video is shot in noisy environments. We propose audio-visual methods to isolate the voice of a single speaker and eliminate unrelated sounds. First, face motions captured in the video are used to estimate the speaker's voice, by passing the silent video frames through a video-to-speech neural network-based model. Then the speech predictions are applied as a filter on the noisy input audio. This approach avoids using mixtures of sounds in the learning process, as the number of such possible mixtures is huge, and would inevitably bias the trained model. We evaluate our method on two audio-visual datasets, GRID and TCD-TIMIT, and show that our method attains significant SDR and PESQ improvements over the raw video-to-speech predictions, and a well-known audio-only method.Comment: Supplementary video: https://www.youtube.com/watch?v=qmsyj7vAzo

    Preliminary Study of the Application of Visual Phonics to the Remediation of Developmental Dyspraxia of Speech

    Get PDF
    An unequivocal program of remediation for developmental dyspraxia of speech remains to be established. Observations have concluded that dyspraxia, a neurologically-based motor programming disorder, benefits from a multisensory stimulation approach. One augmentative approach which incorporates auditory, tactile and visual stimuli is Visual Phonics. Research is limited in the use of Visual Phonics in dyspraxic intervention and, therefore, its contribution to remediation cannot be substantiated. The purpose of the present study was to investigate the contribution of Visual Phonics to the remediation of developmental dyspraxia of speech. One subject, thirteen years of age, participated in this study. Upon identification of six prominently misarticulated sounds, the subject received two-hour intervention sessions, five tiroes per week, for three consecutive weeks. Standard articulation intervention augmented with Visual Phonics hand symbols was used to treat two of the error sounds in syllables, standard articulation intervention alone was used with another two error sounds, and the final two phonemes were monitored but not treated. Responses for all three treatments were recorded and results were shown in a time series of figures and tables. Regardless of the treatment strategy, it was found that the subject made notable progress on all errors. Data obtained demonstrated that on average the sounds treated utilizing Visual Phonics progressed more rapidly and, further than the untreated target phonemes or those treated without Visual Phonics. It was concluded that extensive further research is necessary to establish the efficacy of Visual Phonics as a treatment tool for developmental dyspraxia and that this report’s promising results suggest further study is warranted

    Co-Localization of Audio Sources in Images Using Binaural Features and Locally-Linear Regression

    Get PDF
    This paper addresses the problem of localizing audio sources using binaural measurements. We propose a supervised formulation that simultaneously localizes multiple sources at different locations. The approach is intrinsically efficient because, contrary to prior work, it relies neither on source separation, nor on monaural segregation. The method starts with a training stage that establishes a locally-linear Gaussian regression model between the directional coordinates of all the sources and the auditory features extracted from binaural measurements. While fixed-length wide-spectrum sounds (white noise) are used for training to reliably estimate the model parameters, we show that the testing (localization) can be extended to variable-length sparse-spectrum sounds (such as speech), thus enabling a wide range of realistic applications. Indeed, we demonstrate that the method can be used for audio-visual fusion, namely to map speech signals onto images and hence to spatially align the audio and visual modalities, thus enabling to discriminate between speaking and non-speaking faces. We release a novel corpus of real-room recordings that allow quantitative evaluation of the co-localization method in the presence of one or two sound sources. Experiments demonstrate increased accuracy and speed relative to several state-of-the-art methods.Comment: 15 pages, 8 figure

    Mapping Sounds on Images Using Binaural Spectrograms

    Get PDF
    International audienceWe propose a novel method for mapping sound spectrograms onto images and thus enabling alignment between auditory and visual features for subsequent multimodal processing. We suggest a supervised learning approach to this audio-visual fusion problem, on the following grounds. Firstly, we use a Gaussian mixture of locally-linear regressions to learn a mapping from image locations to binaural spectrograms. Secondly, we derive a closed-form expression for the conditional posterior probability of an image location, given both an observed spectrogram, emitted from an unknown source direction, and the mapping parameters that were previously learnt. Prominently, the proposed method is able to deal with completely different spectrograms for training and for alignment. While fixed-length wide-spectrum sounds are used for learning, thus fully and robustly estimating the regression, variable-length sparse-spectrum sounds, e.g., speech, are used for alignment. The proposed method successfully extracts the image location of speech utterances in realistic reverberant-room scenarios

    The Anatomy of Onomatopoeia

    Get PDF
    Virtually every human faculty engage with imitation. One of the most natural and unexplored objects for the study of the mimetic elements in language is the onomatopoeia, as it implies an imitative-driven transformation of a sound of nature into a word. Notably, simple sounds are transformed into complex strings of vowels and consonants, making difficult to identify what is acoustically preserved in this operation. In this work we propose a definition for vocal imitation by which sounds are transformed into the speech elements that minimize their spectral difference within the constraints of the vocal system. In order to test this definition, we use a computational model that allows recovering anatomical features of the vocal system from experimental sound data. We explore the vocal configurations that best reproduce non-speech sounds, like striking blows on a door or the sharp sounds generated by pressing on light switches or computer mouse buttons. From the anatomical point of view, the configurations obtained are readily associated with co-articulated consonants, and we show perceptual evidence that these consonants are positively associated with the original sounds. Moreover, the pairs vowel-consonant that compose these co-articulations correspond to the most stable syllables found in the knock and click onomatopoeias across languages, suggesting a mechanism by which vocal imitation naturally embeds single sounds into more complex speech structures. Other mimetic forces received extensive attention by the scientific community, such as cross-modal associations between speech and visual categories. The present approach helps building a global view of the mimetic forces acting on language and opens a new venue for a quantitative study of word formation in terms of vocal imitation

    A Criticism on Speech Primacy

    Get PDF
    It is the issue of speech primacy that has been disputed throughout the history of foreign language (FL) teaching so far. Speech primacy refers to a propensity to taking precedence over speaking and listening in FL teaching such as Audio-Lingual Approach. In effect, speech has been given priority in terms of pragmatism of late years. FL teaching in Europe nowadays is inclined to make much of aural practices together with the development of educational media and phonetic notation at the beginning of this century. It is said that the auditory type learners are favored over those of visual type in FL learning. But authenticity of this claim is doubtful. People in the cultural area of Chinese characters in general are predominant in visual perception and use fewer sounds in their ordinary conversation in comparison with those in Europe. It should result in their narrow undestanding of the languages in Europe. Thus, the learners whose first language (L1) has fewer speech sounds than the second language (L2) are likely to have various constraints on the phonological level in the L2. In addition, some differences in linguistic structure between the L1 and the L2 prescribes the cognitive patterns of its users. Viewed in this light it is hard to say that the teaching method established based on speech primacy is effective to the adult learners of Japanese. It is important to plant the basic knowledge-intelligibility in English, with a view to overcoming such constraints of the Japanese learners, i. e. beginners of visual type, through visual aids such as reading and writing. That is to say, FL teaching is likely to be more valid if the teaching corresponds to the learners\u27 imagery type and their L1

    VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices

    Full text link
    In this paper, we address the problem of lip-voice synchronisation in videos containing human face and voice. Our approach is based on determining if the lips motion and the voice in a video are synchronised or not, depending on their audio-visual correspondence score. We propose an audio-visual cross-modal transformer-based model that outperforms several baseline models in the audio-visual synchronisation task on the standard lip-reading speech benchmark dataset LRS2. While the existing methods focus mainly on the lip synchronisation in speech videos, we also consider the special case of singing voice. Singing voice is a more challenging use case for synchronisation due to sustained vowel sounds. We also investigate the relevance of lip synchronisation models trained on speech datasets in the context of singing voice. Finally, we use the frozen visual features learned by our lip synchronisation model in the singing voice separation task to outperform a baseline audio-visual model which was trained end-to-end. The demos, source code, and the pre-trained model will be made available on https://ipcv.github.io/VocaLiST/Comment: Submitted to Interspeech 2022; Project Page: https://ipcv.github.io/VocaLiST

    Confusion modelling for lip-reading

    Get PDF
    Lip-reading is mostly used as a means of communication by people with hearing di�fficulties. Recent work has explored the automation of this process, with the aim of building a speech recognition system entirely driven by lip movements. However, this work has so far produced poor results because of factors such as high variability of speaker features, diffi�culties in mapping from visual features to speech sounds, and high co-articulation of visual features. The motivation for the work in this thesis is inspired by previous work in dysarthric speech recognition [Morales, 2009]. Dysathric speakers have poor control over their articulators, often leading to a reduced phonemic repertoire. The premise of this thesis is that recognition of the visual speech signal is a similar problem to recog- nition of dysarthric speech, in that some information about the speech signal has been lost in both cases, and this brings about a systematic pattern of errors in the decoded output. This work attempts to exploit the systematic nature of these errors by modelling them in the framework of a weighted finite-state transducer cascade. Results indicate that the technique can achieve slightly lower error rates than the conventional approach. In addition, it explores some interesting more general questions for automated lip-reading

    The Articulatory Basis of the Alphabet

    Get PDF
    The origin of the alphabet has long been a subject for research, speculation and myths. How to explain its survival and effectiveness over thousands of years? One approach is in terms of the practical problems faced by the originator of the alphabet: another would examine the archaeological record; a third might focus on the perceptual process by which the alphabet makes rapid reading possible. It is proposed that the alphabet originated in an intellectual sequence similar to that followed by Alexander Bell and Henry Sweet in constructing their Visible and Organic Alphabets.The originator of the alphabet used the same kind of introspective analysis of his own speech sounds and of the manner in which they were articulated. This was the vital step. The next step was to represent the articulatory differences in terms of visual patterns. One way to understand what might have been involved is to attempt to replicate the process oneself
    corecore