745,642 research outputs found
Using audio and visual information for single channel speaker separation
This work proposes a method to exploit both audio and vi- sual speech information to extract a target speaker from a mix- ture of competing speakers. The work begins by taking an ef- fective audio-only method of speaker separation, namely the soft mask method, and modifying its operation to allow visual speech information to improve the separation process. The au- dio input is taken from a single channel and includes the mix- ture of speakers, where as a separate set of visual features are extracted from each speaker. This allows modification of the separation process to include not only the audio speech but also visual speech from each speaker in the mixture. Experimen- tal results are presented that compare the proposed audio-visual speaker separation with audio-only and visual-only methods us- ing both speech quality and speech intelligibility metrics
Language Identification Using Visual Features
Automatic visual language identification (VLID) is the technology of using information derived from the visual appearance and movement of the speech articulators to iden- tify the language being spoken, without the use of any audio information. This technique for language identification (LID) is useful in situations in which conventional audio processing is ineffective (very noisy environments), or impossible (no audio signal is available). Research in this field is also beneficial in the related field of automatic lip-reading. This paper introduces several methods for visual language identification (VLID). They are based upon audio LID techniques, which exploit language phonology and phonotactics to discriminate languages. We show that VLID is possible in a speaker-dependent mode by discrimi- nating different languages spoken by an individual, and we then extend the technique to speaker-independent operation, taking pains to ensure that discrimination is not due to artefacts, either visual (e.g. skin-tone) or audio (e.g. rate of speaking). Although the low accuracy of visual speech recognition currently limits the performance of VLID, we can obtain an error-rate of < 10% in discriminating between Arabic and English on 19 speakers and using about 30s of visual speech
Deep Multimodal Learning for Audio-Visual Speech Recognition
In this paper, we present methods in deep multimodal learning for fusing
speech and visual modalities for Audio-Visual Automatic Speech Recognition
(AV-ASR). First, we study an approach where uni-modal deep networks are trained
separately and their final hidden layers fused to obtain a joint feature space
in which another deep network is built. While the audio network alone achieves
a phone error rate (PER) of under clean condition on the IBM large
vocabulary audio-visual studio dataset, this fusion model achieves a PER of
demonstrating the tremendous value of the visual channel in phone
classification even in audio with high signal to noise ratio. Second, we
present a new deep network architecture that uses a bilinear softmax layer to
account for class specific correlations between modalities. We show that
combining the posteriors from the bilinear networks with those from the fused
model mentioned above results in a further significant phone error rate
reduction, yielding a final PER of .Comment: ICASSP 201
"Sitting too close to the screen can be bad for your ears": A study of audio-visual location discrepancy detection under different visual projections
In this work, we look at the perception of event locality under conditions of disparate audio and visual cues. We address an aspect of the so called “ventriloquism effect” relevant for multi-media designers; namely, how auditory perception of event locality is influenced by the size and scale of the accompanying visual projection of those events. We observed that recalibration of the visual axes of an audio-visual animation (by resizing and zooming) exerts a recalibrating influence on the auditory space perception. In particular, sensitivity to audio-visual discrepancies (between a centrally located visual stimuli and laterally displaced audio cue) increases near the edge of the screen on which the visual cue is displayed. In other words,discrepancy detection thresholds are not fixed for a particular pair of stimuli, but are influenced by the size of the display space. Moreover, the discrepancy thresholds are influenced by scale as well as size. That is, the boundary of auditory space perception is not rigidly fixed on the boundaries of the screen; it also depends on the spatial relationship depicted. For example,the ventriloquism effect will break down within the boundaries of a large screen if zooming is used to exaggerate the proximity of the audience to the events. The latter effect appears to be much weaker than the former
- …
