24 research outputs found

    The Natural Statistics of Audiovisual Speech

    Get PDF
    Humans, like other animals, are exposed to a continuous stream of signals, which are dynamic, multimodal, extended, and time varying in nature. This complex input space must be transduced and sampled by our sensory systems and transmitted to the brain where it can guide the selection of appropriate actions. To simplify this process, it's been suggested that the brain exploits statistical regularities in the stimulus space. Tests of this idea have largely been confined to unimodal signals and natural scenes. One important class of multisensory signals for which a quantitative input space characterization is unavailable is human speech. We do not understand what signals our brain has to actively piece together from an audiovisual speech stream to arrive at a percept versus what is already embedded in the signal structure of the stream itself. In essence, we do not have a clear understanding of the natural statistics of audiovisual speech. In the present study, we identified the following major statistical features of audiovisual speech. First, we observed robust correlations and close temporal correspondence between the area of the mouth opening and the acoustic envelope. Second, we found the strongest correlation between the area of the mouth opening and vocal tract resonances. Third, we observed that both area of the mouth opening and the voice envelope are temporally modulated in the 2–7 Hz frequency range. Finally, we show that the timing of mouth movements relative to the onset of the voice is consistently between 100 and 300 ms. We interpret these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver

    Perceptual evaluation of blind source separation in object-based audio production

    Get PDF
    Object-based audio has the potential to enable multime- dia content to be tailored to individual listeners and their reproduc- tion equipment. In general, object-based production assumes that the objects|the assets comprising the scene|are free of noise and inter- ference. However, there are many applications in which signal separa- tion could be useful to an object-based audio work ow, e.g., extracting individual objects from channel-based recordings or legacy content, or recording a sound scene with a single microphone array. This paper de- scribes the application and evaluation of blind source separation (BSS) for sound recording in a hybrid channel-based and object-based workflow, in which BSS-estimated objects are mixed with the original stereo recording. A subjective experiment was conducted using simultaneously spoken speech recorded with omnidirectional microphones in a rever- berant room. Listeners mixed a BSS-extracted speech object into the scene to make the quieter talker clearer, while retaining acceptable au- dio quality, compared to the raw stereo recording. Objective evaluations show that the relative short-term objective intelligibility and speech qual- ity scores increase using BSS. Further objective evaluations are used to discuss the in uence of the BSS method on the remixing scenario; the scenario shown by human listeners to be useful in object-based audio is shown to be a worse-case scenario
    corecore