18 research outputs found

    Monkeys and Humans Share a Common Computation for Face/Voice Integration

    Get PDF
    Speech production involves the movement of the mouth and other regions of the face resulting in visual motion cues. These visual cues enhance intelligibility and detection of auditory speech. As such, face-to-face speech is fundamentally a multisensory phenomenon. If speech is fundamentally multisensory, it should be reflected in the evolution of vocal communication: similar behavioral effects should be observed in other primates. Old World monkeys share with humans vocal production biomechanics and communicate face-to-face with vocalizations. It is unknown, however, if they, too, combine faces and voices to enhance their perception of vocalizations. We show that they do: monkeys combine faces and voices in noisy environments to enhance their detection of vocalizations. Their behavior parallels that of humans performing an identical task. We explored what common computational mechanism(s) could explain the pattern of results we observed across species. Standard explanations or models such as the principle of inverse effectiveness and a “race” model failed to account for their behavior patterns. Conversely, a “superposition model”, positing the linear summation of activity patterns in response to visual and auditory components of vocalizations, served as a straightforward but powerful explanatory mechanism for the observed behaviors in both species. As such, it represents a putative homologous mechanism for integrating faces and voices across primates

    The Natural Statistics of Audiovisual Speech

    Get PDF
    Humans, like other animals, are exposed to a continuous stream of signals, which are dynamic, multimodal, extended, and time varying in nature. This complex input space must be transduced and sampled by our sensory systems and transmitted to the brain where it can guide the selection of appropriate actions. To simplify this process, it's been suggested that the brain exploits statistical regularities in the stimulus space. Tests of this idea have largely been confined to unimodal signals and natural scenes. One important class of multisensory signals for which a quantitative input space characterization is unavailable is human speech. We do not understand what signals our brain has to actively piece together from an audiovisual speech stream to arrive at a percept versus what is already embedded in the signal structure of the stream itself. In essence, we do not have a clear understanding of the natural statistics of audiovisual speech. In the present study, we identified the following major statistical features of audiovisual speech. First, we observed robust correlations and close temporal correspondence between the area of the mouth opening and the acoustic envelope. Second, we found the strongest correlation between the area of the mouth opening and vocal tract resonances. Third, we observed that both area of the mouth opening and the voice envelope are temporally modulated in the 2–7 Hz frequency range. Finally, we show that the timing of mouth movements relative to the onset of the voice is consistently between 100 and 300 ms. We interpret these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver

    Temporal modulations in the population.

    No full text
    <p>A - Left, average multitaper Fourier spectrum of the wideband auditory envelope across all subjects in the GRID corpus plotted on log-log axes. Right, average multitaper Fourier spectrum of the mouth area function across all subjects in the GRID corpus plotted on a log-log axis. Figure conventions as in 5B. Gray shading denotes regions in the 2–7 Hz band, which seem to deviate from the 1/f fit. B - Left, average multitaper Fourier spectrum of the wideband auditory envelope across all subjects in the Wisconsin x-ray database plotted on log-log axes. Right, average multitaper Fourier spectrum of the inter-lip distance across all subjects in the Wisconsin x-ray database plotted on a log-log axis. Figure conventions as in 5B. C - Left, average multitaper Fourier spectrum of the wideband auditory envelope averaged over the entire spontaneous speech segment for the two subjects from the spontaneous speech database plotted on log-log axes. Right, average multitaper Fourier spectrum of the mouth area function for the two subjects in the spontaneous speech database plotted on a log-log axis. Figure conventions as in 5B.</p

    Measuring time-to-voice for consonants at the beginning of words.

    No full text
    <p>A - Visual and auditory dynamics during the production of the word “PROBLEM” by a single speaker. Green and red lines denote the velocity profiles of the upper and lower lips respectively. Dark blue line denotes the inter-lip distance as a function of time. The waveform of the sound is shown in black. Blue dashed line denotes the point of maximal lower lip velocity (marked by (1)). Red dashed line denotes the point of zero lower lip velocity (marked by (2)). Solid black line denotes the onset of the sound. The greens dot denotes the maximal mouth opening before onset of the sound. The red dot denotes the half peak point of the mouth opening. X-axes depict time in milliseconds. B - Inter-lip distance for different subjects and average as a function of time for the bilabial plosive /p/ aligned to the onset of the sound. Red dashed line denotes the half peak opening of the mouth. X-axes depict time in milliseconds, y-axes depict inter-lip distance in mm. Solid dark line denotes the onset of the sound. Gray lines denote traces from individual subjects. Red line denotes the average across subjects. Shaded regions denote the standard error of the mean. C - Inter-lip distance for different subjects and average as a function of time for the bilabial consonant /m/ aligned to the onset of the sound. Figure conventions as in B. D - Inter-lip distance for different subjects and average as a function of time for the bilabial plosive /b/ aligned to the onset of the sound. Figure conventions as in B. E - Inter-lip distance for different subjects and average as a function of time for the labiodental /f/ aligned to the onset of the sound. Figure conventions as in B. F - Average time to half opening from the onset of the sound for the three bilabial consonants and the labiodental across subjects. Error bars denote the standard error of the mean.</p

    Average correlations between visual and auditory segments for longer speech materials.

    No full text
    <p>A - Top, Inter-lip distance and the auditory envelope for a single 20 second segment from a single subject in the X-ray database as a function of time. X –axes depict time in seconds. Y–axes on the left depict the distance between the lower and upper lip in millimeters. Y-axes on the right depict the power in the wideband envelope. Bottom, shows a zoomed in portion of the 8–12 second time segment of the same data shown in A. Clear correspondences are present between the inter-lip distance and the auditory envelope. B - Scatter plot of the envelope power and inter lip distance along with the corresponding regression line. Each red circle denotes a single point in the speech time series. Black line denotes the linear regression between the inter-lip distance and the envelope power. Correlation coefficient between auditory and visual components for this sentence was 0.49 (p<0.0001). C - Average rank ordered intact correlations (red bars) and shuffled correlations (green bars) for the 15 subjects analyzed in the Wisconsin×ray database. X-axes depict subject number; Y- axes depict the correlations. Intact correlations for each subject were the average across all speech segments analyzed for that subject. Error bars denote standard error of the mean. Shuffled correlations were computed as an average correlation between all non-paired auditory envelopes and the inter-lip distance for each subject.</p

    Coherence between vision and audition.

    No full text
    <p>A - Left, heat map shows the coherence between the mouth area function and the auditory signal as a function of both spectral frequency band and temporal modulation frequency for a single subject from the GRID corpus. X-axes depicts temporal modulation frequency in Hz. Y-axes depict the spectral frequency in KHz. Square drawn in dashed lines depicts the region of maximal coherence between the visual and auditory signals. Right, heat map for another subject from the GRID corpus. Figure conventions as in the left panel. B - Average coherence between the mouth area function and the auditory signal as a function of both spectral frequency band and temporal modulation frequency for the twenty subjects in the GRID corpus. Figure conventions as in the left panel of A. C - Average coherence between the mouth area function and the auditory signal for four different spectral frequencies (8.8 KHz – orange, 2.3 KHz – red, 161 Hz – blue, 460 Hz – green) across all subjects in the GRID corpus as a function of temporal frequency. Shaded regions denote the standard error of the mean. D - Average coherence between the inter-lip distance and the wideband auditory envelope as a function of both spectral frequency band and temporal modulation frequency for the fifteen subjects in the Wisconsin x-ray database. Figure conventions as in A. E - Average coherence between the area of the mouth opening and the wideband auditory envelope as a function of both spectral frequency band and temporal modulation frequency averaged across the two subjects from the French spontaneous database. Figure conventions as in A.</p

    Context modifies time-to-voice for consonants.

    No full text
    <p>A - Visual and auditory dynamics during the production of the word “APA” by a single speaker. Conventions as in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000436#pcbi-1000436-g008" target="_blank">Figure 8A</a>. B - Average inter-lip distance as a function of time for the word APA, AMA, ABA and AFA across all subjects in the database aligned to the onset of the respective consonants (/p/, /m/, /b/, /f/). X-axes depict time in milliseconds, y-axes depict the mean corrected inter-lip distance in mm. Shaded regions denote the standard error of the mean.</p

    Temporal modulations in the visual and auditory signals.

    No full text
    <p>A - Left, average multitaper Fourier spectrum of the wideband auditory envelope across all sentences for a single subject from the GRID corpus. X-axes depicts frequency in Hz. Y-axes depicts power. Shaded regions denote the standard error of the mean. Note the peak between 2–7 Hz. Right, average multitaper Fourier spectrum of the mouth area function across all the sentences for the same subject shown in the left panel. X-axes depicts frequency in Hz. Y-axes depicts power. Shaded regions denote the standard error of the mean. B - Left, average multitaper Fourier spectrum of the wideband auditory envelope across all sentences for the same subject shown in A plotted on log-log axes. X-axes depicts frequency in Hz. Y-axes depicts power in log10 units. Shaded regions denote the standard error of the mean. Black line denotes a 1/f fit to the data. Deviations from 1/f are usually observed when there are rhythmic modulations in a signal. Right, average multitaper Fourier spectrum of the mouth area function across all the sentences for the same subject. Figure conventions as in the left panel.</p
    corecore