24 research outputs found
Recommended from our members
Critical Examination of Autism Screening Tools: A Reply to Øien
We thank Dr. Øien for his comments on the relevance and implications of our cluster randomized controlled trial (RCT) in his editorial, "Editorial: The Critical Examination of Autism Screening Tools: A Call for Addressing False Negatives."1
Monkeys and Humans Share a Common Computation for Face/Voice Integration
Speech production involves the movement of the mouth and other regions of the face resulting in visual motion cues. These visual cues enhance intelligibility and detection of auditory speech. As such, face-to-face speech is fundamentally a multisensory phenomenon. If speech is fundamentally multisensory, it should be reflected in the evolution of vocal communication: similar behavioral effects should be observed in other primates. Old World monkeys share with humans vocal production biomechanics and communicate face-to-face with vocalizations. It is unknown, however, if they, too, combine faces and voices to enhance their perception of vocalizations. We show that they do: monkeys combine faces and voices in noisy environments to enhance their detection of vocalizations. Their behavior parallels that of humans performing an identical task. We explored what common computational mechanism(s) could explain the pattern of results we observed across species. Standard explanations or models such as the principle of inverse effectiveness and a “race” model failed to account for their behavior patterns. Conversely, a “superposition model”, positing the linear summation of activity patterns in response to visual and auditory components of vocalizations, served as a straightforward but powerful explanatory mechanism for the observed behaviors in both species. As such, it represents a putative homologous mechanism for integrating faces and voices across primates
The Natural Statistics of Audiovisual Speech
Humans, like other animals, are exposed to a continuous stream of signals, which are dynamic, multimodal, extended, and time varying in nature. This complex input space must be transduced and sampled by our sensory systems and transmitted to the brain where it can guide the selection of appropriate actions. To simplify this process, it's been suggested that the brain exploits statistical regularities in the stimulus space. Tests of this idea have largely been confined to unimodal signals and natural scenes. One important class of multisensory signals for which a quantitative input space characterization is unavailable is human speech. We do not understand what signals our brain has to actively piece together from an audiovisual speech stream to arrive at a percept versus what is already embedded in the signal structure of the stream itself. In essence, we do not have a clear understanding of the natural statistics of audiovisual speech. In the present study, we identified the following major statistical features of audiovisual speech. First, we observed robust correlations and close temporal correspondence between the area of the mouth opening and the acoustic envelope. Second, we found the strongest correlation between the area of the mouth opening and vocal tract resonances. Third, we observed that both area of the mouth opening and the voice envelope are temporally modulated in the 2–7 Hz frequency range. Finally, we show that the timing of mouth movements relative to the onset of the voice is consistently between 100 and 300 ms. We interpret these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver
Measuring time-to-voice for consonants at the beginning of words.
<p>A - Visual and auditory dynamics during the production of the word “PROBLEM” by a single speaker. Green and red lines denote the velocity profiles of the upper and lower lips respectively. Dark blue line denotes the inter-lip distance as a function of time. The waveform of the sound is shown in black. Blue dashed line denotes the point of maximal lower lip velocity (marked by (1)). Red dashed line denotes the point of zero lower lip velocity (marked by (2)). Solid black line denotes the onset of the sound. The greens dot denotes the maximal mouth opening before onset of the sound. The red dot denotes the half peak point of the mouth opening. X-axes depict time in milliseconds. B - Inter-lip distance for different subjects and average as a function of time for the bilabial plosive /p/ aligned to the onset of the sound. Red dashed line denotes the half peak opening of the mouth. X-axes depict time in milliseconds, y-axes depict inter-lip distance in mm. Solid dark line denotes the onset of the sound. Gray lines denote traces from individual subjects. Red line denotes the average across subjects. Shaded regions denote the standard error of the mean. C - Inter-lip distance for different subjects and average as a function of time for the bilabial consonant /m/ aligned to the onset of the sound. Figure conventions as in B. D - Inter-lip distance for different subjects and average as a function of time for the bilabial plosive /b/ aligned to the onset of the sound. Figure conventions as in B. E - Inter-lip distance for different subjects and average as a function of time for the labiodental /f/ aligned to the onset of the sound. Figure conventions as in B. F - Average time to half opening from the onset of the sound for the three bilabial consonants and the labiodental across subjects. Error bars denote the standard error of the mean.</p
Correlation between visual and auditory components as a function of spectral frequency.
<p>A - Correlation coefficient between the area of the mouth opening and the envelope of each spectral band for two different subjects. X-axes depict the spectral frequency in KHz. Y-axes depict the correlation coefficient. The red line denotes the intact correlation between the area of the mouth opening and the auditory envelopes. Blue line denotes the average shuffled correlation for that subject. Shaded region denotes standard error of the mean. B - Average intact and shuffled correlation as a function of spectral frequency for the 20 subjects in the GRID corpus. Figure conventions are as in A. C - Average intact and shuffled correlation as a function of spectral frequency for the subjects from the Wisconsin X-ray database. Figure conventions as in A.</p
Coherence between vision and audition.
<p>A - Left, heat map shows the coherence between the mouth area function and the auditory signal as a function of both spectral frequency band and temporal modulation frequency for a single subject from the GRID corpus. X-axes depicts temporal modulation frequency in Hz. Y-axes depict the spectral frequency in KHz. Square drawn in dashed lines depicts the region of maximal coherence between the visual and auditory signals. Right, heat map for another subject from the GRID corpus. Figure conventions as in the left panel. B - Average coherence between the mouth area function and the auditory signal as a function of both spectral frequency band and temporal modulation frequency for the twenty subjects in the GRID corpus. Figure conventions as in the left panel of A. C - Average coherence between the mouth area function and the auditory signal for four different spectral frequencies (8.8 KHz – orange, 2.3 KHz – red, 161 Hz – blue, 460 Hz – green) across all subjects in the GRID corpus as a function of temporal frequency. Shaded regions denote the standard error of the mean. D - Average coherence between the inter-lip distance and the wideband auditory envelope as a function of both spectral frequency band and temporal modulation frequency for the fifteen subjects in the Wisconsin x-ray database. Figure conventions as in A. E - Average coherence between the area of the mouth opening and the wideband auditory envelope as a function of both spectral frequency band and temporal modulation frequency averaged across the two subjects from the French spontaneous database. Figure conventions as in A.</p
Average correlations between the area of the mouth opening and the audio envelope.
<p>A - Left, Area of the mouth opening and the auditory envelope for a single sentence from a single subject in the grid corpus as a function of time in seconds. X–axes depict time in seconds. Y–axes on the left depict the area of the mouth opening in pixel squared. Y-axes on the right depict the envelope in Hilbert units. Right, scatter plot of the envelope and area of the mouth opening along with the corresponding regression line. Each red circle denotes a single point in the speech time series. Black line denotes the linear regression between the area of the mouth opening and the envelope power. Correlation coefficient between auditory and visual components for this sentence was 0.742 (p<0.0001). B - Average rank ordered intact correlations (red bars) and shuffled correlations (green bars) for the 20 subjects analyzed in the GRID corpus. X-axes depict subject number; Y- axes depict the correlations. Intact correlations for each subject were the average across all sentences analyzed for that subject. Error bars denote standard error of the mean. Shuffled correlations were computed as an average correlation between all non paired auditory envelopes and the mouth area function for each subject. C - Scatter plot of the average intact versus the average shuffled correlations for the 20 subjects in the dataset. D - Mean intact and shuffled correlations over the 20 subjects in the GRID corpus. Error bars denote standard error of the mean.</p
Analysis framework for the visual and auditory components of natural speech.
<p>A - Shows the frames from the orofacial region of the mouth for the sentence “Bin red by y2 now” spoken by a single female speaker in the GRID corpus. Black lines denote the fitted lip contour using our contour fitting algorithm (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000436#s2" target="_blank">methods</a>). B - Top panel, shows the frames and the corresponding contours for several frames for the same sentence from the same speaker shown in A. The bottom panel shows the estimated area for each mouth contour as a function of time. X-axes depict time in seconds; y-axes depict the area of the mouth opening in pixel squared. Arrows point to specific frames in the time series depicting different amounts of mouth opening. C - Similar analysis of the French phonetically balanced sentences. Tracking in this case was facilitated by the blue lipstick applied to the lips of the speaker, which allowed for automatic segmentation of the lips from the face. D - A sagittal frame from the x-ray database with the pellet positions marked. Eight different pellets were recorded for this database of human speech. We analyzed the inter-lip distance between the markers UL (upper lip) and LL (lower lip). E - Estimation procedure for the wideband envelope. The signal is first band pass filtered into narrow bands according to a cochlear frequency map. The narrowband envelopes are obtained by taking the Hilbert transform and then computing the absolute value. The wideband envelope is estimated as the sum of the narrowband envelopes.</p
