8 research outputs found
Temporal modulations in the population.
<p>A - Left, average multitaper Fourier spectrum of the wideband auditory envelope across all subjects in the GRID corpus plotted on log-log axes. Right, average multitaper Fourier spectrum of the mouth area function across all subjects in the GRID corpus plotted on a log-log axis. Figure conventions as in 5B. Gray shading denotes regions in the 2–7 Hz band, which seem to deviate from the 1/f fit. B - Left, average multitaper Fourier spectrum of the wideband auditory envelope across all subjects in the Wisconsin x-ray database plotted on log-log axes. Right, average multitaper Fourier spectrum of the inter-lip distance across all subjects in the Wisconsin x-ray database plotted on a log-log axis. Figure conventions as in 5B. C - Left, average multitaper Fourier spectrum of the wideband auditory envelope averaged over the entire spontaneous speech segment for the two subjects from the spontaneous speech database plotted on log-log axes. Right, average multitaper Fourier spectrum of the mouth area function for the two subjects in the spontaneous speech database plotted on a log-log axis. Figure conventions as in 5B.</p
Coherence between vision and audition.
<p>A - Left, heat map shows the coherence between the mouth area function and the auditory signal as a function of both spectral frequency band and temporal modulation frequency for a single subject from the GRID corpus. X-axes depicts temporal modulation frequency in Hz. Y-axes depict the spectral frequency in KHz. Square drawn in dashed lines depicts the region of maximal coherence between the visual and auditory signals. Right, heat map for another subject from the GRID corpus. Figure conventions as in the left panel. B - Average coherence between the mouth area function and the auditory signal as a function of both spectral frequency band and temporal modulation frequency for the twenty subjects in the GRID corpus. Figure conventions as in the left panel of A. C - Average coherence between the mouth area function and the auditory signal for four different spectral frequencies (8.8 KHz – orange, 2.3 KHz – red, 161 Hz – blue, 460 Hz – green) across all subjects in the GRID corpus as a function of temporal frequency. Shaded regions denote the standard error of the mean. D - Average coherence between the inter-lip distance and the wideband auditory envelope as a function of both spectral frequency band and temporal modulation frequency for the fifteen subjects in the Wisconsin x-ray database. Figure conventions as in A. E - Average coherence between the area of the mouth opening and the wideband auditory envelope as a function of both spectral frequency band and temporal modulation frequency averaged across the two subjects from the French spontaneous database. Figure conventions as in A.</p
Average correlations between visual and auditory segments for longer speech materials.
<p>A - Top, Inter-lip distance and the auditory envelope for a single 20 second segment from a single subject in the X-ray database as a function of time. X –axes depict time in seconds. Y–axes on the left depict the distance between the lower and upper lip in millimeters. Y-axes on the right depict the power in the wideband envelope. Bottom, shows a zoomed in portion of the 8–12 second time segment of the same data shown in A. Clear correspondences are present between the inter-lip distance and the auditory envelope. B - Scatter plot of the envelope power and inter lip distance along with the corresponding regression line. Each red circle denotes a single point in the speech time series. Black line denotes the linear regression between the inter-lip distance and the envelope power. Correlation coefficient between auditory and visual components for this sentence was 0.49 (p<0.0001). C - Average rank ordered intact correlations (red bars) and shuffled correlations (green bars) for the 15 subjects analyzed in the Wisconsin×ray database. X-axes depict subject number; Y- axes depict the correlations. Intact correlations for each subject were the average across all speech segments analyzed for that subject. Error bars denote standard error of the mean. Shuffled correlations were computed as an average correlation between all non-paired auditory envelopes and the inter-lip distance for each subject.</p
Measuring time-to-voice for consonants at the beginning of words.
<p>A - Visual and auditory dynamics during the production of the word “PROBLEM” by a single speaker. Green and red lines denote the velocity profiles of the upper and lower lips respectively. Dark blue line denotes the inter-lip distance as a function of time. The waveform of the sound is shown in black. Blue dashed line denotes the point of maximal lower lip velocity (marked by (1)). Red dashed line denotes the point of zero lower lip velocity (marked by (2)). Solid black line denotes the onset of the sound. The greens dot denotes the maximal mouth opening before onset of the sound. The red dot denotes the half peak point of the mouth opening. X-axes depict time in milliseconds. B - Inter-lip distance for different subjects and average as a function of time for the bilabial plosive /p/ aligned to the onset of the sound. Red dashed line denotes the half peak opening of the mouth. X-axes depict time in milliseconds, y-axes depict inter-lip distance in mm. Solid dark line denotes the onset of the sound. Gray lines denote traces from individual subjects. Red line denotes the average across subjects. Shaded regions denote the standard error of the mean. C - Inter-lip distance for different subjects and average as a function of time for the bilabial consonant /m/ aligned to the onset of the sound. Figure conventions as in B. D - Inter-lip distance for different subjects and average as a function of time for the bilabial plosive /b/ aligned to the onset of the sound. Figure conventions as in B. E - Inter-lip distance for different subjects and average as a function of time for the labiodental /f/ aligned to the onset of the sound. Figure conventions as in B. F - Average time to half opening from the onset of the sound for the three bilabial consonants and the labiodental across subjects. Error bars denote the standard error of the mean.</p
Context modifies time-to-voice for consonants.
<p>A - Visual and auditory dynamics during the production of the word “APA” by a single speaker. Conventions as in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000436#pcbi-1000436-g008" target="_blank">Figure 8A</a>. B - Average inter-lip distance as a function of time for the word APA, AMA, ABA and AFA across all subjects in the database aligned to the onset of the respective consonants (/p/, /m/, /b/, /f/). X-axes depict time in milliseconds, y-axes depict the mean corrected inter-lip distance in mm. Shaded regions denote the standard error of the mean.</p
Temporal modulations in the visual and auditory signals.
<p>A - Left, average multitaper Fourier spectrum of the wideband auditory envelope across all sentences for a single subject from the GRID corpus. X-axes depicts frequency in Hz. Y-axes depicts power. Shaded regions denote the standard error of the mean. Note the peak between 2–7 Hz. Right, average multitaper Fourier spectrum of the mouth area function across all the sentences for the same subject shown in the left panel. X-axes depicts frequency in Hz. Y-axes depicts power. Shaded regions denote the standard error of the mean. B - Left, average multitaper Fourier spectrum of the wideband auditory envelope across all sentences for the same subject shown in A plotted on log-log axes. X-axes depicts frequency in Hz. Y-axes depicts power in log10 units. Shaded regions denote the standard error of the mean. Black line denotes a 1/f fit to the data. Deviations from 1/f are usually observed when there are rhythmic modulations in a signal. Right, average multitaper Fourier spectrum of the mouth area function across all the sentences for the same subject. Figure conventions as in the left panel.</p
Average correlations between the area of the mouth opening and the audio envelope.
<p>A - Left, Area of the mouth opening and the auditory envelope for a single sentence from a single subject in the grid corpus as a function of time in seconds. X–axes depict time in seconds. Y–axes on the left depict the area of the mouth opening in pixel squared. Y-axes on the right depict the envelope in Hilbert units. Right, scatter plot of the envelope and area of the mouth opening along with the corresponding regression line. Each red circle denotes a single point in the speech time series. Black line denotes the linear regression between the area of the mouth opening and the envelope power. Correlation coefficient between auditory and visual components for this sentence was 0.742 (p<0.0001). B - Average rank ordered intact correlations (red bars) and shuffled correlations (green bars) for the 20 subjects analyzed in the GRID corpus. X-axes depict subject number; Y- axes depict the correlations. Intact correlations for each subject were the average across all sentences analyzed for that subject. Error bars denote standard error of the mean. Shuffled correlations were computed as an average correlation between all non paired auditory envelopes and the mouth area function for each subject. C - Scatter plot of the average intact versus the average shuffled correlations for the 20 subjects in the dataset. D - Mean intact and shuffled correlations over the 20 subjects in the GRID corpus. Error bars denote standard error of the mean.</p
Analysis framework for the visual and auditory components of natural speech.
<p>A - Shows the frames from the orofacial region of the mouth for the sentence “Bin red by y2 now” spoken by a single female speaker in the GRID corpus. Black lines denote the fitted lip contour using our contour fitting algorithm (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000436#s2" target="_blank">methods</a>). B - Top panel, shows the frames and the corresponding contours for several frames for the same sentence from the same speaker shown in A. The bottom panel shows the estimated area for each mouth contour as a function of time. X-axes depict time in seconds; y-axes depict the area of the mouth opening in pixel squared. Arrows point to specific frames in the time series depicting different amounts of mouth opening. C - Similar analysis of the French phonetically balanced sentences. Tracking in this case was facilitated by the blue lipstick applied to the lips of the speaker, which allowed for automatic segmentation of the lips from the face. D - A sagittal frame from the x-ray database with the pellet positions marked. Eight different pellets were recorded for this database of human speech. We analyzed the inter-lip distance between the markers UL (upper lip) and LL (lower lip). E - Estimation procedure for the wideband envelope. The signal is first band pass filtered into narrow bands according to a cochlear frequency map. The narrowband envelopes are obtained by taking the Hilbert transform and then computing the absolute value. The wideband envelope is estimated as the sum of the narrowband envelopes.</p