25 research outputs found
Data processing steps for facial, lingual, laryngeal, and acoustic data.
<p><b>a)</b> The lips of the speaker were painted in blue and dots were painted in red on the nose and chin. A camera was then placed in front of the speaker’s face such that all painted regions were contained within the frame and the lips are approximately centered. Video was captured at 30 frames per second (fps) during speaking (i). Each frame of the video was thresholded based on hue value, resulting in a binary mask. Points were defined based upon the upper, lower, left and right extents of the lip mask and the centroids of the nose and jaw masks (ii). The X and Y position of these points was extracted as a time varying signal (iii). Grey lines mark the acoustic onset. <b>b)</b> The tongue was monitored using an ultrasound transducer held firmly under the speaker’s chin such that the tongue is centered in the frame of the ultrasound image. Video output of the ultrasound was captured at 30 fps (i). The tongue contour for each frame was extracted using EdgeTrak, resulting in an X and Y position of 100 evenly placed points along the tongue surface (ii). From these 100 points, three equidistant points were extracted, representing the front, middle, and back tongue regions which comprises our time varying signal (iii). <b>c)</b> Instances of glottal closure were measured using an electroglottograph placed with contacts on either side of the speaker’s larynx. The instances of glottal closure were tracked by changes in the impedance between the electrodes using the SIGMA algorithm [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0151327#pone.0151327.ref028" target="_blank">28</a>]. <b>d)</b> Speech acoustics were recorded using a microphone placed in front of the subject’s mouth (though not blocking the video camera) and recorded at 22 kHz (Fig 1di). We measured the vowel formants, F<sub>1</sub>-F<sub>4</sub>, as a function of time for each utterance of a vowel using an inverse filter method. For the extraction of F<sub>0</sub> (pitch), we used standard auto-correlation methods.</p
High-Resolution, Non-Invasive Imaging of Upper Vocal Tract Articulators Compatible with Human Brain Recordings
<div><p>A complete neurobiological understanding of speech motor control requires determination of the relationship between simultaneously recorded neural activity and the kinematics of the lips, jaw, tongue, and larynx. Many speech articulators are internal to the vocal tract, and therefore simultaneously tracking the kinematics of all articulators is nontrivial—especially in the context of human electrophysiology recordings. Here, we describe a noninvasive, multi-modal imaging system to monitor vocal tract kinematics, demonstrate this system in six speakers during production of nine American English vowels, and provide new analysis of such data. Classification and regression analysis revealed considerable variability in the articulator-to-acoustic relationship across speakers. Non-negative matrix factorization extracted basis sets capturing vocal tract shapes allowing for higher vowel classification accuracy than traditional methods. Statistical speech synthesis generated speech from vocal tract measurements, and we demonstrate perceptual identification. We demonstrate the capacity to predict lip kinematics from ventral sensorimotor cortical activity. These results demonstrate a multi-modal system to non-invasively monitor articulator kinematics during speech production, describe novel analytic methods for relating kinematic data to speech acoustics, and provide the first decoding of speech kinematics from electrocorticography. These advances will be critical for understanding the cortical basis of speech production and the creation of vocal prosthetics.</p></div
Articulatory and Acoustic Feature Time-courses and Classification.
<p><b>a-b)</b> Average time course of formant values(a) and articulator position (b) for each of the 9 vowels examined. Traces shown are for a single subject (speaker 1). Each trial was warped using linear interpolation so that all trials were of equal length. Grey lines mark the acoustic onset and offset. Error bars denote standard error. Shaded region marks the time window used for LDA and classification analyses in (e-g). <b>c-d)</b> Change in cluster separability across the trial for acoustic (c) and articulatory (d) features. Error bars denote standard error. <b>e-f)</b> LDA projections of formant values (e) and articulator position (f) drawn from the middle fifth of each trial across all speakers. All values are z-scored across all trials. Each dot marks the values for a single trial. Color denotes vowel spoken during the trial. Larger dots mark trials from a single speaker (same as in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0151327#pone.0151327.g004" target="_blank">Fig 4I</a>). <b>g)</b> Classification performance resulting from running a 50x cross-validated naïve Bayes classifier on the mid-vowel acoustic and kinematic measurements. Each dot denotes an individual speaker with error bars denoting standard error across the cross-validations. Red line marks the median performance across speakers. Horizontal lines denote statistical significance (P < 0.05, WSRT, N = 6).</p
Continuous linear relationship between vowel acoustics and articulator position.
<p><b>a)</b> Scatter plot of pitch values (F<sub>0</sub>) vs. frequency of glottal closures from 3 subjects (2 males, 1 female). Color corresponds to vowel identity. <b>b)</b> Linear prediction of (z-scored) back tongue height from all acoustic features vs. the observed values for all nine vowels and six speakers. <b>c)</b> Linear prediction of (z-scored) F<sub>1</sub> from all kinematic features vs. the observed values for all nine vowels and six speakers. <b>d)</b> Acoustic-to-articulator mappings. Amount of explained variance (R<sup>2</sup>) for six kinematic features from all acoustic features for each subject. Subjects are identified by symbol. <b>e)</b> Articulator-to-acoustic mappings. Amount of explained variance (R<sup>2</sup>) for five acoustic features from all kinematic features for each subject.</p
Decoding of lip aperture from ECoG recordings during production of words.
<p><b>a)</b> Lateral view of the right hemisphere of a neurosurgical patient. The location of ECoG electrodes over the ventral sensorimotor cortex are demarcated with grey disks. <b>b)</b> Example lip shape and vertical aperture during production of words in this subject. <b>c)</b> Predicted lip aperture based on linear decoding of ECoG data vs. the actual aperture. Each dot is a time-point; red-dashed line is best linear fit.</p
Synthesis of speech from articulator kinematics.
<p><b>a</b>) Examples of reference and synthesized stimuli using various articulatory feature sets as identified for articulation of a phoneme /<i>É‘</i>/ by one speaker. <b>b)</b> Error on synthesized acoustics, measured as a mean cepstral distortion measure with increasing number of tongue points. Each subject is one black line, overlaid with a grey trend for average across subjects. Red dot at 10 points is selected as the optimal set of tongue points. <b>c)</b> Prediction error with different sets of articulators used for synthesis, each subject is one dot and the mean across subjects is marked with the red line segment. <b>d)</b> Reference natural stimuli as perceived by listeners based in the United States (i) and Turkers globally (ii) Stimuli synthesized using 10 tongue points (iii) and using both tongue and lip kinematics (iv).</p
Utterance-to-utterance registration of vocal tract data.
<p><b>a)</b> All tongue data from one subject. Left, top: overlay of raw tongue data during the pre-vocalization period for all trials; left, bottom: overlay of the same data after applying the optimal transformation. Right, top: overlay of raw tongue data during vocalization for all trials; right, bottom: overlay of the same data after applying the transformation from the pre-vocalization time. <b>b)</b> All lip data from one subject. Left, top: overlay of raw lip data during the pre-vocalization period for all trials; left, bottom: overlay of the same data after applying the optimal transformation. Right, top: overlay of raw lip data during vocalization for all trials; right, bottom: overlay of the same data after applying the transformation from the pre-vocalization time. <b>c)</b> Quantification of efficacy of applying transform from pre-vocalization data to vocalization times: enhanced separability. We calculated the separability between vowels based on articulatory features during vocalization before (raw) and after (transformed) applying the transformation optimized for the pre-vocalization time. The transformed data consistently had increased separability. Black points: mean ± s.d. across vowel comparisons for a subject, red line: median across subjects.</p
Unsupervised extraction of vocal tract shape with non-negative matrix factorization improves vowel classification.
<p><b>a)</b> Mean tongue shape for each vowel from one subject. <b>b)</b> Non-negative matrix bases blindly extracted from the tongue data for all vowels from one subject. <b>c)</b> Mean lip shape for each vowel from one subject. <b>d)</b> Non-negative matrix bases blindly extracted from the lip data for all vowels from one subject. <b>e-h)</b> Similarity (R<sup>2</sup>) between mean tongue shapes (<b>e</b>), tongue non-negative matrix components (<b>f</b>), mean lip shapes (<b>g</b>), and lip non-negative matrix components (<b>h</b>). <b>i)</b> Scatter plot of all vowels in the first 3 linear discriminant dimensions for one subject (same as <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0151327#pone.0151327.g003" target="_blank">Fig 3A–3D</a>). <b>j)</b> Cross-validated classification accuracy of vowels from vocal tract shapes across all subjects. Naïve Bayes classifiers were trained to predict vowel identity based on NMF reconstruction weights for the lips and tongue individually, as well as from both lips and tongue. The combined model out-performs the individual models. Furthermore, the classification accuracy is enhanced relative to using the pre-defined articulatory features.</p
Associations of episodic memory score with PiB and WMH.
<p>PiB; mean cortical Pittsburgh compound B binding, expressed as the distribution volume ratio; WMH, MRI white matter T2 hyperintensity. Plots generated from models adjusted for age and education showed an independent relationship between lower memory and higher PiB (-0.14 lower memory Z score for each 0.1 increase in PiB distribution volume ratio, 95% confidence interval -0.28 to -0.01) but no relationship with WMH.</p
Associations of executive function score with PiB and WMH.
<p>PiB; mean cortical Pittsburgh compound B binding, expressed as the distribution volume ratio; WMH, MRI white matter T2 hyperintensity. Plots generated from models adjusted for age and education showed a non-significant trend toward lower executive function with higher PiB (-0.12 change per 0.1 unit increase in PiB distribution volume ratio, 95% CI -0.24 to 0.01) but no relationship with WMH.</p