    Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

    Speechreading or lipreading is the technique of understanding and getting phonetic features from a speaker's visual features such as movement of lips, face, teeth and tongue. It has a wide range of multimedia applications such as in surveillance, Internet telephony, and as an aid to a person with hearing impairments. However, most of the work in speechreading has been limited to text generation from silent videos. Recently, research has started venturing into generating (audio) speech from silent video sequences but there have been no developments thus far in dealing with divergent views and poses of a speaker. Thus although, we have multiple camera feeds for the speech of a user, but we have failed in using these multiple video feeds for dealing with the different poses. To this end, this paper presents the world's first ever multi-view speech reading and reconstruction system. This work encompasses the boundaries of multimedia research by putting forth a model which leverages silent video feeds from multiple cameras recording the same subject to generate intelligent speech for a speaker. Initial results confirm the usefulness of exploiting multiple camera views in building an efficient speech reading and reconstruction system. It further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Next, it lays out various innovative applications for the proposed system focusing on its potential prodigious impact in not just security arena but in many other multimedia analytics problems.Comment: 2018 ACM Multimedia Conference (MM '18), October 22--26, 2018, Seoul, Republic of Kore

    Changes in Audiovisual Word Perception During Mid-Childhood: An ERP Study

    Throughout school-age years, speech perception is an important skill that often relies on the child’s ability to combine both auditory and visual information from the speaker. In order to better understand the development of multisensory speech perception during mid-childhood, we analyzed audiovisual word perception in three groups of participants: 8-9-year olds, 11-12-year olds, and adults. Participants matched visually-perceived articulatory movements with corresponding auditory words. In “congruent” trials, the auditory word matched the subsequently presented silent visual articulation. In “incongruent” trials, the words presented differed on the initial phoneme. From this task, we evaluated specific neural components —the N400 and the Late Positive Complex (LPC) — which index the phoneme and whole word level of audiovisual processing, respectively. The results of this experiment were then related to a real-life behavioral speech perception skill, namely, listening to speech-in-noise. Our results suggest that while the LPC becomes adultlike by the age of 11 or 12, the N400 is not fully matured until later in development. In addition, the relation of the LPC to listening to speech-in-noise is stronger earlier in childhood while the relation of the N400 is stronger during later school years and adulthood. Overall, we show that audiovisual processes related to the whole-word level mature earlier in childhood than processes related to the phonological level

    Connections between articulations and grasping

    The idea that hand gestures and speech are connected is quite old. Some of these theories even suggest that language is primarily based on a manual communication system. In this thesis, I present four studies in which we studied the connections between articulatory gestures and manual grasps. The work is based on an earlier finding showing systematic connections between specific articulatory gestures and grasp types. For example, uttering a syllable such as [kɑ] can facilitate power grip responses, whereas uttering a syllable such as [ti] can facilitate precision grip responses. I will refer to this phenomenon as the articulation-grip congruency effect. Similarly, to the original work, we used special power and precision grip devices that the participants held in their hand to perform responses. In Study I, we measured response times and accuracy of grip responses and vocalisations to investigate whether the effect can be also observed in vocal responses, and to which extent the effect operates in the action selection processes. In Study II, grip response times were measured to investigate whether the effect persists when the syllables are only heard or read silently. Study III investigated the influence of grasp planning and/or execution on categorizing perceived syllables. In Study IV, we measured electrical activity in the brain during listening of syllables that were either congruent or incongruent with the precision or power grip, and we investigated how performing different grips affected the auditory processing of the heard syllables. The results of Study I showed that besides manual facilitation, the effect is observed also in vocal responses, both when a simultaneous grip is executed and when it is only prepared, meaning that overt execution is not needed for the effect. This suggests that the effect operates in action planning. In addition, the effect was also observed when the participants knew beforehand which response they should execute, suggesting that the effect is not based on the action selection processes. Study II showed that the effect was also observed when the syllables were heard or read silently, supporting the view that articulatory simulation of a perceived syllable can activate the motor program of the grasp which is congruent with the syllable. Study III revealed that grip preparation can influence categorization of perceived syllables. The participants were biased to categorize noise-masked syllables as being [ke] rather than [te] when they were prepared to execute the power grip, and vice versa when they were prepared to execute the precision grip. Finally, Study IV showed that grip performance also modulates early auditory processing of heard syllables. These results support the view that articulatory and hand motor representations form a partly shared network, where activity from one domain can induce activity in the other. This is in line with earlier studies that have shown more general linkage between mouth and manual processes and expands this notion of hand-mouth interaction by showing that these connections can also operate between very specific hand and articulatory gestures.Ajatus käden eleiden ja puheen välisistä yhteyksistä on melko vanha. Jotkut teoriat jopa ehdottavat, että kieli pohjautuu pääosin käsillä tapahtuvaan kommunikointijärjestelmään. Tässä väitöskirjassa esittelen neljä osatyötä, joissa tutkimme artikulatoristen eleiden ja tarttumisotteiden välisiä yhteyksiä. Työ perustuu aiempaan löydökseen, joka paljasti systemaattisia yhteyksiä tiettyjen artikulatoristen eleiden ja tarttumisotteiden välillä. Esimerkiksi [kɑ] tavun lausuminen nopeuttaa voimaotteen tekemistä, kun taas esimerkiksi [ti] tavun lausuminen nopeuttaa pinsettiotteen tekemistä. Väitöskirjan osatyöt hyödynsivät tätä perusefektiä muokkaamalla koeasetelmaa kuhunkin tutkimuskysymykseen sopivaksi. Osatyön I tulokset osoittivat, että yhteensopivuusefekti on havaittavissa myös lausutuissa vastauksissa. Efekti havaittiin myös, kun otteen suorittamiseen oli vain valmistauduttu. Tämä viittaa siihen, että efekti toimii toimintojen suunnittelun tasolla. Lisäksi efekti havaittiin silloinkin, kun osallistujat tiesivät etukäteen, mikä vastaus heidän tulisi suorittaa, mikä viittaa siihen, ettei efekti perustu toimintojen valintaan liittyviin prosesseihin. Osatyössä II efekti havaittiin, vaikka tavut vain kuultiin tai luettiin äänettömästä. Tämä tukee näkemystä, että havaittujen tavujen artikulatorinen simulointi voi aktivoida tavun kanssa yhteensopivan otteen motorista ohjelmaa. Osatyö III osoitti, että käden otteet voivat vaikuttaa havaittujen tavujen luokitteluun. Osallistujat olivat biasoituneet luokittelemaan esitettyjen kohinaisten tavujen olevan ennemmin [ke] kuin [te], kun he olivat valmistautuneet suorittamaan voimaotteen ja päinvastoin, kun he olivat valmistautuneet pinsettiotteen suorittamiseen. Viimeisimpänä osatyö IV osoitti, että otteiden suorittaminen vaikuttaa myös havaittujen tavujen varhaiseen auditoriseen prosessointiin. Nämä tulokset tukevat näkemystä, että artikulatoriset ja käden motoriset edustukset muodostavat osittain jaetun verkoston, jossa aktiivisuus yhdellä osa-alueella voi aiheuttaa aktiivisuutta myös toisella. Tämä on linjassa aiheen aiempien tutkimusten kanssa, jotka ovat osoittaneet yleisempiä yhteyksiä käden ja suun toimintojen välillä. Nämä tulokset laajentavat käden ja suun välisen yhteyden ajatusta osoittamalla, että yhteydet voivat toimia myös hyvin tarkasti rajattujen artikulatoristen ja käden eleiden välillä

    Atypical audiovisual speech integration in infants at risk for autism

    The language difficulties often seen in individuals with autism might stem from an inability to integrate audiovisual information, a skill important for language development. We investigated whether 9-month-old siblings of older children with autism, who are at an increased risk of developing autism, are able to integrate audiovisual speech cues. We used an eye-tracker to record where infants looked when shown a screen displaying two faces of the same model, where one face is articulating/ba/and the other/ga/, with one face congruent with the syllable sound being presented simultaneously, the other face incongruent. This method was successful in showing that infants at low risk can integrate audiovisual speech: they looked for the same amount of time at the mouths in both the fusible visual/ga/− audio/ba/and the congruent visual/ba/− audio/ba/displays, indicating that the auditory and visual streams fuse into a McGurk-type of syllabic percept in the incongruent condition. It also showed that low-risk infants could perceive a mismatch between auditory and visual cues: they looked longer at the mouth in the mismatched, non-fusible visual/ba/− audio/ga/display compared with the congruent visual/ga/− audio/ga/display, demonstrating that they perceive an uncommon, and therefore interesting, speech-like percept when looking at the incongruent mouth (repeated ANOVA: displays x fusion/mismatch conditions interaction: F(1,16) = 17.153, p = 0.001). The looking behaviour of high-risk infants did not differ according to the type of display, suggesting difficulties in matching auditory and visual information (repeated ANOVA, displays x conditions interaction: F(1,25) = 0.09, p = 0.767), in contrast to low-risk infants (repeated ANOVA: displays x conditions x low/high-risk groups interaction: F(1,41) = 4.466, p = 0.041). In some cases this reduced ability might lead to the poor communication skills characteristic of autism

    The computer synthesis of expressive three-dimensional facial character animation.

    This present research is concerned with the design, development and implementation of three-dimensional computer-generated facial images capable of expression gesture and speech. A review of previous work in chapter one shows that to date the model of computer-generated faces has been one in which construction and animation were not separated and which therefore possessed only a limited expressive range. It is argued in chapter two that the physical description of the face cannot be seen as originating from a single generic mould. Chapter three therefore describes data acquisition techniques employed in the computer generation of free-form surfaces which are applicable to three-dimensional faces. Expressions are the result of the distortion of the surface of the skin by the complex interactions of bone, muscle and skin. Chapter four demonstrates with static images and short animation sequences in video that a muscle model process algorithm can simulate the primary characteristics of the facial muscles. Three-dimensional speech synchronization was the most complex problem to achieve effectively. Chapter five describes two successful approaches: the direct mapping of mouth shapes in two dimensions to the model in three dimensions, and geometric distortions of the mouth created by the contraction of specified muscle combinations. Chapter six describes the implementation of software for this research and argues the case for a parametric approach. Chapter seven is concerned with the control of facial articulations and discusses a more biological approach to these. Finally chapter eight draws conclusions from the present research and suggests further extensions

    Engaging the articulators enhances perception of concordant visible speech movements

    PURPOSE This study aimed to test whether (and how) somatosensory feedback signals from the vocal tract affect concurrent unimodal visual speech perception. METHOD Participants discriminated pairs of silent visual utterances of vowels under 3 experimental conditions: (a) normal (baseline) and while holding either (b) a bite block or (c) a lip tube in their mouths. To test the specificity of somatosensory-visual interactions during perception, we assessed discrimination of vowel contrasts optically distinguished based on their mandibular (English /ɛ/-/æ/) or labial (English /u/-French /u/) postures. In addition, we assessed perception of each contrast using dynamically articulating videos and static (single-frame) images of each gesture (at vowel midpoint). RESULTS Engaging the jaw selectively facilitated perception of the dynamic gestures optically distinct in terms of jaw height, whereas engaging the lips selectively facilitated perception of the dynamic gestures optically distinct in terms of their degree of lip compression and protrusion. Thus, participants perceived visible speech movements in relation to the configuration and shape of their own vocal tract (and possibly their ability to produce covert vowel production-like movements). In contrast, engaging the articulators had no effect when the speaking faces did not move, suggesting that the somatosensory inputs affected perception of time-varying kinematic information rather than changes in target (movement end point) mouth shapes. CONCLUSIONS These findings suggest that orofacial somatosensory inputs associated with speech production prime premotor and somatosensory brain regions involved in the sensorimotor control of speech, thereby facilitating perception of concordant visible speech movements. SUPPLEMENTAL MATERIAL https://doi.org/10.23641/asha.9911846R01 DC002852 - NIDCD NIH HHSAccepted manuscrip

    The Role of Speech Production System in Audiovisual Speech Perception

    Seeing the articulatory gestures of the speaker significantly enhances speech perception. Findings from recent neuroimaging studies suggest that activation of the speech motor system during lipreading enhance speech perception by tuning, in a top-down fashion, speech-sound processing in the superior aspects of the posterior temporal lobe. Anatomically, the superior-posterior temporal lobe areas receive connections from the auditory, visual, and speech motor cortical areas. Thus, it is possible that neuronal receptive fields are shaped during development to respond to speech-sound features that coincide with visual and motor speech cues, in contrast with the anterior/lateral temporal lobe areas that might process speech sounds predominantly based on acoustic cues. The superior-posterior temporal lobe areas have also been consistently associated with auditory spatial processing. Thus, the involvement of these areas in audiovisual speech perception might partly be explained by the spatial processing requirements when associating sounds, seen articulations, and one’s own motor movements. Tentatively, it is possible that the anterior “what” and posterior “where / how” auditory cortical processing pathways are parts of an interacting network, the instantaneous state of which determines what one ultimately perceives, as potentially reflected in the dynamics of oscillatory activity

    A Silent-Speech Interface using Electro-Optical Stomatography

    Sprachtechnologie ist eine große und wachsende Industrie, die das Leben von technologieinteressierten Nutzern auf zahlreichen Wegen bereichert. Viele potenzielle Nutzer werden jedoch ausgeschlossen: Nämlich alle Sprecher, die nur schwer oder sogar gar nicht Sprache produzieren können. Silent-Speech Interfaces bieten einen Weg, mit Maschinen durch ein bequemes sprachgesteuertes Interface zu kommunizieren ohne dafür akustische Sprache zu benötigen. Sie können außerdem prinzipiell eine Ersatzstimme stellen, indem sie die intendierten Äußerungen, die der Nutzer nur still artikuliert, künstlich synthetisieren. Diese Dissertation stellt ein neues Silent-Speech Interface vor, das auf einem neu entwickelten Messsystem namens Elektro-Optischer Stomatografie und einem neuartigen parametrischen Vokaltraktmodell basiert, das die Echtzeitsynthese von Sprache basierend auf den gemessenen Daten ermöglicht. Mit der Hardware wurden Studien zur Einzelworterkennung durchgeführt, die den Stand der Technik in der intra- und inter-individuellen Genauigkeit erreichten und übertrafen. Darüber hinaus wurde eine Studie abgeschlossen, in der die Hardware zur Steuerung des Vokaltraktmodells in einer direkten Artikulation-zu-Sprache-Synthese verwendet wurde. Während die Verständlichkeit der Synthese von Vokalen sehr hoch eingeschätzt wurde, ist die Verständlichkeit von Konsonanten und kontinuierlicher Sprache sehr schlecht. Vielversprechende Möglichkeiten zur Verbesserung des Systems werden im Ausblick diskutiert.:Statement of authorship iii Abstract v List of Figures vii List of Tables xi Acronyms xiii 1. Introduction 1 1.1. The concept of a Silent-Speech Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2. Structure of this work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. Fundamentals of phonetics 7 2.1. Components of the human speech production system . . . . . . . . . . . . . . . . . . . 7 2.2. Vowel sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3. Consonantal sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4. Acoustic properties of speech sounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5. Coarticulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6. Phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.7. Summary and implications for the design of a Silent-Speech Interface (SSI) . . . . . . . 21 3. Articulatory data acquisition techniques in Silent-Speech Interfaces 25 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2. Scope of the literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3. Video Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4. Ultrasonography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5. Electromyography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6. Permanent-Magnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.7. Electromagnetic Articulography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.8. Radio waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.9. Palatography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.10.Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4. Electro-Optical Stomatography 55 4.1. Contact sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2. Optical distance sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3. Lip sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.4. Sensor Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5. Control Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.6. Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5. Articulation-to-Text 99 5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2. Command word recognition pilot study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.3. Command word recognition small-scale study . . . . . . . . . . . . . . . . . . . . . . . . 102 6. Articulation-to-Speech 109 6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.2. Articulatory synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.3. The six point vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.4. Objective evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 116 6.5. Perceptual evaluation of the vocal tract model . . . . . . . . . . . . . . . . . . . . . . . . 120 6.6. Direct synthesis using EOS to control the vocal tract model . . . . . . . . . . . . . . . . 125 6.7. Pitch and voicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7. Summary and outlook 145 7.1. Summary of the contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 7.2. Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 A. Overview of the International Phonetic Alphabet 151 B. Mathematical proofs and derivations 153 B.1. Combinatoric calculations illustrating the reduction of possible syllables using phonotactics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 B.2. Signal Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 B.3. Effect of the contact sensor area on the conductance . . . . . . . . . . . . . . . . . . . . 155 B.4. Calculation of the forward current for the OP280V diode . . . . . . . . . . . . . . . . . . 155 C. Schematics and layouts 157 C.1. Schematics of the control unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 C.2. Layout of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 C.3. Bill of materials of the control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 C.4. Schematics of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 C.5. Layout of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 C.6. Bill of materials of the sensor unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 D. Sensor unit assembly 169 E. Firmware flow and data protocol 177 F. Palate file format 181 G. Supplemental material regarding the vocal tract model 183 H. Articulation-to-Speech: Optimal hyperparameters 189 Bibliography 191Speech technology is a major and growing industry that enriches the lives of technologically-minded people in a number of ways. Many potential users are, however, excluded: Namely, all speakers who cannot easily or even at all produce speech. Silent-Speech Interfaces offer a way to communicate with a machine by a convenient speech recognition interface without the need for acoustic speech. They also can potentially provide a full replacement voice by synthesizing the intended utterances that are only silently articulated by the user. To that end, the speech movements need to be captured and mapped to either text or acoustic speech. This dissertation proposes a new Silent-Speech Interface based on a newly developed measurement technology called Electro-Optical Stomatography and a novel parametric vocal tract model to facilitate real-time speech synthesis based on the measured data. The hardware was used to conduct command word recognition studies reaching state-of-the-art intra- and inter-individual performance. Furthermore, a study on using the hardware to control the vocal tract model in a direct articulation-to-speech synthesis loop was also completed.     Asymmetric discrimination of non-speech tonal analogues of vowels

    Full text link
    Published in final edited form as: J Exp Psychol Hum Percept Perform. 2019 February ; 45(2): 285–300. doi:10.1037/xhp0000603.Directional asymmetries reveal a universal bias in vowel perception favoring extreme vocalic articulations, which lead to acoustic vowel signals with dynamic formant trajectories and well-defined spectral prominences due to the convergence of adjacent formants. The present experiments investigated whether this bias reflects speech-specific processes or general properties of spectral processing in the auditory system. Toward this end, we examined whether analogous asymmetries in perception arise with non-speech tonal analogues that approximate some of the dynamic and static spectral characteristics of naturally-produced /u/ vowels executed with more versus less extreme lip gestures. We found a qualitatively similar but weaker directional effect with two-component tones varying in both the dynamic changes and proximity of their spectral energies. In subsequent experiments, we pinned down the phenomenon using tones that varied in one or both of these two acoustic characteristics. We found comparable asymmetries with tones that differed exclusively in their spectral dynamics, and no asymmetries with tones that differed exclusively in their spectral proximity or both spectral features. We interpret these findings as evidence that dynamic spectral changes are a critical cue for eliciting asymmetries in non-speech tone perception, but that the potential contribution of general auditory processes to asymmetries in vowel perception is limited.Accepted manuscrip