513 research outputs found

    Generating intelligible audio speech from visual speech

    Get PDF
    This work is concerned with generating intelligible audio speech from a video of a person talking. Regression and classification methods are proposed first to estimate static spectral envelope features from active appearance model (AAM) visual features. Two further methods are then developed to incorporate temporal information into the prediction - a feature-level method using multiple frames and a model-level method based on recurrent neural networks. Speech excitation information is not available from the visual signal, so methods to artificially generate aperiodicity and fundamental frequency are developed. These are combined within the STRAIGHT vocoder to produce a speech signal. The various systems are optimised through objective tests before applying subjective intelligibility tests that determine a word accuracy of 85% from a set of human listeners on the GRID audio-visual speech database. This compares favourably with a previous regression-based system that serves as a baseline which achieved a word accuracy of 33%

    Final Report to NSF of the Standards for Facial Animation Workshop

    Get PDF
    The human face is an important and complex communication channel. It is a very familiar and sensitive object of human perception. The facial animation field has increased greatly in the past few years as fast computer graphics workstations have made the modeling and real-time animation of hundreds of thousands of polygons affordable and almost commonplace. Many applications have been developed such as teleconferencing, surgery, information assistance systems, games, and entertainment. To solve these different problems, different approaches for both animation control and modeling have been developed

    Geometrical-based lip-reading using template probabilistic multi-dimension dynamic time warping

    Get PDF
    By identifying lip movements and characterizing their associations with speech sounds, the performance of speech recognition systems can be improved, particularly when operating in noisy environments. In this paper, we present a geometrical-based automatic lip reading system that extracts the lip region from images using conventional techniques, but the contour itself is extracted using a novel application of a combination of border following and convex hull approaches. Classification is carried out using an enhanced dynamic time warping technique that has the ability to operate in multiple dimensions and a template probability technique that is able to compensate for differences in the way words are uttered in the training set. The performance of the new system has been assessed in recognition of the English digits 0 to 9 as available in the CUAVE database. The experimental results obtained from the new approach compared favorably with those of existing lip reading approaches, achieving a word recognition accuracy of up to 71% with the visual information being obtained from estimates of lip height, width and their ratio

    Articulatory features for robust visual speech recognition

    Full text link

    Feeling and Speaking: The Role of Sensory Feedback in Speech

    Get PDF
    Sensory feedback allows talkers to accurately control speech production, and auditory information is the predominant form of speech feedback. When this sensory stream is degraded, talkers have been shown to rely more heavily on somatosensory information. Furthermore, perceptual speech abilities are greatest when both auditory and visual feedback are available. In this study, we experimentally degraded auditory feedback using a cochlear implant simulation and somatosensory feedback using Orajel. Additionally, we placed a mirror in front of the talkers to introduce visual feedback. Participants were prompted to speak under a baseline, feedback degraded, and visual condition; audiovisual speech recordings were taken for each treatment. These recordings were then used in a playback study to determine the intelligibility of speech. Acoustically, baseline speech was selected as “easier to understand” significantly more often than speech from either the feedback degraded or visual condition. Visually, speech from the visual condition was selected as “easier to understand” significantly less often than speech from the feedback degraded condition. Listener preference of baseline speech was significantly greater when both auditory and somatosensory feedback were degraded then when only auditory feedback was degraded (Casserly, in prep., 2015). These results suggest that feedback was successfully degraded and that the addition of visual feedback decreased speech intelligibility

    Supporting the learning of deaf students in higher education: a case study at Sheffield Hallam University

    Get PDF
    This article is an examination of the issues surrounding support for the learning of deaf students in higher education (HE). There are an increasing number of deaf students attending HE institutes, and as such provision of support mechanisms for these students is not only necessary but essential. Deaf students are similar to their hearing peers, in that they will approach their learning and require differing levels of support dependant upon the individual. They will, however, require a different kind of support, which can be technical or human resource based. This article examines the issues that surround supporting deaf students in HE with use of a case study of provision at Sheffield Hallam University (SHU), during the academic year 1994-95. It is evident that by considering the needs of deaf students and making changes to our teaching practices that all students can benefit

    The effects of visual processing on human frequency following brain response

    Get PDF
    Tässä työssä tutkittiin, miten huomion keskittäminen visuaaliseen ärsykkeeseen vaikuttaa aikaiseen kuuloprosessointiin elektroenkefalografian (EEG) avulla. Erityisesti haluttiin selvittää vaikutukset kuuloaivorunkovasteen taajuusseuraajavasteeseen. EEG mitattiin koehenkilöiden keskittyessä seuraamaan visuaalisia ärsykkeitä samalla, kun taustalla toistettiin /da/-ääniärsykettä. Kokeessa oli kolme eri tilannetta, jotka erosivat toisistaan visuaalisen ärsykkeen osalta. (1) Vokaalit, jossa koehenkilöt katsoivat äänetöntä videota vokaaleja toistavasta henkilöstä. (2) Laajenevat renkaat, jossa koehenkilöiden katsomalla videolla henkilön suun alueella liikkui laajeneva ja supistuva rengas, jolla luotiin ajallisesti ja avaruudellisesti vokaali-tilannetta vastannut liikkeen havainnointi ilman kielellistä sisältöä. (3) Staattinen kuva, jossa koehenkilölle näytettiin pysähtynyttä neutraalia kasvokuvaa. Vokaalit- ja laajenevat renkaat - tilanteissa koehenkilöiden tehtävänä oli reagoida kahteen peräkkäiseen samaan vokaaliin / renkaan suuntaan. Koehenkilöt suorittivat kunkin tehtävän kaksi kertaa satunnaisessa järjestyksessä, jotta aivorunkovasteen toistettavuutta voitiin seurata. Koehenkilöt tunnistivat vokaalit-tilanteen kohdeärsykkeet prosentuaalisesti paremmin, mutta hitaammin, kuin laajenevat renkaat -tilanteessa. EEG tuloksissa on nähtävissä trendi, jossa taajuusseuraajavasteen amplitudi on vokaalit-tilanteessa alhaisempi. Taajuusseuraajavasteen ensimmäisen piikin osalta tulokset olivat tilastollisesti merkitseviä. Nämä tulokset voivat viitata siihen, että huuliltaluku vaikuttaa alentavasti taajuusseuraajavasteen amplitudeihin.In this thesis, the effects of attending to visual stimuli to early auditory processing were studied with the aid of electroenchephalography (EEG). The focus was on the frequency following response (FFR) of the auditory brainstem response (ABR). EEG-responses were measured while subjects attended to visual stimuli combined with auditory /da/-syllable. There were three different conditions based on the visual stimulus type: (1) vowels, where subjects attended to a video of a female speaker mouthing Finnish vowels; (2) expanding rings, where the video consisted of the same female speaker's still face with an expanding ring/oval imposed over the mouth region, creating temporally and spatially similar movement to the vowels condition, but without the linguistic content; and (3) the still condition, where the still image of the same female speaker was presented. Tasks in the vowels and expanding rings conditions were to react to two consecutive same vowels/rings. Subjects completed two sets of the three conditions in randomized order to control the intra-subject ABR replicability. The behavioral results showed the subjects identified the vowel targets better, but slower in comparison to the expanding rings. The EEG results showed a trend towards lower FFR amplitudes during lipreading/vowels condition and there were statistically significant effects in the first peak of the FFR. These results could suggest suppressive effects of lipreading on the FFR

    Neural pathways for visual speech perception

    Get PDF
    This paper examines the questions, what levels of speech can be perceived visually, and how is visual speech represented by the brain? Review of the literature leads to the conclusions that every level of psycholinguistic speech structure (i.e., phonetic features, phonemes, syllables, words, and prosody) can be perceived visually, although individuals differ in their abilities to do so; and that there are visual modality-specific representations of speech qua speech in higher-level vision brain areas. That is, the visual system represents the modal patterns of visual speech. The suggestion that the auditory speech pathway receives and represents visual speech is examined in light of neuroimaging evidence on the auditory speech pathways. We outline the generally agreed-upon organization of the visual ventral and dorsal pathways and examine several types of visual processing that might be related to speech through those pathways, specifically, face and body, orthography, and sign language processing. In this context, we examine the visual speech processing literature, which reveals widespread diverse patterns activity in posterior temporal cortices in response to visual speech stimuli. We outline a model of the visual and auditory speech pathways and make several suggestions: (1) The visual perception of speech relies on visual pathway representations of speech qua speech. (2) A proposed site of these representations, the temporal visual speech area (TVSA) has been demonstrated in posterior temporal cortex, ventral and posterior to multisensory posterior superior temporal sulcus (pSTS). (3) Given that visual speech has dynamic and configural features, its representations in feedforward visual pathways are expected to integrate these features, possibly in TVSA
    corecore