Search CORE

48,397 research outputs found

Voicing classification of visual speech using convolutional neural networks

Author: Le Cornu Thomas
Milner Ben
Publication venue
Publication date: 01/01/2015
Field of study

The application of neural network and convolutional neural net- work (CNN) architectures is explored for the tasks of voicing classification (classifying frames as being either non-speech, unvoiced, or voiced) and voice activity detection (VAD) of vi- sual speech. Experiments are conducted for both speaker de- pendent and speaker independent scenarios. A Gaussian mixture model (GMM) baseline system is de- veloped using standard image-based two-dimensional discrete cosine transform (2D-DCT) visual speech features, achieving speaker dependent accuracies of 79% and 94%, for voicing classification and VAD respectively. Additionally, a single- layer neural network system trained using the same visual fea- tures achieves accuracies of 86 % and 97 %. A novel technique using convolutional neural networks for visual speech feature extraction and classification is presented. The voicing classifi- cation and VAD results using the system are further improved to 88 % and 98 % respectively. The speaker independent results show the neural network system to outperform both the GMM and CNN systems, achiev- ing accuracies of 63 % for voicing classification, and 79 % for voice activity detection

University of East Anglia digital repository

End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models

Author: Busso Carlos
Tao Fei
Publication venue
Publication date: 12/09/2018
Field of study

Speech activity detection (SAD) plays an important role in current speech processing systems, including automatic speech recognition (ASR). SAD is particularly difficult in environments with acoustic noise. A practical solution is to incorporate visual information, increasing the robustness of the SAD approach. An audiovisual system has the advantage of being robust to different speech modes (e.g., whisper speech) or background noise. Recent advances in audiovisual speech processing using deep learning have opened opportunities to capture in a principled way the temporal relationships between acoustic and visual features. This study explores this idea proposing a \emph{bimodal recurrent neural network} (BRNN) framework for SAD. The approach models the temporal dynamic of the sequential audiovisual data, improving the accuracy and robustness of the proposed SAD system. Instead of estimating hand-crafted features, the study investigates an end-to-end training approach, where acoustic and visual features are directly learned from the raw data during training. The experimental evaluation considers a large audiovisual corpus with over 60.8 hours of recordings, collected from 105 speakers. The results demonstrate that the proposed framework leads to absolute improvements up to 1.2% under practical scenarios over a VAD baseline using only audio implemented with deep neural network (DNN). The proposed approach achieves 92.7% F1-score when it is evaluated using the sensors from a portable tablet under noisy acoustic environment, which is only 1.0% lower than the performance obtained under ideal conditions (e.g., clean speech obtained with a high definition camera and a close-talking microphone).Comment: Submitted to Speech Communicatio

arXiv.org e-Print Archive

Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition

Author: Beskow Jonas
Salvi Giampiero
Stefanov Kalin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions.Comment: 10 pages, IEEE Transactions on Cognitive and Developmental System

arXiv.org e-Print Archive

Publikationer från KTH

Digitala Vetenskapliga Arkivet - Academic Archive On-line

NORA - Norwegian Open Research Archives

Auditory feedback control mechanisms do not contribute to cortical hyperactivity within the voice production network in adductor spasmodic dysphonia

Author: Blood Anne
Burns James
Daliri Ayoub
Elizabeth Heller Murray
Guenther Frank
Nieto-Castanon Alfonso
Noordzij J. Pieter
Tourville Jason
Publication venue: 'American Speech Language Hearing Association'
Publication date: 01/02/2020
Field of study

Adductor spasmodic dysphonia (ADSD), the most common form of spasmodic dysphonia, is a debilitating voice disorder characterized by hyperactivity and muscle spasms in the vocal folds during speech. Prior neuroimaging studies have noted excessive brain activity during speech in ADSD participants compared to controls. Speech involves an auditory feedback control mechanism that generates motor commands aimed at eliminating disparities between desired and actual auditory signals. Thus, excessive neural activity in ADSD during speech may reflect, at least in part, increased engagement of the auditory feedback control mechanism as it attempts to correct vocal production errors detected through audition. To test this possibility, functional magnetic resonance imaging was used to identify differences between ADSD participants and age-matched controls in (i) brain activity when producing speech under different auditory feedback conditions, and (ii) resting state functional connectivity within the cortical network responsible for vocalization. The ADSD group had significantly higher activity than the control group during speech (compared to a silent baseline task) in three left-hemisphere cortical regions: ventral Rolandic (sensorimotor) cortex, anterior planum temporale, and posterior superior temporal gyrus/planum temporale. This was true for speech while auditory feedback was masked with noise as well as for speech with normal auditory feedback, indicating that the excess activity was not the result of auditory feedback control mechanisms attempting to correct for perceived voicing errors in ADSD. Furthermore, the ADSD group had significantly higher resting state functional connectivity between sensorimotor and auditory cortical regions within the left hemisphere as well as between the left and right hemispheres, consistent with the view that excessive motor activity frequently co-occurs with increased auditory cortical activity in individuals with ADSD.First author draf

Boston University Institutional Repository (OpenBU)

Audiovisual integration of emotional signals from others' social interactions

Author: Petrini Karin
Piwek Lukasz
Pollick Frank
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2015
Field of study

Audiovisual perception of emotions has been typically examined using displays of a solitary character (e.g., the face-voice and/or body-sound of one actor). However, in real life humans often face more complex multisensory social situations, involving more than one person. Here we ask if the audiovisual facilitation in emotion recognition previously found in simpler social situations extends to more complex and ecological situations. Stimuli consisting of the biological motion and voice of two interacting agents were used in two experiments. In Experiment 1, participants were presented with visual, auditory, auditory filtered/noisy, and audiovisual congruent and incongruent clips. We asked participants to judge whether the two agents were interacting happily or angrily. In Experiment 2, another group of participants repeated the same task, as in Experiment 1, while trying to ignore either the visual or the auditory information. The findings from both experiments indicate that when the reliability of the auditory cue was decreased participants weighted more the visual cue in their emotional judgments. This in turn translated in increased emotion recognition accuracy for the multisensory condition. Our findings thus point to a common mechanism of multisensory integration of emotional signals irrespective of social stimulus complexity

Directory of Open Access Journals

Frontiers - Publisher Connector

PubMed Central

Enlighten

A neural marker for social bias toward in-group accents

Author: Belin Pascal
Bestelmeyer Patricia E.G.
Ladd D. Robert
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/12/2014
Field of study

Accents provide information about the speaker's geographical, socio-economic, and ethnic background. Research in applied psychology and sociolinguistics suggests that we generally prefer our own accent to other varieties of our native language and attribute more positive traits to it. Despite the widespread influence of accents on social interactions, educational and work settings the neural underpinnings of this social bias toward our own accent and, what may drive this bias, are unexplored. We measured brain activity while participants from two different geographical backgrounds listened passively to 3 English accent types embedded in an adaptation design. Cerebral activity in several regions, including bilateral amygdalae, revealed a significant interaction between the participants' own accent and the accent they listened to: while repetition of own accents elicited an enhanced neural response, repetition of the other group's accent resulted in reduced responses classically associated with adaptation. Our findings suggest that increased social relevance of, or greater emotional sensitivity to in-group accents, may underlie the own-accent bias. Our results provide a neural marker for the bias associated with accents, and show, for the first time, that the neural response to speech is partly shaped by the geographical background of the listener