Search CORE

885 research outputs found

Reconstructing intelligible audio speech from visual speech features

Author: Le Cornu Thomas
Milner Ben
Publication venue
Publication date: 01/01/2015
Field of study

This work describes an investigation into the feasibility of producing intelligible audio speech from only visual speech fea- tures. The proposed method aims to estimate a spectral enve- lope from visual features which is then combined with an arti- ficial excitation signal and used within a model of speech pro- duction to reconstruct an audio signal. Different combinations of audio and visual features are considered, along with both a statistical method of estimation and a deep neural network. The intelligibility of the reconstructed audio speech is measured by human listeners, and then compared to the intelligibility of the video signal only and when combined with the reconstructed audio

University of East Anglia digital repository

Deep learning for audio-visul speaker diarization

Author: Βαρθολομαίος Αργύριος Σ.
Publication venue
Publication date: 01/01/2017
Field of study

University of Thessaly Institutional Repository

Voice activity detection based on facial movement

Author: Bart Joosten
D Sodoyer
DJ Field
DJ Heeger
E Krahmer
EK Patterson
Emiel Krahmer
Eric Postma
F Beritelli
G Potamianos
H Lee
H McGurk
J Ramírez
J Sohn
JG Daugman
JH Chang
JH Hateren van
JJ Stekelenburg
JP Jones
M Kass
N Petkov
P Tiawongsombat
PK Ghosh
RE Fan
T Kinnunen
TF Cootes
V Wassenhove van
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Crossref

Springer - Publisher Connector

Tilburg University Repository

Visual Voice Activity Detection in the Wild

Author: Iosifidis Alexandros
Nikolaidis Nikolaos
Patrona Foteini
Pitas Ioannis
Tefas Anastasios
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/02/2016
Field of study

Explore Bristol Research

Visual voice activity detection based on spatiotemporal information and bag of words

Author: Iosifidis Alexandros
Nikolaidis Nikos
Patrona Foteini
Pitas Ioannis
Tefas Anastasios
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Crossref

Explore Bristol Research

Audio-visual speaker diarization in broadcast news

Author: Βόσσος Χαράλαμπος Π.
Publication venue
Publication date: 01/01/2020
Field of study

University of Thessaly Institutional Repository

A motion-based approach for audio-visual automatic speech recognition

Author: Nasir Ahmad (821439)
Publication venue
Publication date: 12/08/2019
Field of study

The research work presented in this thesis introduces novel approaches for both visual region of interest extraction and visual feature extraction for use in audio-visual automatic speech recognition. In particular, the speaker‘s movement that occurs during speech is used to isolate the mouth region in video sequences and motionbased features obtained from this region are used to provide new visual features for audio-visual automatic speech recognition. The mouth region extraction approach proposed in this work is shown to give superior performance compared with existing colour-based lip segmentation methods. The new features are obtained from three separate representations of motion in the region of interest, namely the difference in luminance between successive images, block matching based motion vectors and optical flow. The new visual features are found to improve visual-only and audiovisual speech recognition performance when compared with the commonly-used appearance feature-based methods. In addition, a novel approach is proposed for visual feature extraction from either the discrete cosine transform or discrete wavelet transform representations of the mouth region of the speaker. In this work, the image transform is explored from a new viewpoint of data discrimination; in contrast to the more conventional data preservation viewpoint. The main findings of this work are that audio-visual automatic speech recognition systems using the new features extracted from the frequency bands selected according to their discriminatory abilities generally outperform those using features designed for data preservation. To establish the noise robustness of the new features proposed in this work, their performance has been studied in presence of a range of different types of noise and at various signal-to-noise ratios. In these experiments, the audio-visual automatic speech recognition systems based on the new approaches were found to give superior performance both to audio-visual systems using appearance based features and to audio-only speech recognition systems

Loughborough University Institutional Repository

Automatic Visual Speech Recognition

Author: Alin Chiţu
Léon J.M. Rothkrantz
Publication venue: 'IntechOpen'
Publication date: 03/03/2012
Field of study

Intelligent SystemsElectrical Engineering, Mathematics and Computer Scienc

IntechOpen

Crossref

TU Delft Repository

A novel lip geometry approach for audio-visual speech recognition

Author: Zamri Ibrahim (7201733)
Publication venue
Publication date: 01/01/2014
Field of study

By identifying lip movements and characterizing their associations with speech sounds, the performance of speech recognition systems can be improved, particularly when operating in noisy environments. Various method have been studied by research group around the world to incorporate lip movements into speech recognition in recent years, however exactly how best to incorporate the additional visual information is still not known. This study aims to extend the knowledge of relationships between visual and speech information specifically using lip geometry information due to its robustness to head rotation and the fewer number of features required to represent movement. A new method has been developed to extract lip geometry information, to perform classification and to integrate visual and speech modalities. This thesis makes several contributions. First, this work presents a new method to extract lip geometry features using the combination of a skin colour filter, a border following algorithm and a convex hull approach. The proposed method was found to improve lip shape extraction performance compared to existing approaches. Lip geometry features including height, width, ratio, area, perimeter and various combinations of these features were evaluated to determine which performs best when representing speech in the visual domain. Second, a novel template matching technique able to adapt dynamic differences in the way words are uttered by speakers has been developed, which determines the best fit of an unseen feature signal to those stored in a database template. Third, following on evaluation of integration strategies, a novel method has been developed based on alternative decision fusion strategy, in which the outcome from the visual and speech modality is chosen by measuring the quality of audio based on kurtosis and skewness analysis and driven by white noise confusion. Finally, the performance of the new methods introduced in this work are evaluated using the CUAVE and LUNA-V data corpora under a range of different signal to noise ratio conditions using the NOISEX-92 dataset

Loughborough University Institutional Repository

UMP Institutional Repository