Search CORE

4,481 research outputs found

Audio-Visual Speaker Identification using the CUAVE Database

Author: Dean David
Lucey Patrick
Sridharan Subramanian
Publication venue: AVSP '05
Publication date: 01/01/2005
Field of study

The freely available nature of the CUAVE database allows it to provide a valuable platform to form benchmarks and compare research. This paper shows that the CUAVE database can successfully be used to test speaker identifications systems, with performance comparable to existing systems implemented on other databases. Additionally, this research shows that the optimal configuration for decisionfusion of an audio-visual speaker identification system relies heavily on the video modality in all but clean speech conditions

CiteSeerX

Queensland University of Technology ePrints Archive

Video augmentation for improving audio speech recognition under noise

Author: British Machine Vision Conference (BMVC)
Cavallaro A
Gong S
Pachoud S
Publication venue
Publication date: 23/02/2015
Field of study

Queen Mary Research Online

Audio-Visual Automatic Speech Recognition Using PZM, MFCC and Statistical Analysis

Author: Debnath Saswati
Roy Pinki
Publication venue: 'Universidad Internacional de La Rioja'
Publication date: 10/05/2022
Field of study

Audio-Visual Automatic Speech Recognition (AV-ASR) has become the most promising research area when the audio signal gets corrupted by noise. The main objective of this paper is to select the important and discriminative audio and visual speech features to recognize audio-visual speech. This paper proposes Pseudo Zernike Moment (PZM) and feature selection method for audio-visual speech recognition. Visual information is captured from the lip contour and computes the moments for lip reading. We have extracted 19th order of Mel Frequency Cepstral Coefficients (MFCC) as speech features from audio. Since all the 19 speech features are not equally important, therefore, feature selection algorithms are used to select the most efficient features. The various statistical algorithm such as Analysis of Variance (ANOVA), Kruskal-wallis, and Friedman test are employed to analyze the significance of features along with Incremental Feature Selection (IFS) technique. Statistical analysis is used to analyze the statistical significance of the speech features and after that IFS is used to select the speech feature subset. Furthermore, multiclass Support Vector Machine (SVM), Artificial Neural Network (ANN) and Naive Bayes (NB) machine learning techniques are used to recognize the speech for both the audio and visual modalities. Based on the recognition rate combined decision is taken from the two individual recognition systems. This paper compares the result achieved by the proposed model and the existing model for both audio and visual speech recognition. Zernike Moment (ZM) is compared with PZM and shows that our proposed model using PZM extracts better discriminative features for visual speech recognition. This study also proves that audio feature selection using statistical analysis outperforms methods without any feature selection technique

Re-UNIR

Combining Multiple Views for Visual Speech Recognition

Author: Ekenel Hazım Kemal
Ghazi Mostafa Mehdipour
Thiran Jean-Philippe
Zimmermann Marina
Publication venue
Publication date: 07/07/2017
Field of study

Visual speech recognition is a challenging research problem with a particular practical application of aiding audio speech recognition in noisy scenarios. Multiple camera setups can be beneficial for the visual speech recognition systems in terms of improved performance and robustness. In this paper, we explore this aspect and provide a comprehensive study on combining multiple views for visual speech recognition. The thorough analysis covers fusion of all possible view angle combinations both at feature level and decision level. The employed visual speech recognition system in this study extracts features through a PCA-based convolutional neural network, followed by an LSTM network. Finally, these features are processed in a tandem system, being fed into a GMM-HMM scheme. The decision fusion acts after this point by combining the Viterbi path log-likelihoods. The results show that the complementary information contained in recordings from different view angles improves the results significantly. For example, the sentence correctness on the test set is increased from 76% for the highest performing single view (

30^\circ

) to up to 83% when combining this view with the frontal and

60^\circ

view angles

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref

A novel lip geometry approach for audio-visual speech recognition

Author: Zamri Ibrahim (7201733)
Publication venue
Publication date: 01/01/2014
Field of study

By identifying lip movements and characterizing their associations with speech sounds, the performance of speech recognition systems can be improved, particularly when operating in noisy environments. Various method have been studied by research group around the world to incorporate lip movements into speech recognition in recent years, however exactly how best to incorporate the additional visual information is still not known. This study aims to extend the knowledge of relationships between visual and speech information specifically using lip geometry information due to its robustness to head rotation and the fewer number of features required to represent movement. A new method has been developed to extract lip geometry information, to perform classification and to integrate visual and speech modalities. This thesis makes several contributions. First, this work presents a new method to extract lip geometry features using the combination of a skin colour filter, a border following algorithm and a convex hull approach. The proposed method was found to improve lip shape extraction performance compared to existing approaches. Lip geometry features including height, width, ratio, area, perimeter and various combinations of these features were evaluated to determine which performs best when representing speech in the visual domain. Second, a novel template matching technique able to adapt dynamic differences in the way words are uttered by speakers has been developed, which determines the best fit of an unseen feature signal to those stored in a database template. Third, following on evaluation of integration strategies, a novel method has been developed based on alternative decision fusion strategy, in which the outcome from the visual and speech modality is chosen by measuring the quality of audio based on kurtosis and skewness analysis and driven by white noise confusion. Finally, the performance of the new methods introduced in this work are evaluated using the CUAVE and LUNA-V data corpora under a range of different signal to noise ratio conditions using the NOISEX-92 dataset

Loughborough University Institutional Repository

UMP Institutional Repository

Adaptive Decision Fusion for Audio-Visual Speech Recognition

Author: Cheol Hoon Park
Jong-Seok Lee
Publication venue: 'IntechOpen'
Publication date: 01/11/2008
Field of study

IntechOpen

Crossref

Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition

Author: Athanassios Katsamanis
George Papandreou
Petros Maragos
Vassilis Pitsikalis
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Temporal Attention-Gated Model for Robust Sequence Classification

Author: Baltrušaitis Tadas
Morency Louis-Philippe
Pei Wenjie
Tax David M. J.
Publication venue
Publication date: 15/04/2017
Field of study

Typical techniques for sequence classification are designed for well-segmented sequences which have been edited to remove noisy or irrelevant parts. Therefore, such methods cannot be easily applied on noisy sequences expected in real-world applications. In this paper, we present the Temporal Attention-Gated Model (TAGM) which integrates ideas from attention models and gated recurrent networks to better deal with noisy or unsegmented sequences. Specifically, we extend the concept of attention model to measure the relevance of each observation (time step) of a sequence. We then use a novel gated recurrent network to learn the hidden representation for the final prediction. An important advantage of our approach is interpretability since the temporal attention weights provide a meaningful value for the salience of each time step in the sequence. We demonstrate the merits of our TAGM approach, both for prediction accuracy and interpretability, on three different tasks: spoken digit recognition, text-based sentiment analysis and visual event recognition.Comment: Accepted by CVPR 201

arXiv.org e-Print Archive

Crossref

Unsupervised Stream-Weights Computation in Classification and Recognition Tasks

Author: Daoudi Khalid
Potamianos Alexandros
Sanchez-Soto E.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

International audienceIn this paper, we provide theoretical results on the problem of optimal stream weight selection for the multi-stream classi- fication problem. It is shown, that in the presence of estimation or modeling errors using stream weights can decrease the total classification error. Stream weight estimates are computed for various conditions. Then we turn our attention to the problem of unsupervised stream weights computation. Based on the theoretical results we propose to use models and “anti-models” (class- specific background models) to estimate stream weights. A non-linear function of the ratio of the inter- to intra-class distance is used for stream weight estimation. The proposed unsupervised stream weight estimation algorithm is evaluated on both artificial data and on the problem of audio-visual speech classification. Finally the proposed algorithm is extended to the problem of audio- visual speech recognition. It is shown that the proposed algorithms achieve results comparable to the supervised minimum-error training approach under most testing conditions

Scientific Publications of the University of Toulouse II Le Mirail

INRIA a CCSD electronic archive server