262 research outputs found
Recommended from our members
Evaluation and analysis of hybrid intelligent pattern recognition techniques for speaker identification
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.The rapid momentum of the technology progress in the recent years has led to a tremendous rise in the use of biometric authentication systems. The objective of this research is to investigate the problem
of identifying a speaker from its voice regardless of the content (i.e.
text-independent), and to design efficient methods of combining face and voice in producing a robust authentication system.
A novel approach towards speaker identification is developed using
wavelet analysis, and multiple neural networks including Probabilistic
Neural Network (PNN), General Regressive Neural Network (GRNN)and Radial Basis Function-Neural Network (RBF NN) with the AND
voting scheme. This approach is tested on GRID and VidTIMIT cor-pora and comprehensive test results have been validated with state-
of-the-art approaches. The system was found to be competitive and it improved the recognition rate by 15% as compared to the classical Mel-frequency Cepstral Coe±cients (MFCC), and reduced the recognition time by 40% compared to Back Propagation Neural Network (BPNN), Gaussian Mixture Models (GMM) and Principal Component Analysis (PCA).
Another novel approach using vowel formant analysis is implemented using Linear Discriminant Analysis (LDA). Vowel formant based speaker identification is best suitable for real-time implementation and requires only a few bytes of information to be stored for each speaker, making it both storage and time efficient. Tested on GRID and Vid-TIMIT, the proposed scheme was found to be 85.05% accurate when Linear Predictive Coding (LPC) is used to extract the vowel formants, which is much higher than the accuracy of BPNN and GMM. Since the proposed scheme does not require any training time other than creating a small database of vowel formants, it is faster as well. Furthermore, an increasing number of speakers makes it di±cult for BPNN and GMM to sustain their accuracy, but the proposed score-based methodology stays almost linear.
Finally, a novel audio-visual fusion based identification system is implemented using GMM and MFCC for speaker identi¯cation and PCA for face recognition. The results of speaker identification and face recognition are fused at different levels, namely the feature, score and decision levels. Both the score-level and decision-level (with OR voting) fusions were shown to outperform the feature-level fusion in terms of accuracy and error resilience. The result is in line with the distinct nature of the two modalities which lose themselves when combined at the feature-level. The GRID and VidTIMIT test results validate that
the proposed scheme is one of the best candidates for the fusion of
face and voice due to its low computational time and high recognition accuracy
Visual Passwords Using Automatic Lip Reading
This paper presents a visual passwords system to increase security. The
system depends mainly on recognizing the speaker using the visual speech signal
alone. The proposed scheme works in two stages: setting the visual password
stage and the verification stage. At the setting stage the visual passwords
system request the user to utter a selected password, a video recording of the
user face is captured, and processed by a special words-based VSR system which
extracts a sequence of feature vectors. In the verification stage, the same
procedure is executed, the features will be sent to be compared with the stored
visual password. The proposed scheme has been evaluated using a video database
of 20 different speakers (10 females and 10 males), and 15 more males in
another video database with different experiment sets. The evaluation has
proved the system feasibility, with average error rate in the range of 7.63% to
20.51% at the worst tested scenario, and therefore, has potential to be a
practical approach with the support of other conventional authentication
methods such as the use of usernames and passwords
Audio-coupled video content understanding of unconstrained video sequences
Unconstrained video understanding is a difficult task. The main aim of this thesis is to
recognise the nature of objects, activities and environment in a given video clip using
both audio and video information. Traditionally, audio and video information has not
been applied together for solving such complex task, and for the first time we propose,
develop, implement and test a new framework of multi-modal (audio and video) data
analysis for context understanding and labelling of unconstrained videos.
The framework relies on feature selection techniques and introduces a novel algorithm
(PCFS) that is faster than the well-established SFFS algorithm. We use the framework for
studying the benefits of combining audio and video information in a number of different
problems. We begin by developing two independent content recognition modules. The
first one is based on image sequence analysis alone, and uses a range of colour, shape,
texture and statistical features from image regions with a trained classifier to recognise
the identity of objects, activities and environment present. The second module uses audio
information only, and recognises activities and environment. Both of these approaches
are preceded by detailed pre-processing to ensure that correct video segments containing
both audio and video content are present, and that the developed system can be made
robust to changes in camera movement, illumination, random object behaviour etc. For
both audio and video analysis, we use a hierarchical approach of multi-stage
classification such that difficult classification tasks can be decomposed into simpler and
smaller tasks.
When combining both modalities, we compare fusion techniques at different levels of
integration and propose a novel algorithm that combines advantages of both feature and
decision-level fusion. The analysis is evaluated on a large amount of test data comprising
unconstrained videos collected for this work. We finally, propose a decision correction
algorithm which shows that further steps towards combining multi-modal classification
information effectively with semantic knowledge generates the best possible results
Acoustic Approaches to Gender and Accent Identification
There has been considerable research on the problems of speaker and language recognition
from samples of speech. A less researched problem is that of accent recognition. Although this
is a similar problem to language identification, di�erent accents of a language exhibit more
fine-grained di�erences between classes than languages. This presents a tougher problem
for traditional classification techniques. In this thesis, we propose and evaluate a number of
techniques for gender and accent classification. These techniques are novel modifications and
extensions to state of the art algorithms, and they result in enhanced performance on gender
and accent recognition.
The first part of the thesis focuses on the problem of gender identification, and presents a
technique that gives improved performance in situations where training and test conditions are
mismatched.
The bulk of this thesis is concerned with the application of the i-Vector technique to accent
identification, which is the most successful approach to acoustic classification to have emerged
in recent years. We show that it is possible to achieve high accuracy accent identification without
reliance on transcriptions and without utilising phoneme recognition algorithms. The thesis
describes various stages in the development of i-Vector based accent classification that improve
the standard approaches usually applied for speaker or language identification, which are
insu�cient. We demonstrate that very good accent identification performance is possible with
acoustic methods by considering di�erent i-Vector projections, frontend parameters, i-Vector
configuration parameters, and an optimised fusion of the resulting i-Vector classifiers we can
obtain from the same data.
We claim to have achieved the best accent identification performance on the test corpus
for acoustic methods, with up to 90% identification rate. This performance is even better than
previously reported acoustic-phonotactic based systems on the same corpus, and is very close
to performance obtained via transcription based accent identification. Finally, we demonstrate
that the utilization of our techniques for speech recognition purposes leads to considerably
lower word error rates.
Keywords: Accent Identification, Gender Identification, Speaker Identification, Gaussian
Mixture Model, Support Vector Machine, i-Vector, Factor Analysis, Feature Extraction, British
English, Prosody, Speech Recognition
- …