11 research outputs found
A PCA based manifold representation for visual speech recognition
In this paper, we discuss a new Principal Component Analysis (PCA)-based manifold representation for visual speech recognition. In this regard, the real time input video data is compressed using Principal Component Analysis and the low-dimensional points calculated for each frame define the manifold. Since the number of frames that form the video sequence is dependent on the word complexity, in order to use these manifolds for visual speech classification it is required to re-sample them into a fixed pre-defined number of key-points. These key-points are used as input for a Hidden Markov Model (HMM) classification scheme. We have applied the developed visual speech recognition system to a database containing a group of English words and the experimental data indicates that the proposed approach is able to produce accurate classification results
Multimodal person recognition for human-vehicle interaction
Next-generation vehicles will undoubtedly feature biometric person recognition as part of an effort to improve the driving experience. Today's technology prevents such systems from operating satisfactorily under adverse conditions. A proposed framework for achieving person recognition successfully combines different biometric modalities, borne out in two case studies
Contour Mapping for Speaker-Independent Lip Reading System
In this paper, we demonstrate how an existing deep learning architecture for automatically lip reading individuals can
be adapted it so that it can be made speaker independent, and by doing so, improved accuracies can be achieved on a
variety of different speakers. The architecture itself is multi-layered consisting of a convolutional neural network, but if
we are to apply an initial edge detection-based stage to pre-process the image inputs so that only the contours are
required, the architecture can be made to be less speaker favourable.
The neural network architecture achieves good accuracy rates when trained and tested on some of the same speakers
in the ”overlapped speakers” phase of simulations, where word error rates of just 1.3% and 0.4% are achieved when
applied to two individual speakers respectively, as well as character error rates of 0.6% and 0.3%. The ”unseen speakers”
phase fails to achieve as good an accuracy, with greater recorded word error rates of 20.6% and 17.0% when tested on
the two speakers with character error rates of 11.5% and 8.3%.
The variation in size and colour of different people’s lips will result in different outputs at the convolution layer of a
convolutional neural network as the output depends on the pixel intensity of the red, green and blue channels of an input
image so a convolutional neural network will naturally favour the observations of the individual whom the network was
tested on. This paper proposes an initial ”contour mapping stage” which makes all inputs uniform so that the system can
be speaker independent.
Keywords: Lip Reading, Speech Recognition, Deep Learning, Facial Landmarks, Convolutional Neural Networks,
Recurrent Neural Networks, Edge Detection, Contour Mappin
Visual Passwords Using Automatic Lip Reading
This paper presents a visual passwords system to increase security. The
system depends mainly on recognizing the speaker using the visual speech signal
alone. The proposed scheme works in two stages: setting the visual password
stage and the verification stage. At the setting stage the visual passwords
system request the user to utter a selected password, a video recording of the
user face is captured, and processed by a special words-based VSR system which
extracts a sequence of feature vectors. In the verification stage, the same
procedure is executed, the features will be sent to be compared with the stored
visual password. The proposed scheme has been evaluated using a video database
of 20 different speakers (10 females and 10 males), and 15 more males in
another video database with different experiment sets. The evaluation has
proved the system feasibility, with average error rate in the range of 7.63% to
20.51% at the worst tested scenario, and therefore, has potential to be a
practical approach with the support of other conventional authentication
methods such as the use of usernames and passwords
Exploiting the bimodality of speech in the cocktail party problem
The cocktail party problem is one of following a conversation in a crowded room where there are many competing sound sources, such as the voices of other speakers or music. To address this problem using computers, digital signal processing solutions commonly use blind source separation (BSS) which aims to separate all the original sources (voices) from the mixture simultaneously. Traditionally, BSS methods have relied on information derived from the mixture of sources to separate the mixture into its constituent elements. However, the human auditory system is well adapted to handle the cocktail party scenario, using both auditory and visual information to follow (or hold) a conversation in a such an environment. This thesis focuses on using visual information of the speakers in a cocktail party like scenario to aid in improving the performance of BSS. There are several useful applications of such technology, for example: a pre-processing step for a speech recognition system, teleconferencing or security surveillance. The visual information used in this thesis is derived from the speaker's mouth region, as it is the most visible component of speech production. Initial research presented in this thesis considers a joint statistical model of audio and visual features, which is used to assist in control ling the convergence behaviour of a BSS algorithm. The results of using the statistical models are compared to using the raw audio information alone and it is shown that the inclusion of visual information greatly improves its convergence behaviour. Further research focuses on using the speaker's mouth region to identify periods of time when the speaker is silent through the development of a visual voice activity detector (V-VAD) (i.e. voice activity detection using visual information alone). This information can be used in many different ways to simplify the BSS process. To this end, two novel V-VADs were developed and tested within a BSS framework, which result in significantly improved intelligibility of the separated source associated with the V-VAD output. Thus the research presented in this thesis confirms the viability of using visual information to improve solutions to the cocktail party problem.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Exploiting the bimodality of speech in the cocktail party problem
The cocktail party problem is one of following a conversation in a crowded room where there are many competing sound sources, such as the voices of other speakers or music. To address this problem using computers, digital signal processing solutions commonly use blind source separation (BSS) which aims to separate all the original sources (voices) from the mixture simultaneously. Traditionally, BSS methods have relied on information derived from the mixture of sources to separate the mixture into its constituent elements. However, the human auditory system is well adapted to handle the cocktail party scenario, using both auditory and visual information to follow (or hold) a conversation in a such an environment. This thesis focuses on using visual information of the speakers in a cocktail party like scenario to aid in improving the performance of BSS. There are several useful applications of such technology, for example: a pre-processing step for a speech recognition system, teleconferencing or security surveillance. The visual information used in this thesis is derived from the speaker's mouth region, as it is the most visible component of speech production. Initial research presented in this thesis considers a joint statistical model of audio and visual features, which is used to assist in control ling the convergence behaviour of a BSS algorithm. The results of using the statistical models are compared to using the raw audio information alone and it is shown that the inclusion of visual information greatly improves its convergence behaviour. Further research focuses on using the speaker's mouth region to identify periods of time when the speaker is silent through the development of a visual voice activity detector (V-VAD) (i.e. voice activity detection using visual information alone). This information can be used in many different ways to simplify the BSS process. To this end, two novel V-VADs were developed and tested within a BSS framework, which result in significantly improved intelligibility of the separated source associated with the V-VAD output. Thus the research presented in this thesis confirms the viability of using visual information to improve solutions to the cocktail party problem.EThOS - Electronic Theses Online ServiceGBUnited Kingdo