30 research outputs found
Fuzzy Audio-Visual Feature Maps for Speaker Identification
Speech-based person recognition by machine has not reached the level of technological maturity required by some of its potential applications. The deficiencies revolve around sub-optimal pre-processing, feature extraction or selection, and classification, particularly under conditions of input data variability. The joint use of audible and visible manifestations of speech aims to alleviate these shortcomings, but the development of effective combination techniques is challenging. This paper proposes and evaluates a combination approach for speaker identification based on fuzzy modelling of acoustic and visual speaker characteristics. The proposed audio-visual model has been evaluated experimentally on a speaker identification task. The results show that the joint model outperforms its isolated components in terms of identification accuracy. In particular, the cross-modal coupling of audio-visual streams is shown to improve identification accuracy
Hierarchical multistream recognition of facial expressions
Achieving optimal recognition accuracy, particularly under conditions of input data variability, is a challenge for automatic facial expression recognition. However, little research has been devoted to investigating the robustness of automatic expression recognition under adverse conditions. A facial expression modelling approach is proposed for enhancing the robustness of expression recognition. The approach is founded on hierarchical state-based modelling of streams that represent spatially localised expression dynamics. Experimental assessment shows that the proposed model achieves high and stable recognition accuracy over a range of input data degradation. Moreover, interstream coupling as well as the inclusion of adaptive estimation of model reliability and credibility are shown to make a positive contribution to recognition accuracy
Robustness of head pose estimation based on Gabor wavelets and graph matching
Gabor wavelet filter responses used as graph nodes in labelled graph matching have previously been shown to give high accuracy for head pose estimation from a single two-dimensional image. However, the robustness of this approach to additive or convolutive noise, and illumination level has not been assessed systematically. This paper presents an experimental evaluation of the combination of Gabor wavelets with graph matching. The investigations assess robustness against additive impulse noise, convolutive periodic noise, and ambient light level. The results show that the above pose estimation approach is sensitive to the density of impulse noise, and is mildly sensitive to illumination level. The paper also reports that pose estimation is sensitive to the amplitude of periodic noise but insensitive to its orientation
Robust facial expression recognition using a state-based model of spatially-localised facial dynamics
This paper proposes a new approach for the robust recognition of facial expressions from video sequences. The goal of the work presented herein, is to develop robust recognition techniques that will overcome some limitations of current techniques, such as their sensitivity to partial occlusion of the face, and noisy data. This paper investigates a representation of facial expressions which is based on a spatially-localised geometric facial model coupled to a state-based model of facial motion. The experiments show that the proposed facial expression recognition framework yields relatively little degradation in recognition rate, when faces are partially occluded, or under a variety of levels of noise introduced at the feature tracker level
Classification of smart video surveillance systems for commercial applications
Video surveillance has a large market as the number of installed cameras around us can show. There are immediate commercial needs for smart video surveillance systems that can make use of the existing camera network (e.g. CCTV) for more intelligent security systems and to contribute in more applications (beside or) rather than security applications. This work introduces a new classification for smart video surveillance systems depending on their commercial applications. This paper highlights different links between the research and the commercial applications. The work reported here has both research and commercial motivations. Our goals are first to define a generic model of smart video surveillance systems that can meet requirements of strong commercial applications. Our second goal is to categorize different smart video surveillance applications and to relate capabilities of computer vision algorithms to the requirement of commercial application
Image Processing: Object Segmentation Using Full-Spectrum Matching of Albedo Derived from Colour Images
An image segmentation method has a training phase, and a segmentation phase. In the training phase a frame of pixel lated data from a camera is processed using information on camera characteristics to render it camera-independent. The camera independent data are processed using a chosen value of illuminant spectral characteristics to derive reflectivity data of the items in the image. Pixels of high reflectivity are established. Then, using data from the high reflectivity pixels, the actual illuminant spectral characteristics are established. The illuminant data are then processed to determine information on the illumination of the scene represented by the frame of pixel lated data to derive reflectivity data of the scene. The segmentation phase comprises operating on a subsequent frame of pixel lated data to render it camera-independent and using the determined illumination information to process the camera independent data to determine reflectivity data of the scene to derive a foreground mask
Face segmentation using fuzzy reasoning
In this paper, we present a face segmentation architecture using fuzzy inference. The head of a talker and two key structural features of his face (eyes and mouth) are located based on temporal and spatial information extracted from a head-and-shoulder image sequence. The architecture is modular and the segmentation uses a coarse-to-fine fuzzy reasoning strategy implemented across a three-level multi-resolution image pyramid. Results illustrating the performance of the system are given
Integrated person identification using voice and facial features
Real-world automatic person recognition requires a consistently high recognition accuracy which is difficult to attain using a single recognition modality. This paper addresses the issue of person identification accuracy resulting from the combination of voice and outer lip-margin features. An assessment of feature fusion - based on audio-visual feature vector concatenation, principal component analysis, and linear discriminant analysis - is conducted. The paper shows that outer lip margins carry speaker identity cues. It is also shown that the joint use of voice and lip-margin features is equivalent to an effective increase in signal-to-noise ratio of the audio signal. Simple audio-visual feature vector concatenation is shown to be an effective method for feature combination, and linear discriminant analysis is shown to possess the capability of packing discriminating audio-visual information into fewer coeficients than principal component analysis
Voice and Facial Image Integration for Person Recognition
Automatic person recognition systems attempt to use physical or behavioural characteristics in order to perform the recognition task, and to this extent may be compared with humans. The face and voice of a person are two sources of characteristics that can be provide information about the identity of an individual. We describe a person recognition system using face and voice as primary sources of personal identity information. Results of preliminary investigations on person recognition, based on still face profile images and voice, show improved recognition accuracy compared to face profile recognition and voice recognition performed separately. Further, a recognition system based on visual and acoustic speech is proposed
Audio-visual person recognition: an evaluation of data fusion strategies
Audio-visual person recognition promises higher recognition accuracy than recognition in either domain in isolation. To reach this goal, special attention should be given to the strategies for combining the acoustic and visual sensory modalities. This paper presents a comparative assessment of three decision-level data fusion techniques for person identification. Under mismatched training and test noise-conditions, Bayesian inference and Dempster-Shafer theory are shown to outperform possibility theory. For these mismatched noise conditions, all three techniques result in compromising integration. Under matched training and test noise-conditions, the three techniques yield similar error rates approaching the more accurate of the two sensory modalities, and show signs of leading to enhancing integration at low acoustic noise levels. The paper also shows that automatic ident8cation of idenlicul twins is possible, and that lip margins convey a high level of speaker identity informatio