84 research outputs found
Pose-invariant, model-based object recognition, using linear combination of views and Bayesian statistics
This thesis presents an in-depth study on the problem of object recognition, and in particular the detection
of 3-D objects in 2-D intensity images which may be viewed from a variety of angles. A solution to this
problem remains elusive to this day, since it involves dealing with variations in geometry, photometry
and viewing angle, noise, occlusions and incomplete data. This work restricts its scope to a particular
kind of extrinsic variation; variation of the image due to changes in the viewpoint from which the object
is seen.
A technique is proposed and developed to address this problem, which falls into the category of
view-based approaches, that is, a method in which an object is represented as a collection of a small
number of 2-D views, as opposed to a generation of a full 3-D model. This technique is based on the
theoretical observation that the geometry of the set of possible images of an object undergoing 3-D rigid
transformations and scaling may, under most imaging conditions, be represented by a linear combination
of a small number of 2-D views of that object. It is therefore possible to synthesise a novel image of an
object given at least two existing and dissimilar views of the object, and a set of linear coefficients that
determine how these views are to be combined in order to synthesise the new image.
The method works in conjunction with a powerful optimization algorithm, to search and recover the
optimal linear combination coefficients that will synthesize a novel image, which is as similar as possible
to the target, scene view. If the similarity between the synthesized and the target images is above some
threshold, then an object is determined to be present in the scene and its location and pose are defined,
in part, by the coefficients. The key benefits of using this technique is that because it works directly
with pixel values, it avoids the need for problematic, low-level feature extraction and solution of the
correspondence problem. As a result, a linear combination of views (LCV) model is easy to construct
and use, since it only requires a small number of stored, 2-D views of the object in question, and the
selection of a few landmark points on the object, the process which is easily carried out during the offline,
model building stage. In addition, this method is general enough to be applied across a variety of
recognition problems and different types of objects.
The development and application of this method is initially explored looking at two-dimensional
problems, and then extending the same principles to 3-D. Additionally, the method is evaluated across
synthetic and real-image datasets, containing variations in the objectsâ identity and pose. Future work on
possible extensions to incorporate a foreground/background model and lighting variations of the pixels
are examined
Study of Video Assisted BSS for Convolutive Mixtures
In this paper we present an overview of recent research in
the area of audio-visual blind source separation (BSS), together
with new results of our work that highlight the advantage
of including visual information into a BSS algorithm.
In our work the visual information is combined with audio
information to form joint audio-visual feature vectors. The
audio-visual coherence is then modelled using statistical models.
The outputs of these models are used within a frequency
domain BSS algorithm to control the step size. Experimental
results verify the improvement of the audio-visual method
compared to audio only BSS. We also discuss visual feature
extraction techniques, along with several recently published
methods for audio-visual BSS, and conclude with suggestions
for future research
Expressive Modulation of Neutral Visual Speech
The need for animated graphical models of the human face is commonplace in
the movies, video games and television industries, appearing in everything from
low budget advertisements and free mobile apps, to Hollywood blockbusters
costing hundreds of millions of dollars. Generative statistical models of
animation attempt to address some of the drawbacks of industry standard
practices such as labour intensity and creative inflexibility.
This work describes one such method for transforming speech animation curves
between different expressive styles. Beginning with the assumption that
expressive speech animation is a mix of two components, a high-frequency
speech component (the content) and a much lower-frequency expressive
component (the style), we use Independent Component Analysis (ICA) to
identify and manipulate these components independently of one another. Next
we learn how the energy for different speaking styles is distributed in terms of
the low-dimensional independent components model. Transforming the
speaking style involves projecting new animation curves into the lowdimensional
ICA space, redistributing the energy in the independent
components, and finally reconstructing the animation curves by inverting the
projection.
We show that a single ICA model can be used for separating multiple expressive
styles into their component parts. Subjective evaluations show that viewers can
reliably identify the expressive style generated using our approach, and that they
have difficulty in identifying transformed animated expressive speech from the
equivalent ground-truth
Editing faces in videos
Editing faces in movies is of interest in the special effects industry. We aim at
producing effects such as the addition of accessories interacting correctly with
the face or replacing the face of a stuntman with the face of the main actor.
The system introduced in this thesis is based on a 3D generative face model.
Using a 3D model makes it possible to edit the face in the semantic space of pose,
expression, and identity instead of pixel space, and due to its 3D nature allows
a modelling of the light interaction. In our system we first reconstruct the 3D
face, which is deforming because of expressions and speech, the lighting, and
the camera in all frames of a monocular input video. The face is then edited by
substituting expressions or identities with those of another video sequence or by
adding virtual objects into the scene. The manipulated 3D scene is rendered back
into the original video, correctly simulating the interaction of the light with the
deformed face and virtual objects.
We describe all steps necessary to build and apply the system. This includes
registration of training faces to learn a generative face model, semi-automatic
annotation of the input video, fitting of the face model to the input video, editing
of the fit, and rendering of the resulting scene.
While describing the application we introduce a host of new methods, each
of which is of interest on its own. We start with a new method to register 3D
face scans to use as training data for the face model. For video preprocessing a
new interest point tracking and 2D Active Appearance Model fitting technique
is proposed. For robust fitting we introduce background modelling, model-based
stereo techniques, and a more accurate light model
Face De-Identification for Privacy Protection
The ability to record, store and analyse images of faces economically, rapidly and on a vast scale brings peopleâs attention to privacy. The current privacy protection approaches for face images are mainly through masking, blurring or black-out which, however, removes data utilities along with the identifying information. As a result, these ad hoc methods are hardly used for data publishing or in further researches. The technique of de-identification attempts to remove identifying information from a dataset while preserving the data utility as much as possible. The research on de-identify structured data has been established while it remains a challenge to de-identify unstructured data such as face data in images and videos. The k-Same face de-identification was the first method that attempted to use an established de-identification theory, k-anonymity, to de-identify a face image dataset. The k-Same face de-identification is also the starting point of this thesis.
Re-identification risk and data utility are two incompatible aspects in face de-identification. The focus of this thesis is to improve the privacy protection performance of a face de-identification system while providing data utility preserving solutions for different application scenarios. This thesis first proposes the k-Same furthest face de-identification method which introduces the wrong-map protection to the k-Same-M face de-identification, where the identity loss is maximised by replacing an original face with the face that has the least similarity to it.
The data utility of face images has been considered from two aspects in this thesis, the dataset-wise data utility such as data distribution of the data set and the individual-wise data utility such as the facial expression in an individual image. With the aim to preserve the diversity of a face image dataset, the k-Diff-furthest face de-identification method is proposed, which extends the k-Same-furthest method and can provide the wrong-map protection.
With respect to the data utility of an individual face image, the visual quality and the preservation of facial expression are discussed in this thesis. A method to merge the isolated de-identified face region and its original image background is presented. The described method can increase the visual quality of a de-identified face image in terms of fidelity and intelligibility. A novel solution to preserving facial expressions in de-identified face images is presented, which can preserve not only the category of facial expressions but also the intensity of face Action Units.
Finally, an integration of the Active Appearance Model (AAM) and Generative
Adversarial Network (GAN) is presented, which achieves the synthesis of realistic face images with shallow neural network architectures
Infrared face recognition: a comprehensive review of methodologies and databases
Automatic face recognition is an area with immense practical potential which
includes a wide range of commercial and law enforcement applications. Hence it
is unsurprising that it continues to be one of the most active research areas
of computer vision. Even after over three decades of intense research, the
state-of-the-art in face recognition continues to improve, benefitting from
advances in a range of different research fields such as image processing,
pattern recognition, computer graphics, and physiology. Systems based on
visible spectrum images, the most researched face recognition modality, have
reached a significant level of maturity with some practical success. However,
they continue to face challenges in the presence of illumination, pose and
expression changes, as well as facial disguises, all of which can significantly
decrease recognition accuracy. Amongst various approaches which have been
proposed in an attempt to overcome these limitations, the use of infrared (IR)
imaging has emerged as a particularly promising research direction. This paper
presents a comprehensive and timely review of the literature on this subject.
Our key contributions are: (i) a summary of the inherent properties of infrared
imaging which makes this modality promising in the context of face recognition,
(ii) a systematic review of the most influential approaches, with a focus on
emerging common trends as well as key differences between alternative
methodologies, (iii) a description of the main databases of infrared facial
images available to the researcher, and lastly (iv) a discussion of the most
promising avenues for future research.Comment: Pattern Recognition, 2014. arXiv admin note: substantial text overlap
with arXiv:1306.160
Face recognition using infrared vision
Au cours de la derniĂšre dĂ©cennie, la reconnaissance de visage basĂ©e sur lâimagerie infrarouge (IR) et en particulier la thermographie IR est devenue une alternative prometteuse aux approches conventionnelles utilisant lâimagerie dans le spectre visible. En effet lâimagerie (visible et infrarouge) trouvent encore des contraintes Ă leur application efficace dans le monde rĂ©el. Bien quâinsensibles Ă toute variation dâillumination dans le spectre visible, les images IR sont caractĂ©risĂ©es par des dĂ©fis spĂ©cifiques qui leur sont propres, notamment la sensibilitĂ© aux facteurs qui affectent le rayonnement thermique du visage tels que lâĂ©tat Ă©motionnel, la tempĂ©rature ambiante, la consommation dâalcool, etc. En outre, il est plus laborieux de corriger lâexpression du visage et les changements de poses dans les images IR puisque leur contenu est moins riche aux hautes frĂ©quences spatiales ce qui reprĂ©sente en fait une indication importante pour le calage de tout modĂšle dĂ©formable. Dans cette thĂšse, nous dĂ©crivons une nouvelle mĂ©thode qui rĂ©pond Ă ces dĂ©fis majeurs. ConcrĂštement, pour remĂ©dier aux changements dans les poses et expressions du visage, nous gĂ©nĂ©rons une image synthĂ©tique frontale du visage qui est canonique et neutre vis-Ă -vis de toute expression faciale Ă partir dâune image du visage de pose et expression faciale arbitraires. Ceci est rĂ©alisĂ© par lâapplication dâune dĂ©formation affine par morceaux prĂ©cĂ©dĂ©e par un calage via un modĂšle dâapparence active (AAM). Ainsi, une de nos publications est la premiĂšre publication qui explore lâutilisation dâun AAM sur les images IR thermiques ; nous y proposons une Ă©tape de prĂ©traitement qui rehausse la nettetĂ© des images thermiques, ce qui rend la convergence de lâAAM rapide et plus prĂ©cise. Pour surmonter le problĂšme des images IR thermiques par rapport au motif exact du rayonnement thermique du visage, nous le dĂ©crivons celui-ci par une reprĂ©sentation sâappuyant sur des caractĂ©ristiques anatomiques fiables. Contrairement aux approches existantes, notre reprĂ©sentation nâest pas binaire ; elle met plutĂŽt lâaccent sur la fiabilitĂ© des caractĂ©ristiques extraites. Cela rend la reprĂ©sentation proposĂ©e beaucoup plus robuste Ă la fois Ă la pose et aux changements possibles de tempĂ©rature. LâefficacitĂ© de lâapproche proposĂ©e est dĂ©montrĂ©e sur la plus grande base de donnĂ©es publique des vidĂ©os IR thermiques des visages. Sur cette base dâimages, notre mĂ©thode atteint des performances de reconnaissance assez bonnes et surpasse de maniĂšre significative les mĂ©thodes dĂ©crites prĂ©cĂ©demment dans la littĂ©rature. Lâapproche proposĂ©e a Ă©galement montrĂ© de trĂšs bonnes performances sur des sous-ensembles de cette base de donnĂ©es que nous avons montĂ©e nous-mĂȘmes au sein de notre laboratoire. A notre connaissance, il sâagit de lâune des bases de donnĂ©es les plus importantes disponibles Ă lâheure actuelle tout en prĂ©sentant certains dĂ©fis.Over the course of the last decade, infrared (IR) and particularly thermal IR imaging based face recognition has emerged as a promising complement to conventional, visible spectrum based approaches which continue to struggle when applied in the real world. While inherently insensitive to visible spectrum illumination changes, IR images introduce specific challenges of their own, most notably sensitivity to factors which affect facial heat emission patterns, e.g., emotional state, ambient temperature, etc. In addition, facial expression and pose changes are more difficult to correct in IR images because they are less rich in high frequency details which is an important cue for fitting any deformable model. In this thesis we describe a novel method which addresses these major challenges. Specifically, to normalize for pose and facial expression changes we generate a synthetic frontal image of a face in a canonical, neutral facial expression from an image of the face in an arbitrary pose and facial expression. This is achieved by piecewise affine warping which follows active appearance model (AAM) fitting. This is the first work which explores the use of an AAM on thermal IR images; we propose a pre-processing step which enhances details in thermal images, making AAM convergence faster and more accurate. To overcome the problem of thermal IR image sensitivity to the exact pattern of facial temperature emissions we describe a representation based on reliable anatomical features. In contrast to previous approaches, our representation is not binary; rather, our method accounts for the reliability of the extracted features. This makes the proposed representation much more robust both to pose and scale changes. The effectiveness of the proposed approach is demonstrated on the largest public database of thermal IR images of faces on which it achieves satisfying recognition performance and significantly outperforms previously described methods. The proposed approach has also demonstrated satisfying performance on subsets of the largest video database of the world gathered in our laboratory which will be publicly available free of charge in future. The reader should note that due to the very nature of the feature extraction method in our system (i.e., anatomical based nature of it), we anticipate high robustness of our system to some challenging factors such as the temperature changes. However, we were not able to investigate this in depth due to the limits which exist in gathering realistic databases. Gathering the largest video database considering some challenging factors is one of the other contributions of this research
Exploiting the bimodality of speech in the cocktail party problem
The cocktail party problem is one of following a conversation in a crowded room where there are many competing sound sources, such as the voices of other speakers or music. To address this problem using computers, digital signal processing solutions commonly use blind source separation (BSS) which aims to separate all the original sources (voices) from the mixture simultaneously. Traditionally, BSS methods have relied on information derived from the mixture of sources to separate the mixture into its constituent elements. However, the human auditory system is well adapted to handle the cocktail party scenario, using both auditory and visual information to follow (or hold) a conversation in a such an environment. This thesis focuses on using visual information of the speakers in a cocktail party like scenario to aid in improving the performance of BSS. There are several useful applications of such technology, for example: a pre-processing step for a speech recognition system, teleconferencing or security surveillance. The visual information used in this thesis is derived from the speaker's mouth region, as it is the most visible component of speech production. Initial research presented in this thesis considers a joint statistical model of audio and visual features, which is used to assist in control ling the convergence behaviour of a BSS algorithm. The results of using the statistical models are compared to using the raw audio information alone and it is shown that the inclusion of visual information greatly improves its convergence behaviour. Further research focuses on using the speaker's mouth region to identify periods of time when the speaker is silent through the development of a visual voice activity detector (V-VAD) (i.e. voice activity detection using visual information alone). This information can be used in many different ways to simplify the BSS process. To this end, two novel V-VADs were developed and tested within a BSS framework, which result in significantly improved intelligibility of the separated source associated with the V-VAD output. Thus the research presented in this thesis confirms the viability of using visual information to improve solutions to the cocktail party problem.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
- âŠ