339 research outputs found
Hetero-manifold Regularisation for Cross-modal Hashing
Recently, cross-modal search has attracted considerable attention but remains a very challenging task because of the integration complexity and heterogeneity of the multi-modal data. To address both challenges, in this paper, we propose a novel method termed hetero-manifold regularisation (HMR) to supervise the learning of hash functions for efficient cross-modal search. A hetero-manifold integrates multiple sub-manifolds defined by homogeneous data with the help of cross-modal supervision information. Taking advantages of the hetero-manifold, the similarity between each pair of heterogeneous data could be naturally measured by three order random walks on this hetero-manifold. Furthermore, a novel cumulative distance inequality defined on the hetero-manifold is introduced to avoid the computational difficulty induced by the discreteness of hash codes. By using the inequality, cross-modal hashing is transformed into a problem of hetero-manifold regularised support vector learning. Therefore, the performance of cross-modal search can be significantly improved by seamlessly combining the integrated information of the hetero-manifold and the strong generalisation of the support vector machine. Comprehensive experiments show that the proposed HMR achieve advantageous results over the state-of-the-art methods in several challenging cross-modal tasks
Disentangling the modes of variation in unlabelled data
Statistical methods are of paramount importance in discovering the modes of variation in visual data. The Principal Component Analysis (PCA) is probably the most prominent method for extracting a single mode of variation in the data. However, in practice, visual data exhibit several modes of variations. For instance, the appearance of faces varies in identity, expression, pose etc. To extract these modes of variations from visual data, several supervised methods, such as the TensorFaces relying on multilinear (tensor) decomposition (e.g., Higher Order SVD) have been developed. The main drawbacks of such methods is that they require both labels regarding the modes of variations and the same number of samples under all modes of variations (e.g., the same face under different expressions, poses etc.). Therefore, their applicability is limited to well-organised data, usually captured in well-controlled conditions. In this paper, we propose a novel general multilinear matrix decomposition method that discovers the multilinear structure of possibly incomplete sets of visual data in unsupervised setting (i.e., without the presence of labels). We also propose extensions of the method with sparsity and low-rank constraints in order to handle noisy data, captured in unconstrained conditions. Besides that, a graph-regularised variant of the method is also developed in order to exploit available geometric or label information for some modes of variations. We demonstrate the applicability of the proposed method in several computer vision tasks, including Shape from Shading (SfS) (in the wild and with occlusion removal), expression transfer, and estimation of surface normals from images captured in the wild
Disentangling the modes of variation in unlabelled data
Statistical methods are of paramount importance in discovering the modes of variation in visual data. The Principal Component Analysis (PCA) is probably the most prominent method for extracting a single mode of variation in the data. However, in practice, visual data exhibit several modes of variations. For instance, the appearance of faces varies in identity, expression, pose etc. To extract these modes of variations from visual data, several supervised methods, such as the TensorFaces relying on multilinear (tensor) decomposition (e.g., Higher Order SVD) have been developed. The main drawbacks of such methods is that they require both labels regarding the modes of variations and the same number of samples under all modes of variations (e.g., the same face under different expressions, poses etc.). Therefore, their applicability is limited to well-organised data, usually captured in well-controlled conditions. In this paper, we propose a novel general multilinear matrix decomposition method that discovers the multilinear structure of possibly incomplete sets of visual data in unsupervised setting (i.e., without the presence of labels). We also propose extensions of the method with sparsity and low-rank constraints in order to handle noisy data, captured in unconstrained conditions. Besides that, a graph-regularised variant of the method is also developed in order to exploit available geometric or label information for some modes of variations. We demonstrate the applicability of the proposed method in several computer vision tasks, including Shape from Shading (SfS) (in the wild and with occlusion removal), expression transfer, and estimation of surface normals from images captured in the wild
Multilinear methods for disentangling variations with applications to facial analysis
Several factors contribute to the appearance of an object in a visual scene, including pose,
illumination, and deformation, among others. Each factor accounts for a source of variability
in the data. It is assumed that the multiplicative interactions of these factors emulate the
entangled variability, giving rise to the rich structure of visual object appearance. Disentangling
such unobserved factors from visual data is a challenging task, especially when the data have
been captured in uncontrolled recording conditions (also referred to as āin-the-wildā) and label
information is not available. The work presented in this thesis focuses on disentangling the
variations contained in visual data, in particular applied to 2D and 3D faces. The motivation
behind this work lies in recent developments in the field, such as (i) the creation of large, visual
databases for face analysis, with (ii) the need of extracting information without the use of labels
and (iii) the need to deploy systems under demanding, real-world conditions.
In the first part of this thesis, we present a method to synthesise plausible 3D expressions
that preserve the identity of a target subject. This method is supervised as the model uses
labels, in this case 3D facial meshes of people performing a defined set of facial expressions, to
learn. The ability to synthesise an entire facial rig from a single neutral expression has a large
range of applications both in computer graphics and computer vision, ranging from the ecient
and cost-eāµective creation of CG characters to scalable data generation for machine learning
purposes. Unlike previous methods based on multilinear models, the proposed approach is
capable to extrapolate well outside the sample pool, which allows it to accurately reproduce
the identity of the target subject and create artefact-free expression shapes while requiring
only a small input dataset. We introduce global-local multilinear models that leverage the
strengths of expression-specific and identity-specific local models combined with coarse motion
estimations from a global model. The expression-specific and identity-specific local models
are built from diāµerent slices of the patch-wise local multilinear model. Experimental results
show that we achieve high-quality, identity-preserving facial expression synthesis results that
outperform existing methods both quantitatively and qualitatively.
In the second part of this thesis, we investigate how the modes of variations from visual data
can be extracted. Our assumption is that visual data has an underlying structure consisting of
factors of variation and their interactions. Finding this structure and the factors is important
as it would not only help us to better understand visual data but once obtained we can edit the factors for use in various applications. Shape from Shading and expression transfer are just two
of the potential applications. To extract the factors of variation, several supervised methods
have been proposed but they require both labels regarding the modes of variations and the same
number of samples under all modes of variations. Therefore, their applicability is limited to
well-organised data, usually captured in well-controlled conditions. We propose a novel general
multilinear matrix decomposition method that discovers the multilinear structure of possibly
incomplete sets of visual data in unsupervised setting. We demonstrate the applicability of the
proposed method in several computer vision tasks, including Shape from Shading (SfS) (in the
wild and with occlusion removal), expression transfer, and estimation of surface normals from
images captured in the wild.
Finally, leveraging the unsupervised multilinear method proposed as well as recent advances in
deep learning, we propose a weakly supervised deep learning method for disentangling multiple
latent factors of variation in face images captured in-the-wild. To this end, we propose a deep
latent variable model, where we model the multiplicative interactions of multiple latent factors
of variation explicitly as a multilinear structure. We demonstrate that the proposed approach
indeed learns disentangled representations of facial expressions and pose, which can be used in
various applications, including face editing, as well as 3D face reconstruction and classification
of facial expression, identity and pose.Open Acces
Learning to hash for large scale image retrieval
This thesis is concerned with improving the effectiveness of nearest neighbour search.
Nearest neighbour search is the problem of finding the most similar data-points to a
query in a database, and is a fundamental operation that has found wide applicability
in many fields. In this thesis the focus is placed on hashing-based approximate
nearest neighbour search methods that generate similar binary hashcodes for similar
data-points. These hashcodes can be used as the indices into the buckets of hashtables
for fast search. This work explores how the quality of search can be improved by
learning task specific binary hashcodes.
The generation of a binary hashcode comprises two main steps carried out sequentially:
projection of the image feature vector onto the normal vectors of a set of hyperplanes
partitioning the input feature space followed by a quantisation operation that
uses a single threshold to binarise the resulting projections to obtain the hashcodes.
The degree to which these operations preserve the relative distances between the datapoints
in the input feature space has a direct influence on the effectiveness of using
the resulting hashcodes for nearest neighbour search. In this thesis I argue that the
retrieval effectiveness of existing hashing-based nearest neighbour search methods can
be increased by learning the thresholds and hyperplanes based on the distribution of
the input data.
The first contribution is a model for learning multiple quantisation thresholds. I
demonstrate that the best threshold positioning is projection specific and introduce a
novel clustering algorithm for threshold optimisation. The second contribution extends
this algorithm by learning the optimal allocation of quantisation thresholds per hyperplane.
In doing so I argue that some hyperplanes are naturally more effective than others
at capturing the distribution of the data and should therefore attract a greater allocation
of quantisation thresholds. The third contribution focuses on the complementary
problem of learning the hashing hyperplanes. I introduce a multi-step iterative model
that, in the first step, regularises the hashcodes over a data-point adjacency graph,
which encourages similar data-points to be assigned similar hashcodes. In the second
step, binary classifiers are learnt to separate opposing bits with maximum margin. This
algorithm is extended to learn hyperplanes that can generate similar hashcodes for similar
data-points in two different feature spaces (e.g. text and images). Individually the
performance of these algorithms is often superior to competitive baselines. I unify my
contributions by demonstrating that learning hyperplanes and thresholds as part of the
same model can yield an additive increase in retrieval effectiveness
Inferring Geodesic Cerebrovascular Graphs: Image Processing, Topological Alignment and Biomarkers Extraction
A vectorial representation of the vascular network that embodies quantitative features - location, direction, scale, and bifurcations - has many potential neuro-vascular applications. Patient-specific models support computer-assisted surgical procedures in neurovascular interventions, while analyses on multiple subjects are essential for group-level studies on which clinical prediction and therapeutic inference ultimately depend. This first motivated the development of a variety of methods to segment the cerebrovascular system. Nonetheless, a number of limitations, ranging from data-driven inhomogeneities, the anatomical intra- and inter-subject variability, the lack of exhaustive ground-truth, the need for operator-dependent processing pipelines, and the highly non-linear vascular domain, still make the automatic inference of the cerebrovascular topology an open problem. In this thesis, brain vesselsā topology is inferred by focusing on their connectedness. With a novel framework, the brain vasculature is recovered from 3D angiographies by solving a connectivity-optimised anisotropic level-set over a voxel-wise tensor field representing the orientation of the underlying vasculature. Assuming vessels joining by minimal paths, a connectivity paradigm is formulated to automatically determine the vascular topology as an over-connected geodesic graph. Ultimately, deep-brain vascular structures are extracted with geodesic minimum spanning trees. The inferred topologies are then aligned with similar ones for labelling and propagating information over a non-linear vectorial domain, where the branching pattern of a set of vessels transcends a subject-specific quantized grid. Using a multi-source embedding of a vascular graph, the pairwise registration of topologies is performed with the state-of-the-art graph matching techniques employed in computer vision. Functional biomarkers are determined over the neurovascular graphs with two complementary approaches. Efficient approximations of blood flow and pressure drop account for autoregulation and compensation mechanisms in the whole network in presence of perturbations, using lumped-parameters analog-equivalents from clinical angiographies. Also, a localised NURBS-based parametrisation of bifurcations is introduced to model fluid-solid interactions by means of hemodynamic simulations using an isogeometric analysis framework, where both geometry and solution profile at the interface share the same homogeneous domain. Experimental results on synthetic and clinical angiographies validated the proposed formulations. Perspectives and future works are discussed for the group-wise alignment of cerebrovascular topologies over a population, towards defining cerebrovascular atlases, and for further topological optimisation strategies and risk prediction models for therapeutic inference. Most of the algorithms presented in this work are available as part of the open-source package VTrails
- ā¦