512 research outputs found
Simultaneous inference for misaligned multivariate functional data
We consider inference for misaligned multivariate functional data that
represents the same underlying curve, but where the functional samples have
systematic differences in shape. In this paper we introduce a new class of
generally applicable models where warping effects are modeled through nonlinear
transformation of latent Gaussian variables and systematic shape differences
are modeled by Gaussian processes. To model cross-covariance between sample
coordinates we introduce a class of low-dimensional cross-covariance structures
suitable for modeling multivariate functional data. We present a method for
doing maximum-likelihood estimation in the models and apply the method to three
data sets. The first data set is from a motion tracking system where the
spatial positions of a large number of body-markers are tracked in
three-dimensions over time. The second data set consists of height and weight
measurements for Danish boys. The third data set consists of three-dimensional
spatial hand paths from a controlled obstacle-avoidance experiment. We use the
developed method to estimate the cross-covariance structure, and use a
classification setup to demonstrate that the method outperforms
state-of-the-art methods for handling misaligned curve data.Comment: 44 pages in total including tables and figures. Additional 9 pages of
supplementary material and reference
Video Face Editing Using Temporal-Spatial-Smooth Warping
Editing faces in videos is a popular yet challenging aspect of computer
vision and graphics, which encompasses several applications including facial
attractiveness enhancement, makeup transfer, face replacement, and expression
manipulation. Simply applying image-based warping algorithms to video-based
face editing produces temporal incoherence in the synthesized videos because it
is impossible to consistently localize facial features in two frames
representing two different faces in two different videos (or even two
consecutive frames representing the same face in one video). Therefore, high
performance face editing usually requires significant manual manipulation. In
this paper we propose a novel temporal-spatial-smooth warping (TSSW) algorithm
to effectively exploit the temporal information in two consecutive frames, as
well as the spatial smoothness within each frame. TSSW precisely estimates two
control lattices in the horizontal and vertical directions respectively from
the corresponding control lattices in the previous frame, by minimizing a novel
energy function that unifies a data-driven term, a smoothness term, and feature
point constraints. Corresponding warping surfaces then precisely map source
frames to the target frames. Experimental testing on facial attractiveness
enhancement, makeup transfer, face replacement, and expression manipulation
demonstrates that the proposed approaches can effectively preserve spatial
smoothness and temporal coherence in editing facial geometry, skin detail,
identity, and expression, which outperform the existing face editing methods.
In particular, TSSW is robust to subtly inaccurate localization of feature
points and is a vast improvement over image-based warping methods
Learn to Model Motion from Blurry Footages
It is difficult to recover the motion field from a real-world footage given a
mixture of camera shake and other photometric effects. In this paper we propose
a hybrid framework by interleaving a Convolutional Neural Network (CNN) and a
traditional optical flow energy. We first conduct a CNN architecture using a
novel learnable directional filtering layer. Such layer encodes the angle and
distance similarity matrix between blur and camera motion, which is able to
enhance the blur features of the camera-shake footages. The proposed CNNs are
then integrated into an iterative optical flow framework, which enable the
capability of modelling and solving both the blind deconvolution and the
optical flow estimation problems simultaneously. Our framework is trained
end-to-end on a synthetic dataset and yields competitive precision and
performance against the state-of-the-art approaches.Comment: Preprint of our paper accepted by Pattern Recognitio
Stacked Dense U-Nets with Dual Transformers for Robust Face Alignment
Facial landmark localisation in images captured in-the-wild is an important
and challenging problem. The current state-of-the-art revolves around certain
kinds of Deep Convolutional Neural Networks (DCNNs) such as stacked U-Nets and
Hourglass networks. In this work, we innovatively propose stacked dense U-Nets
for this task. We design a novel scale aggregation network topology structure
and a channel aggregation building block to improve the model's capacity
without sacrificing the computational complexity and model size. With the
assistance of deformable convolutions inside the stacked dense U-Nets and
coherent loss for outside data transformation, our model obtains the ability to
be spatially invariant to arbitrary input face images. Extensive experiments on
many in-the-wild datasets, validate the robustness of the proposed method under
extreme poses, exaggerated expressions and heavy occlusions. Finally, we show
that accurate 3D face alignment can assist pose-invariant face recognition
where we achieve a new state-of-the-art accuracy on CFP-FP
Iterative Residual Refinement for Joint Optical Flow and Occlusion Estimation
Deep learning approaches to optical flow estimation have seen rapid progress
over the recent years. One common trait of many networks is that they refine an
initial flow estimate either through multiple stages or across the levels of a
coarse-to-fine representation. While leading to more accurate results, the
downside of this is an increased number of parameters. Taking inspiration from
both classical energy minimization approaches as well as residual networks, we
propose an iterative residual refinement (IRR) scheme based on weight sharing
that can be combined with several backbone networks. It reduces the number of
parameters, improves the accuracy, or even achieves both. Moreover, we show
that integrating occlusion prediction and bi-directional flow estimation into
our IRR scheme can further boost the accuracy. Our full network achieves
state-of-the-art results for both optical flow and occlusion estimation across
several standard datasets.Comment: To appear in CVPR 201
Optical Flow in Mostly Rigid Scenes
The optical flow of natural scenes is a combination of the motion of the
observer and the independent motion of objects. Existing algorithms typically
focus on either recovering motion and structure under the assumption of a
purely static world or optical flow for general unconstrained scenes. We
combine these approaches in an optical flow algorithm that estimates an
explicit segmentation of moving objects from appearance and physical
constraints. In static regions we take advantage of strong constraints to
jointly estimate the camera motion and the 3D structure of the scene over
multiple frames. This allows us to also regularize the structure instead of the
motion. Our formulation uses a Plane+Parallax framework, which works even under
small baselines, and reduces the motion estimation to a one-dimensional search
problem, resulting in more accurate estimation. In moving regions the flow is
treated as unconstrained, and computed with an existing optical flow method.
The resulting Mostly-Rigid Flow (MR-Flow) method achieves state-of-the-art
results on both the MPI-Sintel and KITTI-2015 benchmarks.Comment: 15 pages, 10 figures; accepted for publication at CVPR 201
Neural approaches to spoken content embedding
Comparing spoken segments is a central operation to speech processing.
Traditional approaches in this area have favored frame-level dynamic
programming algorithms, such as dynamic time warping, because they require no
supervision, but they are limited in performance and efficiency. As an
alternative, acoustic word embeddings -- fixed-dimensional vector
representations of variable-length spoken word segments -- have begun to be
considered for such tasks as well. However, the current space of such
discriminative embedding models, training approaches, and their application to
real-world downstream tasks is limited. We start by considering ``single-view"
training losses where the goal is to learn an acoustic word embedding model
that separates same-word and different-word spoken segment pairs. Then, we
consider ``multi-view" contrastive losses. In this setting, acoustic word
embeddings are learned jointly with embeddings of character sequences to
generate acoustically grounded embeddings of written words, or acoustically
grounded word embeddings.
In this thesis, we contribute new discriminative acoustic word embedding
(AWE) and acoustically grounded word embedding (AGWE) approaches based on
recurrent neural networks (RNNs). We improve model training in terms of both
efficiency and performance. We take these developments beyond English to
several low-resource languages and show that multilingual training improves
performance when labeled data is limited. We apply our embedding models, both
monolingual and multilingual, to the downstream tasks of query-by-example
speech search and automatic speech recognition. Finally, we show how our
embedding approaches compare with and complement more recent self-supervised
speech models.Comment: PhD thesi
Machine learning for automatic analysis of affective behaviour
The automated analysis of affect has been gaining rapidly increasing attention by researchers over the past two decades, as it constitutes a fundamental step towards achieving next-generation computing technologies and integrating them into everyday life (e.g. via affect-aware, user-adaptive interfaces, medical imaging, health assessment, ambient intelligence etc.). The work presented in this thesis focuses on several fundamental problems manifesting in the course towards the achievement of reliable, accurate and robust affect sensing systems. In more detail, the motivation behind this work lies in recent developments in the field, namely (i) the creation of large, audiovisual databases for affect analysis in the so-called ''Big-Data`` era, along with (ii) the need to deploy systems under demanding, real-world conditions. These developments led to the requirement for the analysis of emotion expressions continuously in time, instead of merely processing static images, thus unveiling the wide range of temporal dynamics related to human behaviour to researchers. The latter entails another deviation from the traditional line of research in the field: instead of focusing on predicting posed, discrete basic emotions (happiness, surprise etc.), it became necessary to focus on spontaneous, naturalistic expressions captured under settings more proximal to real-world conditions, utilising more expressive emotion descriptions than a set of discrete labels. To this end, the main motivation of this thesis is to deal with challenges arising from the adoption of continuous dimensional emotion descriptions under naturalistic scenarios, considered to capture a much wider spectrum of expressive variability than basic emotions, and most importantly model emotional states which are commonly expressed by humans in their everyday life. In the first part of this thesis, we attempt to demystify the quite unexplored problem of predicting continuous emotional dimensions. This work is amongst the first to explore the problem of predicting emotion dimensions via multi-modal fusion, utilising facial expressions, auditory cues and shoulder gestures. A major contribution of the work presented in this thesis lies in proposing the utilisation of various relationships exhibited by emotion dimensions in order to improve the prediction accuracy of machine learning methods - an idea which has been taken on by other researchers in the field since. In order to experimentally evaluate this, we extend methods such as the Long Short-Term Memory Neural Networks (LSTM), the Relevance Vector Machine (RVM) and Canonical Correlation Analysis (CCA) in order to exploit output relationships in learning. As it is shown, this increases the accuracy of machine learning models applied to this task.
The annotation of continuous dimensional emotions is a tedious task, highly prone to the influence of various types of noise. Performed real-time by several annotators (usually experts), the annotation process can be heavily biased by factors such as subjective interpretations of the emotional states observed, the inherent ambiguity of labels related to human behaviour, the varying reaction lags exhibited by each annotator as well as other factors such as input device noise and annotation errors. In effect, the annotations manifest a strong spatio-temporal annotator-specific bias. Failing to properly deal with annotation bias and noise leads to an inaccurate ground truth, and therefore to ill-generalisable machine learning models. This deems the proper fusion of multiple annotations, and the inference of a clean, corrected version of the ``ground truth'' as one of the most significant challenges in the area. A highly important contribution of this thesis lies in the introduction of Dynamic Probabilistic Canonical Correlation Analysis (DPCCA), a method aimed at fusing noisy continuous annotations. By adopting a private-shared space model, we isolate the individual characteristics that are annotator-specific and not shared, while most importantly we model the common, underlying annotation which is shared by annotators (i.e., the derived ground truth). By further learning temporal dynamics and incorporating a time-warping process, we are able to derive a clean version of the ground truth given multiple annotations, eliminating temporal discrepancies and other nuisances.
The integration of the temporal alignment process within the proposed private-shared space model deems DPCCA suitable for the problem of temporally aligning human behaviour; that is, given temporally unsynchronised sequences (e.g., videos of two persons smiling), the goal is to generate the temporally synchronised sequences (e.g., the smile apex should co-occur in the videos). Temporal alignment is an important problem for many applications where multiple datasets need to be aligned in time. Furthermore, it is particularly suitable for the analysis of facial expressions, where the activation of facial muscles (Action Units) typically follows a set of predefined temporal phases. A highly challenging scenario is when the observations are perturbed by gross, non-Gaussian noise (e.g., occlusions), as is often the case when analysing data acquired under real-world conditions. To account for non-Gaussian noise, a robust variant of Canonical Correlation Analysis (RCCA) for robust fusion and temporal alignment is proposed. The model captures the shared, low-rank subspace of the observations, isolating the gross noise in a sparse noise term. RCCA is amongst the first robust variants of CCA proposed in literature, and as we show in related experiments outperforms other, state-of-the-art methods for related tasks such as the fusion of multiple modalities under gross noise.
Beyond private-shared space models, Component Analysis (CA) is an integral component of most computer vision systems, particularly in terms of reducing the usually high-dimensional input spaces in a meaningful manner pertaining to the task-at-hand (e.g., prediction, clustering). A final, significant contribution of this thesis lies in proposing the first unifying framework for probabilistic component analysis. The proposed framework covers most well-known CA methods, such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Locality Preserving Projections (LPP) and Slow Feature Analysis (SFA), providing further theoretical insights into the workings of CA. Moreover, the proposed framework is highly flexible, enabling novel CA methods to be generated by simply manipulating the connectivity of latent variables (i.e. the latent neighbourhood). As shown experimentally, methods derived via the proposed framework outperform other equivalents in several problems related to affect sensing and facial expression analysis, while providing advantages such as reduced complexity and explicit variance modelling.Open Acces
Recommended from our members
Functional data analytics for wearable device and neuroscience data
This thesis uses methods from functional data analysis (FDA) to solve problems from three scientific areas of study. While the areas of application are quite distinct, the common thread of functional data analysis ties them together. The first chapter describes interactive open-source software for explaining and disseminating results of functional data analyses. Chapters two and three use curve alignment, or registration, to solve common problems in accelerometry and neuroimaging, respectively. The final chapter introduces a novel regression method for modeling functional outcomes that are trajectories over time. The first chapter of this thesis details a software package for interactively visualizing functional data analyses. The software is designed to work for a wide range of datasets and several types of analyses. This chapter describes that software and provides an overview ofFDA in different contexts. The second chapter introduces a framework for curve alignment, or registration, of exponential family functional data. The approach distinguishes itself from previous registration methods in its ability to handle dense binary observations with computational efficiency. Motivation comes from the Baltimore Longitudinal Study on Aging, in which accelerometer data provides valuable insights into the timing of sedentary behavior. The third chapter takes lessons learned about curve registration from the second chapter and use them to develop methods in an entirely new context: large multisite brain imaging studies. Scanner effects in multisite imaging studies are non-biological variability due to technical differences across sites and scanner hardware. This method identifies and removes scanner effects by registering cumulative distribution functions of image intensities values. In the final chapter the focus shifts from curve registration to regression. Described within this chapter is an entirely new nonlinear regression framework that draws from both functional data analysis and systems of ordinary equations. This model is motivated by the neurobiology of skilled movement, and was developed to capture the relationship between neural activity and arm movement in mice
Smart Cameras
We review camera architecture in the age of artificial intelligence. Modern
cameras use physical components and software to capture, compress and display
image data. Over the past 5 years, deep learning solutions have become superior
to traditional algorithms for each of these functions. Deep learning enables
10-100x reduction in electrical sensor power per pixel, 10x improvement in
depth of field and dynamic range and 10-100x improvement in image pixel count.
Deep learning enables multiframe and multiaperture solutions that fundamentally
shift the goals of physical camera design. Here we review the state of the art
of deep learning in camera operations and consider the impact of AI on the
physical design of cameras
- …