2,487 research outputs found
Distance Estimation of an Unknown Person from a Portrait
We propose the first automated method for estimating distance from frontal pictures of unknown faces. Camera calibration is not necessary, nor is the reconstruction of a 3D representation of the shape of the head. Our method is based on estimating automatically the position of face and head landmarks in the image, and then using a regressor to estimate distance from such measurements. We collected and annotated a dataset of frontal portraits of 53 individuals spanning a number of attributes (sex, age, race, hair), each photographed from seven distances. We find that our proposed method outperforms humans performing the same task. We observe that different physiognomies will bias systematically the estimate of distance, i.e. some people look closer than others. We expire which landmarks are more important for this task
Jumping Manifolds: Geometry Aware Dense Non-Rigid Structure from Motion
Given dense image feature correspondences of a non-rigidly moving object
across multiple frames, this paper proposes an algorithm to estimate its 3D
shape for each frame. To solve this problem accurately, the recent
state-of-the-art algorithm reduces this task to set of local linear subspace
reconstruction and clustering problem using Grassmann manifold representation
\cite{kumar2018scalable}. Unfortunately, their method missed on some of the
critical issues associated with the modeling of surface deformations, for e.g.,
the dependence of a local surface deformation on its neighbors. Furthermore,
their representation to group high dimensional data points inevitably introduce
the drawbacks of categorizing samples on the high-dimensional Grassmann
manifold \cite{huang2015projection, harandi2014manifold}. Hence, to deal with
such limitations with \cite{kumar2018scalable}, we propose an algorithm that
jointly exploits the benefit of high-dimensional Grassmann manifold to perform
reconstruction, and its equivalent lower-dimensional representation to infer
suitable clusters. To accomplish this, we project each Grassmannians onto a
lower-dimensional Grassmann manifold which preserves and respects the
deformation of the structure w.r.t its neighbors. These Grassmann points in the
lower-dimension then act as a representative for the selection of
high-dimensional Grassmann samples to perform each local reconstruction. In
practice, our algorithm provides a geometrically efficient way to solve dense
NRSfM by switching between manifolds based on its benefit and usage.
Experimental results show that the proposed algorithm is very effective in
handling noise with reconstruction accuracy as good as or better than the
competing methods.Comment: New version with corrected typo. 10 Pages, 7 Figures, 1 Table.
Accepted for publication in IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) 2019. Acknowledgement added. Supplementary material is
available at https://suryanshkumar.github.io
Learning-to-Rank Meets Language: Boosting Language-Driven Ordering Alignment for Ordinal Classification
We present a novel language-driven ordering alignment method for ordinal
classification. The labels in ordinal classification contain additional
ordering relations, making them prone to overfitting when relying solely on
training data. Recent developments in pre-trained vision-language models
inspire us to leverage the rich ordinal priors in human language by converting
the original task into a visionlanguage alignment task. Consequently, we
propose L2RCLIP, which fully utilizes the language priors from two
perspectives. First, we introduce a complementary prompt tuning technique
called RankFormer, designed to enhance the ordering relation of original rank
prompts. It employs token-level attention with residual-style prompt blending
in the word embedding space. Second, to further incorporate language priors, we
revisit the approximate bound optimization of vanilla cross-entropy loss and
restructure it within the cross-modal embedding space. Consequently, we propose
a cross-modal ordinal pairwise loss to refine the CLIP feature space, where
texts and images maintain both semantic alignment and ordering alignment.
Extensive experiments on three ordinal classification tasks, including facial
age estimation, historical color image (HCI) classification, and aesthetic
assessment demonstrate its promising performance. The code is available at
https://github.com/raywang335/L2RCLIP.Comment: Accepted by NeurIPS 202
Discriminative Appearance Models for Face Alignment
The proposed face alignment algorithm uses local gradient features as the appearance representation. These features are obtained by pixel value comparison, which provide robustness against changes in illumination, as well as partial occlusion and local deformation due to the locality. The adopted features are modeled in three discriminative methods, which correspond to different alignment cost functions. The discriminative appearance modeling alleviate the generalization problem to some extent
Latent Multi-Attribute Transformer for Face Editing in Images
Facial attribute transformation, the ability to modify specific facial attributes in images and videos, has gained significant attention in computer vision and image processing. We explored in depth the state-of-the-art techniques based on multiple individual and sequential Latent Transformers to modify the latent representation of the image. In this work, we introduce a novel approach that utilizes a unified architecture capable of learning multiple facial attributes simultaneously in a single Transformer model. By training a single model for multiple attributes, we aim to enhance efficiency, particularly in terms of speed, while maintaining competitive overall quality and outperforming some of the metrics. The disentanglement of attributes and preservation of identity are essential goals. Our results demonstrate the successful learning of 20 attributes with a unified Latent Transformer, showcasing significant improvements in speed, identity preservation, realism, and close-to-original overall quality. Through subjective study, our model demonstrates more appealing and preferred results compared to the baseline. As a summary, this work contributes to the field of facial attribute transformation by providing a more elegant and scalable solution, opening avenues for further advancements in disentangled face editing in images and videos
Context-sensitive dynamic ordinal regression for intensity estimation of facial action units
Modeling intensity of facial action units from spontaneously displayed facial expressions is challenging mainly because of high variability in subject-specific facial expressiveness, head-movements, illumination changes, etc. These factors make the target problem highly context-sensitive. However, existing methods usually ignore this context-sensitivity of the target problem. We propose a novel Conditional Ordinal Random Field (CORF) model for context-sensitive modeling of the facial action unit intensity, where the W5+ (who, when, what, where, why and how) definition of the context is used. While the proposed model is general enough to handle all six context questions, in this paper we focus on the context questions: who (the observed subject), how (the changes in facial expressions), and when (the timing of facial expressions and their intensity). The context questions who and howare modeled by means of the newly introduced context-dependent covariate effects, and the context question when is modeled in terms of temporal correlation between the ordinal outputs, i.e., intensity levels of action units. We also introduce a weighted softmax-margin learning of CRFs from data with skewed distribution of the intensity levels, which is commonly encountered in spontaneous facial data. The proposed model is evaluated on intensity estimation of pain and facial action units using two recently published datasets (UNBC Shoulder Pain and DISFA) of spontaneously displayed facial expressions. Our experiments show that the proposed model performs significantly better on the target tasks compared to the state-of-the-art approaches. Furthermore, compared to traditional learning of CRFs, we show that the proposed weighted learning results in more robust parameter estimation from the imbalanced intensity data
Learned Multi-View Texture Super-Resolution
We present a super-resolution method capable of creating a high-resolution
texture map for a virtual 3D object from a set of lower-resolution images of
that object. Our architecture unifies the concepts of (i) multi-view
super-resolution based on the redundancy of overlapping views and (ii)
single-view super-resolution based on a learned prior of high-resolution (HR)
image structure. The principle of multi-view super-resolution is to invert the
image formation process and recover the latent HR texture from multiple
lower-resolution projections. We map that inverse problem into a block of
suitably designed neural network layers, and combine it with a standard
encoder-decoder network for learned single-image super-resolution. Wiring the
image formation model into the network avoids having to learn perspective
mapping from textures to images, and elegantly handles a varying number of
input views. Experiments demonstrate that the combination of multi-view
observations and learned prior yields improved texture maps.Comment: 11 pages, 5 figures, 2019 International Conference on 3D Vision (3DV
- …