2,487 research outputs found

    Distance Estimation of an Unknown Person from a Portrait

    Get PDF
    We propose the first automated method for estimating distance from frontal pictures of unknown faces. Camera calibration is not necessary, nor is the reconstruction of a 3D representation of the shape of the head. Our method is based on estimating automatically the position of face and head landmarks in the image, and then using a regressor to estimate distance from such measurements. We collected and annotated a dataset of frontal portraits of 53 individuals spanning a number of attributes (sex, age, race, hair), each photographed from seven distances. We find that our proposed method outperforms humans performing the same task. We observe that different physiognomies will bias systematically the estimate of distance, i.e. some people look closer than others. We expire which landmarks are more important for this task

    Jumping Manifolds: Geometry Aware Dense Non-Rigid Structure from Motion

    Full text link
    Given dense image feature correspondences of a non-rigidly moving object across multiple frames, this paper proposes an algorithm to estimate its 3D shape for each frame. To solve this problem accurately, the recent state-of-the-art algorithm reduces this task to set of local linear subspace reconstruction and clustering problem using Grassmann manifold representation \cite{kumar2018scalable}. Unfortunately, their method missed on some of the critical issues associated with the modeling of surface deformations, for e.g., the dependence of a local surface deformation on its neighbors. Furthermore, their representation to group high dimensional data points inevitably introduce the drawbacks of categorizing samples on the high-dimensional Grassmann manifold \cite{huang2015projection, harandi2014manifold}. Hence, to deal with such limitations with \cite{kumar2018scalable}, we propose an algorithm that jointly exploits the benefit of high-dimensional Grassmann manifold to perform reconstruction, and its equivalent lower-dimensional representation to infer suitable clusters. To accomplish this, we project each Grassmannians onto a lower-dimensional Grassmann manifold which preserves and respects the deformation of the structure w.r.t its neighbors. These Grassmann points in the lower-dimension then act as a representative for the selection of high-dimensional Grassmann samples to perform each local reconstruction. In practice, our algorithm provides a geometrically efficient way to solve dense NRSfM by switching between manifolds based on its benefit and usage. Experimental results show that the proposed algorithm is very effective in handling noise with reconstruction accuracy as good as or better than the competing methods.Comment: New version with corrected typo. 10 Pages, 7 Figures, 1 Table. Accepted for publication in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019. Acknowledgement added. Supplementary material is available at https://suryanshkumar.github.io

    Learning-to-Rank Meets Language: Boosting Language-Driven Ordering Alignment for Ordinal Classification

    Full text link
    We present a novel language-driven ordering alignment method for ordinal classification. The labels in ordinal classification contain additional ordering relations, making them prone to overfitting when relying solely on training data. Recent developments in pre-trained vision-language models inspire us to leverage the rich ordinal priors in human language by converting the original task into a visionlanguage alignment task. Consequently, we propose L2RCLIP, which fully utilizes the language priors from two perspectives. First, we introduce a complementary prompt tuning technique called RankFormer, designed to enhance the ordering relation of original rank prompts. It employs token-level attention with residual-style prompt blending in the word embedding space. Second, to further incorporate language priors, we revisit the approximate bound optimization of vanilla cross-entropy loss and restructure it within the cross-modal embedding space. Consequently, we propose a cross-modal ordinal pairwise loss to refine the CLIP feature space, where texts and images maintain both semantic alignment and ordering alignment. Extensive experiments on three ordinal classification tasks, including facial age estimation, historical color image (HCI) classification, and aesthetic assessment demonstrate its promising performance. The code is available at https://github.com/raywang335/L2RCLIP.Comment: Accepted by NeurIPS 202

    Discriminative Appearance Models for Face Alignment

    Get PDF
    The proposed face alignment algorithm uses local gradient features as the appearance representation. These features are obtained by pixel value comparison, which provide robustness against changes in illumination, as well as partial occlusion and local deformation due to the locality. The adopted features are modeled in three discriminative methods, which correspond to different alignment cost functions. The discriminative appearance modeling alleviate the generalization problem to some extent

    Latent Multi-Attribute Transformer for Face Editing in Images

    Get PDF
    Facial attribute transformation, the ability to modify specific facial attributes in images and videos, has gained significant attention in computer vision and image processing. We explored in depth the state-of-the-art techniques based on multiple individual and sequential Latent Transformers to modify the latent representation of the image. In this work, we introduce a novel approach that utilizes a unified architecture capable of learning multiple facial attributes simultaneously in a single Transformer model. By training a single model for multiple attributes, we aim to enhance efficiency, particularly in terms of speed, while maintaining competitive overall quality and outperforming some of the metrics. The disentanglement of attributes and preservation of identity are essential goals. Our results demonstrate the successful learning of 20 attributes with a unified Latent Transformer, showcasing significant improvements in speed, identity preservation, realism, and close-to-original overall quality. Through subjective study, our model demonstrates more appealing and preferred results compared to the baseline. As a summary, this work contributes to the field of facial attribute transformation by providing a more elegant and scalable solution, opening avenues for further advancements in disentangled face editing in images and videos

    Context-sensitive dynamic ordinal regression for intensity estimation of facial action units

    Get PDF
    Modeling intensity of facial action units from spontaneously displayed facial expressions is challenging mainly because of high variability in subject-specific facial expressiveness, head-movements, illumination changes, etc. These factors make the target problem highly context-sensitive. However, existing methods usually ignore this context-sensitivity of the target problem. We propose a novel Conditional Ordinal Random Field (CORF) model for context-sensitive modeling of the facial action unit intensity, where the W5+ (who, when, what, where, why and how) definition of the context is used. While the proposed model is general enough to handle all six context questions, in this paper we focus on the context questions: who (the observed subject), how (the changes in facial expressions), and when (the timing of facial expressions and their intensity). The context questions who and howare modeled by means of the newly introduced context-dependent covariate effects, and the context question when is modeled in terms of temporal correlation between the ordinal outputs, i.e., intensity levels of action units. We also introduce a weighted softmax-margin learning of CRFs from data with skewed distribution of the intensity levels, which is commonly encountered in spontaneous facial data. The proposed model is evaluated on intensity estimation of pain and facial action units using two recently published datasets (UNBC Shoulder Pain and DISFA) of spontaneously displayed facial expressions. Our experiments show that the proposed model performs significantly better on the target tasks compared to the state-of-the-art approaches. Furthermore, compared to traditional learning of CRFs, we show that the proposed weighted learning results in more robust parameter estimation from the imbalanced intensity data

    Learned Multi-View Texture Super-Resolution

    Full text link
    We present a super-resolution method capable of creating a high-resolution texture map for a virtual 3D object from a set of lower-resolution images of that object. Our architecture unifies the concepts of (i) multi-view super-resolution based on the redundancy of overlapping views and (ii) single-view super-resolution based on a learned prior of high-resolution (HR) image structure. The principle of multi-view super-resolution is to invert the image formation process and recover the latent HR texture from multiple lower-resolution projections. We map that inverse problem into a block of suitably designed neural network layers, and combine it with a standard encoder-decoder network for learned single-image super-resolution. Wiring the image formation model into the network avoids having to learn perspective mapping from textures to images, and elegantly handles a varying number of input views. Experiments demonstrate that the combination of multi-view observations and learned prior yields improved texture maps.Comment: 11 pages, 5 figures, 2019 International Conference on 3D Vision (3DV
    • …
    corecore