385 research outputs found

    Shallow CNNs for the Reliable Detection of Facial Marks

    Get PDF

    Interspecies Knowledge Transfer for Facial Keypoint Detection

    Full text link
    We present a method for localizing facial keypoints on animals by transferring knowledge gained from human faces. Instead of directly finetuning a network trained to detect keypoints on human faces to animal faces (which is sub-optimal since human and animal faces can look quite different), we propose to first adapt the animal images to the pre-trained human detection network by correcting for the differences in animal and human face shape. We first find the nearest human neighbors for each animal image using an unsupervised shape matching method. We use these matches to train a thin plate spline warping network to warp each animal face to look more human-like. The warping network is then jointly finetuned with a pre-trained human facial keypoint detection network using an animal dataset. We demonstrate state-of-the-art results on both horse and sheep facial keypoint detection, and significant improvement over simple finetuning, especially when training data is scarce. Additionally, we present a new dataset with 3717 images with horse face and facial keypoint annotations.Comment: CVPR 2017 Camera Read

    3D Shape Descriptor-Based Facial Landmark Detection: A Machine Learning Approach

    Get PDF
    Facial landmark detection on 3D human faces has had numerous applications in the literature such as establishing point-to-point correspondence between 3D face models which is itself a key step for a wide range of applications like 3D face detection and authentication, matching, reconstruction, and retrieval, to name a few. Two groups of approaches, namely knowledge-driven and data-driven approaches, have been employed for facial landmarking in the literature. Knowledge-driven techniques are the traditional approaches that have been widely used to locate landmarks on human faces. In these approaches, a user with sucient knowledge and experience usually denes features to be extracted as the landmarks. Data-driven techniques, on the other hand, take advantage of machine learning algorithms to detect prominent features on 3D face models. Besides the key advantages, each category of these techniques has limitations that prevent it from generating the most reliable results. In this work we propose to combine the strengths of the two approaches to detect facial landmarks in a more ecient and precise way. The suggested approach consists of two phases. First, some salient features of the faces are extracted using expert systems. Afterwards, these points are used as the initial control points in the well-known Thin Plate Spline (TPS) technique to deform the input face towards a reference face model. Second, by exploring and utilizing multiple machine learning algorithms another group of landmarks are extracted. The data-driven landmark detection step is performed in a supervised manner providing an information-rich set of training data in which a set of local descriptors are computed and used to train the algorithm. We then, use the detected landmarks for establishing point-to-point correspondence between the 3D human faces mainly using an improved version of Iterative Closest Point (ICP) algorithms. Furthermore, we propose to use the detected landmarks for 3D face matching applications

    Lip Reading Sentences in the Wild

    Full text link
    The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) a 'Watch, Listen, Attend and Spell' (WLAS) network that learns to transcribe videos of mouth motion to characters; (2) a curriculum learning strategy to accelerate training and to reduce overfitting; (3) a 'Lip Reading Sentences' (LRS) dataset for visual speech recognition, consisting of over 100,000 natural sentences from British television. The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin. This lip reading performance beats a professional lip reader on videos from BBC television, and we also demonstrate that visual information helps to improve speech recognition performance even when the audio is available
    • …
    corecore