553 research outputs found

    Hourglass Face Detector for Hard Face

    Get PDF
    Face detection is an upstream task of facial image analysis. In many real-world scenarios, we need to detect small, occluded or dense faces that are hard to detect, but hard face detection is a challenging task in particular considering the balance between accuracy and inference speed for real-world applications. This paper proposes an Hourglass Face Detector (HFD) for hard face by developing a deep one-stage fully-convolutional hourglass network, which achieves an excellent balance between accuracy and inference speed. To this end, the HFD firstly shrinks a feature map by a series of stridden convolutional layers rather than pooling layers, so that useful subtle information is preserved better. Secondly, it exploits context information by merging fine-grained shallow feature maps with deep ones full of semantic information, making a better fusion of detailed information and semantic information to achieve a better detection of small faces. Moreover, the HFD exploits prior and multiscale information from the training data to enhance its scale-invariance and adaptability of anchor scales. Compared with the SSH and S3FD methods, the HFD can achieve a better performance in average precision on detecting hard faces as well as a quicker inference. Experiments on the WIDER FACE and FDDB datasets demonstrate the superior performance of our proposed method

    Towards Accurate Multi-person Pose Estimation in the Wild

    Full text link
    We propose a method for multi-person detection and 2-D pose estimation that achieves state-of-art results on the challenging COCO keypoints task. It is a simple, yet powerful, top-down approach consisting of two stages. In the first stage, we predict the location and scale of boxes which are likely to contain people; for this we use the Faster RCNN detector. In the second stage, we estimate the keypoints of the person potentially contained in each proposed bounding box. For each keypoint type we predict dense heatmaps and offsets using a fully convolutional ResNet. To combine these outputs we introduce a novel aggregation procedure to obtain highly localized keypoint predictions. We also use a novel form of keypoint-based Non-Maximum-Suppression (NMS), instead of the cruder box-level NMS, and a novel form of keypoint-based confidence score estimation, instead of box-level scoring. Trained on COCO data alone, our final system achieves average precision of 0.649 on the COCO test-dev set and the 0.643 test-standard sets, outperforming the winner of the 2016 COCO keypoints challenge and other recent state-of-art. Further, by using additional in-house labeled data we obtain an even higher average precision of 0.685 on the test-dev set and 0.673 on the test-standard set, more than 5% absolute improvement compared to the previous best performing method on the same dataset.Comment: Paper describing an improved version of the G-RMI entry to the 2016 COCO keypoints challenge (http://image-net.org/challenges/ilsvrc+coco2016). Camera ready version to appear in the Proceedings of CVPR 201

    Evaluating the Performance of Vision Transformer Architecture for Deepfake Image Classification

    Get PDF
    Deepfake classification has seen some impressive results lately, with the experimentation of various deep learning methodologies, researchers were able to design some state-of-the art techniques. This study attempts to use an existing technology “Transformers” in the field of Natural Language Processing (NLP) which has been a de-facto standard in text processing for the purposes of Computer Vision. Transformers use a mechanism called “self-attention”, which is different from CNN and LSTM. This study uses a novel technique that considers images as 16x16 words (Dosovitskiy et al., 2021) to train a deep neural network with “self-attention” blocks to detect deepfakes. It creates position embeddings of the image patches which can be passed to the Transformer block to classify the modified images from the CELEB-DF-v2 dataset. Furthermore, the difference between the mean accuracy of this model and an existing state-of-the-art detection technique that uses the Residual CNN network is compared for statistical significance. Both these models are compared on their performances mainly Accuracy and loss. This study shows the state-of-the-art results obtained using this novel technique. The Vision Transformer based model achieved state-of-the-art performance with 97.07% accuracy when compared to the ResNet-18 model which achieved 91.78% accuracy

    Unsupervised learning of object landmarks by factorized spatial embeddings

    Full text link
    Learning automatically the structure of object categories remains an important open problem in computer vision. In this paper, we propose a novel unsupervised approach that can discover and learn landmarks in object categories, thus characterizing their structure. Our approach is based on factorizing image deformations, as induced by a viewpoint change or an object deformation, by learning a deep neural network that detects landmarks consistently with such visual effects. Furthermore, we show that the learned landmarks establish meaningful correspondences between different object instances in a category without having to impose this requirement explicitly. We assess the method qualitatively on a variety of object types, natural and man-made. We also show that our unsupervised landmarks are highly predictive of manually-annotated landmarks in face benchmark datasets, and can be used to regress these with a high degree of accuracy.Comment: To be published in ICCV 201
    • …
    corecore