Search CORE

18,839 research outputs found

Matching-CNN Meets KNN: Quasi-Parametric Human Parsing

Author: Cao Xiaochun
Liang Xiaodan
Lin Liang
Liu Luoqi
Liu Si
Shen Xiaohui
Xu Changsheng
Yan Shuicheng
Yang Jianchao
Publication venue
Publication date: 06/04/2015
Field of study

Both parametric and non-parametric approaches have demonstrated encouraging performances in the human parsing task, namely segmenting a human image into several semantic regions (e.g., hat, bag, left arm, face). In this work, we aim to develop a new solution with the advantages of both methodologies, namely supervision from annotated data and the flexibility to use newly annotated (possibly uncommon) images, and present a quasi-parametric human parsing model. Under the classic K Nearest Neighbor (KNN)-based nonparametric framework, the parametric Matching Convolutional Neural Network (M-CNN) is proposed to predict the matching confidence and displacements of the best matched region in the testing image for a particular semantic region in one KNN image. Given a testing image, we first retrieve its KNN images from the annotated/manually-parsed human image corpus. Then each semantic region in each KNN image is matched with confidence to the testing image using M-CNN, and the matched regions from all KNN images are further fused, followed by a superpixel smoothing procedure to obtain the ultimate human parsing result. The M-CNN differs from the classic CNN in that the tailored cross image matching filters are introduced to characterize the matching between the testing image and the semantic region of a KNN image. The cross image matching filters are defined at different convolutional layers, each aiming to capture a particular range of displacements. Comprehensive evaluations over a large dataset with 7,700 annotated human images well demonstrate the significant performance gain from the quasi-parametric model over the state-of-the-arts, for the human parsing task.Comment: This manuscript is the accepted version for CVPR 201

arXiv.org e-Print Archive

Crossref

Driving Scene Perception Network: Real-time Joint Detection, Depth Estimation and Semantic Segmentation

Author: Chen Liangfu
Luo Zheng
Ma Jianjun
Yang Zeng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/03/2018
Field of study

As the demand for enabling high-level autonomous driving has increased in recent years and visual perception is one of the critical features to enable fully autonomous driving, in this paper, we introduce an efficient approach for simultaneous object detection, depth estimation and pixel-level semantic segmentation using a shared convolutional architecture. The proposed network model, which we named Driving Scene Perception Network (DSPNet), uses multi-level feature maps and multi-task learning to improve the accuracy and efficiency of object detection, depth estimation and image segmentation tasks from a single input image. Hence, the resulting network model uses less than 850 MiB of GPU memory and achieves 14.0 fps on NVIDIA GeForce GTX 1080 with a 1024x512 input image, and both precision and efficiency have been improved over combination of single tasks.Comment: 9 pages, 7 figures, WACV'1

arXiv.org e-Print Archive

Crossref

Multi-view Face Detection Using Deep Convolutional Neural Networks

Author: Garcia C.
Girshick R. B.
Kaiming He S. R.
Krizhevsky A.
Martin Koestinger P. M. R.
Osadchy M.
Osadchy R.
Ramanan D.
Saberian M.
Sermanet P.
Sun Y.
Szegedy C.
Szegedy C.
Szegedy C.
Tompson Y. L.
Vaillant R.
Viola M.
Wu B.
Publication venue
Publication date: 20/04/2015
Field of study

In this paper we consider the problem of multi-view face detection. While there has been significant research on this problem, current state-of-the-art approaches for this task require annotation of facial landmarks, e.g. TSM [25], or annotation of face poses [28, 22]. They also require training dozens of models to fully capture faces in all orientations, e.g. 22 models in HeadHunter method [22]. In this paper we propose Deep Dense Face Detector (DDFD), a method that does not require pose/landmark annotation and is able to detect faces in a wide range of orientations using a single model based on deep convolutional neural networks. The proposed method has minimal complexity; unlike other recent deep learning object detection methods [9], it does not require additional components such as segmentation, bounding-box regression, or SVM classifiers. Furthermore, we analyzed scores of the proposed face detector for faces in different orientations and found that 1) the proposed method is able to detect faces from different angles and can handle occlusion to some extent, 2) there seems to be a correlation between dis- tribution of positive examples in the training set and scores of the proposed face detector. The latter suggests that the proposed methods performance can be further improved by using better sampling strategies and more sophisticated data augmentation techniques. Evaluations on popular face detection benchmark datasets show that our single-model face detector algorithm has similar or better performance compared to the previous methods, which are more complex and require annotations of either different poses or facial landmarks.Comment: in International Conference on Multimedia Retrieval 2015 (ICMR

arXiv.org e-Print Archive

Crossref

Mask-guided Style Transfer Network for Purifying Real Images

Author: Fu Xianping
Peng Jinjia
Wang Huibing
Yan Yuxiao
Zhao Tongtong
Publication venue
Publication date: 18/03/2019
Field of study

Recently, the progress of learning-by-synthesis has proposed a training model for synthetic images, which can effectively reduce the cost of human and material resources. However, due to the different distribution of synthetic images compared with real images, the desired performance cannot be achieved. To solve this problem, the previous method learned a model to improve the realism of the synthetic images. Different from the previous methods, this paper try to purify real image by extracting discriminative and robust features to convert outdoor real images to indoor synthetic images. In this paper, we first introduce the segmentation masks to construct RGB-mask pairs as inputs, then we design a mask-guided style transfer network to learn style features separately from the attention and bkgd(background) regions and learn content features from full and attention region. Moreover, we propose a novel region-level task-guided loss to restrain the features learnt from style and content. Experiments were performed using mixed studies (qualitative and quantitative) methods to demonstrate the possibility of purifying real images in complex directions. We evaluate the proposed method on various public datasets, including LPW, COCO and MPIIGaze. Experimental results show that the proposed method is effective and achieves the state-of-the-art results.Comment: arXiv admin note: substantial text overlap with arXiv:1903.0582

arXiv.org e-Print Archive

Crossref