2,185 research outputs found

    Ensemble of Part Detectors for Simultaneous Classification and Localization

    Full text link
    Part-based representation has been proven to be effective for a variety of visual applications. However, automatic discovery of discriminative parts without object/part-level annotations is challenging. This paper proposes a discriminative mid-level representation paradigm based on the responses of a collection of part detectors, which only requires the image-level labels. Towards this goal, we first develop a detector-based spectral clustering method to mine the representative and discriminative mid-level patterns for detector initialization. The advantage of the proposed pattern mining technology is that the distance metric based on detectors only focuses on discriminative details, and a set of such grouped detectors offer an effective way for consistent pattern mining. Relying on the discovered patterns, we further formulate the detector learning process as a confidence-loss sparse Multiple Instance Learning (cls-MIL) task, which considers the diversity of the positive samples, while avoid drifting away the well localized ones by assigning a confidence value to each positive sample. The responses of the learned detectors can form an effective mid-level image representation for both image classification and object localization. Experiments conducted on benchmark datasets demonstrate the superiority of our method over existing approaches

    Recycle deep features for better object detection

    Full text link
    Aiming at improving the performance of existing detection algorithms developed for different applications, we propose a region regression-based multi-stage class-agnostic detection pipeline, whereby the existing algorithms are employed for providing the initial detection proposals. Better detection is obtained by exploiting the power of deep learning in the region regress scheme while avoiding the requirement on a huge amount of reference data for training deep neural networks. Additionally, a novel network architecture with recycled deep features is proposed, which provides superior regression results compared to the commonly used architectures. As demonstrated on a data set with ~1200 samples of different classes, it is feasible to successfully train a deep neural network in our proposed architecture and use it to obtain the desired detection performance. Since only slight modifications are required to common network architectures and since the deep neural network is trained using the standard hyperparameters, the proposed detection is well accessible and can be easily adopted to a broad variety of detection tasks

    Where are the Blobs: Counting by Localization with Point Supervision

    Full text link
    Object counting is an important task in computer vision due to its growing demand in applications such as surveillance, traffic monitoring, and counting everyday objects. State-of-the-art methods use regression-based optimization where they explicitly learn to count the objects of interest. These often perform better than detection-based methods that need to learn the more difficult task of predicting the location, size, and shape of each object. However, we propose a detection-based method that does not need to estimate the size and shape of the objects and that outperforms regression-based methods. Our contributions are three-fold: (1) we propose a novel loss function that encourages the network to output a single blob per object instance using point-level annotations only; (2) we design two methods for splitting large predicted blobs between object instances; and (3) we show that our method achieves new state-of-the-art results on several challenging datasets including the Pascal VOC and the Penguins dataset. Our method even outperforms those that use stronger supervision such as depth features, multi-point annotations, and bounding-box labels

    E2E-MLT - an Unconstrained End-to-End Method for Multi-Language Scene Text

    Full text link
    An end-to-end trainable (fully differentiable) method for multi-language scene text localization and recognition is proposed. The approach is based on a single fully convolutional network (FCN) with shared layers for both tasks. E2E-MLT is the first published multi-language OCR for scene text. While trained in multi-language setup, E2E-MLT demonstrates competitive performance when compared to other methods trained for English scene text alone. The experiments show that obtaining accurate multi-language multi-script annotations is a challenging problem

    Deep Learning for Generic Object Detection: A Survey

    Full text link
    Object detection, one of the most fundamental and challenging problems in computer vision, seeks to locate object instances from a large number of predefined categories in natural images. Deep learning techniques have emerged as a powerful strategy for learning feature representations directly from data and have led to remarkable breakthroughs in the field of generic object detection. Given this period of rapid evolution, the goal of this paper is to provide a comprehensive survey of the recent achievements in this field brought about by deep learning techniques. More than 300 research contributions are included in this survey, covering many aspects of generic object detection: detection frameworks, object feature representation, object proposal generation, context modeling, training strategies, and evaluation metrics. We finish the survey by identifying promising directions for future research.Comment: IJCV Mino

    Cascaded Sparse Spatial Bins for Efficient and Effective Generic Object Detection

    Full text link
    A novel efficient method for extraction of object proposals is introduced. Its "objectness" function exploits deep spatial pyramid features, a novel fast-to-compute HoG-based edge statistic and the EdgeBoxes score. The efficiency is achieved by the use of spatial bins in a novel combination with sparsity-inducing group normalized SVM. State-of-the-art recall performance is achieved on Pascal VOC07, significantly outperforming methods with comparable speed. Interestingly, when only 100 proposals per image are considered the method attains 78% recall on VOC07. The method improves mAP of the RCNN state-of-the-art class-specific detector, increasing it by 10 points when only 50 proposals are used in each image. The system trained on twenty classes performs well on the two hundred class ILSVRC2013 set confirming generalization capability.Comment: Accepted to ICCV1

    Recurrent Attentional Reinforcement Learning for Multi-label Image Recognition

    Full text link
    Recognizing multiple labels of images is a fundamental but challenging task in computer vision, and remarkable progress has been attained by localizing semantic-aware image regions and predicting their labels with deep convolutional neural networks. The step of hypothesis regions (region proposals) localization in these existing multi-label image recognition pipelines, however, usually takes redundant computation cost, e.g., generating hundreds of meaningless proposals with non-discriminative information and extracting their features, and the spatial contextual dependency modeling among the localized regions are often ignored or over-simplified. To resolve these issues, this paper proposes a recurrent attention reinforcement learning framework to iteratively discover a sequence of attentional and informative regions that are related to different semantic objects and further predict label scores conditioned on these regions. Besides, our method explicitly models long-term dependencies among these attentional regions that help to capture semantic label co-occurrence and thus facilitate multi-label recognition. Extensive experiments and comparisons on two large-scale benchmarks (i.e., PASCAL VOC and MS-COCO) show that our model achieves superior performance over existing state-of-the-art methods in both performance and efficiency as well as explicitly identifying image-level semantic labels to specific object regions.Comment: Accepted at AAAI 201

    Global-Local Face Upsampling Network

    Full text link
    Face hallucination, which is the task of generating a high-resolution face image from a low-resolution input image, is a well-studied problem that is useful in widespread application areas. Face hallucination is particularly challenging when the input face resolution is very low (e.g., 10 x 12 pixels) and/or the image is captured in an uncontrolled setting with large pose and illumination variations. In this paper, we revisit the algorithm introduced in [1] and present a deep interpretation of this framework that achieves state-of-the-art under such challenging scenarios. In our deep network architecture the global and local constraints that define a face can be efficiently modeled and learned end-to-end using training data. Conceptually our network design can be partitioned into two sub-networks: the first one implements the holistic face reconstruction according to global constraints, and the second one enhances face-specific details and enforces local patch statistics. We optimize the deep network using a new loss function for super-resolution that combines reconstruction error with a learned face quality measure in adversarial setting, producing improved visual results. We conduct extensive experiments in both controlled and uncontrolled setups and show that our algorithm improves the state of the art both numerically and visually

    Enhancing Salient Object Segmentation Through Attention

    Full text link
    Segmenting salient objects in an image is an important vision task with ubiquitous applications. The problem becomes more challenging in the presence of a cluttered and textured background, low resolution and/or low contrast images. Even though existing algorithms perform well in segmenting most of the object(s) of interest, they often end up segmenting false positives due to resembling salient objects in the background. In this work, we tackle this problem by iteratively attending to image patches in a recurrent fashion and subsequently enhancing the predicted segmentation mask. Saliency features are estimated independently for every image patch, which are further combined using an aggregation strategy based on a Convolutional Gated Recurrent Unit (ConvGRU) network. The proposed approach works in an end-to-end manner, removing background noise and false positives incrementally. Through extensive evaluation on various benchmark datasets, we show superior performance to the existing approaches without any post-processing.Comment: CVPRW - Deep Vision 201

    Accurate Face Detection for High Performance

    Full text link
    Face detection has witnessed significant progress due to the advances of deep convolutional neural networks (CNNs). Its central issue in recent years is how to improve the detection performance of tiny faces. To this end, many recent works propose some specific strategies, redesign the architecture and introduce new loss functions for tiny object detection. In this report, we start from the popular one-stage RetinaNet approach and apply some recent tricks to obtain a high performance face detector. Specifically, we apply the Intersection over Union (IoU) loss function for regression, employ the two-step classification and regression for detection, revisit the data augmentation based on data-anchor-sampling for training, utilize the max-out operation for classification and use the multi-scale testing strategy for inference. As a consequence, the proposed face detection method achieves state-of-the-art performance on the most popular and challenging face detection benchmark WIDER FACE dataset.Comment: 9 pages, 3 figures, technical repor
    corecore