1,080 research outputs found

    Generalized Max Pooling

    Full text link
    State-of-the-art patch-based image representations involve a pooling operation that aggregates statistics computed from local descriptors. Standard pooling operations include sum- and max-pooling. Sum-pooling lacks discriminability because the resulting representation is strongly influenced by frequent yet often uninformative descriptors, but only weakly influenced by rare yet potentially highly-informative ones. Max-pooling equalizes the influence of frequent and rare descriptors but is only applicable to representations that rely on count statistics, such as the bag-of-visual-words (BOV) and its soft- and sparse-coding extensions. We propose a novel pooling mechanism that achieves the same effect as max-pooling but is applicable beyond the BOV and especially to the state-of-the-art Fisher Vector -- hence the name Generalized Max Pooling (GMP). It involves equalizing the similarity between each patch and the pooled representation, which is shown to be equivalent to re-weighting the per-patch statistics. We show on five public image classification benchmarks that the proposed GMP can lead to significant performance gains with respect to heuristic alternatives.Comment: (to appear) CVPR 2014 - IEEE Conference on Computer Vision & Pattern Recognition (2014

    Deep learning for food instance segmentation

    Get PDF
    Food object detection and instance segmentation are critical in many applications such as dietary management or food intake monitoring. Food image recognition poses different challenges, such as the existence of a large number of classes, a high inter-class similarity, and high intra-class variance. This, along with the traditional problems associated with object detection and instance segmentation make this a very complex computer vision task. Real-world food datasets generally suffer from long-tailed and fine-grained distributions. However, the recent literature fails to address food detection in this regard. In this research, we propose a novel two-stage object detector, which we call Strong LOng-tailed Food object Detection and instance Segmentation (SLOF-DS), to tackle the long-tailed nature of food images. In addition, a multi-task based framework, which exploits different sources of prior information, was proposed to improve the classification of fine-grained classes. Lastly, we also propose a new module based on Graph Neural Neworks, we call Graph Confidence Propagation (GCP) that additionally improves the performance of both object detection and instance segmentation modules by combining all the model outputs considering the global image context. Exhaustive quantitive and qualitative analysis performed on two open source food benchmarks, namely the UECFood-256 (object detection) and the AiCrowd Food Recognition Challenge 2022 dataset (instance segmentation) using different baseline algorithms prove the robust improvements introduced by the different components proposed in this thesis. More concretely, we outperformed the state-of-the-art performance on both public datasets

    Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization

    Full text link
    Open-vocabulary object detection (OVD) aims to scale up vocabulary size to detect objects of novel categories beyond the training vocabulary. Recent work resorts to the rich knowledge in pre-trained vision-language models. However, existing methods are ineffective in proposal-level vision-language alignment. Meanwhile, the models usually suffer from confidence bias toward base categories and perform worse on novel ones. To overcome the challenges, we present MEDet, a novel and effective OVD framework with proposal mining and prediction equalization. First, we design an online proposal mining to refine the inherited vision-semantic knowledge from coarse to fine, allowing for proposal-level detection-oriented feature alignment. Second, based on causal inference theory, we introduce a class-wise backdoor adjustment to reinforce the predictions on novel categories to improve the overall OVD performance. Extensive experiments on COCO and LVIS benchmarks verify the superiority of MEDet over the competing approaches in detecting objects of novel categories, e.g., 32.6% AP50 on COCO and 22.4% mask mAP on LVIS
    • …
    corecore