33 research outputs found

    Improving Object Detection with MatrixNets

    Get PDF
    Object detection is a popular task in computer vision with various applications, from pedestrian detection to face detection. Following the success of Convolutional Neural Networks (CNNs), many CNN based object detectors have been proposed to solve the object detection task. Early CNN based detectors suggested using deeper networks to detect objects in images. However, deeper networks cannot capture objects of varied sizes and aspect ratios with high accuracy. Thus, CNN-based detectors have two main challenges --- scale invariance (detecting objects at multiple scales) and aspect-ratio invariance (detecting objects at various aspect ratios). Modern CNN-based object detectors have two main components --- a backbone network that learns features from an image and an output network that leverages these features to make predictions. Scale and aspect-ratio invariance are typically added by either making changes to the backbone or to the output network. Adding scale awareness to the output network is often computationally expensive. Thus, a popular method to add scale invariance by changing the backbone is Feature Pyramid Networks (FPNs) [Lin et al 2017]. FPNs create a hierarchy of features at different scales and implicitly capture objects at various resolutions. Thus, FPNs are able to identify objects at different scales without the need to resize the input image. However, FPNs have a square-bias and favour square objects over asymmetric ones. One solution to alleviate the square biasedness of FPNs is to add template anchor boxes of various sizes to add more bias towards non-square objects. However, anchor boxes are set as hyperparameters and add a computational overhead to the network. Newer architectures have thus moved towards anchor-free techniques; however, they still rely on FPNs, which are square-biased. Recently, MatrixNets [rashwan et al. 2019] has been proposed as a general-purpose aspect-ratio aware extension of FPNs that can explicitly model aspect-ratios better than anchor boxes while keeping the model anchor-free. Matrixnets expands the FPN backbone by applies asymmetrically strided convolutions to create skewed receptive fields, making rectangular objects appear more square to the network. While MatrixNets has been shown to improve keypoint based object detectors significantly, the implementation makes significant changes to the architecture, making it difficult to isolate the solo impact of MatrixNets. In this thesis, we explore MatrixNets as a viable method to add aspect-ratio awareness. Specifically, we study MatrixNets along three axes --- 1) Does MatrixNets make anchor-based detectors anchor-free. 2) Does MatrixNets add aspect-ratio awareness to object detectors, and 3) can MatrixNets be used for other, more complicated computer vision tasks like instance segmentation .We explore these questions via three case studies. We demonstrate the effectiveness of MatrixNets by replacing anchor boxes in RetinaNet [Lin et at 2017] with our MatrixNets module and showing better performance on skewed boxes while making the detector anchor-free. Then, we extend the anchor-free CornerNets [Law et al. 2018] to x-CornerNet to support multiple output heads and smaller backbones. We then apply MatrixNets to x-CornerNet and demonstrate a similar improvement in skewed boxes leading to an overall 5.6% mAP improvement on MS COCO, achieving competitive results. Finally, we add MatrixNets to Mask RCNN [He et al. 2017] to tackle the instance segmentation tasks. While object detection draws bounding boxes delineating an object, instance segmentation goes one step further and draws pixel-wise masks delineating objects. This level of detail makes instance segmentation a more difficult problem to solve than object detection. We propose a new loss function, Mask Edge Loss (MEL), that leverages mask contours to reduce coarseness in predicted masks, thereby achieving higher accuracy. Together these three case studies demonstrate the effectiveness of MatrixNets for adding aspect-ratio awareness to object detectors. The code-base for our implementation will be made public

    Classifying All Interacting Pairs in a Single Shot

    Full text link
    In this paper, we introduce a novel human interaction detection approach, based on CALIPSO (Classifying ALl Interacting Pairs in a Single shOt), a classifier of human-object interactions. This new single-shot interaction classifier estimates interactions simultaneously for all human-object pairs, regardless of their number and class. State-of-the-art approaches adopt a multi-shot strategy based on a pairwise estimate of interactions for a set of human-object candidate pairs, which leads to a complexity depending, at least, on the number of interactions or, at most, on the number of candidate pairs. In contrast, the proposed method estimates the interactions on the whole image. Indeed, it simultaneously estimates all interactions between all human subjects and object targets by performing a single forward pass throughout the image. Consequently, it leads to a constant complexity and computation time independent of the number of subjects, objects or interactions in the image. In detail, interaction classification is achieved on a dense grid of anchors thanks to a joint multi-task network that learns three complementary tasks simultaneously: (i) prediction of the types of interaction, (ii) estimation of the presence of a target and (iii) learning of an embedding which maps interacting subject and target to a same representation, by using a metric learning strategy. In addition, we introduce an object-centric passive-voice verb estimation which significantly improves results. Evaluations on the two well-known Human-Object Interaction image datasets, V-COCO and HICO-DET, demonstrate the competitiveness of the proposed method (2nd place) compared to the state-of-the-art while having constant computation time regardless of the number of objects and interactions in the image.Comment: WACV 2020 (to appear

    OpenNet: Incremental Learning for Autonomous Driving Object Detection with Balanced Loss

    Full text link
    Automated driving object detection has always been a challenging task in computer vision due to environmental uncertainties. These uncertainties include significant differences in object sizes and encountering the class unseen. It may result in poor performance when traditional object detection models are directly applied to automated driving detection. Because they usually presume fixed categories of common traffic participants, such as pedestrians and cars. Worsely, the huge class imbalance between common and novel classes further exacerbates performance degradation. To address the issues stated, we propose OpenNet to moderate the class imbalance with the Balanced Loss, which is based on Cross Entropy Loss. Besides, we adopt an inductive layer based on gradient reshaping to fast learn new classes with limited samples during incremental learning. To against catastrophic forgetting, we employ normalized feature distillation. By the way, we improve multi-scale detection robustness and unknown class recognition through FPN and energy-based detection, respectively. The Experimental results upon the CODA dataset show that the proposed method can obtain better performance than that of the existing methods
    corecore