362 research outputs found

    GlobalTrack: A Simple and Strong Baseline for Long-term Tracking

    Full text link
    A key capability of a long-term tracker is to search for targets in very large areas (typically the entire image) to handle possible target absences or tracking failures. However, currently there is a lack of such a strong baseline for global instance search. In this work, we aim to bridge this gap. Specifically, we propose GlobalTrack, a pure global instance search based tracker that makes no assumption on the temporal consistency of the target's positions and scales. GlobalTrack is developed based on two-stage object detectors, and it is able to perform full-image and multi-scale search of arbitrary instances with only a single query as the guide. We further propose a cross-query loss to improve the robustness of our approach against distractors. With no online learning, no punishment on position or scale changes, no scale smoothing and no trajectory refinement, our pure global instance search based tracker achieves comparable, sometimes much better performance on four large-scale tracking benchmarks (i.e., 52.1% AUC on LaSOT, 63.8% success rate on TLP, 60.3% MaxGM on OxUvA and 75.4% normalized precision on TrackingNet), compared to state-of-the-art approaches that typically require complex post-processing. More importantly, our tracker runs without cumulative errors, i.e., any type of temporary tracking failures will not affect its performance on future frames, making it ideal for long-term tracking. We hope this work will be a strong baseline for long-term tracking and will stimulate future works in this area. Code is available at https://github.com/huanglianghua/GlobalTrack.Comment: Accepted in AAAI202

    MHP-VOS: Multiple Hypotheses Propagation for Video Object Segmentation

    Full text link
    We address the problem of semi-supervised video object segmentation (VOS), where the masks of objects of interests are given in the first frame of an input video. To deal with challenging cases where objects are occluded or missing, previous work relies on greedy data association strategies that make decisions for each frame individually. In this paper, we propose a novel approach to defer the decision making for a target object in each frame, until a global view can be established with the entire video being taken into consideration. Our approach is in the same spirit as Multiple Hypotheses Tracking (MHT) methods, making several critical adaptations for the VOS problem. We employ the bounding box (bbox) hypothesis for tracking tree formation, and the multiple hypotheses are spawned by propagating the preceding bbox into the detected bbox proposals within a gated region starting from the initial object mask in the first frame. The gated region is determined by a gating scheme which takes into account a more comprehensive motion model rather than the simple Kalman filtering model in traditional MHT. To further design more customized algorithms tailored for VOS, we develop a novel mask propagation score instead of the appearance similarity score that could be brittle due to large deformations. The mask propagation score, together with the motion score, determines the affinity between the hypotheses during tree pruning. Finally, a novel mask merging strategy is employed to handle mask conflicts between objects. Extensive experiments on challenging datasets demonstrate the effectiveness of the proposed method, especially in the case of object missing.Comment: accepted to CVPR 2019 as oral presentatio

    Quadcopter drone formation control via onboard visual perception

    Full text link
    Quadcopter drone formation control is an important capability for fields like area surveillance, search and rescue, agriculture, and reconnaissance. Of particular interest is formation control in environments where radio communications and/or GPS may be either denied or not sufficiently accurate for the desired application. To address this, we focus on vision as the sensing modality. We train an Hourglass Convolutional Neural Network (CNN) to discriminate between quadcopter pixels and non-quadcopter pixels in a live video feed and use it to guide a formation of quadcopters. The CNN outputs "heatmaps" - pixel-by-pixel likelihood estimates of the presence of a quadcopter. These heatmaps suffer from short-lived false detections. To mitigate these, we apply a version of the Siamese networks technique on consecutive frames for clutter mitigation and to promote temporal smoothness in the heatmaps. The heatmaps give an estimate of the range and bearing to the other quadcopter(s), which we use to calculate flight control commands and maintain the desired formation. We implement the algorithm on a single-board computer (ODROID XU4) with a standard webcam mounted to a quadcopter drone. Flight tests in a motion capture volume demonstrate successful formation control with two quadcopters in a leader-follower setup
    • …