462 research outputs found

    SeqTrack: Sequence to Sequence Learning for Visual Object Tracking

    Full text link
    In this paper, we present a new sequence-to-sequence learning framework for visual tracking, dubbed SeqTrack. It casts visual tracking as a sequence generation problem, which predicts object bounding boxes in an autoregressive fashion. This is different from prior Siamese trackers and transformer trackers, which rely on designing complicated head networks, such as classification and regression heads. SeqTrack only adopts a simple encoder-decoder transformer architecture. The encoder extracts visual features with a bidirectional transformer, while the decoder generates a sequence of bounding box values autoregressively with a causal transformer. The loss function is a plain cross-entropy. Such a sequence learning paradigm not only simplifies tracking framework, but also achieves competitive performance on benchmarks. For instance, SeqTrack gets 72.5% AUC on LaSOT, establishing a new state-of-the-art performance. Code and models are available at here.Comment: CVPR2023 pape

    CiteTracker: Correlating Image and Text for Visual Tracking

    Full text link
    Existing visual tracking methods typically take an image patch as the reference of the target to perform tracking. However, a single image patch cannot provide a complete and precise concept of the target object as images are limited in their ability to abstract and can be ambiguous, which makes it difficult to track targets with drastic variations. In this paper, we propose the CiteTracker to enhance target modeling and inference in visual tracking by connecting images and text. Specifically, we develop a text generation module to convert the target image patch into a descriptive text containing its class and attribute information, providing a comprehensive reference point for the target. In addition, a dynamic description module is designed to adapt to target variations for more effective target representation. We then associate the target description and the search image using an attention-based correlation module to generate the correlated features for target state reference. Extensive experiments on five diverse datasets are conducted to evaluate the proposed algorithm and the favorable performance against the state-of-the-art methods demonstrates the effectiveness of the proposed tracking method.Comment: accepted by ICCV 202

    Learning to Segment Dynamic Objects using SLAM Outliers

    Full text link
    We present a method to automatically learn to segment dynamic objects using SLAM outliers. It requires only one monocular sequence per dynamic object for training and consists in localizing dynamic objects using SLAM outliers, creating their masks, and using these masks to train a semantic segmentation network. We integrate the trained network in ORB-SLAM 2 and LDSO. At runtime we remove features on dynamic objects, making the SLAM unaffected by them. We also propose a new stereo dataset and new metrics to evaluate SLAM robustness. Our dataset includes consensus inversions, i.e., situations where the SLAM uses more features on dynamic objects that on the static background. Consensus inversions are challenging for SLAM as they may cause major SLAM failures. Our approach performs better than the State-of-the-Art on the TUM RGB-D dataset in monocular mode and on our dataset in both monocular and stereo modes.Comment: Accepted to ICPR 202

    Resource-Efficient RGBD Aerial Tracking

    Get PDF
    corecore