1,759 research outputs found

    Learning Adaptive Discriminative Correlation Filters via Temporal Consistency Preserving Spatial Feature Selection for Robust Visual Tracking

    Get PDF
    With efficient appearance learning models, Discriminative Correlation Filter (DCF) has been proven to be very successful in recent video object tracking benchmarks and competitions. However, the existing DCF paradigm suffers from two major issues, i.e., spatial boundary effect and temporal filter degradation. To mitigate these challenges, we propose a new DCF-based tracking method. The key innovations of the proposed method include adaptive spatial feature selection and temporal consistent constraints, with which the new tracker enables joint spatial-temporal filter learning in a lower dimensional discriminative manifold. More specifically, we apply structured spatial sparsity constraints to multi-channel filers. Consequently, the process of learning spatial filters can be approximated by the lasso regularisation. To encourage temporal consistency, the filter model is restricted to lie around its historical value and updated locally to preserve the global structure in the manifold. Last, a unified optimisation framework is proposed to jointly select temporal consistency preserving spatial features and learn discriminative filters with the augmented Lagrangian method. Qualitative and quantitative evaluations have been conducted on a number of well-known benchmarking datasets such as OTB2013, OTB50, OTB100, Temple-Colour, UAV123 and VOT2018. The experimental results demonstrate the superiority of the proposed method over the state-of-the-art approaches

    Pixel-Level Deep Multi-Dimensional Embeddings for Homogeneous Multiple Object Tracking

    Get PDF
    The goal of Multiple Object Tracking (MOT) is to locate multiple objects and keep track of their individual identities and trajectories given a sequence of (video) frames. A popular approach to MOT is tracking by detection consisting of two processing components: detection (identification of objects of interest in individual frames) and data association (connecting data from multiple frames). This work addresses the detection component by introducing a method based on semantic instance segmentation, i.e., assigning labels to all visible pixels such that they are unique among different instances. Modern tracking methods often built around Convolutional Neural Networks (CNNs) and additional, explicitly-defined post-processing steps. This work introduces two detection methods that incorporate multi-dimensional embeddings. We train deep CNNs to produce easily-clusterable embeddings for semantic instance segmentation and to enable object detection through pose estimation. The use of embeddings allows the method to identify per-pixel instance membership for both tasks. Our method specifically targets applications that require long-term tracking of homogeneous targets using a stationary camera. Furthermore, this method was developed and evaluated on a livestock tracking application which presents exceptional challenges that generalized tracking methods are not equipped to solve. This is largely because contemporary datasets for multiple object tracking lack properties that are specific to livestock environments. These include a high degree of visual similarity between targets, complex physical interactions, long-term inter-object occlusions, and a fixed-cardinality set of targets. For the reasons stated above, our method is developed and tested with the livestock application in mind and, specifically, group-housed pigs are evaluated in this work. Our method reliably detects pigs in a group housed environment based on the publicly available dataset with 99% precision and 95% using pose estimation and achieves 80% accuracy when using semantic instance segmentation at 50% IoU threshold. Results demonstrate our method\u27s ability to achieve consistent identification and tracking of group-housed livestock, even in cases where the targets are occluded and despite the fact that they lack uniquely identifying features. The pixel-level embeddings used by the proposed method are thoroughly evaluated in order to demonstrate their properties and behaviors when applied to real data. Adivser: Lance C. PĂ©re

    Wasserstein Auto-Encoders of Merge Trees (and Persistence Diagrams)

    Full text link
    This paper presents a computational framework for the Wasserstein auto-encoding of merge trees (MT-WAE), a novel extension of the classical auto-encoder neural network architecture to the Wasserstein metric space of merge trees. In contrast to traditional auto-encoders which operate on vectorized data, our formulation explicitly manipulates merge trees on their associated metric space at each layer of the network, resulting in superior accuracy and interpretability. Our novel neural network approach can be interpreted as a non-linear generalization of previous linear attempts [79] at merge tree encoding. It also trivially extends to persistence diagrams. Extensive experiments on public ensembles demonstrate the efficiency of our algorithms, with MT-WAE computations in the orders of minutes on average. We show the utility of our contributions in two applications adapted from previous work on merge tree encoding [79]. First, we apply MT-WAE to merge tree compression, by concisely representing them with their coordinates in the final layer of our auto-encoder. Second, we document an application to dimensionality reduction, by exploiting the latent space of our auto-encoder, for the visual analysis of ensemble data. We illustrate the versatility of our framework by introducing two penalty terms, to help preserve in the latent space both the Wasserstein distances between merge trees, as well as their clusters. In both applications, quantitative experiments assess the relevance of our framework. Finally, we provide a C++ implementation that can be used for reproducibility.Comment: arXiv admin note: text overlap with arXiv:2207.1096

    CHITNet: A Complementary to Harmonious Information Transfer Network for Infrared and Visible Image Fusion

    Full text link
    Current infrared and visible image fusion (IVIF) methods go to great lengths to excavate complementary features and design complex fusion strategies, which is extremely challenging. To this end, we rethink the IVIF outside the box, proposing a complementary to harmonious information transfer network (CHITNet). It reasonably transfers complementary information into harmonious one, which integrates both the shared and complementary features from two modalities. Specifically, to skillfully sidestep aggregating complementary information in IVIF, we design a mutual information transfer (MIT) module to mutually represent features from two modalities, roughly transferring complementary information into harmonious one. Then, a harmonious information acquisition supervised by source image (HIASSI) module is devised to further ensure the complementary to harmonious information transfer after MIT. Meanwhile, we also propose a structure information preservation (SIP) module to guarantee that the edge structure information of the source images can be transferred to the fusion results. Moreover, a mutual promotion training paradigm (MPTP) with interaction loss is adopted to facilitate better collaboration among MIT, HIASSI and SIP. In this way, the proposed method is able to generate fused images with higher qualities. Extensive experimental results demonstrate the superiority of our CHITNet over state-of-the-art algorithms in terms of visual quality and quantitative evaluations

    Boundary-semantic collaborative guidance network with dual-stream feedback mechanism for salient object detection in optical remote sensing imagery

    Full text link
    With the increasing application of deep learning in various domains, salient object detection in optical remote sensing images (ORSI-SOD) has attracted significant attention. However, most existing ORSI-SOD methods predominantly rely on local information from low-level features to infer salient boundary cues and supervise them using boundary ground truth, but fail to sufficiently optimize and protect the local information, and almost all approaches ignore the potential advantages offered by the last layer of the decoder to maintain the integrity of saliency maps. To address these issues, we propose a novel method named boundary-semantic collaborative guidance network (BSCGNet) with dual-stream feedback mechanism. First, we propose a boundary protection calibration (BPC) module, which effectively reduces the loss of edge position information during forward propagation and suppresses noise in low-level features without relying on boundary ground truth. Second, based on the BPC module, a dual feature feedback complementary (DFFC) module is proposed, which aggregates boundary-semantic dual features and provides effective feedback to coordinate features across different layers, thereby enhancing cross-scale knowledge communication. Finally, to obtain more complete saliency maps, we consider the uniqueness of the last layer of the decoder for the first time and propose the adaptive feedback refinement (AFR) module, which further refines feature representation and eliminates differences between features through a unique feedback mechanism. Extensive experiments on three benchmark datasets demonstrate that BSCGNet exhibits distinct advantages in challenging scenarios and outperforms the 17 state-of-the-art (SOTA) approaches proposed in recent years. Codes and results have been released on GitHub: https://github.com/YUHsss/BSCGNet.Comment: Accepted by TGR

    Deep visible and thermal image fusion for enhanced pedestrian visibility

    Get PDF
    Reliable vision in challenging illumination conditions is one of the crucial requirements of future autonomous automotive systems. In the last decade, thermal cameras have become more easily accessible to a larger number of researchers. This has resulted in numerous studies which confirmed the benefits of the thermal cameras in limited visibility conditions. In this paper, we propose a learning-based method for visible and thermal image fusion that focuses on generating fused images with high visual similarity to regular truecolor (red-green-blue or RGB) images, while introducing new informative details in pedestrian regions. The goal is to create natural, intuitive images that would be more informative than a regular RGB camera to a human driver in challenging visibility conditions. The main novelty of this paper is the idea to rely on two types of objective functions for optimization: a similarity metric between the RGB input and the fused output to achieve natural image appearance; and an auxiliary pedestrian detection error to help defining relevant features of the human appearance and blending them into the output. We train a convolutional neural network using image samples from variable conditions (day and night) so that the network learns the appearance of humans in the different modalities and creates more robust results applicable in realistic situations. Our experiments show that the visibility of pedestrians is noticeably improved especially in dark regions and at night. Compared to existing methods we can better learn context and define fusion rules that focus on the pedestrian appearance, while that is not guaranteed with methods that focus on low-level image quality metrics
    • …
    corecore