1,759 research outputs found
Learning Adaptive Discriminative Correlation Filters via Temporal Consistency Preserving Spatial Feature Selection for Robust Visual Tracking
With efficient appearance learning models, Discriminative Correlation Filter
(DCF) has been proven to be very successful in recent video object tracking
benchmarks and competitions. However, the existing DCF paradigm suffers from
two major issues, i.e., spatial boundary effect and temporal filter
degradation. To mitigate these challenges, we propose a new DCF-based tracking
method. The key innovations of the proposed method include adaptive spatial
feature selection and temporal consistent constraints, with which the new
tracker enables joint spatial-temporal filter learning in a lower dimensional
discriminative manifold. More specifically, we apply structured spatial
sparsity constraints to multi-channel filers. Consequently, the process of
learning spatial filters can be approximated by the lasso regularisation. To
encourage temporal consistency, the filter model is restricted to lie around
its historical value and updated locally to preserve the global structure in
the manifold. Last, a unified optimisation framework is proposed to jointly
select temporal consistency preserving spatial features and learn
discriminative filters with the augmented Lagrangian method. Qualitative and
quantitative evaluations have been conducted on a number of well-known
benchmarking datasets such as OTB2013, OTB50, OTB100, Temple-Colour, UAV123 and
VOT2018. The experimental results demonstrate the superiority of the proposed
method over the state-of-the-art approaches
Pixel-Level Deep Multi-Dimensional Embeddings for Homogeneous Multiple Object Tracking
The goal of Multiple Object Tracking (MOT) is to locate multiple objects and keep track of their individual identities and trajectories given a sequence of (video) frames. A popular approach to MOT is tracking by detection consisting of two processing components: detection (identification of objects of interest in individual frames) and data association (connecting data from multiple frames). This work addresses the detection component by introducing a method based on semantic instance segmentation, i.e., assigning labels to all visible pixels such that they are unique among different instances. Modern tracking methods often built around Convolutional Neural Networks (CNNs) and additional, explicitly-defined post-processing steps.
This work introduces two detection methods that incorporate multi-dimensional embeddings. We train deep CNNs to produce easily-clusterable embeddings for semantic instance segmentation and to enable object detection through pose estimation. The use of embeddings allows the method to identify per-pixel instance membership for both tasks.
Our method specifically targets applications that require long-term tracking of homogeneous targets using a stationary camera. Furthermore, this method was developed and evaluated on a livestock tracking application which presents exceptional challenges that generalized tracking methods are not equipped to solve. This is largely because contemporary datasets for multiple object tracking lack properties that are specific to livestock environments. These include a high degree of visual similarity between targets, complex physical interactions, long-term inter-object occlusions, and a fixed-cardinality set of targets.
For the reasons stated above, our method is developed and tested with the livestock application in mind and, specifically, group-housed pigs are evaluated in this work. Our method reliably detects pigs in a group housed environment based on the publicly available dataset with 99% precision and 95% using pose estimation and achieves 80% accuracy when using semantic instance segmentation at 50% IoU threshold.
Results demonstrate our method\u27s ability to achieve consistent identification and tracking of group-housed livestock, even in cases where the targets are occluded and despite the fact that they lack uniquely identifying features. The pixel-level embeddings used by the proposed method are thoroughly evaluated in order to demonstrate their properties and behaviors when applied to real data.
Adivser: Lance C. PĂ©re
Wasserstein Auto-Encoders of Merge Trees (and Persistence Diagrams)
This paper presents a computational framework for the Wasserstein
auto-encoding of merge trees (MT-WAE), a novel extension of the classical
auto-encoder neural network architecture to the Wasserstein metric space of
merge trees. In contrast to traditional auto-encoders which operate on
vectorized data, our formulation explicitly manipulates merge trees on their
associated metric space at each layer of the network, resulting in superior
accuracy and interpretability. Our novel neural network approach can be
interpreted as a non-linear generalization of previous linear attempts [79] at
merge tree encoding. It also trivially extends to persistence diagrams.
Extensive experiments on public ensembles demonstrate the efficiency of our
algorithms, with MT-WAE computations in the orders of minutes on average. We
show the utility of our contributions in two applications adapted from previous
work on merge tree encoding [79]. First, we apply MT-WAE to merge tree
compression, by concisely representing them with their coordinates in the final
layer of our auto-encoder. Second, we document an application to dimensionality
reduction, by exploiting the latent space of our auto-encoder, for the visual
analysis of ensemble data. We illustrate the versatility of our framework by
introducing two penalty terms, to help preserve in the latent space both the
Wasserstein distances between merge trees, as well as their clusters. In both
applications, quantitative experiments assess the relevance of our framework.
Finally, we provide a C++ implementation that can be used for reproducibility.Comment: arXiv admin note: text overlap with arXiv:2207.1096
CHITNet: A Complementary to Harmonious Information Transfer Network for Infrared and Visible Image Fusion
Current infrared and visible image fusion (IVIF) methods go to great lengths
to excavate complementary features and design complex fusion strategies, which
is extremely challenging. To this end, we rethink the IVIF outside the box,
proposing a complementary to harmonious information transfer network (CHITNet).
It reasonably transfers complementary information into harmonious one, which
integrates both the shared and complementary features from two modalities.
Specifically, to skillfully sidestep aggregating complementary information in
IVIF, we design a mutual information transfer (MIT) module to mutually
represent features from two modalities, roughly transferring complementary
information into harmonious one. Then, a harmonious information acquisition
supervised by source image (HIASSI) module is devised to further ensure the
complementary to harmonious information transfer after MIT. Meanwhile, we also
propose a structure information preservation (SIP) module to guarantee that the
edge structure information of the source images can be transferred to the
fusion results. Moreover, a mutual promotion training paradigm (MPTP) with
interaction loss is adopted to facilitate better collaboration among MIT,
HIASSI and SIP. In this way, the proposed method is able to generate fused
images with higher qualities. Extensive experimental results demonstrate the
superiority of our CHITNet over state-of-the-art algorithms in terms of visual
quality and quantitative evaluations
Boundary-semantic collaborative guidance network with dual-stream feedback mechanism for salient object detection in optical remote sensing imagery
With the increasing application of deep learning in various domains, salient
object detection in optical remote sensing images (ORSI-SOD) has attracted
significant attention. However, most existing ORSI-SOD methods predominantly
rely on local information from low-level features to infer salient boundary
cues and supervise them using boundary ground truth, but fail to sufficiently
optimize and protect the local information, and almost all approaches ignore
the potential advantages offered by the last layer of the decoder to maintain
the integrity of saliency maps. To address these issues, we propose a novel
method named boundary-semantic collaborative guidance network (BSCGNet) with
dual-stream feedback mechanism. First, we propose a boundary protection
calibration (BPC) module, which effectively reduces the loss of edge position
information during forward propagation and suppresses noise in low-level
features without relying on boundary ground truth. Second, based on the BPC
module, a dual feature feedback complementary (DFFC) module is proposed, which
aggregates boundary-semantic dual features and provides effective feedback to
coordinate features across different layers, thereby enhancing cross-scale
knowledge communication. Finally, to obtain more complete saliency maps, we
consider the uniqueness of the last layer of the decoder for the first time and
propose the adaptive feedback refinement (AFR) module, which further refines
feature representation and eliminates differences between features through a
unique feedback mechanism. Extensive experiments on three benchmark datasets
demonstrate that BSCGNet exhibits distinct advantages in challenging scenarios
and outperforms the 17 state-of-the-art (SOTA) approaches proposed in recent
years. Codes and results have been released on GitHub:
https://github.com/YUHsss/BSCGNet.Comment: Accepted by TGR
Deep visible and thermal image fusion for enhanced pedestrian visibility
Reliable vision in challenging illumination conditions is one of the crucial requirements of future autonomous automotive systems. In the last decade, thermal cameras have become more easily accessible to a larger number of researchers. This has resulted in numerous studies which confirmed the benefits of the thermal cameras in limited visibility conditions. In this paper, we propose a learning-based method for visible and thermal image fusion that focuses on generating fused images with high visual similarity to regular truecolor (red-green-blue or RGB) images, while introducing new informative details in pedestrian regions. The goal is to create natural, intuitive images that would be more informative than a regular RGB camera to a human driver in challenging visibility conditions. The main novelty of this paper is the idea to rely on two types of objective functions for optimization: a similarity metric between the RGB input and the fused output to achieve natural image appearance; and an auxiliary pedestrian detection error to help defining relevant features of the human appearance and blending them into the output. We train a convolutional neural network using image samples from variable conditions (day and night) so that the network learns the appearance of humans in the different modalities and creates more robust results applicable in realistic situations. Our experiments show that the visibility of pedestrians is noticeably improved especially in dark regions and at night. Compared to existing methods we can better learn context and define fusion rules that focus on the pedestrian appearance, while that is not guaranteed with methods that focus on low-level image quality metrics
- …