31,948 research outputs found
Describe and Attend to Track: Learning Natural Language guided Structural Representation and Visual Attention for Object Tracking
The tracking-by-detection framework requires a set of positive and negative
training samples to learn robust tracking models for precise localization of
target objects. However, existing tracking models mostly treat different
samples independently while ignores the relationship information among them. In
this paper, we propose a novel structure-aware deep neural network to overcome
such limitations. In particular, we construct a graph to represent the pairwise
relationships among training samples, and additionally take the natural
language as the supervised information to learn both feature representations
and classifiers robustly. To refine the states of the target and re-track the
target when it is back to view from heavy occlusion and out of view, we
elaborately design a novel subnetwork to learn the target-driven visual
attentions from the guidance of both visual and natural language cues.
Extensive experiments on five tracking benchmark datasets validated the
effectiveness of our proposed method
A Survey on Content-Aware Video Analysis for Sports
Sports data analysis is becoming increasingly large-scale, diversified, and
shared, but difficulty persists in rapidly accessing the most crucial
information. Previous surveys have focused on the methodologies of sports video
analysis from the spatiotemporal viewpoint instead of a content-based
viewpoint, and few of these studies have considered semantics. This study
develops a deeper interpretation of content-aware sports video analysis by
examining the insight offered by research into the structure of content under
different scenarios. On the basis of this insight, we provide an overview of
the themes particularly relevant to the research on content-aware systems for
broadcast sports. Specifically, we focus on the video content analysis
techniques applied in sportscasts over the past decade from the perspectives of
fundamentals and general review, a content hierarchical model, and trends and
challenges. Content-aware analysis methods are discussed with respect to
object-, event-, and context-oriented groups. In each group, the gap between
sensation and content excitement must be bridged using proper strategies. In
this regard, a content-aware approach is required to determine user demands.
Finally, the paper summarizes the future trends and challenges for sports video
analysis. We believe that our findings can advance the field of research on
content-aware video analysis for broadcast sports.Comment: Accepted for publication in IEEE Transactions on Circuits and Systems
for Video Technology (TCSVT
DF-SLAM: A Deep-Learning Enhanced Visual SLAM System based on Deep Local Features
As the foundation of driverless vehicle and intelligent robots, Simultaneous
Localization and Mapping(SLAM) has attracted much attention these days.
However, non-geometric modules of traditional SLAM algorithms are limited by
data association tasks and have become a bottleneck preventing the development
of SLAM. To deal with such problems, many researchers seek to Deep Learning for
help. But most of these studies are limited to virtual datasets or specific
environments, and even sacrifice efficiency for accuracy. Thus, they are not
practical enough.
We propose DF-SLAM system that uses deep local feature descriptors obtained
by the neural network as a substitute for traditional hand-made features.
Experimental results demonstrate its improvements in efficiency and stability.
DF-SLAM outperforms popular traditional SLAM systems in various scenes,
including challenging scenes with intense illumination changes. Its versatility
and mobility fit well into the need for exploring new environments. Since we
adopt a shallow network to extract local descriptors and remain others the same
as original SLAM systems, our DF-SLAM can still run in real-time on GPU
Hierarchical Spatial-aware Siamese Network for Thermal Infrared Object Tracking
Most thermal infrared (TIR) tracking methods are discriminative, treating the
tracking problem as a classification task. However, the objective of the
classifier (label prediction) is not coupled to the objective of the tracker
(location estimation). The classification task focuses on the between-class
difference of the arbitrary objects, while the tracking task mainly deals with
the within-class difference of the same objects. In this paper, we cast the TIR
tracking problem as a similarity verification task, which is coupled well to
the objective of the tracking task. We propose a TIR tracker via a Hierarchical
Spatial-aware Siamese Convolutional Neural Network (CNN), named HSSNet. To
obtain both spatial and semantic features of the TIR object, we design a
Siamese CNN that coalesces the multiple hierarchical convolutional layers.
Then, we propose a spatial-aware network to enhance the discriminative ability
of the coalesced hierarchical feature. Subsequently, we train this network end
to end on a large visible video detection dataset to learn the similarity
between paired objects before we transfer the network into the TIR domain.
Next, this pre-trained Siamese network is used to evaluate the similarity
between the target template and target candidates. Finally, we locate the
candidate that is most similar to the tracked target. Extensive experimental
results on the benchmarks VOT-TIR 2015 and VOT-TIR 2016 show that our proposed
method achieves favourable performance compared to the state-of-the-art
methods.Comment: 20 pages, 7 figure
Self-Attention Recurrent Network for Saliency Detection
Feature maps in deep neural network generally contain different semantics.
Existing methods often omit their characteristics that may lead to sub-optimal
results. In this paper, we propose a novel end-to-end deep saliency network
which could effectively utilize multi-scale feature maps according to their
characteristics. Shallow layers often contain more local information, and deep
layers have advantages in global semantics. Therefore, the network generates
elaborate saliency maps by enhancing local and global information of feature
maps in different layers. On one hand, local information of shallow layers is
enhanced by a recurrent structure which shared convolution kernel at different
time steps. On the other hand, global information of deep layers is utilized by
a self-attention module, which generates different attention weights for
salient objects and backgrounds thus achieve better performance. Experimental
results on four widely used datasets demonstrate that our method has advantages
in performance over existing algorithms
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
Global and Local Sensitivity Guided Key Salient Object Re-augmentation for Video Saliency Detection
The existing still-static deep learning based saliency researches do not
consider the weighting and highlighting of extracted features from different
layers, all features contribute equally to the final saliency decision-making.
Such methods always evenly detect all "potentially significant regions" and
unable to highlight the key salient object, resulting in detection failure of
dynamic scenes. In this paper, based on the fact that salient areas in videos
are relatively small and concentrated, we propose a \textbf{key salient object
re-augmentation method (KSORA) using top-down semantic knowledge and bottom-up
feature guidance} to improve detection accuracy in video scenes. KSORA includes
two sub-modules (WFE and KOS): WFE processes local salient feature selection
using bottom-up strategy, while KOS ranks each object in global fashion by
top-down statistical knowledge, and chooses the most critical object area for
local enhancement. The proposed KSORA can not only strengthen the saliency
value of the local key salient object but also ensure global saliency
consistency. Results on three benchmark datasets suggest that our model has the
capability of improving the detection accuracy on complex scenes. The
significant performance of KSORA, with a speed of 17FPS on modern GPUs, has
been verified by comparisons with other ten state-of-the-art algorithms.Comment: 6 figures, 10 page
cvpaper.challenge in 2015 - A review of CVPR2015 and DeepSurvey
The "cvpaper.challenge" is a group composed of members from AIST, Tokyo Denki
Univ. (TDU), and Univ. of Tsukuba that aims to systematically summarize papers
on computer vision, pattern recognition, and related fields. For this
particular review, we focused on reading the ALL 602 conference papers
presented at the CVPR2015, the premier annual computer vision event held in
June 2015, in order to grasp the trends in the field. Further, we are proposing
"DeepSurvey" as a mechanism embodying the entire process from the reading
through all the papers, the generation of ideas, and to the writing of paper.Comment: Survey Pape
Siamese Attentional Keypoint Network for High Performance Visual Tracking
In this paper, we investigate the impacts of three main aspects of visual
tracking, i.e., the backbone network, the attentional mechanism, and the
detection component, and propose a Siamese Attentional Keypoint Network, dubbed
SATIN, for efficient tracking and accurate localization. Firstly, a new Siamese
lightweight hourglass network is specially designed for visual tracking. It
takes advantage of the benefits of the repeated bottom-up and top-down
inference to capture more global and local contextual information at multiple
scales. Secondly, a novel cross-attentional module is utilized to leverage both
channel-wise and spatial intermediate attentional information, which can
enhance both discriminative and localization capabilities of feature maps.
Thirdly, a keypoints detection approach is invented to trace any target object
by detecting the top-left corner point, the centroid point, and the
bottom-right corner point of its bounding box. Therefore, our SATIN tracker not
only has a strong capability to learn more effective object representations,
but also is computational and memory storage efficiency, either during the
training or testing stages. To the best of our knowledge, we are the first to
propose this approach. Without bells and whistles, experimental results
demonstrate that our approach achieves state-of-the-art performance on several
recent benchmark datasets, at a speed far exceeding 27 frames per second.Comment: Accepted by Knowledge-Based SYSTEM
HandSeg: An Automatically Labeled Dataset for Hand Segmentation from Depth Images
We propose an automatic method for generating high-quality annotations for
depth-based hand segmentation, and introduce a large-scale hand segmentation
dataset. Existing datasets are typically limited to a single hand. By
exploiting the visual cues given by an RGBD sensor and a pair of colored
gloves, we automatically generate dense annotations for two hand segmentation.
This lowers the cost/complexity of creating high quality datasets, and makes it
easy to expand the dataset in the future. We further show that existing
datasets, even with data augmentation, are not sufficient to train a hand
segmentation algorithm that can distinguish two hands. Source and datasets will
be made publicly available
- …