1,189 research outputs found
A Reverse Hierarchy Model for Predicting Eye Fixations
A number of psychological and physiological evidences suggest that early
visual attention works in a coarse-to-fine way, which lays a basis for the
reverse hierarchy theory (RHT). This theory states that attention propagates
from the top level of the visual hierarchy that processes gist and abstract
information of input, to the bottom level that processes local details.
Inspired by the theory, we develop a computational model for saliency detection
in images. First, the original image is downsampled to different scales to
constitute a pyramid. Then, saliency on each layer is obtained by image
super-resolution reconstruction from the layer above, which is defined as
unpredictability from this coarse-to-fine reconstruction. Finally, saliency on
each layer of the pyramid is fused into stochastic fixations through a
probabilistic model, where attention initiates from the top layer and
propagates downward through the pyramid. Extensive experiments on two standard
eye-tracking datasets show that the proposed method can achieve competitive
results with state-of-the-art models.Comment: CVPR 2014, 27th IEEE Conference on Computer Vision and Pattern
Recognition (CVPR). CVPR 201
DORT: Modeling Dynamic Objects in Recurrent for Multi-Camera 3D Object Detection and Tracking
Recent multi-camera 3D object detectors usually leverage temporal information
to construct multi-view stereo that alleviates the ill-posed depth estimation.
However, they typically assume all the objects are static and directly
aggregate features across frames. This work begins with a theoretical and
empirical analysis to reveal that ignoring the motion of moving objects can
result in serious localization bias. Therefore, we propose to model Dynamic
Objects in RecurrenT (DORT) to tackle this problem. In contrast to previous
global Bird-Eye-View (BEV) methods, DORT extracts object-wise local volumes for
motion estimation that also alleviates the heavy computational burden. By
iteratively refining the estimated object motion and location, the preceding
features can be precisely aggregated to the current frame to mitigate the
aforementioned adverse effects. The simple framework has two significant
appealing properties. It is flexible and practical that can be plugged into
most camera-based 3D object detectors. As there are predictions of object
motion in the loop, it can easily track objects across frames according to
their nearest center distances. Without bells and whistles, DORT outperforms
all the previous methods on the nuScenes detection and tracking benchmarks with
62.5\% NDS and 57.6\% AMOTA, respectively. The source code will be released
Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability
Video segmentation encompasses a wide range of categories of problem
formulation, e.g., object, scene, actor-action and multimodal video
segmentation, for delineating task-specific scene components with pixel-level
masks. Recently, approaches in this research area shifted from concentrating on
ConvNet-based to transformer-based models. In addition, various
interpretability approaches have appeared for transformer models and video
temporal dynamics, motivated by the growing interest in basic scientific
understanding, model diagnostics and societal implications of real-world
deployment. Previous surveys mainly focused on ConvNet models on a subset of
video segmentation tasks or transformers for classification tasks. Moreover,
component-wise discussion of transformer-based video segmentation models has
not yet received due focus. In addition, previous reviews of interpretability
methods focused on transformers for classification, while analysis of video
temporal dynamics modelling capabilities of video models received less
attention. In this survey, we address the above with a thorough discussion of
various categories of video segmentation, a component-wise discussion of the
state-of-the-art transformer-based models, and a review of related
interpretability methods. We first present an introduction to the different
video segmentation task categories, their objectives, specific challenges and
benchmark datasets. Next, we provide a component-wise review of recent
transformer-based models and document the state of the art on different video
segmentation tasks. Subsequently, we discuss post-hoc and ante-hoc
interpretability methods for transformer models and interpretability methods
for understanding the role of the temporal dimension in video models. Finally,
we conclude our discussion with future research directions
V2CE: Video to Continuous Events Simulator
Dynamic Vision Sensor (DVS)-based solutions have recently garnered
significant interest across various computer vision tasks, offering notable
benefits in terms of dynamic range, temporal resolution, and inference speed.
However, as a relatively nascent vision sensor compared to Active Pixel Sensor
(APS) devices such as RGB cameras, DVS suffers from a dearth of ample labeled
datasets. Prior efforts to convert APS data into events often grapple with
issues such as a considerable domain shift from real events, the absence of
quantified validation, and layering problems within the time axis. In this
paper, we present a novel method for video-to-events stream conversion from
multiple perspectives, considering the specific characteristics of DVS. A
series of carefully designed losses helps enhance the quality of generated
event voxels significantly. We also propose a novel local dynamic-aware
timestamp inference strategy to accurately recover event timestamps from event
voxels in a continuous fashion and eliminate the temporal layering problem.
Results from rigorous validation through quantified metrics at all stages of
the pipeline establish our method unquestionably as the current
state-of-the-art (SOTA).Comment: 6 pages, 7 figure
Pedestrian Attribute Recognition: A Survey
Recognizing pedestrian attributes is an important task in computer vision
community due to it plays an important role in video surveillance. Many
algorithms has been proposed to handle this task. The goal of this paper is to
review existing works using traditional methods or based on deep learning
networks. Firstly, we introduce the background of pedestrian attributes
recognition (PAR, for short), including the fundamental concepts of pedestrian
attributes and corresponding challenges. Secondly, we introduce existing
benchmarks, including popular datasets and evaluation criterion. Thirdly, we
analyse the concept of multi-task learning and multi-label learning, and also
explain the relations between these two learning algorithms and pedestrian
attribute recognition. We also review some popular network architectures which
have widely applied in the deep learning community. Fourthly, we analyse
popular solutions for this task, such as attributes group, part-based,
\emph{etc}. Fifthly, we shown some applications which takes pedestrian
attributes into consideration and achieve better performance. Finally, we
summarized this paper and give several possible research directions for
pedestrian attributes recognition. The project page of this paper can be found
from the following website:
\url{https://sites.google.com/view/ahu-pedestrianattributes/}.Comment: Check our project page for High Resolution version of this survey:
https://sites.google.com/view/ahu-pedestrianattributes
- …