3,832 research outputs found
Evaluation of trackers for Pan-Tilt-Zoom Scenarios
Tracking with a Pan-Tilt-Zoom (PTZ) camera has been a research topic in
computer vision for many years. Compared to tracking with a still camera, the
images captured with a PTZ camera are highly dynamic in nature because the
camera can perform large motion resulting in quickly changing capture
conditions. Furthermore, tracking with a PTZ camera involves camera control to
position the camera on the target. For successful tracking and camera control,
the tracker must be fast enough, or has to be able to predict accurately the
next position of the target. Therefore, standard benchmarks do not allow to
assess properly the quality of a tracker for the PTZ scenario. In this work, we
use a virtual PTZ framework to evaluate different tracking algorithms and
compare their performances. We also extend the framework to add target position
prediction for the next frame, accounting for camera motion and processing
delays. By doing this, we can assess if predicting can make long-term tracking
more robust as it may help slower algorithms for keeping the target in the
field of view of the camera. Results confirm that both speed and robustness are
required for tracking under the PTZ scenario.Comment: 6 pages, 2 figures, International Conference on Pattern Recognition
and Artificial Intelligence 201
SANet: Structure-Aware Network for Visual Tracking
Convolutional neural network (CNN) has drawn increasing interest in visual
tracking owing to its powerfulness in feature extraction. Most existing
CNN-based trackers treat tracking as a classification problem. However, these
trackers are sensitive to similar distractors because their CNN models mainly
focus on inter-class classification. To address this problem, we use
self-structure information of object to distinguish it from distractors.
Specifically, we utilize recurrent neural network (RNN) to model object
structure, and incorporate it into CNN to improve its robustness to similar
distractors. Considering that convolutional layers in different levels
characterize the object from different perspectives, we use multiple RNNs to
model object structure in different levels respectively. Extensive experiments
on three benchmarks, OTB100, TC-128 and VOT2015, show that the proposed
algorithm outperforms other methods. Code is released at
http://www.dabi.temple.edu/~hbling/code/SANet/SANet.html.Comment: In CVPR Deep Vision Workshop, 201
Representation, space and Hollywood Squares: Looking at things that aren't there anymore
It has been argued that the human cognitive system is capable of using spatial indexes or oculomotor coordinates to relieve working memory load (Ballard, Hayhoe, Pook & Rao, 1997) track multiple moving items through occlusion (Scholl & Pylyshyn, 1999) or link incompatible cognitive and sensorimotor codes (Bridgeman and Huemer, 1998). Here we examine the use of such spatial information in memory for semantic information. Previous research has often focused on the role of task demands and the level of automaticity in the encoding of spatial location in memory tasks. We present five experiments where location is irrelevant to the task, and participants' encoding of spatial information is measured implicitly by their looking behavior during recall. In a paradigm developed from Spivey and Geng (submitted), participants were presented with pieces of auditory, semantic information as part of an event occurring in one of four regions of a computer screen. In front of a blank grid, they were asked a question relating to one of those facts. Under certain conditions it was found that during the question period participants made significantly more saccades to the empty region of space where the semantic information had been previously presented. Our findings are discussed in relation to previous research on memory and spatial location, the dorsal and ventral streams of the visual system, and the notion of a cognitive-perceptual system using spatial indexes to exploit the stability of the external world
Looking Beyond a Clever Narrative: Visual Context and Attention are Primary Drivers of Affect in Video Advertisements
Emotion evoked by an advertisement plays a key role in influencing brand
recall and eventual consumer choices. Automatic ad affect recognition has
several useful applications. However, the use of content-based feature
representations does not give insights into how affect is modulated by aspects
such as the ad scene setting, salient object attributes and their interactions.
Neither do such approaches inform us on how humans prioritize visual
information for ad understanding. Our work addresses these lacunae by
decomposing video content into detected objects, coarse scene structure, object
statistics and actively attended objects identified via eye-gaze. We measure
the importance of each of these information channels by systematically
incorporating related information into ad affect prediction models. Contrary to
the popular notion that ad affect hinges on the narrative and the clever use of
linguistic and social cues, we find that actively attended objects and the
coarse scene structure better encode affective information as compared to
individual scene objects or conspicuous background elements.Comment: Accepted for publication in the Proceedings of 20th ACM International
Conference on Multimodal Interaction, Boulder, CO, US
Vision-based Real-Time Aerial Object Localization and Tracking for UAV Sensing System
The paper focuses on the problem of vision-based obstacle detection and
tracking for unmanned aerial vehicle navigation. A real-time object
localization and tracking strategy from monocular image sequences is developed
by effectively integrating the object detection and tracking into a dynamic
Kalman model. At the detection stage, the object of interest is automatically
detected and localized from a saliency map computed via the image background
connectivity cue at each frame; at the tracking stage, a Kalman filter is
employed to provide a coarse prediction of the object state, which is further
refined via a local detector incorporating the saliency map and the temporal
information between two consecutive frames. Compared to existing methods, the
proposed approach does not require any manual initialization for tracking, runs
much faster than the state-of-the-art trackers of its kind, and achieves
competitive tracking performance on a large number of image sequences.
Extensive experiments demonstrate the effectiveness and superior performance of
the proposed approach.Comment: 8 pages, 7 figure
Do You See What I Mean? Visual Resolution of Linguistic Ambiguities
Understanding language goes hand in hand with the ability to integrate
complex contextual information obtained via perception. In this work, we
present a novel task for grounded language understanding: disambiguating a
sentence given a visual scene which depicts one of the possible interpretations
of that sentence. To this end, we introduce a new multimodal corpus containing
ambiguous sentences, representing a wide range of syntactic, semantic and
discourse ambiguities, coupled with videos that visualize the different
interpretations for each sentence. We address this task by extending a vision
model which determines if a sentence is depicted by a video. We demonstrate how
such a model can be adjusted to recognize different interpretations of the same
underlying sentence, allowing to disambiguate sentences in a unified fashion
across the different ambiguity types.Comment: EMNLP 201
- …