58,360 research outputs found
Joint Video and Text Parsing for Understanding Events and Answering Queries
We propose a framework for parsing video and text jointly for understanding
events and answering user queries. Our framework produces a parse graph that
represents the compositional structures of spatial information (objects and
scenes), temporal information (actions and events) and causal information
(causalities between events and fluents) in the video and text. The knowledge
representation of our framework is based on a spatial-temporal-causal And-Or
graph (S/T/C-AOG), which jointly models possible hierarchical compositions of
objects, scenes and events as well as their interactions and mutual contexts,
and specifies the prior probabilistic distribution of the parse graphs. We
present a probabilistic generative model for joint parsing that captures the
relations between the input video/text, their corresponding parse graphs and
the joint parse graph. Based on the probabilistic model, we propose a joint
parsing system consisting of three modules: video parsing, text parsing and
joint inference. Video parsing and text parsing produce two parse graphs from
the input video and text respectively. The joint inference module produces a
joint parse graph by performing matching, deduction and revision on the video
and text parse graphs. The proposed framework has the following objectives:
Firstly, we aim at deep semantic parsing of video and text that goes beyond the
traditional bag-of-words approaches; Secondly, we perform parsing and reasoning
across the spatial, temporal and causal dimensions based on the joint S/T/C-AOG
representation; Thirdly, we show that deep joint parsing facilitates subsequent
applications such as generating narrative text descriptions and answering
queries in the forms of who, what, when, where and why. We empirically
evaluated our system based on comparison against ground-truth as well as
accuracy of query answering and obtained satisfactory results
DOTA: A Large-scale Dataset for Object Detection in Aerial Images
Object detection is an important and challenging problem in computer vision.
Although the past decade has witnessed major advances in object detection in
natural scenes, such successes have been slow to aerial imagery, not only
because of the huge variation in the scale, orientation and shape of the object
instances on the earth's surface, but also due to the scarcity of
well-annotated datasets of objects in aerial scenes. To advance object
detection research in Earth Vision, also known as Earth Observation and Remote
Sensing, we introduce a large-scale Dataset for Object deTection in Aerial
images (DOTA). To this end, we collect aerial images from different
sensors and platforms. Each image is of the size about 4000-by-4000 pixels and
contains objects exhibiting a wide variety of scales, orientations, and shapes.
These DOTA images are then annotated by experts in aerial image interpretation
using common object categories. The fully annotated DOTA images contains
instances, each of which is labeled by an arbitrary (8 d.o.f.)
quadrilateral To build a baseline for object detection in Earth Vision, we
evaluate state-of-the-art object detection algorithms on DOTA. Experiments
demonstrate that DOTA well represents real Earth Vision applications and are
quite challenging.Comment: Accepted to CVPR 201
Task-demands can immediately reverse the effects of sensory-driven saliency in complex visual stimuli
In natural vision both stimulus features and task-demands affect an observer's attention. However, the relationship between sensory-driven (âbottom-upâ) and task-dependent (âtop-downâ) factors remains controversial: Can task-demands counteract strong sensory signals fully, quickly, and irrespective of bottom-up features? To measure attention under naturalistic conditions, we recorded eye-movements in human observers, while they viewed photographs of outdoor scenes. In the first experiment, smooth modulations of contrast biased the stimuli's sensory-driven saliency towards one side. In free-viewing, observers' eye-positions were immediately biased toward the high-contrast, i.e., high-saliency, side. However, this sensory-driven bias disappeared entirely when observers searched for a bull's-eye target embedded with equal probability to either side of the stimulus. When the target always occurred in the low-contrast side, observers' eye-positions were immediately biased towards this low-saliency side, i.e., the sensory-driven bias reversed. Hence, task-demands do not only override sensory-driven saliency but also actively countermand it. In a second experiment, a 5-Hz flicker replaced the contrast gradient. Whereas the bias was less persistent in free viewing, the overriding and reversal took longer to deploy. Hence, insufficient sensory-driven saliency cannot account for the bias reversal. In a third experiment, subjects searched for a spot of locally increased contrast (âoddityâ) instead of the bull's-eye (âtemplateâ). In contrast to the other conditions, a slight sensory-driven free-viewing bias prevails in this condition. In a fourth experiment, we demonstrate that at known locations template targets are detected faster than oddity targets, suggesting that the former induce a stronger top-down drive when used as search targets. Taken together, task-demands can override sensory-driven saliency in complex visual stimuli almost immediately, and the extent of overriding depends on the search target and the overridden feature, but not on the latter's free-viewing saliency
Reasoning About Pragmatics with Neural Listeners and Speakers
We present a model for pragmatically describing scenes, in which contrastive
behavior results from a combination of inference-driven pragmatics and learned
semantics. Like previous learned approaches to language generation, our model
uses a simple feature-driven architecture (here a pair of neural "listener" and
"speaker" models) to ground language in the world. Like inference-driven
approaches to pragmatics, our model actively reasons about listener behavior
when selecting utterances. For training, our approach requires only ordinary
captions, annotated _without_ demonstration of the pragmatic behavior the model
ultimately exhibits. In human evaluations on a referring expression game, our
approach succeeds 81% of the time, compared to a 69% success rate using
existing techniques
Nonstimulated early visual areas carry information about surrounding context
Even within the early sensory areas, the majority of the input to any given cortical neuron comes from other cortical neurons. To extend our knowledge of the contextual information that is transmitted by such lateral and feedback connections, we investigated how visually nonstimulated regions in primary visual cortex (V1) and visual area V2 are influenced by the surrounding context. We used functional magnetic resonance imaging (fMRI) and pattern-classification methods to show that the cortical representation of a nonstimulated quarter-field carries information that can discriminate the surrounding visual context. We show further that the activity patterns in these regions are significantly related to those observed with feed-forward stimulation and that these effects are driven primarily by V1. These results thus demonstrate that visual context strongly influences early visual areas even in the absence of differential feed-forward thalamic stimulation
Gaze Behaviour during Space Perception and Spatial Decision Making
A series of four experiments investigating gaze behavior and decision making in the context of wayfinding is reported. Participants were presented with screen-shots of choice points taken in large virtual environments. Each screen-shot depicted alternative path options. In Experiment 1, participants had to decide between them in order to find an object hidden in the environment. In Experiment 2, participants were first informed about which path option to take as if following a guided route. Subsequently they were presented with the same images in random order and had to indicate which path option they chose during initial exposure. In Experiment 1, we demonstrate (1) that participants have a tendency to choose the path option that featured the longer line of sight, and (2) a robust gaze bias towards the eventually chosen path option. In Experiment 2, systematic differences in gaze behavior towards the alternative path options between encoding and decoding were observed. Based on data from Experiments 1 & 2 and two control experiments ensuring that fixation patterns were specific to the spatial tasks, we develop a tentative model of gaze behavior during wayfinding decision making suggesting that particular attention was paid to image areas depicting changes in the local geometry of the environments such as corners, openings, and occlusions. Together, the results suggest that gaze during a wayfinding tasks is directed toward, and can be predicted by, a subset of environmental features and that gaze bias effects are a general phenomenon of visual decision making
Research in interactive scene analysis
Cooperative (man-machine) scene analysis techniques were developed whereby humans can provide a computer with guidance when completely automated processing is infeasible. An interactive approach promises significant near-term payoffs in analyzing various types of high volume satellite imagery, as well as vehicle-based imagery used in robot planetary exploration. This report summarizes the work accomplished over the duration of the project and describes in detail three major accomplishments: (1) the interactive design of texture classifiers; (2) a new approach for integrating the segmentation and interpretation phases of scene analysis; and (3) the application of interactive scene analysis techniques to cartography
- âŚ