9,964 research outputs found
Semantic Image Retrieval via Active Grounding of Visual Situations
We describe a novel architecture for semantic image retrieval---in
particular, retrieval of instances of visual situations. Visual situations are
concepts such as "a boxing match," "walking the dog," "a crowd waiting for a
bus," or "a game of ping-pong," whose instantiations in images are linked more
by their common spatial and semantic structure than by low-level visual
similarity. Given a query situation description, our architecture---called
Situate---learns models capturing the visual features of expected objects as
well the expected spatial configuration of relationships among objects. Given a
new image, Situate uses these models in an attempt to ground (i.e., to create a
bounding box locating) each expected component of the situation in the image
via an active search procedure. Situate uses the resulting grounding to compute
a score indicating the degree to which the new image is judged to contain an
instance of the situation. Such scores can be used to rank images in a
collection as part of a retrieval system. In the preliminary study described
here, we demonstrate the promise of this system by comparing Situate's
performance with that of two baseline methods, as well as with a related
semantic image-retrieval system based on "scene graphs.
Overcoming the Challenges Associated with Image-based Mapping of Small Bodies in Preparation for the OSIRIS-REx Mission to (101955) Bennu
The OSIRIS-REx Asteroid Sample Return Mission is the third mission in NASA's
New Frontiers Program and is the first U.S. mission to return samples from an
asteroid to Earth. The most important decision ahead of the OSIRIS-REx team is
the selection of a prime sample-site on the surface of asteroid (101955) Bennu.
Mission success hinges on identifying a site that is safe and has regolith that
can readily be ingested by the spacecraft's sampling mechanism. To inform this
mission-critical decision, the surface of Bennu is mapped using the OSIRIS-REx
Camera Suite and the images are used to develop several foundational data
products. Acquiring the necessary inputs to these data products requires
observational strategies that are defined specifically to overcome the
challenges associated with mapping a small irregular body. We present these
strategies in the context of assessing candidate sample-sites at Bennu
according to a framework of decisions regarding the relative safety,
sampleability, and scientific value across the asteroid's surface. To create
data products that aid these assessments, we describe the best practices
developed by the OSIRIS-REx team for image-based mapping of irregular small
bodies. We emphasize the importance of using 3D shape models and the ability to
work in body-fixed rectangular coordinates when dealing with planetary surfaces
that cannot be uniquely addressed by body-fixed latitude and longitude.Comment: 31 pages, 10 figures, 2 table
A Data-Driven Approach for Tag Refinement and Localization in Web Videos
Tagging of visual content is becoming more and more widespread as web-based
services and social networks have popularized tagging functionalities among
their users. These user-generated tags are used to ease browsing and
exploration of media collections, e.g. using tag clouds, or to retrieve
multimedia content. However, not all media are equally tagged by users. Using
the current systems is easy to tag a single photo, and even tagging a part of a
photo, like a face, has become common in sites like Flickr and Facebook. On the
other hand, tagging a video sequence is more complicated and time consuming, so
that users just tag the overall content of a video. In this paper we present a
method for automatic video annotation that increases the number of tags
originally provided by users, and localizes them temporally, associating tags
to keyframes. Our approach exploits collective knowledge embedded in
user-generated tags and web sources, and visual similarity of keyframes and
images uploaded to social sites like YouTube and Flickr, as well as web sources
like Google and Bing. Given a keyframe, our method is able to select on the fly
from these visual sources the training exemplars that should be the most
relevant for this test sample, and proceeds to transfer labels across similar
images. Compared to existing video tagging approaches that require training
classifiers for each tag, our system has few parameters, is easy to implement
and can deal with an open vocabulary scenario. We demonstrate the approach on
tag refinement and localization on DUT-WEBV, a large dataset of web videos, and
show state-of-the-art results.Comment: Preprint submitted to Computer Vision and Image Understanding (CVIU
Generic Tubelet Proposals for Action Localization
We develop a novel framework for action localization in videos. We propose
the Tube Proposal Network (TPN), which can generate generic, class-independent,
video-level tubelet proposals in videos. The generated tubelet proposals can be
utilized in various video analysis tasks, including recognizing and localizing
actions in videos. In particular, we integrate these generic tubelet proposals
into a unified temporal deep network for action classification. Compared with
other methods, our generic tubelet proposal method is accurate, general, and is
fully differentiable under a smoothL1 loss function. We demonstrate the
performance of our algorithm on the standard UCF-Sports, J-HMDB21, and UCF-101
datasets. Our class-independent TPN outperforms other tubelet generation
methods, and our unified temporal deep network achieves state-of-the-art
localization results on all three datasets
Video browsing interfaces and applications: a review
We present a comprehensive review of the state of the art in video browsing and retrieval systems, with special emphasis on interfaces and applications. There has been a significant increase in activity (e.g., storage, retrieval, and sharing) employing video data in the past decade, both for personal and professional use. The ever-growing amount of video content available for human consumption and the inherent characteristics of video data—which, if presented in its raw format, is rather unwieldy and costly—have become driving forces for the development of more effective solutions to present video contents and allow rich user interaction. As a result, there are many contemporary research efforts toward developing better video browsing solutions, which we summarize. We review more than 40 different video browsing and retrieval interfaces and classify them into three groups: applications that use video-player-like interaction, video retrieval applications, and browsing solutions based on video surrogates. For each category, we present a summary of existing work, highlight the technical aspects of each solution, and compare them against each other
- …