5,268 research outputs found
Geometric Features Informed Multi-person Human-object Interaction Recognition in Videos
Human-Object Interaction (HOI) recognition in videos is important for
analyzing human activity. Most existing work focusing on visual features
usually suffer from occlusion in the real-world scenarios. Such a problem will
be further complicated when multiple people and objects are involved in HOIs.
Consider that geometric features such as human pose and object position provide
meaningful information to understand HOIs, we argue to combine the benefits of
both visual and geometric features in HOI recognition, and propose a novel
Two-level Geometric feature-informed Graph Convolutional Network (2G-GCN). The
geometric-level graph models the interdependency between geometric features of
humans and objects, while the fusion-level graph further fuses them with visual
features of humans and objects. To demonstrate the novelty and effectiveness of
our method in challenging scenarios, we propose a new multi-person HOI dataset
(MPHOI-72). Extensive experiments on MPHOI-72 (multi-person HOI), CAD-120
(single-human HOI) and Bimanual Actions (two-hand HOI) datasets demonstrate our
superior performance compared to state-of-the-arts.Comment: Accepted by ECCV 202
Weakly-Supervised Alignment of Video With Text
Suppose that we are given a set of videos, along with natural language
descriptions in the form of multiple sentences (e.g., manual annotations, movie
scripts, sport summaries etc.), and that these sentences appear in the same
temporal order as their visual counterparts. We propose in this paper a method
for aligning the two modalities, i.e., automatically providing a time stamp for
every sentence. Given vectorial features for both video and text, we propose to
cast this task as a temporal assignment problem, with an implicit linear
mapping between the two feature modalities. We formulate this problem as an
integer quadratic program, and solve its continuous convex relaxation using an
efficient conditional gradient algorithm. Several rounding procedures are
proposed to construct the final integer solution. After demonstrating
significant improvements over the state of the art on the related task of
aligning video with symbolic labels [7], we evaluate our method on a
challenging dataset of videos with associated textual descriptions [36], using
both bag-of-words and continuous representations for text.Comment: ICCV 2015 - IEEE International Conference on Computer Vision, Dec
2015, Santiago, Chil
Vision of a Visipedia
The web is not perfect: while text is easily
searched and organized, pictures (the vast majority of the bits
that one can find online) are not. In order to see how one could
improve the web and make pictures first-class citizens of the
web, I explore the idea of Visipedia, a visual interface for
Wikipedia that is able to answer visual queries and enables
experts to contribute and organize visual knowledge. Five
distinct groups of humans would interact through Visipedia:
users, experts, editors, visual workers, and machine vision
scientists. The latter would gradually build automata able to
interpret images. I explore some of the technical challenges
involved in making Visipedia happen. I argue that Visipedia will
likely grow organically, combining state-of-the-art machine
vision with human labor
- …