3 research outputs found
Face-space Action Recognition by Face-Object Interactions
Action recognition in still images has seen major improvement in recent years
due to advances in human pose estimation, object recognition and stronger
feature representations. However, there are still many cases in which
performance remains far from that of humans. In this paper, we approach the
problem by learning explicitly, and then integrating three components of
transitive actions: (1) the human body part relevant to the action (2) the
object being acted upon and (3) the specific form of interaction between the
person and the object. The process uses class-specific features and relations
not used in the past for action recognition and which use inherently two cycles
in the process unlike most standard approaches. We focus on face-related
actions (FRA), a subset of actions that includes several currently challenging
categories. We present an average relative improvement of 52% over state-of-the
art. We also make a new benchmark publicly available.Comment: our more recent work on a related topic is described in a separate
paper : http://arxiv.org/abs/1511.0381
Hand-Object Interaction and Precise Localization in Transitive Action Recognition
Action recognition in still images has seen major improvement in recent years
due to advances in human pose estimation, object recognition and stronger
feature representations produced by deep neural networks. However, there are
still many cases in which performance remains far from that of humans. A major
difficulty arises in distinguishing between transitive actions in which the
overall actor pose is similar, and recognition therefore depends on details of
the grasp and the object, which may be largely occluded. In this paper we
demonstrate how recognition is improved by obtaining precise localization of
the action-object and consequently extracting details of the object shape
together with the actor-object interaction. To obtain exact localization of the
action object and its interaction with the actor, we employ a coarse-to-fine
approach which combines semantic segmentation and contextual features, in
successive stages. We focus on (but are not limited) to face-related actions, a
set of actions that includes several currently challenging categories. We
present an average relative improvement of 35% over state-of-the art and
validate through experimentation the effectiveness of our approach.Comment: Minor changes: title and abstrac
Jointly Attentive Spatial-Temporal Pooling Networks for Video-based Person Re-Identification
Person Re-Identification (person re-id) is a crucial task as its applications
in visual surveillance and human-computer interaction. In this work, we present
a novel joint Spatial and Temporal Attention Pooling Network (ASTPN) for
video-based person re-identification, which enables the feature extractor to be
aware of the current input video sequences, in a way that interdependency from
the matching items can directly influence the computation of each other's
representation. Specifically, the spatial pooling layer is able to select
regions from each frame, while the attention temporal pooling performed can
select informative frames over the sequence, both pooling guided by the
information from distance matching. Experiments are conduced on the iLIDS-VID,
PRID-2011 and MARS datasets and the results demonstrate that this approach
outperforms existing state-of-art methods. We also analyze how the joint
pooling in both dimensions can boost the person re-id performance more
effectively than using either of them separately.Comment: To appear in ICCV 201