37,349 research outputs found
Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search
Text-based person search aims to retrieve the corresponding person images in
an image database by virtue of a describing sentence about the person, which
poses great potential for various applications such as video surveillance.
Extracting visual contents corresponding to the human description is the key to
this cross-modal matching problem. Moreover, correlated images and descriptions
involve different granularities of semantic relevance, which is usually ignored
in previous methods. To exploit the multilevel corresponding visual contents,
we propose a pose-guided multi-granularity attention network (PMA). Firstly, we
propose a coarse alignment network (CA) to select the related image regions to
the global description by a similarity-based attention. To further capture the
phrase-related visual body part, a fine-grained alignment network (FA) is
proposed, which employs pose information to learn latent semantic alignment
between visual body part and textual noun phrase. To verify the effectiveness
of our model, we perform extensive experiments on the CUHK Person Description
Dataset (CUHK-PEDES) which is currently the only available dataset for
text-based person search. Experimental results show that our approach
outperforms the state-of-the-art methods by 15 \% in terms of the top-1 metric.Comment: published in AAAI2020(oral
Cross-Modal Health State Estimation
Individuals create and consume more diverse data about themselves today than
any time in history. Sources of this data include wearable devices, images,
social media, geospatial information and more. A tremendous opportunity rests
within cross-modal data analysis that leverages existing domain knowledge
methods to understand and guide human health. Especially in chronic diseases,
current medical practice uses a combination of sparse hospital based biological
metrics (blood tests, expensive imaging, etc.) to understand the evolving
health status of an individual. Future health systems must integrate data
created at the individual level to better understand health status perpetually,
especially in a cybernetic framework. In this work we fuse multiple user
created and open source data streams along with established biomedical domain
knowledge to give two types of quantitative state estimates of cardiovascular
health. First, we use wearable devices to calculate cardiorespiratory fitness
(CRF), a known quantitative leading predictor of heart disease which is not
routinely collected in clinical settings. Second, we estimate inherent genetic
traits, living environmental risks, circadian rhythm, and biological metrics
from a diverse dataset. Our experimental results on 24 subjects demonstrate how
multi-modal data can provide personalized health insight. Understanding the
dynamic nature of health status will pave the way for better health based
recommendation engines, better clinical decision making and positive lifestyle
changes.Comment: Accepted to ACM Multimedia 2018 Conference - Brave New Ideas, Seoul,
Korea, ACM ISBN 978-1-4503-5665-7/18/1
Flowing ConvNets for Human Pose Estimation in Videos
The objective of this work is human pose estimation in videos, where multiple
frames are available. We investigate a ConvNet architecture that is able to
benefit from temporal context by combining information across the multiple
frames using optical flow.
To this end we propose a network architecture with the following novelties:
(i) a deeper network than previously investigated for regressing heatmaps; (ii)
spatial fusion layers that learn an implicit spatial model; (iii) optical flow
is used to align heatmap predictions from neighbouring frames; and (iv) a final
parametric pooling layer which learns to combine the aligned heatmaps into a
pooled confidence map.
We show that this architecture outperforms a number of others, including one
that uses optical flow solely at the input layers, one that regresses joint
coordinates directly, and one that predicts heatmaps without spatial fusion.
The new architecture outperforms the state of the art by a large margin on
three video pose estimation datasets, including the very challenging Poses in
the Wild dataset, and outperforms other deep methods that don't use a graphical
model on the single-image FLIC benchmark (and also Chen & Yuille and Tompson et
al. in the high precision region).Comment: ICCV'1
- …