453 research outputs found
Fully-Coupled Two-Stream Spatiotemporal Networks for Extremely Low Resolution Action Recognition
A major emerging challenge is how to protect people's privacy as cameras and
computer vision are increasingly integrated into our daily lives, including in
smart devices inside homes. A potential solution is to capture and record just
the minimum amount of information needed to perform a task of interest. In this
paper, we propose a fully-coupled two-stream spatiotemporal architecture for
reliable human action recognition on extremely low resolution (e.g., 12x16
pixel) videos. We provide an efficient method to extract spatial and temporal
features and to aggregate them into a robust feature representation for an
entire action video sequence. We also consider how to incorporate high
resolution videos during training in order to build better low resolution
action recognition models. We evaluate on two publicly-available datasets,
showing significant improvements over the state-of-the-art.Comment: 9 pagers, 5 figures, published in WACV 201
Temporal Recurrent Networks for Online Action Detection
Most work on temporal action detection is formulated as an offline problem,
in which the start and end times of actions are determined after the entire
video is fully observed. However, important real-time applications including
surveillance and driver assistance systems require identifying actions as soon
as each video frame arrives, based only on current and historical observations.
In this paper, we propose a novel framework, Temporal Recurrent Network (TRN),
to model greater temporal context of a video frame by simultaneously performing
online action detection and anticipation of the immediate future. At each
moment in time, our approach makes use of both accumulated historical evidence
and predicted future information to better recognize the action that is
currently occurring, and integrates both of these into a unified end-to-end
architecture. We evaluate our approach on two popular online action detection
datasets, HDD and TVSeries, as well as another widely used dataset, THUMOS'14.
The results show that TRN significantly outperforms the state-of-the-art
Egocentric Vision-based Future Vehicle Localization for Intelligent Driving Assistance Systems
Predicting the future location of vehicles is essential for safety-critical
applications such as advanced driver assistance systems (ADAS) and autonomous
driving. This paper introduces a novel approach to simultaneously predict both
the location and scale of target vehicles in the first-person (egocentric) view
of an ego-vehicle. We present a multi-stream recurrent neural network (RNN)
encoder-decoder model that separately captures both object location and scale
and pixel-level observations for future vehicle localization. We show that
incorporating dense optical flow improves prediction results significantly
since it captures information about motion as well as appearance change. We
also find that explicitly modeling future motion of the ego-vehicle improves
the prediction accuracy, which could be especially beneficial in intelligent
and automated vehicles that have motion planning capability. To evaluate the
performance of our approach, we present a new dataset of first-person videos
collected from a variety of scenarios at road intersections, which are
particularly challenging moments for prediction because vehicle trajectories
are diverse and dynamic.Comment: To appear on ICRA 201
Recurrent violent injury: magnitude, risk factors, and opportunities for intervention from a statewide analysis.
INTRODUCTION: Although preventing recurrent violent injury is an important component of a public health approach to interpersonal violence and a common focus of violence intervention programs, the true incidence of recurrent violent injury is unknown. Prior studies have reported recurrence rates from 0.8% to 44%, and risk factors for recurrence are not well established.
METHODS: We used a statewide, all-payer database to perform a retrospective cohort study of emergency department visits for injury due to interpersonal violence in Florida, following up patients injured in 2010 for recurrence through 2012. We assessed risk factors for recurrence with multivariable logistic regression and estimated time to recurrence with the Kaplan-Meier method. We tabulated hospital charges and costs for index and recurrent visits.
RESULTS: Of 53 908 patients presenting for violent injury in 2010, 11.1% had a recurrent violent injury during the study period. Trauma centers treated 31.8%, including 55.9% of severe injuries. Among recurrers, 58.9% went to a different hospital for their second injury. Low income, homelessness, Medicaid or uninsurance, and black race were associated with increased odds of recurrence. Patients with visits for mental and behavioral health and unintentional injury also had increased odds of recurrence. Index injuries accounted for 25.3 million.
CONCLUSIONS: Recurrent violent injury is a common and costly phenomenon, and effective violence prevention programs are needed. Prevention must include the nontrauma centers where many patients seek care
Predicting Geo-informative Attributes in Large-Scale Image Collections Using Convolutional Neural Networks
Geographic location is a powerful property for or-ganizing large-scale photo collections, but only a small fraction of online photos are geo-tagged. Most work in automatically estimating geo-tags from image content is based on comparison against models of buildings or land-marks, or on matching to large reference collections of geo-tagged images. These approaches work well for frequently-photographed places like major cities and tourist destina-tions, but fail for photos taken in sparsely photographed places where few reference photos exist. Here we consider how to recognize general geo-informative attributes of a photo, e.g. the elevation gradient, population density, de-mographics, etc. of where it was taken, instead of trying to estimate a precise geo-tag. We learn models for these attributes using a large (noisy) set of geo-tagged images from Flickr by training deep convolutional neural networks (CNNs). We evaluate on over a dozen attributes, showing that while automatically recognizing some attributes is very difficult, others can be automatically estimated with about the same accuracy as a human. 1
Identifying First-person Camera Wearers in Third-person Videos
We consider scenarios in which we wish to perform joint scene understanding,
object tracking, activity recognition, and other tasks in environments in which
multiple people are wearing body-worn cameras while a third-person static
camera also captures the scene. To do this, we need to establish person-level
correspondences across first- and third-person videos, which is challenging
because the camera wearer is not visible from his/her own egocentric video,
preventing the use of direct feature matching. In this paper, we propose a new
semi-Siamese Convolutional Neural Network architecture to address this novel
challenge. We formulate the problem as learning a joint embedding space for
first- and third-person videos that considers both spatial- and motion-domain
cues. A new triplet loss function is designed to minimize the distance between
correct first- and third-person matches while maximizing the distance between
incorrect ones. This end-to-end approach performs significantly better than
several baselines, in part by learning the first- and third-person features
optimized for matching jointly with the distance measure itself
- …