24,744 research outputs found
VIP: Finding Important People in Images
People preserve memories of events such as birthdays, weddings, or vacations
by capturing photos, often depicting groups of people. Invariably, some
individuals in the image are more important than others given the context of
the event. This paper analyzes the concept of the importance of individuals in
group photographs. We address two specific questions -- Given an image, who are
the most important individuals in it? Given multiple images of a person, which
image depicts the person in the most important role? We introduce a measure of
importance of people in images and investigate the correlation between
importance and visual saliency. We find that not only can we automatically
predict the importance of people from purely visual cues, incorporating this
predicted importance results in significant improvement in applications such as
im2text (generating sentences that describe images of groups of people)
Neural Motifs: Scene Graph Parsing with Global Context
We investigate the problem of producing structured graph representations of
visual scenes. Our work analyzes the role of motifs: regularly appearing
substructures in scene graphs. We present new quantitative insights on such
repeated structures in the Visual Genome dataset. Our analysis shows that
object labels are highly predictive of relation labels but not vice-versa. We
also find that there are recurring patterns even in larger subgraphs: more than
50% of graphs contain motifs involving at least two relations. Our analysis
motivates a new baseline: given object detections, predict the most frequent
relation between object pairs with the given labels, as seen in the training
set. This baseline improves on the previous state-of-the-art by an average of
3.6% relative improvement across evaluation settings. We then introduce Stacked
Motif Networks, a new architecture designed to capture higher order motifs in
scene graphs that further improves over our strong baseline by an average 7.1%
relative gain. Our code is available at github.com/rowanz/neural-motifs.Comment: CVPR 2018 camera read
Multimodal classification of driver glance
—This paper presents a multimodal approach to invehicle
classification of driver glances. Driver glance is a
strong predictor of cognitive load and is a useful input to
many applications in the automotive domain. Six descriptive
glance regions are defined and a classifier is trained on video
recordings of drivers from a single low-cost camera. Visual
features such as head orientation, eye gaze and confidence
ratings are extracted, then statistical methods are used to
perform failure analysis and calibration on the visual features.
Non-visual features such as steering wheel angle and indicator
position are extracted from a RaceLogic VBOX system. The
approach is evaluated on a dataset containing multiple 60
second samples from 14 participants recorded while driving in
a natural environment. We compare our multimodal approach
to separate unimodal approaches using both Support Vector
Machine (SVM) and Random Forests (RF) classifiers. RF
Mean Decrease in Gini Index is used to rank selected features
which gives insight into the selected features and improves the
classifier performance. We demonstrate that our multimodal
approach yields significantly higher results than unimodal
approaches. The final model achieves an average F1 score of
70.5% across the six classes
Automatic Eye-Gaze Following from 2-D Static Images: Application to Classroom Observation Video Analysis
In this work, we develop an end-to-end neural network-based computer vision system to automatically identify where each person within a 2-D image of a school classroom is looking (“gaze following�), as well as who she/he is looking at. Automatic gaze following could help facilitate data-mining of large datasets of classroom observation videos that are collected routinely in schools around the world in order to understand social interactions between teachers and students. Our network is based on the architecture by Recasens, et al. (2015) but is extended to (1) predict not only where, but who the person is looking at; and (2) predict whether each person is looking at a target inside or outside the image. Since our focus is on classroom observation videos, we collect gaze dataset (48,907 gaze annotations over 2,263 classroom images) for students and teachers in classrooms. Results of our experiments indicate that the proposed neural network can estimate the gaze target - either the spatial location or the face of a person - with substantially higher accuracy compared to several baselines
How Facial Features Convey Attention in Stationary Environments
Awareness detection technologies have been gaining traction in a variety of enterprises; most often used for driver fatigue detection, recent research has shifted towards using computer vision technologies to analyze user attention in environments such as online classrooms. This paper aims to extend previous research on distraction detection by analyzing which visual features contribute most to predicting awareness and fatigue. We utilized the open-source facial analysis toolkit OpenFace in order to analyze visual data of subjects at varying levels of attentiveness. Then, using a Support-Vector Machine (SVM) we created several prediction models for user attention and identified the Histogram of Oriented Gradients (HOG) and Action Units to be the greatest predictors of the features we tested. We also compared the performance of this SVM to deep learning approaches that utilize Convolutional and/or Recurrent neural networks (CNNs and CRNNs). Interestingly, CRNNs did not appear to perform significantly better than their CNN counterparts. While deep learning methods achieved greater prediction accuracy, SVMs utilized less resources and, using certain parameters, were able to approach the performance of deep learning methods
Discovering Class-Specific Pixels for Weakly-Supervised Semantic Segmentation
We propose an approach to discover class-specific pixels for the
weakly-supervised semantic segmentation task. We show that properly combining
saliency and attention maps allows us to obtain reliable cues capable of
significantly boosting the performance. First, we propose a simple yet powerful
hierarchical approach to discover the class-agnostic salient regions, obtained
using a salient object detector, which otherwise would be ignored. Second, we
use fully convolutional attention maps to reliably localize the class-specific
regions in a given image. We combine these two cues to discover class-specific
pixels which are then used as an approximate ground truth for training a CNN.
While solving the weakly supervised semantic segmentation task, we ensure that
the image-level classification task is also solved in order to enforce the CNN
to assign at least one pixel to each object present in the image.
Experimentally, on the PASCAL VOC12 val and test sets, we obtain the mIoU of
60.8% and 61.9%, achieving the performance gains of 5.1% and 5.2% compared to
the published state-of-the-art results. The code is made publicly available
- …