24,744 research outputs found

    VIP: Finding Important People in Images

    Full text link
    People preserve memories of events such as birthdays, weddings, or vacations by capturing photos, often depicting groups of people. Invariably, some individuals in the image are more important than others given the context of the event. This paper analyzes the concept of the importance of individuals in group photographs. We address two specific questions -- Given an image, who are the most important individuals in it? Given multiple images of a person, which image depicts the person in the most important role? We introduce a measure of importance of people in images and investigate the correlation between importance and visual saliency. We find that not only can we automatically predict the importance of people from purely visual cues, incorporating this predicted importance results in significant improvement in applications such as im2text (generating sentences that describe images of groups of people)

    Neural Motifs: Scene Graph Parsing with Global Context

    Full text link
    We investigate the problem of producing structured graph representations of visual scenes. Our work analyzes the role of motifs: regularly appearing substructures in scene graphs. We present new quantitative insights on such repeated structures in the Visual Genome dataset. Our analysis shows that object labels are highly predictive of relation labels but not vice-versa. We also find that there are recurring patterns even in larger subgraphs: more than 50% of graphs contain motifs involving at least two relations. Our analysis motivates a new baseline: given object detections, predict the most frequent relation between object pairs with the given labels, as seen in the training set. This baseline improves on the previous state-of-the-art by an average of 3.6% relative improvement across evaluation settings. We then introduce Stacked Motif Networks, a new architecture designed to capture higher order motifs in scene graphs that further improves over our strong baseline by an average 7.1% relative gain. Our code is available at github.com/rowanz/neural-motifs.Comment: CVPR 2018 camera read

    Multimodal classification of driver glance

    Get PDF
    —This paper presents a multimodal approach to invehicle classification of driver glances. Driver glance is a strong predictor of cognitive load and is a useful input to many applications in the automotive domain. Six descriptive glance regions are defined and a classifier is trained on video recordings of drivers from a single low-cost camera. Visual features such as head orientation, eye gaze and confidence ratings are extracted, then statistical methods are used to perform failure analysis and calibration on the visual features. Non-visual features such as steering wheel angle and indicator position are extracted from a RaceLogic VBOX system. The approach is evaluated on a dataset containing multiple 60 second samples from 14 participants recorded while driving in a natural environment. We compare our multimodal approach to separate unimodal approaches using both Support Vector Machine (SVM) and Random Forests (RF) classifiers. RF Mean Decrease in Gini Index is used to rank selected features which gives insight into the selected features and improves the classifier performance. We demonstrate that our multimodal approach yields significantly higher results than unimodal approaches. The final model achieves an average F1 score of 70.5% across the six classes

    Automatic Eye-Gaze Following from 2-D Static Images: Application to Classroom Observation Video Analysis

    Get PDF
    In this work, we develop an end-to-end neural network-based computer vision system to automatically identify where each person within a 2-D image of a school classroom is looking (“gaze following�), as well as who she/he is looking at. Automatic gaze following could help facilitate data-mining of large datasets of classroom observation videos that are collected routinely in schools around the world in order to understand social interactions between teachers and students. Our network is based on the architecture by Recasens, et al. (2015) but is extended to (1) predict not only where, but who the person is looking at; and (2) predict whether each person is looking at a target inside or outside the image. Since our focus is on classroom observation videos, we collect gaze dataset (48,907 gaze annotations over 2,263 classroom images) for students and teachers in classrooms. Results of our experiments indicate that the proposed neural network can estimate the gaze target - either the spatial location or the face of a person - with substantially higher accuracy compared to several baselines

    How Facial Features Convey Attention in Stationary Environments

    Full text link
    Awareness detection technologies have been gaining traction in a variety of enterprises; most often used for driver fatigue detection, recent research has shifted towards using computer vision technologies to analyze user attention in environments such as online classrooms. This paper aims to extend previous research on distraction detection by analyzing which visual features contribute most to predicting awareness and fatigue. We utilized the open-source facial analysis toolkit OpenFace in order to analyze visual data of subjects at varying levels of attentiveness. Then, using a Support-Vector Machine (SVM) we created several prediction models for user attention and identified the Histogram of Oriented Gradients (HOG) and Action Units to be the greatest predictors of the features we tested. We also compared the performance of this SVM to deep learning approaches that utilize Convolutional and/or Recurrent neural networks (CNNs and CRNNs). Interestingly, CRNNs did not appear to perform significantly better than their CNN counterparts. While deep learning methods achieved greater prediction accuracy, SVMs utilized less resources and, using certain parameters, were able to approach the performance of deep learning methods

    Discovering Class-Specific Pixels for Weakly-Supervised Semantic Segmentation

    Full text link
    We propose an approach to discover class-specific pixels for the weakly-supervised semantic segmentation task. We show that properly combining saliency and attention maps allows us to obtain reliable cues capable of significantly boosting the performance. First, we propose a simple yet powerful hierarchical approach to discover the class-agnostic salient regions, obtained using a salient object detector, which otherwise would be ignored. Second, we use fully convolutional attention maps to reliably localize the class-specific regions in a given image. We combine these two cues to discover class-specific pixels which are then used as an approximate ground truth for training a CNN. While solving the weakly supervised semantic segmentation task, we ensure that the image-level classification task is also solved in order to enforce the CNN to assign at least one pixel to each object present in the image. Experimentally, on the PASCAL VOC12 val and test sets, we obtain the mIoU of 60.8% and 61.9%, achieving the performance gains of 5.1% and 5.2% compared to the published state-of-the-art results. The code is made publicly available
    • …
    corecore