2,678 research outputs found

    Measuring and Predicting Importance of Objects in Our Visual World

    Get PDF
    Associating keywords with images automatically is an approachable and useful goal for visual recognition researchers. Keywords are distinctive and informative objects. We argue that keywords need to be sorted by 'importance', which we define as the probability of being mentioned first by an observer. We propose a method for measuring the `importance' of words using the object labels that multiple human observers give an everyday scene photograph. We model object naming as drawing balls from an urn, and fit this model to estimate `importance'; this combines order and frequency, enabling precise prediction under limited human labeling. We explore the relationship between the importance of an object in a particular image and the area, centrality, and saliency of the corresponding image patches. Furthermore, our data shows that many words are associated with even simple environments, and that few frequently appearing objects are shared across environments

    Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks

    Full text link
    Taking a photo outside, can we predict the immediate future, e.g., how would the cloud move in the sky? We address this problem by presenting a generative adversarial network (GAN) based two-stage approach to generating realistic time-lapse videos of high resolution. Given the first frame, our model learns to generate long-term future frames. The first stage generates videos of realistic contents for each frame. The second stage refines the generated video from the first stage by enforcing it to be closer to real videos with regard to motion dynamics. To further encourage vivid motion in the final generated video, Gram matrix is employed to model the motion more precisely. We build a large scale time-lapse dataset, and test our approach on this new dataset. Using our model, we are able to generate realistic videos of up to 128Γ—128128\times 128 resolution for 32 frames. Quantitative and qualitative experiment results have demonstrated the superiority of our model over the state-of-the-art models.Comment: To appear in Proceedings of CVPR 201

    Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions

    Get PDF
    We aim for zero-shot localization and classification of human actions in video. Where traditional approaches rely on global attribute or object classification scores for their zero-shot knowledge transfer, our main contribution is a spatial-aware object embedding. To arrive at spatial awareness, we build our embedding on top of freely available actor and object detectors. Relevance of objects is determined in a word embedding space and further enforced with estimated spatial preferences. Besides local object awareness, we also embed global object awareness into our embedding to maximize actor and object interaction. Finally, we exploit the object positions and sizes in the spatial-aware embedding to demonstrate a new spatio-temporal action retrieval scenario with composite queries. Action localization and classification experiments on four contemporary action video datasets support our proposal. Apart from state-of-the-art results in the zero-shot localization and classification settings, our spatial-aware embedding is even competitive with recent supervised action localization alternatives.Comment: ICC
    • …
    corecore