57 research outputs found

    Automatic Concept Discovery from Parallel Text and Visual Corpora

    Full text link
    Humans connect language and vision to perceive the world. How to build a similar connection for computers? One possible way is via visual concepts, which are text terms that relate to visually discriminative entities. We propose an automatic visual concept discovery algorithm using parallel text and visual corpora; it filters text terms based on the visual discriminative power of the associated images, and groups them into concepts using visual and semantic similarities. We illustrate the applications of the discovered concepts using bidirectional image and sentence retrieval task and image tagging task, and show that the discovered concepts not only outperform several large sets of manually selected concepts significantly, but also achieves the state-of-the-art performance in the retrieval task.Comment: To appear in ICCV 201

    Understanding Human Actions in Video

    Full text link
    Understanding human behavior is crucial for any autonomous system which interacts with humans. For example, assistive robots need to know when a person is signaling for help, and autonomous vehicles need to know when a person is waiting to cross the street. However, identifying human actions in video is a challenging and unsolved problem. In this work, we address several of the key challenges in human action recognition. To enable better representations of video sequences, we develop novel deep learning architectures which improve representations both at the level of instantaneous motion as well as at the level of long-term context. In addition, to reduce reliance on fixed action vocabularies, we develop a compositional representation of actions which allows novel action descriptions to be represented as a sequence of sub-actions. Finally, we address the issue of data collection for human action understanding by creating a large-scale video dataset, consisting of 70 million videos collected from internet video sharing sites and their matched descriptions. We demonstrate that these contributions improve the generalization performance of human action recognition systems on several benchmark datasets.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162887/1/stroud_1.pd

    Modeling Visual Rhetoric and Semantics in Multimedia

    Get PDF
    Recent advances in machine learning have enabled computer vision algorithms to model complicated visual phenomena with accuracies unthinkable a mere decade ago. Their high-performance on a plethora of vision-related tasks has enabled computer vision researchers to begin to move beyond traditional visual recognition problems to tasks requiring higher-level image understanding. However, most computer vision research still focuses on describing what images, text, or other media literally portrays. In contrast, in this dissertation we focus on learning how and why such content is portrayed. Rather than viewing media for its content, we recast the problem as understanding visual communication and visual rhetoric. For example, the same content may be portrayed in different ways in order to present the story the author wishes to convey. We thus seek to model not only the content of the media, but its authorial intent and latent messaging. Understanding how and why visual content is portrayed a certain way requires understanding higher level abstract semantic concepts which are themselves latent within visual media. By latent, we mean the concept is not readily visually accessible within a single image (e.g. right vs left political bias), in contrast to explicit visual semantic concepts such as objects. Specifically, we study the problems of modeling photographic style (how professional photographers portray their subjects), understanding visual persuasion in image advertisements, modeling political bias in multimedia (image and text) news articles, and learning cross-modal semantic representations. While most past research in vision and natural language processing studies the case where visual content and paired text are highly aligned (as in the case of image captions), we target the case where each modality conveys complementary information to tell a larger story. We particularly focus on the problem of learning cross-modal representations from multimedia exhibiting weak alignment between the image and text modalities. A variety of techniques are presented which improve modeling of multimedia rhetoric in real-world data and enable more robust artificially intelligent systems

    Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images

    Full text link
    We address the problem of fine-grained action localization from temporally untrimmed web videos. We assume that only weak video-level annotations are available for training. The goal is to use these weak labels to identify temporal segments corresponding to the actions, and learn models that generalize to unconstrained web videos. We find that web images queried by action names serve as well-localized highlights for many actions, but are noisily labeled. To solve this problem, we propose a simple yet effective method that takes weak video labels and noisy image labels as input, and generates localized action frames as output. This is achieved by cross-domain transfer between video frames and web images, using pre-trained deep convolutional neural networks. We then use the localized action frames to train action recognition models with long short-term memory networks. We collect a fine-grained sports action data set FGA-240 of more than 130,000 YouTube videos. It has 240 fine-grained actions under 85 sports activities. Convincing results are shown on the FGA-240 data set, as well as the THUMOS 2014 localization data set with untrimmed training videos.Comment: Camera ready version for ACM Multimedia 201
    • …
    corecore