8 research outputs found

    Recognizing actions from still images

    Full text link
    In this paper, we approach the problem of understanding human actions from still images. Our method involves representing the pose with a spatial and orientational histogramming of rectangular regions on a parse probability map. We use LDA to obtain a more compact and discriminative feature representation and binary SVMs for classification. Our results over a new dataset collected for this problem show that by using a rectangle histogramming approach, we can discriminate actions to a great extent. We also show how we can use this approach in an unsupervised setting. To our best knowledge, this is one of the first studies that try to recognize actions within still images

    Learning Actions From the Web

    No full text
    This paper proposes a generic method for action recognition in uncontrolled videos. The idea is to use images collected from the Web to learn representations of actions and use this knowledge to automatically annotate actions in videos. Our approach is unsupervised in the sense that it requires no human intervention other than the text querying. Its benefits are two-fold: 1) we can improve retrieval of action images, and 2) we can collect a large generic database of action poses, which can then be used in tagging videos. We present experimental evidence that using action images collected from the Web, annotating actions is possible

    Visual Saliency Estimation via Attribute Based Classifiers and Conditional Random Field

    No full text
    Visual Saliency Estimation is a computer vision problem that aims to find the regions of interest that are frequently in eye focus in a scene or an image. Since most computer vision problems require discarding irrelevant regions in a scene, visual saliency estimation can be used as a preprocessing step in such problems. In this work, we propose a method to solve top-down saliency estimation problem using Attribute Based Classifiers and Conditional Random Fields (CRF). Experimental results show that attribute-based classifiers encode visual information better than low level features and the presented approach generates promising results compared to state-of-theart approaches on Graz-02 dataset

    Red Carpet to Fight Club: Partially-supervised Domain Transfer for Face Recognition in Violent Videos

    No full text
    In many real-world problems, there is typically a large discrepancy between the characteristics of data used in training versus deployment. A prime example is the analysis of aggression videos: in a criminal incidence, typically suspects need to be identified based on their clean portrait-like photos, instead of their prior video recordings. This results in three major challenges; large domain discrepancy between violence videos and ID-photos, the lack of video examples for most individuals and limited training data availability. To mimic such scenarios, we formulate a realistic domain-transfer problem, where the goal is to transfer the recognition model trained on clean posed images to the target domain of violent videos, where training videos are available only for a subset of subjects. To this end, we introduce the "WildestFaces" dataset, tailored to study cross-domain recognition under a variety of adverse conditions. We divide the task of transferring a recognition model from the domain of clean images to the violent videos into two sub-problems and tackle them using (i) stacked affine-transforms for classifier-transfer, (ii) attention-driven pooling for temporal-adaptation. We additionally formulate a self-attention based model for domain-transfer. We establish a rigorous evaluation protocol for this "clean-to-violent" recognition task, and present a detailed analysis of the proposed dataset and the methods. Our experiments highlight the unique challenges introduced by the WildestFaces dataset and the advantages of the proposed approach

    Data-driven image captioning with meta-class based retrieval

    No full text
    Automatic image captioning, the process cif producing a description for an image, is a very challenging problem which has only recently received interest from the computer vision and natural language processing communities. In this study, we present a novel data-driven image captioning strategy, which, for a given image, finds the most visually similar image in a large dataset of image-caption pairs and transfers its caption as the description of the input image. Our novelty lies in employing a recently' proposed high-level global image representation, named the meta-class descriptor, to better capture the semantic content of the input image for use in the retrieval process. Our experiments show that as compared to the baseline Im2Text model, our meta-class guided approach produces more accurate descriptions

    DATA-DRIVEN IMAGE CAPTIONING WITH META-CLASS BASED RETRIEVAL

    No full text
    Automatic image captioning, the process cif producing a description for an image, is a very challenging problem which has only recently received interest from the computer vision and natural language processing communities. In this study, we present a novel data-driven image captioning strategy, which, for a given image, finds the most visually similar image in a large dataset of image-caption pairs and transfers its caption as the description of the input image. Our novelty lies in employing a recently' proposed high-level global image representation, named the meta-class descriptor, to better capture the semantic content of the input image for use in the retrieval process. Our experiments show that as compared to the baseline Im2Text model, our meta-class guided approach produces more accurate descriptions

    Data-driven image captioning via salient region discovery

    Get PDF
    n the past few years, automatically generating descriptions for images has attracted a lot of attention in computer vision and natural language processing research. Among the existing approaches, data-driven methods have been proven to be highly effective. These methods compare the given image against a large set of training images to determine a set of relevant images, then generate a description using the associated captions. In this study, the authors propose to integrate an object-based semantic image representation into a deep features-based retrieval framework to select the relevant images. Moreover, they present a novel phrase selection paradigm and a sentence generation model which depends on a joint analysis of salient regions in the input and retrieved images within a clustering framework. The authors demonstrate the effectiveness of their proposed approach on Flickr8K and Flickr30K benchmark datasets and show that their model gives highly competitive results compared with the state-of-the-art models

    TasvirEt: A Benchmark Dataset for Automatic Turkish Description Generation from Images

    No full text
    Automatically describing images with natural sentences is considered to be a challenging research problem that has recently been explored. Although the number of methods proposed to solve this problem increases over time, since the datasets used commonly in this field contain only English descriptions, the studies have mostly been limited to single language, namely English. In this study, for the first time in the literature, a new dataset is proposed which enables generating Turkish descriptions from images, which can be used as a benchmark for this purpose. Furthermore, two approaches are proposed, again for the first time in the literature, for image captioning in Turkish with the dataset we named as TasvirEt. Our findings indicate that the new Turkish dataset and the approaches used here can be successfully used for automatically describing images in Turkish
    corecore