3,496 research outputs found

    Zero-Shot Event Detection by Multimodal Distributional Semantic Embedding of Videos

    Full text link
    We propose a new zero-shot Event Detection method by Multi-modal Distributional Semantic embedding of videos. Our model embeds object and action concepts as well as other available modalities from videos into a distributional semantic space. To our knowledge, this is the first Zero-Shot event detection model that is built on top of distributional semantics and extends it in the following directions: (a) semantic embedding of multimodal information in videos (with focus on the visual modalities), (b) automatically determining relevance of concepts/attributes to a free text query, which could be useful for other applications, and (c) retrieving videos by free text event query (e.g., "changing a vehicle tire") based on their content. We embed videos into a distributional semantic space and then measure the similarity between videos and the event query in a free text form. We validated our method on the large TRECVID MED (Multimedia Event Detection) challenge. Using only the event title as a query, our method outperformed the state-of-the-art that uses big descriptions from 12.6% to 13.5% with MAP metric and 0.73 to 0.83 with ROC-AUC metric. It is also an order of magnitude faster.Comment: To appear in AAAI 201

    A Survey on Food Computing

    Full text link
    Food is very essential for human life and it is fundamental to the human experience. Food-related study may support multifarious applications and services, such as guiding the human behavior, improving the human health and understanding the culinary culture. With the rapid development of social networks, mobile networks, and Internet of Things (IoT), people commonly upload, share, and record food images, recipes, cooking videos, and food diaries, leading to large-scale food data. Large-scale food data offers rich knowledge about food and can help tackle many central issues of human society. Therefore, it is time to group several disparate issues related to food computing. Food computing acquires and analyzes heterogenous food data from disparate sources for perception, recognition, retrieval, recommendation, and monitoring of food. In food computing, computational approaches are applied to address food related issues in medicine, biology, gastronomy and agronomy. Both large-scale food data and recent breakthroughs in computer science are transforming the way we analyze food data. Therefore, vast amounts of work has been conducted in the food area, targeting different food-oriented tasks and applications. However, there are very few systematic reviews, which shape this area well and provide a comprehensive and in-depth summary of current efforts or detail open problems in this area. In this paper, we formalize food computing and present such a comprehensive overview of various emerging concepts, methods, and tasks. We summarize key challenges and future directions ahead for food computing. This is the first comprehensive survey that targets the study of computing technology for the food area and also offers a collection of research studies and technologies to benefit researchers and practitioners working in different food-related fields.Comment: Accepted by ACM Computing Survey

    COMIC: Towards A Compact Image Captioning Model with Attention

    Full text link
    Recent works in image captioning have shown very promising raw performance. However, we realize that most of these encoder-decoder style networks with attention do not scale naturally to large vocabulary size, making them difficult to be deployed on embedded system with limited hardware resources. This is because the size of word and output embedding matrices grow proportionally with the size of vocabulary, adversely affecting the compactness of these networks. To address this limitation, this paper introduces a brand new idea in the domain of image captioning. That is, we tackle the problem of compactness of image captioning models which is hitherto unexplored. We showed that, our proposed model, named COMIC for COMpact Image Captioning, achieves comparable results in five common evaluation metrics with state-of-the-art approaches on both MS-COCO and InstaPIC-1.1M datasets despite having an embedding vocabulary size that is 39x - 99x smaller. The source code and models are available at: https://github.com/jiahuei/COMIC-Compact-Image-Captioning-with-AttentionComment: Added source code link and new results in Table

    Zero-Shot Object Recognition System based on Topic Model

    Full text link
    Object recognition systems usually require fully complete manually labeled training data to train the classifier. In this paper, we study the problem of object recognition where the training samples are missing during the classifier learning stage, a task also known as zero-shot learning. We propose a novel zero-shot learning strategy that utilizes the topic model and hierarchical class concept. Our proposed method advanced where cumbersome human annotation stage (i.e. attribute-based classification) is eliminated. We achieve comparable performance with state-of-the-art algorithms in four public datasets: PubFig (67.09%), Cifar-100 (54.85%), Caltech-256 (52.14%), and Animals with Attributes (49.65%) when unseen classes exist in the classification task.Comment: To appear in IEEE Transactions on Human-Machine System

    Vision-to-Language Tasks Based on Attributes and Attention Mechanism

    Full text link
    Vision-to-language tasks aim to integrate computer vision and natural language processing together, which has attracted the attention of many researchers. For typical approaches, they encode image into feature representations and decode it into natural language sentences. While they neglect high-level semantic concepts and subtle relationships between image regions and natural language elements. To make full use of these information, this paper attempt to exploit the text guided attention and semantic-guided attention (SA) to find the more correlated spatial information and reduce the semantic gap between vision and language. Our method includes two level attention networks. One is the text-guided attention network which is used to select the text-related regions. The other is SA network which is used to highlight the concept-related regions and the region-related concepts. At last, all these information are incorporated to generate captions or answers. Practically, image captioning and visual question answering experiments have been carried out, and the experimental results have shown the excellent performance of the proposed approach.Comment: 15 pages, 6 figures, 50 reference

    Active Speakers in Context

    Full text link
    Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker. Although this strategy can be enough for addressing single-speaker scenarios, it prevents accurate detection when the task is to identify who of many candidate speakers are talking. This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons. Our Active Speaker Context is designed to learn pairwise and temporal relations from an structured ensemble of audio-visual observations. Our experiments show that a structured feature ensemble already benefits the active speaker detection performance. Moreover, we find that the proposed Active Speaker Context improves the state-of-the-art on the AVA-ActiveSpeaker dataset achieving a mAP of 87.1%. We present ablation studies that verify that this result is a direct consequence of our long-term multi-speaker analysis

    Tri-axial Self-Attention for Concurrent Activity Recognition

    Full text link
    We present a system for concurrent activity recognition. To extract features associated with different activities, we propose a feature-to-activity attention that maps the extracted global features to sub-features associated with individual activities. To model the temporal associations of individual activities, we propose a transformer-network encoder that models independent temporal associations for each activity. To make the concurrent activity prediction aware of the potential associations between activities, we propose self-attention with an association mask. Our system achieved state-of-the-art or comparable performance on three commonly used concurrent activity detection datasets. Our visualizations demonstrate that our system is able to locate the important spatial-temporal features for final decision making. We also showed that our system can be applied to general multilabel classification problems

    Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images

    Full text link
    In this paper, we address the task of learning novel visual concepts, and their interactions with other concepts, from a few images with sentence descriptions. Using linguistic context and visual features, our method is able to efficiently hypothesize the semantic meaning of new words and add them to its word dictionary so that they can be used to describe images which contain these novel concepts. Our method has an image captioning module based on m-RNN with several improvements. In particular, we propose a transposed weight sharing scheme, which not only improves performance on image captioning, but also makes the model more suitable for the novel concept learning task. We propose methods to prevent overfitting the new concepts. In addition, three novel concept datasets are constructed for this new task. In the experiments, we show that our method effectively learns novel visual concepts from a few examples without disturbing the previously learned concepts. The project page is http://www.stat.ucla.edu/~junhua.mao/projects/child_learning.htmlComment: ICCV 2015 camera ready version. We add much more novel visual concepts in the NVC dataset and have released it, see http://www.stat.ucla.edu/~junhua.mao/projects/child_learning.htm

    Probabilistic Semantic Retrieval for Surveillance Videos with Activity Graphs

    Full text link
    We present a novel framework for finding complex activities matching user-described queries in cluttered surveillance videos. The wide diversity of queries coupled with unavailability of annotated activity data limits our ability to train activity models. To bridge the semantic gap we propose to let users describe an activity as a semantic graph with object attributes and inter-object relationships associated with nodes and edges, respectively. We learn node/edge-level visual predictors during training and, at test-time, propose to retrieve activity by identifying likely locations that match the semantic graph. We formulate a novel CRF based probabilistic activity localization objective that accounts for mis-detections, mis-classifications and track-losses, and outputs a likelihood score for a candidate grounded location of the query in the video. We seek groundings that maximize overall precision and recall. To handle the combinatorial search over all high-probability groundings, we propose a highest precision subgraph matching algorithm. Our method outperforms existing retrieval methods on benchmarked datasets.Comment: 1520-9210 (c) 2018 IEEE. This paper has been accepted by IEEE Transactions on Multimedia. Print ISSN: 1520-9210. Online ISSN: 1941-0077. Preprint link is https://ieeexplore.ieee.org/document/8438958

    Inferring Human Activities Using Robust Privileged Probabilistic Learning

    Full text link
    Classification models may often suffer from "structure imbalance" between training and testing data that may occur due to the deficient data collection process. This imbalance can be represented by the learning using privileged information (LUPI) paradigm. In this paper, we present a supervised probabilistic classification approach that integrates LUPI into a hidden conditional random field (HCRF) model. The proposed model is called LUPI-HCRF and is able to cope with additional information that is only available during training. Moreover, the proposed method employes Student's t-distribution to provide robustness to outliers by modeling the conditional distribution of the privileged information. Experimental results in three publicly available datasets demonstrate the effectiveness of the proposed approach and improve the state-of-the-art in the LUPI framework for recognizing human activities.Comment: To appear in ICCV Workshops 2017 (TASK-CV
    corecore