3,496 research outputs found
Zero-Shot Event Detection by Multimodal Distributional Semantic Embedding of Videos
We propose a new zero-shot Event Detection method by Multi-modal
Distributional Semantic embedding of videos. Our model embeds object and action
concepts as well as other available modalities from videos into a
distributional semantic space. To our knowledge, this is the first Zero-Shot
event detection model that is built on top of distributional semantics and
extends it in the following directions: (a) semantic embedding of multimodal
information in videos (with focus on the visual modalities), (b) automatically
determining relevance of concepts/attributes to a free text query, which could
be useful for other applications, and (c) retrieving videos by free text event
query (e.g., "changing a vehicle tire") based on their content. We embed videos
into a distributional semantic space and then measure the similarity between
videos and the event query in a free text form. We validated our method on the
large TRECVID MED (Multimedia Event Detection) challenge. Using only the event
title as a query, our method outperformed the state-of-the-art that uses big
descriptions from 12.6% to 13.5% with MAP metric and 0.73 to 0.83 with ROC-AUC
metric. It is also an order of magnitude faster.Comment: To appear in AAAI 201
A Survey on Food Computing
Food is very essential for human life and it is fundamental to the human
experience. Food-related study may support multifarious applications and
services, such as guiding the human behavior, improving the human health and
understanding the culinary culture. With the rapid development of social
networks, mobile networks, and Internet of Things (IoT), people commonly
upload, share, and record food images, recipes, cooking videos, and food
diaries, leading to large-scale food data. Large-scale food data offers rich
knowledge about food and can help tackle many central issues of human society.
Therefore, it is time to group several disparate issues related to food
computing. Food computing acquires and analyzes heterogenous food data from
disparate sources for perception, recognition, retrieval, recommendation, and
monitoring of food. In food computing, computational approaches are applied to
address food related issues in medicine, biology, gastronomy and agronomy. Both
large-scale food data and recent breakthroughs in computer science are
transforming the way we analyze food data. Therefore, vast amounts of work has
been conducted in the food area, targeting different food-oriented tasks and
applications. However, there are very few systematic reviews, which shape this
area well and provide a comprehensive and in-depth summary of current efforts
or detail open problems in this area. In this paper, we formalize food
computing and present such a comprehensive overview of various emerging
concepts, methods, and tasks. We summarize key challenges and future directions
ahead for food computing. This is the first comprehensive survey that targets
the study of computing technology for the food area and also offers a
collection of research studies and technologies to benefit researchers and
practitioners working in different food-related fields.Comment: Accepted by ACM Computing Survey
COMIC: Towards A Compact Image Captioning Model with Attention
Recent works in image captioning have shown very promising raw performance.
However, we realize that most of these encoder-decoder style networks with
attention do not scale naturally to large vocabulary size, making them
difficult to be deployed on embedded system with limited hardware resources.
This is because the size of word and output embedding matrices grow
proportionally with the size of vocabulary, adversely affecting the compactness
of these networks. To address this limitation, this paper introduces a brand
new idea in the domain of image captioning. That is, we tackle the problem of
compactness of image captioning models which is hitherto unexplored. We showed
that, our proposed model, named COMIC for COMpact Image Captioning, achieves
comparable results in five common evaluation metrics with state-of-the-art
approaches on both MS-COCO and InstaPIC-1.1M datasets despite having an
embedding vocabulary size that is 39x - 99x smaller. The source code and models
are available at:
https://github.com/jiahuei/COMIC-Compact-Image-Captioning-with-AttentionComment: Added source code link and new results in Table
Zero-Shot Object Recognition System based on Topic Model
Object recognition systems usually require fully complete manually labeled
training data to train the classifier. In this paper, we study the problem of
object recognition where the training samples are missing during the classifier
learning stage, a task also known as zero-shot learning. We propose a novel
zero-shot learning strategy that utilizes the topic model and hierarchical
class concept. Our proposed method advanced where cumbersome human annotation
stage (i.e. attribute-based classification) is eliminated. We achieve
comparable performance with state-of-the-art algorithms in four public
datasets: PubFig (67.09%), Cifar-100 (54.85%), Caltech-256 (52.14%), and
Animals with Attributes (49.65%) when unseen classes exist in the
classification task.Comment: To appear in IEEE Transactions on Human-Machine System
Vision-to-Language Tasks Based on Attributes and Attention Mechanism
Vision-to-language tasks aim to integrate computer vision and natural
language processing together, which has attracted the attention of many
researchers. For typical approaches, they encode image into feature
representations and decode it into natural language sentences. While they
neglect high-level semantic concepts and subtle relationships between image
regions and natural language elements. To make full use of these information,
this paper attempt to exploit the text guided attention and semantic-guided
attention (SA) to find the more correlated spatial information and reduce the
semantic gap between vision and language. Our method includes two level
attention networks. One is the text-guided attention network which is used to
select the text-related regions. The other is SA network which is used to
highlight the concept-related regions and the region-related concepts. At last,
all these information are incorporated to generate captions or answers.
Practically, image captioning and visual question answering experiments have
been carried out, and the experimental results have shown the excellent
performance of the proposed approach.Comment: 15 pages, 6 figures, 50 reference
Active Speakers in Context
Current methods for active speak er detection focus on modeling short-term
audiovisual information from a single speaker. Although this strategy can be
enough for addressing single-speaker scenarios, it prevents accurate detection
when the task is to identify who of many candidate speakers are talking. This
paper introduces the Active Speaker Context, a novel representation that models
relationships between multiple speakers over long time horizons. Our Active
Speaker Context is designed to learn pairwise and temporal relations from an
structured ensemble of audio-visual observations. Our experiments show that a
structured feature ensemble already benefits the active speaker detection
performance. Moreover, we find that the proposed Active Speaker Context
improves the state-of-the-art on the AVA-ActiveSpeaker dataset achieving a mAP
of 87.1%. We present ablation studies that verify that this result is a direct
consequence of our long-term multi-speaker analysis
Tri-axial Self-Attention for Concurrent Activity Recognition
We present a system for concurrent activity recognition. To extract features
associated with different activities, we propose a feature-to-activity
attention that maps the extracted global features to sub-features associated
with individual activities. To model the temporal associations of individual
activities, we propose a transformer-network encoder that models independent
temporal associations for each activity. To make the concurrent activity
prediction aware of the potential associations between activities, we propose
self-attention with an association mask. Our system achieved state-of-the-art
or comparable performance on three commonly used concurrent activity detection
datasets. Our visualizations demonstrate that our system is able to locate the
important spatial-temporal features for final decision making. We also showed
that our system can be applied to general multilabel classification problems
Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images
In this paper, we address the task of learning novel visual concepts, and
their interactions with other concepts, from a few images with sentence
descriptions. Using linguistic context and visual features, our method is able
to efficiently hypothesize the semantic meaning of new words and add them to
its word dictionary so that they can be used to describe images which contain
these novel concepts. Our method has an image captioning module based on m-RNN
with several improvements. In particular, we propose a transposed weight
sharing scheme, which not only improves performance on image captioning, but
also makes the model more suitable for the novel concept learning task. We
propose methods to prevent overfitting the new concepts. In addition, three
novel concept datasets are constructed for this new task. In the experiments,
we show that our method effectively learns novel visual concepts from a few
examples without disturbing the previously learned concepts. The project page
is http://www.stat.ucla.edu/~junhua.mao/projects/child_learning.htmlComment: ICCV 2015 camera ready version. We add much more novel visual
concepts in the NVC dataset and have released it, see
http://www.stat.ucla.edu/~junhua.mao/projects/child_learning.htm
Probabilistic Semantic Retrieval for Surveillance Videos with Activity Graphs
We present a novel framework for finding complex activities matching
user-described queries in cluttered surveillance videos. The wide diversity of
queries coupled with unavailability of annotated activity data limits our
ability to train activity models. To bridge the semantic gap we propose to let
users describe an activity as a semantic graph with object attributes and
inter-object relationships associated with nodes and edges, respectively. We
learn node/edge-level visual predictors during training and, at test-time,
propose to retrieve activity by identifying likely locations that match the
semantic graph. We formulate a novel CRF based probabilistic activity
localization objective that accounts for mis-detections, mis-classifications
and track-losses, and outputs a likelihood score for a candidate grounded
location of the query in the video. We seek groundings that maximize overall
precision and recall. To handle the combinatorial search over all
high-probability groundings, we propose a highest precision subgraph matching
algorithm. Our method outperforms existing retrieval methods on benchmarked
datasets.Comment: 1520-9210 (c) 2018 IEEE. This paper has been accepted by IEEE
Transactions on Multimedia. Print ISSN: 1520-9210. Online ISSN: 1941-0077.
Preprint link is https://ieeexplore.ieee.org/document/8438958
Inferring Human Activities Using Robust Privileged Probabilistic Learning
Classification models may often suffer from "structure imbalance" between
training and testing data that may occur due to the deficient data collection
process. This imbalance can be represented by the learning using privileged
information (LUPI) paradigm. In this paper, we present a supervised
probabilistic classification approach that integrates LUPI into a hidden
conditional random field (HCRF) model. The proposed model is called LUPI-HCRF
and is able to cope with additional information that is only available during
training. Moreover, the proposed method employes Student's t-distribution to
provide robustness to outliers by modeling the conditional distribution of the
privileged information. Experimental results in three publicly available
datasets demonstrate the effectiveness of the proposed approach and improve the
state-of-the-art in the LUPI framework for recognizing human activities.Comment: To appear in ICCV Workshops 2017 (TASK-CV
- …