9,551 research outputs found
Semantic categories underlying the meaning of ‘place’
This paper analyses the semantics of natural language expressions that are associated with the intuitive notion of ‘place’. We note that the nature of such terms is highly contested, and suggest that this arises from two main considerations: 1) there are a number of logically
distinct categories of place expression, which are not always clearly distinguished in discourse about ‘place’; 2) the many non-substantive place count nouns (such as ‘place’, ‘region’, ‘area’, etc.) employed in natural
language are highly ambiguous. With respect to consideration 1), we propose that place-related expressions
should be classified into the following distinct logical types: a) ‘place-like’ count nouns (further subdivided into abstract, spatial and substantive varieties), b) proper names of ‘place-like’ objects, c) locative property phrases, and d) definite descriptions of ‘place-like’ objects. We outline possible formal representations for each of these. To address consideration 2), we examine meanings, connotations and ambiguities of the English vocabulary of abstract and generic place count nouns, and identify underlying elements of meaning, which explain both
similarities and differences in the sense and usage of the various terms
Multimodal Visual Concept Learning with Weakly Supervised Techniques
Despite the availability of a huge amount of video data accompanied by
descriptive texts, it is not always easy to exploit the information contained
in natural language in order to automatically recognize video concepts. Towards
this goal, in this paper we use textual cues as means of supervision,
introducing two weakly supervised techniques that extend the Multiple Instance
Learning (MIL) framework: the Fuzzy Sets Multiple Instance Learning (FSMIL) and
the Probabilistic Labels Multiple Instance Learning (PLMIL). The former encodes
the spatio-temporal imprecision of the linguistic descriptions with Fuzzy Sets,
while the latter models different interpretations of each description's
semantics with Probabilistic Labels, both formulated through a convex
optimization algorithm. In addition, we provide a novel technique to extract
weak labels in the presence of complex semantics, that consists of semantic
similarity computations. We evaluate our methods on two distinct problems,
namely face and action recognition, in the challenging and realistic setting of
movies accompanied by their screenplays, contained in the COGNIMUSE database.
We show that, on both tasks, our method considerably outperforms a
state-of-the-art weakly supervised approach, as well as other baselines.Comment: CVPR 201
Unsupervised Learning of Long-Term Motion Dynamics for Videos
We present an unsupervised representation learning approach that compactly
encodes the motion dependencies in videos. Given a pair of images from a video
clip, our framework learns to predict the long-term 3D motions. To reduce the
complexity of the learning framework, we propose to describe the motion as a
sequence of atomic 3D flows computed with RGB-D modality. We use a Recurrent
Neural Network based Encoder-Decoder framework to predict these sequences of
flows. We argue that in order for the decoder to reconstruct these sequences,
the encoder must learn a robust video representation that captures long-term
motion dependencies and spatial-temporal relations. We demonstrate the
effectiveness of our learned temporal representations on activity
classification across multiple modalities and datasets such as NTU RGB+D and
MSR Daily Activity 3D. Our framework is generic to any input modality, i.e.,
RGB, Depth, and RGB-D videos.Comment: CVPR 201
Human-Machine CRFs for Identifying Bottlenecks in Holistic Scene Understanding
Recent trends in image understanding have pushed for holistic scene
understanding models that jointly reason about various tasks such as object
detection, scene recognition, shape analysis, contextual reasoning, and local
appearance based classifiers. In this work, we are interested in understanding
the roles of these different tasks in improved scene understanding, in
particular semantic segmentation, object detection and scene recognition.
Towards this goal, we "plug-in" human subjects for each of the various
components in a state-of-the-art conditional random field model. Comparisons
among various hybrid human-machine CRFs give us indications of how much "head
room" there is to improve scene understanding by focusing research efforts on
various individual tasks
Multimodal Classification of Urban Micro-Events
In this paper we seek methods to effectively detect urban micro-events. Urban
micro-events are events which occur in cities, have limited geographical
coverage and typically affect only a small group of citizens. Because of their
scale these are difficult to identify in most data sources. However, by using
citizen sensing to gather data, detecting them becomes feasible. The data
gathered by citizen sensing is often multimodal and, as a consequence, the
information required to detect urban micro-events is distributed over multiple
modalities. This makes it essential to have a classifier capable of combining
them. In this paper we explore several methods of creating such a classifier,
including early, late, hybrid fusion and representation learning using
multimodal graphs. We evaluate performance on a real world dataset obtained
from a live citizen reporting system. We show that a multimodal approach yields
higher performance than unimodal alternatives. Furthermore, we demonstrate that
our hybrid combination of early and late fusion with multimodal embeddings
performs best in classification of urban micro-events
Towards a Visual Turing Challenge
As language and visual understanding by machines progresses rapidly, we are
observing an increasing interest in holistic architectures that tightly
interlink both modalities in a joint learning and inference process. This trend
has allowed the community to progress towards more challenging and open tasks
and refueled the hope at achieving the old AI dream of building machines that
could pass a turing test in open domains. In order to steadily make progress
towards this goal, we realize that quantifying performance becomes increasingly
difficult. Therefore we ask how we can precisely define such challenges and how
we can evaluate different algorithms on this open tasks? In this paper, we
summarize and discuss such challenges as well as try to give answers where
appropriate options are available in the literature. We exemplify some of the
solutions on a recently presented dataset of question-answering task based on
real-world indoor images that establishes a visual turing challenge. Finally,
we argue despite the success of unique ground-truth annotation, we likely have
to step away from carefully curated dataset and rather rely on 'social
consensus' as the main driving force to create suitable benchmarks. Providing
coverage in this inherently ambiguous output space is an emerging challenge
that we face in order to make quantifiable progress in this area.Comment: Published in the NIPS 2014 Workshop on Learning Semantic
A Diagram Is Worth A Dozen Images
Diagrams are common tools for representing complex concepts, relationships
and events, often when it would be difficult to portray the same information
with natural images. Understanding natural images has been extensively studied
in computer vision, while diagram understanding has received little attention.
In this paper, we study the problem of diagram interpretation and reasoning,
the challenging task of identifying the structure of a diagram and the
semantics of its constituents and their relationships. We introduce Diagram
Parse Graphs (DPG) as our representation to model the structure of diagrams. We
define syntactic parsing of diagrams as learning to infer DPGs for diagrams and
study semantic interpretation and reasoning of diagrams in the context of
diagram question answering. We devise an LSTM-based method for syntactic
parsing of diagrams and introduce a DPG-based attention model for diagram
question answering. We compile a new dataset of diagrams with exhaustive
annotations of constituents and relationships for over 5,000 diagrams and
15,000 questions and answers. Our results show the significance of our models
for syntactic parsing and question answering in diagrams using DPGs
- …