131,129 research outputs found
Curiosity-driven Exploration by Self-supervised Prediction
In many real-world scenarios, rewards extrinsic to the agent are extremely
sparse, or absent altogether. In such cases, curiosity can serve as an
intrinsic reward signal to enable the agent to explore its environment and
learn skills that might be useful later in its life. We formulate curiosity as
the error in an agent's ability to predict the consequence of its own actions
in a visual feature space learned by a self-supervised inverse dynamics model.
Our formulation scales to high-dimensional continuous state spaces like images,
bypasses the difficulties of directly predicting pixels, and, critically,
ignores the aspects of the environment that cannot affect the agent. The
proposed approach is evaluated in two environments: VizDoom and Super Mario
Bros. Three broad settings are investigated: 1) sparse extrinsic reward, where
curiosity allows for far fewer interactions with the environment to reach the
goal; 2) exploration with no extrinsic reward, where curiosity pushes the agent
to explore more efficiently; and 3) generalization to unseen scenarios (e.g.
new levels of the same game) where the knowledge gained from earlier experience
helps the agent explore new places much faster than starting from scratch. Demo
video and code available at https://pathak22.github.io/noreward-rl/Comment: In ICML 2017. Website at https://pathak22.github.io/noreward-rl
Deep Ordinal Hashing with Spatial Attention
Hashing has attracted increasing research attentions in recent years due to
its high efficiency of computation and storage in image retrieval. Recent works
have demonstrated the superiority of simultaneous feature representations and
hash functions learning with deep neural networks. However, most existing deep
hashing methods directly learn the hash functions by encoding the global
semantic information, while ignoring the local spatial information of images.
The loss of local spatial structure makes the performance bottleneck of hash
functions, therefore limiting its application for accurate similarity
retrieval. In this work, we propose a novel Deep Ordinal Hashing (DOH) method,
which learns ordinal representations by leveraging the ranking structure of
feature space from both local and global views. In particular, to effectively
build the ranking structure, we propose to learn the rank correlation space by
exploiting the local spatial information from Fully Convolutional Network (FCN)
and the global semantic information from the Convolutional Neural Network (CNN)
simultaneously. More specifically, an effective spatial attention model is
designed to capture the local spatial information by selectively learning
well-specified locations closely related to target objects. In such hashing
framework,the local spatial and global semantic nature of images are captured
in an end-to-end ranking-to-hashing manner. Experimental results conducted on
three widely-used datasets demonstrate that the proposed DOH method
significantly outperforms the state-of-the-art hashing methods
Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking
In this paper, we develop a new approach of spatially supervised recurrent
convolutional neural networks for visual object tracking. Our recurrent
convolutional network exploits the history of locations as well as the
distinctive visual features learned by the deep neural networks. Inspired by
recent bounding box regression methods for object detection, we study the
regression capability of Long Short-Term Memory (LSTM) in the temporal domain,
and propose to concatenate high-level visual features produced by convolutional
networks with region information. In contrast to existing deep learning based
trackers that use binary classification for region candidates, we use
regression for direct prediction of the tracking locations both at the
convolutional layer and at the recurrent unit. Our extensive experimental
results and performance comparison with state-of-the-art tracking methods on
challenging benchmark video tracking datasets shows that our tracker is more
accurate and robust while maintaining low computational cost. For most test
video sequences, our method achieves the best tracking performance, often
outperforms the second best by a large margin.Comment: 10 pages, 9 figures, conferenc
Classifying Symmetrical Differences and Temporal Change in Mammography Using Deep Neural Networks
We investigate the addition of symmetry and temporal context information to a
deep Convolutional Neural Network (CNN) with the purpose of detecting malignant
soft tissue lesions in mammography. We employ a simple linear mapping that
takes the location of a mass candidate and maps it to either the contra-lateral
or prior mammogram and Regions Of Interest (ROI) are extracted around each
location. We subsequently explore two different architectures (1) a fusion
model employing two datastreams were both ROIs are fed to the network during
training and testing and (2) a stage-wise approach where a single ROI CNN is
trained on the primary image and subsequently used as feature extractor for
both primary and symmetrical or prior ROIs. A 'shallow' Gradient Boosted Tree
(GBT) classifier is then trained on the concatenation of these features and
used to classify the joint representation. Results shown a significant increase
in performance using the first architecture and symmetry information, but only
marginal gains in performance using temporal data and the other setting. We
feel results are promising and can greatly be improved when more temporal data
becomes available
A Survey on Food Computing
Food is very essential for human life and it is fundamental to the human
experience. Food-related study may support multifarious applications and
services, such as guiding the human behavior, improving the human health and
understanding the culinary culture. With the rapid development of social
networks, mobile networks, and Internet of Things (IoT), people commonly
upload, share, and record food images, recipes, cooking videos, and food
diaries, leading to large-scale food data. Large-scale food data offers rich
knowledge about food and can help tackle many central issues of human society.
Therefore, it is time to group several disparate issues related to food
computing. Food computing acquires and analyzes heterogenous food data from
disparate sources for perception, recognition, retrieval, recommendation, and
monitoring of food. In food computing, computational approaches are applied to
address food related issues in medicine, biology, gastronomy and agronomy. Both
large-scale food data and recent breakthroughs in computer science are
transforming the way we analyze food data. Therefore, vast amounts of work has
been conducted in the food area, targeting different food-oriented tasks and
applications. However, there are very few systematic reviews, which shape this
area well and provide a comprehensive and in-depth summary of current efforts
or detail open problems in this area. In this paper, we formalize food
computing and present such a comprehensive overview of various emerging
concepts, methods, and tasks. We summarize key challenges and future directions
ahead for food computing. This is the first comprehensive survey that targets
the study of computing technology for the food area and also offers a
collection of research studies and technologies to benefit researchers and
practitioners working in different food-related fields.Comment: Accepted by ACM Computing Survey
Vision-to-Language Tasks Based on Attributes and Attention Mechanism
Vision-to-language tasks aim to integrate computer vision and natural
language processing together, which has attracted the attention of many
researchers. For typical approaches, they encode image into feature
representations and decode it into natural language sentences. While they
neglect high-level semantic concepts and subtle relationships between image
regions and natural language elements. To make full use of these information,
this paper attempt to exploit the text guided attention and semantic-guided
attention (SA) to find the more correlated spatial information and reduce the
semantic gap between vision and language. Our method includes two level
attention networks. One is the text-guided attention network which is used to
select the text-related regions. The other is SA network which is used to
highlight the concept-related regions and the region-related concepts. At last,
all these information are incorporated to generate captions or answers.
Practically, image captioning and visual question answering experiments have
been carried out, and the experimental results have shown the excellent
performance of the proposed approach.Comment: 15 pages, 6 figures, 50 reference
HyperFusion-Net: Densely Reflective Fusion for Salient Object Detection
Salient object detection (SOD), which aims to find the most important region
of interest and segment the relevant object/item in that area, is an important
yet challenging vision task. This problem is inspired by the fact that human
seems to perceive main scene elements with high priorities. Thus, accurate
detection of salient objects in complex scenes is critical for human-computer
interaction. In this paper, we present a novel feature learning framework for
SOD, in which we cast the SOD as a pixel-wise classification problem. The
proposed framework utilizes a densely hierarchical feature fusion network,
named HyperFusion-Net, automatically predicts the most important area and
segments the associated objects in an end-to-end manner. Specifically, inspired
by the human perception system and image reflection separation, we first
decompose input images into reflective image pairs by content-preserving
transforms. Then, the complementary information of reflective image pairs is
jointly extracted by an interweaved convolutional neural network (ICNN) and
hierarchically combined with a hyper-dense fusion mechanism. Based on the fused
multi-scale features, our method finally achieves a promising way of predicting
SOD. As shown in our extensive experiments, the proposed method consistently
outperforms other state-of-the-art methods on seven public datasets with a
large margin.Comment: Submmited to ECCV 2018, 16 pages, including 6 figures and 4 tables.
arXiv admin note: text overlap with arXiv:1802.0652
Exploring Visual Relationship for Image Captioning
It is always well believed that modeling relationships between objects would
be helpful for representing and eventually describing an image. Nevertheless,
there has not been evidence in support of the idea on image description
generation. In this paper, we introduce a new design to explore the connections
between objects for image captioning under the umbrella of attention-based
encoder-decoder framework. Specifically, we present Graph Convolutional
Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that
novelly integrates both semantic and spatial object relationships into image
encoder. Technically, we build graphs over the detected objects in an image
based on their spatial and semantic connections. The representations of each
region proposed on objects are then refined by leveraging graph structure
through GCN. With the learnt region-level features, our GCN-LSTM capitalizes on
LSTM-based captioning framework with attention mechanism for sentence
generation. Extensive experiments are conducted on COCO image captioning
dataset, and superior results are reported when comparing to state-of-the-art
approaches. More remarkably, GCN-LSTM increases CIDEr-D performance from 120.1%
to 128.7% on COCO testing set.Comment: ECCV 201
Contextualized Spatial-Temporal Network for Taxi Origin-Destination Demand Prediction
Taxi demand prediction has recently attracted increasing research interest
due to its huge potential application in large-scale intelligent transportation
systems. However, most of the previous methods only considered the taxi demand
prediction in origin regions, but neglected the modeling of the specific
situation of the destination passengers. We believe it is suboptimal to
preallocate the taxi into each region based solely on the taxi origin demand.
In this paper, we present a challenging and worth-exploring task, called taxi
origin-destination demand prediction, which aims at predicting the taxi demand
between all region pairs in a future time interval. Its main challenges come
from how to effectively capture the diverse contextual information to learn the
demand patterns. We address this problem with a novel Contextualized
Spatial-Temporal Network (CSTN), which consists of three components for the
modeling of local spatial context (LSC), temporal evolution context (TEC) and
global correlation context (GCC) respectively. Firstly, an LSC module utilizes
two convolution neural networks to learn the local spatial dependencies of taxi
demand respectively from the origin view and the destination view. Secondly, a
TEC module incorporates both the local spatial features of taxi demand and the
meteorological information to a Convolutional Long Short-term Memory Network
(ConvLSTM) for the analysis of taxi demand evolution. Finally, a GCC module is
applied to model the correlation between all regions by computing a global
correlation feature as a weighted sum of all regional features, with the
weights being calculated as the similarity between the corresponding region
pairs. Extensive experiments and evaluations on a large-scale dataset well
demonstrate the superiority of our CSTN over other compared methods for taxi
origin-destination demand prediction
Human Pose Estimation with Spatial Contextual Information
We explore the importance of spatial contextual information in human pose
estimation. Most state-of-the-art pose networks are trained in a multi-stage
manner and produce several auxiliary predictions for deep supervision. With
this principle, we present two conceptually simple and yet computational
efficient modules, namely Cascade Prediction Fusion (CPF) and Pose Graph Neural
Network (PGNN), to exploit underlying contextual information. Cascade
prediction fusion accumulates prediction maps from previous stages to extract
informative signals. The resulting maps also function as a prior to guide
prediction at following stages. To promote spatial correlation among joints,
our PGNN learns a structured representation of human pose as a graph. Direct
message passing between different joints is enabled and spatial relation is
captured. These two modules require very limited computational complexity.
Experimental results demonstrate that our method consistently outperforms
previous methods on MPII and LSP benchmark
- …