261 research outputs found
Sentence Directed Video Object Codetection
We tackle the problem of video object codetection by leveraging the weak
semantic constraint implied by sentences that describe the video content.
Unlike most existing work that focuses on codetecting large objects which are
usually salient both in size and appearance, we can codetect objects that are
small or medium sized. Our method assumes no human pose or depth information
such as is required by the most recent state-of-the-art method. We employ weak
semantic constraint on the codetection process by pairing the video with
sentences. Although the semantic information is usually simple and weak, it can
greatly boost the performance of our codetection framework by reducing the
search space of the hypothesized object detections. Our experiment demonstrates
an average IoU score of 0.423 on a new challenging dataset which contains 15
object classes and 150 videos with 12,509 frames in total, and an average IoU
score of 0.373 on a subset of an existing dataset, originally intended for
activity recognition, which contains 5 object classes and 75 videos with 8,854
frames in total
Discriminative Training: Learning to Describe Video with Sentences, from Video Described with Sentences
We present a method for learning word meanings from complex and realistic
video clips by discriminatively training (DT) positive sentential labels
against negative ones, and then use the trained word models to generate
sentential descriptions for new video. This new work is inspired by recent work
which adopts a maximum likelihood (ML) framework to address the same problem
using only positive sentential labels. The new method, like the ML-based one,
is able to automatically determine which words in the sentence correspond to
which concepts in the video (i.e., ground words to meanings) in a weakly
supervised fashion. While both DT and ML yield comparable results with
sufficient training data, DT outperforms ML significantly with smaller training
sets because it can exploit negative training labels to better constrain the
learning problem
Interactive Grounded Language Acquisition and Generalization in a 2D World
We build a virtual agent for learning language in a 2D maze-like world. The
agent sees images of the surrounding environment, listens to a virtual teacher,
and takes actions to receive rewards. It interactively learns the teacher's
language from scratch based on two language use cases: sentence-directed
navigation and question answering. It learns simultaneously the visual
representations of the world, the language, and the action control. By
disentangling language grounding from other computational routines and sharing
a concept detection function between language grounding and prediction, the
agent reliably interpolates and extrapolates to interpret sentences that
contain new word combinations or new words missing from training sentences. The
new words are transferred from the answers of language prediction. Such a
language ability is trained and evaluated on a population of over 1.6 million
distinct sentences consisting of 119 object words, 8 color words, 9
spatial-relation words, and 50 grammatical words. The proposed model
significantly outperforms five comparison methods for interpreting zero-shot
sentences. In addition, we demonstrate human-interpretable intermediate outputs
of the model in the appendix.Comment: ICLR 2018 (Figure 6 caption improved
Guided Feature Transformation (GFT): A Neural Language Grounding Module for Embodied Agents
Recently there has been a rising interest in training agents, embodied in
virtual environments, to perform language-directed tasks by deep reinforcement
learning. In this paper, we propose a simple but effective neural language
grounding module for embodied agents that can be trained end to end from
scratch taking raw pixels, unstructured linguistic commands, and sparse rewards
as the inputs. We model the language grounding process as a language-guided
transformation of visual features, where latent sentence embeddings are used as
the transformation matrices. In several language-directed navigation tasks that
feature challenging partial observability and require simple reasoning, our
module significantly outperforms the state of the art. We also release
XWorld3D, an easy-to-customize 3D environment that can potentially be modified
to evaluate a variety of embodied agents.Comment: CoRL 201
A Faster Method for Tracking and Scoring Videos Corresponding to Sentences
Prior work presented the sentence tracker, a method for scoring how well a
sentence describes a video clip or alternatively how well a video clip depicts
a sentence. We present an improved method for optimizing the same cost function
employed by this prior work, reducing the space complexity from exponential in
the sentence length to polynomial, as well as producing a qualitatively
identical result in time polynomial in the sentence length instead of
exponential. Since this new method is plug-compatible with the prior method, it
can be used for the same applications: video retrieval with sentential queries,
generating sentential descriptions of video clips, and focusing the attention
of a tracker with a sentence, while allowing these applications to scale with
significantly larger numbers of object detections, word meanings modeled with
HMMs with significantly larger numbers of states, and significantly longer
sentences, with no appreciable degradation in quality of results
Collecting and Annotating the Large Continuous Action Dataset
We make available to the community a new dataset to support
action-recognition research. This dataset is different from prior datasets in
several key ways. It is significantly larger. It contains streaming video with
long segments containing multiple action occurrences that often overlap in
space and/or time. All actions were filmed in the same collection of
backgrounds so that background gives little clue as to action class. We had
five humans replicate the annotation of temporal extent of action occurrences
labeled with their class and measured a surprisingly low level of intercoder
agreement. A baseline experiment shows that recent state-of-the-art methods
perform poorly on this dataset. This suggests that this will be a challenging
dataset to foster advances in action-recognition research. This manuscript
serves to describe the novel content and characteristics of the LCA dataset,
present the design decisions made when filming the dataset, and document the
novel methods employed to annotate the dataset
Robot Language Learning, Generation, and Comprehension
We present a unified framework which supports grounding natural-language
semantics in robotic driving. This framework supports acquisition (learning
grounded meanings of nouns and prepositions from human annotation of robotic
driving paths), generation (using such acquired meanings to generate sentential
description of new robotic driving paths), and comprehension (using such
acquired meanings to support automated driving to accomplish navigational goals
specified in natural language). We evaluate the performance of these three
tasks by having independent human judges rate the semantic fidelity of the
sentences associated with paths, achieving overall average correctness of 94.6%
and overall average completeness of 85.6%
Interactive Language Acquisition with One-shot Visual Concept Learning through a Conversational Game
Building intelligent agents that can communicate with and learn from humans
in natural language is of great value. Supervised language learning is limited
by the ability of capturing mainly the statistics of training data, and is
hardly adaptive to new scenarios or flexible for acquiring new knowledge
without inefficient retraining or catastrophic forgetting. We highlight the
perspective that conversational interaction serves as a natural interface both
for language learning and for novel knowledge acquisition and propose a joint
imitation and reinforcement approach for grounded language learning through an
interactive conversational game. The agent trained with this approach is able
to actively acquire information by asking questions about novel objects and use
the just-learned knowledge in subsequent conversations in a one-shot fashion.
Results compared with other methods verified the effectiveness of the proposed
approach.Comment: ACL 201
Hierarchical Reinforcement Learning By Discovering Intrinsic Options
We propose a hierarchical reinforcement learning method, HIDIO, that can
learn task-agnostic options in a self-supervised manner while jointly learning
to utilize them to solve sparse-reward tasks. Unlike current hierarchical RL
approaches that tend to formulate goal-reaching low-level tasks or pre-define
ad hoc lower-level policies, HIDIO encourages lower-level option learning that
is independent of the task at hand, requiring few assumptions or little
knowledge about the task structure. These options are learned through an
intrinsic entropy minimization objective conditioned on the option
sub-trajectories. The learned options are diverse and task-agnostic. In
experiments on sparse-reward robotic manipulation and navigation tasks, HIDIO
achieves higher success rates with greater sample efficiency than regular RL
baselines and two state-of-the-art hierarchical RL methods.Comment: ICLR 2021. 19 pages, 9 figures. Code at
https://www.github.com/jesbu1/hidi
Resource-Efficient Neural Architect
Neural Architecture Search (NAS) is a laborious process. Prior work on
automated NAS targets mainly on improving accuracy, but lacks consideration of
computational resource use. We propose the Resource-Efficient Neural Architect
(RENA), an efficient resource-constrained NAS using reinforcement learning with
network embedding. RENA uses a policy network to process the network embeddings
to generate new configurations. We demonstrate RENA on image recognition and
keyword spotting (KWS) problems. RENA can find novel architectures that achieve
high performance even with tight resource constraints. For CIFAR10, it achieves
2.95% test error when compute intensity is greater than 100 FLOPs/byte, and
3.87% test error when model size is less than 3M parameters. For Google Speech
Commands Dataset, RENA achieves the state-of-the-art accuracy without resource
constraints, and it outperforms the optimized architectures with tight resource
constraints
- …