3 research outputs found
Text2Action: Generative Adversarial Synthesis from Language to Action
In this paper, we propose a generative model which learns the relationship
between language and human action in order to generate a human action sequence
given a sentence describing human behavior. The proposed generative model is a
generative adversarial network (GAN), which is based on the sequence to
sequence (SEQ2SEQ) model. Using the proposed generative network, we can
synthesize various actions for a robot or a virtual agent using a text encoder
recurrent neural network (RNN) and an action decoder RNN. The proposed
generative network is trained from 29,770 pairs of actions and sentence
annotations extracted from MSR-Video-to-Text (MSR-VTT), a large-scale video
dataset. We demonstrate that the network can generate human-like actions which
can be transferred to a Baxter robot, such that the robot performs an action
based on a provided sentence. Results show that the proposed generative network
correctly models the relationship between language and action and can generate
a diverse set of actions from the same sentence.Comment: 8 pages, 10 figure
Topological Semantic Graph Memory for Image-Goal Navigation
A novel framework is proposed to incrementally collect landmark-based graph
memory and use the collected memory for image goal navigation. Given a target
image to search, an embodied robot utilizes semantic memory to find the target
in an unknown environment. % The semantic graph memory is collected from a
panoramic observation of an RGB-D camera without knowing the robot's pose. In
this paper, we present a topological semantic graph memory (TSGM), which
consists of (1) a graph builder that takes the observed RGB-D image to
construct a topological semantic graph, (2) a cross graph mixer module that
takes the collected nodes to get contextual information, and (3) a memory
decoder that takes the contextual memory as an input to find an action to the
target. On the task of image goal navigation, TSGM significantly outperforms
competitive baselines by +5.0-9.0% on the success rate and +7.0-23.5% on SPL,
which means that the TSGM finds efficient paths. Additionally, we demonstrate
our method on a mobile robot in real-world image goal scenarios
Deep Ego-Motion Classifiers for Compound Eye Cameras
Compound eyes, also known as insect eyes, have a unique structure. They have a hemispheric surface, and a lot of single eyes are deployed regularly on the surface. Thanks to this unique form, using the compound images has several advantages, such as a large field of view (FOV) with low aberrations. We can exploit these benefits in high-level vision applications, such as object recognition, or semantic segmentation for a moving robot, by emulating the compound images that describe the captured scenes from compound eye cameras. In this paper, to the best of our knowledge, we propose the first convolutional neural network (CNN)-based ego-motion classification algorithm designed for the compound eye structure. To achieve this, we introduce a voting-based approach that fully utilizes one of the unique features of compound images, specifically, the compound images consist of a lot of single eye images. The proposed method classifies a number of local motions by CNN, and these local classifications which represent the motions of each single eye image, are aggregated to the final classification by a voting procedure. For the experiments, we collected a new dataset for compound eye camera ego-motion classification which contains scenes of the inside and outside of a certain building. The samples of the proposed dataset consist of two consequent emulated compound images and the corresponding ego-motion class. The experimental results show that the proposed method has achieved the classification accuracy of 85.0%, which is superior compared to the baselines on the proposed dataset. Also, the proposed model is light-weight compared to the conventional CNN-based image recognition algorithms such as AlexNet, ResNet50, and MobileNetV2