We present a new method to translate videos to commands for robotic
manipulation using Deep Recurrent Neural Networks (RNN). Our framework first
extracts deep features from the input video frames with a deep Convolutional
Neural Networks (CNN). Two RNN layers with an encoder-decoder architecture are
then used to encode the visual features and sequentially generate the output
words as the command. We demonstrate that the translation accuracy can be
improved by allowing a smooth transaction between two RNN layers and using the
state-of-the-art feature extractor. The experimental results on our new
challenging dataset show that our approach outperforms recent methods by a fair
margin. Furthermore, we combine the proposed translation module with the vision
and planning system to let a robot perform various manipulation tasks. Finally,
we demonstrate the effectiveness of our framework on a full-size humanoid robot
WALK-MAN