The growing interest in language-conditioned robot manipulation aims to
develop robots capable of understanding and executing complex tasks, with the
objective of enabling robots to interpret language commands and manipulate
objects accordingly. While language-conditioned approaches demonstrate
impressive capabilities for addressing tasks in familiar environments, they
encounter limitations in adapting to unfamiliar environment settings. In this
study, we propose a general-purpose, language-conditioned approach that
combines base skill priors and imitation learning under unstructured data to
enhance the algorithm's generalization in adapting to unfamiliar environments.
We assess our model's performance in both simulated and real-world environments
using a zero-shot setting. In the simulated environment, the proposed approach
surpasses previously reported scores for CALVIN benchmark, especially in the
challenging Zero-Shot Multi-Environment setting. The average completed task
length, indicating the average number of tasks the agent can continuously
complete, improves more than 2.5 times compared to the state-of-the-art method
HULC. In addition, we conduct a zero-shot evaluation of our policy in a
real-world setting, following training exclusively in simulated environments
without additional specific adaptations. In this evaluation, we set up ten
tasks and achieved an average 30% improvement in our approach compared to the
current state-of-the-art approach, demonstrating a high generalization
capability in both simulated environments and the real world. For further
details, including access to our code and videos, please refer to our
supplementary materials