    Time-Contrastive Networks: Self-Supervised Learning from Video

    We propose a self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. Imitation of human behavior requires a viewpoint-invariant representation that captures the relationships between end-effectors (hands or robot grippers) and the environment, object attributes, and body pose. We train our representations using a metric learning loss, where multiple simultaneous viewpoints of the same observation are attracted in the embedding space, while being repelled from temporal neighbors which are often visually similar but functionally different. In other words, the model simultaneously learns to recognize what is common between different-looking images, and what is different between similar-looking images. This signal causes our model to discover attributes that do not change across viewpoint, but do change across time, while ignoring nuisance variables such as occlusions, motion blur, lighting and background. We demonstrate that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be used as a reward function within a reinforcement learning algorithm. While representations are learned from an unlabeled collection of task-related videos, robot behaviors such as pouring are learned by watching a single 3rd-person demonstration by a human. Reward functions obtained by following the human demonstrations under the learned representation enable efficient reinforcement learning that is practical for real-world robotic systems. Video results, open-source code and dataset are available at https://sermanet.github.io/imitat

    Using humanoid robots to study human behavior

    Our understanding of human behavior advances as our humanoid robotics work progresses-and vice versa. This team's work focuses on trajectory formation and planning, learning from demonstration, oculomotor control and interactive behaviors. They are programming robotic behavior based on how we humans “program” behavior in-or train-each other

    Investigation of the Sense of Agency in Social Cognition, based on frameworks of Predictive Coding and Active Inference: A simulation study on multimodal imitative interaction

    When agents interact socially with different intentions, conflicts are difficult to avoid. Although how agents can resolve such problems autonomously has not been determined, dynamic characteristics of agency may shed light on underlying mechanisms. The current study focused on the sense of agency (SoA), a specific aspect of agency referring to congruence between the agent's intention in acting and the outcome. Employing predictive coding and active inference as theoretical frameworks of perception and action generation, we hypothesize that regulation of complexity in the evidence lower bound of an agent's model should affect the strength of the agent's SoA and should have a critical impact on social interactions. We built a computational model of imitative interaction between a robot and a human via visuo-proprioceptive sensation with a variational Bayes recurrent neural network, and simulated the model in the form of pseudo-imitative interaction using recorded human body movement data. A key feature of the model is that each modality's complexity can be regulated differently with a hyperparameter assigned to each module. We first searched for an optimal setting that endows the model with appropriate coordination of multimodal sensation. This revealed that the vision module's complexity should be more tightly regulated than that of the proprioception module. Using the optimally trained model, we examined how changing the tightness of complexity regulation after training affects the strength of the SoA during interactions. The results showed that with looser regulation, an agent tends to act more egocentrically, without adapting to the other. In contrast, with tighter regulation, the agent tends to follow the other by adjusting its intention. We conclude that the tightness of complexity regulation crucially affects the strength of the SoA and the dynamics of interactions between agents.Comment: 23 pages, 8 figure

    Probabilistic movement modeling for intention inference in human-robot interaction.

    Intention inference can be an essential step toward efficient humanrobot interaction. For this purpose, we propose the Intention-Driven Dynamics Model (IDDM) to probabilistically model the generative process of movements that are directed by the intention. The IDDM allows to infer the intention from observed movements using Bayes ’ theorem. The IDDM simultaneously finds a latent state representation of noisy and highdimensional observations, and models the intention-driven dynamics in the latent states. As most robotics applications are subject to real-time constraints, we develop an efficient online algorithm that allows for real-time intention inference. Two human-robot interaction scenarios, i.e., target prediction for robot table tennis and action recognition for interactive humanoid robots, are used to evaluate the performance of our inference algorithm. In both intention inference tasks, the proposed algorithm achieves substantial improvements over support vector machines and Gaussian processes.

    SLoMo: A General System for Legged Robot Motion Imitation from Casual Videos

    We present SLoMo: a first-of-its-kind framework for transferring skilled motions from casually captured "in the wild" video footage of humans and animals to legged robots. SLoMo works in three stages: 1) synthesize a physically plausible reconstructed key-point trajectory from monocular videos; 2) optimize a dynamically feasible reference trajectory for the robot offline that includes body and foot motion, as well as contact sequences that closely tracks the key points; 3) track the reference trajectory online using a general-purpose model-predictive controller on robot hardware. Traditional motion imitation for legged motor skills often requires expert animators, collaborative demonstrations, and/or expensive motion capture equipment, all of which limits scalability. Instead, SLoMo only relies on easy-to-obtain monocular video footage, readily available in online repositories such as YouTube. It converts videos into motion primitives that can be executed reliably by real-world robots. We demonstrate our approach by transferring the motions of cats, dogs, and humans to example robots including a quadruped (on hardware) and a humanoid (in simulation). To the best knowledge of the authors, this is the first attempt at a general-purpose motion transfer framework that imitates animal and human motions on legged robots directly from casual videos without artificial markers or labels.Comment: accepted at RA-L 2023, with ICRA 2024 optio

    Cultural differences in speed adaptation in human-robot interaction tasks

    AbstractIn social interactions, human movement is a rich source of information for all those who take part in the collaboration. In fact, a variety of intuitive messages are communicated through motion and continuously inform the partners about the future unfolding of the actions. A similar exchange of implicit information could support movement coordination in the context of Human-Robot Interaction. In this work, we investigate how implicit signaling in an interaction with a humanoid robot can lead to emergent coordination in the form of automatic speed adaptation. In particular, we assess whether different cultures – specifically Japanese and Italian – have a different impact on motor resonance and synchronization in HRI. Japanese people show a higher general acceptance toward robots when compared with Western cultures. Since acceptance, or better affiliation, is tightly connected to imitation and mimicry, we hypothesize a higher degree of speed imitation for Japanese participants when compared to Italians. In the experimental studies undertaken both in Japan and Italy, we observe that cultural differences do not impact on the natural predisposition of subjects to adapt to the robot
