67,984 research outputs found

    High Level Learning Using the Temporal Features of Human Demonstrated Sequential Tasks

    Get PDF
    Modelling human-led demonstrations of high-level sequential tasks is fundamental to a number of practical inference applications including vision-based policy learning and activity recognition. Demonstrations of these tasks are captured as videos with long durations and similar spatial contents. Learning from this data is challenging since inference cannot be conducted solely on spatial feature presence and must instead consider how spatial features play out across time. To be successful these temporal representations must generalize to variations in the duration of activities and be able to capture relationships between events expressed across the scale of an entire video. Contemporary deep learning architectures that represent time (convolution-based and Recurrent Neural Networks) do not address these concerns. Representations learned by these models describe temporal features in terms of fixed durations such as minutes, seconds, and frames. They are also developed sequentially and must use unreasonably large models to capture temporal features expressed at scale. Probabilistic temporal models have been successful in representing the temporal information of videos in a duration invariant manner that is robust to scale, however, this has only been accomplished through the use of user-defined spatial features. Such abstractions make unrealistic assumptions about the content being expressed in these videos, the quality of the perception model, and they also limit the potential applications of trained models. To that end, I present D-ITR-L, a temporal wrapper that extends the spatial features extracted from a typically CNN architecture and transforms them into temporal features. D-ITR-L-derived temporal features are duration invariant and can identify temporal relationships between events at the scale of a full video. Validation of this claim is conducted through various vision-based policy learning and action recognition settings. Additionally, these studies show that challenging visual domains such as human-led demonstration of high-level sequential tasks can be effectively represented when using a D-ITR-L-based model

    High Level Learning Using the Temporal Features of Human Demonstrated Sequential Tasks

    Get PDF
    Modelling human-led demonstrations of high-level sequential tasks is fundamental to a number of practical inference applications including vision-based policy learning and activity recognition. Demonstrations of these tasks are captured as videos with long durations and similar spatial contents. Learning from this data is challenging since inference cannot be conducted solely on spatial feature presence and must instead consider how spatial features play out across time. To be successful these temporal representations must generalize to variations in the duration of activities and be able to capture relationships between events expressed across the scale of an entire video. Contemporary deep learning architectures that represent time (convolution-based and Recurrent Neural Networks) do not address these concerns. Representations learned by these models describe temporal features in terms of fixed durations such as minutes, seconds, and frames. They are also developed sequentially and must use unreasonably large models to capture temporal features expressed at scale. Probabilistic temporal models have been successful in representing the temporal information of videos in a duration invariant manner that is robust to scale, however, this has only been accomplished through the use of user-defined spatial features. Such abstractions make unrealistic assumptions about the content being expressed in these videos, the quality of the perception model, and they also limit the potential applications of trained models. To that end, I present D-ITR-L, a temporal wrapper that extends the spatial features extracted from a typically CNN architecture and transforms them into temporal features. D-ITR-L-derived temporal features are duration invariant and can identify temporal relationships between events at the scale of a full video. Validation of this claim is conducted through various vision-based policy learning and action recognition settings. Additionally, these studies show that challenging visual domains such as human-led demonstration of high-level sequential tasks can be effectively represented when using a D-ITR-L-based model

    Deep Object-Centric Representations for Generalizable Robot Learning

    Full text link
    Robotic manipulation in complex open-world scenarios requires both reliable physical manipulation skills and effective and generalizable perception. In this paper, we propose a method where general purpose pretrained visual models serve as an object-centric prior for the perception system of a learned policy. We devise an object-level attentional mechanism that can be used to determine relevant objects from a few trajectories or demonstrations, and then immediately incorporate those objects into a learned policy. A task-independent meta-attention locates possible objects in the scene, and a task-specific attention identifies which objects are predictive of the trajectories. The scope of the task-specific attention is easily adjusted by showing demonstrations with distractor objects or with diverse relevant objects. Our results indicate that this approach exhibits good generalization across object instances using very few samples, and can be used to learn a variety of manipulation tasks using reinforcement learning

    Hierarchical Generative Adversarial Imitation Learning with Mid-level Input Generation for Autonomous Driving on Urban Environments

    Full text link
    Deriving robust control policies for realistic urban navigation scenarios is not a trivial task. In an end-to-end approach, these policies must map high-dimensional images from the vehicle's cameras to low-level actions such as steering and throttle. While pure Reinforcement Learning (RL) approaches are based exclusively on rewards,Generative Adversarial Imitation Learning (GAIL) agents learn from expert demonstrations while interacting with the environment, which favors GAIL on tasks for which a reward signal is difficult to derive. In this work, the hGAIL architecture was proposed to solve the autonomous navigation of a vehicle in an end-to-end approach, mapping sensory perceptions directly to low-level actions, while simultaneously learning mid-level input representations of the agent's environment. The proposed hGAIL consists of an hierarchical Adversarial Imitation Learning architecture composed of two main modules: the GAN (Generative Adversarial Nets) which generates the Bird's-Eye View (BEV) representation mainly from the images of three frontal cameras of the vehicle, and the GAIL which learns to control the vehicle based mainly on the BEV predictions from the GAN as input.Our experiments have shown that GAIL exclusively from cameras (without BEV) fails to even learn the task, while hGAIL, after training, was able to autonomously navigate successfully in all intersections of the city
    • …
    corecore