564 research outputs found
Time-Contrastive Networks: Self-Supervised Learning from Video
We propose a self-supervised approach for learning representations and
robotic behaviors entirely from unlabeled videos recorded from multiple
viewpoints, and study how this representation can be used in two robotic
imitation settings: imitating object interactions from videos of humans, and
imitating human poses. Imitation of human behavior requires a
viewpoint-invariant representation that captures the relationships between
end-effectors (hands or robot grippers) and the environment, object attributes,
and body pose. We train our representations using a metric learning loss, where
multiple simultaneous viewpoints of the same observation are attracted in the
embedding space, while being repelled from temporal neighbors which are often
visually similar but functionally different. In other words, the model
simultaneously learns to recognize what is common between different-looking
images, and what is different between similar-looking images. This signal
causes our model to discover attributes that do not change across viewpoint,
but do change across time, while ignoring nuisance variables such as
occlusions, motion blur, lighting and background. We demonstrate that this
representation can be used by a robot to directly mimic human poses without an
explicit correspondence, and that it can be used as a reward function within a
reinforcement learning algorithm. While representations are learned from an
unlabeled collection of task-related videos, robot behaviors such as pouring
are learned by watching a single 3rd-person demonstration by a human. Reward
functions obtained by following the human demonstrations under the learned
representation enable efficient reinforcement learning that is practical for
real-world robotic systems. Video results, open-source code and dataset are
available at https://sermanet.github.io/imitat
DAVIS-Ag: A Synthetic Plant Dataset for Developing Domain-Inspired Active Vision in Agricultural Robots
In agricultural environments, viewpoint planning can be a critical
functionality for a robot with visual sensors to obtain informative
observations of objects of interest (e.g., fruits) from complex structures of
plant with random occlusions. Although recent studies on active vision have
shown some potential for agricultural tasks, each model has been designed and
validated on a unique environment that would not easily be replicated for
benchmarking novel methods being developed later. In this paper, hence, we
introduce a dataset for more extensive research on Domain-inspired Active
VISion in Agriculture (DAVIS-Ag). To be specific, we utilized our open-source
"AgML" framework and the 3D plant simulator of "Helios" to produce 502K RGB
images from 30K dense spatial locations in 632 realistically synthesized
orchards of strawberries, tomatoes, and grapes. In addition, useful labels are
provided for each image, including (1) bounding boxes and (2) pixel-wise
instance segmentations for all identifiable fruits, and also (3) pointers to
other images that are reachable by an execution of action so as to simulate the
active selection of viewpoint at each time step. Using DAVIS-Ag, we show the
motivating examples in which performance of fruit detection for the same plant
can significantly vary depending on the position and orientation of camera view
primarily due to occlusions by other components such as leaves. Furthermore, we
develop several baseline models to showcase the "usage" of data with one of
agricultural active vision tasks--fruit search optimization--providing
evaluation results against which future studies could benchmark their
methodologies. For encouraging relevant research, our dataset is released
online to be freely available at: https://github.com/ctyeong/DAVIS-AgComment: 8 pages, 5 figures, 4 table
Policy Learning with Hypothesis based Local Action Selection
For robots to be able to manipulate in unknown and unstructured environments
the robot should be capable of operating under partial observability of the
environment. Object occlusions and unmodeled environments are some of the
factors that result in partial observability. A common scenario where this is
encountered is manipulation in clutter. In the case that the robot needs to
locate an object of interest and manipulate it, it needs to perform a series of
decluttering actions to accurately detect the object of interest. To perform
such a series of actions, the robot also needs to account for the dynamics of
objects in the environment and how they react to contact. This is a non trivial
problem since one needs to reason not only about robot-object interactions but
also object-object interactions in the presence of contact. In the example
scenario of manipulation in clutter, the state vector would have to account for
the pose of the object of interest and the structure of the surrounding
environment. The process model would have to account for all the aforementioned
robot-object, object-object interactions. The complexity of the process model
grows exponentially as the number of objects in the scene increases. This is
commonly the case in unstructured environments. Hence it is not reasonable to
attempt to model all object-object and robot-object interactions explicitly.
Under this setting we propose a hypothesis based action selection algorithm
where we construct a hypothesis set of the possible poses of an object of
interest given the current evidence in the scene and select actions based on
our current set of hypothesis. This hypothesis set tends to represent the
belief about the structure of the environment and the number of poses the
object of interest can take. The agent's only stopping criterion is when the
uncertainty regarding the pose of the object is fully resolved.Comment: RLDM abstrac
- …