13,011 research outputs found

    On Pairwise Costs for Network Flow Multi-Object Tracking

    Full text link
    Multi-object tracking has been recently approached with the min-cost network flow optimization techniques. Such methods simultaneously resolve multiple object tracks in a video and enable modeling of dependencies among tracks. Min-cost network flow methods also fit well within the "tracking-by-detection" paradigm where object trajectories are obtained by connecting per-frame outputs of an object detector. Object detectors, however, often fail due to occlusions and clutter in the video. To cope with such situations, we propose to add pairwise costs to the min-cost network flow framework. While integer solutions to such a problem become NP-hard, we design a convex relaxation solution with an efficient rounding heuristic which empirically gives certificates of small suboptimality. We evaluate two particular types of pairwise costs and demonstrate improvements over recent tracking methods in real-world video sequences

    Augmenting Sensorimotor Control Using “Goal-Aware” Vibrotactile Stimulation during Reaching and Manipulation Behaviors

    Get PDF
    We describe two sets of experiments that examine the ability of vibrotactile encoding of simple position error and combined object states (calculated from an optimal controller) to enhance performance of reaching and manipulation tasks in healthy human adults. The goal of the first experiment (tracking) was to follow a moving target with a cursor on a computer screen. Visual and/or vibrotactile cues were provided in this experiment, and vibrotactile feedback was redundant with visual feedback in that it did not encode any information above and beyond what was already available via vision. After only 10 minutes of practice using vibrotactile feedback to guide performance, subjects tracked the moving target with response latency and movement accuracy values approaching those observed under visually guided reaching. Unlike previous reports on multisensory enhancement, combining vibrotactile and visual feedback of performance errors conferred neither positive nor negative effects on task performance. In the second experiment (balancing), vibrotactile feedback encoded a corrective motor command as a linear combination of object states (derived from a linear-quadratic regulator implementing a trade-off between kinematic and energetic performance) to teach subjects how to balance a simulated inverted pendulum. Here, the tactile feedback signal differed from visual feedback in that it provided information that was not readily available from visual feedback alone. Immediately after applying this novel “goal-aware” vibrotactile feedback, time to failure was improved by a factor of three. Additionally, the effect of vibrotactile training persisted after the feedback was removed. These results suggest that vibrotactile encoding of appropriate combinations of state information may be an effective form of augmented sensory feedback that can be applied, among other purposes, to compensate for lost or compromised proprioception as commonly observed, for example, in stroke survivors

    Time-Contrastive Networks: Self-Supervised Learning from Video

    Full text link
    We propose a self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. Imitation of human behavior requires a viewpoint-invariant representation that captures the relationships between end-effectors (hands or robot grippers) and the environment, object attributes, and body pose. We train our representations using a metric learning loss, where multiple simultaneous viewpoints of the same observation are attracted in the embedding space, while being repelled from temporal neighbors which are often visually similar but functionally different. In other words, the model simultaneously learns to recognize what is common between different-looking images, and what is different between similar-looking images. This signal causes our model to discover attributes that do not change across viewpoint, but do change across time, while ignoring nuisance variables such as occlusions, motion blur, lighting and background. We demonstrate that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be used as a reward function within a reinforcement learning algorithm. While representations are learned from an unlabeled collection of task-related videos, robot behaviors such as pouring are learned by watching a single 3rd-person demonstration by a human. Reward functions obtained by following the human demonstrations under the learned representation enable efficient reinforcement learning that is practical for real-world robotic systems. Video results, open-source code and dataset are available at https://sermanet.github.io/imitat
    corecore