2,175 research outputs found
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations
The ability to associate touch with other modalities has huge implications
for humans and computational systems. However, multimodal learning with touch
remains challenging due to the expensive data collection process and
non-standardized sensor outputs. We introduce UniTouch, a unified tactile model
for vision-based touch sensors connected to multiple modalities, including
vision, language, and sound. We achieve this by aligning our UniTouch
embeddings to pretrained image embeddings already associated with a variety of
other modalities. We further propose learnable sensor-specific tokens, allowing
the model to learn from a set of heterogeneous tactile sensors, all at the same
time. UniTouch is capable of conducting various touch sensing tasks in the
zero-shot setting, from robot grasping prediction to touch image question
answering. To the best of our knowledge, UniTouch is the first to demonstrate
such capabilities. Project page: https://cfeng16.github.io/UniTouch
A Touch, Vision, and Language Dataset for Multimodal Alignment
Touch is an important sensing modality for humans, but it has not yet been
incorporated into a multimodal generative language model. This is partially due
to the difficulty of obtaining natural language labels for tactile data and the
complexity of aligning tactile readings with both visual observations and
language descriptions. As a step towards bridging that gap, this work
introduces a new dataset of 44K in-the-wild vision-touch pairs, with English
language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V
(90%). We use this dataset to train a vision-language-aligned tactile encoder
for open-vocabulary classification and a touch-vision-language (TVL) model for
text generation using the trained encoder. Results suggest that by
incorporating touch, the TVL model improves (+29% classification accuracy)
touch-vision-language alignment over existing models trained on any pair of
those modalities. Although only a small fraction of the dataset is
human-labeled, the TVL model demonstrates improved visual-tactile understanding
over GPT-4V (+12%) and open-source vision-language models (+32%) on a new
touch-vision understanding benchmark. Code and data:
https://tactile-vlm.github.io
MOSAIC: Learning Unified Multi-Sensory Object Property Representations for Robot Learning via Interactive Perception
A holistic understanding of object properties across diverse sensory
modalities (e.g., visual, audio, and haptic) is essential for tasks ranging
from object categorization to complex manipulation. Drawing inspiration from
cognitive science studies that emphasize the significance of multi-sensory
integration in human perception, we introduce MOSAIC (Multimodal Object
property learning with Self-Attention and Interactive Comprehension), a novel
framework designed to facilitate the learning of unified multi-sensory object
property representations. While it is undeniable that visual information plays
a prominent role, we acknowledge that many fundamental object properties extend
beyond the visual domain to encompass attributes like texture, mass
distribution, or sounds, which significantly influence how we interact with
objects. In MOSAIC, we leverage this profound insight by distilling knowledge
from multimodal foundation models and aligning these representations not only
across vision but also haptic and auditory sensory modalities. Through
extensive experiments on a dataset where a humanoid robot interacts with 100
objects across 10 exploratory behaviors, we demonstrate the versatility of
MOSAIC in two task families: object categorization and object-fetching tasks.
Our results underscore the efficacy of MOSAIC's unified representations,
showing competitive performance in category recognition through a simple linear
probe setup and excelling in the fetch object task under zero-shot transfer
conditions. This work pioneers the application of sensory grounding in
foundation models for robotics, promising a significant leap in multi-sensory
perception capabilities for autonomous systems. We have released the code,
datasets, and additional results: https://github.com/gtatiya/MOSAIC.Comment: Accepted to the 2024 IEEE International Conference on Robotics and
Automation (ICRA), May 13 to 17, 2024; Yokohama, Japa
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models
The ability to quickly learn a new task with minimal instruction - known as
few-shot learning - is a central aspect of intelligent agents. Classical
few-shot benchmarks make use of few-shot samples from a single modality, but
such samples may not be sufficient to characterize an entire concept class. In
contrast, humans use cross-modal information to learn new concepts efficiently.
In this work, we demonstrate that one can indeed build a better
dog classifier by ing about dogs and ing to them
bark. To do so, we exploit the fact that recent multimodal foundation models
such as CLIP are inherently cross-modal, mapping different modalities to the
same representation space. Specifically, we propose a simple cross-modal
adaptation approach that learns from few-shot examples spanning different
modalities. By repurposing class names as additional one-shot training samples,
we achieve SOTA results with an embarrassingly simple linear classifier for
vision-language adaptation. Furthermore, we show that our approach can benefit
existing methods such as prefix tuning, adapters, and classifier ensembling.
Finally, to explore other modalities beyond vision and language, we construct
the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal
training to improve the performance of both image and audio classification.Comment: CVPR 2023. Project website:
https://linzhiqiu.github.io/papers/cross_modal
RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot
A key challenge in robotic manipulation in open domains is how to acquire
diverse and generalizable skills for robots. Recent research in one-shot
imitation learning has shown promise in transferring trained policies to new
tasks based on demonstrations. This feature is attractive for enabling robots
to acquire new skills and improving task and motion planning. However, due to
limitations in the training dataset, the current focus of the community has
mainly been on simple cases, such as push or pick-place tasks, relying solely
on visual guidance. In reality, there are many complex skills, some of which
may even require both visual and tactile perception to solve. This paper aims
to unlock the potential for an agent to generalize to hundreds of real-world
skills with multi-modal perception. To achieve this, we have collected a
dataset comprising over 110,000 contact-rich robot manipulation sequences
across diverse skills, contexts, robots, and camera viewpoints, all collected
in the real world. Each sequence in the dataset includes visual, force, audio,
and action information. Moreover, we also provide a corresponding human
demonstration video and a language description for each robot sequence. We have
invested significant efforts in calibrating all the sensors and ensuring a
high-quality dataset. The dataset is made publicly available at rh20t.github.ioComment: RSS 2023 workshop on LTAMP. The project page is at rh20t.github.i
Self-Supervised Visuo-Tactile Pretraining to Locate and Follow Garment Features
Humans make extensive use of vision and touch as complementary senses, with
vision providing global information about the scene and touch measuring local
information during manipulation without suffering from occlusions. While prior
work demonstrates the efficacy of tactile sensing for precise manipulation of
deformables, they typically rely on supervised, human-labeled datasets. We
propose Self-Supervised Visuo-Tactile Pretraining (SSVTP), a framework for
learning multi-task visuo-tactile representations in a self-supervised manner
through cross-modal supervision. We design a mechanism that enables a robot to
autonomously collect precisely spatially-aligned visual and tactile image
pairs, then train visual and tactile encoders to embed these pairs into a
shared latent space using cross-modal contrastive loss. We apply this latent
space to downstream perception and control of deformable garments on flat
surfaces, and evaluate the flexibility of the learned representations without
fine-tuning on 5 tasks: feature classification, contact localization, anomaly
detection, feature search from a visual query (e.g., garment feature
localization under occlusion), and edge following along cloth edges. The
pretrained representations achieve a 73-100% success rate on these 5 tasks.Comment: RSS 2023, site: https://sites.google.com/berkeley.edu/ssvt
Generating Visual Scenes from Touch
An emerging line of work has sought to generate plausible imagery from touch.
Existing approaches, however, tackle only narrow aspects of the visuo-tactile
synthesis problem, and lag significantly behind the quality of cross-modal
synthesis methods in other domains. We draw on recent advances in latent
diffusion to create a model for synthesizing images from tactile signals (and
vice versa) and apply it to a number of visuo-tactile synthesis tasks. Using
this model, we significantly outperform prior work on the tactile-driven
stylization problem, i.e., manipulating an image to match a touch signal, and
we are the first to successfully generate images from touch without additional
sources of information about the scene. We also successfully use our model to
address two novel synthesis problems: generating images that do not contain the
touch sensor or the hand holding it, and estimating an image's shading from its
reflectance and touch.Comment: ICCV 2023; Project site:
https://fredfyyang.github.io/vision-from-touch
- …