13 research outputs found
Accessible Robot Control in Mixed Reality
A novel method to control the Spot robot of Boston Dynamics by Hololens 2 is
proposed. This method is mainly designed for people with physical disabilities,
users can control the robot's movement and robot arm without using their hands.
The eye gaze tracking and head motion tracking technologies of Hololens 2 are
utilized for sending control commands. The movement of the robot would follow
the eye gaze and the robot arm would mimic the pose of the user's head. Through
our experiment, our method is comparable with the traditional control method by
joystick in both time efficiency and user experience. Demo can be found on our
project webpage: https://zhangganlin.github.io/Holo-Spot-Page/index.htmlComment: Course Project of Mixed Reality at ETH Zuric
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning
Multimodal representation learning has shown promising improvements on
various vision-language tasks. Most existing methods excel at building
global-level alignment between vision and language while lacking effective
fine-grained image-text interaction. In this paper, we propose a jointly masked
multimodal modeling method to learn fine-grained multimodal representations.
Our method performs joint masking on image-text input and integrates both
implicit and explicit targets for the masked signals to recover. The implicit
target provides a unified and debiased objective for vision and language, where
the model predicts latent multimodal representations of the unmasked input. The
explicit target further enriches the multimodal representations by recovering
high-level and semantically meaningful information: momentum visual features of
image patches and concepts of word tokens. Through such a masked modeling
process, our model not only learns fine-grained multimodal interaction, but
also avoids the semantic gap between high-level representations and low- or
mid-level prediction targets (e.g. image pixels), thus producing semantically
rich multimodal representations that perform well on both zero-shot and
fine-tuned settings. Our pre-trained model (named MAMO) achieves
state-of-the-art performance on various downstream vision-language tasks,
including image-text retrieval, visual question answering, visual reasoning,
and weakly-supervised visual grounding
Aligning Linguistic Words and Visual Semantic Units for Image Captioning
Image captioning attempts to generate a sentence composed of several
linguistic words, which are used to describe objects, attributes, and
interactions in an image, denoted as visual semantic units in this paper. Based
on this view, we propose to explicitly model the object interactions in
semantics and geometry based on Graph Convolutional Networks (GCNs), and fully
exploit the alignment between linguistic words and visual semantic units for
image captioning. Particularly, we construct a semantic graph and a geometry
graph, where each node corresponds to a visual semantic unit, i.e., an object,
an attribute, or a semantic (geometrical) interaction between two objects.
Accordingly, the semantic (geometrical) context-aware embeddings for each unit
are obtained through the corresponding GCN learning processers. At each time
step, a context gated attention module takes as inputs the embeddings of the
visual semantic units and hierarchically align the current word with these
units by first deciding which type of visual semantic unit (object, attribute,
or interaction) the current word is about, and then finding the most correlated
visual semantic units under this type. Extensive experiments are conducted on
the challenging MS-COCO image captioning dataset, and superior results are
reported when comparing to state-of-the-art approaches.Comment: 8 pages, 5 figures. Accepted by ACM MM 201
Boosted Transformer for Image Captioning
Image captioning attempts to generate a description given an image, usually taking Convolutional Neural Network as the encoder to extract the visual features and a sequence model, among which the self-attention mechanism has achieved advanced progress recently, as the decoder to generate descriptions. However, this predominant encoder-decoder architecture has some problems to be solved. On the encoder side, without the semantic concepts, the extracted visual features do not make full use of the image information. On the decoder side, the sequence self-attention only relies on word representations, lacking the guidance of visual information and easily influenced by the language prior. In this paper, we propose a novel boosted transformer model with two attention modules for the above-mentioned problems, i.e., “Concept-Guided Attention” (CGA) and “Vision-Guided Attention” (VGA). Our model utilizes CGA in the encoder, to obtain the boosted visual features by integrating the instance-level concepts into the visual features. In the decoder, we stack VGA, which uses the visual information as a bridge to model internal relationships among the sequences and can be an auxiliary module of sequence self-attention. Quantitative and qualitative results on the Microsoft COCO dataset demonstrate the better performance of our model than the state-of-the-art approaches