15 research outputs found
Accessible Robot Control in Mixed Reality
A novel method to control the Spot robot of Boston Dynamics by Hololens 2 is
proposed. This method is mainly designed for people with physical disabilities,
users can control the robot's movement and robot arm without using their hands.
The eye gaze tracking and head motion tracking technologies of Hololens 2 are
utilized for sending control commands. The movement of the robot would follow
the eye gaze and the robot arm would mimic the pose of the user's head. Through
our experiment, our method is comparable with the traditional control method by
joystick in both time efficiency and user experience. Demo can be found on our
project webpage: https://zhangganlin.github.io/Holo-Spot-Page/index.htmlComment: Course Project of Mixed Reality at ETH Zuric
MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning
Multimodal representation learning has shown promising improvements on
various vision-language tasks. Most existing methods excel at building
global-level alignment between vision and language while lacking effective
fine-grained image-text interaction. In this paper, we propose a jointly masked
multimodal modeling method to learn fine-grained multimodal representations.
Our method performs joint masking on image-text input and integrates both
implicit and explicit targets for the masked signals to recover. The implicit
target provides a unified and debiased objective for vision and language, where
the model predicts latent multimodal representations of the unmasked input. The
explicit target further enriches the multimodal representations by recovering
high-level and semantically meaningful information: momentum visual features of
image patches and concepts of word tokens. Through such a masked modeling
process, our model not only learns fine-grained multimodal interaction, but
also avoids the semantic gap between high-level representations and low- or
mid-level prediction targets (e.g. image pixels), thus producing semantically
rich multimodal representations that perform well on both zero-shot and
fine-tuned settings. Our pre-trained model (named MAMO) achieves
state-of-the-art performance on various downstream vision-language tasks,
including image-text retrieval, visual question answering, visual reasoning,
and weakly-supervised visual grounding
Aligning Linguistic Words and Visual Semantic Units for Image Captioning
Image captioning attempts to generate a sentence composed of several
linguistic words, which are used to describe objects, attributes, and
interactions in an image, denoted as visual semantic units in this paper. Based
on this view, we propose to explicitly model the object interactions in
semantics and geometry based on Graph Convolutional Networks (GCNs), and fully
exploit the alignment between linguistic words and visual semantic units for
image captioning. Particularly, we construct a semantic graph and a geometry
graph, where each node corresponds to a visual semantic unit, i.e., an object,
an attribute, or a semantic (geometrical) interaction between two objects.
Accordingly, the semantic (geometrical) context-aware embeddings for each unit
are obtained through the corresponding GCN learning processers. At each time
step, a context gated attention module takes as inputs the embeddings of the
visual semantic units and hierarchically align the current word with these
units by first deciding which type of visual semantic unit (object, attribute,
or interaction) the current word is about, and then finding the most correlated
visual semantic units under this type. Extensive experiments are conducted on
the challenging MS-COCO image captioning dataset, and superior results are
reported when comparing to state-of-the-art approaches.Comment: 8 pages, 5 figures. Accepted by ACM MM 201
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE
Building scalable vision-language models to learn from diverse, multimodal
data remains an open challenge. In this paper, we introduce an Efficient
Vision-languagE foundation model, namely EVE, which is one unified multimodal
Transformer pre-trained solely by one unified pre-training task. Specifically,
EVE encodes both vision and language within a shared Transformer network
integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which
capture modality-specific information by selectively switching to different
experts. To unify pre-training tasks of vision and language, EVE performs
masked signal modeling on image-text pairs to reconstruct masked signals, i.e.,
image pixels and text tokens, given visible signals. This simple yet effective
pre-training objective accelerates training by 3.5x compared to the model
pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing
to the combination of the unified architecture and pre-training task, EVE is
easy to scale up, enabling better downstream performance with fewer resources
and faster training speed. Despite its simplicity, EVE achieves
state-of-the-art performance on various vision-language downstream tasks,
including visual question answering, visual reasoning, and image-text
retrieval.Comment: Accepted by AAAI 202
Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models
Large Language Models (LLMs) have seen great advance in both academia and
industry, and their popularity results in numerous open-source frameworks and
techniques in accelerating LLM pre-training, fine-tuning, and inference.
Training and deploying LLMs are expensive as it requires considerable computing
resources and memory, hence many efficient approaches have been developed for
improving system pipelines as well as operators. However, the runtime
performance can vary significantly across hardware and software stacks, which
makes it difficult to choose the best configuration. In this work, we aim to
benchmark the performance from both macro and micro perspectives. First, we
benchmark the end-to-end performance of pre-training, fine-tuning, and serving
LLMs in different sizes , i.e., 7, 13, and 70 billion parameters (7B, 13B, and
70B) on three 8-GPU platforms with and without individual optimization
techniques, including ZeRO, quantization, recomputation, FlashAttention. Then,
we dive deeper to provide a detailed runtime analysis of the sub-modules,
including computing and communication operators in LLMs. For end users, our
benchmark and findings help better understand different optimization
techniques, training and inference frameworks, together with hardware platforms
in choosing configurations for deploying LLMs. For researchers, our in-depth
module-wise analyses discover potential opportunities for future work to
further optimize the runtime performance of LLMs