113 research outputs found
Image Captioning with Context-Aware Auxiliary Guidance
Image captioning is a challenging computer vision task, which aims to
generate a natural language description of an image. Most recent researches
follow the encoder-decoder framework which depends heavily on the previous
generated words for the current prediction. Such methods can not effectively
take advantage of the future predicted information to learn complete semantics.
In this paper, we propose Context-Aware Auxiliary Guidance (CAAG) mechanism
that can guide the captioning model to perceive global contexts. Upon the
captioning model, CAAG performs semantic attention that selectively
concentrates on useful information of the global predictions to reproduce the
current generation. To validate the adaptability of the method, we apply CAAG
to three popular captioners and our proposal achieves competitive performance
on the challenging Microsoft COCO image captioning benchmark, e.g. 132.2
CIDEr-D score on Karpathy split and 130.7 CIDEr-D (c40) score on official
online evaluation server
ViCo: Engaging Video Comment Generation with Human Preference Rewards
Engaging video comments play an important role in video social media, as they
are the carrier of feelings, thoughts, or humor of the audience. Preliminary
works have made initial exploration for video comment generation by adopting
caption-style encoder-decoder models. However, comment generation presents some
unique challenges distinct from caption generation, which makes these methods
somewhat less effective at generating engaging comments. In contrast to the
objective and descriptive nature of captions, comments tend to be inherently
subjective, making it hard to quantify and evaluate the engagement of comments.
Furthermore, the scarcity of truly engaging comments brings difficulty to
collecting enough high-quality training examples. In this paper, we propose
ViCo with three novel designs to tackle the above challenges for generating
engaging Video Comments. Firstly, to quantify the engagement of comments, we
utilize the number of "likes" each comment receives as a proxy of human
preference after an appropriate debiasing procedure. Secondly, to automatically
evaluate the engagement of comments, we train a reward model to align its
judgment to the above proxy. Our user studies indicate that this reward model
effectively aligns with human judgments. Lastly, to alleviate the scarcity of
high-quality comments, an initial generator is trained on readily available but
noisy data to generate comments. Then the reward model is employed to offer
feedback on the generated comments, thus optimizing the initial generator. To
facilitate the research of video commenting, we collect a large video
comment-dataset (ViCo-20k) with rich metadata from a popular video website.
Experiments on ViCo-20k show that the comments generated by our ViCo model
exhibit the best performance in terms of both quantitative and qualitative
results, particularly when engagement is considered
OFAR: A Multimodal Evidence Retrieval Framework for Illegal Live-streaming Identification
Illegal live-streaming identification, which aims to help live-streaming
platforms immediately recognize the illegal behaviors in the live-streaming,
such as selling precious and endangered animals, plays a crucial role in
purifying the network environment. Traditionally, the live-streaming platform
needs to employ some professionals to manually identify the potential illegal
live-streaming. Specifically, the professional needs to search for related
evidence from a large-scale knowledge database for evaluating whether a given
live-streaming clip contains illegal behavior, which is time-consuming and
laborious. To address this issue, in this work, we propose a multimodal
evidence retrieval system, named OFAR, to facilitate the illegal live-streaming
identification. OFAR consists of three modules: Query Encoder, Document
Encoder, and MaxSim-based Contrastive Late Intersection. Both query encoder and
document encoder are implemented with the advanced OFA encoder, which is
pretrained on a large-scale multimodal dataset. In the last module, we
introduce contrastive learning on the basis of the MaxiSim-based late
intersection, to enhance the model's ability of query-document matching. The
proposed framework achieves significant improvement on our industrial dataset
TaoLive, demonstrating the advances of our scheme
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
The pre-trained image-text models, like CLIP, have demonstrated the strong
power of vision-language representation learned from a large scale of
web-collected image-text data. In light of the well-learned visual features,
some existing works transfer image representation to video domain and achieve
good results. However, how to utilize image-language pre-trained model (e.g.,
CLIP) for video-language pre-training (post-pretraining) is still under
explored. In this paper, we investigate two questions: 1) what are the factors
hindering post-pretraining CLIP to further improve the performance on
video-language tasks? and 2) how to mitigate the impact of these factors?
Through a series of comparative experiments and analyses, we find that the data
scale and domain gap between language sources have great impacts. Motivated by
these, we propose a Omnisource Cross-modal Learning method equipped with a
Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results
show that our approach improves the performance of CLIP on video-text retrieval
by a large margin. Our model also achieves SOTA results on a variety of
datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We will release
our code and pre-trained CLIP-ViP models at
https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP
Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place Robots
Improving the generalization capabilities of general-purpose robotic agents
has long been a significant challenge actively pursued by research communities.
Existing approaches often rely on collecting large-scale real-world robotic
data, such as the RT-1 dataset. However, these approaches typically suffer from
low efficiency, limiting their capability in open-domain scenarios with new
objects, and diverse backgrounds. In this paper, we propose a novel paradigm
that effectively leverages language-grounded segmentation masks generated by
state-of-the-art foundation models, to address a wide range of pick-and-place
robot manipulation tasks in everyday scenarios. By integrating precise
semantics and geometries conveyed from masks into our multi-view policy model,
our approach can perceive accurate object poses and enable sample-efficient
learning. Besides, such design facilitates effective generalization for
grasping new objects with similar shapes observed during training. Our approach
consists of two distinct steps. First, we introduce a series of foundation
models to accurately ground natural language demands across multiple tasks.
Second, we develop a Multi-modal Multi-view Policy Model that incorporates
inputs such as RGB images, semantic masks, and robot proprioception states to
jointly predict precise and executable robot actions. Extensive real-world
experiments conducted on a Franka Emika robot arm validate the effectiveness of
our proposed paradigm. Real-world demos are shown in YouTube
(https://www.youtube.com/watch?v=1m9wNzfp_4E ) and Bilibili
(https://www.bilibili.com/video/BV178411Z7H2/ )
AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation
We propose a novel framework for learning high-level cognitive capabilities
in robot manipulation tasks, such as making a smiley face using building
blocks. These tasks often involve complex multi-step reasoning, presenting
significant challenges due to the limited paired data connecting human
instructions (e.g., making a smiley face) and robot actions (e.g., end-effector
movement). Existing approaches relieve this challenge by adopting an open-loop
paradigm decomposing high-level instructions into simple sub-task plans, and
executing them step-by-step using low-level control models. However, these
approaches are short of instant observations in multi-step reasoning, leading
to sub-optimal results. To address this issue, we propose to automatically
collect a cognitive robot dataset by Large Language Models (LLMs). The
resulting dataset AlphaBlock consists of 35 comprehensive high-level tasks of
multi-step text plans and paired observation sequences. To enable efficient
data acquisition, we employ elaborated multi-round prompt designs that
effectively reduce the burden of extensive human involvement. We further
propose a closed-loop multi-modal embodied planning model that autoregressively
generates plans by taking image observations as input. To facilitate effective
learning, we leverage MiniGPT-4 with a frozen visual encoder and LLM, and
finetune additional vision adapter and Q-former to enable fine-grained spatial
perception for manipulation tasks. We conduct experiments to verify the
superiority over existing open and closed-loop methods, and achieve a
significant increase in success rate by 21.4% and 14.5% over ChatGPT and GPT-4
based robot tasks. Real-world demos are shown in
https://www.youtube.com/watch?v=ayAzID1_qQk
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
Large Pre-trained Transformers exhibit an intriguing capacity for in-context
learning. Without gradient updates, these models can rapidly construct new
predictors from demonstrations presented in the inputs. Recent works promote
this ability in the vision-language domain by incorporating visual information
into large language models that can already make in-context predictions.
However, these methods could inherit issues in the language domain, such as
template sensitivity and hallucination. Also, the scale of these language
models raises a significant demand for computations, making learning and
operating these models resource-intensive. To this end, we raise a question:
``How can we enable in-context learning without relying on the intrinsic
in-context ability of large language models?". To answer it, we propose a
succinct and general framework, Self-supervised IN-Context learning (SINC),
that introduces a meta-model to learn on self-supervised prompts consisting of
tailored demonstrations. The learned models can be transferred to downstream
tasks for making in-context predictions on-the-fly. Extensive experiments show
that SINC outperforms gradient-based methods in various vision-language tasks
under few-shot settings. Furthermore, the designs of SINC help us investigate
the benefits of in-context learning across different tasks, and the analysis
further reveals the essential components for the emergence of in-context
learning in the vision-language domain.Comment: Accepted by ICCV 2023; Camera Ready Versio
- …