5 research outputs found
Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation
Resource-constrained perception systems such as edge computing and
vision-for-robotics require vision models to be both accurate and lightweight
in computation and memory usage. While knowledge distillation is a proven
strategy to enhance the performance of lightweight classification models, its
application to structured outputs like object detection and instance
segmentation remains a complicated task, due to the variability in outputs and
complex internal network modules involved in the distillation process. In this
paper, we propose a simple yet surprisingly effective sequential approach to
knowledge distillation that progressively transfers the knowledge of a set of
teacher detectors to a given lightweight student. To distill knowledge from a
highly accurate but complex teacher model, we construct a sequence of teachers
to help the student gradually adapt. Our progressive strategy can be easily
combined with existing detection distillation mechanisms to consistently
maximize student performance in various settings. To the best of our knowledge,
we are the first to successfully distill knowledge from Transformer-based
teacher detectors to convolution-based students, and unprecedentedly boost the
performance of ResNet-50 based RetinaNet from 36.5% to 42.0% AP and Mask R-CNN
from 38.2% to 42.5% AP on the MS COCO benchmark.Comment: ICML 202
Aligning Large Multimodal Models with Factually Augmented RLHF
Large Multimodal Models (LMM) are built across modalities and the
misalignment between two modalities can result in "hallucination", generating
textual outputs that are not grounded by the multimodal information in context.
To address the multimodal misalignment issue, we adapt the Reinforcement
Learning from Human Feedback (RLHF) from the text domain to the task of
vision-language alignment, where human annotators are asked to compare two
responses and pinpoint the more hallucinated one, and the vision-language model
is trained to maximize the simulated human rewards. We propose a new alignment
algorithm called Factually Augmented RLHF that augments the reward model with
additional factual information such as image captions and ground-truth
multi-choice options, which alleviates the reward hacking phenomenon in RLHF
and further improves the performance. We also enhance the GPT-4-generated
training data (for vision instruction tuning) with previously available
human-written image-text pairs to improve the general capabilities of our
model. To evaluate the proposed approach in real-world scenarios, we develop a
new evaluation benchmark MMHAL-BENCH with a special focus on penalizing
hallucinations. As the first LMM trained with RLHF, our approach achieves
remarkable improvement on the LLaVA-Bench dataset with the 94% performance
level of the text-only GPT-4 (while previous best methods can only achieve the
87% level), and an improvement by 60% on MMHAL-BENCH over other baselines. We
opensource our code, model, data at https://llava-rlhf.github.io.Comment: Preprin