LLM Fine-Tuning Using a Multimodal Reward Model Trained with Ground Truth

Abstract

This disclosure describes techniques to improve the accuracy of large language model (LLM) responses by using a trained multimodal reward model (RM) to fine-tune the LLM. In contrast to traditional techniques that train the RM without consideration of the ground truth, the RM is trained using the prompt (including image and/or other forms of input), the response, and the ground truth. The trained RM can be applied to score the LLM response to the prompt in relation to the ground truth. The score can be used to fine-tune the LLM to improve the accuracy and the form of its responses to queries of a similar form (e.g., including image and/or other forms of input). The training data for the RM can include positive examples (examples consistent with the ground truth) and negative examples (examples that deviate from the ground truth) generated by a generative model. Training an RM over ground truth and using such a trained RM to score the LLM can improve the accuracy of LLM responses

Similar works

Full text

thumbnail-image

Technical Disclosure Common

redirect
Last time updated on 14/01/2026

This paper was published in Technical Disclosure Common.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.

Licence: http://creativecommons.org/licenses/by/4.0/