The last few years have witnessed great success on image generation, which
has crossed the acceptance thresholds of aesthetics, making it directly
applicable to personal and commercial applications. However, images, especially
in marketing and advertising applications, are often created as a means to an
end as opposed to just aesthetic concerns. The goal can be increasing sales,
getting more clicks, likes, or image sales (in the case of stock businesses).
Therefore, the generated images need to perform well on these key performance
indicators (KPIs), in addition to being aesthetically good. In this paper, we
make the first endeavor to answer the question of "How can one infuse the
knowledge of the end-goal within the image generation process itself to create
not just better-looking images but also "better-performing'' images?''. We
propose BoigLLM, an LLM that understands both image content and user behavior.
BoigLLM knows how an image should look to get a certain required KPI. We show
that BoigLLM outperforms 13x larger models such as GPT-3.5 and GPT-4 in this
task, demonstrating that while these state-of-the-art models can understand
images, they lack information on how these images perform in the real world. To
generate actual pixels of behavior-conditioned images, we train a
diffusion-based model (BoigSD) to align with a proposed BoigLLM-defined reward.
We show the performance of the overall pipeline on two datasets covering two
different behaviors: a stock dataset with the number of forward actions as the
KPI and a dataset containing tweets with the total likes as the KPI, denoted as
BoigBench. To advance research in the direction of utility-driven image
generation and understanding, we release BoigBench, a benchmark dataset
containing 168 million enterprise tweets with their media, brand account names,
time of post, and total likes