7 research outputs found
Recommending Themes for Ad Creative Design via Visual-Linguistic Representations
There is a perennial need in the online advertising industry to refresh ad
creatives, i.e., images and text used for enticing online users towards a
brand. Such refreshes are required to reduce the likelihood of ad fatigue among
online users, and to incorporate insights from other successful campaigns in
related product categories. Given a brand, to come up with themes for a new ad
is a painstaking and time consuming process for creative strategists.
Strategists typically draw inspiration from the images and text used for past
ad campaigns, as well as world knowledge on the brands. To automatically infer
ad themes via such multimodal sources of information in past ad campaigns, we
propose a theme (keyphrase) recommender system for ad creative strategists. The
theme recommender is based on aggregating results from a visual question
answering (VQA) task, which ingests the following: (i) ad images, (ii) text
associated with the ads as well as Wikipedia pages on the brands in the ads,
and (iii) questions around the ad. We leverage transformer based cross-modality
encoders to train visual-linguistic representations for our VQA task. We study
two formulations for the VQA task along the lines of classification and
ranking; via experiments on a public dataset, we show that cross-modal
representations lead to significantly better classification accuracy and
ranking precision-recall metrics. Cross-modal representations show better
performance compared to separate image and text representations. In addition,
the use of multimodal information shows a significant lift over using only
textual or visual information.Comment: 7 pages, 8 figures, 2 tables, accepted by The Web Conference 202
Matching images and text with multi-modal tensor fusion and re-ranking
A major challenge in matching images and text is that they have intrinsically different data distributions and feature representations. Most existing approaches are based either on embedding or classification, the first one mapping image and text instances into a common embedding space for distance measuring, and the second one regarding image-text matching as a binary classification problem. Neither of these approaches can, however, balance the matching accuracy and model complexity well. We propose a novel framework that achieves remarkable matching performance with acceptable model complexity. Specifically, in the training stage, we propose a novel Multi-modal Tensor Fusion Network (MTFN) to explicitly learn an accurate image-text similarity function with rank-based tensor fusion rather than seeking a common embedding space for each image-text instance. Then, during testing, we deploy a generic Cross-modal Re-ranking (RR) scheme for refinement without requiring additional training procedure. Extensive experiments on two datasets demonstrate that our MTFN-RR consistently achieves the state-of-the-art matching performance with much less time complexity.Accepted author manuscriptIntelligent System