3 research outputs found
Poet: Product-oriented Video Captioner for E-commerce
In e-commerce, a growing number of user-generated videos are used for product
promotion. How to generate video descriptions that narrate the user-preferred
product characteristics depicted in the video is vital for successful
promoting. Traditional video captioning methods, which focus on routinely
describing what exists and happens in a video, are not amenable for
product-oriented video captioning. To address this problem, we propose a
product-oriented video captioner framework, abbreviated as Poet. Poet firstly
represents the videos as product-oriented spatial-temporal graphs. Then, based
on the aspects of the video-associated product, we perform knowledge-enhanced
spatial-temporal inference on those graphs for capturing the dynamic change of
fine-grained product-part characteristics. The knowledge leveraging module in
Poet differs from the traditional design by performing knowledge filtering and
dynamic memory modeling. We show that Poet achieves consistent performance
improvement over previous methods concerning generation quality, product
aspects capturing, and lexical diversity. Experiments are performed on two
product-oriented video captioning datasets, buyer-generated fashion video
dataset (BFVD) and fan-generated fashion video dataset (FFVD), collected from
Mobile Taobao. We will release the desensitized datasets to promote further
investigations on both video captioning and general video analysis problems.Comment: 10 pages, 3 figures, to appear in ACM MM 2020 proceeding