Multi-Modal Livestream Highlight Detection from Audio, Visual, and Language Data

Abstract

Livestreaming is the act of live broadcasting via the internet and allows viewer-host interaction via a text-based chat system. In particular, video game livestreaming is prevalent, where streamers host individual play sessions or present esports competitions. While livestreaming is an emerging entertainment medium, it is popular. For example, every minute, about 1900 hours of footage is livestreamed on Twitch.tv, currently the most popular video game livestreaming platform. It can be challenging for viewers to access the content they are most likely to enjoy. One solution is ‘highlight videos’, which can entertain users who did not watch a broadcast, e.g. due to a lack of awareness, availability, or willingness. Furthermore, livestream content creators can grow their audiences by using highlights to advertise their streams and engage casual followers. However, hand-generating these videos is laborious. Thus there is great value in developing automatic highlight detection methods. Video game streaming provides the viewer with a rich set of audio-visual data, conveying information about the game through game footage and the streamer’s emotional state via webcam footage. Analysing both the game and the behaviour of broadcast personnel is crucial for modelling the exciting aspects of livestreams. Furthermore, livestreaming offers a unique opportunity to understand the viewing experience through the text-based chat system. However, livestream data has a significant set of challenges, e.g. how to fuse multimodal data captured by different sources in uncontrolled, noisy conditions. Thus deep learning models able to leverage complex data are appealing for highlight detection methods. This thesis explores the application of deep learning highlight detection models to the domain of livestreaming. Multimodal highlight detection methods are developed for personalitydriven livestreams and esports broadcasts. The unique nature of livestream audience chat language is explored, and audience-based highlight methods are proposed. Finally, a model capable of modelling all these modalities in one system is presented

    Similar works