21 research outputs found

    Video Analysis for Understanding Human Actions and Interactions

    Get PDF
    Each time that we act, our actions are not just conditioned by the spatial information, e.g., objects, people, and the scene where we are involved. These actions are also conditioned temporally with the previous actions that we have done. Indeed, we live in an evolving and dynamic world. To understand what a person is doing, we reason jointly over spatial and temporal information. Intelligent systems that interact with people and perform useful tasks will also require this ability. In light of this need, video analysis has become, in recent years, an essential field in computer vision, providing to the community a wide range of tasks to solve. In this thesis, we make several contributions to the literature of video analysis, exploring different tasks that aim to understand human actions and interactions. We begin by considering the challenging problem of human action anticipation. In this task, we seek to predict a person's action as early as possible before it is completed. This task is critical for applications where machines have to react to human actions. We introduce a novel approach that forecasts the most plausible future human motion by hallucinating motion representations. Then, we address the challenging problem of temporal moment localization. It consists of finding the temporal localization of a natural-language query in a long untrimmed video. Although the queries could be anything that is happening within the video, the vast majority of them describe human actions. In contrast with the propose and rank approaches, where methods create or use predefined clips as candidates, we introduce a proposal-free approach that localizes the query by looking at the whole video at once. We also consider the temporal annotations' subjectivity and propose a soft-labelling using a categorical distribution centred on the annotated start and end. Equipped with a proposal-free architecture, we tackle the temporal moment localization introducing a spatial-temporal graph. We found that one of the limitations of the existing methods is the lack of spatial cues involved in the video and the query, i.e., objects and people. We create six semantically meaningful nodes. Three that are feed with visual features of people, objects, and activities, and the other three that capture the relationship at the language level of the "subject-object,'' "subject-verb," and "verb-object." We use a language-conditional message-passing algorithm to capture the relationship between nodes and create an improved representation of the activity. A temporal graph uses this new representation to determine the start and end of the query. Last, we study the problem of fine-grained opinion mining in video review using a multi-modal setting. There is increasing use of video as a source of information for guidance in the shopping process. People use video reviews as a guide to answering what, why, and where to buy something. We tackle this problem using the three different modalities inherently present in a video ---audio, frames, and transcripts--- to determine the most relevant aspect of the product under review and the sentiment polarity of the reviewer upon that aspect. We propose an early fusion mechanism of the three modalities. In this approach, we fuse the three different modalities at the sentence level. It is a general framework that does not lay in any strict constraints on the individual encodings of the audio, video frames and transcripts

    Leveraging the multimodal information from video content for video recommendation

    Get PDF
    Since the popularisation of media streaming, a number of video streaming services are continually buying new video content to mine the potential profit. As such, newly added content has to be handled appropriately to be recommended to suitable users. In this dissertation, the new item cold-start problem is addressed by exploring the potential of various deep learning features to provide video recommendations. The deep learning features investigated include features that capture the visual-appearance, as well as audio and motion information from video content. Different fusion methods are also explored to evaluate how well these feature modalities can be combined to fully exploit the complementary information captured by them. Experiments on a real-world video dataset for movie recommendations show that deep learning features outperform hand crafted features. In particular, it is found that recommendations generated with deep learning audio features and action-centric deep learning features are superior to Mel-frequency cepstral coefficients (MFCC) and state-of-the-art improved dense trajectory (iDT) features. It was also found that the combination of various deep learning features with textual metadata and hand-crafted features provide significant improvement in recommendations, as compared to combining only deep learning and hand-crafted features.Dissertation (MEng (Computer Engineering))--University of Pretoria, 2021.The MultiChoice Research Chair of Machine Learning at the University of PretoriaUP Postgraduate Masters Research bursaryElectrical, Electronic and Computer EngineeringMEng (Computer Engineering)Unrestricte
    corecore