2 research outputs found

    Video Content Understanding Using Text

    Get PDF
    The rise of the social media and video streaming industry provided us a plethora of videos and their corresponding descriptive information in the form of concepts (words) and textual video captions. Due to the mass amount of available videos and the textual data, today is the best time ever to study the Computer Vision and Machine Learning problems related to videos and text. In this dissertation, we tackle multiple problems associated with the joint understanding of videos and text. We first address the task of multi-concept video retrieval, where the input is a set of words as concepts, and the output is a ranked list of full-length videos. This approach deals with multi-concept input and prolonged length of videos by incorporating multi-latent variables to tie the information within each shot (short clip of a full-video) and across shots. Secondly, we address the problem of video question answering, in which, the task is to answer a question, in the form of Fill-In-the-Blank (FIB), given a video. Answering a question is a task of retrieving a word from a dictionary (all possible words suitable for an answer) based on the input question and video. Following the FIB problem, we introduce a new problem, called Visual Text Correction (VTC), i.e., detecting and replacing an inaccurate word in the textual description of a video. We propose a deep network that can simultaneously detect an inaccuracy in a sentence while benefiting 1D-CNNs/LSTMs to encode short/long term dependencies, and fix it by replacing the inaccurate word(s). Finally, as the last part of the dissertation, we propose to tackle the problem of video generation using user input natural language sentences. Our proposed video generation method constructs two distributions out of the input text, corresponding to the first and last frames latent representations. We generate high-fidelity videos by interpolating latent representations and a sequence of CNN based up-pooling blocks

    Algorithms and Applications of Novel Capsule Networks

    Get PDF
    Convolutional neural networks, despite their profound impact in countless domains, suffer from significant shortcomings. Linearly-combined scalar feature representations and max pooling operations lead to spatial ambiguities and a lack of robustness to pose variations. Capsule networks can potentially alleviate these issues by storing and routing the pose information of extracted features through their architectures, seeking agreement between the lower-level predictions of higher-level poses at each layer. In this dissertation, we make several key contributions to advance the algorithms of capsule networks in segmentation and classification applications. We create the first ever capsule-based segmentation network in the literature, SegCaps, by introducing a novel locally-constrained dynamic routing algorithm, transformation matrix sharing, the concept of a deconvolutional capsule, extension of the reconstruction regularization to segmentation, and a new encoder-decoder capsule architecture. Following this, we design a capsule-based diagnosis network, D-Caps, which builds off SegCaps and introduces a novel capsule-average pooling technique to handle to larger medical imaging data. Finally, we design an explainable capsule network, X-Caps, which encodes high-level visual object attributes within its capsules by utilizing a multi-task framework and a novel routing sigmoid function which independently routes information from child capsules to parents. Predictions come with human-level explanations, via object attributes, and a confidence score, by training our network directly on the distribution of expert labels, modeling inter-observer agreement and punishing over/under confidence during training. This body of work constitutes significant algorithmic advances to the application of capsule networks, especially in real-world biomedical imaging data
    corecore