21 research outputs found

    Scalable Methodologies and Analyses for Modality Bias and Feature Exploitation in Language-Vision Multimodal Deep Learning

    Get PDF
    Multimodal machine learning benchmarks have exponentially grown in both capability and popularity over the last decade. Language-vision question-answering tasks such as Visual Question Answering (VQA) and Video Question Answering (video-QA) have ---thanks to their high difficulty--- become a particularly popular means through which to develop and test new modelling designs and methodology for multimodal deep learning. The challenging nature of VQA and video-QA tasks leaves plenty of room for innovation at every component of the deep learning pipeline: from dataset to modelling methodology. Such circumstances are ideal for innovating in the space of language-vision multimodality. Furthermore, the wider field is currently undergoing an incredible period of growth and increasing interest. I therefore aim to contribute to multiple key components of the VQA and video-QA pipeline, but specifically in a manner such that my contributions remain relevant, ‘scaling’ with the revolutionary new benchmark models and datasets of the near future instead of being rendered obsolete by them. The work in this thesis: highlights and explores the disruptive and problematic presence of language bias in the popular TVQA video-QA dataset, and proposes a dataset-invariant method to identify subsets that respond to different modalities; thoroughly explores the suitability of bilinear pooling as a language-vision fusion technique in video-QA, offering experimental and theoretical insight, and highlighting the parallels in multimodal processing with neurological theories; explores the nascent visual equivalent of languague modelling (`visual modelling') in order to boost the power of visual features; and proposes a dataset-invariant neurolinguistically-inspired labelling scheme for use in multimodal question-answering. I explore the positive and negative results that my experiments across this thesis yield. I conclude by discussing the limitations of my contributions, and conclude with proposals for future directions of study in the areas I contribute to

    Robust Subspace Estimation via Low-Rank and Sparse Decomposition and Applications in Computer Vision

    Get PDF
    PhDRecent advances in robust subspace estimation have made dimensionality reduction and noise and outlier suppression an area of interest for research, along with continuous improvements in computer vision applications. Due to the nature of image and video signals that need a high dimensional representation, often storage, processing, transmission, and analysis of such signals is a difficult task. It is therefore desirable to obtain a low-dimensional representation for such signals, and at the same time correct for corruptions, errors, and outliers, so that the signals could be readily used for later processing. Major recent advances in low-rank modelling in this context were initiated by the work of Cand`es et al. [17] where the authors provided a solution for the long-standing problem of decomposing a matrix into low-rank and sparse components in a Robust Principal Component Analysis (RPCA) framework. However, for computer vision applications RPCA is often too complex, and/or may not yield desirable results. The low-rank component obtained by the RPCA has usually an unnecessarily high rank, while in certain tasks lower dimensional representations are required. The RPCA has the ability to robustly estimate noise and outliers and separate them from the low-rank component, by a sparse part. But, it has no mechanism of providing an insight into the structure of the sparse solution, nor a way to further decompose the sparse part into a random noise and a structured sparse component that would be advantageous in many computer vision tasks. As videos signals are usually captured by a camera that is moving, obtaining a low-rank component by RPCA becomes impossible. In this thesis, novel Approximated RPCA algorithms are presented, targeting different shortcomings of the RPCA. The Approximated RPCA was analysed to identify the most time consuming RPCA solutions, and replace them with simpler yet tractable alternative solutions. The proposed method is able to obtain the exact desired rank for the low-rank component while estimating a global transformation to describe camera-induced motion. Furthermore, it is able to decompose the sparse part into a foreground sparse component, and a random noise part that contains no useful information for computer vision processing. The foreground sparse component is obtained by several novel structured sparsity-inducing norms, that better encapsulate the needed pixel structure in visual signals. Moreover, algorithms for reducing complexity of low-rank estimation have been proposed that achieve significant complexity reduction without sacrificing the visual representation of video and image information. The proposed algorithms are applied to several fundamental computer vision tasks, namely, high efficiency video coding, batch image alignment, inpainting, and recovery, video stabilisation, background modelling and foreground segmentation, robust subspace clustering and motion estimation, face recognition, and ultra high definition image and video super-resolution. The algorithms proposed in this thesis including batch image alignment and recovery, background modelling and foreground segmentation, robust subspace clustering and motion segmentation, and ultra high definition image and video super-resolution achieve either state-of-the-art or comparable results to existing methods

    Evolution and Regularisation of Vacuum Brill Gravitational Waves in Spherical Polar Coordinates

    Full text link
    In this thesis the universal collapse of vacuum Brill waves is demonstrated numerically and analytically. This thesis presents the mathematical and numerical methods necessary to regularise and evolve Brill Gravitational Waves in spherical polar coordinates. A Cauchy ADM formulation is used for the time evolution. We find strong evidence that all IVP formulations of pure vacuum Brill gravitational waves collapse to form singularities/black holes, and we do not observe critical black hole mass scaling phenomena in the IVP parameter phase space that has been characterised in non-vacuum systems. A theoretical framework to prove this result analytically is presented. We discuss the meaning of Brill metric variables, the topology of trapped surfaces for various scenarios, and verify other results in the field related to critical values of initial value parameters and black hole formation approaching spatial infinity. The instability of Minkowski (flat) space under Brill wave and more general perturbations is demonstrated. The main numerical tools employed to achieve a stable evolution code are (1) derivation of appropriate regularity conditions on the lapse function and metric function q, (2) the move to a 4th order correct discretisation scheme with appropriate boundary conditions, (3) the use of exponential metric terms, (4) an understanding of the right mix of free versus constrained evolution and (5) the development of appropriate numerical techniques for discretisation and differencing to reduce numerical error, along with a characterisation of condition numbers.Comment: PhD Thesis, 2014, University of Calgary, 368 page

    Deep learning for image restoration and enhancement

    Get PDF
    Image restoration is the process of recovering an original clean image from its degraded version, and image enhancement takes the goal of improving the image quality either objectively or subjectively. Both of them play a key part in computer vision and image processing and have broad applications in industry. The past few years have witnessed the revival of deep learning in computer vision, and substantial progress has been made due to the use of deep neural networks. In this dissertation, we use deep learning to address the problems of image restoration and enhancement, with the focus on the following topics: image and video super-resolution (SR), as well as image denoising. For these problems, deep neural networks are generally used as a regression model to predict the original clean image content from the input. However, designing a network structure that can effectively exploit the intrinsic image properties to achieve remarkable performances is not a trivial task. For image SR, several models based on deep neural networks have been recently proposed and attained superior performance that overshadows all previous handcrafted models. The question then arises whether large-capacity and data-driven models have become the dominant solution to this ill-posed problem. We argue that domain expertise represented by the conventional sparse coding model is still valuable, and it can be combined with the key ingredients of deep learning to achieve further improved results. We experimentally show that a sparse coding model particularly designed for SR can be incarnated as a neural network, which can be trained from end to end. The interpretation of the network based on sparse coding leads to much more efficient and effective training, as well as a reduced model size. In addition, we design a unified framework to learn a mixture of sub-networks for image SR so as to further boost SR accuracy. Video SR aims to generate a high-resolution (HR) frame from multiple low-resolution (LR) frames in a local temporal window. The inter-frame temporal relation is as crucial as the intra-frame spatial relation for tackling this problem. We design deep networks for utilizing the temporal relation from two aspects. First, we propose a temporal adaptive neural network that can adaptively determine the optimal scale of temporal dependency. Filters on various temporal scales are applied to the input LR sequence before their responses are adaptively aggregated. Second, we reduce the complexity of motion between neighboring frames using a spatial alignment network which is much more robust and efficient than competing alignment methods and can be jointly trained with the temporal adaptive network. Image denoising, as another important task of image restoration, is dedicated to recovering the underlying image signal from its noisy measurement. First we customize a convolutional neural network for image denoising. Second we investigate the mutual relation between image denoising and high-level vision tasks in a deep learning fashion when image denoising serves as a preprocessing step for high-level vision tasks. We design a network that cascades two modules for image denoising and various high-level tasks, and use the joint loss for updating only the denoising network via back-propagation. Self-similarity in natural images is widely used for image restoration by many classic approaches. We propose a non-local recurrent network as the first attempt to incorporate non-local operations into a recurrent neural network (RNN) for image restoration. Unlike existing methods that measure self-similarity in an isolated manner, the proposed non-local module can be flexibly integrated into existing deep networks for end-to-end training to capture deep feature correlation between each location and its neighborhood. We fully use an RNN for its parameter efficiency and allow deep feature correlation to be propagated along adjacent recurrent states. This design boosts robustness against inaccurate correlation estimation due to severely degraded images. Finally, we show that it is essential to choose a proper neighborhood size for computing deep feature correlation given degraded images, in order to obtain the best restoration performance

    Structure-aware narrative summarization from multiple views

    Get PDF
    Narratives, such as movies and TV shows, provide a testbed for addressing a variety of challenges in the field of artificial intelligence. They are examples of complex stories where characters and events interact in many ways. Inferring what is happening in a narrative requires modeling long-range dependencies between events, understanding commonsense knowledge and accounting for non-linearities in the presentation of the story. Moreover, narratives are usually long (i.e., there are hundreds of pages in a screenplay and thousands of frames in a video) and cannot be easily processed by standard neural architectures. Movies and TV episodes also include information from multiple sources (i.e., video, audio, text) that are complementary to inferring high-level events and their interactions. Finally, creating large-scale multimodal datasets with narratives containing long videos and aligned textual data is challenging, resulting in small datasets that require data efficient approaches. Most prior work that analyzes narratives does not consider the above challenges all at once. In most cases, text-only approaches focus on full-length narratives with complex semantics and address tasks such as question-answering and summarization, or multimodal approaches are limited to short videos with simpler semantics (e.g., isolated actions and local interactions). In this thesis, we combine these two different directions in addressing narrative summarization. We use all input modalities (i.e., video, audio, text), consider full-length narratives and perform the task of narrative summarization both in a video-to-video setting (i.e., video summarization, trailer generation) and a video-to-text setting (i.e., multimodal abstractive summarization). We hypothesize that information about the narrative structure of movies and TVepisodes can facilitate summarizing them. We introduce the task of Turning Point identification and provide a corresponding dataset called TRIPOD as a means of analyzing the narrative structure of movies. According to screenwriting theory, turning points (e.g., change of plans, major setback, climax) are crucial narrative moments within a movie or TV episode: they define the plot structure and determine its progression and thematic units. We validate that narrative structure contributes to extractive screenplay summarization by testing our hypothesis on a dataset containing TV episodes and summary-specific labels. We further hypothesize that movies should not be viewed as a sequence of scenes from a screenplay or shots from a video and instead be modelled as sparse graphs, where nodes are scenes or shots and edges denote strong semantic relationships between them. We utilize multimodal information for creating movie graphs in the latent space, and find that both graph-related and multimodal information help contextualization and boost performance on extractive summarization. Moving one step further, we also address the task of trailer moment identification, which can be viewed as a specific instiatiation of narrative summarization. We decompose this task, which is challenging and subjective, into two simpler ones: narrativestructure identification, defined again by turning points, and sentiment prediction. We propose a graph-based unsupervised algorithm that uses interpretable criteria for retrieving trailer shots and convert it into an interactive tool with a human in the loop for trailer creation. Semi-automatic trailer shot selection exhibits comparable performance to fully manual selection according to human judges, while minimizing processing time. After identifying salient content in narratives, we next attempt to produce abstractive textual summaries (i.e., video-to-text). We hypothesize that multimodal information is directly important for generating textual summaries, apart from contributing to content selection. For that, we propose a parameter efficient way for incorporating multimodal information into a pre-trained textual summarizer, while training only 3.8% of model parameters, and demonstrate the importance of multimodal information for generating high-quality and factual summaries. The findings of this thesis underline the need to focus on realistic and multimodal settings when addressing narrative analysis and generation tasks

    EXPLOITING USER COMMENTS FOR WEB APPLICATIONS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Localisation faiblement supervisée des actions orientées vers un but

    Get PDF
    The goal of this thesis is to develop methods for automatic understanding of video content. We focus on instructional videos that demonstrate how to perform complex tasks, such as making an omelette or hanging a picture. First, we investigate learning visual models for the steps of tasks, using only a list of steps for each task, instead of costly and time consuming human annotations. Our model allows us to share the information between the tasks on the sub-step level, effectively multiplying the amount of available training data. We demonstrate the benefits of our method on a newly collected dataset of instructional videos, CrossTask. Next, we present a method for isolating task-related actions from the surrounding background, that doesn’t rely on human supervision. Finally, we learn to associate natural language instructions with the corresponding objects within the 3D scene, reconstructed from the videos.Le but de cette thèse est de développer des méthodes pour la compréhension automatique des vidéos d'instructions, qui démontrent des tâches humaines, comme, par exemple, faire une omelette ou accrocher une peinture. Nous proposons, d’abord, une méthode d'apprentissage des actions seulement à partir d'un script pour chaque tâche, au lieu des annotations manuelles. Notre modèle permet de réduire la quantité de données d'entraînement, en partageant l’information entre les tâches. Nous évaluons notre approche sur un nouveau jeu de données, CrossTask. Nous présentons, ensuite, une méthode non supervisée pour isoler les actions, liée à une tâche de leur contexte. Finally, we learn to associate natural language instructions with the corresponding objects within the 3D scene, reconstructed from the videos. Finalement, nous proposons une approche pour associer des instructions textuelles avec des objets correspondants dans la scène 3D, reconstruite à partir des vidéos

    Mathematical Modelling in Engineering & Human Behaviour 2018

    Get PDF
    This book includes papers in cross-disciplinary applications of mathematical modelling: from medicine to linguistics, social problems, and more. Based on cutting-edge research, each chapter is focused on a different problem of modelling human behaviour or engineering problems at different levels. The reader would find this book to be a useful reference in identifying problems of interest in social, medicine and engineering sciences, and in developing mathematical models that could be used to successfully predict behaviours and obtain practical information for specialised practitioners. This book is a must-read for anyone interested in the new developments of applied mathematics in connection with epidemics, medical modelling, social issues, random differential equations and numerical methods

    Extensions to the Latent Dirichlet Allocation Topic Model Using Flexible Priors

    Get PDF
    Intrinsically, topic models have always their likelihood functions fixed to multinomial distributions as they operate on count data instead of Gaussian data. As a result, their performances ultimately depend on the flexibility of the chosen prior distributions when following the Bayesian paradigm compared to classical approaches such as PLSA (probabilistic latent semantic analysis), unigrams and mixture of unigrams that do not use prior information. The standard LDA (latent Dirichlet allocation) topic model operates with symmetric Dirichlet distribution (as a conjugate prior) which has been found to carry some limitations due to its independent structure that tends to hinder performance for instance in topic correlation including positively correlated data processing. Compared to classical ML estimators, the use of priors ultimately presents another unique advantage of smoothing out the multinomials while enhancing predictive topic models. In this thesis, we propose a series of flexible priors such as generalized Dirichlet (GD) and Beta-Liouville (BL) for our topic models within the collapsed representation, leading to much improved CVB (collapsed variational Bayes) update equations compared to ones from the standard LDA. This is because the flexibility of these priors improves significantly the lower bounds in the corresponding CVB algorithms. We also show the robustness of our proposed CVB inferences when using simultaneously the BL and GD in hybrid generative-discriminative models where the generative stage produces good and heterogeneous topic features that are used in the discriminative stage by powerful classifiers such as SVMs (support vector machines) as we propose efficient probabilistic kernels to facilitate processing (classification) of documents based on topic signatures. Doing so, we implicitly cast topic modeling which is an unsupervised learning method into a supervised learning technique. Furthermore, due to the complexity of the CVB algorithm (as it requires second order Taylor expansions) in general, despite its flexibility, we propose a much simpler and tractable update equation using a MAP (maximum a posteriori) framework with the standard EM (expectation-maximization) algorithm. As most Bayesian posteriors are not tractable for complex models, we ultimately propose the MAP-LBLA (latent BL allocation) where we characterize the contributions of asymmetric BL priors over the symmetric Dirichlet (Dir). The proposed MAP technique importantly offers a point estimate (mode) with a much tractable solution. In the MAP, we show that point estimate could be easy to implement than full Bayesian analysis that integrates over the entire parameter space. The MAP implicitly exhibits some equivalent relationship with the CVB especially the zero order approximations CVB0 and its stochastic version SCVB0. The proposed method enhances performances in information retrieval in text document analysis. We show that parametric topic models (as they are finite dimensional methods) have a much smaller hypothesis space and they generally suffer from model selection. We therefore propose a Bayesian nonparametric (BNP) technique that uses the Hierarchical Dirichlet process (HDP) as conjugate prior to the document multinomial distributions where the asymmetric BL serves as a diffuse (probability) base measure that provides the global atoms (topics) that are shared among documents. The heterogeneity in the topic structure helps in providing an alternative to model selection because the nonparametric topic model (which is infinite dimensional with a much bigger hypothesis space) could now prune out irrelevant topics based on the associated probability masses to only retain the most relevant ones. We also show that for large scale applications, stochastic optimizations using natural gradients of the objective functions have demonstrated significant performances when we learn rapidly both data and parameters in online fashion (streaming). We use both predictive likelihood and perplexity as evaluation methods to assess the robustness of our proposed topic models as we ultimately refer to probability as a way to quantify uncertainty in our Bayesian framework. We improve object categorization in terms of inferences through the flexibility of our prior distributions in the collapsed space. We also improve information retrieval technique with the MAP and the HDP-LBLA topic models while extending the standard LDA. These two applications present the ultimate capability of enhancing a search engine based on topic models
    corecore