Search CORE

229 research outputs found

Visual Question Answering with Memory-Augmented Networks

Author: Dick Anthony
Hengel Anton van den
Ma Chao
Reid Ian
Shen Chunhua
Wang Peng
Wu Qi
Publication venue
Publication date: 25/03/2018
Field of study

In this paper, we exploit a memory-augmented neural network to predict accurate answers to visual questions, even when those answers occur rarely in the training set. The memory network incorporates both internal and external memory blocks and selectively pays attention to each training exemplar. We show that memory-augmented neural networks are able to maintain a relatively long-term memory of scarce training exemplars, which is important for visual question answering due to the heavy-tailed distribution of answers in a general VQA setting. Experimental results on two large-scale benchmark datasets show the favorable performance of the proposed algorithm with a comparison to state of the art.Comment: CVPR 201

arXiv.org e-Print Archive

Crossref

Streaming Video QoE Modeling and Prediction: A Long Short-Term Memory Approach

Author: Ashique S
Chakraborty Soumen
Channappayya Sumohana S.
Eswara Nagabhushan
Kuchi Kiran
Kumar Abhinav
Panchbhai Anand
Sethuram Hemanth P.
Publication venue
Publication date: 18/07/2018
Field of study

HTTP based adaptive video streaming has become a popular choice of streaming due to the reliable transmission and the flexibility offered to adapt to varying network conditions. However, due to rate adaptation in adaptive streaming, the quality of the videos at the client keeps varying with time depending on the end-to-end network conditions. Further, varying network conditions can lead to the video client running out of playback content resulting in rebuffering events. These factors affect the user satisfaction and cause degradation of the user quality of experience (QoE). It is important to quantify the perceptual QoE of the streaming video users and monitor the same in a continuous manner so that the QoE degradation can be minimized. However, the continuous evaluation of QoE is challenging as it is determined by complex dynamic interactions among the QoE influencing factors. Towards this end, we present LSTM-QoE, a recurrent neural network based QoE prediction model using a Long Short-Term Memory (LSTM) network. The LSTM-QoE is a network of cascaded LSTM blocks to capture the nonlinearities and the complex temporal dependencies involved in the time varying QoE. Based on an evaluation over several publicly available continuous QoE databases, we demonstrate that the LSTM-QoE has the capability to model the QoE dynamics effectively. We compare the proposed model with the state-of-the-art QoE prediction models and show that it provides superior performance across these databases. Further, we discuss the state space perspective for the LSTM-QoE and show the efficacy of the state space modeling approaches for QoE prediction

arXiv.org e-Print Archive

Research Archive of Indian Institute of Technology Hyderabad

SpatioTemporal Feature Integration and Model Fusion for Full Reference Video Quality Assessment

Author: Bampis Christos G.
Li Zhi
Bovik Alan C.
Publication venue
Publication date: 30/09/1990
Field of study

Perceptual video quality assessment models are either frame-based or video-based, i.e., they apply spatiotemporal filtering or motion estimation to capture temporal video distortions. Despite their good performance on video quality databases, video-based approaches are time-consuming and harder to efficiently deploy. To balance between high performance and computational efficiency, Netflix developed the Video Multi-method Assessment Fusion (VMAF) framework, which integrates multiple quality-aware features to predict video quality. Nevertheless, this fusion framework does not fully exploit temporal video quality measurements which are relevant to temporal video distortions. To this end, we propose two improvements to the VMAF framework: SpatioTemporal VMAF and Ensemble VMAF. Both algorithms exploit efficient temporal video features which are fed into a single or multiple regression models. To train our models, we designed a large subjective database and evaluated the proposed models against state-of-the-art approaches. The compared algorithms will be made available as part of the open source package in https://github.com/Netflix/vmaf

arXiv.org e-Print Archive

Revistas y Boletines - Banco de la República

BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection

Author: Ben-younes Hedi
Cadene Rémi
Cord Matthieu
Thome Nicolas
Publication venue
Publication date: 27/01/2019
Field of study

Multimodal representation learning is gaining more and more interest within the deep learning community. While bilinear models provide an interesting framework to find subtle combination of modalities, their number of parameters grows quadratically with the input dimensions, making their practical implementation within classical deep learning pipelines challenging. In this paper, we introduce BLOCK, a new multimodal fusion based on the block-superdiagonal tensor decomposition. It leverages the notion of block-term ranks, which generalizes both concepts of rank and mode ranks for tensors, already used for multimodal fusion. It allows to define new ways for optimizing the tradeoff between the expressiveness and complexity of the fusion model, and is able to represent very fine interactions between modalities while maintaining powerful mono-modal representations. We demonstrate the practical interest of our fusion model by using BLOCK for two challenging tasks: Visual Question Answering (VQA) and Visual Relationship Detection (VRD), where we design end-to-end learnable architectures for representing relevant interactions between modalities. Through extensive experiments, we show that BLOCK compares favorably with respect to state-of-the-art multimodal fusion models for both VQA and VRD tasks. Our code is available at https://github.com/Cadene/block.bootstrap.pytorch

arXiv.org e-Print Archive

HAL Descartes

Hal-Diderot

Association for the Advancement of Artificial Intelligence: AAAI Publications

Learning to Reason: End-to-End Module Networks for Visual Question Answering

Author: Andreas Jacob
Darrell Trevor
Hu Ronghang
Rohrbach Marcus
Saenko Kate
Publication venue
Publication date: 11/09/2017
Field of study

Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems. For example, to answer "is there an equal number of balls and boxes?" we can look for balls, look for boxes, count them, and compare the results. The recently proposed Neural Module Network (NMN) architecture implements this approach to question answering by parsing questions into linguistic substructures and assembling question-specific deep networks from smaller modules that each solve one subtask. However, existing NMN implementations rely on brittle off-the-shelf parsers, and are restricted to the module configurations proposed by these parsers rather than learning them from data. In this paper, we propose End-to-End Module Networks (N2NMNs), which learn to reason by directly predicting instance-specific network layouts without the aid of a parser. Our model learns to generate network structures (by imitating expert demonstrations) while simultaneously learning network parameters (using the downstream task loss). Experimental results on the new CLEVR dataset targeted at compositional question answering show that N2NMNs achieve an error reduction of nearly 50% relative to state-of-the-art attentional approaches, while discovering interpretable network architectures specialized for each question

arXiv.org e-Print Archive

Crossref

Critical analysis on the reproducibility of visual quality assessment using deep features

Author: Götz-Hahn Franz
Hosu Vlad
Saupe Dietmar
Publication venue
Publication date: 10/09/2020
Field of study

Data used to train supervised machine learning models are commonly split into independent training, validation, and test sets. In this paper we illustrate that intricate cases of data leakage have occurred in the no-reference video and image quality assessment literature. We show that the performance results of several recently published journal papers that are well above the best performances in related works, cannot be reached. Our analysis shows that information from the test set was inappropriately used in the training process in different ways. When correcting for the data leakage, the performances of the approaches drop below the state-of-the-art by a large margin. Additionally, we investigate end-to-end variations to the discussed approaches, which do not improve upon the original.Comment: 20 pages, 7 figures, PLOS ONE journal. arXiv admin note: substantial text overlap with arXiv:2005.0440

arXiv.org e-Print Archive

KOPS - The Institutional Repository of the University of Konstanz