1,483 research outputs found
Revisiting Visual Question Answering Baselines
Visual question answering (VQA) is an interesting learning setting for
evaluating the abilities and shortcomings of current systems for image
understanding. Many of the recently proposed VQA systems include attention or
memory mechanisms designed to support "reasoning". For multiple-choice VQA,
nearly all of these systems train a multi-class classifier on image and
question features to predict an answer. This paper questions the value of these
common practices and develops a simple alternative model based on binary
classification. Instead of treating answers as competing choices, our model
receives the answer as input and predicts whether or not an
image-question-answer triplet is correct. We evaluate our model on the Visual7W
Telling and the VQA Real Multiple Choice tasks, and find that even simple
versions of our model perform competitively. Our best model achieves
state-of-the-art performance on the Visual7W Telling task and compares
surprisingly well with the most complex systems proposed for the VQA Real
Multiple Choice task. We explore variants of the model and study its
transferability between both datasets. We also present an error analysis of our
model that suggests a key problem of current VQA systems lies in the lack of
visual grounding of concepts that occur in the questions and answers. Overall,
our results suggest that the performance of current VQA systems is not
significantly better than that of systems designed to exploit dataset biases.Comment: European Conference on Computer Visio
CanvasGAN: A simple baseline for text to image generation by incrementally patching a canvas
We propose a new recurrent generative model for generating images from text
captions while attending on specific parts of text captions. Our model creates
images by incrementally adding patches on a "canvas" while attending on words
from text caption at each timestep. Finally, the canvas is passed through an
upscaling network to generate images. We also introduce a new method for
generating visual-semantic sentence embeddings based on self-attention over
text. We compare our model's generated images with those generated Reed et.
al.'s model and show that our model is a stronger baseline for text to image
generation tasks.Comment: CVC 201
Neural Networks Compression for Language Modeling
In this paper, we consider several compression techniques for the language
modeling problem based on recurrent neural networks (RNNs). It is known that
conventional RNNs, e.g, LSTM-based networks in language modeling, are
characterized with either high space complexity or substantial inference time.
This problem is especially crucial for mobile applications, in which the
constant interaction with the remote server is inappropriate. By using the Penn
Treebank (PTB) dataset we compare pruning, quantization, low-rank
factorization, tensor train decomposition for LSTM networks in terms of model
size and suitability for fast inference.Comment: Keywords: LSTM, RNN, language modeling, low-rank factorization,
pruning, quantization. Published by Springer in the LNCS series, 7th
International Conference on Pattern Recognition and Machine Intelligence,
201
Longitudinal detection of radiological abnormalities with time-modulated LSTM
Convolutional neural networks (CNNs) have been successfully employed in
recent years for the detection of radiological abnormalities in medical images
such as plain x-rays. To date, most studies use CNNs on individual examinations
in isolation and discard previously available clinical information. In this
study we set out to explore whether Long-Short-Term-Memory networks (LSTMs) can
be used to improve classification performance when modelling the entire
sequence of radiographs that may be available for a given patient, including
their reports. A limitation of traditional LSTMs, though, is that they
implicitly assume equally-spaced observations, whereas the radiological exams
are event-based, and therefore irregularly sampled. Using both a simulated
dataset and a large-scale chest x-ray dataset, we demonstrate that a simple
modification of the LSTM architecture, which explicitly takes into account the
time lag between consecutive observations, can boost classification
performance. Our empirical results demonstrate improved detection of commonly
reported abnormalities on chest x-rays such as cardiomegaly, consolidation,
pleural effusion and hiatus hernia.Comment: Submitted to 4th MICCAI Workshop on Deep Learning in Medical Imaging
Analysi
Evolutionary estimation of a Coupled Markov Chain credit risk model
There exists a range of different models for estimating and simulating credit
risk transitions to optimally manage credit risk portfolios and products. In
this chapter we present a Coupled Markov Chain approach to model rating
transitions and thereby default probabilities of companies. As the likelihood
of the model turns out to be a non-convex function of the parameters to be
estimated, we apply heuristics to find the ML estimators. To this extent, we
outline the model and its likelihood function, and present both a Particle
Swarm Optimization algorithm, as well as an Evolutionary Optimization algorithm
to maximize the likelihood function. Numerical results are shown which suggest
a further application of evolutionary optimization techniques for credit risk
management
How did the discussion go: Discourse act classification in social media conversations
We propose a novel attention based hierarchical LSTM model to classify
discourse act sequences in social media conversations, aimed at mining data
from online discussion using textual meanings beyond sentence level. The very
uniqueness of the task is the complete categorization of possible pragmatic
roles in informal textual discussions, contrary to extraction of
question-answers, stance detection or sarcasm identification which are very
much role specific tasks. Early attempt was made on a Reddit discussion
dataset. We train our model on the same data, and present test results on two
different datasets, one from Reddit and one from Facebook. Our proposed model
outperformed the previous one in terms of domain independence; without using
platform-dependent structural features, our hierarchical LSTM with word
relevance attention mechanism achieved F1-scores of 71\% and 66\% respectively
to predict discourse roles of comments in Reddit and Facebook discussions.
Efficiency of recurrent and convolutional architectures in order to learn
discursive representation on the same task has been presented and analyzed,
with different word and comment embedding schemes. Our attention mechanism
enables us to inquire into relevance ordering of text segments according to
their roles in discourse. We present a human annotator experiment to unveil
important observations about modeling and data annotation. Equipped with our
text-based discourse identification model, we inquire into how heterogeneous
non-textual features like location, time, leaning of information etc. play
their roles in charaterizing online discussions on Facebook
Patterns versus Characters in Subword-aware Neural Language Modeling
Words in some natural languages can have a composite structure. Elements of
this structure include the root (that could also be composite), prefixes and
suffixes with which various nuances and relations to other words can be
expressed. Thus, in order to build a proper word representation one must take
into account its internal structure. From a corpus of texts we extract a set of
frequent subwords and from the latter set we select patterns, i.e. subwords
which encapsulate information on character -gram regularities. The selection
is made using the pattern-based Conditional Random Field model with
regularization. Further, for every word we construct a new sequence over an
alphabet of patterns. The new alphabet's symbols confine a local statistical
context stronger than the characters, therefore they allow better
representations in and are better building blocks for word
representation. In the task of subword-aware language modeling, pattern-based
models outperform character-based analogues by 2-20 perplexity points. Also, a
recurrent neural network in which a word is represented as a sum of embeddings
of its patterns is on par with a competitive and significantly more
sophisticated character-based convolutional architecture.Comment: 10 page
TandemNet: Distilling Knowledge from Medical Images Using Diagnostic Reports as Optional Semantic References
In this paper, we introduce the semantic knowledge of medical images from
their diagnostic reports to provide an inspirational network training and an
interpretable prediction mechanism with our proposed novel multimodal neural
network, namely TandemNet. Inside TandemNet, a language model is used to
represent report text, which cooperates with the image model in a tandem
scheme. We propose a novel dual-attention model that facilitates high-level
interactions between visual and semantic information and effectively distills
useful features for prediction. In the testing stage, TandemNet can make
accurate image prediction with an optional report text input. It also
interprets its prediction by producing attention on the image and text
informative feature pieces, and further generating diagnostic report
paragraphs. Based on a pathological bladder cancer images and their diagnostic
reports (BCIDR) dataset, sufficient experiments demonstrate that our method
effectively learns and integrates knowledge from multimodalities and obtains
significantly improved performance than comparing baselines.Comment: MICCAI2017 Ora
Dataset for Automatic Summarization of Russian News
Automatic text summarization has been studied in a variety of domains and
languages. However, this does not hold for the Russian language. To overcome
this issue, we present Gazeta, the first dataset for summarization of Russian
news. We describe the properties of this dataset and benchmark several
extractive and abstractive models. We demonstrate that the dataset is a valid
task for methods of text summarization for Russian. Additionally, we prove the
pretrained mBART model to be useful for Russian text summarization.Comment: Version 3, accepted to AINL 202
Broadcasting Convolutional Network for Visual Relational Reasoning
In this paper, we propose the Broadcasting Convolutional Network (BCN) that
extracts key object features from the global field of an entire input image and
recognizes their relationship with local features. BCN is a simple network
module that collects effective spatial features, embeds location information
and broadcasts them to the entire feature maps. We further introduce the
Multi-Relational Network (multiRN) that improves the existing Relation Network
(RN) by utilizing the BCN module. In pixel-based relation reasoning problems,
with the help of BCN, multiRN extends the concept of `pairwise relations' in
conventional RNs to `multiwise relations' by relating each object with multiple
objects at once. This yields in O(n) complexity for n objects, which is a vast
computational gain from RNs that take O(n^2). Through experiments, multiRN has
achieved a state-of-the-art performance on CLEVR dataset, which proves the
usability of BCN on relation reasoning problems.Comment: Accepted paper at ECCV 2018. 24 page
- …