13 research outputs found
Question-guided hybrid convolution for visual question answering
National Research Foundation (NRF) Singapore under International Research Centre @ Singapore Funding Initiativ
A^2-Net: Molecular Structure Estimation from Cryo-EM Density Volumes
Constructing of molecular structural models from Cryo-Electron Microscopy
(Cryo-EM) density volumes is the critical last step of structure determination
by Cryo-EM technologies. Methods have evolved from manual construction by
structural biologists to perform 6D translation-rotation searching, which is
extremely compute-intensive. In this paper, we propose a learning-based method
and formulate this problem as a vision-inspired 3D detection and pose
estimation task. We develop a deep learning framework for amino acid
determination in a 3D Cryo-EM density volume. We also design a sequence-guided
Monte Carlo Tree Search (MCTS) to thread over the candidate amino acids to form
the molecular structure. This framework achieves 91% coverage on our newly
proposed dataset and takes only a few minutes for a typical structure with a
thousand amino acids. Our method is hundreds of times faster and several times
more accurate than existing automated solutions without any human intervention.Comment: 8 pages, 5 figures, 4 table
Recent, rapid advancement in visual question answering architecture: a review
Understanding visual question answering is going to be crucial for numerous
human activities. However, it presents major challenges at the heart of the
artificial intelligence endeavor. This paper presents an update on the rapid
advancements in visual question answering using images that have occurred in
the last couple of years. Tremendous growth in research on improving visual
question answering system architecture has been published recently, showing the
importance of multimodal architectures. Several points on the benefits of
visual question answering are mentioned in the review paper by Manmadhan et al.
(2020), on which the present article builds, including subsequent updates in
the field.Comment: 11 page
Towards Language-guided Visual Recognition via Dynamic Convolutions
In this paper, we are committed to establishing an unified and end-to-end
multi-modal network via exploring the language-guided visual recognition. To
approach this target, we first propose a novel multi-modal convolution module
called Language-dependent Convolution (LaConv). Its convolution kernels are
dynamically generated based on natural language information, which can help
extract differentiated visual features for different multi-modal examples.
Based on the LaConv module, we further build the first fully language-driven
convolution network, termed as LaConvNet, which can unify the visual
recognition and multi-modal reasoning in one forward structure. To validate
LaConv and LaConvNet, we conduct extensive experiments on four benchmark
datasets of two vision-and-language tasks, i.e., visual question answering
(VQA) and referring expression comprehension (REC). The experimental results
not only shows the performance gains of LaConv compared to the existing
multi-modal modules, but also witness the merits of LaConvNet as an unified
network, including compact network, high generalization ability and excellent
performance, e.g., +4.7% on RefCOCO+