174 research outputs found
Towards Language-guided Visual Recognition via Dynamic Convolutions
In this paper, we are committed to establishing an unified and end-to-end
multi-modal network via exploring the language-guided visual recognition. To
approach this target, we first propose a novel multi-modal convolution module
called Language-dependent Convolution (LaConv). Its convolution kernels are
dynamically generated based on natural language information, which can help
extract differentiated visual features for different multi-modal examples.
Based on the LaConv module, we further build the first fully language-driven
convolution network, termed as LaConvNet, which can unify the visual
recognition and multi-modal reasoning in one forward structure. To validate
LaConv and LaConvNet, we conduct extensive experiments on four benchmark
datasets of two vision-and-language tasks, i.e., visual question answering
(VQA) and referring expression comprehension (REC). The experimental results
not only shows the performance gains of LaConv compared to the existing
multi-modal modules, but also witness the merits of LaConvNet as an unified
network, including compact network, high generalization ability and excellent
performance, e.g., +4.7% on RefCOCO+
Application of multimodal machine learning to visual question answering
Master’s Degree in ICT Research and Innovation (i2-ICT)Due to the great advances in Natural Language Processing and Computer Vision in recent yearswith neural networks and attention mechanisms, a great interest in VQA has been awakened,starting to be considered as the ”Visual Turing Test” for modern AI systems, since it is aboutanswering a question from an image, where the system has to learn to understand and reasonabout the image and question shown. One of the main reasons for this great interest is thelarge number of potential applications that these systems allow, such as medical applicationsfor diagnosis through an image, assistants for blind people, e-learning applications, etc.In this Master’s thesis, a study of the state of the art of VQA is proposed, investigatingboth techniques and existing datasets. Finally, a development is carried out in order to try toreproduce the results of the art with the latest VQA models with the aim of being able to applythem and experiment on new datasets.Therefore, in this work, experiments are carried out with a first VQA model, MoViE+MCAN[1] [2] (winner of the 2020 VQA Challenge), which after observing its non-viability due toresource issues, we switched to the LXMERT Model [3], which consists of a pre-trained modelin 5 subtasks, which allows us to perform fine-tunnig on several tasks, which in this specificcase is the VQA task on the VQA v2.0 [4] dataset.As the main result of this Thesis we experimentally show that LXMERT provides similarresults to MoViE-MCAN (the best known method for VQA) in the most recent and demandingbenchmarks with less resources starting from the pre-trained model provided by the GitHubrepository [5]
What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study
Most of the existing work in one-stage referring expression comprehension
(REC) mainly focuses on multi-modal fusion and reasoning, while the influence
of other factors in this task lacks in-depth exploration. To fill this gap, we
conduct an empirical study in this paper. Concretely, we first build a very
simple REC network called SimREC, and ablate 42 candidate designs/settings,
which covers the entire process of one-stage REC from network design to model
training. Afterwards, we conduct over 100 experimental trials on three
benchmark datasets of REC. The extensive experimental results not only show the
key factors that affect REC performance in addition to multi-modal fusion,
e.g., multi-scale features and data augmentation, but also yield some findings
that run counter to conventional understanding. For example, as a vision and
language (V&L) task, REC does is less impacted by language prior. In addition,
with a proper combination of these findings, we can improve the performance of
SimREC by a large margin, e.g., +27.12% on RefCOCO+, which outperforms all
existing REC methods. But the most encouraging finding is that with much less
training overhead and parameters, SimREC can still achieve better performance
than a set of large-scale pre-trained models, e.g., UNITER and VILLA,
portraying the special role of REC in existing V&L research
PaLI-X: On Scaling up a Multilingual Vision and Language Model
We present the training recipe and results of scaling up PaLI-X, a
multilingual vision and language model, both in terms of size of the components
and the breadth of its training task mixture. Our model achieves new levels of
performance on a wide-range of varied and complex tasks, including multiple
image-based captioning and question-answering tasks, image-based document
understanding and few-shot (in-context) learning, as well as object detection,
video question answering, and video captioning. PaLI-X advances the
state-of-the-art on most vision-and-language benchmarks considered (25+ of
them). Finally, we observe emerging capabilities, such as complex counting and
multilingual object detection, tasks that are not explicitly in the training
mix
Medical Image Segmentation Review: The success of U-Net
Automatic medical image segmentation is a crucial topic in the medical domain
and successively a critical counterpart in the computer-aided diagnosis
paradigm. U-Net is the most widespread image segmentation architecture due to
its flexibility, optimized modular design, and success in all medical image
modalities. Over the years, the U-Net model achieved tremendous attention from
academic and industrial researchers. Several extensions of this network have
been proposed to address the scale and complexity created by medical tasks.
Addressing the deficiency of the naive U-Net model is the foremost step for
vendors to utilize the proper U-Net variant model for their business. Having a
compendium of different variants in one place makes it easier for builders to
identify the relevant research. Also, for ML researchers it will help them
understand the challenges of the biological tasks that challenge the model. To
address this, we discuss the practical aspects of the U-Net model and suggest a
taxonomy to categorize each network variant. Moreover, to measure the
performance of these strategies in a clinical application, we propose fair
evaluations of some unique and famous designs on well-known datasets. We
provide a comprehensive implementation library with trained models for future
research. In addition, for ease of future studies, we created an online list of
U-Net papers with their possible official implementation. All information is
gathered in https://github.com/NITR098/Awesome-U-Net repository.Comment: Submitted to the IEEE Transactions on Pattern Analysis and Machine
Intelligence Journa
HAC-Net: A Hybrid Attention-Based Convolutional Neural Network for Highly Accurate Protein-Ligand Binding Affinity Prediction
Applying deep learning concepts from image detection and graph theory has
greatly advanced protein-ligand binding affinity prediction, a challenge with
enormous ramifications for both drug discovery and protein engineering. We
build upon these advances by designing a novel deep learning architecture
consisting of a 3-dimensional convolutional neural network utilizing
channel-wise attention and two graph convolutional networks utilizing
attention-based aggregation of node features. HAC-Net (Hybrid Attention-Based
Convolutional Neural Network) obtains state-of-the-art results on the PDBbind
v.2016 core set, the most widely recognized benchmark in the field. We
extensively assess the generalizability of our model using multiple train-test
splits, each of which maximizes differences between either protein structures,
protein sequences, or ligand extended-connectivity fingerprints of complexes in
the training and test sets. Furthermore, we perform 10-fold cross-validation
with a similarity cutoff between SMILES strings of ligands in the training and
test sets, and also evaluate the performance of HAC-Net on lower-quality data.
We envision that this model can be extended to a broad range of supervised
learning problems related to structure-based biomolecular property prediction.
All of our software is available as open source at
https://github.com/gregory-kyro/HAC-Net/, and the HACNet Python package is
available through PyPI
Low-Power Computer Vision: Improve the Efficiency of Artificial Intelligence
Energy efficiency is critical for running computer vision on battery-powered systems, such as mobile phones or UAVs (unmanned aerial vehicles, or drones). This book collects the methods that have won the annual IEEE Low-Power Computer Vision Challenges since 2015. The winners share their solutions and provide insight on how to improve the efficiency of machine learning systems
- …