161 research outputs found
Generating Diverse and Meaningful Captions: Unsupervised Specificity Optimization for Image Captioning
Image Captioning is a task that requires models to acquire a multi-modal understanding of the world and to express this understanding in natural language text. While the state-of-the-art for this task has rapidly improved in terms of n-gram metrics, these models tend to output the same generic captions for similar images. In this work, we address this limitation and train a model that generates more diverse and specific captions through an unsupervised training approach that incorporates a learning signal from an Image Retrieval model. We summarize previous results and improve the state-of-the-art on caption diversity and novelty.
We make our source code publicly available online: https://github.com/AnnikaLindh/Diverse_and_Specific_Image_Captionin
From Pointwise to Powerhouse: Initialising Neural Networks with Generative Models
Traditional initialisation methods, e.g. He and Xavier, have been effective
in avoiding the problem of vanishing or exploding gradients in neural networks.
However, they only use simple pointwise distributions, which model
one-dimensional variables. Moreover, they ignore most information about the
architecture and disregard past training experiences. These limitations can be
overcome by employing generative models for initialisation. In this paper, we
introduce two groups of new initialisation methods. First, we locally
initialise weight groups by employing variational autoencoders. Secondly, we
globally initialise full weight sets by employing graph hypernetworks. We
thoroughly evaluate the impact of the employed generative models on
state-of-the-art neural networks in terms of accuracy, convergence speed and
ensembling. Our results show that global initialisations result in higher
accuracy and faster initial convergence speed. However, the implementation
through graph hypernetworks leads to diminished ensemble performance on out of
distribution data. To counteract, we propose a modification called noise graph
hypernetwork, which encourages diversity in the produced ensemble members.
Furthermore, our approach might be able to transfer learned knowledge to
different image distributions. Our work provides insights into the potential,
the trade-offs and possible modifications of these new initialisation methods
Representation Learning with Fine-grained Patterns
With the development of computational power and techniques for data
collection, deep learning demonstrates a superior performance over most of
existing algorithms on benchmark data sets. Many efforts have been devoted to
studying the mechanism of deep learning. One important observation is that deep
learning can learn the discriminative patterns from raw materials directly in a
task-dependent manner. Therefore, the representations obtained by deep learning
outperform hand-crafted features significantly. However, those patterns are
often learned from super-class labels due to a limited availability of
fine-grained labels, while fine-grained patterns are desired in many real-world
applications such as visual search in online shopping. To mitigate the
challenge, we propose an algorithm to learn the fine-grained patterns
sufficiently when only super-class labels are available. The effectiveness of
our method can be guaranteed with the theoretical analysis. Extensive
experiments on real-world data sets demonstrate that the proposed method can
significantly improve the performance on target tasks corresponding to
fine-grained classes, when only super-class information is available for
training
Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection
Self-supervised speech models are a rapidly developing research topic in fake
audio detection. Many pre-trained models can serve as feature extractors,
learning richer and higher-level speech features. However,when fine-tuning
pre-trained models, there is often a challenge of excessively long training
times and high memory consumption, and complete fine-tuning is also very
expensive. To alleviate this problem, we apply low-rank adaptation(LoRA) to the
wav2vec2 model, freezing the pre-trained model weights and injecting a
trainable rank-decomposition matrix into each layer of the transformer
architecture, greatly reducing the number of trainable parameters for
downstream tasks. Compared with fine-tuning with Adam on the wav2vec2 model
containing 317M training parameters, LoRA achieved similar performance by
reducing the number of trainable parameters by 198 times.Comment: 6page
A fast multi-object tracking system using an object detector ensemble
Multiple-Object Tracking (MOT) is of crucial importance for applications such
as retail video analytics and video surveillance. Object detectors are often
the computational bottleneck of modern MOT systems, limiting their use for
real-time applications. In this paper, we address this issue by leveraging on
an ensemble of detectors, each running every f frames. We measured the
performance of our system in the MOT16 benchmark. The proposed model surpassed
other online entries of the MOT16 challenge in speed, while maintaining an
acceptable accuracy.Comment: 5 pages, 4 figures, 1 table, published in 2019 IEEE Colombian
Conference on Applications in Computational Intelligence (ColCACI
Visual In-Context Learning for Few-Shot Eczema Segmentation
Automated diagnosis of eczema from digital camera images is crucial for
developing applications that allow patients to self-monitor their recovery. An
important component of this is the segmentation of eczema region from such
images. Current methods for eczema segmentation rely on deep neural networks
such as convolutional (CNN)-based U-Net or transformer-based Swin U-Net. While
effective, these methods require high volume of annotated data, which can be
difficult to obtain. Here, we investigate the capabilities of visual in-context
learning that can perform few-shot eczema segmentation with just a handful of
examples and without any need for retraining models. Specifically, we propose a
strategy for applying in-context learning for eczema segmentation with a
generalist vision model called SegGPT. When benchmarked on a dataset of
annotated eczema images, we show that SegGPT with just 2 representative example
images from the training dataset performs better (mIoU: 36.69) than a CNN U-Net
trained on 428 images (mIoU: 32.60). We also discover that using more number of
examples for SegGPT may in fact be harmful to its performance. Our result
highlights the importance of visual in-context learning in developing faster
and better solutions to skin imaging tasks. Our result also paves the way for
developing inclusive solutions that can cater to minorities in the demographics
who are typically heavily under-represented in the training data
- …