28 research outputs found
Maximum-Entropy Adversarial Audio Augmentation for Keyword Spotting
Data augmentation is a key tool for improving the performance of deep
networks, particularly when there is limited labeled data. In some fields, such
as computer vision, augmentation methods have been extensively studied;
however, for speech and audio data, there are relatively fewer methods
developed. Using adversarial learning as a starting point, we develop a simple
and effective augmentation strategy based on taking the gradient of the entropy
of the outputs with respect to the inputs and then creating new data points by
moving in the direction of the gradient to maximize the entropy. We validate
its efficacy on several keyword spotting tasks as well as standard audio
benchmarks. Our method is straightforward to implement, offering greater
computational efficiency than more complex adversarial schemes like GANs.
Despite its simplicity, it proves robust and effective, especially when
combined with the established SpecAugment technique, leading to enhanced
performance.Comment: 5 pages, 2 figure
Learning Semantically Enhanced Feature for Fine-Grained Image Classification
We aim to provide a computationally cheap yet effective approach for
fine-grained image classification (FGIC) in this letter. Unlike previous
methods that rely on complex part localization modules, our approach learns
fine-grained features by enhancing the semantics of sub-features of a global
feature. Specifically, we first achieve the sub-feature semantic by arranging
feature channels of a CNN into different groups through channel permutation.
Meanwhile, to enhance the discriminability of sub-features, the groups are
guided to be activated on object parts with strong discriminability by a
weighted combination regularization. Our approach is parameter parsimonious and
can be easily integrated into the backbone model as a plug-and-play module for
end-to-end training with only image-level supervision. Experiments verified the
effectiveness of our approach and validated its comparable performance to the
state-of-the-art methods. Code is available at https://github.com/cswluo/SEFComment: Accepted by IEEE Signal Processing Letters. 5 pages, 4 figures, 4
table
Feature Fusion Vision Transformer for Fine-Grained Visual Categorization
The core for tackling the fine-grained visual categorization (FGVC) is to
learn subtle yet discriminative features. Most previous works achieve this by
explicitly selecting the discriminative parts or integrating the attention
mechanism via CNN-based approaches.However, these methods enhance the
computational complexity and make the modeldominated by the regions containing
the most of the objects. Recently, vision trans-former (ViT) has achieved SOTA
performance on general image recognition tasks. Theself-attention mechanism
aggregates and weights the information from all patches to the classification
token, making it perfectly suitable for FGVC. Nonetheless, the classifi-cation
token in the deep layer pays more attention to the global information, lacking
the local and low-level features that are essential for FGVC. In this work, we
proposea novel pure transformer-based framework Feature Fusion Vision
Transformer (FFVT)where we aggregate the important tokens from each transformer
layer to compensate thelocal, low-level and middle-level information. We design
a novel token selection mod-ule called mutual attention weight selection (MAWS)
to guide the network effectively and efficiently towards selecting
discriminative tokens without introducing extra param-eters. We verify the
effectiveness of FFVT on three benchmarks where FFVT achieves the
state-of-the-art performance.Comment: 9 pages, 2 figures, 3 table
Deep Collective Knowledge Distillation
Many existing studies on knowledge distillation have focused on methods in
which a student model mimics a teacher model well.
Simply imitating the teacher's knowledge, however, is not sufficient for the
student to surpass that of the teacher.
We explore a method to harness the knowledge of other students to complement
the knowledge of the teacher.
We propose deep collective knowledge distillation for model compression,
called DCKD, which is a method for training student models with rich
information to acquire knowledge from not only their teacher model but also
other student models.
The knowledge collected from several student models consists of a wealth of
information about the correlation between classes.
Our DCKD considers how to increase the correlation knowledge of classes
during training.
Our novel method enables us to create better performing student models for
collecting knowledge.
This simple yet powerful method achieves state-of-the-art performances in
many experiments.
For example, for ImageNet, ResNet18 trained with DCKD achieves 72.27\%, which
outperforms the pretrained ResNet18 by 2.52\%.
For CIFAR-100, the student model of ShuffleNetV1 with DCKD achieves 6.55\%
higher top-1 accuracy than the pretrained ShuffleNetV1
Fine-grained Recognition: Accounting for Subtle Differences between Similar Classes
The main requisite for fine-grained recognition task is to focus on subtle
discriminative details that make the subordinate classes different from each
other. We note that existing methods implicitly address this requirement and
leave it to a data-driven pipeline to figure out what makes a subordinate class
different from the others. This results in two major limitations: First, the
network focuses on the most obvious distinctions between classes and overlooks
more subtle inter-class variations. Second, the chance of misclassifying a
given sample in any of the negative classes is considered equal, while in fact,
confusions generally occur among only the most similar classes. Here, we
propose to explicitly force the network to find the subtle differences among
closely related classes. In this pursuit, we introduce two key novelties that
can be easily plugged into existing end-to-end deep learning pipelines. On one
hand, we introduce diversification block which masks the most salient features
for an input to force the network to use more subtle cues for its correct
classification. Concurrently, we introduce a gradient-boosting loss function
that focuses only on the confusing classes for each sample and therefore moves
swiftly along the direction on the loss surface that seeks to resolve these
ambiguities. The synergy between these two blocks helps the network to learn
more effective feature representations. Comprehensive experiments are performed
on five challenging datasets. Our approach outperforms existing methods using
similar experimental setting on all five datasets.Comment: To appear in AAAI 202