14 research outputs found
Unsupervised Contrastive Learning of Sound Event Representations
Self-supervised representation learning can mitigate the limitations in
recognition tasks with few manually labeled data but abundant unlabeled
data---a common scenario in sound event research. In this work, we explore
unsupervised contrastive learning as a way to learn sound event
representations. To this end, we propose to use the pretext task of contrasting
differently augmented views of sound events. The views are computed primarily
via mixing of training examples with unrelated backgrounds, followed by other
data augmentations. We analyze the main components of our method via ablation
experiments. We evaluate the learned representations using linear evaluation,
and in two in-domain downstream sound event classification tasks, namely, using
limited manually labeled data, and using noisy labeled data. Our results
suggest that unsupervised contrastive pre-training can mitigate the impact of
data scarcity and increase robustness against noisy labels, outperforming
supervised baselines.Comment: A 4-page version is submitted to ICASSP 202
DFM-X: Augmentation by Leveraging Prior Knowledge of Shortcut Learning
Neural networks are prone to learn easy solutions from superficial statistics
in the data, namely shortcut learning, which impairs generalization and
robustness of models. We propose a data augmentation strategy, named DFM-X,
that leverages knowledge about frequency shortcuts, encoded in Dominant
Frequencies Maps computed for image classification models. We randomly select
X% training images of certain classes for augmentation, and process them by
retaining the frequencies included in the DFMs of other classes. This strategy
compels the models to leverage a broader range of frequencies for
classification, rather than relying on specific frequency sets. Thus, the
models learn more deep and task-related semantics compared to their counterpart
trained with standard setups. Unlike other commonly used augmentation
techniques which focus on increasing the visual variations of training data,
our method targets exploiting the original data efficiently, by distilling
prior knowledge about destructive learning behavior of models from data. Our
experimental results demonstrate that DFM-X improves robustness against common
corruptions and adversarial attacks. It can be seamlessly integrated with other
augmentation techniques to further enhance the robustness of models.Comment: Accepted at ICCVW202
DFM-X:Augmentation by Leveraging Prior Knowledge of Shortcut Learning
Neural networks are prone to learn easy solutions from superficial statistics in the data, namely shortcut learning, which impairs generalization and robustness of models. We propose a data augmentation strategy, named DFM-X, that leverages knowledge about frequency shortcuts, encoded in Dominant Frequencies Maps computed for image classification models. We randomly select X% training images of certain classes for augmentation, and process them by retaining the frequencies included in the DFMs of other classes. This strategy compels the models to leverage a broader range of frequencies for classification, rather than relying on specific frequency sets. Thus, the models learn more deep and task-related semantics compared to their counterpart trained with standard setups. Unlike other commonly used augmentation techniques which focus on increasing the visual variations of training data, our method targets exploiting the original data efficiently, by distilling prior knowledge about destructive learning behavior of models from data. Our experimental results demonstrate that DFM-X improves robustness against common corruptions and adversarial attacks. It can be seamlessly integrated with other augmentation techniques to further enhance the robustness of models
Benign Shortcut for Debiasing: Fair Visual Recognition via Intervention with Shortcut Features
Machine learning models often learn to make predictions that rely on
sensitive social attributes like gender and race, which poses significant
fairness risks, especially in societal applications, such as hiring, banking,
and criminal justice. Existing work tackles this issue by minimizing the
employed information about social attributes in models for debiasing. However,
the high correlation between target task and these social attributes makes
learning on the target task incompatible with debiasing. Given that model bias
arises due to the learning of bias features (\emph{i.e}., gender) that help
target task optimization, we explore the following research question: \emph{Can
we leverage shortcut features to replace the role of bias feature in target
task optimization for debiasing?} To this end, we propose \emph{Shortcut
Debiasing}, to first transfer the target task's learning of bias attributes
from bias features to shortcut features, and then employ causal intervention to
eliminate shortcut features during inference. The key idea of \emph{Shortcut
Debiasing} is to design controllable shortcut features to on one hand replace
bias features in contributing to the target task during the training stage, and
on the other hand be easily removed by intervention during the inference stage.
This guarantees the learning of the target task does not hinder the elimination
of bias features. We apply \emph{Shortcut Debiasing} to several benchmark
datasets, and achieve significant improvements over the state-of-the-art
debiasing methods in both accuracy and fairness.Comment: arXiv admin note: text overlap with arXiv:2211.0125
Latent-OFER: Detect, Mask, and Reconstruct with Latent Vectors for Occluded Facial Expression Recognition
Most research on facial expression recognition (FER) is conducted in highly
controlled environments, but its performance is often unacceptable when applied
to real-world situations. This is because when unexpected objects occlude the
face, the FER network faces difficulties extracting facial features and
accurately predicting facial expressions. Therefore, occluded FER (OFER) is a
challenging problem. Previous studies on occlusion-aware FER have typically
required fully annotated facial images for training. However, collecting facial
images with various occlusions and expression annotations is time-consuming and
expensive. Latent-OFER, the proposed method, can detect occlusions, restore
occluded parts of the face as if they were unoccluded, and recognize them,
improving FER accuracy. This approach involves three steps: First, the vision
transformer (ViT)-based occlusion patch detector masks the occluded position by
training only latent vectors from the unoccluded patches using the support
vector data description algorithm. Second, the hybrid reconstruction network
generates the masking position as a complete image using the ViT and
convolutional neural network (CNN). Last, the expression-relevant latent vector
extractor retrieves and uses expression-related information from all latent
vectors by applying a CNN-based class activation map. This mechanism has a
significant advantage in preventing performance degradation from occlusion by
unseen objects. The experimental results on several databases demonstrate the
superiority of the proposed method over state-of-the-art methods.Comment: 11 pages, 8 figure
MASKER: Masked Keyword Regularization for Reliable Text Classification
Pre-trained language models have achieved state-of-the-art accuracies on
various text classification tasks, e.g., sentiment analysis, natural language
inference, and semantic textual similarity. However, the reliability of the
fine-tuned text classifiers is an often underlooked performance criterion. For
instance, one may desire a model that can detect out-of-distribution (OOD)
samples (drawn far from training distribution) or be robust against domain
shifts. We claim that one central obstacle to the reliability is the
over-reliance of the model on a limited number of keywords, instead of looking
at the whole context. In particular, we find that (a) OOD samples often contain
in-distribution keywords, while (b) cross-domain samples may not always contain
keywords; over-relying on the keywords can be problematic for both cases. In
light of this observation, we propose a simple yet effective fine-tuning
method, coined masked keyword regularization (MASKER), that facilitates
context-based prediction. MASKER regularizes the model to reconstruct the
keywords from the rest of the words and make low-confidence predictions without
enough context. When applied to various pre-trained language models (e.g.,
BERT, RoBERTa, and ALBERT), we demonstrate that MASKER improves OOD detection
and cross-domain generalization without degrading classification accuracy. Code
is available at https://github.com/alinlab/MASKER.Comment: AAAI 2021. First two authors contributed equall