74 research outputs found
k-Same-Siamese-GAN: k-Same Algorithm with Generative Adversarial Network for Facial Image De-identification with Hyperparameter Tuning and Mixed Precision Training
For a data holder, such as a hospital or a government entity, who has a
privately held collection of personal data, in which the revealing and/or
processing of the personal identifiable data is restricted and prohibited by
law. Then, "how can we ensure the data holder does conceal the identity of each
individual in the imagery of personal data while still preserving certain
useful aspects of the data after de-identification?" becomes a challenge issue.
In this work, we propose an approach towards high-resolution facial image
de-identification, called k-Same-Siamese-GAN, which leverages the
k-Same-Anonymity mechanism, the Generative Adversarial Network, and the
hyperparameter tuning methods. Moreover, to speed up model training and reduce
memory consumption, the mixed precision training technique is also applied to
make kSS-GAN provide guarantees regarding privacy protection on close-form
identities and be trained much more efficiently as well. Finally, to validate
its applicability, the proposed work has been applied to actual datasets - RafD
and CelebA for performance testing. Besides protecting privacy of
high-resolution facial images, the proposed system is also justified for its
ability in automating parameter tuning and breaking through the limitation of
the number of adjustable parameters
Training strategy for a lightweight countermeasure model for automatic speaker verification
The countermeasure (CM) model is developed to protect Automatic Speaker
Verification (ASV) systems from spoof attacks and prevent resulting personal
information leakage. Based on practicality and security considerations, the CM
model is usually deployed on edge devices, which have more limited computing
resources and storage space than cloud-based systems. This work proposes
training strategies for a lightweight CM model for ASV, using generalized
end-to-end (GE2E) pre-training and adversarial fine-tuning to improve
performance, and applying knowledge distillation (KD) to reduce the size of the
CM model. In the evaluation phase of the ASVspoof 2021 Logical Access task, the
lightweight ResNetSE model reaches min t-DCF 0.2695 and EER 3.54%. Compared to
the teacher model, the lightweight student model only uses 22.5% of parameters
and 21.1% of multiply and accumulate operands of the teacher model.Comment: ASVspoof202
Personalized Audio Quality Preference Prediction
This paper proposes to use both audio input and subject information to
predict the personalized preference of two audio segments with the same content
in different qualities. A siamese network is used to compare the inputs and
predict the preference. Several different structures for each side of the
siamese network are investigated, and an LDNet with PANNs' CNN6 as the encoder
and a multi-layer perceptron block as the decoder outperforms a baseline model
using only audio input the most, where the overall accuracy grows from 77.56%
to 78.04%. Experimental results also show that using all the subject
information, including age, gender, and the specifications of headphones or
earphones, is more effective than using only a part of them
Multimodal Transformer Distillation for Audio-Visual Synchronization
Audio-visual synchronization aims to determine whether the mouth movements
and speech in the video are synchronized. VocaLiST reaches state-of-the-art
performance by incorporating multimodal Transformers to model audio-visual
interact information. However, it requires high computing resources, making it
impractical for real-world applications. This paper proposed an MTDVocaLiST
model, which is trained by our proposed multimodal Transformer distillation
(MTD) loss. MTD loss enables MTDVocaLiST model to deeply mimic the
cross-attention distribution and value-relation in the Transformer of VocaLiST.
Our proposed method is effective in two aspects: From the distillation method
perspective, MTD loss outperforms other strong distillation baselines. From the
distilled model's performance perspective: 1) MTDVocaLiST outperforms
similar-size SOTA models, SyncNet, and PM models by 15.69% and 3.39%; 2)
MTDVocaLiST reduces the model size of VocaLiST by 83.52%, yet still maintaining
similar performance.Comment: Submitted to ICASSP 202
WC-SBERT: Zero-Shot Text Classification via SBERT with Self-Training for Wikipedia Categories
Our research focuses on solving the zero-shot text classification problem in
NLP, with a particular emphasis on innovative self-training strategies. To
achieve this objective, we propose a novel self-training strategy that uses
labels rather than text for training, significantly reducing the model's
training time. Specifically, we use categories from Wikipedia as our training
set and leverage the SBERT pre-trained model to establish positive correlations
between pairs of categories within the same text, facilitating associative
training. For new test datasets, we have improved the original self-training
approach, eliminating the need for prior training and testing data from each
target dataset. Instead, we adopt Wikipedia as a unified training dataset to
better approximate the zero-shot scenario. This modification allows for rapid
fine-tuning and inference across different datasets, greatly reducing the time
required for self-training. Our experimental results demonstrate that this
method can adapt the model to the target dataset within minutes. Compared to
other BERT-based transformer models, our approach significantly reduces the
amount of training data by training only on labels, not the actual text, and
greatly improves training efficiency by utilizing a unified training set.
Additionally, our method achieves state-of-the-art results on both the Yahoo
Topic and AG News datasets
- …