56 research outputs found
Training strategy for a lightweight countermeasure model for automatic speaker verification
The countermeasure (CM) model is developed to protect Automatic Speaker
Verification (ASV) systems from spoof attacks and prevent resulting personal
information leakage. Based on practicality and security considerations, the CM
model is usually deployed on edge devices, which have more limited computing
resources and storage space than cloud-based systems. This work proposes
training strategies for a lightweight CM model for ASV, using generalized
end-to-end (GE2E) pre-training and adversarial fine-tuning to improve
performance, and applying knowledge distillation (KD) to reduce the size of the
CM model. In the evaluation phase of the ASVspoof 2021 Logical Access task, the
lightweight ResNetSE model reaches min t-DCF 0.2695 and EER 3.54%. Compared to
the teacher model, the lightweight student model only uses 22.5% of parameters
and 21.1% of multiply and accumulate operands of the teacher model.Comment: ASVspoof202
Personalized Audio Quality Preference Prediction
This paper proposes to use both audio input and subject information to
predict the personalized preference of two audio segments with the same content
in different qualities. A siamese network is used to compare the inputs and
predict the preference. Several different structures for each side of the
siamese network are investigated, and an LDNet with PANNs' CNN6 as the encoder
and a multi-layer perceptron block as the decoder outperforms a baseline model
using only audio input the most, where the overall accuracy grows from 77.56%
to 78.04%. Experimental results also show that using all the subject
information, including age, gender, and the specifications of headphones or
earphones, is more effective than using only a part of them
Multimodal Transformer Distillation for Audio-Visual Synchronization
Audio-visual synchronization aims to determine whether the mouth movements
and speech in the video are synchronized. VocaLiST reaches state-of-the-art
performance by incorporating multimodal Transformers to model audio-visual
interact information. However, it requires high computing resources, making it
impractical for real-world applications. This paper proposed an MTDVocaLiST
model, which is trained by our proposed multimodal Transformer distillation
(MTD) loss. MTD loss enables MTDVocaLiST model to deeply mimic the
cross-attention distribution and value-relation in the Transformer of VocaLiST.
Our proposed method is effective in two aspects: From the distillation method
perspective, MTD loss outperforms other strong distillation baselines. From the
distilled model's performance perspective: 1) MTDVocaLiST outperforms
similar-size SOTA models, SyncNet, and PM models by 15.69% and 3.39%; 2)
MTDVocaLiST reduces the model size of VocaLiST by 83.52%, yet still maintaining
similar performance.Comment: Submitted to ICASSP 202
WC-SBERT: Zero-Shot Text Classification via SBERT with Self-Training for Wikipedia Categories
Our research focuses on solving the zero-shot text classification problem in
NLP, with a particular emphasis on innovative self-training strategies. To
achieve this objective, we propose a novel self-training strategy that uses
labels rather than text for training, significantly reducing the model's
training time. Specifically, we use categories from Wikipedia as our training
set and leverage the SBERT pre-trained model to establish positive correlations
between pairs of categories within the same text, facilitating associative
training. For new test datasets, we have improved the original self-training
approach, eliminating the need for prior training and testing data from each
target dataset. Instead, we adopt Wikipedia as a unified training dataset to
better approximate the zero-shot scenario. This modification allows for rapid
fine-tuning and inference across different datasets, greatly reducing the time
required for self-training. Our experimental results demonstrate that this
method can adapt the model to the target dataset within minutes. Compared to
other BERT-based transformer models, our approach significantly reduces the
amount of training data by training only on labels, not the actual text, and
greatly improves training efficiency by utilizing a unified training set.
Additionally, our method achieves state-of-the-art results on both the Yahoo
Topic and AG News datasets
Neuro-Fuzzy and soft computing : A Computational Approach to Learning and Machine Intelligence
New Jerseyxxvi, 614 p.; app., fig., ind.; 24 c
Input Selection for ANFIS Learning
We present a quick and straightfoward way of input selection for neuro-fuzzy modeling using ANFIS. The method is tested on two real-world problems: the nonlinear regression problem of automobile MPG (miles per gallon) prediction, and the nonlinear system identification using the Box and Jenkins gas furnace data [1]
Adaptive Neuro-Fuzzy Inference Systems (ANFIS) for Noise Cancellation
[[abstract]]©1995 World Scientific-This paper presents an innovative application of adaptive noise cancellation using adaptive neuro-fuzzy inference systems (ANFIS). Under certain weak assumptions, ANFIS can successfully model the nonlinear dynamics of a noise passage, thus a distorted noise can be effectively removed from a measured signal. Simulation results demonstrate the feasibility of the proposed approach[[department]]資訊工程å¸
Hierarchical Filtering Method for Content-based Music Retrieval via Acoustic Input
[[abstract]]This paper presents an implementation of a content-based music retrieval system that can take a user's acoustic input (8-second clip of singing or humming) via a microphone and then retrieve the intended song from a database containing over 3000 candidate songs. The system, known as Super MBox, demonstrates the feasibility of real-time music retrieval with a high success rate. Super MBox first takes the user's acoustic input from a microphone and converts it into a pitch vector. Then a hierarchical filtering method (HFM) is used to first filter out 80% unlikely candidates and then compare the query input with the remaining 20% candidates in a detailed manner. The output of Super MBox is a ranked song list according to the computed similarity scores. A brief mathematical analysis of the two-step HFM is given in the paper to explain how to derive the optimum parameters of the comparison engine. The proposed HFM and its analysis framework can be directly applied to other multimedia information retrieval systems. We have tested Super MBox extensively and found the top-20 success rate is over 85%, based on a dataset of about singing/humming 2000 clips from people with mediocre singing skills. Our studies demonstrate the feasibility of using Super MBox as a prototype for music search engines over the Internet and/or query engines in digital music libraries.[[fileno]]2030226030028[[department]]資訊工程å¸
- …