17 research outputs found
Evil Operation: Breaking Speaker Recognition with PaddingBack
Machine Learning as a Service (MLaaS) has gained popularity due to
advancements in machine learning. However, untrusted third-party platforms have
raised concerns about AI security, particularly in backdoor attacks. Recent
research has shown that speech backdoors can utilize transformations as
triggers, similar to image backdoors. However, human ears easily detect these
transformations, leading to suspicion. In this paper, we introduce PaddingBack,
an inaudible backdoor attack that utilizes malicious operations to make
poisoned samples indistinguishable from clean ones. Instead of using external
perturbations as triggers, we exploit the widely used speech signal operation,
padding, to break speaker recognition systems. Our experimental results
demonstrate the effectiveness of the proposed approach, achieving a
significantly high attack success rate while maintaining a high rate of benign
accuracy. Furthermore, PaddingBack demonstrates the ability to resist defense
methods while maintaining its stealthiness against human perception. The
results of the stealthiness experiment have been made available at
https://nbufabio25.github.io/paddingback/
Fake the Real: Backdoor Attack on Deep Speech Classification via Voice Conversion
Deep speech classification has achieved tremendous success and greatly
promoted the emergence of many real-world applications. However, backdoor
attacks present a new security threat to it, particularly with untrustworthy
third-party platforms, as pre-defined triggers set by the attacker can activate
the backdoor. Most of the triggers in existing speech backdoor attacks are
sample-agnostic, and even if the triggers are designed to be unnoticeable, they
can still be audible. This work explores a backdoor attack that utilizes
sample-specific triggers based on voice conversion. Specifically, we adopt a
pre-trained voice conversion model to generate the trigger, ensuring that the
poisoned samples does not introduce any additional audible noise. Extensive
experiments on two speech classification tasks demonstrate the effectiveness of
our attack. Furthermore, we analyzed the specific scenarios that activated the
proposed backdoor and verified its resistance against fine-tuning.Comment: Accepted by INTERSPEECH 202
BadSQA: Stealthy Backdoor Attacks Using Presence Events as Triggers in Non-Intrusive Speech Quality Assessment
Non-Intrusive speech quality assessment (NISQA) has gained significant
attention for predicting the mean opinion score (MOS) of speech without
requiring the reference speech. In practical NISQA scenarios, untrusted
third-party resources are often employed during deep neural network training to
reduce costs. However, it would introduce a potential security vulnerability as
specially designed untrusted resources can launch backdoor attacks against
NISQA systems. Existing backdoor attacks primarily focus on classification
tasks and are not directly applicable to NISQA which is a regression task. In
this paper, we propose a novel backdoor attack on NISQA tasks, leveraging
presence events as triggers to achieving highly stealthy attacks. To evaluate
the effectiveness of our proposed approach, we conducted experiments on four
benchmark datasets and employed two state-of-the-art NISQA models. The results
demonstrate that the proposed backdoor attack achieved an average attack
success rate of up to 99% with a poisoning rate of only 3%.Comment: 5 pages, 6 figures,conferenc
Facial Data Minimization: Shallow Model as Your Privacy Filter
Face recognition service has been used in many fields and brings much
convenience to people. However, once the user's facial data is transmitted to a
service provider, the user will lose control of his/her private data. In recent
years, there exist various security and privacy issues due to the leakage of
facial data. Although many privacy-preserving methods have been proposed, they
usually fail when they are not accessible to adversaries' strategies or
auxiliary data. Hence, in this paper, by fully considering two cases of
uploading facial images and facial features, which are very typical in face
recognition service systems, we proposed a data privacy minimization
transformation (PMT) method. This method can process the original facial data
based on the shallow model of authorized services to obtain the obfuscated
data. The obfuscated data can not only maintain satisfactory performance on
authorized models and restrict the performance on other unauthorized models but
also prevent original privacy data from leaking by AI methods and human visual
theft. Additionally, since a service provider may execute preprocessing
operations on the received data, we also propose an enhanced perturbation
method to improve the robustness of PMT. Besides, to authorize one facial image
to multiple service models simultaneously, a multiple restriction mechanism is
proposed to improve the scalability of PMT. Finally, we conduct extensive
experiments and evaluate the effectiveness of the proposed PMT in defending
against face reconstruction, data abuse, and face attribute estimation attacks.
These experimental results demonstrate that PMT performs well in preventing
facial data abuse and privacy leakage while maintaining face recognition
accuracy.Comment: 14 pages, 11 figure
Multi-Level Label Correction by Distilling Proximate Patterns for Semi-supervised Semantic Segmentation
Semi-supervised semantic segmentation relieves the reliance on large-scale
labeled data by leveraging unlabeled data. Recent semi-supervised semantic
segmentation approaches mainly resort to pseudo-labeling methods to exploit
unlabeled data. However, unreliable pseudo-labeling can undermine the
semi-supervision processes. In this paper, we propose an algorithm called
Multi-Level Label Correction (MLLC), which aims to use graph neural networks to
capture structural relationships in Semantic-Level Graphs (SLGs) and
Class-Level Graphs (CLGs) to rectify erroneous pseudo-labels. Specifically,
SLGs represent semantic affinities between pairs of pixel features, and CLGs
describe classification consistencies between pairs of pixel labels. With the
support of proximate pattern information from graphs, MLLC can rectify
incorrectly predicted pseudo-labels and can facilitate discriminative feature
representations. We design an end-to-end network to train and perform this
effective label corrections mechanism. Experiments demonstrate that MLLC can
significantly improve supervised baselines and outperforms state-of-the-art
approaches in different scenarios on Cityscapes and PASCAL VOC 2012 datasets.
Specifically, MLLC improves the supervised baseline by at least 5% and 2% with
DeepLabV2 and DeepLabV3+ respectively under different partition protocols.Comment: 12 pages, 8 figures. IEEE Transactions on Multimedia, 202
Antiforensics of Speech Resampling Using Dual-Path Strategy
Resampling is an operation to convert a digital speech from a given sampling rate to a different one. It can be used to interface two systems with different sampling rates. Unfortunately, resampling may also be intentionally utilized as a postoperation to remove the manipulated artifacts left by pitch shifting, splicing, etc. To detect the resampling, some forensic detectors have been proposed. Little consideration, however, has been given to the security of these detectors themselves. To expose weaknesses of these resampling detectors and hide the resampling artifacts, a dual-path resampling antiforensic framework is proposed in this paper. In the proposed framework, 1D median filtering is utilized to destroy the linear correlation between the adjacent speech samples introduced by resampling on low-frequency component. And for high-frequency component, Gaussian white noise perturbation (GWNP) is adopted to destroy the periodic resampling traces. The experimental results show that the proposed method successfully deceives the existing resampling forensic algorithms while keeping good perceptual quality of the resampled speech
Exposing Speech Transsplicing Forgery with Noise Level Inconsistency
Splicing is one of the most common tampering techniques for speech forgery in many forensic scenarios. Some successful approaches have been presented for detecting speech splicing when the splicing segments have different signal-to-noise ratios (SNRs). However, when the SNRs between the spliced segments are close or even same, no effective detection methods have been reported yet. In this study, noise inconsistency between the original speech and the inserted segment from other speech is utilized to detect the splicing trace. First, noise signal of the suspected speech is extracted by a parameter-optimized noise estimation algorithm. Second, the statistical Mel frequency features are extracted from the estimated noise signal. Finally, the spliced region is located by utilizing a change point detection algorithm on the estimated noise signal. The effectiveness of the proposed method is evaluated on a well-designed speech splicing dataset. The comparative experimental results show that the proposed algorithm can achieve better detection performance than other algorithms
An Antiforensic Method against AMR Compression Detection
Adaptive multirate (AMR) compression audio has been exploited as an effective forensic evidence to justify audio authenticity. Little consideration has been given, however, to antiforensic techniques capable of fooling AMR compression forensic algorithms. In this paper, we present an antiforensic method based on generative adversarial network (GAN) to attack AMR compression detectors. The GAN framework is utilized to modify double AMR compressed audio to have the underlying statistics of single compressed one. Three state-of-the-art detectors of AMR compression are selected as the targets to be attacked. The experimental results demonstrate that the proposed method is capable of removing the forensically detectable artifacts of AMR compression under various ratios with an average successful attack rate about 94.75%, which means the modified audios generated by our well-trained generator can treat the forensic detector effectively. Moreover, we show that the perceptual quality of the generated AMR audio is well preserved
Identification of Weakly Pitch-Shifted Voice Based on Convolutional Neural Network
Pitch shifting is a common voice editing technique in which the original pitch of a digital voice is raised or lowered. It is likely to be abused by the malicious attacker to conceal his/her true identity. Existing forensic detection methods are no longer effective for weakly pitch-shifted voice. In this paper, we proposed a convolutional neural network (CNN) to detect not only strongly pitch-shifted voice but also weakly pitch-shifted voice of which the shifting factor is less than ±4 semitones. Specifically, linear frequency cepstral coefficients (LFCC) computed from power spectrums are considered and their dynamic coefficients are extracted as the discriminative features. And the CNN model is carefully designed with particular attention to the input feature map, the activation function and the network topology. We evaluated the algorithm on voices from two datasets with three pitch shifting software. Extensive results show that the algorithm achieves high detection rates for both binary and multiple classifications
Source Cell-Phone Identification in the Presence of Additive Noise from CQT Domain
With the widespread availability of cell-phone recording devices, source cell-phone identification has become a hot topic in multimedia forensics. At present, the research on the source cell-phone identification in clean conditions has achieved good results, but that in noisy environments is not ideal. This paper proposes a novel source cell-phone identification system suitable for both clean and noisy environments using spectral distribution features of constant Q transform (CQT) domain and multi-scene training method. Based on the analysis, it is found that the identification difficulty lies in different models of cell-phones of the same brand, and their tiny differences are mainly in the middle and low frequency bands. Therefore, this paper extracts spectral distribution features from the CQT domain, which has a higher frequency resolution in the mid-low frequency. To evaluate the effectiveness of the proposed feature, four classification techniques of Support Vector Machine (SVM), Random Forest (RF), Convolutional Neural Network (CNN) and Recurrent Neuron Network-Long Short-Term Memory Neural Network (RNN-BLSTM) are used to identify the source recording device. Experimental results show that the features proposed in this paper have superior performance. Compared with Mel frequency cepstral coefficient (MFCC) and linear frequency cepstral coefficient (LFCC), it enhances the accuracy of cell-phones within the same brand, whether the speech to be tested comprises clean speech files or noisy speech files. In addition, the CNN classification effect is outstanding. In terms of models, the model is established by the multi-scene training method, which improves the distinguishing ability of the model in the noisy environment than single-scenario training method. The average accuracy rate in CNN for clean speech files on the CKC speech database (CKC-SD) and TIMIT Recaptured Database (TIMIT-RD) databases increased from 95.47% and 97.89% to 97.08% and 99.29%, respectively. For noisy speech files with seen noisy types and unseen noisy types, the performance was greatly improved, and most of the recognition rates exceeded 90%. Therefore, the source identification system in this paper is robust to noise