405 research outputs found
Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion
Traditional studies on voice conversion (VC) have made progress with parallel
training data and known speakers. Good voice conversion quality is obtained by
exploring better alignment modules or expressive mapping functions. In this
study, we investigate zero-shot VC from a novel perspective of self-supervised
disentangled speech representation learning. Specifically, we achieve the
disentanglement by balancing the information flow between global speaker
representation and time-varying content representation in a sequential
variational autoencoder (VAE). A zero-shot voice conversion is performed by
feeding an arbitrary speaker embedding and content embeddings to the VAE
decoder. Besides that, an on-the-fly data augmentation training strategy is
applied to make the learned representation noise invariant. On TIMIT and VCTK
datasets, we achieve state-of-the-art performance on both objective evaluation,
i.e., speaker verification (SV) on speaker embedding and content embedding, and
subjective evaluation, i.e., voice naturalness and similarity, and remains to
be robust even with noisy source/target utterances.Comment: Accepted to 2022 ICASS
Sparse Complementary Pairs with Additional Aperiodic ZCZ Property
This paper presents a novel class of complex-valued sparse complementary
pairs (SCPs), each consisting of a number of zero values and with additional
zero-correlation zone (ZCZ) property for the aperiodic autocorrelations and
crosscorrelations of the two constituent sequences. Direct constructions of
SCPs and their mutually-orthogonal mates based on restricted generalized
Boolean functions are proposed. It is shown that such SCPs exist with arbitrary
lengths and controllable sparsity levels, making them a disruptive sequence
candidate for modern low-complexity, low-latency, and low-storage signal
processing applications
Multi-view Multi-label Anomaly Network Traffic Classification based on MLP-Mixer Neural Network
Network traffic classification is the basis of many network security
applications and has attracted enough attention in the field of cyberspace
security. Existing network traffic classification based on convolutional neural
networks (CNNs) often emphasizes local patterns of traffic data while ignoring
global information associations. In this paper, we propose a MLP-Mixer based
multi-view multi-label neural network for network traffic classification.
Compared with the existing CNN-based methods, our method adopts the MLP-Mixer
structure, which is more in line with the structure of the packet than the
conventional convolution operation. In our method, the packet is divided into
the packet header and the packet body, together with the flow features of the
packet as input from different views. We utilize a multi-label setting to learn
different scenarios simultaneously to improve the classification performance by
exploiting the correlations between different scenarios. Taking advantage of
the above characteristics, we propose an end-to-end network traffic
classification method. We conduct experiments on three public datasets, and the
experimental results show that our method can achieve superior performance.Comment: 15 pages,6 figure
Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions
Enhancing speech signal quality in adverse acoustic environments is a
persistent challenge in speech processing. Existing deep learning based
enhancement methods often struggle to effectively remove background noise and
reverberation in real-world scenarios, hampering listening experiences. To
address these challenges, we propose a novel approach that uses pre-trained
generative methods to resynthesize clean, anechoic speech from degraded inputs.
This study leverages pre-trained vocoder or codec models to synthesize
high-quality speech while enhancing robustness in challenging scenarios.
Generative methods effectively handle information loss in speech signals,
resulting in regenerated speech that has improved fidelity and reduced
artifacts. By harnessing the capabilities of pre-trained models, we achieve
faithful reproduction of the original speech in adverse conditions.
Experimental evaluations on both simulated datasets and realistic samples
demonstrate the effectiveness and robustness of our proposed methods.
Especially by leveraging codec, we achieve superior subjective scores for both
simulated and realistic recordings. The generated speech exhibits enhanced
audio quality, reduced background noise, and reverberation. Our findings
highlight the potential of pre-trained generative techniques in speech
processing, particularly in scenarios where traditional methods falter. Demos
are available at https://whmrtm.github.io/SoundResynthesis.Comment: Paper in submissio
- …