549 research outputs found
Long-frame-shift Neural Speech Phase Prediction with Spectral Continuity Enhancement and Interpolation Error Compensation
Speech phase prediction, which is a significant research focus in the field
of signal processing, aims to recover speech phase spectra from
amplitude-related features. However, existing speech phase prediction methods
are constrained to recovering phase spectra with short frame shifts, which are
considerably smaller than the theoretical upper bound required for exact
waveform reconstruction of short-time Fourier transform (STFT). To tackle this
issue, we present a novel long-frame-shift neural speech phase prediction
(LFS-NSPP) method which enables precise prediction of long-frame-shift phase
spectra from long-frame-shift log amplitude spectra. The proposed method
consists of three stages: interpolation, prediction and decimation. The
short-frame-shift log amplitude spectra are first constructed from
long-frame-shift ones through frequency-by-frequency interpolation to enhance
the spectral continuity, and then employed to predict short-frame-shift phase
spectra using an NSPP model, thereby compensating for interpolation errors.
Ultimately, the long-frame-shift phase spectra are obtained from
short-frame-shift ones through frame-by-frame decimation. Experimental results
show that the proposed LFS-NSPP method can yield superior quality in predicting
long-frame-shift phase spectra than the original NSPP model and other
signal-processing-based phase estimation algorithms.Comment: Published at IEEE Signal Processing Letter
Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement
Phase information has a significant impact on speech perceptual quality and
intelligibility. However, existing speech enhancement methods encounter
limitations in explicit phase estimation due to the non-structural nature and
wrapping characteristics of the phase, leading to a bottleneck in enhanced
speech quality. To overcome the above issue, in this paper, we proposed
MP-SENet, a novel Speech Enhancement Network which explicitly enhances
Magnitude and Phase spectra in parallel. The proposed MP-SENet adopts a codec
architecture in which the encoder and decoder are bridged by time-frequency
Transformers along both time and frequency dimensions. The encoder aims to
encode time-frequency representations derived from the input distorted
magnitude and phase spectra. The decoder comprises dual-stream magnitude and
phase decoders, directly enhancing magnitude and wrapped phase spectra by
incorporating a magnitude estimation architecture and a phase parallel
estimation architecture, respectively. To train the MP-SENet model effectively,
we define multi-level loss functions, including mean square error and
perceptual metric loss of magnitude spectra, anti-wrapping loss of phase
spectra, as well as mean square error and consistency loss of short-time
complex spectra. Experimental results demonstrate that our proposed MP-SENet
excels in high-quality speech enhancement across multiple tasks, including
speech denoising, dereverberation, and bandwidth extension. Compared to
existing phase-aware speech enhancement methods, it successfully avoids the
bidirectional compensation effect between the magnitude and phase, leading to a
better harmonic restoration. Notably, for the speech denoising task, the
MP-SENet yields a state-of-the-art performance with a PESQ of 3.60 on the
public VoiceBank+DEMAND dataset.Comment: Submmited to IEEE Transactions on Audio, Speech and Language
Processin
Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis
This paper proposes a source-filter-based generative adversarial neural
vocoder named SF-GAN, which achieves high-fidelity waveform generation from
input acoustic features by introducing F0-based source excitation signals to a
neural filter framework. The SF-GAN vocoder is composed of a source module and
a resolution-wise conditional filter module and is trained based on generative
adversarial strategies. The source module produces an excitation signal from
the F0 information, then the resolution-wise convolutional filter module
combines the excitation signal with processed acoustic features at various
temporal resolutions and finally reconstructs the raw waveform. The
experimental results show that our proposed SF-GAN vocoder outperforms the
state-of-the-art HiFi-GAN and Fre-GAN in both analysis-synthesis (AS) and
text-to-speech (TTS) tasks, and the synthesized speech quality of SF-GAN is
comparable to the ground-truth audio.Comment: Accepted by NCMMSC 202
APNet2: High-quality and High-efficiency Neural Vocoder with Direct Prediction of Amplitude and Phase Spectra
In our previous work, we proposed a neural vocoder called APNet, which
directly predicts speech amplitude and phase spectra with a 5 ms frame shift in
parallel from the input acoustic features, and then reconstructs the 16 kHz
speech waveform using inverse short-time Fourier transform (ISTFT). APNet
demonstrates the capability to generate synthesized speech of comparable
quality to the HiFi-GAN vocoder but with a considerably improved inference
speed. However, the performance of the APNet vocoder is constrained by the
waveform sampling rate and spectral frame shift, limiting its practicality for
high-quality speech synthesis. Therefore, this paper proposes an improved
iteration of APNet, named APNet2. The proposed APNet2 vocoder adopts ConvNeXt
v2 as the backbone network for amplitude and phase predictions, expecting to
enhance the modeling capability. Additionally, we introduce a multi-resolution
discriminator (MRD) into the GAN-based losses and optimize the form of certain
losses. At a common configuration with a waveform sampling rate of 22.05 kHz
and spectral frame shift of 256 points (i.e., approximately 11.6ms), our
proposed APNet2 vocoder outperformed the original APNet and Vocos vocoders in
terms of synthesized speech quality. The synthesized speech quality of APNet2
is also comparable to that of HiFi-GAN and iSTFTNet, while offering a
significantly faster inference speed
Towards High-Quality and Efficient Speech Bandwidth Extension with Parallel Amplitude and Phase Prediction
Speech bandwidth extension (BWE) refers to widening the frequency bandwidth
range of speech signals, enhancing the speech quality towards brighter and
fuller. This paper proposes a generative adversarial network (GAN) based BWE
model with parallel prediction of Amplitude and Phase spectra, named AP-BWE,
which achieves both high-quality and efficient wideband speech waveform
generation. The proposed AP-BWE generator is entirely based on convolutional
neural networks (CNNs). It features a dual-stream architecture with mutual
interaction, where the amplitude stream and the phase stream communicate with
each other and respectively extend the high-frequency components from the input
narrowband amplitude and phase spectra. To improve the naturalness of the
extended speech signals, we employ a multi-period discriminator at the waveform
level and design a pair of multi-resolution amplitude and phase discriminators
at the spectral level, respectively. Experimental results demonstrate that our
proposed AP-BWE achieves state-of-the-art performance in terms of speech
quality for BWE tasks targeting sampling rates of both 16 kHz and 48 kHz. In
terms of generation efficiency, due to the all-convolutional architecture and
all-frame-level operations, the proposed AP-BWE can generate 48 kHz waveform
samples 292.3 times faster than real-time on a single RTX 4090 GPU and 18.1
times faster than real-time on a single CPU. Notably, to our knowledge, AP-BWE
is the first to achieve the direct extension of the high-frequency phase
spectrum, which is beneficial for improving the effectiveness of existing BWE
methods.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language
Processin
APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding
This paper introduces a novel neural audio codec targeting high waveform
sampling rates and low bitrates named APCodec, which seamlessly integrates the
strengths of parametric codecs and waveform codecs. The APCodec revolutionizes
the process of audio encoding and decoding by concurrently handling the
amplitude and phase spectra as audio parametric characteristics like parametric
codecs. It is composed of an encoder and a decoder with the modified ConvNeXt
v2 network as the backbone, connected by a quantizer based on the residual
vector quantization (RVQ) mechanism. The encoder compresses the audio amplitude
and phase spectra in parallel, amalgamating them into a continuous latent code
at a reduced temporal resolution. This code is subsequently quantized by the
quantizer. Ultimately, the decoder reconstructs the audio amplitude and phase
spectra in parallel, and the decoded waveform is obtained by inverse short-time
Fourier transform. To ensure the fidelity of decoded audio like waveform
codecs, spectral-level loss, quantization loss, and generative adversarial
network (GAN) based loss are collectively employed for training the APCodec. To
support low-latency streamable inference, we employ feed-forward layers and
causal convolutional layers in APCodec, incorporating a knowledge distillation
training strategy to enhance the quality of decoded audio. Experimental results
confirm that our proposed APCodec can encode 48 kHz audio at bitrate of just 6
kbps, with no significant degradation in the quality of the decoded audio. At
the same bitrate, our proposed APCodec also demonstrates superior decoded audio
quality and faster generation speed compared to well-known codecs, such as
SoundStream, Encodec, HiFi-Codec and AudioDec.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language
Processin
A comparison about the inhibitory effect of curcunmin and Avastin on the rat corneal neovascularization
AIM: To compare the inhibitory effect of curcunmin and Avastin on the rat corneal neovascularization(CNV), and approach the mechanism of the curcunmin's inhibition. METHODS: CNV was established in thirty SD rats by alkaline burning. Rats were divided equally to group A and group B at random. In group A, right eyes were experimental group A1, treated by 40μmol/L curcunmin solution, and left eyes were control group A2, treated by 0.09% sodium chloride. In group B, right eyes were experimental group group B1, treated by 5g/L avastin, and left eyes were control group B2, treated by 0.09% sodium chloride. Cornea and aqueous humor were collected by time spot. The capillary vessels were study, and the expressions of VEGF were detected by Enzyme-Linked immunosorbnent Assay(ELISA). RESULTS: No toxic effects of the drugs were found. The capillary vessels in experimental group were less than those of control group(P<0.01). No statistical different of the capillary vessels between two drugs were found. The expressions of VEGF in experimental group were less than those in control group(P<0.01). The expressions of VEGF in B1 group were less than in group A1. CONCLUSION: The inhibitory effect to CNV of curcunmin and avastin have no statistical different in the experiment, but curcunmin has the less inhibitory effect to the expressions of VEGF than avastin. Curcunmin may have other mechanism in the inhibitory action on CNV
Genetic variations of the porcine PRKAG3 gene in Chinese indigenous pig breeds
Four missense substitutions (T30N, G52S, V199I and R200Q) in the porcine PRKAG3 gene were considered as the likely candidate loci affecting meat quality. In this study, the R200Q substitution was investigated in a sample of 62 individuals from Hampshire, Chinese Min and Erhualian pigs, and the genetic variations of T30N, G52S and V199I substitutions were detected in 1505 individuals from 21 Chinese indigenous breeds, 5 Western commercial pig breeds, and the wild pig. Allele 200R was fixed in Chinese Min and Erhualian pigs. Haplotypes II-QQ and IV-QQ were not observed in the Hampshire population, supporting the hypothesis that allele 200Q is tightly linked with allele 199V. Significant differences in allele frequencies of the three substitutions (T30N, G52S and V199I) between Chinese indigenous pigs and Western commercial pigs were observed. Obvious high frequencies of the "favorable" alleles 30T and 52G in terms of meat quality were detected in Chinese indigenous pigs, which are well known for high meat quality. However, the frequency of the "favorable" allele 199I, which was reported to have a greater effect on meat quality in comparison with 30T and 52G, was very low in all of the Chinese indigenous pigs except for the Min pig. The reasons accounting for this discrepancy remain to be addressed. The presence of the three substitutions in purebred Chinese Tibetan pigs indicates that the three substitutions were ancestral mutations. A novel A/G substitution at position 51 in exon 1 was identified. The results suggest that further studies are required to investigate the associations of these substitutions in the PRKAG3 gene with meat quality of Chinese indigenous pigs, and to uncover other polymorphisms in the PRKAG3 gene with potential effects on meat quality in Chinese indigenous pigs
- …