68 research outputs found
Puffin: pitch-synchronous neural waveform generation for fullband speech on modest devices
We present a neural vocoder designed with low-powered Alternative and
Augmentative Communication devices in mind. By combining elements of successful
modern vocoders with established ideas from an older generation of technology,
our system is able to produce high quality synthetic speech at 48kHz on devices
where neural vocoders are otherwise prohibitively complex. The system is
trained adversarially using differentiable pitch synchronous overlap add, and
reduces complexity by relying on pitch synchronous Inverse Short-Time Fourier
Transform (ISTFT) to generate speech samples. Our system achieves comparable
quality with a strong (HiFi-GAN) baseline while using only a fraction of the
compute. We present results of a perceptual evaluation as well as an analysis
of system complexity.Comment: ICASSP 2023 submissio
Evaluating speech intelligibility enhancement for HMM-based synthetic speech in noise
It is possible to increase the intelligibility of speech in noise by enhancing the clean speech signal. In this paper we demonstrate the effects of modifying the spectral envelope of synthetic speech according to the environmental noise. To achieve this, we modify Mel cepstral coefficients according to an intelligibility measure that accounts for glimpses of speech in noise: the Glimpse Proportion measure. We evaluate this method against a baseline synthetic voice trained only with normal speech and a topline voice trained with Lombard speech, as well as natural speech. The intelligibility of these voices was measured when mixed with speech-shaped noise and with a competing speaker at three different levels. The Lombard voices, both natural and synthetic, were more intelligible than the normal voices in all conditions. For speechshaped noise, the proposed modified voice was as intelligible as the Lombard synthetic voice without requiring any recordings of Lombard speech, which are hard to obtain. However, in the case of competing talker noise, the Lombard synthetic voice was more intelligible than the proposed modified voice. Index Terms: HMM-based speech synthesis, intelligibility of speech in noise, Lombard speec
Evaluating Cognitive Load of Text-To-Speech (TTS) synthesis
Current evaluation methods for text-to-speech (TTS) synthesis rely solely on subjective rating scores. Thesetests typically account mostly for how natural or intelligible the voice is. With state-of-the-art systems, thesemeasures are approaching ceiling and therefore alternative measures such as the cognitive load may becomemore meaningful. To our knowledge, there is little or no recent work evaluating the cognitive load of state-of- the-art text-to-speech systems. We use pupillometry as a measure of cognitive load. The pupil has beenfound to dilate upon increased cognitive effort when carrying out a listening task. Currently we are evaluatingspeech generated by a Deep Neural Network TTS synthesiser. In our method, we generate stimuli that stepincrementally from natural speech to synthesized speech by changing only a single feature at a time. Stimuli arepresented to listeners in speech-shaped noise conditions
Differentiable Grey-box Modelling of Phaser Effects using Frame-based Spectral Processing
Machine learning approaches to modelling analog audio effects have seen
intensive investigation in recent years, particularly in the context of
non-linear time-invariant effects such as guitar amplifiers. For modulation
effects such as phasers, however, new challenges emerge due to the presence of
the low-frequency oscillator which controls the slowly time-varying nature of
the effect. Existing approaches have either required foreknowledge of this
control signal, or have been non-causal in implementation. This work presents a
differentiable digital signal processing approach to modelling phaser effects
in which the underlying control signal and time-varying spectral response of
the effect are jointly learned. The proposed model processes audio in short
frames to implement a time-varying filter in the frequency domain, with a
transfer function based on typical analog phaser circuit topology. We show that
the model can be trained to emulate an analog reference device, while retaining
interpretable and adjustable parameters. The frame duration is an important
hyper-parameter of the proposed model, so an investigation was carried out into
its effect on model accuracy. The optimal frame length depends on both the rate
and transient decay-time of the target effect, but the frame length can be
altered at inference time without a significant change in accuracy.Comment: Accepted for publication in Proc. DAFx23, Copenhagen, Denmark,
September 202
- …