Search CORE

68 research outputs found

Speech intelligibility in cars: the effect of speaking style, noise and listener age

Author: Valentini Botinhao Cassia
Yamagishi Junichi
Publication venue: 'International Speech Communication Association'
Publication date: 24/08/2017
Field of study

Crossref

Edinburgh Research Explorer

Detection and analysis of attention errors in sequence-to-sequence text-to-speech

Author: King Simon
Valentini-Botinhao Cassia
Publication venue: 'International Speech Communication Association'
Publication date: 30/08/2021
Field of study

Edinburgh Research Explorer

Using neighbourhood density and selective SNR boosting to increase the intelligibility of synthetic speech in noise

Author: King Simon
Valentini-Botinhao Cassia
Wester Mirjam
Yamagishi Junichi
Publication venue
Publication date: 01/08/2013
Field of study

Edinburgh Research Explorer

Puffin: pitch-synchronous neural waveform generation for fullband speech on modest devices

Author: Valentini-Botinhao Cassia
Watts Oliver
Wihlborg Lovisa
Publication venue
Publication date: 25/11/2022
Field of study

We present a neural vocoder designed with low-powered Alternative and Augmentative Communication devices in mind. By combining elements of successful modern vocoders with established ideas from an older generation of technology, our system is able to produce high quality synthetic speech at 48kHz on devices where neural vocoders are otherwise prohibitively complex. The system is trained adversarially using differentiable pitch synchronous overlap add, and reduces complexity by relying on pitch synchronous Inverse Short-Time Fourier Transform (ISTFT) to generate speech samples. Our system achieves comparable quality with a strong (HiFi-GAN) baseline while using only a fraction of the compute. We present results of a perceptual evaluation as well as an analysis of system complexity.Comment: ICASSP 2023 submissio

arXiv.org e-Print Archive

Edinburgh Research Explorer

Speech Waveform Reconstruction using Convolutional Neural Networks with Noise and Periodic Inputs

Author: King Simon
Valentini Botinhao Cassia
Watts Oliver
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/04/2019
Field of study

Edinburgh Research Explorer

Evaluating speech intelligibility enhancement for HMM-based synthetic speech in noise

Author: King Simon
Valentini-Botinhao Cassia
Yamagishi Junichi
Publication venue
Publication date: 01/01/2012
Field of study

It is possible to increase the intelligibility of speech in noise by enhancing the clean speech signal. In this paper we demonstrate the effects of modifying the spectral envelope of synthetic speech according to the environmental noise. To achieve this, we modify Mel cepstral coefficients according to an intelligibility measure that accounts for glimpses of speech in noise: the Glimpse Proportion measure. We evaluate this method against a baseline synthetic voice trained only with normal speech and a topline voice trained with Lombard speech, as well as natural speech. The intelligibility of these voices was measured when mixed with speech-shaped noise and with a competing speaker at three different levels. The Lombard voices, both natural and synthetic, were more intelligible than the normal voices in all conditions. For speechshaped noise, the proposed modified voice was as intelligible as the Lombard synthetic voice without requiring any recordings of Lombard speech, which are hard to obtain. However, in the case of competing talker noise, the Lombard synthetic voice was more intelligible than the proposed modified voice. Index Terms: HMM-based speech synthesis, intelligibility of speech in noise, Lombard speec

CiteSeerX

Edinburgh Research Explorer

Evaluating Cognitive Load of Text-To-Speech (TTS) synthesis

Author: Govender Avashna
King Simon
Valentini-Botinhao Cassia
Publication venue
Publication date: 01/01/2019
Field of study

Current evaluation methods for text-to-speech (TTS) synthesis rely solely on subjective rating scores. Thesetests typically account mostly for how natural or intelligible the voice is. With state-of-the-art systems, thesemeasures are approaching ceiling and therefore alternative measures such as the cognitive load may becomemore meaningful. To our knowledge, there is little or no recent work evaluating the cognitive load of state-of- the-art text-to-speech systems. We use pupillometry as a measure of cognitive load. The pupil has beenfound to dilate upon increased cognitive effort when carrying out a listening task. Currently we are evaluatingspeech generated by a Deep Neural Network TTS synthesiser. In our method, we generate stimuli that stepincrementally from natural speech to synthesized speech by changing only a single feature at a time. Stimuli arepresented to listeners in speech-shaped noise conditions

ZENODO

Edinburgh Research Explorer

Publikationsserver der RWTH Aachen University

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Speech Enhancement of Noisy and Reverberant Speech for Text-to-Speech

Author: Valentini Botinhao Cassia
Yamagishi Junichi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/08/2018
Field of study

Edinburgh Research Explorer

Direct Modelling of Magnitude and Phase Spectra for Statistical Parametric Speech Synthesis

Author: Espic calderón Felipe
King Simon
Valentini Botinhao Cassia
Publication venue: 'International Speech Communication Association'
Publication date: 20/08/2017
Field of study

Crossref

Edinburgh Research Explorer

Differentiable Grey-box Modelling of Phaser Effects using Frame-based Spectral Processing

Author: Bilbao Stefan
Carson Alistair
King Simon
Valentini-Botinhao Cassia
Publication venue
Publication date: 02/06/2023
Field of study

Machine learning approaches to modelling analog audio effects have seen intensive investigation in recent years, particularly in the context of non-linear time-invariant effects such as guitar amplifiers. For modulation effects such as phasers, however, new challenges emerge due to the presence of the low-frequency oscillator which controls the slowly time-varying nature of the effect. Existing approaches have either required foreknowledge of this control signal, or have been non-causal in implementation. This work presents a differentiable digital signal processing approach to modelling phaser effects in which the underlying control signal and time-varying spectral response of the effect are jointly learned. The proposed model processes audio in short frames to implement a time-varying filter in the frequency domain, with a transfer function based on typical analog phaser circuit topology. We show that the model can be trained to emulate an analog reference device, while retaining interpretable and adjustable parameters. The frame duration is an important hyper-parameter of the proposed model, so an investigation was carried out into its effect on model accuracy. The optimal frame length depends on both the rate and transient decay-time of the target effect, but the frame length can be altered at inference time without a significant change in accuracy.Comment: Accepted for publication in Proc. DAFx23, Copenhagen, Denmark, September 202

arXiv.org e-Print Archive