5,768 research outputs found
Prosodic-Enhanced Siamese Convolutional Neural Networks for Cross-Device Text-Independent Speaker Verification
In this paper a novel cross-device text-independent speaker verification
architecture is proposed. Majority of the state-of-the-art deep architectures
that are used for speaker verification tasks consider Mel-frequency cepstral
coefficients. In contrast, our proposed Siamese convolutional neural network
architecture uses Mel-frequency spectrogram coefficients to benefit from the
dependency of the adjacent spectro-temporal features. Moreover, although
spectro-temporal features have proved to be highly reliable in speaker
verification models, they only represent some aspects of short-term acoustic
level traits of the speaker's voice. However, the human voice consists of
several linguistic levels such as acoustic, lexicon, prosody, and phonetics,
that can be utilized in speaker verification models. To compensate for these
inherited shortcomings in spectro-temporal features, we propose to enhance the
proposed Siamese convolutional neural network architecture by deploying a
multilayer perceptron network to incorporate the prosodic, jitter, and shimmer
features. The proposed end-to-end verification architecture performs feature
extraction and verification simultaneously. This proposed architecture displays
significant improvement over classical signal processing approaches and deep
algorithms for forensic cross-device speaker verification.Comment: Accepted in 9th IEEE International Conference on Biometrics: Theory,
Applications, and Systems (BTAS 2018
Text-based and Signal-based Prediction of Break Indices and Pause Durations
The relation between symbolic and signal features of prosodic
boundaries is experimentally studied using prediction methods.
Text-based break index prediction turns out to be fairly good,
but signal-based prediction and pause duration prediction perform worse. A possible reason is that random signal feature
variations, as usually produced by humans, are hard to predict
The listening talker: A review of human and algorithmic context-induced modifications of speech
International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output
- …